Power Estimation and Minimization of Digital Signal Processing Systems

0 downloads 0 Views 1MB Size Report
Digital Signal Processing Systems are used in many applications where low-power ...... Note that, the probability distribution of x(n) can either be estimated or ...... techniques are (with the quantities being reduced in parenthesis), multi-rate ...
Power Estimation and Minimization of Digital Signal Processing Systems

BY Sumant Ramprasad Department of Computer Science University of Illinois at Urbana-Champaign

COMMITTEE Prof. Michael Faiman Prof. Ibrahim N. Hajj Prof. Saburo Muroga Prof. Farid N. Najm Prof. Naresh R. Shanbhag (Chairperson)

Thursday, January 28, 1999  10.30 A.M.  Room 419 C&SRL

1

c Copyright by Sumant Ramprasad 1999

POWER ESTIMATION AND MINIMIZATION OF DIGITAL SIGNAL PROCESSING SYSTEMS

BY SUMANT RAMPRASAD B. Tech., Indian Institute of Technology, Mumbai, 1989 M.S., The Ohio State University, Columbus, 1990

THESIS Submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1999

Urbana, Illinois

POWER ESTIMATION AND MINIMIZATION OF DIGITAL SIGNAL PROCESSING SYSTEMS Sumant Ramprasad, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1999 Naresh Shanbhag and Ibrahim Hajj, Advisor Power dissipation has become a critical design concern in recent years driven by the emergence of mobile applications. Reliability concerns and packaging costs have made power optimization relevant even for tethered applications. Digital Signal Processing Systems are used in many applications where low-power dissipation is an important goal. In this thesis, we present techniques to estimate and reduce power dissipation in digital processing systems. We rst present a technique to estimate the power dissipation in lters, which are used extensively in DSP applications. We present a novel methodology to determine the average number of transitions in a signal from its word-level statistical description. The proposed methodology employs: 1.) high-level signal statistics, 2.) a statistical signal generation model, and 3.) the signal encoding (or number representation) to estimate the transition activity for that signal. The proposed method is employed in estimation of transition activity in DSP hardware. After presenting power estimation techniques for lters, we present two techniques to reduce the power dissipation in digital lters. We rst present decorrelating transformations (referred to as DECOR transformations) to reduce the power dissipation in digital lters by coding the lter coecients and/or the input. The DECOR transform is suited for narrow-band lters because there is signi cant correlation between adjacent coecients. The second power reduction technique for digital lters is applicable to Distributed Arithmetic (DA) architectures. In a DA architecture, a memory is employed to store linear combinations of coecients. The probability distribution of addresses to the memory is usually not uniform because of temporal correlation in the input. We present a rule governing this probability distribution and use it to partition the memory such that the most frequently accessed locations are stored in the smallest memory. After focusing on reducing power dissipation in processing blocks ( lters) we concentrate next on reducing the power dissipation in busses that transmit data between the processing iii

blocks. Transitions on high capacitance busses result in considerable system power dissipation. We present practical novel encoding schemes to reduce transition activity. The encoding schemes are developed via a communication-theoretic approach, whereby a data source is passed through a decorrelating function followed by a variant of entropy coding function which reduces the transition activity. We also present fundamental bounds on the activity reduction capability of any encoding scheme for a given source, and 2.) practical novel encoding schemes that approach these bounds. The fundamental bounds in 1.) are obtained via an information-theoretic approach where a signal x(n) with entropy rate H is coded with R bits per sample on average. We have so far focussed on power dissipation in static CMOS circuits. Dynamic logic circuits are used in high-performance circuits due to their speed and area advantage over static CMOS circuits. In this thesis, we also present an optimization technique, termed clock-generating (CG) domino, for dual-output domino logic that reduces area, clock load, and power without increasing the delay. A delayed clock, generated from certain dual-output gates, is used to convert other dual-output gates to single output.

iv

TABLE OF CONTENTS CHAPTER PAGE 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Thesis Outline and Contribution : : : : : : : : : : : : : : : : : : : : : : : : : : : 2

2 POWER ESTIMATION IN DIGITAL FILTERS : : : : : : : : : : : : : : : : 2.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1.1 Word and Bit-Level Quantities : : : : : : : : : : : : : : : : : 2.1.2 Signal Generation Models : : : : : : : : : : : : : : : : : : : : 2.2 Word-Level Signal Transition Activity : : : : : : : : : : : : : : : : : 2.2.1 Transition Activity For Single-bit Signals : : : : : : : : : : : 2.2.2 Estimation of i: The Exact Method : : : : : : : : : : : : : : 2.2.3 Estimation of i: The Approximate Method : : : : : : : : : : 2.2.4 Calculation of T : : : : : : : : : : : : : : : : : : : : : : : : : 2.2.5 E ect of Signal Encoding/Number Representation : : : : : : 2.3 Transition Activity for DSP Architectures : : : : : : : : : : : : : : : 2.3.1 Propagation of Word-Level Statistics : : : : : : : : : : : : : : 2.3.2 Example 1: FIR lter : : : : : : : : : : : : : : : : : : : : : : 2.3.3 Example 2: Folded FIR lter : : : : : : : : : : : : : : : : : : 2.3.4 Example 3: IIR lter : : : : : : : : : : : : : : : : : : : : : : : 2.4 Results with Realistic Benchmark Signals : : : : : : : : : : : : : : : 2.4.1 Realistic benchmark signals : : : : : : : : : : : : : : : : : : : 2.4.2 Total word-level transition activity, T , for FIR and IIR lters 2.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : :

5 8 8 10 11 11 12 15 20 21 23 23 25 26 28 30 30 31 34

3 POWER REDUCTION IN DIGITAL FILTERS : : : : : : : : : : : : : : : : 35 3.1 DECOR Transform Applied To Fixed Coecient Filters : : : : : : : : : : : : : 36

3.2 3.3 3.4 3.5

Related work : : : : : : : : : : : : : : : : : : The DECOR transform : : : : : : : : : : : : Relaxed DECOR : : : : : : : : : : : : : : : Low-power IIR lters : : : : : : : : : : : : : DECOR applied to lter inputs : : : : : : : DECOR Transform Applied To Adaptive Filters : Results : : : : : : : : : : : : : : : : : : : : : : : : : : 3.3.1 Analytical results : : : : : : : : : : : : : : : : 3.3.2 Simulation Model : : : : : : : : : : : : : : : : 3.3.3 Simulation Results : : : : : : : : : : : : : : : Low-Power Distributed Arithmetic Architectures : : Low-Power DA architecture : : : : : : : : : : : : : : 3.5.1 Probability distribution of memory addresses

3.1.1 3.1.2 3.1.3 3.1.4 3.1.5

v

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

37 38 42 43 44 45 49 49 50 51 56 59 59

3.5.2 Low Power DA architecture : : : : : : : : : : : : : : : : : : : : : : : : : : 60 3.6 Experimental Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 62 3.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 64

4 LOWER BOUNDS ON TRANSITION ACTIVITY : : : : : : : : : : : : : : 66 4.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.2 Bounds on Signal Transition Activity : : : : : : : : : : : : : : 4.2.1 An Asymptotically Optimal Coding Algorithm : : : : 4.3 Applications of Bounds on Transition activity : : : : : : : : : 4.3.1 Fully compressed data : : : : : : : : : : : : : : : : : : 4.3.2 Transition Signaling : : : : : : : : : : : : : : : : : : : 4.3.3 Bounds For 1-bit Redundant Codes : : : : : : : : : : 4.3.4 Lower Bound On Power-Delay Product : : : : : : : : 4.3.5 Bounds On Transition Activity For An i.i.d. Source : 4.3.6 Bounds On Transition Activity For A Markov Process 4.4 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

: : : : : : : : : : :

68 70 73 74 75 75 76 76 77 79 81

5 LOW-POWER CODING SCHEMES : : : : : : : : : : : : : : : : : : : : : : : : 82 5.1 Source-Coding Framework : : : : : : : : : : : : 5.1.1 A Generic Communications System : : : 5.1.2 The Source-Coding Framework : : : : : 5.1.3 Alternatives for F : : : : : : : : : : : : 5.1.4 Alternatives for f1 : : : : : : : : : : : : 5.1.5 Alternatives for f2 : : : : : : : : : : : : 5.2 Encoding Schemes : : : : : : : : : : : : : : : : 5.3 Simulation Results : : : : : : : : : : : : : : : : 5.3.1 Reduction In Transition Activity : : : : 5.3.2 Reduction In Power Dissipation : : : : : 5.3.3 Extensions: Adaptive Encoding Scheme 5.4 Summary : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: : : : : : : : : : : :

: 84 : 84 : 86 : 87 : 88 : 91 : 94 : 96 : 97 : 100 : 102 : 103

6 LOW-POWER DYNAMIC LOGIC : : : : : : : : : : : : : : : : : : : : : : : : : 104 6.1 Clock-generating (CG) domino : : : : 6.1.1 Generation of the delayed clock 6.1.2 Example : : : : : : : : : : : : : 6.2 Synthesis of CG domino : : : : : : : : 6.3 Experimental Results : : : : : : : : : : 6.4 Summary : : : : : : : : : : : : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: : : : : :

: 106 : 107 : 108 : 109 : 110 : 111

7 SUMMARY AND FUTURE DIRECTIONS : : : : : : : : : : : : : : : : : : : 112 7.1 Thesis Contributions and Summary : : : : : : : : : : : : : : : : : : : : : : : : : : 112 7.2 Future Directions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115

A PROOF OF LOWER BOUNDS : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 A.1 Proof of Lemma 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 A.2 Proof of Lemma 2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 117 A.3 Proof Of Asymptotic Achievability Of Lemma 2: : : : : : : : : : : : : : : : : : : 117 vi

A.4 Proof of Theorem 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119 A.5 Proof Of Asymptotic Achievability Of Theorem 1: : : : : : : : : : : : : : : : : : 121 A.6 Proof of Asymptotic Optimality of the MLZ Algorithm : : : : : : : : : : : : : : 122

B DECOR TRANSFORMATION : : : : : : : : : : : : : : : : : : : : : : : : : : : 124

B.1 Determining and for FIR Filters : : : : : : : : : : : : : : : : : : : : : : : : : 124 B.2 E ect of Quantization in IIR Filters on DECOR : : : : : : : : : : : : : : : : : : 125

REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 VITA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 136

vii

LIST OF TABLES Table 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 3.1 3.2 3.3 3.4 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 6.1 6.2

Page

Signal details : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Description of data-sets : : : : : : : : : : : : : : : : : : : : : : : : Measured and Estimated BP0 and BP1 : : : : : : : : : : : : : : : Word-level transition activity for di erent number representations Word-level statistics for direct form FIR lter : : : : : : : : : : : : Total transition activity for FIR lters : : : : : : : : : : : : : : : : Word-level statistics for folded direct form FIR lter : : : : : : : : Total transition activity for folded direct form FIR lter : : : : : : Word-level statistics for direct form IIR lter : : : : : : : : : : : : Total transition activity for IIR lters : : : : : : : : : : : : : : : : Measured and Estimated BP0 and BP1 : : : : : : : : : : : : : : : Word-level transition activity : : : : : : : : : : : : : : : : : : : : : Total transition activity for FIR lters : : : : : : : : : : : : : : : : Total transition activity for folded direct form FIR lter : : : : : : Total transition activity for IIR lters : : : : : : : : : : : : : : : : Run times in seconds for direct form lter : : : : : : : : : : : : : :

::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: ::: Overhead in DECOR and DCM for an N Tap FIR lter : : : : : : : : and for di erent types of FIR lters : : : : : : : : : : : : : : : : : : N , D , N , and D for di erent types of IIR lters : : : : : : : : : : : Capacitive coecients for a ROM : : : : : : : : : : : : : : : : : : : : : : Transition Matrix : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Entropy Code for Markov Process : : : : : : : : : : : : : : : : : : : : : Example of xor and dbm : : : : : : : : : : : : : : : : : : : : : : : : : : Description of data sets : : : : : : : : : : : : : : : : : : : : : : : : : : : Correlation and Kullback Leibler distance before and after xor and dbm Example of inv , pbm, and vbm functions : : : : : : : : : : : : : : : : : : Encoding Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Percentage Reduction in Transition Activity : : : : : : : : : : : : : : : : Percentage Reduction in Transition Activity With 1 Bit Redundancy : : Average Word-Level Transition Activity for Real Addresses : : : : : : : Area-Delay-Power for Bus-Invert and xor-pbm : : : : : : : : : : : : : : Area-Delay-Power for Gray, T0, and inc-xor : : : : : : : : : : : : : : : Possible outputs of a dual-output domino logic gate : : : : : : : : : : : Experimental results for ISCAS 85 benchmark circuits : : : : : : : : : : viii

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: 14 : 17 : 18 : 21 : 27 : 27 : 28 : 29 : 30 : 30 : 31 : 31 : 32 : 32 : 33 : 33 : 40 : 40 : 44 : 64 : 79 : 79 : 89 : 90 : 90 : 94 : 94 : 97 : 98 : 99 : 101 : 102 : 104 : 111

LIST OF FIGURES Figure 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10

Page

Measured and Theoretical ti and i vs. bit for the AR(1) signal, SIG2 Measured and Theoretical ti and i vs. bit for the MA(1) signal SIG3 Temporal correlation versus bit : : : : : : : : : : : : : : : : : : : : : : Temporal correlation versus bit : : : : : : : : : : : : : : : : : : : : : : Temporal correlation and transition activity for SIG2 and SIG4 : : : : Adder, Multiplier, Multiplexor, and Delay : : : : : : : : : : : : : : : : Direct form FIR lter : : : : : : : : : : : : : : : : : : : : : : : : : : : Transpose FIR lter : : : : : : : : : : : : : : : : : : : : : : : : : : : : Folded direct form lter : : : : : : : : : : : : : : : : : : : : : : : : : : IIR direct form lter and transpose : : : : : : : : : : : : : : : : : : : :

:::::: :::::: :::::: :::::: :::::: :::::: :::::: :::::: :::::: :::::: (a) Direct Form (DF), (b) DCM, and (c) DECOR Filters : : : : : : : : : : : : DF and DECOR Code For An N Tap FIR Filter : : : : : : : : : : : : : : : : : Coecients of a low-pass lter before and after DECOR ( = ?1, = 1) : : : : Coecients of a high-pass lter before and after DECOR ( = 1, = 1, m = 1) DECOR For Filter Input ( = ?1, = 1, m = 1) : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

: : : : : : : : : : : : : : :

3.1 3.2 3.3 3.4 3.5 3.6 Adaptive Filter: (a) traditional, (b) DECOR adaptive lter, (c) DECOR adaptive lter with multiple decorrelators : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.7 Traditional LMS Filter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.8 DECOR LMS Filter : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.9 Analytical Estimate of Reduction in Transition Activity due to DECOR and DCM 3.10 Analytical Estimate of Speedup due to DECOR and DCM : : : : : : : : : : : : : 3.11 E ect of cuto frequency ( lter order = 40, coecient precision = 16 bits, data precision = 16 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : 3.12 Reduction in Transition Activity due to DECOR (coecient precision = 16 bits, data precision = 16 bits, = 1, m = 1 : : : : : : : : : : : : : : : : : : : : : : : : : : 3.13 Reduction in Transition Activity with m for an FIR lter (cuto = 10 , lter order = 40, coecient precision = 16 bits, data precision = 16 bits, = ?1, = 1, m = 1) 3.14 E ect of m for an IIR lter (cuto = 5 , lter order = 15, coecient precision = 16 bits, data precision = 16 bits, m = 1) : : : : : : : : : : : : : : : : : : : : : : : : : : 3.15 Reduction in Transition Activity with di erence bit-width (cuto = 6 , lter order = 40, coecient precision = 16 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : 3.16 Signal to Quantization Noise Ratio (SQNR) (cuto = 6 , lter order = 40, coecient precision = 16 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : 3.17 E ect of passband width ( lter order = 40, coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : 3.18 E ect of lter order (cuto = 0:1 , coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix

13 15 16 18 20 24 26 26 28 29 38 40 41 42 45 47 48 48 50 50 53 53 54 54 55 55 56 56

3.19 E ect of coecient precision (cuto = 0:1 , lter order = 40, data precision = 17 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.20 E ect of order of di erence m (cuto = 0:1 , lter order = 40, coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1) : : : : : : : : : : : : : : : : : : 3.21 E ect of data precision (cuto = 0:1 , lter order = 40, coecient precision = 8 bits, = ?1, = 1, m = 1) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.22 DA-based implementation of a 4-tap FIR lter : : : : : : : : : : : : : : : : : : : : 3.23 Transition activity versus bit position : : : : : : : : : : : : : : : : : : : : : : : : : 3.24 Variation in probability with transitions in an address : : : : : : : : : : : : : : : : 3.25 Multiple memory architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.26 Memory select logic : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.27 Pr1 vs. S1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.28 Memory bypass architecture to reduce memory accesses : : : : : : : : : : : : : : : 3.29 Power savings vs. S1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A plot of the function H () : : : : : : : : : : : : : : : : : Lower and Upper bounds on Transition Activity versus R Lower bound on Transition Activity versus R : : : : : : : Lower bound on power-delay versus R for given H : : : : Transition Activity versus Block Size for i.i.d. source : : : Transition Activity versus Block Size for Markov Process

: : : : : : 5.1 A Generic Communication System : : : : : : : : : : : : : : 5.2 A Generic Communication System for a Noiseless Channel :

4.1 4.2 4.3 4.4 4.5 4.6

5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13

5.14 5.15 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

::::::::::::: ::::::::::::: ::::::::::::: ::::::::::::: ::::::::::::: ::::::::::::: ::::::::::::: ::::::::::::: A Practical Communication System for a Noiseless Channel : : : : : : : : : : : : : Linear Prediction Con guration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Framework for Low Power Encoder and Decoder : : : : : : : : : : : : : : : : : : : Encoder and Decoder can share hardware when f1 is xor and F is Identity : : : : dbm function : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example of Di erence-Based Mapping (dbm) : : : : : : : : : : : : : : : : : : : : : Probability distribution for V2 data before and after applying xor and dbm : : : : The function inv (e(n)) : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The algorithm to implement vbm(a) : : : : : : : : : : : : : : : : : : : : : : : : : : Encoder and Decoder for inc-xor : : : : : : : : : : : : : : : : : : : : : : : : : : : : Analytical Estimate of Transition Activity for Unsigned, Gray, T0, and inc-xor coding schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Encoder and Decoder for xor-pbm : : : : : : : : : : : : : : : : : : : : : : : : : : : Power dissipation for Bus-Invert and xor-pbm : : : : : : : : : : : : : : : : : : : : : Standard Dual-Output Domino Logic AND2 Gate : : : : : : : : : : : : : : : : : : Example circuit : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Clock-generating (CG) domino : : : : : : : : : : : : : : : : : : : : : : : : : : : : : CG domino when gate Q has low fanout : : : : : : : : : : : : : : : : : : : : : : : : Circuit layout using standard dual-output domino : : : : : : : : : : : : : : : : : : Circuit layout using CG domino : : : : : : : : : : : : : : : : : : : : : : : : : : : : Timing diagram for circuit using CG domino : : : : : : : : : : : : : : : : : : : : : Synthesis of CG domino circuits : : : : : : : : : : : : : : : : : : : : : : : : : : : : x

: 57 : 57 : : : : : : : : : : : : : : : : : : : : : : : : : : :

58 58 59 60 61 62 62 63 65 69 72 72 77 79 80 84 85 85 86 87 88 89 89 91 92 93 95

: 96 : 101 : 101 : 104 : 105 : 106 : 106 : 108 : 108 : 109 : 110

A.1 Relation between Bi , Bprev(i) , and Ci : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

xi

Chapter 1 INTRODUCTION Power dissipation has become a critical design concern in recent years driven by the emergence of mobile applications. Reliability concerns and packaging costs have made power optimization relevant even for tethered applications. As system designers strive to integrate multiple-systems on-chip, power dissipation has become an equally important parameter that needs to be optimized along with area and speed. The main source of power dissipation in a CMOS logic gate is due to a transition of its output and is given by the formula, Pdyn = 21 tCout Vdd2 f; (1:1) where t is the average number of times that the output of the gate toggles during a clock cycle, Vdd is the supply voltage, f is the clock frequency, and Cout is the physical capacitance at the output of the gate. Digital Signal Processing (DSP) systems are used in a wide variety of multimedia, wireless, and portable systems where low-power dissipation is an important goal. In these systems, there is signi cant power dissipation in the processing units (ex. CPU), the memory, and the I/O subsystems (ex. bus). Hence techniques are required to reduce power dissipation in all of the above. In order to develop power reduction techniques, it is important to have tools to estimate power dissipation of competing designs. At the logic and circuit levels, techniques such as the ones proposed in [1, 2, 3, 4, 5, 6, 7] exist for power estimation. While these techniques provide relatively accurate estimates of power dissipation, they require a gate or transistor level description of the circuit. Therefore, such techniques are applicable once the design has reached a substantial degree of maturity. Hence there is a need for power estimation at a higher level, 1

which in our case is the architectural level. In the present context, an architectural description refers to the register-transfer level (RTL) model of the system. Architectural level power estimation tools [8] will allow the system designer to choose between competing architectures and also permit major design changes at the level where it is easiest to do so. Once power estimation tools are available, the next step is the development of power reduction techniques. For these techniques to be e ective, they have to target those parts of a system that dissipate the most power. In an IC, a signi cant portion of the power dissipation occurs at the I/O pads. For instance, the power dissipated at the I/O pads of an IC ranges from 10% to 80% of the total power dissipation with a typical value of 50% for circuits optimized for low power [9]. Power dissipation on these busses occur mainly during signal transitions and reducing them will reduce total power dissipation. In addition to the I/O pads, power is dissipated in other sub-systems of an IC, for instance, memory, ALU, and state machines. Since ICs are diverse, there are many techniques to reduce power dissipation, depending, in part, on the application that the IC is targetted at. We have so far outlined important problems in designing low-power static CMOS VLSI circuits. In recent years, dynamic CMOS circuits have become more popular because of their speed and area advantage of static CMOS circuits. Hence, designers need techniques to estimate and and reduce power in dynamic CMOS as well as static CMOS.

1.1 Thesis Outline and Contribution In this thesis, we describe techniques for high-level power estimation and power reduction with a focus on DSP systems. First, in Chapter 2, we focus on estimating the power dissipation in digital lters. We present a novel methodology to determine the average number of transitions in a signal from its word-level statistical description. The proposed methodology employs: 1.) high-level signal statistics, 2.) a statistical signal generation model, and 3.) the signal encoding (or number representation) to estimate the transition activity for that signal. The proposed method is employed in estimation of transition activity in DSP hardware. In Chapter 3, we move from estimating the power dissipation in digital lters to developing two power reduction techniques for lters. The rst technique, termed DECOR transforma2

tions, reduces the power dissipation in the arithmetic units of digital lters by coding the lter coecients and/or the input. The transfer function and/or the input is decorrelated such that fewer bits are required to represent the coecients and inputs. Thus the size of the arithmetic units in the lter is reduced thereby reducing the power dissipation. The DECOR transform is suited for narrow-band lters because there is signi cant correlation between adjacent coecients. The second technique reduces the power dissipation in the memory of lters implemented using Distributed Arithmetic (DA). In a DA architecture, a memory is employed to store linear combinations of coecients. The probability distribution of addresses to the memory is usually not uniform because of temporal correlation in the input. We present a rule governing this probability distribution and use it to partition the memory such that the most frequently accessed locations are stored in the smallest memory. Power dissipation is reduced because accesses to smaller memories dissipate less power. In Chapter 4 and Chapter 5, we shift our focus from the processing unit and the memory to the I/O subsytem. More speci cally, we concentrate on busses that carry data between the processing units. Transitions on high capacitance busses result in considerable system power dissipation. Therefore, various coding schemes have been proposed in the literature to encode the input signal in order to reduce the number of transitions. In Chapter 4, we present fundamental bounds on the activity reduction capability of any encoding scheme for a given source. The fundamental bounds are obtained via an information-theoretic approach where a signal x(n) with entropy rate H is coded with R bits per sample on average. In Chapter 5 we present practical novel encoding schemes that reduce transition activity on busses. The encoding schemes are developed via a communication-theoretic approach, whereby a data source is passed through a decorrelating function followed by a variant of entropy coding function which reduces the transition activity. In Chapters 2, 3, 4, and 5 we assume static CMOS logic is used in VLSI circuits and hence concentrate on reducing transition activity. However, dynamic logic circuits are often used due to their speed and area advantage over static CMOS. In Chapter 6 we present our work which focusses on reducing power dissipation in dynamic CMOS circuits. We present an optimization technique, termed clock-generating (CG) domino, for dual-output domino logic that reduces area, clock load, and power without increasing the delay. A delayed clock, generated from certain dual-output gates, is used to convert other dual-output gates to single output. Simulation 3

results with ISCAS 85 benchmark circuits indicate an average reduction in area, clock load, and power of 17%, 20%, and 24% respectively over dual-output domino and a 48% power reduction for the largest circuit.

4

Chapter 2 POWER ESTIMATION IN DIGITAL FILTERS Power dissipation has become a critical design concern in recent years driven by the emergence of mobile applications. Reliability concerns and packaging costs have made power optimization relevant even for tethered applications. As system designers strive to integrate multiple-systems on-chip, power dissipation has become an equally important parameter that needs to be optimized along with area and speed. Therefore, extensive research into various aspects of low-power system design is presently being conducted. We may classify this research into: 1.) power reduction techniques [10, 11, 12]; 2.) low-power synthesis techniques [13, 14, 15]; 3.) power estimation [16, 7];and 4.) fundamental limits on power dissipation [17, 18]. While the work presented in this chapter focuses on 3.), our eventual objective is to enable 2.). Power reduction techniques form an integral part of low-power VLSI systems design and is presently an active area of research [10, 11, 12]. These techniques have been proposed at all levels of the design hierarchy beginning with algorithms and architectures and ending with circuits and technological innovations. Existing techniques include those at the algorithmic level (such as reduced complexity algorithms [10]), architectural level (such as pipelining [19, 20] and parallel processing), logic (logic minimization [15] and precomputation [21]), circuit (reduced voltage swing [22], adiabatic logic [23]) and technological level [24]. It is now well recognized that an astute algorithmic and architectural design can have a large impact on the nal power dissipation characteristics of the fabricated VLSI solution. Therefore, there is a great need for techniques which allow the evaluation of di erent architectures from the viewpoint of power dissipation and be able to accurately estimate their power dissipation. 5

Power dissipation in CMOS VLSI circuits is a direct function of the number of signal transitions occuring at the capacitive nodes present in it. The terms switching activity, transition probability [16], transition density [5] and transition activity [25] have been proposed in the past to provide a measure of the number of signal transitions. Switching activity and transition probability indicate the average number of transitions at a node per clock cycle. The term transition density refers to the average number of transitions per unit time. Transition activity has been employed in [25] to indicate the average number of transitions in a clock cycle present in a bit of a signal word, in a word, and within a module. Here, we will employ the terminology transition activity as in [25] without any ambiguity. At the logic and circuit levels, techniques such as [1, 2, 3, 4, 5, 6, 7] exist for power estimation. While these techniques provide accurate estimates of power dissipation, they require a gate or transistor level description of the circuit. Therefore, such techniques are applicable once the design has reached a substantial degree of maturity. Our interest in this chapter is to enable power estimation at a higher level, which in this case is the architectural level. In the present context, an architectural description refers to the register-transfer level (RTL) model of the system. Architectural level power estimation tools will allow the system designer to choose between competing architectures and also permit major design changes when it is easiest to do so. While a large amount of work has been done at the circuit and logic levels, not much work has been done for power estimation at the architectural level. In [8], a technique based upon the concept of entropy was presented for estimating the average transition density inside a combinational circuit. This technique employs the Boolean relationship between its input and output. The closest approach to our work, however, is the Dual Bit Type (DBT) model described in [25] where a word-level signal is broken up into: 1.) uncorrelated data bits, 2.) correlated data bits, and 3.) sign bits. The uncorrelated data bits are from the least signi cant bit (LSB ) up to a certain break-point BP0 , with a xed transition activity. The transition activity of the sign bits, which are from the most signi cant bit (MSB ) to another break-point BP1 , are measured by an RTL simulation. A linear model is then employed for the switching activity of correlated data bits, which lie between the sign bits and uncorrelated data bits. Empirical equations de ning BP0 and BP1 in terms of word-level statistics such as mean (), variance ( 2), and autocorrelation () were also presented. 6

Our approach considers the same problem as [25] in that we present a methodology for estimating the average number of transitions in a signal from its word-level statistical description. However, unlike [25] where the estimation of transition activity is based on simulation, the proposed methodology is analytical requiring: 1.) high-level signal statistics, 2.) a statistical signal generation model, and 3.) the signal encoding (or number representation) to estimate the transition activity for that signal. Therefore, the two novel features of the proposed method are: 1.) it is a completely analytical approach and 2.) its computational complexity is independent of the length (i.e., number of samples) of the signal. Both of these features distinguish the proposed approach from most existing techniques to estimate signal transition activity. While [25] also estimates power dissipation by characterizing input capacitance, we focus only on the estimation of transition activity. We rst derive a new relation between the bit-level transition activity (ti ), bit-level probability (pi ) and the bit-level autocorrelation (i) for a single bit signal bi. Then, we present two methods, the rst exact but computationally expensive and the second fast but approximate, to estimate the word-level transition activity, T , employing word-level signal statistics (namely , , and ), signal generation models (such as auto-regressive (AR), moving-average (MA) and auto-regressive moving-average (ARMA) models), along with a certain number representation (such as unsigned, sign-magnitude, one's complement or two's complement). In the approximate method, we divide a word into three regions based on the temporal correlations, unlike [25], where a word is divided into three regions based on the transition activities. Such an approach enables us to estimate the transition activity analytically. The approximate method also uses di erent and more accurate formulae than the ones in [25] for estimating the break-points BP0 and BP1 . Proceeding further, we describe the propagation of the input statistics through commonly used digital signal processing (DSP) blocks such as adders, multipliers, multiplexers, and delays. The e ect of the folding transformation [26] on signal statistics is also studied. The word-level transition activities of all the signals in a system composed of these DSP blocks is determined. These are then summed up to determine the total transition activity for the lter. Even though we focus upon architectural level power estimation in this chapter, we believe that the work presented here would lead to a formal procedure for the synthesis of low-power DSP hardware. The transition activities estimated at the inputs and outputs to blocks such as

7

adders, multipliers, multiplexers, and delays can be used to estimate power dissipation within the block using a power macro-model [27]. This chapter is organized as follows. In section 2.1, we present some preliminaries and summarize existing results. Determining word-level transition activity T from word-level signal properties is described in section 2.2. In section 2.3, we compute transition activity for various lter structures and in section 2.4 we present simulation results for audio, video, and communication system signals and lters. The results of this chapter have appeared in [28, 29, 30].

2.1 Preliminaries In this section, we will present de nitions and review existing results that will be employed in later sections. First, we will de ne the word-level quantities such as the mean (), variance ( 2), and temporal correlation (). Next, we consider bit-level quantities such as the probability pi of the ith bit bi being equal to a 1, the bit-level temporal correlation i, the bit-level transition activity ti . Finally, the structures of the AR, MA, and ARMA models are described.

2.1.1 Word and Bit-Level Quantities Let x(n) be a B -bit word signal given by

x(n) =

BX ?1 i=0

cibi(n);

(2.1)

where bi(n) 2 f0; 1g represents the ith bit, ci are the weights, and n is the time index. For example, in case of unsigned number representation we have ci = 2i. For x(n) in (2.1), the mean  or the average (or expected value) of x(n) is de ned as

 = E [x(n)] =

X

8k2X

k Pr(x(n) = k);

(2.2)

where the elements of the set X are the values that x(n) can assume, and Pr(A) is the probability that event A occurs. Note that the elements of the set X are a function of the signal encoding or the number representation. Similarly, the variance  2 of x(n) is given by

2 = E [(x(n) ? )2 ] = E [x2(n)] ? 2 : 8

(2.3)

The variance  2 is also referred to as the signal power. The lag-i temporal correlation (i) of x(n) is de ned as

)(x(n ? i) ? )] = E [x(n)x(n ? i)] ?  : (i) = E [(x(nE) ? [(x(n) ? )2 ] 2 2

(2.4)

In this chapter, we will be interested mainly in (1) and therefore we will denote it via the simpli ed notation . We now consider the ith bit bi of a word-level signal x(n) de ned in (2.1). Let pi be the probability that bi (n) is 1, i.e., pi = Pr(bi(n) = 1) = E [bi(n)]. If Xi is the set of all elements in X such that the ith bit is 1, then,

pi = Pr(x(n) 2 Xi ) X ? j? p1 e  (assuming normal distribution) = 8j 2Xi  2 ( )2 2 2

(2.5) (2.6)

Clearly, the value of pi is dependent on the statistical distribution of the values in X . While we have provided an example of a normal distribution here, there is no restriction on the distribution itself. Note that, the probability distribution of x(n) can either be estimated or obtained from the knowledge of the parameters of the signal generation models to be discussed in subsection II.B. However, without loss of generality, we will assume that the probability distribution of x(n) is known a priori. The temporal correlation, i , of the ith bit is de ned as, 2 i = E [(bi(nE) ?[(bp(i)(n)bi?(np?)21)] ? pi)] = E [bi(n)bpi(?n ?p21)] ? pi :

i

i

i

i

(2.7)

If pi = 1 or pi = 0 then i is de ned to be 1. The transition activity (or transition probability [16]), ti , of the ith bit is de ned as

ti = Pr(bi(n) = 0 and bi(n ? 1) = 1) + Pr(bi(n) = 1 and bi (n ? 1) = 0):

(2.8)

If the bits bi (n) and bi (n ? 1) are independent then the transition activity is given by [16],

ti = 2pi (1 ? pi ):

(2.9)

In section III, we will derive an equation relating the transition activity ti and the correlation i . 9

Finally, we de ne the word-level transition activity, T , as follows,

T =

BX ?1 i=0

ti :

(2.10)

In section III, we will show how to compute ti and then employ (2.10) to compute T .

2.1.2 Signal Generation Models As mentioned in the previous section, we will employ ARMA signal generation models to calculate transition activity. These signal models are commonly employed to represent stationary signals in general and have found widespread application in speech [31] and video coding [32]. Furthermore, signals obtained from sources such as speech, audio, and video can also be modeled employing ARMA models. An (N; M ) order auto-regressive moving average model (ARMA(N; M )) can be represented as

x(n) =

N X i=0

di (n ? i) +

M X i=1

aix(n ? i)

(2.11)

where the signal (n) is a white (uncorrelated) noise source with zero mean, and x(n) is the signal being generated. If a given signal source, such as speech, needs to be modeled via (2.11), then we can choose coecients ai and di to minimize a certain error measure (such as the meansquared error) between x(n) and the given source. In that case, we say that x(n) represents the given signal source. As mentioned in the previous subsection, if the ai 's and di 's in (2.11) are known, along with the distribution of (n), then we can obtain the probability distribution of x(n). The model in (2.11) is an in nite-impulse response (IIR) lter with coecients ai and di , with a zero mean white noise as the input. It is also possible to transform this IIR model into one that depends only on the inputs as shown below,

x(n) =

1 X i=0

hi (n ? i);

(2.12)

where hi can be computed according to the following recursion,

hk = dk + 10

N X i=1

ai hk?i ;

(2.13)

where hk = 0 for k < 0, and h0 = d0. Finally, AR and MA models are special cases of ARMA models. An M th order auto-regressive (AR(M )) signal model is identical to an ARMA(0; M ) model. Also, an N th order moving-average (MA(N )) signal model is the same as an ARMA(N; 0) model. In proving Theorem 1 in section III, we will also employ the following result from [2],

Lemma 1 : E[bi(n)bi(n ? 1)] = pi ? t2i

2.2 Word-Level Signal Transition Activity In this section, we will present techniques for estimating word-level transition activity, T , of a signal, x(n), from its word-level statistics. We will rst present a theorem relating bitlevel quantities, namely, the transition activity ti , the probability pi , and temporal correlation, i. Next, two techniques for estimating i are presented. The rst is referred to as the exact method, whereby i is explicitly determined for the B bits i = 0; : : :; B ? 1 in x(n). The second method is called the approximate method in which break-points BP0 and BP1 (as de ned in [25]) are determined from an ARMA model of the signal. Simulation results will be provided in support of the theory.

2.2.1 Transition Activity For Single-bit Signals For single-bit signals, we have an expression given by (2.9) [16] for independent bits bi (n) and bi (n ? 1). In this subsection, we will present a more general result which is also applicable when the temporal correlation between bi (n) and bi (n ? 1) (i.e., i ) is not zero. This result is presented as Theorem 1 as follows,

Theorem 1 : If an ith bit bi has a probability pi of being a `1' and has a temporal correlation of i , then its transition activity ti is given by

ti = 2pi(1 ? pi )(1 ? i )

Proof: From the de nition of i in (2.7), we have 2 i = E [bi(n)bpi(?n ?p21)] ? pi : i i 11

(2.14)

(2.15)

Substituting for E [bi(n)bi(n ? 1)] from Lemma 1 into (2.15) and solving for ti , we get

ti = 2pi(1 ? pi)(1 ? i);

(2.16)

which is the desired result. {

Note that, substitution of i = 0 (corresponding to the case of uncorrelated bits) in (2.14) reduces it to (2.9). In subsequent sections, we present two methods (the exact and approximate methods) for calculating i from word-level statistics. These will then be substituted in (2.14) to obtain ti .

2.2.2 Estimation of i: The Exact Method From (2.7), we see that it is necessary to compute pi and E [bi(n)bi(n ? 1)] in order to estimate i. As pi can be obtained from the probability distribution function of x(n), we will now focus upon E [bi(n)bi(n ? 1)], which is given by (recall that Xi is the set of all elements in X such that the ith bit is a `1'),

E [bi(n)bi(n ? 1)] = Pr((bi(n) = 1) and (bi (n ? 1) = 1)) = Pr(x(n) 2 Xi and x(n ? 1) 2 Xi )

(2.17)

In particular, we will employ AR(1) and MA(N ) signal models to estimate E [bi(n)bi(n ? 1)]. First, we present the following result for an AR(1) model.

Theorem 2 : For an AR(1) signal, X Pr(x(n ? 1) = j ) E [bi(n)bi(n ? 1)] = 8j 2Xi

X

8k2Xi

Pr( (n) = k ? a1 j )

Proof: From the de nition of E [bi(n)bi(n ? 1)] in (2.17), we have X X Pr(x(n) = k and x(n ? 1) = j ): E [bi(n)bi(n ? 1)] = (8j 2Xi) (8k2Xi)

(2.18) (2.19)

Substituting the expression for an AR(1) model (obtained by substituting N = 0, M = 1 and b0 = 1) in (2.11)) into (2.19), we obtain

E [bi(n)bi(n ? 1)] = = =

X

X

(8j 2Xi) (8k2Xi) X

X

(8j 2Xi) (8k2Xi) X

X

(8j 2Xi) (8k2Xi)

Pr( (n) + a1 x(n ? 1) = k and x(n ? 1) = j ) Pr( (n) + a1 j = k and x(n ? 1) = j ) Pr( (n) + a1 j = k) Pr(x(n ? 1) = j ) 12

(2.20)

where the last step is justi ed because (n) and x(n ? 1) are independent. Note that, (2.18) can now be obtained by a simple rewriting of (2.20). Furthermore, each of the summations in (2.18) can be evaluated via the knowledge of the probability distribution function. {

In order to support Theorem 2 experimentally, we compared the measured values of ti and i for the data generated by an AR(1) signal, SIG2, in Table 2.1, with the estimated values predicted by the theorem. The results shown in Figure 2.1 indicate the measured and theoretical values match very well. For the word-level transition activity, T , a total error of less than 1% was obtained. Similar results were obtained for the other signals in Table 2.1. The signals in Table 2.1 were chosen because they represent a wide variety of signals. The signals SIG1 and SIG2 are based on an AR(1) model with positive and negative correlations, respectively, whereas the signal SIG2 has an AR(1) model with positive correlation. Similarly, the signal SIG3 is based on an MA(1) model and the signal SIG4 is identical to SIG2 except for the mean. The signal SIG5 is derived from an ARMA(3; 5) model. 1 0.9

Temporal correlation & Transition activity

0.8 ’measured transition activity’ ’theoretical transition activity’ ’measured temporal corrrelation’ ’theoretical temporal correlation’

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 0

2

4

6

8 Bit

10

12

14

16

Figure 2.1 Measured and Theoretical ti and i vs. bit for the AR(1) signal, SIG2 We now consider an MA(1) process and present the following result.

13

Signal SIG1 SIG2 SIG3 SIG4 SIG5

Table 2.1 Signal details

x(n) 

(n) ? 0:5x(n ? 1) 866

(n) + 0:99x(n ? 1) 141

(n) + 0:5 (n ? 1) 100

(n) + 0:99x(n ? 1) 141

(n) + 0:4 (n ? 1) + 0:2 (n ? 2)+ 1000 :07 (n ? 3) + :5x(n ? 1) + :3x(n ? 2)+ 0:1x(n ? 3) + 0:05x(n ? 4) ? :2x(n ? 5)







1000 -0.50 0 1000 0.99 0 111 0.40 0 1000 0.99 16384 2309 0.89 0

Theorem 3 : Let j; k; l 2 Xi , where j + b1k 2 Xi and k + b1l 2 Xi. Then, for an MA(1) signal x(n) = (n) + b1 (n ? 1), XXX

E [bi(n)bi (n ? 1)] =

j

k

l

Pr( (n) = j ) Pr( (n ? 1) = k) Pr( (n ? 2) = l) (2.21)

Proof: Employing the expression for an MA(1) signal obtained by substituting N = 1 and

M = 0 into (2.11), we get

E [bi(n)bi(n ? 1)] = Pr( (n); (n ? 1); and (n ? 2) : x(n) 2 Xi and x(n ? 1) 2 Xi ) = Pr( (n); (n ? 1); and (n ? 2) : (n) + b1 (n ? 1) 2 Xi and

(n ? 1) + b1 (n ? 2) 2 Xi ) (2.22) If (n) = j , (n ? 1) = k, and (n ? 2) = l, then we can write (2.22) as follows, E[bi(n)bi (n ? 1)] = Pr( (n) = j and (n ? 1) = k and (n ? 2) = l : j + b1k 2 Xi and k + b1 l 2 Xi) XXX = Pr( (n) = j) Pr( (n ? 1) = k) Pr( (n ? 2) = l); (2.23) j

k

l

where j + b1k 2 Xi and k + b1l 2 Xi , which is the desired result. {

In Figure 2.2, we show the simulation results in support of Theorem 3. Again, we compared the measured values for ti and i in data generated by the MA(1) signal, SIG3, in Table 2.1 with the values predicted by the theorem. In this case, we found that the errors between the measured and predicted values of T were less than 2%. Finally, we consider the computation of E [bi(n)bi (n ? 1)] for an MA(2) signal and show that Theorem 3 can also be extended to calculate E [bi(n)bi(n ? 1)] for an MA(N ) signal. For 14

0.6 ’measured transition activity’ ’theoretical transition activity’ ’measured temporal corrrelation’ ’theoretical temporal correlation’

Temporal correlation & Transition activity

0.5

0.4

0.3

0.2

0.1

0

-0.1 0

2

4

6

8 Bit

10

12

14

16

Figure 2.2 Measured and Theoretical ti and i vs. bit for the MA(1) signal SIG3 an MA(2) signal x(n) = (n) + b1 (n ? 1) + b2 (n ? 2), the quantity E [bi(n)bi(n ? 1)] is given by, E[bi (n)bi (n ? 1)] =

XXXX

j

k

l m

Pr( (n) = j) Pr( (n ? 1) = k) Pr( (n ? 2) = l) Pr( (n ? 3) = m);

where j; k; l; m : j + b1k + b2 l 2 Xi and k + b1l + b2m 2 Xi . It can be checked that E [bi(n)bi(n ? 1)] for AR(M ) and ARMA(N; M ) signals is dicult to calculate for M > 1 because we need to compute the joint probability distribution function of x(n) and x(n ? 1). However, we can estimate E [bi(n)bi(n ? 1)] for an AR(M ) or an ARMA(N; M ) signal by approximating the signal with an MA(N 0) signal, where N 0 is suciently large, or approximating with an AR(1) signal.

2.2.3 Estimation of i: The Approximate Method In the previous subsection, an exact method for computing i (i = 0; : : :; B ? 1) was presented. For large values of B , this computation can become expensive. In order to alleviate this problem, we will present a computationally ecient method to estimate i from word-level 15

statistics. As mentioned before, this method (referred to as the approximate method) uses a model similar to that described in [25]. In Figure 2.3, we plot the temporal correlation i versus bit position i for various audio, video, and communications channel streams described in Table 2.2. It can be seen that the temporal correlation i is approximately zero for the LSB s and close to the word-level temporal correlation  for the MSB s. Furthermore, there is a region in between the LSB s and MSB s where the bit-level temporal correlation i increases approximately linearly. As proposed in [25], we divide the bits in the signal word into three regions of contiguous bits referred to as the LSB , linear, and MSB regions. The break-points BP0 and BP1 separate the LSB from the linear region and the linear from the MSB region, respectively. Furthermore, the graph of temporal correlation i versus bit position i for the LSB , linear, and MSB regions has slopes of zero, non-zero, and zero, respectively. 1 0.9 0.8 ’Atmlan’ ’Audio3’ ’Audio4’ ’Audio5’ ’Audio6’ ’Audio7’ ’Video3’

Temporal correlation

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 0

2

4

6

8 Bit

10

12

14

16

Figure 2.3 Temporal correlation versus bit In spite of this similarity with [25], the proposed approach di ers from [25] in the following ways: 1.) the word is divided into three regions based upon the correlation and not the transition activity, 2.) the way the break-points BP0 and BP1 are computed, and 3.) our use of (2.14) to 16

Data set Audio3 Audio4 Audio5 Audio6 Audio7 ATM LAN Video3

Table 2.2 Description of data-sets

Description    2.88MB of 16 bit PCM audio data (music) 1.4285 7349.20 0.9628 2.88MB of 16 bit PCM audio data (music) -17.6342 4040.40 0.9712 0.37MB of 16 bit PCM audio data (speech) 59.4566 2661.75 0.9005 0.61MB of 16 bit PCM audio data (speech) 23.6151 2328.79 0.9647 2.88MB of 16 bit PCM audio data (music) -39.3460 3086.30 0.9920 0.80MB of 16 bit communications channel data 0.4861 5581.60 0.2952 9.70MB (380 QCIF frames) of 8 bit video data 99.7108 55.57 0.9199

compute ti and (2.10) to compute T analytically. In particular, we do not employ simulations to estimate transition activity of the most signi cant bits. Without loss of generality, we will assume that two's complement representation is employed. By de nition, i = 0 for i < BP0 . Now, let i = BP for i  BP1 ? 1. Hence, we can make the following approximation for two's complement representation, 1

8 > > > >
(2.24) (BP0  i < BP1 ? 1) > > > :  (i  BP1 ? 1) BP We now examine the relation between the parameters in the set fBP , BP0 , BP1 g and those in f,  , g in order to derive expressions for BP , BP0 , and BP1 . 0

(i?BP0 +1)BP1 BP1 ?BP0 1

1

1

2.2.3.1 Calculation of BP0 For an uncorrelated signal, (n), a good estimate of BP0 is given by log2  , where  is the standard deviation of (n) [25]. If the signal x(n) has non-zero correlation, then it can be modeled using a signal model, which can then be used to calculate BP0 . For instance, if x(n) is modeled using an ARMA model then it can be expressed using (2.12). Since the signals hi (n ? i) are uncorrelated, BP0 for each of the signals can be estimated as log2 jhij . Given an adder which accepts two input signals with BP0 break-points, BP01 and BP02 respectively, a good estimate for the BP0 break-point at the output of the adder is max(BP01 , BP02 ). Hence, P the break-point BP0 for a signal x(n) = i hi (n ? i) can now be estimated as the maximum of the BP0 's of the signals hi (n ? i), as shown below,

BP0 = [log2 hmax  ]; 17

(2.25)

Table 2.3 Measured and Estimated BP0 and BP1 BP0

Signal

BP1

Measured Estimated Measured Estimated 11 10 13 13 8 7 13 13 8 7 10 9 8 7 13 13 11 10 14 14

SIG1 SIG2 SIG3 SIG4 SIG5

where hmax = max(jhij) and [k] is the integer nearest to k. We veri ed (2.25) by comparing the measured and estimated values of BP0 obtained from data generated with the ve signals shown in Table 2.1. The measured value of BP0 was obtained by counting the number of bits with correlation close to 0. For instance, from Figure 2.4, we see that BP0 for the signal SIG2 is 8 because there are 8 bits with correlation close to 0. The measured and estimated values of BP0 are shown in Table 2.3, where it can be seen that the measured and estimated values match quite well. 1

0.8

SIG1 SIG2 SIG3 SIG4 SIG5

Temporal correlation

0.6

0.4

0.2

0

-0.2

-0.4 0

2

4

6

8 Bit

10

12

Figure 2.4 Temporal correlation versus bit 18

14

16

2.2.3.2 Calculation of BP1 Let the values of x(n) lie between the values xmin and xmax . In a normal distribution, xmin =  ? 3 and xmax =  +3. We de ne BP1 such that for i  BP1 ? 1, i is approximately constant. Since the dynamic range of x(n) is xmax ? xmin , the least signi cant log2 (xmax ? xmin ) bits are required to cover this range. Hence, we have

BP1 = [log2(xmax ? xmin )]; which reduces to,

BP1 = [log2 6]

(2.26)

for a normal distribution where  is the standard deviation of x(n). The estimate for BP1 in (2.26) is di erent from that in [25], which is given in (2.27) below for comparison purposes,

BP1 = [log2 (jj + 3)]:

(2.27)

When jj  3 , both (2.26) and (2.27) are approximately equal with the maximum di erence of 1 occuring at  = 0. However, in the case where jj  3 , (2.26) is more accurate than (2.27). This is due to the fact that for jj > 3 there are 3 regions in which i is a constant. The rst region consists of the bit positions i such that i < BP0 . The second region has bit positions i lying between BP1 and another break-point BP2 . The third region consists of bits with positions beyond BP2 where the bits do not have any transitions. The bits in the third region can be calculated by computing the common most signi cant bits in the binary representations of the numbers xmax and xmin . These are the numbers which lie at the two extremes of the probability distribution. We veri ed (2.26) by comparing it with the measured values of BP1 obtained from data generated by various signals in Table 2.1. The results are shown in Table 2.3 where it can be seen that the measured and estimated values match closely. To verify that BP1 is independent of the mean , we plot the bit-level temporal correlation i and transition activity ti for signals SIG2 and SIG4 in Figure 2.5. Note that from Table 2.1, SIG2 and SIG4 are identical except for their mean . It can be seen from Figure 2.5, that the value of BP1 , 13, for SIG2 and SIG4 is independent of , which is also indicated by (2.26). For SIG4 BP2 is 15 because the binary representations of xmax , 19384, and xmin , 13384, have only 1 common most signi cant bit. 19

1

Temporal correlation & Transition activity

0.8 ’SIG2 transition activity’ ’SIG4 transition activity’ ’SIG2 temporal correlation’ ’SIG4 temporal correlation’

0.6

0.4

0.2

0

-0.2 0

2

4

6

8

10

12

14

16

Figure 2.5 Temporal correlation and transition activity for SIG2 and SIG4 All that now remains in the approximate method is to estimate the value for BP . If the model for x(n) is known then we can use the exact method to calculate BP . If the model for x(n) is not available, then we assume that BP =  which is the word-level temporal correlation. This is because in most number representations like sign magnitude, two's complement, and one's complement, the most signi cant bits have higher weight than the least signi cant bits. Hence the correlation of the most signi cant bits will be close to the word-level correlation. This is especially valid for audio and video signals (see Figure 2.3). 1

1

1

2.2.4 Calculation of T Employing (2.10), (2.14), (2.25), (2.26), we computed the value of the word-level transition activity T for the signals described in Table 2.1 for two's complement representation. The measured and estimated word-level transition activity T , for all the signals are shown in Table 2.4. It can be noted that the error is less than 2% for two's complement representation.

20

Table 2.4 Word-level transition activity for di erent number representations

Signal Unsigned, Two's complement One's complement Sign magnitude Meas. Est. % Error Meas. Est. % Error Meas. Est. % Error SIG1 8.79 8.82 0.34 8.79 8.82 0.34 6.07 6.16 1.48 SIG2 4.99 5.03 0.80 4.99 5.03 0.80 4.65 4.74 1.94 SIG3 6.97 6.94 0.43 6.97 6.94 0.43 4.20 4.15 1.19 SIG4 4.99 5.03 0.80 4.99 5.03 0.80 4.65 4.86 4.52 SIG5 6.54 6.42 1.83 6.55 6.42 1.98 5.91 5.89 0.34

2.2.5 E ect of Signal Encoding/Number Representation The results presented so far in this section (Theorems 2 and 3) have implicitly included the e ect of the signal encoding. This is due to the fact that the elements of the sets X and Xi will depend upon the signal encoding. In this subsection, we examine explicitly the e ect of number representation on the transition activity. In the previous subsections, we have considered two's complement number representation. The unsigned representation will have the same transition activity as two's complement because the most signi cant bits of the former behave identical to the sign bits of the latter. Therefore, we will not consider the unsigned representation any further. We will now analyze the one's complement and sign-magnitude representations.

2.2.5.1 One's complement The one's complement representation is identical to the two's complement for positive numbers. For negative numbers, we can generate the two's complement representation from that of the one's complement by adding a 1 to the LSB , which will usually a ect only the LSB s. In the approximate method, since we assume that LSB s are uncorrelated, the activity of the LSBs in the one's complement will be close to that of the two's complement. The remaining bits will have the same temporal correlation as in the two's complement representation. Therefore, i for one's complement representation will be the same as that for two's complement representation. The measured and estimated word level transition activity T for the signals in Table 2.1 employing one's complement are shown in the second set of the three columns in Table 2.4. The measured word-level transition activity was obtained by generating data using

21

the signal model and measuring transition activity in that data. The error in T is less than 2% for one's complement representation.

2.2.5.2 Sign magnitude In the sign magnitude representation there is only one sign bit; namely, the most signi cant bit, bB ?1(n). This bit will have the same temporal correlation as the sign bits in two's complement representation because the temporal correlation of the sign bit depends on the sign transitions. The bits bi(n) for i < BP0 are uncorrelated as in the case of two's complement. We again assume a linear model for i for BP0  i < BP1 ? 1. The resulting expression for i is as follows,

i =

8 > > > > > > > < > > > > > > > :

(i < BP0 ) (BP0  i < BP1 ? 1) (BP1 ? 1  i < B ? 1) (i = B ? 1)

0

(i?BP0 +1)BP1 BP1 ?BP0

1

BP

1

(2.28)

The measured and estimated word level transition activity, T , for the signals are shown in the last three columns of Table 2.4. As always, the measured word-level transition activity was obtained by generating data using the signal model and measuring transition activity in that data. It can be seen that the error in T is less than 2% for all the signals except for SIG4 where the error is less than 5%.

2.2.5.3 Discussion From the expressions for i in (2.24) and (2.28) we see that the temporal correlation and hence transition activity for unsigned, one's complement, and two's complement representations are nearly equal. Also, the transition activity for sign magnitude is less than or equal to two's complement because the number of sign bits in sign magnitude representation (one) is less than or equal to the number of sign bits in two's complement representation. These conclusions are supported via the results in Table 2.4, which show that the transition activity for unsigned, one's complement and two's complement are similar, while the transition activity for sign magnitude is less than that of unsigned, one's complement, and two's complement.

22

2.3 Transition Activity for DSP Architectures In the previous section, we had presented techniques for estimating the word-level transition activity T for signals. In this section, we will apply these techniques to compute the transition activity for DSP architectures. First, we propagate the statistics of the input signal through a given DSP architecture so that word-level statistics for each signal in the architecture is obtained. Then, we calculate the transition activity for each signal employing the techniques presented in the previous section. These are then added up to obtain the total transition activity of the architecture.

2.3.1 Propagation of Word-Level Statistics In this subsection, we propagate the input statistics to the output for the following DSP operators: (1) Adder (2) Multiplier (3) Multiplexor (4) Delay These operators were chosen due to their widespread use in DSP algorithms. First, we start with the adder.

2.3.1.1 Adder In Figure 2.6, the two signals xi (n) (i = 1; 2) at the input to the adder have statistics i , i , i (i = 1; 2). The mean 3 , variance 32, and temporal correlation, 3, at the output of the adder are given by the following equations.

3 = E [x3(n)] = E [x1(n) + x2(n)] = 1 + 2 32 = E [x23(n)] ? 23 = 12 + 22 + 2E [x1(n)x2(n)] ? 21 2 3 = E[x (n)x (n?1)]? = E[(x (n)+x (n))(x (n?1)+x (n?1))]?( + ) 3

2 3

3

2 3 2 2 2

1

2

1

2

2 3

=   +  +E [x (n)x (n?1)]+E [x (n)x (n?1)]?2  2 1 1

2

1

2 3

1

2

1 2

23

1

2

2

x1(n)

x3(n)

x1(n)

x2(n)

x3(n)

x2(n)

x1(n)

x3(n)

x1(n)

D

x2(n)

x2(n)

Figure 2.6 Adder, Multiplier, Multiplexor, and Delay P ?1 If x1(n) = ik=0 ci x(n ? i) and x2 (n) = ck x(n ? k) as in the case of an FIR lter, we have

3 = (

k X

ci)

i=0 k X

(2.29) k kX ?1 X

(j ? i)cicj ) (2.30) i=0 j =i+1 i=0 ?1 ci ci+1 + Pk Pk ci cj (j ? i + 1) + Pk?2 Pk 2(Pik=0 i=0 j =i i=0 j =i+2 ci cj (j ? i ? 1))(2.31) 2 3

32 =  2( 3 =

c2i + 2

2.3.1.2 Multiplier In this subsection we examine how to propagate word-level statistics through a multiplier. In Figure 2.6, the two signals x1 (n) and x2 (n) at the input to the multiplier have statistics 1 , 1 , 1 and 2 , 2 , 2 respectively. The statistics at the output of the multiplier are given by the following equations,

3 = E[x3(n)] = E[x1(n)x2(n)] 32 = E[x23(n)] ? 23 = E[(x1(n)x2 (n))(x1(n)x2 (n))]?E2[x1(n)x2(n)] = E[x21(n)x22(n)]?E2[x1 (n)x2(n)] 3 = E[x (n)x (n?1)]? = E[x (n)x (n)x (n?1)x (n?1)]?E [x (n)x (n)] 3

3

2 3

2 3

1

2

1

2 2 3

2

1

2

If x2(n) is a constant c1, then 3 = c1 1 , 3 = c11 , and 3 = 1.

2.3.1.3 Multiplexor When two signals, x1 (n) and x2(n) with statistics f1 , 1, 1 g and f2 , 2, 2g, respectively, are multiplexed (Figure 2.6) by a control signal with probability pc and correlation c , then the statistics f3 , 3, 3 g of x3 (n) at the output of the multiplexor are given by (assuming 0 and 1 on the control signal selects x1(n) and x2(n) respectively),

3 = E [x3(n)] = (1 ? pc)1 + pc 2

(2.32) 24

32 = E [x23(n)] ? 23 = E [(1 ? pc )x21 (n) + pc x22 (n)] ? (1 ? pc )221 ? p2c 22 ? 2pc (1 ? pc)1 2 = (1 ? pc )12 + pc (1 ? pc )21 + pc 22 + pc (1 ? pc )22 ? 2pc (1 ? pc )1 2 (2.33) 2 (2.34) 3 = E [x3(n)x3(n2 ? 1)] ? 3 ; 3

where E [x3(n)x3 (n ? 1)] is given by, E[x3(n)x3 (n ? 1)] = (1 ? pc )(1 ? pc + pc c )E[x1(n ? 1)x1 (n)] + pc (1 ? pc )(1 ? c )E[x1(n ? 1)x2(n)] + pc (1 ? pc )(1 ? c )E[x2(n ? 1)x1(n)] + pc (pc ? pc c + c )E[x2(n ? 1)x1(n)];

where the expectations in the above formula can be obtained from the auto-correlation and cross-correlation values of the input signals. Also, BP0 for x3(n) is the maximum of BP0 for x1 (n) and x2 (n).

2.3.1.4 Delay A delay shifts the signal by one time unit, which in this case is a clock period. The statistics at the output of a delay element are identical to that at the input.

2.3.2 Example 1: FIR lter We illustrate propagating word-level statistics using the 5-tap Finite Impulse Response (FIR) lter in Figure 2.7, where coecients c1 = c5 = 0:09765625, c2 = c4 = 0:1953125, and c3 = 0:39453125. The correlations 10, 11, 12, and 13 require the lag-2, lag-3, lag-4, and lag-5 correlations of the input be known. If they are not available, then for most real-life signals, the lag-i correlation can be approximated by i(1). Such an approximation corresponds to approximating the signal with an AR(1) model. The statistics of signals within the lter can be calculated using (2.29), (2.30), and (2.31). As an example, the equations for the mean, variance, and temporal correlation of the output, x13 (n) are given below,

2 = 13

13 =

5 X

ci ) i=1 5 5 4 X X X (j ? i)cicj )  2( c2i + 2 i=1 i=1 j =i+1  2(P4i=1 ci ci+1 + P5i=1 P5j=i cicj (j ? i + 1) + P3i=1 P5j=i+2 cicj (j ? i ? 1)) 2 13

13 = (

25

x0(n)

x1(n)

D

c1

c2

x5(n)

x6(n)

x2(n)

D c3

x3(n)

D c4

x7(n) x10(n)

x8(n) x11(n)

x4 (n)

D c5

x9(n) x12(n)

x13(n)

Figure 2.7 Direct form FIR lter The measured and estimated word-level statistics for video3 data are shown in Table 2.5. We see that the estimated statistics match the measured statistics very closely, with errors of less than 1%. Table 2.6 shows the measured and estimated total word-level transition activity for the FIR lter (when the signals from Table 2.1 are passed through the lter) in Figure 2.7 and its transpose in Figure 2.8. The measured values were obtained by simulation using a C program. It can be seen that the total transition activity for the transpose form is always less than that for the direct form because of the lower transition activity at the inputs to the delays. The lower transition activity at the inputs to the delays is because multiplying by a constant of magnitude less than unity reduces the variance and hence the transition activity. c5 c4

D

D

D

D

c3 c2 c1

Figure 2.8 Transpose FIR lter

2.3.3 Example 2: Folded FIR lter Folding [26] is an algorithm transformation technique that allows the mapping of algorithmic operations to a given set of hardware units. For instance, the 5 tap FIR lter in Figure 2.7 containing 5 multiplies and 4 adds can be folded onto 3 multipliers and 2 adders using additional delays and multiplexers as shown in Figure 2.9. 26

Table 2.5 Word-level statistics for direct form FIR lter

Signal

 Est. % Error 99.7108 0.00 9.7374 0.07 19.4748 0.04 39.3390 0.02 29.2122 0.05 68.5512 0.03 88.0259 0.04 97.7633 0.04

Meas. 99.7108 9.7445 19.4827 39.3477 29.2272 68.5749 88.0576 97.8021

x0 ; x1 ;x2 ;x3 ; x4 x5 ; x9 x6 ; x8 x7 x10 x11 x12 x13

Meas. 0.9199 0.9183 0.9198 0.9196 0.9529 0.9660 0.9763 0.9811

 Est. % Error 0.9199 0.00 0.9199 0.17 0.9199 0.01 0.9199 0.03 0.9534 0.05 0.9661 0.01 0.9764 0.01 0.9812 0.01

Meas. 55.5663 5.4648 10.8646 21.9415 16.0293 37.1125 47.1728 51.9925

 Est. % Error 55.5663 0.00 5.4264 0.71 10.8528 0.11 21.9226 0.09 15.9868 0.27 37.0569 0.15 47.1104 0.13 51.9001 0.18

Table 2.6 Total transition activity for FIR lters

Signal

Direct form Transpose Measured Estimated % Error Measured Estimated % Error hline SIG1 148.31 148.92 0.13 145.45 145.97 0.36 SIG2 76.64 76.40 0.31 72.84 72.26 0.80 SIG3 113.15 113.64 0.43 109.00 109.44 0.40 SIG4 74.55 74.81 0.35 70.25 70.53 0.40 SIG5 104.63 102.10 2.42 101.41 98.62 2.75 The statistics of the signals of the unfolded lter can be calculated using (2.29), (2.30), and (2.31). These are used along with (2.32), (2.33), and (2.34) to calculate the statistics of signals of the folded lter. As an example, the statistics of the signal, x11;7(n), obtained by multiplexing x11(n) and x7(n) are given by the following equations,

11;7 = (c +c 2+2c ) 2 11 ;7 = 2c  + 2(c + c + c + 2c c  + 2c c  + 2c c (2)) + c  + (c + c + c )  ? 2(c c = 2c23 2 + 2(c21 + c22 + c23 + 2c1c2  + 2c2c3 + 2c1c3 (2)) 2 + (c1 + c2 )22 11;7 = 2 c (c ((2)+)+(c+c; )(+1))?(c +c )  1

2

3

2 2 3

2

3

2 1

1

2 2

2 3

2

1 2

3 2 11 7

2 3

1

1 3

2

2

2 2 3

1

2

3

2 2

1 3

+ c2 c3 + c23 )2

2 2

The measured and estimated word-level statistics are shown in Table 2.7. The measured and estimated word-level statistics match very closely, with errors of less than 1%. Table 2.8 shows the measured and estimated total word-level transition activity for the folded FIR lter in Figure 2.9. The error between the measured and estimated transition activity for the ve signals is less than 4%. A comparison between the transition activities of the original FIR lter (see Table 2.6) and the folded architecture (see Table 2.8) indicates that folding increases the number of transitions. This conclusion is consistent with that observed in [10]. 27

x5,9(n)

D

x4(n)

x0(n)

c1=c5

2D

2D x6,8(n)

x1(n)

D

x3(n)

2D

2D

D

c2=c4 x10,14(n)(x14(n)=x8 (n)+x9 (n))

x2(n)

x7,7(n)

x11,7 (n)

D

x13(n)

c3

x11(n) D

Figure 2.9 Folded direct form lter

2.3.4 Example 3: IIR lter In this example we propagate word-level statistics through the simple In nite Impulse Response (IIR) lter in Figure 2.10, where c1 = 0:1. The equations for the statistics of the signals in the direct form IIR lter are given by,

x3 (n) = Pni=1 ci1x0 (n ? i) P E[x0 (n)x3(n)] =E[ ni=1 ci1 x0 (n ? i)x0(n)] P P = limn!1 ni=1 ci1E[x0 (n ? i)x0(n)] = limn!1 ni=1 (ci1(i)02 + ci120 )

Signal x0;4 x1;3 x2;2 x5;9 x6;8 x10;14 x11;7

Table 2.7 Word-level statistics for folded direct form FIR lter Meas. 99.7108 99.7108 99.7108 9.7444 19.4827 29.2272 53.9613

 Est. % Error 99.7108 0.00 99.7108 0.00 99.7108 0.00 9.7374 0.07 19.4748 0.04 29.2122 0.05 53.9452 0.03

Meas. 0.7203 0.8150 0.9600 0.7200 0.8143 0.8122 0.5096

 Est. % Error 0.7203 0.00 0.8150 0.00 0.9600 0.00 0.7203 0.04 0.8150 0.09 0.8126 0.05 0.5094 0.04

28

Meas. 55.5663 55.5663 55.5663 5.4648 10.8646 16.0297 33.8074

 Est. % Error 55.5663 0.00 55.5663 0.00 55.5663 0.00 5.4264 0.71 10.8528 0.11 15.9868 0.27 33.8042 0.01

Table 2.8 Total transition activity for folded direct form FIR lter Signal Measured Estimated % Error SIG1 202.14 208.39 3.09 SIG2 119.48 123.30 3.20 SIG3 187.04 193.08 3.23 SIG4 118.18 120.64 2.08 SIG5 166.56 169.78 1.93

x0(n)

x1(n)

x0(n)

x1(n) c1

D x3(n)

x2(n)

x3(n)

x2(n)

D

c1

Figure 2.10 IIR direct form lter and transpose i = 1c?c + 1?cc (1) (1) (assuming (i) =  (1)) P c c i i E[x0 (n ? 1)x3(n)] = 1c?c + 02 1 i=1 c1(i ? 1) = 1?c + 1?c (1) (assuming (i) =  (1)) P c  c (1) i i E[x0 (n)x3(n ? 1)] = 1c?c + 02 1 i=1 c1(i + 1) = 1?c + 1?c (1) (assuming (i) =  (1)) 1 = 1? c 12 = 02 + c2112 + 2E[x0(n)x3(n)] ? 203 =  +2E[x (n1?)xc (n)]?2  1 = E[x (n)x (n?1)]? = (1) + c  +E[x (n)x (n?1)]+E[x (n)x (n?1)]?2 c  E [x (n)x (n?1)]?2 c  = (1) +E [x (n)x (n?1)]+ (1?c ) 2 0 1 1 2 1 0 1 2 1 0 1

2 1 0 1

2 0 1 2 0 1

1 1

2 0 1 1

2 0 1 1

2

0

1

2 0

1

1

2 0

2 0

2 1

2 1

3

0

2 1

2 1

0

2 2 1 1 1 3

3

0

0

2 1

3 2 1

0

0

3

3

0 1

1

0 1 1

The measured and estimated statistics are shown in Table 2.9. The error between the measured and estimated statistics is less than 1%. Table 2.10 shows the measured and estimated total word-level transition activity for the direct form IIR lter and its transpose in Figure 2.10. We see that the total transition activity is always less for the transpose form due to the lower transition activity at the input to the latch because multiplication by a constant of magnitude less than 1 reduces the variance which in turn reduces the transition activity.

29

Signal x0 x1,x2 x3

Table 2.9 Word-level statistics for direct form IIR lter Meas. 1.43 1.59 0.16

Signal SIG1 SIG2 SIG3 SIG4 SIG5

   Est. % Error Meas. Est. % Error Meas. Est. % Error 1.43 0.00 0.9628 0.9628 0.00 7349.20 7349.20 0.00 1.58 0.63 0.9672 0.9695 0.24 8132.59 8135.16 0.03 0.16 0.00 0.9672 0.9695 0.24 812.92 813.52 0.07

Table 2.10 Total transition activity for IIR lters

Direct form Transpose Measured Estimated % Error Measured Estimated % Error 35.22 35.52 0.85 35.68 35.97 0.81 18.36 18.21 0.82 16.82 16.38 2.62 26.86 26.92 0.22 26.33 27.25 3.49 17.77 17.86 0.51 16.11 15.92 1.18 24.88 24.38 2.01 23.66 23.26 1.69

2.4 Results with Realistic Benchmark Signals We have so far presented results using the stationary, synthetic signals in Table 2.1. In this section, we will present simulation results for the non-stationary, naturally occuring, audio, video and communications channel signals described in Table 2.2. First, we apply the approximate method (see subsection III(C)) to compare the measured and estimated transition activity for these signals. Then, we process these signals through the direct form FIR (Figure 2.7) and IIR (Figure 2.10), transpose FIR (Figure 2.8) and IIR (Figure 2.10) and the folded direct form FIR (Figure 2.9) lters to compute the total transition activity in these structures.

2.4.1 Realistic benchmark signals For the audio, video, and communications channel data described in Table 2.2, the approximate method was employed to estimate transition activity. The results are shown in Table 2.12 where the measured transition activity was calculated directly from the data. We assumed BP = , which is the word-level temporal correlation. To estimate BP0 we assumed AR(1) models for all data sets except Audio5 and Video3. We used MA(10) models for Video3 and Audio5 because the AR(1) models resulted in higher errors. The measured and estimated value 1

30

Table 2.11 Measured and Estimated BP0 and BP1

Signal

BP0

BP1

Measured Estimated Measured Estimated Audio3 10 11 16 15 Audio4 9 10 15 15 Audio5 0 3 15 14 Audio6 4 9 14 14 Audio7 5 9 14 14 ATM LAN 12 12 15 15 Video3 1 1 8 8

Signal Audio3 Audio4 Audio5 Audio6 Audio7 ATM LAN Video3

Table 2.12 Word-level transition activity

Unsigned, Two's complement Meas. Est. % Error 6.42 6.32 1.56 5.80 6.06 4.46 4.78 4.40 7.95 5.38 5.59 3.90 5.05 5.52 9.31 7.76 7.56 2.58 2.31 2.15 6.93

One's complement Meas. Est. % Error 6.43 6.32 1.71 5.80 6.06 4.46 4.79 4.40 8.14 5.38 5.59 3.90 5.05 5.52 9.31 7.76 7.56 2.58 2.31 2.15 6.93

Sign magnitude Meas. Est. % Error 6.17 6.24 1.13 5.55 5.89 6.13 4.22 4.23 0.24 4.62 5.43 17.53 4.78 5.44 13.81 7.09 6.94 2.12 2.16 2.15 0.15

of BP0 is shown in Table 2.11. The di erence in the measured and estimated value of BP0 for signals Audio5, Audio6, and Audio7 is due to the fact that the least signi cant bits of these signals are correlated as can be seen from Figure 2.3. From Table 2.12, we see that for unsigned, two's complement and one's complement representations, the estimation error in T is less than 10%. For sign magnitude representation, the error in T is less than 18%.

2.4.2 Total word-level transition activity, T , for FIR and IIR lters In this subsection, we present the measured and estimated transition activity with audio, video, and communications channel data for the direct form lter in Figure 2.7 and its transpose in Figure 2.8 (see Table 2.13), the folded direct form lter in Figure 2.9 (see Table 2.14), and the IIR lter and its transpose in Figure 2.10 (see Table 2.15). The errors in T for all the lters are less than 12%. Table 2.16 compares the run time for simulation and the run time 31

Data set

Table 2.13 Total transition activity for FIR lters

Direct form Transpose Measured Estimated % Error Measured Estimated % Error Audio3 102.16 100.76 1.37 99.14 98.01 1.14 Audio4 91.40 94.37 3.25 88.42 90.62 2.49 Audio5 75.80 68.42 9.74 73.23 66.55 9.12 Audio6 84.94 86.63 1.99 82.03 83.09 1.29 Audio7 78.82 85.65 8.67 76.07 82.41 8.33 ATM LAN 129.35 124.76 3.55 127.94 122.23 4.46 Video3 31.58 33.02 4.56 28.29 31.64 11.84

Table 2.14 Total transition activity for folded direct form FIR lter Data set Measured Estimated % Error Audio3 159.14 158.29 0.53 Audio4 145.40 146.89 1.02 Audio5 123.40 132.94 7.73 Audio6 136.12 138.88 2.03 Audio7 130.60 135.04 3.40 ATM LAN 202.46 207.93 2.70 Video3 51.32 52.95 3.18

for the approximate method on a 85 MHz SparcStation 5. We see that in most cases the run time for the approximate method is an order of magnitude less than that for simulation. The run time for simulation depends on the length of the input sequence whereas the run time for the approximate method depends on the width of the signals (8-bit for video3 and 16-bit for the rest). This is because, in our method, the computational complexity is determined by the calculation of pi using (2.5) where the summation is over 2B elements where B is the bit width. We can make the computation time of pi essentially independent of bit width by calculating the BP sum over points in Xi spaced a certain distance (2 ) apart with basically no loss of accuracy of the sum. The running times using the fast approximate method and the Dual Bit Type (DBT) method are also shown in Table 2.16. The run times for the approximate method can be further reduced by introducing optimizations such as setting the transition activity at the output of a delay to be equal to that at its input, etc. 2

32

0

Data set

Table 2.15 Total transition activity for IIR lters

Direct form Transpose Measured Estimated % Error Measured Estimated % Error Audio3 24.36 24.26 0.41 22.92 22.56 1.57 Audio4 21.82 22.49 3.07 20.34 20.78 2.16 Audio5 18.06 17.15 5.04 16.99 15.55 8.48 Audio6 20.29 20.85 2.76 19.02 19.39 1.95 Audio7 18.87 20.33 7.74 17.38 18.59 6.96 ATM LAN 30.59 29.25 4.38 30.17 28.68 4.94 Video3 7.69 7.74 0.65 6.22 6.93 11.41

Table 2.16 Run times in seconds for direct form lter Signal Simulation DBT Approximate method Fast Method Audio3 42.30 6.38 2.25 0.06 Audio4 40.28 6.40 2.41 0.13 Audio5 5.00 0.85 3.10 1.23 Audio6 8.60 1.46 2.58 0.18 Audio7 39.16 6.65 2.58 0.21 ATM LAN 13.05 1.86 2.21 0.05 Video3 138.91 37.95 0.01 0.01

33

2.5 Summary In this chapter, we have proposed a novel methodology to determine the average number of transitions in a signal from its word-level statistical description. The proposed methodology employs: 1.) high-level signal statistics, 2.) a statistical signal generation model, and 3.) the signal encoding (or number representation) to estimate the transition activity for that signal. In particular, the signal statistics employed are mean (), variance ( 2), and autocorrelation (). The signal generation models considered are auto-regressive moving-average (ARMA) models. The signal encoding includes unsigned, one's complement, two's complement, and signmagnitude representations. First, the following exact relation between the transition activity (ti ), bit-level probability (pi) and the bit-level autocorrelation (i) for a single bit signal bi is derived,

ti = 2pi(1 ? pi )(1 ? i )

(2.35)

Next, two techniques are presented which employ the word-level signal statistics, the signal generation model, and the signal encoding to determine i (i = 0; : : :; B ? 1) in (2.35) for an B -bit signal. The word-level transition activity T is obtained as a summation over ti (i = 0; : : :; B ? 1), where ti is obtained from (2.35). Simulation results for 16 bit signals generated via ARMA models indicate that an error in T of less than 2% can be achieved. Employing AR(1) and MA(10) models for audio and video signals, the proposed method results in errors of less than 10%. Both analysis and simulations indicate the sign-magnitude representation to have lower transition activity than unsigned, ones' complement, or two's complement. Finally, the proposed method is employed in estimation of transition activity in digital signal processing (DSP) hardware. Signal statistics are propagated through various DSP operators such as adders, multipliers, multiplexors, and delays and then the transition activity T is calculated. Simulation results with ARMA inputs show that errors less than 4% are achievable in the estimation of the total transition activity in the lters.

34

Chapter 3 POWER REDUCTION IN DIGITAL FILTERS In the previous chapter, we have presented techniques to estimate the power dissipation in digital lters. In this chapter, we focus on reducing the power dissipation in digital lters. In a digital lter, the output, y (n), at time n is given by,

y(n) =

NX ?1 k=0

bk x(n ? k) +

M X k=1

ak y(n ? k);

(3.1)

where bk and ak are the coecients of the lter and x(n) is the input. Equivalently, in the z-transform domain,

Y (z) = H (z)X (z);

(3.2)

where Y (z ), H (z ), and X (z ) are the z -transforms of the output, lter, and input, respectively. Power reduction techniques exist speci cally for digital lters [33, 34, 35, 36, 37, 38, 39, 40, 41]. These techniques employ the fact that power dissipation in CMOS VLSI circuits occurs mainly during signal transitions and is given by,

P = tCL Vdd2 f;

(3.3)

where t is the transition activity, CL is the capacitance, Vdd is the supply voltage, and f is the frequency of operation. The transition activity of a bit-level signal, bn , is de ned as,

t = Pr(bn = 0 and bn?1 = 1) + Pr(bn = 1 and bn?1 = 0); where Pr(A) is the probability of occurrence of event A. 35

Power dissipation is reduced by reducing one or more of t, CL , Vdd , and f in (3.3). The techniques are (with the quantities being reduced in parenthesis), multi-rate architectures (CL and Vdd ), pre ltering (CL ) [33], block nite-impulse response (FIR) lters (CL and Vdd ) [36], coecient optimization (CL), di erential coecients (CL) [40], and coecient reordering (t). In other work on low-power adaptive lters, in [35] the total switched capacitance is reduced by dynamically varying the lter order based on signal statistics. In [34], power reduction is achieved by a combination of powering down lter taps and modifying the coecients. In [39], the strength reduction transformation is applied at the algorithmic level to reduce power dissipation in complex adaptive lters. In this chapter, we present two techniques to reduce power dissipation in digital lters. We rst present an algorithm transformation technique, referred to as the DECOR transform, to reduce the number of bits required to represent the lter coecients or the input samples by decorrelating the coecients or the input. We then present a method to reduce the power dissipation in the memory used in distributed arithmetic architectures. The proposed architecture exploits the fact that the input to a lter is typically correlated, due to which the probability distribution of memory addresses is not uniform. We present a rule governing this distribution and use it to partition the memory so that the most frequently accessed locations are stored in a small memory and use a larger memory to store the remaining data. Power dissipation is reduced because accesses to the smaller memory dissipate less power. The rest of this chapter is organized as follows. In section 3.1, the DECOR transform is presented and applied to xed coecient FIR and IIR lters. In section 3.2, the DECOR transform is applied to adaptive lters and in section 3.3, analytical and simulation results for the reduction in power dissipation are presented. In section 3.5, we present our low-power DA architecture and in section 3.6 we provide experimental results for the reduction in power dissipation. The results of this chapter have appeared in [42, 43, 44].

3.1 DECOR Transform Applied To Fixed Coecient Filters In this section, the DECOR transform is presented and applied to xed coecient FIR and IIR lters. The relaxed DECOR transform, applicable to lters which do not retain full

36

numerical precision, is then presented. Finally, the DECOR transform is applied to the lter inputs.

3.1.1 Related work The Signal Flow Graph Transformations (SFGT) in [38], which were developed independently, are a special case of the DECOR transform. The di erences between our work and [38] are as follows. The DECOR transforms are applied to in nite-impulse response (IIR) lters, adaptive lters, lters with rounding after the output of multipliers, and the inputs to a lter in addition to xed coecient FIR lters which retain full numerical precision. We study the types of lters that are suitable for DECOR transforms and provide gate-level simulations to illustrate the e ect of such lter parameters as cuto frequency and lter order on the energy savings. Another approach close to DECOR in literature is the Di erential Coecients Method (DCM) [40] in which di erential coecients are employed for FIR lters. In FIR lters, the output is given by,

y(n) =

NX ?1 k=0

bk x(n ? k):

(3.1)

The rst-order di erential coecients, k1, are given by,

k1 = bk ? bk?1 :

(3.2)

Each product term, bk x(n ? k), (except b0x(n)) in (3.1) is written as,

bk x(n ? k) = k1 x(n ? k) + bk?1x(n ? k):

(3.3)

The result of applying DCM to the DF lter in Figure 3.1(a) is shown in Figure 3.1(b). It is possible to employ second-order di erences, k2 , by repeating the above procedure on the rst-order di erential coecients k1. The advantage of DCM is that the width of the coecients are reduced but N ? 1 additional adders and latches are required for an N tap lter. In addition to DCM there are other approaches that exploit coecient correlation. In [45] the frequency response of the lter is used to select an appropriate architecture from among the fast FIR algorithms proposed in [46]. 37

D b0

D b1

D b2

b3

(a)

b0

D

D

D

b1 - b0

b2 - b1

b3 - b2

D

D

D

(b)

b0

D

D

D

b1 - b0

b2 - b1

b3 - b2

D - b3

D (c)

Figure 3.1 (a) Direct Form (DF), (b) DCM, and (c) DECOR Filters In this section, motivated by DCM, an alternative approach is proposed to realize a lter with di erential coecients. Our formulation results in the following advantages over DCM: 1) lower overhead for a given lter order, 2) overhead is independent of the lter order, 3) energy savings over a wider range of lter bandwidths, 4) easily and eciently implementable in software, and 5) applicable to adaptive lters. Note that, power reduction techniques such as parallel processing and pipelining and those in [34, 35, 39] can be applied in addition to DECOR for further power savings.

3.1.2 The DECOR transform In DECOR, the transfer function, H (z ), is multiplied and divided by the polynomial,

f (z) = (1 + z ? )m;

(3.4)

where , , and m are integers that are chosen depending on the frequency pro le of H (z ). Thus, the z -transform of the output is given by, (1 + z ? )m X (z ): (3.5) Y (z) = H (z) (1 + z ? )m The frequency response is not altered by multiplying and dividing by (1 + z ? )m as long as nite precision e ects are accounted for. The numerator polynomial H (z )(1 + z ? )m results 38

in a lter with di erential coecients. The denominator (1 + z ? )m introduces a recursive section. The parameter, , (determined in Appendix B) is either 1 or ?1 and determines if coecients spaced sample delays apart are either added or subtracted. The parameter m is analogous to the order of di erence in [40]. The coecient bit-width typically decreases with m. There is, however, a limit to m since the bit-widths of the coecients cannot be less than zero and the overhead of DECOR increases with m. Figure 3.1(c) shows the lter obtained after applying DECOR with = ?1, = 1, and m = 1 (i.e., f (z ) = 1 ? z ?1 ) to the DF lter in Figure 3.1(a). In Figure 3.1(c), all coecients, except for the left-most and right-most, are di erences of adjacent coecients in the original lter. Note that the left-most coecients in DECOR (see Figure 3.1(c)) and DF (see Figure 3.1(a)) are identical, while the right-most coecients have opposite signs. Thus, for there to be a reduction in bit-width, the left-most and right-most coecients in the original lter must be small in magnitude, which is true in most practical lters. The DECOR transform is applicable to lters in which the magnitude of the di erence between the absolute values of adjacent coecients of H (z ) is less than the magnitude of the coecients themselves. Hence, fewer bits are required to represent the di erences compared to the actual coecients. Thus, the size of the arithmetic units in the lter is reduced, thereby reducing the power dissipation. In addition to a reduction in power dissipation, there is also a reduction in delay and area due to smaller bit-widths at the inputs to the multipliers. DECOR transforms are well-suited to custom, hardwired, xed-point implementations since power reduction is achieved by reducing the size of the arithmetic units. Since DECOR transforms maintain the regularity of the original direct form (DF) lter, they can also be considered for implementation on a programmable xed-point processor. In Figure 3.1(c), the result of applying DECOR to an FIR lter is an IIR lter with additional hardware compared to the original FIR lter. The overhead in DECOR and DCM is summarized in Table 3.1. Unlike DCM, the overhead in DECOR is independent of the lter order and for this reason is less than that of DCM. The overhead in DECOR depends on H (z ). For instance, if m is 1 and all the coecients of the original lter are non-zero, then for symmetric and anti-symmetric lters no additional multiplier is required. The overhead in DECOR also depends on whether the lter is implemented in software or hardware. As Figure 3.2 shows, it is nearly as ecient to implement a DECOR lter as a DF lter in software. 39

Table 3.1 Overhead in DECOR and DCM for an N Tap FIR lter Arithmetic Unit DECOR DCM 2 m 2 m

Delay Adder Multiplier

... DF loop: ... rpt mac ... clr jmp ...

m

mN mN

0

...

DECOR loop: #N (*r1)+,(*r2)+,b b DF loop

... rpt #N+1 mac (*r1)+,(*r2)+,b ... ; Note: b is not set to 0 jmp DECOR loop ...

Figure 3.2 DF and DECOR Code For An N Tap FIR Filter The DECOR code executes 1 additional multiply-accumulate instruction but saves on setting register b to 0 due to the presence of the recursive section in Figure 3.1(c). The values of and are determined in Appendix B and summarized in Table 3.2. The values of and in Table 3.2 were veri ed to be e ective in reducing the bit-widths when the lter was designed employing such other methods as the window-based method, ParksMcClellan, or Least-Squares. In Table 3.2, the absolute value of , j j, is always equal to 1. Since is less than the order of the original DF lter, it is also possible perform an exhaustive search to determine optimum and . There are 3 disadvantages if j j is not equal to 1.

Table 3.2 and for di erent types of FIR lters Filter Type Low-pass High-pass Band-pass (!c =center freq.) Band-stop

f (z) ?1 1 (1 ? z?1)m 1 1 (1 + z ?1 )m 1 !c (1 + z ? !c )m

?1 2 (1 ? z?2)m

40

(1) The magnitude of the di erence of the absolute values is not minimized since, typically, the magnitude of the lter coecients increases till the center coecient and then decreases. (2) A multiplier is needed in the recursive section of the IIR lter resulting in extra overhead and a quantizer. (3) If j j 6= 1 and H (z ) is either symmetric or anti-symmetric, then H (z )(1 + z ? )m is not necessarily either symmetric or anti-symmetric. Thus, if j j = 6 1, complexity reductions due to symmetry cannot be exploited. For the above reasons, is assumed to be either 1 or ?1 throughout this chapter. Figure 3.3 shows the coecients of a 41 tap low-pass FIR lter with cuto s  5 (where  corresponds to half the sample rate) before and after applying DECOR. There is a decrease in the range of coecients after applying DECOR once ( = ?1, = 1, m = 1) and a further decrease after applying DECOR twice ( = ?1, = 1, m = 2). Figure 3.4 shows 35000 Original m=1 m=2

30000 25000 20000 15000 10000 5000 0 -5000 -10000 0

5

10

15

20

25

30

35

40

45

Figure 3.3 Coecients of a low-pass lter before and after DECOR ( = ?1, = 1) the coecients of a 41 tap high-pass FIR lter with cuto s  45 before and after applying DECOR. There is a decrease in the range of coecients after applying DECOR once ( = 1, = 1, m = 1). If the lter passband is narrow, then from the equations (B.1-B.4) for the ideal low-pass, high-pass, band-pass, and band-stop lters, respectively, we see that the sinc function is sampled at smaller intervals leading to smaller di erence between the magnitude of adjacent coecients. Thus, as the width of the passband is reduced, the reduction in coecient bit-width is increased and the e ectiveness of DECOR is increased. If the original coecients are quantized such that the magnitude of the maximum coecient is close to the maximum value possible for the given bit-width and m = 1, then it can be shown that the cuto frequency, ! , must be less than 41

Original m=1

0

5

10

15

20

25

30

35

Figure 3.4 Coecients of a high-pass lter before and after DECOR ( = 1, = 1, m = 1) 0:385 (where  corresponds to half the sample rate) for there to be a reduction in the number of bits for a low-pass lter generated using the sinc function. Similarly, for a high-pass lter, the cuto frequency must be greater than 0:615 . Simulation results are presented in section IV to illustrate the impact of cuto frequency on the e ectiveness of DECOR. Multiplying and dividing the transfer function by (1 + z ? )m , introduces m poles and m zeros at each of the roots of ? . Since j j is equal to 1, all the new poles and zeros are on the unit circle. For stability, the new poles must exactly cancel with the new zeros. To ensure exact cancelation of poles and zeros, the coecients in H (z ) must be rst quantized and then multiplied with (1 + z ? )m . In addition, the multiplications and additions in the actual implementation of H (z ) have to be exact, i.e., without rounding or truncation. The next sub-section describes the issues that arise when this assumption is relaxed.

3.1.3 Relaxed DECOR In most lters, quantization (i.e., rounding or truncation) is employed at the output of the multipliers in order to reduce the size of the adders. The noise due to quantization will depend on the amount of quantization; the noise is increased as more bits are dropped at the output of the multipliers. So far, no quantization has been employed at the output of the multipliers in DECOR lters in order to guarantee stability. Relaxed DECOR is identical to DECOR except that quantization is employed at the output of multipliers. There are two issues that have to be considered in relaxed DECOR lters: stability and quantization noise. In relaxed DECOR lters, the pole on the unit circle introduced by the recursive section is not cancelled by the preceding feed-forward section. Hence, to guarantee stability, the IIR section can be modi ed to include a saturating adder. Alternatively, a multiplier with coecient 42

close to, but less than unity can also be employed in the recursive section so as to move the pole inside the unit circle. In our simulations, a saturating adder is employed in the recursive section because, in practice, it results in lower overhead and lower deviation from the ideal (i.e., unquantized) output. The quantization at the output of the multipliers in relaxed DECOR lters has to be done such that the quantization error does not build-up in the recursive part. A build-up of the quantization error can occur if the quantization error has a d.c. bias. For instance, the output, yrelaxed (n), of a relaxed DECOR lter with = ?1, = 1, and m = 1 is given by,

yrelaxed (n) = Q[b0x(n)] +

NX ?1 k=1

Q[(bk ? bk?1)x(n ? k)] ?

Q[bN ?1x(n ? N )] + yrelaxed (n ? 1); where Q[] represents the quantization operation. If Q[x]  x, then it can be shown by induction on n that the di erence between the ideal (i.e., unquantized) output and relaxed DECOR, y(n) ? yrelaxed (n), is a non-decreasing function of n. The condition Q[x]  x is in fact satis ed when two's complement with truncation is employed. Hence, if two's complement arithmetic is employed, then rounding, not truncation, must be employed at the output of the multipliers in relaxed DECOR. If, however, sign-magnitude representation is employed, then it may be possible to employ either truncation or rounding in relaxed DECOR. In our simulations, rounding with two's complement arithmetic was employed.

3.1.4 Low-power IIR lters In an IIR lter, the transfer function H (z ) can be written as, (z ) (3.6) H (z) = N D(z) ; where N (z ) and D(z ) are the numerator and denominator polynomials, respectively. To apply DECOR to IIR lters, the transfer function will have to be multiplied and divided by 2 polynomials, i.e., (z ) (1 + N z ? N )mN (1 + D z ? D )mD X (z ); Y (z) = N D(z) (1 + N z? N )mN (1 + D z? D )mD where ( N , N , mN ) and ( D , D , mD ) depend on N (z ) and D(z ), respectively. If N (z ) and D(z) are such that N = D = and N = D = , then H (z) needs to be multiplied and 43

Table 3.3 N , D , N , and D for di erent types of IIR lters Filter Type Low-pass High-pass Band-pass (!c = center freq.) Band-stop

N N D D ?1 1 1 1 1 1 ?1 1 1 1 ?1 1

(1?z?1 )mN (1+z?1 )mD (1?z?1 )mN (1+z?1 )mD (1+z?1 )mN (1?z?1 )mD (1+z?1 )mN (1?z?1)mD (1+z? !c )mN (1?z? !c )mD (1+z? !c )mN (1?z? !c )mD

f (z)

?1 2 1 2

(1?z?2 )mN (1+z?2 )mD (1?z?2 )mN (1+z?2 )mD

divided by one polynomial, i.e., (z )(1 + z ? )m X (z ): Y (z) = N D(z)(1 + z ? )m Practical values of ( N , N ) and ( D , D ) for low-pass, high-pass, bandpass, and bandstop

lters were obtained experimentally and are shown in Table 3.3. For example, for low-pass Chebyshev, Butterworth, and Elliptic lters, the values of ( N , N ) and ( D , D ) were experimentally determined as (?1, 1) and (1, 1), respectively. In Appendix B, we show that, unlike FIR lters, the output of the lter obtained after applying DECOR to an IIR lter will not be identical to the output of the original DF lter. The di erence in outputs is unavoidable and is due to the quantizer in IIR lters. The DECOR transform can be combined with other algorithm transformations such as look-ahead pipelining [47]. For instance, when look-ahead pipelining is applied to the IIR lter H (z) = 1?az1 ? , it can be shown that there is a decrease in the bit-width of the lter coecients of the feed-forward path after applying DECOR if 1 > jaj > 32 and the look-ahead pipelining level is greater than 2 ? log1 jaj . 1

2

3.1.5 DECOR applied to lter inputs The DECOR transform can be applied to the input to a lter by multiplying and dividing X (z) by (1+ z ? )m. The parameters and should be chosen depending on the region in the frequency spectrum that contains most of the energy of the input. For instance, if the input is correlated (i.e., most of the energy is concentrated in a small frequency band), then is equal to ?1 and is equal to 1 (see derivation of and for low pass FIR lters in Appendix B). 44

One di erence between the coecients and the inputs is that typically the input sequence is not known a priori. Hence, if a lossless implementation is desired then there will actually be an increase in the input bit-width after multiplication with (1 + z ? )m . The input bit-width can, however, be reduced if a certain amount of clipping of di erences is acceptable. A block diagram for reducing the input bit-width by d bits is presented in Figure 3.5. If the di erence between the current and previous input samples is greater than 2n?d?1 ? 1 (less than ?2n?d?1 ), then the di erence is set to 2n?d?1 ? 1 (?2n?d?1 ). The clipper output is employed to predict the current input sample (backward prediction) in order to reduce noise. clipper n-d-1

2 - 1 n-d -2 n-d-1 n-d n+1 n

n-d

D

H(z)

n-d 2 d+1

D n

Figure 3.5 DECOR For Filter Input ( = ?1, = 1, m = 1)

3.2 DECOR Transform Applied To Adaptive Filters In this section, the DECOR transform is applied to adaptive lters. In order to do so, we derive the following from (3.1),

y(n) = ? y(n ? ) +

N+ ?1 X i=0

i (n)x(n ? i):

(3.1)

In (3.1) y (n) is the output of an adaptive lter. The i (n)'s in (3.1) are de ned as follows,

i (n) =

8 > > > > < > > > > :

bi(n) 0i< bi(n) + bi? (n ? )  i < N bi? (n ? ) N i 1 in (3.5) by cascading more than one decorrelating block as shown in Figure 3.6(c). In section IV, we will see that, typically, a single decorrelating block provides the most reduction in power dissipation and hence, in most situations multiple decorrelating blocks are not needed. After the lter has converged, the power dissipation in the WUD and decorrelating blocks can be reduced substantially by powering them down. Hence, after convergence, only the F0block will consume power. The block diagrams for an adaptive LMS lter and a DECOR adaptive LMS lter are shown in Figure 3.7 and Figure 3.8, respectively. Note that, dynamic algorithm transformations [34] can be applied in addition to the DECOR transform, in which case, the WUD block would send a zero coecient for the taps to be turned o . The decorrelating block would, as always, generate a new set of coecients employing the coecients from the WUD block. It may be possible to combine the WUD block with the decorrelating block by deriving update equations for i (n). We can employ (3.2) and (3.3) to calculate i (n) directly from its

47

b0(n+1)

b1(n+1)

b2(n+1)

b3(n+1)

Q

b4(n+1)

Quantizer

WUD Block Q

Q

Quantizer

Q

Quantizer

x(n)

Q

Quantizer

x(n-1)

Q

Quantizer

x(n-2)

Quantizer

x(n-3)

x(n-4) F -Block

b0(n)

b1(n)

b2(n)

b3(n)

b4(n) d(n) ^

e(n)

d(n)

Figure 3.7 Traditional LMS Filter b0(n)

Q

b0(n+1)

b1(n)

Quantizer

Q

b0(n-1)

x(n)

b1(n+1)

Quantizer

b2(n+1)

b2(n)

Q

Quantizer

b1(n-1)

Q

b2(n-1)

x(n-1)

b3(n)

x(n-2)

b3(n+1)

Quantizer

b4(n)

Q

b3(n-1)

x(n-3)

Q

b4(n+1) WUD Block

Quantizer

Quantizer

b4(n-1)

x(n-4)

Decorrelating Block

x(n-5) F’ Block

^

b0(n)

b1(n)- b0(n-1)

b2(n)- b1(n-1)

b3(n)- b2(n-1)

b4(n)- b3(n-1)

d(n-1) d(n)

^

d(n)

e(n)

Figure 3.8 DECOR LMS Filter previous values as indicated below,

i (n + 1) =

8 > > > > > > > > > < > > > > > > > > > :

i (n) + e(n)x(n ? i) 0i< P ?1 i (n ? + 1) +  k=0 (e(n ? k)+ e(n ? k ? ))x(n ? i ? k) i 1. There is a small reduction in delay and area for m = 1. In Figure 3.21, the percentage reduction in power dissipation in the F0 -block is nearly independent of data precision. In all the experiments, the power dissipation in blocks other than the F0 -block (i.e., WUD and decorrelating blocks) changed by less than 2%. Percentage Reduction (%)

35 Transitions in F-block (Power) Transitions in rest of filter (Power) Max. no. of gates in a path (Delay) Transistor Count (Area)

30 25 20 15 10 5 0 -5 0.05

0.1

0.15 0.2 0.25 0.3 Pass-band width (in fractions of pi)

0.35

0.4

Figure 3.17 E ect of passband width ( lter order = 40, coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1, m = 1) Percentage Reduction (%)

40 35 30

Transitions in F-block (Power) Transitions in rest of filter (Power) Max. no. of gates in a path (Delay) Transistor Count (Area)

25 20 15 10 5 0 -5 20

30

40

50

60 70 Filter Order

80

90

100

Figure 3.18 E ect of lter order (cuto = 0:1, coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1, m = 1)

3.4 Low-Power Distributed Arithmetic Architectures In the previous sections, we presented a power reduction technique suited for narrow-band lters. In this section, we present our second power reduction technique for lters implemented using distributed arithmetic. In a DA architecture [49], the multiplier is eliminated by employ56

Percentage Reduction (%)

35 Transitions in F-block (Power) Transitions in rest of filter (Power) Max. no. of gates in a path (Delay) Transistor Count (Area)

30 25 20 15 10 5 0 -5 6

8

10 12 14 16 Coefficient Precision (in bits)

18

20

Figure 3.19 E ect of coecient precision (cuto = 0:1, lter order = 40, data precision = 17 bits, = ?1, = 1, m = 1) Percentage Reduction (%)

30 25 20 15

Transitions in F-block (Power) Transitions in rest of filter (Power) Max. transistors in a path (Delay) Transistor Count (Area)

10 5 0 -5 -10 -15 0

1

2

3

Order of difference

Figure 3.20 E ect of order of di erence m (cuto = 0:1, lter order = 40, coecient precision = 8 bits, data precision = 17 bits, = ?1, = 1) ing a memory to store linear combinations of the coecients. Figure 3.22 shows one possible DA-based implementation of a 4-tap FIR lter. The memory addresses are formed by grouping bits in the same bit position from successive input samples. The input is shifted in one bit at a time into the register containing x(n). The output is available once every B clocks (where B is the input precision) from the accumulator. The size of the memory for a k-tap lter is 2k words. It is possible to reduce the memory size to 2k?1 words by employing extra logic [49]. In this chapter, we will concentrate on the architecture in Figure 3.22, though all our techniques are applicable to the architecture with memory of size 2k?1 words. In a DA-based lter, power is dissipated in the shift register, memory, adder, shifter, and the accumulator. Since the memory size increases exponentially with the number of taps and the power dissipation in a memory increases with its size, a considerable amount of the total power dissipation in the lter occurs in the memory. This is particularly true as the number of taps increases. In this section, we present an architecture to reduce the power dissipation in the memory. The proposed architecture exploits the fact that the input to a lter is typically 57

Percentage Reduction (%)

30 Transitions in F-block (Power) Transitions in rest of filter (Power) Max. no. of gates in a path (Delay) Transistor Count (Area)

25 20 15 10 5 0 6

8

10 12 14 Data Precision (in bits)

16

18

Figure 3.21 E ect of data precision (cuto = 0:1, lter order = 40, coecient precision = 8 bits, = ?1, = 1, m = 1) 16 word memory x(n) x(n-1) x(n-2) x(n-3)

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

0 b3 b2 b2+b3 b1 b1+b3 b1+b2 b1+b2+b3 b0 b0+b3 b0+b2 b0+b2+b3 b0+b1 b0+b1+b3 b0+b1+b2 b0+b1+b2+b3

Add/Sub

+/-

>>

ACC

Figure 3.22 DA-based implementation of a 4-tap FIR lter correlated, due to which the probability distribution of memory addresses is not uniform. We present a rule governing this distribution and use it to partition the memory so that the most frequently accessed locations are stored in a small memory and use a larger memory to store the remaining data. Power dissipation is reduced because accesses to the smaller memory dissipate less power. Experimental results indicate a reduction in power dissipation in the memory of up to 32% for an 8-tap lter with 8 bits of data precision. The basic idea in our approach is similar to that of caches in general purpose microprocessors, where a small, fast, and lower-power memory is used to store the most frequently accessed data [50]. A similar idea, termed precomputation [21], has also been employed in logic synthesis, where a computation is divided into two parts, with the most frequent computations being done in the rst part, which consumes less power. In [51], the non-uniform probability distribution of variable-length codes is employed to partition the table used in variable-length decoding so that the most frequently accessed data are stored in a small memory. To our knowledge, memory partitioning has not been applied to DA architectures, though there are other approaches to reducing power dissipation speci cally in such architectures. For instance, 58

in [52], the nega-binary code is employed to reduce the transitions in the shift register, whereas in [53], the shift register is eliminated by moving a pointer to the data instead of moving the data itself.

3.5 Low-Power DA architecture In this section, we present our low-power DA architecture. Since the architecture exploits the skewed probability distribution of memory addresses, we rst describe this distribution followed by the low power architecture.

3.5.1 Probability distribution of memory addresses In a DA based lter, the memory addresses are formed by grouping bits in the same bit position from successive input samples (see Figure 3.22). The transition activity of bits in typical audio, video, and communications channel data is shown in Figure 3.23, from which, we see that the less signi cant bits (by less signi cant bits, we mean bits close to and including the least signi cant bit) in a typical input sample have a transition activity close to 0.5 [14, 30]. Hence, addresses formed by grouping bits in the less signi cant bit positions are uniformly 0.55 atmlan audio3 audio4 audio5 audio6 audio7 video3

0.5 0.45

Transition Activity

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0

2

4

6

8 Bit

10

12

14

16

Figure 3.23 Transition activity versus bit position distributed. From Figure 3.23, we also see that the more signi cant bits (by more signi cant bits, we mean the bits close to and including the most signi cant bit) in a typical input sample have a transition activity signi cantly less than 0.5. Hence, addresses generated from a more signi cant bit position have the characteristic that those which have fewer transitions within that address are more probable. For instance, assuming 8-bit addresses, 0x00 (00000000) and 59

0xFF (11111111) are the most frequent since there are no transitions within those addresses. The addresses 0x55 (01010101) and 0xAA (10101010) are the least frequent since they have the maximum number of seven transitions. Figure 3.24 shows the probability of occurrence of addresses in an 8 tap lter. The inputs are 8-bit video data and 16-bit audio data. The addresses are arranged in order of increasing number of transitions on the x-axis. We present the following rule which we have found to hold for all input data we have examined and which we will use in our low-power architecture. Rule 1: The probability of occurrence of an address in a DA based lter with correlated input decreases as the number of transitions within that address increases. 0.18 0.16

No. taps = 8, data precision = 8 No. taps = 8, data precision = 16

0.14 Probability

0.12 0.1 0.08 0.06 0.04 0.02 0 00xf8 0x5a Address (in order of increasing number of transitions)

Figure 3.24 Variation in probability with transitions in an address

3.5.2 Low Power DA architecture In this subsection, we present our low-power DA architecture. We place the most frequently accessed locations (which are also addresses with fewer transitions) in a small memory since accessing a small memory dissipates less power. Let M be the original memory containing all possible addresses. We partition M into l memories M1 , M2 , : : :, Ml with M1 containing the most frequently accessed locations, M2 the next most frequently accessed locations, and so on. Let Si be the size of memory Mi , EHi be the energy dissipated when there is a hit in Mi , Eoverhead be the energy dissipated while decoding which memory to access, and Pri to be the probability of a hit in Mi . The energy dissipated during a memory access is given by,

Energy =

l X i=1

Pri EHi + Eoverhead 60

(3.2)

Figure 3.25 shows the proposed low-power DA architecture with two memories (i.e., l = 2). The memory select logic in Figure 3.25 decides which memory will be accessed. In Figure 3.25, M1

Enable

x(n)

Add/Sub x(n-1)

M2

+/-

>>

ACC

x(n-2) Enable

x(n-3) MEM SEL LOGIC

Figure 3.25 Multiple memory architecture this action is performed in the previous clock cycle in order that it not increase the cycle time. To make the memory select logic ecient, we use Rule 1 by placing all addresses with j or fewer transitions in the smaller memory (where j is determined experimentally). The memory select logic outputs 0 or 1 depending on whether the number of transitions within the address is greater than, or less than or equal to j respectively. The memory select logic for an 8-bit address is shown in Figure 3.26, where a row of exclusive-or gates convert transitions within an address to 1's, followed by a tree of 1-bit full adders to count the number of 1's, which is then input to a comparator to compare with j . The output of the comparator is used to select the memory. Figure 3.27 shows the variation of Pr1 with S1 for lters with 4, 8, and 12 taps. The inputs are 8-bit video data and 16-bit audio data. For an 8-tap lter processing 8-bit video data, 10% of the addresses account for over half the accesses. The variation of Pr1 with S1 for the architecture containing memory of size 2k?1 words is similar to Figure 3.27.

3.5.2.1 Memory bypass architecture If S1 is small (< 8), then it is more ecient to provide the elements of M1 directly to a multiplexer rather than employing a separate memory. This is shown in Figure 3.28, where we have logic to detect the accesses to M1 and provide the data at those addresses directly. From Rule 1, we see that the logic to detect occurrence in M1 basically detects the addresses with the fewest transitions (for instance, 00000000, 11111111, 11111100, and so on). It is possible 61

A D D R R E S S

Comparator

> j?

To Memory Enable

Figure 3.26 Memory select logic 1 0.9 0.8 0.7 Pr_1

0.6 0.5 No. taps = 4, data precision = 8 No. taps = 8, data precision = 8 No. taps = 12, data precision = 8 No. taps = 4, data precision = 16 No. taps = 8, data precision = 16 No. taps = 12, data precision = 16

0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3 0.4 0.5 0.6 0.7 0.8 S_1 (as fraction of the size of M)

0.9

1

Figure 3.27 Pr1 vs. S1 to combine the memory bypass architecture with the multiple memory architecture. In the speci c example of Figure 3.28, M1 consists of the locations 0000 and 1111.

3.6 Experimental Results In this section, we present experimental results on the reduction in power dissipation in the memory. We do not include the rest of the lter (i.e., shift registers, adder, shifter, and accumulator) since our techniques reduce the power dissipation in the memory and that in the rest of the lter is unchanged.

62

M2 Enable

Address Locn. 0000 Locn. 1111

Figure 3.28 Memory bypass architecture to reduce memory accesses We assumed the memory is implemented as a ROM and estimated its energy dissipation per access as,

Energy = CT Vdd2 ;

(3.3)

where Vdd is the supply voltage and CT is the average capacitance switched during a single access. We used the following formula from [14] to estimate CT ,

CT = C0 + C1NI 2NI + C2 PO NO 2NI + C3PO NO + C4NO ;

(3.4)

where,

 C0, C1, C2, C3, and C4 are empirically determined capacitive coecients that depend on the exact circuitry and technology used by the ROM (the values of the coecients used in this chapter are shown in Table 3.4),

 NI is the number of address bits (From Figure 3.22, this is equal to the number of taps in the lter),

 PO is the probability of a 1 in the data stored in the ROM (assumed to be 0.5 in our calculations), and

 NO is the number of bits in the output word. Since the memory stores linear combinations of coecients, NO , for our purposes, is the sum of the coecient bit-width (assumed to be 8) and log2 NI .

63

Table 3.4 Capacitive coecients for a ROM Coecient Value (pF) C0 3.693000 C1 0.002103 C2 0.003092 C3 0.302080 C4 0.343750 We used the memory bypass architecture if S1 was less than 9 and the multiple memory architecture for larger values of S1 . An 8-tap lter was used in the experiment, due to which the size of the memory in the original lter was 256 words. The basic unit in the additional logic to detect membership in M1 in Figure 3.28 is an 8-input NOR gate, which was estimated, using SPICE, to have an average switched capacitance of 0.15pF. The average switched capacitance in the memory select logic in Figure 3.25 was estimated, also using SPICE, to be 0.64pF for an 8-bit address. In Figure 3.29, we plot the reduction in power dissipation over an unpartitioned memory. In order to determine how much of the reduction is due to partitioning and how much due to the skewed address distribution, we also plot the reduction in power dissipation over a partitioned memory assuming uniform distribution of addresses. From Figure 3.29, we see that for 8-bit video data, the maximum reduction of 32% is obtained by the memory bypass architecture when S1 is 8 and most of this reduction is due to the skewed address distribution. From Figure 3.29 we also see that, for 8-bit data, we can obtain a 28% reduction in power in the memory just by detecting the 2 most common addresses, which are 0x00 and 0xFF. This is a powerful argument for using the techniques presented in this chapter. From Figure 3.29, we see that for 16 bit audio data, the maximum reduction of 18% is obtained by a multiple memory architecture when the sizes of M1 and M2 are both equal to 128 and most of this reduction is due to the partitioning of memory.

3.7 Summary In this chapter, we presented two techniques to reduce the power dissipation in digital lters. We rst presented the DECOR transform which is a power-reduction technique suited for narrow-band lters. In this method, the transfer function and/or the input is decorrelated 64

% Reduction in power dissipation

35 taps=8, precision=8, savings over unpartitioned taps=8, precision=8, savings over unif taps=8, precision=16, savings over unpartitioned taps=8, precision=16, savings over unif

30 25 20 15 10 5 0 -5 -10 0

50

100

150 S_1

200

250

300

Figure 3.29 Power savings vs. S1 such that fewer bits are required to represent the coecients and inputs. Thus the size of the arithmetic units in the lter is reduced thereby reducing the power dissipation. The DECOR transform is suited for narrow-band lters because there is signi cant correlation between adjacent coecients. Simulations with xed coecient lters indicate reduction in transition activity ranging from 6% to 52% for lter bandwidths ranging from 0:30 to 0:05 respectively (where  corresponds to half the sample rate). Simulations with adaptive lters indicate reduction in transition activity in the F-block ranging from 12% to 38% for lter bandwidths ranging from 0:30 to 0:05 respectively. The DECOR transforms result in greater energy savings and over a larger bandwidth than existing methods. We also presented a low-power lter using distributed arithmetic. In a DA architecture, a memory is employed to store linear combinations of coecients. The probability distribution of addresses t o the memory is usually not uniform because of temporal correlation in the input. We present a rule governing this probability distribution and use it to partition the memory such that the most frequently accessed locations are stored in the smallest memory. Power dissipation is reduced because accesses to smaller memories dissipate less power. Experimental results with an 8-tap lter with 8 bits of data precision result in a 32% power reduction in the memory. A 28% power reduction was obtained by just detecting accesses to the two most frequently accessed locations (0x00 and 0xFF).

65

Chapter 4 LOWER BOUNDS ON TRANSITION ACTIVITY In the previous chapters we have concentrated on estimating and reducing power dissipation in digital lters. We now shift our focus from reducing power dissipation in processing blocks to reducing power dissipation on busses which transmit data between the processing blocks. In this chapter, we derive lower bounds on the power dissipation on a bus. In the next chapter, we present practical coding schemes to reduce the power dissipation on a bus. At the system level, o -chip busses have capacitances, CL , that are orders of magnitude greater than those found on signal lines internal to a chip. Therefore, transitions on these busses result in considerable system power dissipation. To address this problem, various signal encoding techniques have been proposed in the literature [54, 55, 30, 9, 56, 57] to encode the data before transmitting it on a bus so as to reduce the expected and the peak number of transitions. Hence, the signal encoding approaches in [54, 55, 30, 9, 56, 57] achieve power reduction by reducing switched capacitance while keeping the total capacitance more or less unaltered. In this chapter, we will explore the limits to which signal coding can be employed for the purpose of power reduction. The work in [57] exploits the fact that the data transfers on microprocessor address busses are often sequential (i.e., current data value equals the previous data value plus a constant increment) due to instruction fetches. Hence, transition activity can be reduced by employing Gray codes, where only one bit changes between two successive codewords. Another encoding algorithm for address busses has been presented in [54]. In this algorithm, if the next address is greater than the current address by an increment of one then the bus lines are not altered 66

and an extra \increment" bit is set to one. In [55], Fletcher presents an algorithm, termed BusInvert coding in [9], to reduce the number of transitions on a bus. This algorithm determines the number of bus lines that normally change state when the next output word is clocked onto the bus. When the number of transitions exceeds half the bus width, the output word is inverted before being clocked onto the bus. An extra output line is employed to signal the inversion. The Bus-Invert coding algorithm performs well when the data is uncorrelated. In [30], a two-step framework to reduce transition activity is presented in which data is passed through a decorrelating function f1 , followed by a variant of entropy coding function, f2 , which reduces the transition activity. The transition activity reducing algorithms have an analogue in the area of optical communications where the power dissipation depends on the number of ON (1) bits. In [58], Faulkner presents an encoding algorithm to reduce power dissipation in optical circuits by assigning codewords with fewer 1's to signal samples having a higher probability of occurrence. In this chapter, we derive lower and upper bounds on the average signal transition activity for any coding algorithm. These bounds are derived via an information-theoretic approach in which each symbol of a process (possibly correlated) with entropy rate H is coded with R bits on average. The bounds are asymptotically achievable if the process is stationary (i.e., signal statistics such as mean do not change with time) and ergodic (i.e., the time average and ensemble average are equal). The transition reduction eciency of existing coding algorithms are compared with the bounds derived in this chapter. The concept of entropy from information theory was employed in the area of high-level power estimation in [8, 59]. In [8], entropy was employed as a measure of the average activity to be expected in the nal implementation of a circuit, given only its Boolean functional description. In [59], information theory is employed to estimate power dissipation at logic and registertransfer levels. First, the output entropy of a circuit is estimated from the input entropy. Next, the entropy per circuit line is calculated from the input and output entropy values and used as an estimate for the average switching activity. In [59], the expected transition activity of a 1-bit signal is shown to be upper-bounded by one-half of its entropy under the temporal independence assumption and assuming level signaling. In contrast, the work presented here is applicable to multi-bit signals, independent of the coding algorithm, and completely unravels the connection between the bounds on transition activity and entropy rate. Also, the focus of 67

this chapter is not to estimate average switching activity, but to provide information-theoretic bounds on average switching activity. In section 4.1, we present the preliminaries necessary for the development in the rest of the chapter. In section 4.2, the main result is presented in the form of Theorem 4. In section 4.2, we also present a coding algorithm that asymptotically achieves the lower bound for stationary and ergodic processes. In section 4.3, we employ Theorem 4 to, 1.) derive lower and upper bounds for di erent coding algorithms, and 2.) determine the lower-bound on the power-delay product. We also present two examples where transition activity within 4% and 9% of the lower bound is achieved. The results of this chapter have appeared in [60, 61, 62].

4.1 Preliminaries In this section, we de ne terms employed in the rest of the chapter. Let X be a discrete random variable with alphabet X and probability mass function p(x) = Pr(X = x), x 2 X . A measure of the information content of X is given by its entropy H (X ), which is de ned as follows [63],

H (X ) = ?

X

x2X

p(x) log2 p(x) bits.

(4.1)

This de nition of the measure of information implies that the greater the uncertainty in the source output, the higher is its information content. In a similar fashion, a source with zero uncertainty would have zero information content and therefore its entropy would identically be equal to zero (from (4.1)). The joint entropy H (X1, X2 , : : :, Xn ) of a collection of discrete random variables (X1, X2 , : : :, Xn) with a joint distribution p(x1; x2; : : :; xn) is de ned as,

H (X1; X2; : : :; Xn) = X ? p(x1; x2; : : :; xn) log2 p(x1; x2; : : :; xn) The entropy rate of a stochastic process fXig is de ned as, 1 H = nlim !1 n H (X1; X2; : : :; Xn ) bits,

(4.2) (4.3)

when the limit exists. For an independent, identically distributed (i.i.d.) process, the entropy rate is equal to the entropy. 68

1 H(x)

0.9 0.8 0.7 H(.)

0.6 0.5 0.4

y

0.3 0.2 0.1

H -1 (y)

1-H -1 (y)

x

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 4.1 A plot of the function H () The function H (x) is de ned on the real interval [0,1] as follows,

H (x) = ?x log2 x ? (1 ? x) log2 (1 ? x) bits.

(4.4)

The function H (x), shown in Figure 4.1, maps the probability of a binary-valued discrete random variable to its entropy and has the following properties, (1) H (0) and H (1) are both de ned to be 0, (2) H (x) = H (1 ? x), (3) H (x) is a concave function, i.e., a line drawn between any two points on the curve will P lie below the curve. This is also referred to as Jensen's inequality: n1 ni=1 H (xi)  H ( n1 Pni=1 xi), (4) The derivative of H (x) with respect to x, H 0(x) = log2 1?x x , (5) H (x) is monotonically increasing in the interval (0; 21 ] because H 0(x)  0 in the interval with the equality occuring only at x = 12 , and (6) H (x) is monotonically decreasing in the interval [ 21 ; 1) because H 0(x)  0 in the interval with the equality occuring only at x = 12 . The inverse, H ?1 (y ), of H (), is de ned on the real interval [0,1] as follows, H ?1(y) = x; if y = H (x) and x 2 [0; 12 ]: (4.5) The function H ?1 () maps the entropy of a binary-valued discrete random variable to a probability value that lies between 0 and 12 . Following the convention in literature, we denote three 69

di erent functions as H (). In (4.1) the argument of H () is a single random variable, in (4.2) the argument of H () is a sequence of random variables, and in (4.4) the argument of H () is a real number between 0 and 1. It will be clear from the context which function we are referring to. The transition activity of a bit-level signal, bn , is de ned as [1],

t = Pr(bn = 0 and bn?1 = 1) + Pr(bn = 1 and bn?1 = 0): We de ne level signaling to be the case where the bit `0' is used to transmit the symbol `0' and the bit `1' is used to transmit the symbol `1'. In transition signaling, the symbol `0' is transmitted by sending the previous transmitted bit and the symbol `1' is transmitted by sending the complement of the previous transmitted bit. Hence, a `1' is signaled with a transition and a `0' with no transition.

4.2 Bounds on Signal Transition Activity In this section, we present the main results of this chapter; namely, lower and upper bounds on the expected number of transitions. The bounds are asymptotically achievable if the process is stationary and ergodic. We also present a coding algorithm that asymptotically achieves the bounds. The proofs are presented in the Appendices. In order to derive bounds on transition activity, we will employ Lemmas 2 and 3 presented below. Lemma 2 bounds x given y  H (x). Lemma 3 employs Lemma 2 to bound the expected number of 1's in a sequence of bits with a certain entropy rate. Theorem 4 employs Lemma 3 to bound the number of transitions per symbol of a process with a certain entropy rate given that each symbol is coded employing an expected number of R bits.

Lemma 2 : For all (x; y) such that x 2 [0; 1] and y 2 [0; 1], if y  H (x) then, H ?1(y)  x  1 ? H ?1 (y):

(4.1)

The relation between x and y in Lemma 2 is shown in Figure 4.1.

Lemma 3 : If, (1) fBi g is a 0-1 valued stationary and ergodic process with entropy rate greater than or equal to H,

70

(2) pi = Pr(Bi = 1), and P (3) pb = limn!1 n1 ni=1 pi exists then,

H ?1(H)  pb  1 ? H ?1(H);

(4.2)

and the bounds in (4.2) are asymptotically achievable.

Theorem 4 : Let, (1) H be the entropy rate of a stationary and ergodic process fXig, (2) the symbols be coded in a uniquely decodable manner into bits represented by the binary random variables B1 , B2 , : : : employing an expected number of R(> H) bits/symbol, (3) the bits be transmitted in some arbitrary manner over a nite set of wires such that a receiver can uniquely decode the bits, and P

(4) T be the expected number of transitions in the bits on the wires per symbol (i.e. limn!1 n1 ni=1 Bi L Bprev(i) exists, where the function prev(i) returns the index of the bit that immediL ately precedes Bi on the same wire and is the exclusive-or operator) then, ?1 ( H ))R; ) R  T  (1 ? H H ?1( H R R

(4.3)

and the bounds in (4.3) are asymptotically achievable.

In (4.3), H ?1( HR ) is the lower bound on the bit-level transition activity (this will be proved in Appendix A) which is scaled by R to give the lower bound on the expected number of transitions per symbol since R bits (on average) are employed to encode a symbol. The lower and upper bounds on transition activity computed by Theorem 4 for di erent values of R are shown in Figure 4.2. Any coding algorithm (such as those in [54, 55, 30, 9, 56, 57]) will need to reside in the region shown in Figure 4.2. The lower bound on transition activity computed by Theorem 4 for di erent values of R is shown on a larger scale in Figure 4.3, where we see that the transition activity can be made arbitrarily close to 0 by increasing R. In practice, however, R will typically be less than 71

Transition Activity (in fractions of H)

3 Lower bound Upper bound

2.5 2 1.5 1 Fully compressed data -> .

0.5

Region where all coding schemes will lie

0 0

0.5

1

1.5 2 R (in multiples of H)

2.5

3

Transition Activity (in fractions of H)

Figure 4.2 Lower and Upper bounds on Transition Activity versus R

1:25H, then the delay component will increase resulting in a non-optimal power-delay product. Similarly, if R < 1:25H then the power component increases because less redundancy is being added. In order to illustrate Theorem 4, we now calculate bounds for two di erent types of sources and the transition activity for coding algorithms that approach the bounds.

4.3.5 Bounds On Transition Activity For An i.i.d. Source Consider an i.i.d. source with a 5 symbol alphabet X = fA; B; C; D; E g with probabilities 1 1 1 1 1 2 , 4 , 8 , 16 , and 16 respectively. Since the source is i.i.d., the entropy rate is equal to the entropy and is given by, H = 158 bits. Assume an average of R = 3 bits are employed to code a symbol. Thus HR = 85 , H ?1 ( HR ) = 0:156142, and from Theorem 4, the bounds on transition activity are, 0:468426  T  2:531574:

(4.4)

We now calculate the actual transition activity that is achieved by various coding algorithms and compare them with the bounds in (4.4). To simplify the calculation of transition activity, 77

we assume transition signaling is employed to transmit the bits, i.e., a `1' is transmitted with a transition and a `0' is transmitted with no transition. Thus, the number of transitions is equal to the number of 1's. We can make the assumption of transition signaling in the examples because the purpose of the examples is to show the existence of coding algorithms that approach the lower bound.

4.3.5.1 Entropy Coding Followed By Spatial Redundancy Coding In this algorithm, we initially employ an entropy coder to code the symbols employing the minimum expected number of bits per symbol. A code that achieves the entropy is A = 0, B = 10, C = 110, D = 1110, and E = 1111. The output of the entropy coder consists of temporally independent, uniformly distributed bits. These bits are placed in a bu er and when there are 15 bits in a bu er, then the bits are encoded employing 24 bits. Since H = 58 , the 15 bits correspond to an average of 8 symbols and hence employing 24 bits results in an expected number of 3 bits/symbol. Since we are encoding 15 bits employing 24 bits, we can employ the redundancy to reduce the number of transitions [56]. This is an extension of Bus-Invert coding and results in a 5-limited weight code [67, 68] in which we employ the all-zero (or no transitions) code, 24 ?  codes with a single one, 276 codes (= 242 ) with two ones, etc., up to 19817 of the codes with 5 ones. The expected number of transitions (or 1's) in each transmission of 24 bits is given by P P i  +5  (2 ? ( ) ( ) i ) = 4:52383 transitions. Since 24 bits correspond to an average i i [30], i 2 of 8 symbols, the expected number of transitions per symbol is given by, 18  4:52383 = 0:565478 transitions/symbol, which is 20.7% above the lower bound in (4.4). 4 =0

24

15

15

4 =0

24

4.3.5.2 Probability Based Coding An alternative algorithm to reduce the number of transitions is to code the most probable symbol A as 000 or no transitions, B as 001, C as 010, D as 100, and E as 011. The expected number of transitions per symbol is, T = 0:5625 transitions/symbol, which is 20.1% above the lower bound in (4.4). We can further reduce the number of transitions by applying probability based coding to a block of two symbols. and can be reduced further by coding with block sizes larger than two. The transition activity per symbol for di erent block sizes is shown in Figure 4.5. Thus, we can achieve a transition activity within 4% of the lower bound by employing a block size of 8. 78

Transition Activity per symbol

0.57 0.56 0.55 0.54 0.53 0.52 0.51 0.5 0.49 0.48 0.47 0.46

Blocked Probability Based Coding Lower bound Entropy coding followed by redundancy coding

1

2

3 4 5 6 Number of symbols per block

7

8

Figure 4.5 Transition Activity versus Block Size for i.i.d. source Table 4.1 Transition Matrix Previous state S1 S2 S3 1 21 4

S1 S2 S3

0

1 41 21 2

1 41 41 2

4.3.6 Bounds On Transition Activity For A Markov Process Consider the 3-state stationary Markov process U1 , U2, : : : having the transition matrix Pij in Table 4.1 [63]. Thus the probability that S1 follows S3 is equal to zero. An algorithm to encode the process will consist of 3 codes C1, C2, and C3 (one for each state S1 , S2 , and S3), where Ci is a code mapping from elements of the set fS1 , S2 , S3g into a codeword in Ci (see Table 4.2 for an example). The following algorithm will be employed to encode this Markov chain, (1) Note the present symbol Si . (2) Select code Ci .

Table 4.2 Entropy Code for Markov Process S1 S2 S3

C1 0 10 11 C2 10 0 11 C3 ? 0 1

79

Transition Activity per symbol

0.52 0.5

Blocked Probability Based Coding Lower bound Entropy coding followed by redundancy coding

0.48 0.46 0.44 0.42 0.4 0.38 0.36 0.34 2

4

6 8 10 Number of symbols per block

12

Figure 4.6 Transition Activity versus Block Size for Markov Process (3) Note the next symbol Sj and send the codeword in Ci corresponding to Sj . (4) Repeat for the next symbol. h

iT

The stationary distribution of this Markov chain is  = 29 ; 49 ; 13 . The entropy rate of the stationary Markov process is given by [63], X H (X ) = ? i Pij log2 Pij = 34 bits. (4.5) ij A code that achieves the entropy rate is shown in Table 4.2. If the symbols are coded with an average of R = 2 bits/symbol, then HR = 23 , H ?1( HR ) = 0:173952, and from Theorem 4, 0:347904  T  1:652096 transitions/symbol.

(4.6)

The transitions/symbol for entropy coding followed by redundancy coding and probability based coding is shown in Figure 4.6 for di erent block sizes. We can achieve a transition activity within 9% of the lower bound with a block size of 13 symbols. To summarize, the above examples demonstrate that for the speci c source and source distribution,

 We can achieve transition activities within 4% and 9% of the lower bound with block sizes of 8 symbols and 13 symbols, respectively (see Figure 4.5 and Figure 4.6).

 Transition activity can be reduced by coding larger and larger blocks of symbols (see Figure 4.5 and Figure 4.6).

 Entropy coding followed by redundancy coding does not always result in the minimum transition activity for a given block size. 80

4.4 Summary In this chapter, we have derived lower and upper bounds on the average signal transition activity via an information-theoretic approach in which symbols generated by a process (possibly correlated) with entropy rate H are coded with an average of R bits per symbol. The bounds are asymptotically achievable if the process is stationary and ergodic. We have also presented a coding algorithm based on the Lempel-Ziv data compression algorithm to achieve the bounds. Bounds are also obtained on the expected number of 1's (or 0's). These results are applied to, 1.) determine the activity reducing eciency of di erent coding algorithms such as Entropy coding, Transition signaling, and Bus-Invert coding, and 2.) determine the lower-bound on the power-delay product given H and R. Two examples are provided where transition activity within 4% and 9% of the lower bound is achieved when blocks of 8 symbols and 13 symbols, respectively, are coded at a time.

81

Chapter 5 LOW-POWER CODING SCHEMES In the previous chapter, we provided lower bounds on the transition activity at the output of a source. Information-theoretic arguments indicate that these lower bounds can be approached via coding of the source. In this chapter, we present a source-coding framework for describing practical, low-power, encoding schemes and then employ the framework to develop new encoding schemes. In the framework proposed here, a data source (characterized in a probabilistic manner) is rst processed by a decorrelating function f1 . Next, a variant of entropy coding function f2 is employed, which reduces the transition activity. All past work in this area [54, 55, 9, 56, 67, 57] can be shown to be special cases of this framework. The source-coding framework presented in this chapter is di erent from the coding scheme used to asymptotically achieve the lower bound because the latter are impractical to implement. The power dissipated at the I/O pads of an IC ranges from 10% to 80% of the total power dissipation with a typical value of 50% for circuits optimized for low power [9]. The high power dissipation at the I/O pads is because o -chip busses have switching capacitances that are orders of magnitude greater than those internal to a chip. Power dissipation on these busses occur mainly during signal transitions and reducing them will reduce total power dissipation. Therefore, various techniques have been proposed in literature [54, 55, 69, 9, 56, 67, 57, 70] to encode data on a bus to reduce the average and the peak number of transitions. Data transfers on microprocessor address busses are often sequential (i.e., current data value equals the previous data value plus a constant increment) due to fetches of instructions and array elements. This property can be exploited to reduce transitions by the use of Gray codes 82

as proposed in [57]. The Gray code reduces the number of transitions for sequential accesses because consecutive code words di er in only one bit position. This code does not, however, signi cantly reduce transitions for data busses because consecutive data values are typically not sequential. Another encoding scheme for address busses (denoted as the T0 scheme) is presented in [54] in which, if the next address is greater than the current address by an increment of one, then the bus lines are not altered and an extra `increment' bit is set to one. As mentioned before, this encoding scheme is not e ective for data busses. In [55], a scheme, termed Bus-Invert coding in [9], is presented to reduce the number of transitions on a bus. This scheme determines the number of bus lines that normally change state when the next output word is clocked onto the bus. When the number of transitions exceeds half the bus width, the output word is inverted before being clocked onto the bus. An extra output line is then employed to signal the inversion. The Bus-Invert coding scheme performs well when the data is uncorrelated; i.e., it assumes a decorrelating function precedes the Bus-Invert step. For uniformly distributed, uncorrelated data and a bus width of 8 bits, the reduction in transitions is approximately 18%. Bus-Invert coding also halves the peak number of transitions. This scheme does not, however, perform well for correlated data and for larger bus widths. It was proposed in [67] that larger busses be split into smaller busses and that correlated data be compressed in a lossless manner to achieve decorrelation. The power dissipation in compression and uncompression steps, however, can be considerable thereby o setting any savings in power on the bus. Employing the source-coding framework in this chapter, we show that a two-step process involving functions f1 and f2 , in most cases, is better (in terms of coding hardware overhead and reduction of transition activity) than compression followed by Bus-Invert coding. One of the choices for the function f2 in our framework is similar to the scheme in [58], where signal samples having higher probability of occurrence are assigned code words with fewer `ON' bits. This scheme is suited for optical networks where the power dissipation depends on the number of ON bits. In VLSI systems, however, power dissipation depends on the number of transitions rather than the number of ON bits. Simulation results are presented employing the proposed encoding schemes, which indicate an average reduction in transition activity of 36% for one such encoding scheme for data streams. In addition, it is shown that a reduction in power dissipation occurs if the bus capacitance is 83

greater than 14pF/bit in 1:2 CMOS technology. Simulation results with an encoding scheme for instruction address busses indicate an average reduction in transition activity by a factor of 3 times and 1.5 times over the Gray and T0 [54] coding schemes respectively. We also present an adaptive scheme that monitors the statistics of the input and adapts its encoding scheme over time. The rest of the chapter is organized as follows. In section 5.1, a framework for describing encoding schemes is presented. In section 5.2, we develop the proposed encoding schemes and examine the e ect of adding redundancy to the encoding schemes and in section 5.3 simulation results are presented to compare the performance of the proposed and existing schemes. The results of this chapter have appeared in [61, 71, 72].

5.1 Source-Coding Framework In this section, we describe the components of a generic communications system and then develop the proposed framework.

5.1.1 A Generic Communications System A generic communication system (see Figure 5.1) consists of a source coder, a channel coder, a noisy channel, a channel decoder, and a source decoder. The source coder (decoder) compresses (decompresses) the input data so that the number of bits required in the representation of the source is minimized. While the source coder removes redundancy, the channel coder adds just enough of it to combat errors that may arise due to the noise in the physical channel. This view of a communication system has been the basis for the enormous growth in the distinct areas of source coding, channel coding and error correcting codes. In the present context, we Source Tx

Source Coder

Channel Noisy Channel Coder Channel Rx Decoder

Source Decoder

Figure 5.1 A Generic Communication System consider the bus between two chips as the physical channel and the transmitter and receiver blocks to be a part of the pad circuitry, driving (in case of the transmitting chip) or detecting (in case of the receiving chip) the data signals. Furthermore, unlike in [17], we will assume here that the signal levels are suciently high so that the channel can be considered as being 84

noiseless. While this noiseless channel assumption is true for most systems today, this will not be the case for future systems where the signal swings will be reduced to reduce power. The noiseless channel assumption allows us to eliminate the channel coder resulting in the system shown in Figure 5.2. Here, we have an entropy coder at the transmitter, which compresses the source into a minimal representation i.e., a representation requiring the minimum number of bits. This number is given by the entropy, H (X ), [63] of the source X . In order to de ne the entropy, consider the source in Figure 5.2 to be a discrete source generating symbols from the set SX = fX0 ; X1; : : :; XL?1g according to a probability distribution Pr(X ). A measure of the information content of this source is its entropy H (X ), given by (5.1) where Pi def = Pr(X = Xi ) for i = 0; : : :; L ? 1 and H (X ) is in bits.

H (X ) = ?

LX ?1 i=0

Pi log2(Pi) bits

(5.1)

In practice, due to data dependencies and the constraint of having integer code word lengths, Source

Source Coder Entropy Tx Coder

Noiseless Channel

Source Decoder Entropy Rx Decoder

Figure 5.2 A Generic Communication System for a Noiseless Channel it becomes necessary to take data blocks fx(n); x(n ? 1); : : :; x(n ? M + 1)g to create a `supersymbol' and then apply entropy coding. Doing so, however, increases the coder hardware complexity exponentially with the block length M . To address this problem, low-complexity schemes (see Figure 5.3) involving a decorrelating function f1 followed by a scalar entropy coder f2 have been proposed. This is a sub-optimal structure if the function f1 is unable to remove all the data dependencies. If the function f1 is a linear predictor then the structure shown in Source

Tx

Source Coder f1

f2

Decorrelator

Entropy Coder

Noiseless Channel

Rx

Source Decoder f2-1

f1-1

Entropy Decoder Correlator

Figure 5.3 A Practical Communication System for a Noiseless Channel Figure 5.4 is obtained, where F represents the vector of the impulse response of a linear lter. It can be seen that the linear predictor f1 removes all the linear dependencies in x(n); and the entropy coder, f2 , then compresses the output of f1 in a lossless manner.

85

e(n)

x(n) F

Entropy Noiseless Coder Channel

Entropy Decoder

e(n)

^x(n)

^x(n)

Encoder

x(n)

F

Decoder

Figure 5.4 Linear Prediction Con guration

5.1.2 The Source-Coding Framework The proposed framework in Figure 5.5 is based upon the low-complexity source coder architecture (see Figure 5.4). As discussed before, the function f1 decorrelates the input x(n). Therefore, the prediction error, e(n), is a general function of the current value of x(n) and the prediction, xb(n),

e(n) = f1 (x(n); xb(n));

(5.2)

where f1 could be a linear or a non-linear function of its arguments. The prediction, xb(n), is a function of the past values of x(n),

xb(n) = F(x(n ? 1); x(n ? 2); : : :; x(n ? M + 1)):

(5.3)

For complexity reasons, we restrict ourselves to a value of M = 2. The function f2 employs a variant of entropy coding whereby, instead of minimizing the average number of bits at the output, it reduces the average number of transitions. The function f2 employs the error e(n) to generate an output, y (n), which has a `1' to indicate a transition and a `0' to indicate no transition. This code word is then passed through an XOR gate to generate the corresponding signal waveforms on the bus. Hence the output, y (n), of the encoder is given by,

y(n) = f2(f1(x(n); xb(n)))

M

y(n ? 1):

(5.4)

In summary, the function f1 decorrelates the input and, in the process, skews the input probability distribution so that f2 can reduce the transition activity by a memoryless mapping of e(n). The output, x(n), of the decoder is given by,

x(n) = f3 (f2?1 (y(n ? 1)

M

y(n)); xb(n));

(5.5)

where f3 is determined by the choice of f1 . The function f2?1 assigns a level to the input based on an exclusive-or of the previous input y (n ? 1) and the current input y (n). The function 86

x(n) ^x(n)

e(n) f1

y(n) BUS

f2

-1

f2 ^x(n)

F Encoder

e(n) x(n) f3

F

Decoder

Figure 5.5 Framework for Low Power Encoder and Decoder f3 calculates x(n) employing the error, e(n), and the prediction, xb(n). A chip or a pad which both, sends and receives data to and from a bus, will need an encoder and a decoder. In order to implement the encoder and the decoder for a B -bit input, at most 4B delays and 2B exclusive-or gates are needed, in addition to the hardware required to implement the functions F, f1, f2, f2?1, and f3. It is possible to reduce the hardware depending on the actual choices for f1 and f2 , by optimizing the encoder and the decoder each as a whole and also by sharing logic between the encoder and the decoder in a bidirectional pad. In the next subsection, we propose practical choices for F, f1 , and f2 and then evaluate the performance of encoding schemes employing di erent combinations of F, f1 , and f2 . Since the aim of the encoding is to reduce power dissipation, for a choice of F, f1 , and f2 to be practical, the power dissipation at the encoder and decoder should be less than the savings achieved at the bus. This in turn implies that the amount of hardware in the encoder and decoder should be as small as possible. In addition, in systems where the latency for bus access has to be small, the delay through the encoder and the decoder has to be low.

5.1.3 Alternatives for F 5.1.3.1 Identity The output of the Identity function is equal to the input.

Identity(a) = a

(5.6)

Hence, if Identity is employed for F, then xb(n) = x(n ? 1). The Identity function requires no hardware to implement and is useful if the data source has signi cant correlation.

5.1.3.2 Increment The output of the Increment function is equal to the input plus one.

Increment(a) = a + 1 87

(5.7)

Hence, if Increment is employed for F, then xb(n) = x(n ? 1) + 1. The Increment function is useful if x(n) is the data on the address bus of a microprocessor, because, due to fetches of instructions and array elements, the next address on the address bus usually equals the current address plus unity (or some power of 2). This function requires an incrementer each at the encoder and at the decoder.

5.1.4 Alternatives for f1 5.1.4.1 Exclusive-Or (xor) The Exclusive-Or function, xor, is given by a bit-wise exclusive-or of the current input and the prediction.

xor(x(n); xb(n)) = x(n)

M

xb(n)

(5.8)

An example of the xor function for a 3-bit input word is shown in Table 5.1. If the input, x(n), is B bits wide, then the xor function will require B exclusive-or gates at the encoder. It can be shown that if f1 is an xor function (at the encoder), then f3 is also an xor function (at the decoder). Hence, B exclusive-or gates are required to implement f3 . If F is Identity, then the only di erence between the encoder and the decoder is that the encoder employs f2 , whereas the decoder employs f2?1 . Therefore, the hardware between the encoder and the decoder in a bidirectional pad can be shared as shown in Figure 5.6, by employing an additional control line into a multiplexer that selects either the output of f2 or that of f2?1 depending on whether the input is to be encoded or decoded, respectively. Enc/Dec f2 -1

f2

Encoder and Decoder

Figure 5.6 Encoder and Decoder can share hardware when f1 is xor and F is Identity 5.1.4.2 Di erence-Based Mapping (dbm) The Di erence-Based Mapping, dbm, is described in Figure 5.7, where the dbm function returns the di erence between x(n) and xb(n) properly adjusted so that the output ts in the 88

Table 5.1 Example of xor and dbm xb(n) 010 010 010 010 010 010 010 010

x(n) xor(x(n); xb(n)) dbm(x(n); xb(n)) 000 010 011 001 011 001 010 000 000 011 001 010 100 110 100 101 111 101 110 100 110 111 101 111

available B bits. If the input, x(n), is B bits wide, then the dbm function will require 3 B -bit subtracters, B inverters, and B 4-to-1 multiplexers each at the encoder and at the decoder. An if (x(n)  xb(n) && 2xb(n)  x(n)) dbm = 2x(n) ? 2xb(n); else if (x(n) < xb(n) && 2xb(n) ? x(n) < 2B ) dbm = 2xb(n) ? 2x(n) ? 1; else if (xb(n) < 2B?1 ) dbm = x(n); else dbm = 2B ? 1 ? x(n);

Figure 5.7 dbm function example of dbm is shown in Table 5.1, where we see that the dbm output is 0 when the current and previous inputs are equal and it increases as the distance (absolute di erence) between the current input and the prediction increases. This is also shown graphically in Figure 5.8. ^ dbm(x(n), x(n))

3

1

0

2

4

5

6

7

x(n)

0

1

2

3

4

5

6

7

^ =2 x(n)

Figure 5.8 Example of Di erence-Based Mapping (dbm) In this chapter, we will use the audio, video, random, and ASCII data sets in Table 5.2 to provide simulation results. In Table 5.3, we compare signal correlation before and after applying xor and dbm to the data sets in Table 5.2 assuming F is the Identity function, i.e.,

89

Table 5.2 Description of data sets Data A3 A4 A7 CO V1 V2 V3 R1 PS

Description 2.88MB of 16 bit PCM audio data (pop music) 2.88MB of 16 bit PCM audio data (pop music) 2.88MB of 16 bit PCM audio data (classical music) 0.80MB of 16 bit communications channel data 3.80MB of 8 bit video data (miss america) 22.7MB of 8 bit video data (football) 9.70MB (380 QCIF frames) of 8 bit video data (car phones) 0.10MB of white, uniformly distributed data 0.10MB Postscript le

Table 5.3 Correlation and Kullback Leibler distance before and after xor and dbm Data

Correlation Kullback Leibler distance (in bits) xor dbm D(Orig.kunif.) D(xorkunif.) D(dbmkunif.) A3 0.1750 0.4752 1.17 2.76 3.25 A4 0.1463 0.5484 2.05 4.03 4.52 A7 0.1116 0.8079 3.20 5.28 5.80 CO -0.2350 -0.3430 1.64 1.28 1.38 V1 0.2238 0.4537 2.25 3.80 4.18 V2 0.2550 0.4470 1.05 2.20 2.56 V3 0.2657 0.4007 0.64 2.54 2.93 R1 0.0012 -0.0013 -0.0023 0.00 0.00 0.00 PS 0.2367 0.2791 0.4954 3.75 2.72 2.44 Orig. 0.9628 0.9712 0.9922 0.2952 0.9672 0.8747 0.9199

90

xb(n) = x(n ? 1). The signal correlation, , is given by, (5.9)  = E [(x(nE) ?[(x()(n)x?(n?)21)] ? )] ; where  is the mean of the signal x(n) and E [] is the expectation operator. We see that there is a reduction in correlation after applying xor and dbm. The probability distribution at the output of xor and dbm for V2 data in Table 5.2 is shown in Figure 5.9 along with the original probability distribution. Both xor and dbm skew the original distribution for most of the data sets and hence enable f2 to reduce the number of transitions even more. The skew in the probability distributions at the output of f1 can be measured in terms of the Kullback Leibler distance [63] between the given probability distribution p and the uniform distribution q , and is given by (5.10). A higher Kullback Leibler distance from the uniform distribution indicates a more e ective f1 because it indicates a more skewed distribution, which in turn would enable f2 to reduce the number of transitions even more.

D(pkq) =

X

p(x) log2 ( qp((xx)) ) bits

(5.10)

0.09 dbm(video2) Original video2 xor(video2)

0.08

Probability

0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0

50

100

150

200

250

300

Figure 5.9 Probability distribution for V2 data before and after applying xor and dbm

5.1.5 Alternatives for f2 5.1.5.1 Invert (inv) The inv function is given in Figure 5.10, where n1() returns the number of 1's in the binary representation of its argument. If the number of 1's in e(n) (or the number of transitions in x(n)) exceeds half the number of bus lines, then the input is inverted and the inversion is signaled using an extra bit. The function inv has been employed in Bus-Invert coding [9]. An 91

example of the inv function is shown in Table 5.4. The hardware required to implement the inv function is described in [55, 9]. if (n1(e(n)) > B2 ) inv = invert bits in e(n) and set invert bit to 1 else inv = do not invert bits in e(n) and set invert bit to 0

Figure 5.10 The function inv(e(n)) 5.1.5.2 Probability-Based Mapping (pbm) In the pbm function, the number of 1's in the input is reduced by assigning, as in [58], code words with fewer 1's to the more frequently occuring code words. We then map a `1' to a transition waveform and a `0' to a transitionless waveform using an exclusive-or gate. Thus, pbm satis es (5.11) given below.

8(a; b) Pr(a) > Pr(b) ) n1(pbm(a))  n1(pbm(b))

(5.11)

The probabilities in (5.11) can be computed using a representative data sequence. The function pbm reduces the number of 1's. Thus, if the most probable value of e(n) is a, then pbm(a) = 0. The next B most probable values of e(n) are mapped to 2i (i=0 : : :B ? 1) by pbm. The next ?B 2 most probable values are mapped to all values with exactly two 1's, and so on. An example of pbm is shown in Table 5.4. The hardware required to implement pbm will depend on the input probability distribution. An 8 bit input, typically requires approximately 800 gates each to implement pbm and pbm?1. Since the hardware requirement of pbm can grow exponentially with the input bit-width, we can split wide busses into multiple narrow busses and apply pbm independently on each of the smaller busses.

5.1.5.3 Value-Based Mapping (vbm) In Figure 5.9, we see that, after xor and dbm are applied, smaller values are generally more probable than larger values. This is especially true if f1 is dbm. We make use of this feature in vbm, in which code words with fewer 1's are assigned to smaller values and then map a `1' to a transition waveform and a `0' to a transitionless waveform using an exclusive-or gate. The 92

function vbm is such that it satis es (5.12).

8(a; b) a < b ) n1(vbm(a))  n1(vbm(b))

(5.12)

The function vbm makes the assumption that smaller values are more probable than larger values. An example of vbm is shown in Table 5.4. The advantage of vbm over pbm is that a representative data sequence is not needed. The reduction in transitions with vbm, however, is usually lower than pbm. Note that, the function vbm can be generated algorithmically by ?  ?  P realizing that, if there are i ones in vbm(a) then bit k ? 1 of vbm(a) is 1 if a ? ij?=01 kj > k?i 1 . Once bit k ? 1 of vbm(a) is determined, the lower order bits can be similarly determined in an iterative fashion. The number of 1's, i, in vbm(a) can be determined by nding i such that (5.13) is satis ed. i X j =0

!

i+1 k k  a+1< X j j =0 j

!

(5.13)

The algorithm to implement vbm is shown in Figure 5.11. This algorithm can also be employed DeterminePi, the?number of 1's in vbm(a), using (5.13);  val = a ? ij?=01 kj + 1; for (j = k; j? > 0; j ? ?) j ? 1 if (val  i ) set bit j of vbm to 0; else f set bit j of vbm ?j ?1to 1; val = val ? i ; i ? ?;

g

Figure 5.11 The algorithm to implement vbm(a) to generate a j -limited-weight code [67] for an arbitrary j . A similar algorithm is employed at the decoder. The algorithm in Figure 5.11 requires k clocks, where k is the number of bits in ?  the output y (n). The binomial coecients ji can be pre-computed; (k?2)(4 k?4) binomial coecients need to be pre-computed. The function vbm can be implemented either, algorithmically employing the algorithm in Figure 5.11 or by employing combinational logic. An 8 bit input typically requires 700 gates each to implement vbm and vbm?1.

93

Table 5.4 Example of inv, pbm, and vbm functions L

a = x(n) y(n ? 1) 000 001 010 011 100 101 110 111

Pr(a) inv(a) pbm(a) vbm(a) 0.2864 0000 000 000 0.1766 0001 010 001 0.2201 0010 001 010 0.1712 1100 100 100 0.0573 0100 011 011 0.0344 1010 110 110 0.0259 1001 101 101 0.0280 1000 111 111

Table 5.5 Encoding Schemes Encoding scheme xor-pbm xor-vbm dbm-pbm dbm-vbm xor-inv (Bus-Invert) dbm-inv inc-xor

F

f1 Identity xor Identity xor Identity dbm Identity dbm Identity xor Identity dbm Increment xor

f2 pbm vbm pbm vbm inv inv

Identity

5.2 Encoding Schemes In this section, we present reduced transition activity encoding schemes based upon alternatives for the functions F, f1 , and f2 de ned in the previous section. The proposed encoding schemes are summarized in Table 5.5, where we have 7 encoding schemes using di erent combinations of F, f1 , and f2 . The xor-pbm scheme reduces the number of transitions by assigning fewer transitions to the more frequently occuring set of transitions in the original signal. The closest approach in literature to xor-pbm is [58] where signal samples having higher probability of occurrence are assigned code words with fewer ON bits. In VLSI circuits, power dissipation depends on the number of transitions occuring at the capacitive nodes of the circuit. The xor-pbm scheme di ers from [58] in two respects. It reduces the power dissipation at the capacitive nodes by reducing the number of transitions by assigning fewer transitions to the more frequently occuring set of transitions. The xor-pbm scheme also achieves a greater reduction in transitions by skewing the input probability distribution by employing xor for f1 . The xor-vbm scheme 94

has the advantage over xor-pbm of being an input independent mapping and requiring lesser hardware to implement at the cost of typically lesser reduction in transitions. The scheme dbm-pbm requires more hardware than xor-pbm but also reduces transitions more because the function dbm skews the input probability distribution more than xor. The framework in Figure 5.5 can be employed to derive and improve existing coding schemes. For example, the Gray coding scheme can be derived by assigning f1 (x(n); xb(n)) = x(n), f2 be the Gray mapping, and removing the exclusive-or at the output of the encoder. The Bus-Invert [9] coding scheme can be obtained by assigning f1 = xor and f2 = inv . A variant of the Bus-Invert coding scheme can be obtained by employing dbm instead of xor for f1 resulting in the dbm-inv scheme. The scheme in [58] can be derived by assigning f1 (x(n); xb(n)) = x(n), f2 = pbm, and removing the exclusive-or at the output of the encoder. An improved version of the T0 scheme in [54] can be derived using our framework by assigning F = Increment (with over ow being ignored), f1 = xor, and f2 = Identity. This improved scheme, called inc-xor, is shown in Figure 5.12. Unlike the T0 scheme, the inc-xor scheme does not require an extra bit, and as shown in section IV, has a shorter critical path and provides the same or more reduction in transitions. Encoder

BUS Decoder

+1

+1

Figure 5.12 Encoder and Decoder for inc-xor The reduction in transitions due to a coding scheme can be increased by introducing, either spatial redundancy in the form of extra lines on the bus, or temporal redundancy in the form of extra clocks to transfer the data, or both [56]. For instance, if an extra bit is added to the bus, f2 can use the extra bit to further reduce transitions. For instance, in the function pbm, the most probable value is assigned 0. Assuming the bus is B + 1 bits wide, the next B + 1 most probable values are assigned 2i (i = 0 : : :B ) and so on. The maximum number of transitions on the bus is less than or equal to B2 resulting in a perfect B2 limited weight code [9]. We can employ the probabilistic model in [54] to estimate the transition activity of an address stream after encoding. This model employs a parameter, q , the probability of having two consecutive addresses on the bus in two successive clock cycles. In order to simplify analysis, this 95

model assumes that K lines make a transition on average when two non-consecutive addresses are issued on the bus. In general, the value of q will depend on the source generating the data on the bus, while the value of K will depend both on the source and the encoding scheme employed. With this model, the average number of transitions produced by the Unsigned, Gray, T0, and inc-xor codes can be calculated as,

TUnsigned TGray Tinc?xor TT 0

= (1 ? q )KUnsigned + 2q

(5.1)

= (1 ? q )KGray + q;

(5.2)

= (1 ? q )Kinc?xor ;

(5.3)

= (1 ? q )KT 0 + 2q (1 ? q ):

(5.4)

Average Word-Level Transition Activity

In Figure 5.13, we plot TUnsigned , TGray , TT 0, and Tinc?xor as a function of the probability q assuming KUnsigned = KGray = Kinc?xor = KT 0 = K . We see that inc-xor always has the least transition activity, Unsigned always has the highest transition activity, and T0 is better than Gray only for q > 21 . K Gray inc-xor T0 Unsigned

0.875K 0.75K 0.625K 0.5K 0.375K 0.25K 0.125K 0 0

0.2 0.4 0.6 0.8 Probability of in-sequence address (q)

1

Figure 5.13 Analytical Estimate of Transition Activity for Unsigned, Gray, T0, and inc-xor coding schemes

5.3 Simulation Results In this section, we present simulation results for the reduction in transition activity and power dissipation using the encoding schemes and compare them with results obtained by existing schemes. In order to compare coding schemes, we classify them into two groups, based on whether they are more suited for data busses or address busses. For data busses, we compared the reduction in transition activity due to xor-pbm, xor-vbm, dbm-pbm, dbm-vbm, 96

Table 5.6 Percentage Reduction in Transition Activity Data xor- xor- dbm- dbmA3 A4 A7 CO V1 V2 V3 R1 PS

pbm vbm 33 38 42 27 40 33 34 2 43

25 29 33 4 34 26 26 0 23

pbm

xor-pbm with Adapt. vbm pbm opt. for Redn. scheme

45

29 31 25

28

28

36 41

45 39 40

2 38

45 39 40 0 25

A3 A3 A3 A3 V2 V2 V2 V2 PS

33 37 39 20 39 33 33 0

43

25 29 33 4 40 33 33 0 43

and Bus-Invert, and reduction in power dissipation due to xor-pbm and Bus-Invert. For address busses, we compared the reduction in transition activity and power dissipation due to T0, Gray, and inc-xor.

5.3.1 Reduction In Transition Activity The reduction in transition activity when the schemes in Table 5.5 are applied to the data sets in Table 5.2 is shown in Table 5.6. The xor-vbm scheme, as expected, results in a slightly lesser reduction in transitions than xor-pbm. We can achieve an average reduction in transitions of 36% for audio data employing xor-pbm with pbm optimized for A3 data and an average reduction of 35% for video data employing pbm optimized for V2 data. Hence, pbm optimized for one video/audio sequence performs well for other video/audio sequences, thus indicating the robustness of these schemes to variations in signal statistics. There is little change in transitions for uniformly distributed, uncorrelated data (R1 data). This occurs since x(n) and xb(n) are uncorrelated, and hence all values of f1 (x(n); xb(n)) are equally probable. Therefore f2 does not reduce the total number of transitions for R1 data. The dbm-pbm scheme reduces the transition activity around 5 percentage points more than xor-pbm because dbm skews the input probability distribution more than xor. The reduction in transition activity with 1 bit of spatial redundancy is shown in Table 5.7. For audio and video data, assuming 1 bit of redundancy, xor-pbm performs better than the Bus-Invert scheme in [9] or the Bus-Invert with compression (gzip) scheme in [56]. For R1 data, xor-pbm does approximately the same as Bus-Invert, which is optimal for the given 97

Table 5.7 Percentage Reduction in Transition Activity With 1 Bit Redundancy Data xor- xor- dbm- dbm- Bus- dbm- Bus-Inv. xor-pbm with Adapt. pbm vbm pbm vbm Inv. inv & gzip pbm opt. for Redn. scheme A3 36 32 39 39 8 12 2 A3 36 32 A4 40 36 43 43 7 13 5 A3 40 36 A7 43 36 47 47 7 15 10 A3 41 36 CO 32 22 33 32 15 16 17 A3 27 22 V1 42 39 47 46 7 32 4 V2 41 42 V2 38 34 43 43 10 22 0 V2 38 38 V3 38 34 44 43 11 27 6 V2 38 38 R1 19 18 19 18 18 18 18 V2 18 18 PS 47 33 42 36 11 14 63 PS 47 46

redundancy for R1 data. Compression followed by Bus-Invert is better for ASCII les because gzip is very e ective in compressing ASCII les. It is possible to combine gzip with xor-pbm to obtain a 63% reduction in transitions for the ASCII data. The performance of xor-vbm with redundancy is only slightly worse than xor-pbm with redundancy. In addition, dbm-inv does better than Bus-Invert for all the data sets, except for R1 data, for which the performance is approximately equal. The advantage of the dbm-inv scheme over Bus-Invert increases as the data to be encoded is more correlated. In Table 5.8 we report the average word-level transition activity when Unsigned, Gray, T0, and inc-xor encoding schemes are used to encode the addresses generated when di erent benchmark programs are executed on the SGI Power Challenge with a MIPS R10000 processor. As in [54], we consider three cases:

 Transitions on the instruction address bus (I),  Transitions on the data address bus (D), and  Transitions on a multiplexed address bus (M). Since the addresses of successive instruction and data elements on the MIPS processor di er by 4, the encoding schemes encode only the most signi cant 30 bits. We ignore the transitions on the least signi cant 2 bits since they are usually zero. The greatest reduction in transition activity employing inc-xor is observed for instruction address streams because the probability of addresses being sequential is highest for such streams. We also see that inc-xor is better than T0 for 20 of the 24 streams. This is because inc-xor does not use an extra `increment' bit and hence saves on transitions on that bit while providing a similar reduction in transitions on the 98

Table 5.8 Average Word-Level Transition Activity for Real Addresses Address Program Type I gzip MPEG decoder espresso gunzip postgres ghostview gcc gnuplot D gcc espresso postgres gunzip MPEG decoder gzip ghostview gnuplot M gzip gcc gunzip MPEG decoder postgres gnuplot ghostview espresso

Stream Transition Activity Length Unsigned Gray T0 inc-xor 11722992 2.09 1.32 0.65 0.38 11481528 2.15 1.18 0.41 0.29 8455329 2.18 1.24 0.50 0.35 3148893 2.22 1.23 0.51 0.36 2378798 2.28 1.39 1.02 0.54 10884925 2.32 1.42 0.93 0.60 413896 2.34 1.44 1.01 0.70 1506882 2.44 1.46 0.86 0.66 175684 4.34 3.46 4.33 3.94 6544671 5.60 3.91 5.65 5.13 784467 6.42 5.56 6.41 6.40 818875 6.46 6.02 6.43 6.56 3518525 6.92 6.46 6.91 7.06 3277008 6.97 6.48 6.96 6.91 4115073 7.13 5.70 7.09 7.14 662513 7.82 6.27 7.79 7.81 15000000 5.34 5.38 4.88 4.37 589578 5.61 4.89 5.34 4.73 3967768 5.70 4.98 4.89 4.58 15000000 6.68 5.45 5.91 5.64 3163265 7.55 6.92 6.97 6.61 2169395 7.55 7.03 7.14 6.82 15000000 8.06 7.35 7.60 7.22 15000000 8.09 5.38 8.11 7.49

q 0.89 0.93 0.92 0.92 0.84 0.86 0.81 0.89 0.40 0.38 0.08 0.05 0.08 0.06 0.12 0.11 0.50 0.49 0.55 0.50 0.45 0.38 0.40 0.40

K

Unsigned Gray T0 inc-xor 3.86 4.05 4.08 3.59 4.20 3.69 4.01 4.23 4.74 4.08 4.55 4.62 4.65 3.92 4.52 4.70 3.66 3.37 4.22 3.32 4.40 3.99 4.58 4.23 3.86 3.39 3.45 3.80 5.90 5.07 5.58 5.83 6.50 5.19 6.13 6.56 8.65 5.90 8.05 8.31 7.18 6.30 6.86 6.96 6.92 6.50 6.68 6.89 7.68 7.27 7.37 7.65 7.47 7.03 7.34 7.38 8.34 6.81 7.91 8.13 9.18 7.38 8.59 8.79 8.74 9.70 8.91 8.69 9.35 8.66 9.34 9.32 10.12 9.82 10.10 10.14 11.30 9.88 11.23 11.27 11.99 11.76 11.98 12.00 11.03 10.74 10.97 11.01 12.07 11.58 12.08 12.04 12.33 8.24 12.37 12.38

other bits. Of the four encoding schemes, we see that inc-xor provides the greatest reduction in transition activity for all the I-bus streams and 6 of the 8 streams on the M-bus. The Gray encoding scheme provides the greatest reduction in transition activity for all the D-bus streams and 2 of the 8 streams on the M-bus. The analytical estimate in Figure 5.13 matches well with measured data in Table 5.8 in which we see that for the I-bus, which has a high value of q , the transition activity is in the descending order, Unsigned, Gray, T0, inc-xor. For the D-bus, which has a low value of q , Gray code is better than inc-xor because KGray is typically less than Kinc?xor . The values of KUnsigned , KGray , Kinc?xor , KT 0, and q for benchmark programs are shown in Table 5.8.

99

5.3.2 Reduction In Power Dissipation The original, uncoded power dissipation at the bus is given by (5.1) where Tx is the wordlevel transition activity of the input and CL is the bus capacitance per bit.

PD;uncoded = TxCL Vdd2 f

(5.1)

For a given encoding and decoding scheme, we employed (5.2) to calculate the power dissipation as the sum of the power dissipation at the encoder, the bus, and the decoder.

PD;coded = PD;enc + PD;bus + PD;dec

(5.2)

In (5.2), power dissipation of PD;enc occurs in the transmitting chip and PD;dec occurs in the receiving chip. Thus the power dissipation in the transmitting and receiving chips is increased in order to reduce total power dissipation. If Ty is the reduced word-level transition activity at the bus after encoding then the total power dissipation is given by (5.3).

PD;coded = PD;enc + Ty CL Vdd2 f + PD;dec

(5.3)

We employed SIS [73] to generate the net-lists for the encoder and the decoder for xorpbm and Bus-Invert and PSpice [74] to estimate the power dissipation in the encoder, PD;enc , and the decoder, PD;dec . The encoder and the decoder for xor-pbm are shown in Figure 5.14. We employed 1.2 CMOS technology with 3.3V supply voltage and 20MHz frequency. The area-delay-power information is presented in Table 5.9. The clock frequency of 20MHz was chosen so as to accommodate the slowest critical path of 40ns in the xor-pbm coder. In Figure 5.15 we plot the power dissipation for di erent bus capacitances, CL. The power dissipation was estimated using 30 samples of V2 input. Since the input is highly correlated, we see that for small bus capacitances ( 0 which would violate y  H (x)  21  x < 1: H (1 ? H ?1(y))  H (x)

)

[ Since y = H (H ?1 (y)) = H (1 ? H ?1 (y)) and y  H (x) ] x  1 ? H ?1 (y)

)

[ Property 6: H () is monotonically decreasing in [ 21 ,1) ] H ?1 (y)  x  1 ? H ?1(y)

[ Since 0 < H ?1 (y )  21 and 21  x < 1 ]

 0 < x < 21 : H ?1(y)  x

)

[ Since y  H (x) & H () is increasing in (0, 12 ] ] H ?1 (y)  x < 1 ? H ?1 (y)

[ Since 0 < H ?1 (y )  21 and 0 < x < 21 ] Case 2: y = 0: Since H ?1 (y ) = 0 and 0  x  1, (4.1) is satis ed Hence the proof. { 116

A.2 Proof of Lemma 2 From the de nitions of entropy rate in (4.3) and H in the statement of Lemma 2, 1 H  nlim !1 n H (B1; B2 ; : : :; Bn ); n 1X  nlim !1 n H (Bi ) [Independence bound on entropy] i=1

n 1X = nlim H (pi) [pi = Pr(Bi = 1)] !1

n i=1

1  H (nlim !1 n

n X i=1

pi )

[Property 3: Jensen's inequality and concavity of H];

1 ) H  H (pb) [By de nition, pb = nlim !1 n

n X i=1

pi ]

Thus, we can substitute H for y and pb for x in Lemma 1 to obtain (4.2). {

A.3 Proof Of Asymptotic Achievability Of Lemma 2: We now present a coding algorithm, referred to as L2, that asymptotically achieves the lower bound in Lemma 2 for stationary and ergodic processes. (1) We encode each sequence of n symbols employing n code bits. We can do this because the source alphabet consists of only 2 symbols. The Asymptotic Equipartition Property (AEP) [63] states that given a stationary and ergodic process, for each 1 > 0, there exists n1 such that for all n > n1 the following properties hold, (a) there is a set, called the typical set, An , which is a subset of the set of all possible sequences of n symbols generated by the process, 1

(b) the number of elements in An , jAn j, is bounded by, 1

1

(1 ? 1 )2n(H? )  jAn j  2n(H+ ) ; 1

1

1

(A.1)

(c) the probability of An containing a sequence of n symbols generated by the process is at least 1 ? 1 , and 1

(d) 1 ! 0 as n ! 1. 117

In short, AEP states that as the length of the sequence n increases, the probability that a generated sequence belongs to An approaches unity, and the size of the typical set, jAn j, approaches 2nH . 1

1

(2) We generate a set, Cn , of codewords. Each codeword in Cn is formed by drawing n code bits in an independent, identically distributed manner with probability p of being a `1'. Again, from AEP, we know that the set Cn will contain at least (1 ? 2 )2n(H (p)? ) distinct codewords. We choose p such that the number of codewords in Cn is at least the number of sequences in An , i.e., 2

2

2

2

2

1

(1 ? 2 )2n(H (p)? ) = 2n(H+ ) ) p = H ?1(H + 1 + 2 ? log2(1n? 2) ) 2

1

As n ! 1, both 1 and 2 ! 0, and p ! H ?1 (H). (3) We assign codewords to sequences in An from the set Cn . 1

2

(4) After each sequence in An has been assigned a codeword from Cn , sequences not in An are assigned codewords in an arbitrary manner from the remaining codewords (which may or may not be in Cn ). 1

2

1

2

As n ! 1, the probability of a `1' at the output of the L2 encoder is at most (1? )npn+ n , where, 1

1

(1) the probability of a sequence being in An is at least 1 ? 1 (AEP), 1

(2) as n ! 1, the number of 1's in a codeword encoding a sequence from An is pn (strong law of large numbers), 1

(3) the probability of a sequence not being in An is at most 1 (AEP), and 1

(4) the number of 1's in a codeword encoding a sequence not from An is at most n (since length of codeword is n). 1

Hence as n ! 1, the probability of a `1' at the output of the L2 encoder is p, or H ?1 (H), thereby achieving the lower bound in Lemma 2. The L2 coding algorithm can be modi ed to achieve the upper bound by exchanging 1's and 0's. { 118

A.4 Proof of Theorem 1 Outline of proof: We will rst prove that the entropy rate is not altered in the transition

domain where a `1' represents a transition and a `0' represents no transition. We will then employ Lemma 2 to bound the number of 1's in the transition domain, which will in turn bound the number of transitions. Proof: Consider any uniquely decodable coding scheme that codes the rst N symbols, represented by the random variables (X1, X2, : : :, XN ), generated by the process. The symbols are coded, independently of any other symbols, to the n = NR bits represented by the binary random variables B1 , B2 , : : :, Bn and transmits these n bits in any order. Clearly, as N (or n) ! 1, we are considering all possible uniquely decodable coding schemes that encode the process employing an expected number of R bits to code a symbol. Hence the bounds obtained as n ! 1 will hold for all possible uniquely decodable coding schemes that encode the process employing an expected number of R bits to code a symbol. For the bits to be uniquely decodable, the rst N symbols must be some function f of the n bits the symbols are coded to, i.e., (X1; X2; : : :; XN ) = f (B1 ; B2 ; : : :; Bn ):

(A.2)

The entropy rate of the process fBi g is given by, 1 H (B ; B ; : : :; B ) lim 1 2 n n!1 n 1  nlim !1 n H (f (B1; B2 ; : : :; Bn ))

[Joint entropy of a collection of random variables  Joint entropy of a function of the collection of random variables]

1 H (X ; X ; : : :; X ) = Nlim 1 2 N !1

=H R

NR

1 H ) nlim !1 n H (B1 ; B2; : : :; Bn )  R : De ne a function g on (B1 , B2 , : : :, Bn ) as follows, (C1; C2; : : :; Cn) = g (B1; B2 ; : : :; Bn ); 119

(A.3)

(A.4)

Ci

Bi

Bi

B prev(i) B prev(i)

Figure A.1 Relation between Bi, Bprev(i), and Ci where,

Ci = B i

M

Bprev(i) ;

(A.5)

where the function prev (i) returns the index of the bit that is transmitted on the same wire as Bi and immediately precedes Bi . If Bi is the rst bit transmitted on the wire, then Bprev(i) is `0'. Hence we can recursively compute (B1; B2 ; : : :; Bn ) given (C1; C2; : : :; Cn ) as follows,

Bi = C i

M

Bprev(i) :

(A.6)

Figure A.1 shows the relation between Bi , Bprev(i) , and Ci . The two delays in Figure A.1 are initialized to the same state so that Ci can be generated uniquely given Bi and vice versa making the function g bijective. Note that Bi and Bprev(i) are transmitted on the same wire. If Bj is transmitted on a di erent wire from Bi , then we have another set of delays and exclusive-or gates for the generation of Cj from Bj and Bprev(j ) . Since g is an invertible function, 1

nlim !1 n H (C1; C2; : : :; Cn) =

1 H (B ; B ; : : :; B ): lim 1 2 n n!1 n

(A.7)

From (A.3) and (A.7) we get,

H: (A.8) R P Let pc = limn!1 n1 ni=1 Ci, where pc is the probability of Ci being a `1'. The limit exists because of assumption (4) in the theorem. Since Ci is a binary random variable, pc n is the number of 1's in (C1, C2 , : : :, Cn) for large n. Substituting pc for pb and HR for H in Lemma 2 1

nlim !1 n H (C1; C2; : : :; Cn) 

we obtain,

?1 ( H ): )  p  1 ? H H ?1( H c R R

120

(A.9)

Since there are on the average R bits per symbol and Ci is `1' i there was a transition at Bi ,

T = pc R:

(A.10)

Multiplying (A.9) by R and employing (A.10) we have, ?1 H H ?1( H R )R  T  (1 ? H ( R ))R;

(A.11)

which is the desired result. {

A.5 Proof Of Asymptotic Achievability Of Theorem 1: We now present a coding algorithm, T1, that asymptotically achieves the lower bound on transition activity in Theorem 1 for stationary and ergodic processes. The T1 coding algorithm is similar to the L2 algorithm with the di erences being that the source is not binary and R is now not restricted to being 1. We provide an outline of the proof of asymptotic optimality of the T1 algorithm. The detailed proof is similar to the proof of asymptotic optimality of the L2 algorithm. (1) We rst group blocks of k symbols from the source. For large k there are 2kH distinct blocks of symbols all of which are equally likely (AEP). (2) We code each block of k symbols employing kR bits. As in the L2 algorithm, each bit in a codeword encoding a block of symbols is chosen in an i.i.d. manner with probability H ?1 ( HR ) of being a 1. Thus there are 2kRH (H ? ( HR )) = 2kH codewords (AEP). Hence we have a codeword for each block of symbols. Since each bit in each codeword has probability H ?1 ( HR ) of being a 1, the bit-level probability at the output of the T1 encoder is also H ?1 ( HR ). 1

(3) In the last step we map a `1' to a transition waveform and a `0' to a transitionless waveform. Thus the bit-level transition activity is H ?1( HR ) which is then scaled by R to achieve the lower bound on transitions per symbol. {

121

A.6 Proof of Asymptotic Optimality of the MLZ Algorithm Outline of proof: We rst prove the asymptotic optimality of another algorithm MMLZ.

The asymptotic optimality of the MLZ algorithm is obtained by showing that MLZ will be better than or equal to the MMLZ algorithm. Proof: The modi ed MLZ algorithm (MMLZ) is identical to the MLZ algorithm except for pass 2 in sub-section III(A) where the bits in the pre x are chosen, not in order of increasing number of 1's, but in an i.i.d. manner with probability p of being a `1'. Since there are cnR (n) ? 1 bits in each pre x, from AEP we know that there are at least nR (1 ? )2( c n ?1)(H (p)?) distinct pre xes. We choose p such that the number of pre xes is at least equal to the number of phrases, c(n), i.e., ( )

nR

c(n) = (1 ? )2( c n ?1)(H (p)?) ) p = H ?1( c(n) log2cc((nn) ) ? c(n) log2(1c(n?) ) + ): nR(1 ? nR ) nR(1 ? nR ) ( )

As n ! 1,  ! 0, and, from the proof of optimality of the original Lempel-Ziv algorithm [63], we know that, nlim !1

c(n) log2 c(n) = H; n c(n) = 0: lim n!1 n

(A.12) (A.13)

Hence, nlim !1 p

= H ?1( H R ):

(A.14)

As n ! 1, the probability of a 1 at the output of the MMLZ encoder (which includes the pre xes and the additional bits), pmmlz , is upper bounded by,

pmmlz 

c(n)(( cnR (n) ? 1)p + 1) : nR

(A.15)

We can obtain (A.15) as follows,

 as n ! 1, the number of 1's in the pre x is ( cnR (n) ? 1)p (strong law of large numbers),  ( cnR (n) ? 1)p + 1 is the maximum number of 1's including the pre x and the additional bit, 122

 c(n)(( cnR (n) ? 1)p + 1) is the maximum number of 1's in the coded sequence which is then divided by the total number of bits, nR, to give the upper bound on pmmlz . Simplifying (A.15), we obtain, (n) )p + c(n) ; pmmlz  (1 ? cnR nR

(A.16)

which approaches p or H ?1( HR ) as n ! 1 (see (A.13)). Thus by selecting the bits in the pre x in an i.i.d. manner with probability H ?1 ( HR ) we arrive at an asymptotically optimal algorithm. The MLZ algorithm in sub-section III(A), which selects the pre xes in order of increasing number of 1's, will always be better than or equal to the MMLZ algorithm, in which all bits in the pre x have probability H ?1 ( HR ) of being a 1. The reason for this is as follows, (1) Since the bits in the pre x are chosen in order of increasing number of 1's in the MLZ algorithm, the pre xes in the MLZ algorithm will have 0, 1, 2, : : :, upto H ?1( HR )n 1's. (2) Since the bits in the pre xes in the MMLZ algorithm are chosen in an i.i.d. manner with probability H ?1( HR ) of being a `1', the pre xes in the MMLZ algorithm all will have H ?1( HR )n ones. (3) Thus a pre x in the MLZ algorithm will have equal or lesser number of 1's than the MMLZ algorithm due to which the output of the MLZ algorithm will have equal or lesser number of 1's (or transitions) compared to the MMLZ algorithm. Since the MLZ algorithm will have equal or lesser number of 1's (or transitions) than the MMLZ algorithm and the MMLZ algorithm is asymptotically optimal, the MLZ algorithm is also asymptotically optimal. {

123

APPENDIX B DECOR TRANSFORMATION B.1 Determining and for FIR Filters In this section, practical values of and for low-pass, high-pass, band-pass, and bandstop lters are determined. This is done by analyzing the ideal sinc prototype lter impulse responses. The coecients of an ideal low-pass lter with cuto s ! are given by,

bk = ! sinc(!k), ? 1 < k < 1:

(B.1)

bk = (?1)k ! sinc(!k), ? 1 < k < 1:

(B.2)

bk = cos(!c k) ! sinc(!k), ? 1 < k < 1:

(B.3)

The magnitude of the di erence between adjacent coecients is typically less than the magnitude of the original coecients. Hence, by choosing = ?1 and = 1 (i.e., f (z ) = (1 ? z ?1 )m ) the di erence between adjacent coecients can be obtained. The coecients of an ideal high-pass lter with cuto s ( ? ! ) are given by, Due to the (?1)k term in (B.2), the magnitude of the sum of adjacent coecients is typically less than the magnitude of the original coecients. Hence, by choosing = 1 and = 1 (i.e., f (z) = (1 + z ?1)m) the sum of adjacent coecients can be obtained. The coecients of an ideal band-pass lter with cuto s (!c  ! ) are given by, Due to the cos(!c k) term in (B.3), the magnitude of the sum of coecients !c apart is typically less than the magnitude of the original coecients. Hence, by choosing = 1 and = !c  (actually, !c rounded to the nearest integer) or f (z ) = (1 + z ? !c )m, the sum of coecients spaced !c apart can be obtained. 124

The coecients of an ideal band-stop lter with cuto s !1 and ( ? !2 ) are given by, bk = !1 sinc(!1k) + (?1)k !2 sinc(!2k), ? 1 < k < 1:

(B.4)

Due to the (?1)k term in (B.4), adjacent coecients will be the sum and di erence of 2 sinc functions. Hence, the range of coecients can be reduced by choosing = ?1 and = 2 (i.e., f (z) = (1 ? z ?2 )m).

B.2 E ect of Quantization in IIR Filters on DECOR In this section we show that the output at time n, ydf (n), of an N -tap direct form IIR lter will not be identical to the output, ydecor (n) of the lter obtained after applying DECOR, i.e.,

ydecor (n) = ydf (n) + en ; where en is the di erence between the two outputs due to the quantizer.

Proof:

Without loss of generality, assume that both the DF and DECOR lters are in the same state at time n, i.e.,

ydf (n ? i) = ydecor (n ? i) = y(n ? i); for i = 1 : : :N: The output at time n of the DF lter is given by,

ydf (n) = Q[?

N X i=1

ai y(n ? i) + x(n)];

where Q[:] represents the quantization operation. The output of the DECOR lter is given by,

ydecor (n) = Q[?

= Q[?

N X i=1

NX +1 i=1

(ai + ai?1 )y (n ? i) + x(n) + x(n ? 1)] [Note: aN +1 = 0; a0 = 1 ]

ai y(n ? i) ? y(n ? 1) ?

x(n) + x(n ? 1)] = Q[?

N X i=1

ai y(n ? i) ? Q[? 125

N X i=1

NX +1 i=2

ai?1 y(n ? i) +

ai y(n ? 1 ? i) +

x(n ? 1)] ?

N X i=1

ai y(n ? 1 ? i) + x(n) + x(n ? 1)]

[ Applying de nition of ydf (n) to y (n ? 1) since, by assumption, ydf (n ? 1) = ydecor (n ? 1) = y (n ? 1) ] = Q[?

N X

ai y(n ? i) +

i=1 N X

+n ?

i=1

N X i=1

aiy(n ? 1 ? i) ? x(n ? 1)

aiy(n ? 1 ? i) + x(n) + x(n ? 1)]

[ n is the noise due to the quantizer ] = Q[? = Q[?

N X

i=1 N X i=1

ai y(n ? i) + n + x(n)] ai y(n ? i) + x(n)] + en

[ en is the noise due to the quantizer ] = ydf (n) + en [ From de nition of ydf (n) ]; which is the desired result. {

126

REFERENCES [1] R. Marculescu, D. Marculescu, and M. Pedram \Switching Activity Analysis Considering Spatiotemporal Correlations," IEEE/ACM International Conference on Computer-Aided Design pp. 294{299, San Jose CA, November 6{10 1994. [2] Tan-Li Chou, K. Roy, and S. Prasad, \Estimation of Circuit Activity Considering Signal Correlations and Simultaneous Switching," International Conference on Computer-Aided Design, pp. 300{303, San Jose CA, November 1994. [3] A. Ghosh, S. Devadas, K. Keutzer, and J. White, \Estimation of average switching activity in combinational and sequential circuits", 29th Design Automation Conference, pp. 253{ 259, June 1992. [4] C. Huang, B. Zhang, A. Deng, and B. Swirski, \The design and implementation of PowerMill," in International Symposium on Low Power Design, pp. 105{110, Dana Point CA, April 1995. [5] F. Najm, \Transition density, a new measure of activity in digital circuits," IEEE Transactions on Computer-Aided Design, vol. 12, no. 2, pp. 310{323, February 1993. [6] J. H. Satyanarayana and K. K. Parhi, \HEAT: Hierarchical Energy Analysis Tool," 33rd Design Automation Conference, pp. 9-14, June 1996. [7] C.-Y. Tsui, M. Pedram, and A. Despain, \Ecient estimation of dynamic power consumption under a real delay model," International Conference on Computer-Aided Design, pp. 224{228, November 1993. [8] M. Nemani and F. Najm, \Towards a High-Level Power Estimation Capability," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 588{598, June 1996.

127

[9] M. R. Stan and W. P. Burleson, \Bus-Invert Coding for Low-Power I/O," IEEE TVLSI, pp. 49{58, March 1995. [10] A. Chandrakasan and R. W. Brodersen, \Minimizing power consumption in digital CMOS circuits," Proceedings of the IEEE, vol. 83, no. 4, pp. 498{523, April 1995. [11] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Broderson, \Optimizing Power Using Transformations," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, no. 1, pp. 12{31, January 1995. [12] M. Horowitz, T. Indermaur, and R. Gonzalez, \Low-power digital design," IEEE Symposium on Low Power Electronics, pp. 8{11, San Diego CA, October 1994. [13] L. Benini and G. De Micheli, \Automatic synthesis of low-power gated-clock nite-state machines," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 630{643, June 1996. [14] P. E. Landman and J. M. Rabaey, \Activity-sensitive architectural power analysis," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, June 1996. [15] A. Shen, A. Ghosh, S. Devadas, and K. Keutzer, \On average power dissipation and random pattern testability of CMOS combinational logic networks," International Conference on Computer-Aided Design, pp. 402{407, November 1992. [16] F. Najm, \A survey of power estimation techniques in VLSI circuits," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, pp. 446-455, December 1994. [17] N. R. Shanbhag, \A mathematical basis for power-reduction in digital VLSI systems," IEEE Transactions on Circuits and Systems, Part II, vol. 44, no. 11, pp. 935{951, November 1997. [18] E. A. Vittoz, \Low-power design: Ways to approach the limits", IEEE Solid-State Circuits Conference, pp. 14{18, 1994. [19] H. H. Loomis and B. Sinha, \High speed recursive digital lter realization", Circuit System and Signal Processing, vol. 3, no. 3, pp. 267{294, 1984. 128

[20] K. K. Parhi, \Algorithm transformation techniques for concurrent processors," Proceedings of the IEEE, vol. 77, pp. 1879-1895, December 1989. [21] M. Alidina, J. Monterio, S. Devadas, A. Ghosh, and M. Papaefthymiou, \Precomputationbased sequential logic optimization for low-power," IEEE Transactions on Very Large Scale Integration (VLSI) S ystems, vol. 2, no. 4, pp. 426{436, December 1994. [22] Y. Nakagome, K. Itoh, M. Isoda, K. Takeuchi, and M. Aoki, \Sub-1-V swing internal bus architecture for future low-power ULSI's," IEEE Journal of Solid-State Circuits, vol. 28, no. 4, pp. 414{419, April 1993. [23] W. C. Athas, L. J. Svensson, J. G. Koller, N. Tzartzanis, and E. Y.-C. Chou, \Lowpower digital systems based on adiabatic switching principles," IEEE Journal of Solid-State Circuits, vol. 2, no. 4, pp. 398{407, December 1994. [24] B. Davari, R. H. Dennard, and G. G. Shahidi, \CMOS scaling for high-performance and low-power - The next ten years," Proceedings of the IEEE, vol. 83, no. 4, pp. 595{606, April 1995. [25] P. E. Landman and J. M. Rabaey, \Architectural Power Analysis: The Dual Bit Type Method," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 3, no. 2, pp. 173{187, June 1995. [26] K. K. Parhi, Ching-Yi Wang, and A. P. Brown, \Synthesis of Control Circuits in Folded Pipelined DSP Architectures," IEEE Journal of Solid-State Circuits, vol. 27, no. 1, pp. 181{195, January 1992. [27] S. Gupta and F. Najm, \Power Macromodeling for High Level Power Estimation", 34th Design Automation Conference, pp. 365{370, Anaheim CA, June 1997. [28] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Analytical Estimation of Transition Activity for DSP Architectures," Proc. IEEE Intl. Symp. on Circuits and Systems, pp. 1512{1515, Hong Kong, June 1997. [29] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Analytical Estimation of Transition Activity from Word-Level Signal Statistics," 1997 Design Automation Conference, pp. 582{587, June 9-13, Anaheim CA. 129

[30] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Analytical Estimation of signal transition activity from word-level statistics," IEEE Transactions on Computer-Aided Design of ICs, pp. 718{733, vol. 16, no. 7, July 1997. [31] B. Atal and M. R. Schroeder, \Predictive coding of speech and subjective error criteria", IEEE Transactions on Acoustics, Speech, and Signal Processing, no. 23, pp. 247{254, June 1979. [32] N. S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, Inc. Englewood Cli s, NJ 1984. [33] J. W. Adams and A. N. Willson, \Some Ecient Digital Pre lter Structures," IEEE Transactions on Circuits and Systems, vol. 31, no. 5, pp. 260{265, May 1984. [34] M. Goel and N. R. Shanbhag, \Dynamic Algorithm Transformations (DAT) for LowPower Adaptive Signal Processing," International Symposium on Low Power Electronics and Design, pp. 161{166, Monterey CA, August 18{20 1997. [35] J. T. Ludwig, S. H. Nawab, and A. P. Chandrakasan, \Low-Power digital ltering using approximate processing," IEEE Journal of Solid-State Circuits, vol. 31, no. 3, pp. 395{400, March 1996. [36] D. N. Pearson and K. K. Parhi, \Low-Power FIR Digital Filter Architectures," IEEE International Symposium on Circuits and Systems, pp. 231{234, Seattle WA, April 30 { May 3 1995. [37] M. Mehendale, S. D. Sherlekar, and G. Venkatesh, \Algorithmic and Architectural Transformations for Low-Power Realization of FIR Filters", International Conference on VLSI Design, pp. 12{17, Chennai India, January 4{7 1998. [38] M. Mehendale, S. B. Roy, S. D. Sherlekar, and G. Venkatesh, \Coecient Transformations for Area-Ecient Implementation of Multiplier-less FIR Filters," 11th International Conference on VLSI Design, pp. 110{115, Chennai India, January 4{7 1998. [39] N. R. Shanbhag and M. Goel, \Low-Power Adaptive Filter Architectures and their Application to 51.84 Mb/s ATM-LAN," IEEE Trans. Signal Proc., pp. 1276{1290, May 1997. 130

[40] N. Sankarayya, K. Roy, and D. Bhattacharya, \Algorithms for Low Power and High Speed FIR Filter Realization Using Di erential Coecients," IEEE Trans. Circuits and Systems { II, pp. 488{497, June 1997. [41] N. Sankarayya, K. Roy, and D. Bhattacharya, \Optimizing Computations in a Transposed Direct Form Realization Of Floating-Point LTI-FIR Systems," IEEE/ACM International Conference on Computer-Aided Design, pp. 120{125, San Jose CA, November 9{13 1997. [42] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Decorrelating (DECOR) Transformations for Low-Power Adaptive Filters," IEEE International Symposium on Low-Power Electronics and Design, pp. 250{255, August 10{12 1998, Monterey CA. [43] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Decorrelating (DECOR) Transformations for Low-Power Digital Filters," IEEE Transactions on Circuits and Systems II: Analog & Digital Signal Processing (to appear). [44] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Low-Power Distributed Arithmetic Architectures Using Non-Uniform Memory Partitioning," 1999 International Symposium on Circuits and Systems, (to appear), May 30-June 2, Orlando FL. [45] J.-G. Chung, Y.-B. Kim, H.-J. Jeong, and K.K. Parhi, "Ecient Parallel FIR Filter Implementations using Frequency Spectrum Characteristics," Proceedings of IEEE International Symposium on Circuits and Systems, pp. V-354-V-358, Monterey CA, May 31 - June 3 1998. [46] Z.-J. Mou and P. Duhamel, \Short-Length FIR Filters and Their Use in Fast Nonrecursive Filtering," IEEE Transactions on Signal Processing, vol. 39, no. 6, pp. 1322-1332, June 1991. [47] K. K. Parhi and D. G. Messerschmitt, \Pipeline interleaving and parallelism in recursive digital lters - Parts I, II," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no. 7, pp. 1099{1134, July 1989. [48] M. Hatamian and G. L. Cash, \Parallel Pipelined Multiplier," IEEE J. Solid-State Circuits, vol. SC-21, no. 4, pp. 505{513, August 1986. 131

[49] S. A. White, \Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review," IEEE ASSP Magazine, pp. 4{19, July 1989. [50] N. Bellas, I. N. Hajj, C. D. Polychronopoulos, and G. Stamoulis, \Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High-Performance Microprocessors," International Symposium on Low Power Electronics and Design, pp. 70{75, Monterey CA, August 10{12 1998. [51] S. Cho, T. Xanthopoulos, and A. P. Chandrakasan, \An Ultra Low Power Variable Length Decoder for MPEG-2 Exploiting Codeword Distribution," IEEE Custom Integrated Circuits Conference, pp. 177{180, Santa Clara CA, May 11{14 1998. [52] M. Mehendale, A. Sinha, and S. D. Sherlekar, \Low Power Realization of FIR Filters Implemented Using Distributed Arithmetic," Asia and South Paci c Design Automation Conference, Yokohama Japan, pp. 151{156, February 10{13 1998. [53] N. Tan, S. Eriksson, and L. Wanhammar, \A Power-Saving Technique for Bit-Serial DSP ASICs," International Symposium on Circuits and Systems, vol. 4, pp. 51{54, London England, May 30 { June 2 1994. [54] L. Benini et al, \Asymptotic Zero-Transition Activity Encoding for Address Busses in LowPower Microprocessor-Based Systems," Great Lakes VLSI Symposium, pp. 77{82, Urbana IL, March 1997. [55] R. J. Fletcher, \Integrated Circuit Having Outputs Con gured for Reduced State Changes," U.S. Patent no. 4,667,337, May 1987. [56] M. R. Stan and W. P. Burleson, \Two-dimensional Codes for Low-Power," International Symposium on Low-Power Electronics and Design, pp. 335{340, Monterey, CA, August 12{14 1996. [57] C. L. Su, C. Y. Tsui, and A. M. Despain, \Saving Power in the Control Path of Embedded Processors," IEEE Design and Test of Computers, vol. 11, no. 4, pp. 24{30, Winter 1994. [58] D. W. Faulkner, \PCM Signal Coding," U.S. Patent no. 5,062,152, October 1991.

132

[59] D. Marculescu, R. Marculescu, and M. Pedram, \Information theoretic measures for power analysis," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 599{610, June 1996. [60] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Achievable Bounds On Signal Transition Activity," International Conference on Computer-Aided Design, pp. 126{129, November 9-13 1997, San Jose CA. [61] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Signal Coding for Low Power: Fundamental Limits and Practical Realizations," 1998 International Symposium on Circuits and Systems, pp. II-1{II-4, May 31{June 3, Monterey CA. [62] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Information-Theoretic Bounds on Average Signal Transition Activity," IEEE Transactions on VLSI (accepted). [63] T. M. Cover and J. A. Thomas, \Elements of Information Theory," John Wiley & Sons, New York, 1991. [64] J. Ziv and A. Lempel, \A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, vol. 23, no. 3, pp. 337{343, 1977. [65] T. A. Welch, \A technique for high-performance data compression," Computer, vol. 17, pp. 8{19, 1984. [66] T. C. Bell, J. G. Cleary, and I. H. Witten, Text Compression, Prentice-Hall, Englewood Cli s, NJ, 1990. [67] M. R. Stan and W. P. Burleson, \Limited-Weight Codes for Low-Power I/O," Int. Workshop Low Power Design, pp. 209{214, Napa CA, April 1994. [68] M. R. Stan and W. P. Burleson, \Coding a Terminated Bus for Low Power," Great Lakes Symposium on VLSI, pp. 70{73, Bu alo, NY, March 1995. [69] E. Musoll, T. Lang, J. Cortadella, \Exploiting the locality of memory references to reduce the address bus energy," International Symposium on Low Power Electronics Design, pp. 202{207, 1997. 133

[70] J. Tabor, \Noise reduction using low-weight and constant weight coding techniques," M.S. Thesis, M.I.T., May 1990. [71] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Coding For Low-Power Address And Data Busses: A Source-Coding Framework and Applications," 1998 Int. Conf. on VLSI Design, January 5-9, Chennai, India. [72] S. Ramprasad, N. R. Shanbhag, and I. N. Hajj, \Coding Schemes to Reduce Transition Activity: A Source-Coding Framework and Applications," IEEE Transactions on VLSI (to appear). [73] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, R. K. Brayton, and A. Sangiovanni-Vincentelli, \SIS: A System for Sequential Circuit Synthesis," Memorandum No. UCB/ERL M92/41, University of California at Berkeley, May 1992. [74] \MicroSim PSpice A/D Reference Manual," MicroSim Corporation, Irvine CA, 1996. [75] S. Iman and M. Pedram, \An Approach for Multilevel Logic Optimization Targeting Low Power," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 8, pp. 889{901, August 1996. [76] N. Weste and K. Esraghian, \Principles of CMOS VLSI design: a systems perspective," 2nd Edition, Addison Wesley, Reading MA, 1993. [77] R. Krambeck, C. Lee, and H. Law, \High speed compact circuits with CMOS," IEEE Journal of Solid-State Circuits, vol. SC-17, no. 3, pp. 614{619, June 1982. [78] K. Bernstein, K. M. Carrig, C. M. Durham, P. R. Hansen, D. Hogenmiller, E. J. Nowak, N. J. Roher, \High Speed CMOS Design Styles," Kluwer Academic, Dordrecht Netherlands, 1998. [79] J. D. Yetter, \Functionally complete family of self-timed dynamic logic circuits," U.S. Patent 5,208,490, May 1993.

134

[80] R. Puri, A. Bjorksten, T. E. Rosser, \Logic Optimization by Output Phase Assignment in Dynamic Logic Synthesis," International Conference on Computer-Aided Design, pp. 2{7, San Jose CA, November 10{14 1996. [81] G. Yee and C. Sechen, \Dynamic Logic Synthesis," Custom Integrated Circuits Conference, pp. 345{348, Santa Clara CA, May 1997.

135

VITA Sumant Ramprasad was born in Jamnagar, India, in 1967. He received his Bachelor of Technology degree in Computer Science & Engineering from the Indian Institute of Technology, Mumbai, in 1988. Since February 1996 he has been employed as a research assistant in the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign (UIUC). He received the M.S. degree in 1990 from Ohio State University and the Ph.D. degree in 1999 from UIUC. He worked at Motorola from 1991 to 1996. His research interests are design for low power, DSP, and computer architecture.

136