Power Reduction in Custom CMOS Digital Filter Structures ...

2 downloads 3846 Views 163KB Size Report
Abstract. Today the main optimization parameter of digital filters is the filter order. By the aid of two implemented filters we will show that both power and speed ...
Analog Integrated Circuits and Signal Processing, 18, 97±105 (1999) # 1999 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Power Reduction in Custom CMOS Digital Filter Structures Ê STRO È M, PETER NILSSON AND MATS TORKELSON PONTUS A Dept. of Applied Electronics, Lund University, P.O Box 118, SE-221 00 Lund, Sweden. E-mail: [email protected]

Received March 27, 1998; Accepted July 10, 1998

Abstract. Today the main optimization parameter of digital ®lters is the ®lter order. By the aid of two implemented ®lters we will show that both power and speed can be enhanced if the optimization effort is made on reducing the ®lter coef®cient lengths rather than minimizing the order. Both ®lters have been designed from the same speci®cation, one as a standard minimum order ®lter, the other as a ®lter with short coef®cients found by a computer search. The minimum order ®lter is of order three with seven bits long coef®cients. The coef®cient optimized ®lter is of order six with two bits long coef®cients. Both ®lters were implemented with bit-serial ®xed coef®cient arithmetic in two's complement representation in a 0:8 m, two metal layers CMOS process. Measurements show an eightfold speedup at half the power consumption and only 30% area cost for the coef®cient optimized ®lter. Key Words: low power, full custom design, digital ®lter, lattice wave digital structure, coef®cient optimization, baseband ®lter 1.

Introduction

The main optimization parameter in ®lter design is the ®lter order. For analog ®lters this is an obvious choice as it reduces the number of discrete components and minimizes the impact of component variations. For digital ®lters it has been natural to continue this approach even if low order is not as important as for analog ®lters. One reason is that there are no component variations to consider as only arithmetic operations are performed. Important properties of a ®nal implementation are speed, power and die size. In digital ®lters there are mainly three parameters that affect those properties. They are: ®lter order, ®lter coef®cient length and data word length. The data word length is usually set by the required signal to quantization noise ratio and can be considered ®xed. In some applications, like standard DSP designs, the coef®cient length is also ®xed and we end up with ®lter order optimization. However for full custom designs, where it is possible to vary the word length arbitrarily, ®lter order might not be the most important parameter to minimize. Speed, power and die size are all dependent on both the coef®cient lengths and the order. Power and die size are almost linearly dependent on both the

coef®cient length, due to the large share of multipliers in digital ®lters, and to the ®lter order. The power consumption also has a strong dependence on the number of ones in the coef®cients. The speed on the other hand, is mostly dependent on the length of the coef®cients. From the above it is clear that short coef®cients are equally important as low order for the outcome of the design. The problem is that it, to our knowledge, does not exist any methods but search methods to obtain ®lters with good coef®cients. This fact limits the use of coef®cient optimization to ®lters with simple speci®cations. The focus of this paper is to show what results can be obtained if the coef®cients are considered rather than how they are obtained. For that purpose two baseband digital IIR ®lters were designed from the same speci®cation. The ®rst is a minimum order ®lter, the second is a ®lter with short coef®cients found by a computer search. To separate the ®lters in the sequel, the ®rst will be referred to as the minimum order ®lter and the second as the optimized ®lter. To simplify the implementation it was decided to implement the ®lter structure as a hardware mapped data path. The main bene®t is that the control can be reduced to a bare minimum: basically only initializa-

98

Ê stroÈm, P. Nilsson and M. Torkelson P. A

tion of latches and adder carry reset will be needed. To make this feasible without an excessive area growth, the data path is implemented in a bit-serial technique with two's complement representation.

2.

System Overview

Both ®lters are designed to be used as digital baseband ®lters in a radio system project. The aim of the project is to develop a wideband mobile communication system. The system utilizes Orthogonal Frequency Division Multiplexing (OFDM) with differential QPSK modulation. This is the choice of the European Digital Audio Broadcast (DAB) [5] and Digital Video Broadcast (DVB) systems. A short introduction to the general concept of OFDM can be found in [6]. In Fig. 1 a standard direct conversion homodyne receiver is shown, where all ®ltering is done in the analog domain. To ef®ciently utilize the spectrum, a high order analog ®lter is needed. However, there are several drawbacks with such ®lters: it is very dif®cult to obtain high precision and the standard solutions today, crystal or ceramic ®lters, are very expensive and bulky. Other drawbacks are aging effects of discrete components and a need to tune each ®lter individually. The requirements on the analog ®lter can be reduced by moving some of the ®ltering to the digital domain [7]. However, to be able to do so, the sample frequency has to be increased. In Fig. 2 the procedure is exempli®ed. First consider the standard homodyne receiver of Fig. 1. If the sample frequency is given by fs1 in Fig. 2, an analog ®lter response according to the solid line is needed to remove the alias frequencies. If the sample frequency is doubled to fs2 the mirror band is moved up in frequency. The requirements on the analog ®lter is lowered to a level where they can be integrated in a standard CMOS process [8]. Before

Fig. 1. Block structure of a standard homodyne receiver.

Fig. 2. With only analog ®ltering and a sample frequency equal to fs1 , a high order analog ®lter according to the solid line is needed. If the sample frequency is doubled to fs2 , the requirement on the analog ®lter is relaxed (dashed line). Most of the ®ltering is then done in the digital domain (dotted line).

the sample frequency can be converted down from fs2 to fs1 , which is the frequency desired in the system, the fs1 mirror band has to be ®ltered out. This is done by a high order digital ®lter which does not suffer from the drawbacks of high order analog ®lters mentioned above. Note that the analog ®lter helps the digital ®lter by about 5 dB. The total transfer function is thus slightly better than the one shown in Fig. 2. In Fig. 3 a modi®ed homodyne receiver is shown. The ADC is sampling at twice the rate compared to the standard homodyne receiver in Fig. 1. Thus, some ®ltering can be done in the digital domain as described above. After the digital ®lter there is a down converter to reduce the rate by half. The main drawback with the method is that the A/D conversion has to be done at a higher rate and thus constitutes a more severe bottleneck here than it does in the standard homodyne receiver.

3.

Filter Speci®cations

Two generations of the system shown in Fig. 3 have been implemented. The ®rst generation system is designed for a user bandwidth of 2 MHz (1 bit/Hz).

Fig. 3. Block structure of the modi®ed homodyne receiver. The sample rate is twice as high as for the standard homodyne receiver.

CMOS Digital Filter Structures

For this system the minimum order ®lter is implemented. Here the passband has to be 2 MHz or  1 MHz centered around origin. The main purpose of the ®rst generation test bench is to verify the interfaces within the different blocks of the system. The second generation system is a broadband system with a user bandwidth of 20 MHz. Here the optimized ®lter is implemented. The passband of this ®lter needs to be 20 MHz, centered around origin …10 MHz†. To transmit data in OFDM each user bandwidth is split in several narrow sub-channels. The choice of sub-channel bandwidth is a compromise between two contradictory requirements. Due to Doppler effects each sub-channel should be wide but to reduce multipath effects it should be narrow. According to [10] 50 kHz is a good compromise. The total number of sub-channels for the ®rst generation system thus is 40 and for the second generation system 400. However, the FFT has to operate on a number of channels that is a power of 2. For the single user system the closest number of channels larger than 40 is 64. Therefore the ADC for the single user system has to operate at 64 times the channel bandwidth times the oversampling rate (a factor of two). This equal 6.4 MHz. In the same way the number of channels for the multiuser system will be 512 and the sample rate 51.2 MHz. Two important ®lter parameters that depend on the sub-channel width are group delay and passband ripple. The total group delay of the system is speci®ed to 200 ns, which is 1% of the symbol duration. The amplitude is required to be ¯at over each sub-channel, 0.5 dB ripple gives less than one degree phase error. The stopband attenuation is required to exceed 40 dB. Together with the analog ®lter this results in a total adjacent channel suppression of 55 dB. The accuracy of the system is limited to 10 bits by the ADC [11] which gives a signal to quantization noise level of 60 dB. Two extra bits have to be added internally in the ®lters to prevent arithmetic over¯ow. It should be noted that apart from the different sample rates the speci®cations are identical for both ®lters. 4.

99

For example, lattice wave digital ®lters (LWDF) show low sensitivity in the passband and high sensitivity in the stopband. Except for having low sensitivity to coef®cient variations in the stopband, LWDFs have a couple of good properties that make them well suited for implementation of broadband digital ®lters: * Low roundoff noise. * Easily designed by direct mapping of analog ®lters, with all properties of the analog ®lter preserved. * Simple and modular data-path. The basic building blocks of LWDFs are digital approximations of analog components like capacitors and inductors. In the digital domain the capacitor is represented by a delay, or in transform notation, by zÿ1 . The inductor is in the same way represented by a negated delay, ÿzÿ1 . To interconnect those components an adaptor is needed. There exist different kinds of adaptors, two-port and three-port adaptors are the most common. In LWDFs only two-port adaptors are used, its symbol and internal representation is shown in Fig. 4. To each port a component is connected. The adaptor coef®cient a de®nes the ratio between the incident and the re¯ected wave from each port. If it is zero there is no re¯ected wave and the adaptor acts as a feed through. If it equals one the whole wave is re¯ected and nothing passes through the adaptor. An in depth coverage of LWDFs can be found in [2±4]. Table 1. Complete ®lter speci®cation. Group delay Passband Passband ripple Stopband Stopband att.

200 ns 0-0.2 fsample 0.5 dB 0.3-0.5 fsample 40 dB

Filter Structure

It is well known that wideband digital ®lters are very sensitive. However, the sensitivity in passband and stopband varies between different classes of ®lters.

Fig. 4. (a) The adaptor symbol. (b) The internal structure of the adaptor. A ˆ input to the adaptor, B ˆ output from the adaptor, a ˆ adaptor coef®cient. The index indicates the port.

100 5.

Ê stroÈm, P. Nilsson and M. Torkelson P. A

Filter Design

In order to make a fair comparison we want the minimum order ®lter to have as short coef®cients as possible. One way to guarantee this is to do a computer search, starting with short coef®cient lengths and then gradually extending the lengths until a ®lter is found. This way it is guaranteed that the minimum order ®lter has as short coef®cients as possible, independent of any ®lter algorithm. The same approach is taken for the optimized ®lter except that the search not is limited to minimum order ®lters. There is one big ¯aw that limits the use of computer search to ®nd ®lters: the search time grows exponentially both with regard to ®lter order and coef®cient length. This effectively prohibits the use of computer search in the design of ®lters with tight speci®cations. However, many speci®cations are not very tight, as the speci®cation here, and thus, a computer search can be motivated.

5.1.

Computer Algorithm

A computer program is written in order to ®nd ®lters with short coef®cients. The program is given the ztransform of a ®lter together with a list of the coef®cients that are part of the different loops of the ®lter (the critical path is one of the loops). The program computes the transfer function for all coef®cient combinations up to a given length, beginning with short coef®cients with few ones and then gradually increasing both the length and the number of ones of the coef®cients. The maximum coef®cient length can be set arbitrarily but is usually

set to the coef®cient length of a minimum order ®lter that ful®lls the speci®cation. Each computed transfer function is then transformed to the frequency domain by the aid of the FFT algorithm. When the transfer function ®ts the speci®cation the program stops. Currently the program is only semi automatic, i.e. it only checks one speci®c ®lter at a time. To check for different ®lter orders the program has to be manually loaded with a new transfer function each time. As the number of ®lters, given a speci®c structure, is quite small, this is not so cumbersome.

5.2.

Design of the Minimum Order Filter

The best minimum order ®lter is a third order ®lter with seven bits long coef®cients, cf. Table 2. With this choice of coef®cients, the ®lter response is given by the dashed lines in Fig. 5. As shown in Fig. 5(b), the difference in group delay within the passband is 200 ns, equal to the speci®ed maximum limit. The structure of the minimum order ®lter is shown in Fig. 6(a). The ®lter consists of one ®rst order all pass link at the top and one second order link at the bottom. In the ®gure the zÿ1 elements are represented by T blocks. In Fig. 6(b) the datapath is shown. It consists of three multipliers and ten adders. Table 2. Filter coef®cients for third order respectively sixth order ®lter. Coef®cient

Minimum Order

a0 a1 a2

0:37500010 0:57812510 ÿ0:32812510

Optimized 00110002 01001012 11010112

0:010 0:510 0:010

002 012 002

Fig. 5. (a) The amplitude response of both ®lters. Dotted line for the third order ®lter and solid line for the sixth order ®lter. (b) Group delay of the ®lters. The arrow indicates the largest difference of group delay within the passband.

CMOS Digital Filter Structures

101

skipped which saves two clock cycles in the critical path. With 14 clock cycles per sample the clock frequency has to be 89.6 MHz.

5.3.

Fig. 6. (a) Filter structure of third order LWDF. (b) The datapath.

In Fig. 8 the implementation of the a1 adaptor on cell level is shown. In the middle the bit-serial ®xed coef®cient multiplier is situated. The d elements on both sides of the multiplier are shimming delays which keep the signals in phase with the multiplied signal. The critical path of the ®lter is the loop through the a2 and a1 multipliers. The delay equals that of two adders and two multipliers, a total of 16 clock cycles. By rewriting the algorithm to use a negated a2 , all coef®cients become positive so the sign digit can be

Design of the Optimized Filter

The best performance was found with a sixth order ®lter built out of two identical cascaded half band third order ®lters. Each third order ®lter link has the structure given in Fig. 6(a). The coef®cients of the sixth order ®lter are given in Table 2 and the ®lter response by the solid lines in Fig. 5. Group delay within the passband varies from 263 ns to 430 ns. Thus, the maximum difference in group delay is 430±263 ˆ 167 ns which is well below the stipulated 200 ns. Contrary to the minimum order ®lter it is possible to parallelize the data path of the optimized ®lter. This is due to the much shorter coef®cient lengths which reduces the delay of the multipliers considerably. Since both alpha0 and alpha2 equals zero, four adaptors consist of a feed-through only. The data path of the optimized ®lter therefore can be rewritten to the much simpler one shown in Fig. 9. Each T block in the ®gure equals the delay of one sample. From Fig. 9 it can be seen that the delay in each feedback path is two samples. It means that odd and even samples do not affect each other, and thus can be computed independently. The data path of Fig. 9 is therefore rewritten in two parallel paths to process odd and even samples separately as shown in Fig. 10. By merging the decimation operation shown in Fig. 3

Fig. 7. The ®rst ®lter link in the sixth order ®lter computing eight samples concurrently. The second ®lter link only consists of the upper part of the ®gure as only even samples are to be output.

102

Ê stroÈm, P. Nilsson and M. Torkelson P. A

Fig. 8. The data-path on cell level of the a1 adaptor. The middle row contains the ®xed coef®cient multiplier.

with the ®lter, the size of the ®lter can be reduced. The only operation that has to be done is to remove the dotted part in the last ®lter link in Fig. 10. Now we observe that the internal word length is 12 bits whereas the delay of a link is 3 cycles. The result is that feedback data have to be stored for 12 ÿ 3 ˆ 9 cycles in the T blocks of Fig. 10. This is a waste of time. Instead of storing data in registers it is possible to feed the data forward to a new processing element as shown in Fig. 7. As the delay of a ®lter link is three cycles, four processing elements can be connected in series. Thereafter the loop has to be closed as the ®rst element has ®nished processing of the ®rst sample. The parallelization effort above has increased throughput considerably, from one sample every 12 cycles to eight samples every 12 cycles, i.e. an eight times parallelization that has increased the average throughput to one sample every 1.5 cycles on average. Fig. 13 shows the implementation on cell level of one third order ®lter link of the optimized ®lter. This ®gure should be compared to Fig. 8 which shows one of three adaptors in the minimum order ®lter.

6.

Layout

Both chips are implemented in a 0:8 m, two metal layers CMOS process. Except for a few missing

Fig. 10. Architecture that computes two samples concurrently.

shimming delays, the data-paths of Figs. 6(b) and 7 are mapped directly to silicon. In order to get standalone chips, the following three blocks are added to the ®lter cores. I/O unit: Data are fed on and off the chip in a parallel mode. This is handled by a simple I/O unit. The basic building block of the I/O unit is the delay (d) element. Control unit: The control unit is used to reset the carry registers once every sample. It also keeps track of when a word is to be loaded on and off the chip. It is made of d elements coupled in a closed loop, along which a ``1'' is stepped forward. The length of the loop is equal to the on chip word length. On-chip clock generator: The high frequencies needed make it dif®cult to feed the clock signal from outside. Instead a trig signal is fed to the onchip clock generator [12] once every sample. The clock generator then generates a speci®ed number of clock cycles. In Figs. 11 and 12 photos of the ®lters are shown. The structure of both ®lters is the same. All modules are placed in two columns with a data bus in between. The input to the ®lters is handled at the top. Thereafter data is fed downwards trough the data-path. At the bottom the output unit is situated. Control and clock modules are placed in the center of the columns to reduce the in¯uence of clock skew. 7.

Fig. 9. The computational data-path when alpha0 and alpha1 equal zero.

Results

Both ®lters have been veri®ed in a digital test bench. This allows parameters like supply voltage, clock frequency and input data to be changed easily.

CMOS Digital Filter Structures

103

Fig. 11. Photo of minimum order ®lter.

To verify proper operation at the operating frequencies given in Table 3, test vectors obtained from simulation were used. Tests with sine waves of different frequencies as inputs were also realized. The output was then compared with the expected values from the transfer function in Fig. 5. It is dif®cult to obtain a proper value of the power consumption as it depends heavily on the correlation of the input data. However, a worst case estimate can be obtained if the input data is assumed to be uncorrelated, i.e. white noise. This is the case here. All data in the table except power are for the core

only. The interesting ®gures in the table are the sample rate and the power consumption. Although the sample rate is eight times higher for the optimized ®lter it consumes only half the power. This is mainly due to the highly parallel structure which allowed us to reduce the clock frequency to 76.8 MHz and the much simpli®ed adaptors. A comparison between the alpha1 adaptor of the minimum order ®lter in Fig. 8 and the same adaptor in the optimized ®lter shown inside the dashed area in Fig. 13 shows that the latter has much lower delay, is much smaller and has lower power consumption.

Fig. 12. Optimized ®lter. The regular data-path resulted in a simple and compact layout.

Ê stroÈm, P. Nilsson and M. Torkelson P. A

104

The number of clock cycles required for each sample in the optimized ®lter decreased from 14 to 1.5 on average by concurrent processing of eight samples. Compared to a parallel implementation the speed penalty is only 33%. However, the bit-serial implementation of the optimized ®lter has at least two big advantages: the area is much smaller, and because it was possible to use a hardware mapped algorithm, the control is very simple. Fig. 13. The data-path from the surrounded square in Fig. 7 as it is implemented on cell level.

To make a fair comparison it should be noted that due to the decimation operation of the optimized ®lter, half of the second ®lter link disappears, see Fig. 10. For a non down sampling ®lter both the area and the power consumption would be about 30% larger. The reason for the higher transistor density in the optimized ®lter is the higher transistor density in adder cells compared to d elements. The ratio adders/d elements is much larger in the optimized ®lter. Also, since the optimized chip is a second generation design, slight improvements have been made to the cell library.

8.

Conclusion

Two wide-band digital ®lters for mobile communication have been presented. It has been shown that by not limiting the design to minimum order ®lters, both speed and power can be improved. Due to the short delay of the adaptors of the optimized ®lter it is possible to parallelize the architecture in eight parallel pipelines. The result is a much faster ®lter that consumes less power. As shown, an eight fold increase in sample rate can be obtained at almost no area and power cost. Table 3. Filter data. Parameter

Min. Order

Optimized

Transistor count Area, core only Sample frequency Clock cycles/sample Clock frequency Supply voltage Power consumption

3707 1.26 6.4 14 89.6 5 250

6876 1.30 51.2 1.5 76.8 3.0 130

Unit mm2 MHz MHz V mW

References 1. A. R. Omondi, Computer Arithmetic Systems, Prentice Hall, 1994. 2. A. Fettweis, ``Wave Digital Filters: Theory and Practice.'' Proc. of the IEEE 74, pp. 270±327, 1986. 3. L. Gazsi, ``Explicit Formulas for Lattice Wave Digital Filters.'' IEEE Transactions on Circuits and Systems 32, pp. 68±88, 1985. 4. S. Lawson and A. R. Mirzai, Wave Digital Filters, EllisHorwood, New York, 1990. 5. Radio Broadcasting Systems; Digital Audio Broad-casting (DAB) to mobile, portable and ®xed receivers. ETS 300 401, ETSI±European Telecommunications Standards Institute, Valbonne, France, 1995. 6. O. Edfors, M. Sandell, J. van de Beck, D. LindstroÈm, and F. SjoÈberg, ``An Introduction to Orthogonal Frequency Division Multiplexing.'' Technical report, Division of signal Processing LuleaÊ University of Technology, Sweden, 1996. 7. P. Nilson and M. Torkelson, ``A Custom Digital Intermediate Frequency Filter for the American Mobile Telephone System.'' IEEE Journal of Solid-State Circuits 32, pp. 806± 815, 1997. 8. A. Kristensson, ``Design and Analysis of Integrated OTA-C Low-Pass Filters,'' Licentiate Thesis, KF-Sigma Lund, Sweden, 1997. 9. R. Jain, F. Catthoor, J. Vanhoof, B. J. S. De Loore, G. Goossens, N. F. Goncalvez, L. J. M. Claesen, J. K. J. Van Ginderdeuren, J. Vandewalle, and H. J. De Man, ``Custom Design of a VLSI PCM-FDM Transmultiplexer from System Speci®cations to Circuit Layout Using & Computer-Aided Design System,'' IEEE Journal of Solid-State Circuits 21, pp. 73±85, 1986. 10. S. Hara, K. Fukui, M. Okada and N. Morinaga, ``Multicarrier Modulation Technique for Broadband Indoor Wireless Communication,'' Proc. Fourth International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC'93) pp. 132±136, Yokohama, Japan, 1993. 11. N. Tan, ``A 1.5-V 3-mW 10-bit 50 Ms/s CMOS DAC with Low Distortion and Low Intermodulation in Standard Digital CMOS Process,'' Proc. of the 1997 IEEE Custom Integrated Circuits Conference (CICC'97) Santa Clara, USA, pp. 599±602, 1997. 12. P. Nilsson and M. Torkelson, ``A Monolithic Digital ClockGenerator for On-Chip Clocking of Custom DSP.'' IEEE Journal of Solid-State Circuits 31, pp. 162±165, 1993.

CMOS Digital Filter Structures

105

degree. In May 1996 he received the Ph.D. degree and is currently working as an Assistant Professor in the ASIC/DSP group at the same department. His main interests are in the ®eld of silicon implementation of custom DSP's related to the communication area, especially with parallel-serial architectures.

Ê stroÈm received the M.S.E.E. degree in Pontus A 1995 from Lund University, Lund, Sweden. He joined the department of applied electronics as a Ph.D. student in September 1995. He is currently doing research for the Ph.D. degree in the area of digital ®lters and channel codecs for mobile applications.

Peter Nilsson was born in BorlaÈnge, Sweden on December 9, 1958 and has grown up in SkaÊne in the south of Sweden. He received the M.S.E.E. at Lund Institute of Technology, Lund University, Lund, Sweden in 1988. In February 1988 he joined the Department of Applied Electronics, Lund University, Lund, Sweden, where he had made his diploma work and where he, in May 1992, received his Licentiate

Mats Torkelson received the M.S.E.E. degree in electrical engineering in 1980 at ETH Zurich/LTH Lund. He received the Licentiate degree and the Ph.D. from Lund University, Lund, Sweden in 1985 and 1990 respectively. He has worked with AD/DA converters for professional Audio tape recorders at Willi Studer AG, Switzerland and with maritime X-band radar collision avoidance systems at Lund University. During the period 1984±1986 he was part time at the University of California, Berkeley. Mr. Torkelson heads the digital signal processing group at the Dept. of Applied Electronics, Lund University, which he initiated in 1986. Since 1994 he has worked part time and since 1997 full time with Ericsson Radio Systems, Stockholm, Sweden. His current interests are mobile communication, algorithm implementation, and ampli®er design.

Suggest Documents