Pseudo-Floating Point MAC Units for Programmable High Performance QMF Banks A. Benjamin Premkumar Division of Computing Systems School of Applied Science Nanyang Technological University Nanyang Avenue Singapore 639798.
[email protected]
Manish Bhardwaj Signal Processing ICs Microelectronics Design Center Siemens Components Pte. Ltd. 168 Kallang Way Singapore 349253.
[email protected]
A. S. Madhukumar Center for Wireless Communication Science Park II Singapore
[email protected]
Abstract The perfect reconstruction Quadrature Mirror Filter (QMF) bank is the workhorse of modern day multirate signal processing systems that carry out analysis, coding, compression and synthesis. Due to its ubiquitous use, it must be implemented in a cost-effective manner over a wide range of sampling frequency and SNR requirements. In this paper, we propose a programmable QMF bank design based on an innovative pseudo-floating point (PFP) number representation as a solution to meet these wide range of performance and cost constraints. The PFP representation is a constrained form of the full-blown floating-point number representation. Due to its uniform resolution property and reduced entropy, the PFP implementation has significant power and area advantages. The derivation of PFP coefficients from a highly optimal coefficient discretization procedure is also presented. We also describe the design of a high performance, low power, multiply accumulate (MAC) unit based on residue arithmetic that implements this PFP technique.
I. Introduction Numerous design methodologies are available for QMF banks. Designs based on minimization of weighted sum of the reconstruction ripple energy and stop band ripple energy of the prototype analysis filter are discussed in [1-3]. QMF designs based on minimax principles, lattice structure and constrained nonlinear programming approaches are discussed in [4-6]. While these methods result in good performance of QMF, they consider using continuous coefficients in their design and hence become costly when realized in hardware. Chen and Lee [7] have proposed a method for reducing the complexity by suggesting coefficient discretization. They have proposed that the coefficients be represented as powers of two instead of simply rounding the designed continuous coefficients. Their method proceeds with the design of QMF with continuous coefficients followed by optimal filter gain search and then by a discrete optimization procedure based on weighted least square algorithm to minimize the peak reconstruction error and peak stop band simultaneously. The gain search algorithm eliminates the effects due to power of two discretization and the entire design is completed within a few iterations. Although, several hardware realizations for the optimized filters are available, in this paper, we propose a residue number based realization for the QMF, which offers significantly higher performance and consumes less power. Our paper is organized as follows: In section II, we discuss the realization of QMF and its VLSI implementation option, while in section III, we give a detailed overview of our pseudo floating point representation scheme which is suitable for multiply and accumulate units (MAC). In section IV, we describe an architecture based on the above representation and we conclude our work in section V.
II. Realization of QMF and VLSI Implementation Options In general, a QMF filter bank with N coefficients will require at least 2(N-1) delay units, 2(N-1) adders and 2N multipliers. An elegant way of reducing the complexity of processing is to use powers of two coefficients instead of continuous coefficients. These coefficients can be computed using discrete optimization procedure [7] and again updated using an error weighting function. The effect of discretizing small coefficient values is compensated by the reoptimized values of the remaining coefficients using an appropriate order for discretization. The power of two representation for any filter coefficient is:
1
B
a k = ∑ S b 2 − Pb b =1
where Sb∈{-1,0,-1}, Pb∈{0,1,2....M} and B is the number of digits in the power of two representation. After the discrete optimization procedure, we must decide on a suitable VLSI implementation for the linearphase FIR filters. The basic architectural decision is to decide between a hard-wired FIR filter and one that is programmable. The decision depends on a variety of factors – (a) Throughput requirement (higher favors hard-wired), (b) Area budget (lower favors programmable), (c) Energy budget (extremely energy conscious designs, especially in deep sub-micron might find that hardwired is the way to go) (d) Single/Multirate System (multi-rate favors programmable QMF that can be used across several rates) (e) Design and commissioning time (extreme commissioning-time pressures favor programmable solutions that can be re-used across applications and quickly modified). (f) Testability concerns (high test coverage might be easily obtained with a programmable structure). Our intent is not to delve into these and various other factors that determine the choice, but rather to indicate that programmable QMF architectures are suitable and feasible for multi-rate systems with 100-300 Msamples/s performance requirements with moderate power constraints. For solutions that are predominantly single rate and whose performance requirements are above 500 Msamples/s, hard-wired filters might be the only feasible solution. For such filters, an excellent implementation using residue numbers has been proposed by Wrzyszcz et al [8]. A distributed arithmetic based implementation has also been recently proposed by Premkumar et al [9]. In the current paper, our focus is on programmable QMFs. Since the basic unit of the QMF is the linear-phase FIR filter, the programmable architecture centers around a high performance multiply accumulate (MAC) unit with a very basic control path and an address generation unit (AGU) that supports modulo addressing. Again, we focus on the MAC, since the control and AGU design are routine. Besides, the throughput is constrained by the MAC and it consumes more area and energy than any other part. In good VLSI implementations, all the constituent units should be operating at close to 100% utilization. Unfortunately, most of the current MAC units do not deliver the same throughput as the memory subsystem as clearly indicated in the graph below [10-12]:
Gap between Memory and MAC throughput 600 500
Moperands/sec.
500 400
Memory 300
DSP cores
300
System
233 200 160
130 100
100
80 60
0 0
100
200
300
400
500
600
Technology (feature size in nano-meter)
This “reverse von-Neumann gap“ is explained by the fact that multiplier latencies in the MAC prevent it from attaining a high throughput [13]. Employing the usual high performance techniques can close this gap – 1. Super or µ-Pipelining 2. Superscalarity 3. Redundant arithmetic (many techniques including residue arithmetic) The main advantage of the use of residue arithmetic over the first two schemes is that for the FIR operation, it offers significantly higher performance and consumes less energy since only one reverse conversion is carried out for N MAC operations (N being the order of the filter). Note that this advantage might be significantly reduced for other signal processing algorithms, where the computation/conversion rate is significantly lower than N or those that require magnitude comparisons. For these reasons, we focus on a highly parallel residue arithmetic based MAC unit. In the following section, we introduce a novel arithmetic scheme suitable for programmable MAC units with multipliers.
2
III. The Pseudo Floating Point (PFP) Representation As mentioned in the introduction, a high quality implementation results only when optimal techniques are employed at all levels of design abstraction and when these techniques reinforce each other. While Chen et al.‘s discrete optimization procedure significantly improves the quality of the discrete coefficients in terms of the Peak Passband Ripple (PPR), Peak Stopband Ripple (PSR) and Peak Reconstruction Error (PRE), it suggests the use of a 16-bit coefficient value. We now show that the inherent entropy of these coefficients is significantly lower than 16-bits. To illustrate this point, we consider coefficients that have already been scaled up by 216 and hence are 16-bit integers (we consider only positive coefficients to focus on the development of the concept – negative coefficients are easily integrated later). Consider then a coefficient, hi, which can be written thus – B −1 a ij
hi = ∑ 2 j =0
Let us re-write hi thus – B −1
hi = 2 ai 0 ⋅ ∑ 2
aij − ai 0
j =0
B −1 a′ = 2 ai 0 ⋅ ∑ 2 ij j =0 We call ai0 the shift, a’i(B-1) the span and the bracketed term the normalized value (n-value). Note that the shift and the normalized value are the analogues of the exponent and mantissa in true floating point representations. Instead of expressing the given coefficient as a 16-bit integer, we can express it as a (shift, n-value) pair – this is the definition of the pseudo floating point (PFP) number representation. For a given coefficient set, define L and M as the number of bits needed to encode the shift and the n-value respectively. Then,
L = max iN=−01 shift (hi ) M = max iN=−01 span(hi ) (using bit eliding) For the worst-case coefficient set, L = 4 and M = 15. Hence, 19 bits are needed by the PFP for general coefficient sets. It would seem that the PFP seems to have little merit in the worst case – it increases the number of bits needed by the coefficients. However, it is important to look at – 1. How many times such a worst case coefficient can occur theoretically. 2. How many times it occurs in actual coefficient sets. The objective, of course, is to investigate if actual coefficient sets require less than 19 bits. In what follows, we shall focus on the span, since it contributes significantly more to the wordlength requirement and also because shifts are generally well distributed across coefficients. This is shown in the following plot derived from Chen et al’s first and second design examples – Distribution of Shifts Across Coefficients for two QMF bank filters 20,00 18,00 % of Coefficients
16,00 14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 0
2
4
6
8 10 Shift Value
12
14
16
For the more interesting case of the span, we first plot the distribution of spans for all possible coefficients that can be formed for a given wordlength (=16) and value of B (i.e. the number of CSDs) –
3
Distribution of Spans Across All Possible Coefficients for 16-bit Booth Recoded 20 Integers % of Possible Coefficients
18 16 14
B=2
12
B=3
10
B=4
8
B=5
6 4 2 0 0
2
4
6
8 10 Span Value
12
14
16
Note the behavior of the distribution – as the value of B increases (i.e. the coefficient grid gets denser), the span distribution shifts rapidly towards higher values and takes on a markedly exponential form. Observe that for B=3 (the value used in Chen’s examples), the distribution is centered around a span of 9. Another important plot is the number of admissible coefficients with the imposed constraint of B CSDs. Note that in the absence of such a constraint, there are 216 permissible coefficients i.e. an entropy of 16 bits. The entropy variation against B is plotted below – Coefficient Entropy as a Function of B 16 14
Entropy (in bits)
12 10 8 6 4 2 0 0
1
2
3
4
5
6
7
8
9
B
Note that there are only 1710 permissible positive integers less than 216 that can be expressed using 3 CSDs with the maximum power restricted to 215. This corresponds to coefficient entropy of 10.74 bits. Thus, using 16-bits for 3-CSD coefficients seems wasteful even from the theoretical point of view. However, employing a minimum entropy coding scheme to achieve, or come close to this lower bound of 10.74 bits for arbitrary coefficient sets is infeasible because the DSP will then have to carry out an elaborate coefficient decoding prior to the MAC operation. This is clearly impractical. In other words, although the theoretical distribution and entropy seem to indicate that the worst case has a low probability, we are still constrained by, and must consider, the empirical occurrences of such a worst case condition. The following plot shows the actually observed span values for two design examples taken from Chen et al.’s paper (with the B=3 distribution repeated for comparison) –
4
Distribution of Spans Across Coefficients for two QMF bank filters 25,00
% of Coefficients
20,00
15,00
Example 2 (80-tap) Example 1 (64-tap)
10,00
Random Coefficient Set
5,00
0,00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Span Value
Note that the worst case span occurrence is much better than the arbitrary case predicted by the dotted line. No spans greater than 10 were observed. Only 4 spans out of 32 and 3 spans out of 40 were greater than 7 in the first and second examples respectively (these being symmetric FIR filters, the number of taps is twice the number of distinct coefficients). Hence, for these two coefficient sets, the PFP implementation will require only 10 bits for the span (using bit eliding) and hence a total of 14 bits – significantly lower than the worst-case of 19 bits, but not a major improvement over 16 bits. If we could somehow manage the 3 or 4 spans greater than 10, the PFP bit requirement would reduce drastically to 11 bits (which is interestingly the same as the theoretical minimum possible for arbitrary sets using an optimal entropy coding scheme!). This could be done in several ways – 1. Modifying the coefficients that exceed the span of 7. 2. Developing a new discrete optimization scheme that minimizes span. 3. Developing an architectural work-around for the few coefficients that do exceed the span. In technique (1) we could, for example, simply throw away the power of two which was causing the large span. For example, the coefficient with span 9, hd(29) = -2-4+2-10-2-13 could be modified thus h‘d(29) = -2-4+2-10 Clearly, we can expect distortion in the resulting filter, when such a “span-reduction“ technique is employed to all the “offending coefficients“. Indeed, we observe that – a. The stopband performance does not change significantly, but, b. The pass band ripple increases by about 5 times. Also, the equiripple nature is destroyed. For long FIR filters, a more preferable technique might be to move this –2-13 to an adjoining coefficient (i.e. hd(28) or hd(30)). A more refined form of this “moving technique“ is to introduce powers of two in adjoining coefficients such that the frequency response change is minimized (say, in a least mean squares sense). Although we are yet to investigate this technique, it is clear that it will work satisfactorily only for filters with a large number of taps. In technique (2), we propose to modify Chen et al.‘s discretization procedure such that it minimizes span and is unconcerned with the number of CSD digits or even the wordlength (only a limited flexibility is possible in the latter case, though). Such an effort reflects the fact that most power-of-two coefficient discretization strategies focus on minimizing the number of CSDs, which is the right approach for hard-wired filters but not optimal for programmable implementations. Currently, we are working on such a discrete optimization scheme. Technique (3) suggests architectural solutions in case techniques (1) and (2) fail to reduce the span while meeting the filtering requirements. A simple solution exists to tackle the few coefficients that have spans greater than 7. All we need to do is to split the offending coefficient into two sub-coefficients that are “span-compliant“. For instance – hd(29) = -2-4+2-10-2-13 can be written thus – hd(29) = hd(29a)+ hd(29b) hd(29a) = -2-4+2-10 hd(29b) = -2-13 All this implies in terms of implementation is that an extra MAC needs to be carried out for the offending coefficients. In the examples considered above, this implies 72 and 86 MACs for the PFP implementation as against 64 and 80 for normal implementations. This is a small cost considering that – 5
1. 2. 3. 4. 5.
The multiplier complexity has been reduced from 8-bit by 16-bit to 8-bit by 7-bit. Only 11-bits per coefficient need to be stored, instead of 16-bits. Due to (2), one of the buses to the MAC unit can be just 11-bit wide instead of 16 bit wide. The PFP delivers a limited span with variable shift grid which is much more uniform than a fixed point grid. Recently, it has been demonstrated that floating point and similar number representations (like logarithmic number system, PFP etc.) deliver substantial power savings in sub-band coding applications. 6. While we have focussed on the use of PFP to reduce wordlength requirement, we could instead use the reduced bits for improving filter performance. In other words, 16 fixed point bits are roughly equivalent to 11 PFP bits. Hence we can expect 16 PFP bits to deliver performance comparable to about 20-24 fixed point bits. Such precision might be required in certain applications. The flip side to PFP is that we need a programmable shifter to scale the product by the PFP coefficient’s shift value.
IV. Implementation of a RNS PFP MAC Unit We now describe how the PFP scheme can be efficiently implemented using RNS. We begin by describing the design of a highly parallel RNS MAC for normal fixed-point representations and then illustrate how it can be adapted for the PFP. IV.1 RNS MAC for Conventional Fixed Point Arithmetic To achieve a 5-bit residue data-path and a 32-bit dynamic range, we employed a 7 moduli set – S = {32, 31, 29, 27, 25, 23, 19, 17} Note that all residues of these moduli are 5-bits each and hence this is a perfectly balanced set. The parallel channels are clear in the figure below. Also note that the reverse conversion is pipelined. We use pipelined, vector-like fetches for the data and coefficients and hence the figure also shows a vector write-back pipe.
Data[7:0] (write)
Residue-MAC Channel : Mod-17
Residue-MAC Channel : Mod-21
Residue-MAC Channel : Mod-23
Residue-MAC Channel : Mod-25
Coefficient[15:0]
Residue-MAC Channel : Mod-27
Residue-MAC Channel : Mod-31
Residue-MAC Channel : Mod-32
Data[7:0] (read)
Vector Pipe
CMAC Reverse Conversion with Saturation Control
Final 32-bit result
Multiplexing Control
6
23
5
5
5
5
5
FC23
Data[7:0]
To reverse converter
5 23
FC 23
Coefficient[15:0]
There are two main arithmetic pipelines – the forward fetch-compute and the backward reverse-conversion-store pipeline. The three main sub-blocks of the MAC are the forward converters; the residue-multiply and the residue accumulate. Note that there is no pre-shift before the multiply. This shifter is needed for the scaling operation and we have dealt with this by increasing the dynamic range. Also note that the saturation unit is placed with the reverse converter. This is an artifice of using residue arithmetic rather than an architectural decision, as residue tuplets cannot be subjected to saturation. The MAC slice for the modulus 23 is illustrated below.
An important architectural feature from the VLSI point of view is that the datapath exhibits significant nonlocality only in the forward and reverse converters rather than globally, as in conventional implementations. Restricting non-locality allows one to spend design effort in reducing power and delay in smaller regions rather than attacking the whole design. This important advantage of RNS becomes more pronounced as feature-sizes decrease. Also, use of residue architectures leads to power and reliability advantages as shown recently [14]. The residue multiplier is based on the bit unfolding technique [15,16] and produces a sum and carry output. The accumulator also stores a sum and carry output instead of a single accumulated value. The sum and carry of the accumulator are added with those from the multiplier to form a resultant sum and carry that are correct in the modulo sense. The carry propagation is delayed till and integrated with the reverse conversion, which is an interesting feature. Propagating the carry in the accumulator is not needed unless it is to be written back to memory (or read by the programmer). In this case, reverse conversion is also necessary. Hence we have integrated it into the reverse converter. Forward and Reverse Conversion Like the multiplier, the forward converter is also based on bit-unfolding. The reverse converter is the Compressed Multiply ACcumulate type. The converter has extremely low area requirements compared to the core and it is easily pipelinable for high throughput. Details can be found in [15]. Handling –ve numbers Negative numbers are handled in an elegant fashion using what we term the “mod-complement“ number system – the residue analog of the 2‘s complement number system defined thus: For any positive X ≡ (r0, r1,…, rs), –X is given by (m0- r0, m1- r1,…, ms- rs) A highly beneficial feature of this number system is that, in the absence of overflow, the product of any two numbers, positive or negative, is equal to the product of their mod-complement forms. This is different from two’s complement multiplication where a special procedure must be followed. This residue multiplication property further simplifies the datapath, in turn leading to higher throughput. Scaling and Saturation Since scaling residues with factors less than one is unwieldy, the approach taken is to increase the dynamic range to compensate for the lack of the right shift. Saturation arithmetic prevents, or drastically reduces, the possibility of overflow limit cycles [17,18]. Hence, it is essential and placed after the reverse converter. This placement works to our benefit because it is quite meaningless and in fact erroneous to apply saturation to intermediate results - it would interfere with the inherent ability of the mod complement number system to deal with intermittent overflows.
7
IV.2 PFP RNS MAC The most important modification to the above design for the case of PFP MAC unit is to accommodate the shifts. In a conventional number representation, this would need a high wordlength, high-speed shifter, but in RNS it needs only a 5-bit multiplication as we show below. Modular exponentiation i.e. operations of the type – |2shift|mi is trivially implemented when mi is of the form 2n, 2n-1 or 2n+1. In such cases, the output from the exponentiation block is simply a power of two (positive or negative). In such cases the mod multiplier that follows the modular exponentiation also degenerates to a very simple circuit. Hence for the moduli 32, 31 and 17, we have very compact and low latency circuits.
Coefficient[10:0] 10
7
0 1
Elided n-value Bit _______________________
Shift
Modular Exponentiation
Forward Conversion
|⊗|mi
To mod multipier For the other moduli, the following derivation can be applied to develop efficient modular exponentiation circuits. We consider |2shift|23.
2 s3 s2 s1s0
23
= 2 8 s3 + 4 s2 + 2 s1 + s0
= (2 s 3 + 1)(15s 2 + 1)4 2 s1
23
= 256 s316 s2 4 s1 2 s0
23
= (255s 3 + 1)(15s 2 + 1)4 s1 2 s0
23
s0 23
= (2 s 3 + 1)(15s 2 + 1) 23 4 2 s0 s1
23
= 7 s3 s 2 + 15s 2 + 2 s3 + 1 23 4 2 s0 s1
= ( s 3 s 2 + 15s 3 s 2 + 2 s 3 s 2 + 1)4 2 s1
= (16 s 3 s 2 + 2 s 3 + s 2 )4 2 s1
=
23 s0 23
s0 23
(16 s 3 s 2 + 2 s 3 + s 2 ).s1 s 0 or (8s 3 s 2 + 4 s 3 + 2 s 2 + s 3 s 2 ).s1 s 0 or (16 s 3 s 2 + 8s 3 + 4 s 2 + 2 s 3 s 2 ).s1 s 0 or (16 s 3 s 2 + 8s3 s 2 + 4 s3 + 2 s3 s 2 + ( s3 ⊕ s 2 )).s1 s 0
23
= ( s 3 s 2 ,0,0, s 3 , s 2 ).s1 s 0 + (0, s 3 s 2 , s3 , s 2 , s 3 s 2 ).s1 s 0 + ( s 3 s 2 , s 3 , s 2 , s3 s 2 ,0).s1 s 0 + ( s3 s 2 , s3 , s3 s 2 ,0, s 3 ⊕ s 2 ).s1 s 0
8
Note that the “+” in the last step denotes the logical or operator – different from the addition that it denotes in the previous steps. Also, the bracketed expressions show the individual bits of the 5-bit output word. Ignoring inversion, we need only 3 AND gates, one OR gate and a 4-to-1 mux to implement the expression derived. Alternate derivations that treat this as a 4-input 5-output logic design problem will also yield the same result. However, for larger shifts, the derivation shown above is easier to tackle and yields compact exponentiation circuits. Finally, it is worth noting that the 5-bit output expression derived is correct mod 23. The forward converter and mod multiplier are routine and both employ bit-unfolding. Due to the use of biteliding we have an additional problem of representing zero coefficients. This is easily handled by adopting the convention that coefficients with shift = 15 and with MSB of n-value equal to 1 are considered zeros. Note that such coefficients represent integers beyond 216 and hence a valid coefficient will not be accidentally flagged as zero. A 5 input NAND gate that decodes the shift and the MSB of the n-value sets the 5-bit output of the circuitry above to zero if all ones are detected.
V. Conclusion While Chen's discrete optimization significantly improves the quality of discrete coefficients in terms of some of the ripple characteristics of QMF, it suggests the use of 16 bit coefficient value. In this paper we have shown that the inherent entropy of these coefficients is lower than 16 bits. In the process of doing so, we have developed a new representation called pseudo floating point representation for the coefficients. This representation uses fewer bits for a majority of filters, while for worst case scenario, the number of bits used is more. Although, the worst case scenario has a low probability of occurrence, we have proposed three modifications that would result in the reduction of PFP bits with minimum performance degradation. In implementing the PFP MACs, residue number system has been used which offers higher performance and consumes less energy. Residue numbers also allow elegant shift operations using modular exponentiation. The moduli set chosen for the architecture is perfectly balanced while a global approach has been used for forward and reverse conversions. These conversions use newer techniques such bit unfolding and compressed multiply and accumulate which serve to economize real estate on the VLSI.
References [1] T. Q. Nguyen, P.P. Vaidyanathan, 'Two Channel Perfect Reconstruction FIR-QMF Structures which yield Linear Phase Analysis and Synthesis,' IEEE Transaction on ASSP, vol. ASSP-37, pp: 676-690, May 1989. [2] J. D. Johnson, 'A Filter Family designed for use in QMF Banks,' Proc. International Conf. ASSP, Apr. 1980, pp: 291-294. [3] V. K. Jain and R. E. Crochiere, 'QMF Design in Time Domain,' IEEE Transaction on ASSP, vol. ASSP-32, Apr. 1984, pp: 353-361. [4] C. K. Chen and J. H. Lee, 'Design of QMF with Linear Phase in the Frequency Domain,' IEEE Trans. CAS II, vol. 39, Sept. 1992, pp: 593-605. [5] F. Grenez, 'Design of QMF by Linear Programming,' Proc. International Conf. ASSP, Apr. 1986, pp: 26152618. [6] B. R. Hong and A. N. Willson, 'Legrange Multiplier Approaches to the Design of Two Channel Perfect reconstruction Linear Phase FIR Filter Banks,' IEEE Trans. CAS II, vol. 41, Feb. 1990, pp: 364-374. [7] C. K. Chen and J. H. Lee, 'Design of Linear Phase QMF with Power of Two Coefficients,' IEEE Trans. CAS II, vol. 41, Jul 1994, pp: 445-456. [8] A. Wrzyszcz, D. Milford and E. L. Dagless, 'A New Approch to Fixed Coefficient Inner product Computation Over Finite rings, IEEE Trans. on Computers, vol. 45, Dec. 1996, pp: 1345-1355. [9] K. P. Lim and A. B. Premkumar, 'A modular Approach to the Computation of Convolution Sum Using DA Principles,' To Appear in IEEE Trans. on CAS II [10] DSP Group, OakDSPCore Specification. 1997. [11] Artisan Components, High Speed Synchronous SRAM Generators. http://www.artisan.com/. [12] DSP Group, DSP Group Announces the New Tea DSPCore. http://www.dspg.com/, Feb. 1998. [13] T. Thong and Y. C. Jenq, Hardware and Architecture in Handbook for DSP. S. K. Mitra and J. Kaiser, Eds., New York, NY: Wiley Interscience, John Wiley & Sons Inc, 1993, pp: 721-782. [14] M. Bhardwaj and A. Balaram, 'Low Power Signal Processing Architectures using Residue Arithmetic,' Proc. Intl. Conf. Acoustics, Speech and Signal Processing, ICASSP‘98, pp: 3017-3020, 1998. [15] M. Bhardwaj, T. Srikanthan and C. T. Clarke, 'Area-Time Efficient VLSI Residue to Binary Converters. To appear in IEE Proceedings - Computers and Digital Techniques. [16] M. Bhardwaj, T. Srikanthan and C. T. Clarke, 'Implementing Area-Time Efficient VLSI Residue to Binary Converters, Proc. IEEE Workshop on Signal Proc. Systems: Design and Implementation, 1997, pp: 163-172. 1997.
9
[17] P.M. Ebert, J.E. Mazo and M.C. Taylor, "Overflow Oscillations in Digital Filters,' Bell Syst. Tech. J. 48, 1969, pp: 2999-3020. [18] S. R. Parker and S. F. Hess, 'Limit Cycle Oscillations in Digital Filters,' IEEE Trans. Circuit Theory, CT-8, 1971, pp: 687-697.
10