An Efficient Architecture of RNS Based Wallace Tree Multiplier for DSP Applications Partha Pratim Kundu[1], Oishila Bandyopadhyay[2], Amitabha Sinha [3] [1], [2], [3] School of Information Technology West Bengal University of Technology, BF-142, sector-1, Salt lake City, Kolkata-700064, India E-mail:
[email protected] [3],
[email protected] [1],
[email protected] [2] Abstract: In this paper a novel technique to determine the optimal moduli set has been introduced and an efficient RNS multiplier based on Wallace Tree multiplier (for 32 bit arithmetic unit) for DSP applications is presented. Performance analysis on a number of DSP functions like FIR, FFT etc. clearly indicates the novelty of the scheme.
I. INTRODUCTION Digital signal processing algorithms are computationally intensive and most of the algorithms require high performance arithmetic operations in general and multiplications and additions in particular. For enhancing the performance of the computationally intensive DSP algorithms, adders like carry look ahead adder [3], carry select adder [3], carry save adder [3] [12], and multipliers like array multiplier [3], Wallace tree multiplier [3] [12] etc. have been used extensively. Most of these adders and multipliers however, suffer from carry propagation delay [2], [4]. In recent times, different number systems like Double Based number system (DBNS) [8], Logarithmic number system (LNS) [9], Fermat number system (FNS) [10], Residue number system (RNS) [1], [2], [4] are becoming attractive for their capabilities for performing addition and multiplication operation efficiently. Among all these number systems, RNS is gaining attraction of research because of concurrency and carry free arithmetic operations. In RNS, several residue digits represent a number. So, arithmetic operations like additions, subtractions and multiplications of higher bit numbers can be decomposed and performed in set of parallel sub-operation [1], [2], [4]. The efficiency of a RNS multiplier largely depends on the selection of a moduli set. Different architectures have been proposed in this regards to design fast RNS modulo adder and modulo multiplier [1], [2], [16]. Keeping these in views this paper introduces a LUT based efficient RNS multiplier (32 bit) for DSP applications. The concept of multiplication, however, is based on “Wallace Tree” multiplier. The performance analysis of the proposed multiplier has been dealt with in great details and the result indicates its superiority over Array Multiplier and Wallace tree multiplier based on binary.
II. REVIEW OF RESIDUE NUMBER SYSTEM A residue number system is characterized by n-tuple of integers (m1, m2… mn). Each of these mi, i = 1,…,n is called a modulus and is selected such that they are relatively prime, i.e., GCD(mi,...., mj) = 1, for i ≠ j ,1 ≤ i, j ≤ n . Any integer X is represented in the RNS by an n-tuple of integers RNS
⎯ (x1, x2, x3,....., xn) where xi = X mod mi. If X < M X ⎯⎯→ 978-1-4244-2167-1/08/$25.00 ©2008 IEEE
N −1
and M =
∏ m , the set of residues is unique for any X, i =0
i
provided the set of moduli contains only relatively prime moduli. M represents dynamic range of the system. Bit efficiency of moduli set is computed as the ratio of dynamic range and data range of the system. For example, in a 32 bit system, data range is 232 and bit efficiency is M/232. In RNS, the residue additions and multiplications can be performed independently. Let X and Y have residue codes (x1,......,xn) and (y1,......,yn) and X,Y, X D Y ∈ [0, M-1] then |X D Y |mi = | xi D yi |mi and it follows that (|X D Y| m1, |X D Y|m2,..........., |X D Y|mn) = (|x1 D y1|m1, |x2 D y2| D yn|mn) m2,........, |xn D where can be addition or multiplication.
A. Selection of moduli The basic criteria for selection of moduli for a n-bit RNS system is that moduli m1, m2,… mi, should be mutually prime. The selection of the moduli set affects bit efficiency [1] and circuit complexity [3] (i.e. speed and hardware complexity) of arithmetic algorithms. The magnitude of the largest modulus (modulus with maximum no. of bits) affects the speed of operations. So, it is preferable to make all the moduli comparable in size to the largest one. For selecting the optimal moduli set for 32 bit RNS ALU, we have computed bit efficiency and finally selected the optimal moduli set with respect to bit efficiency and circuit complexity. Analyzing the variation of bit efficiency with number of bits in the moduli, it is observed that bit efficiency increases for higher moduli bits (as shown in Fig. 1). We have also found that time complexity increases with an increment in moduli bit. Here, moduli set {32, 31, 29, 27, 25, 23, 13} has been selected to represent 32-bit binary data in RNS.
III. Architecture of the proposed Multiplier The RNS multiplication can be performed using a multistage carry-save adder circuit which is called Wallace tree after its inventor [Wallace 1964] [12]. The proposed scheme takes input as Xrns and Yrns. Xrns and Yrns are the residue number with respect to moduli mi of binary number X and Y respectively. The inputs to the adder tree are in terms of Mk mi = (Xrns * Yrns[k] * 2k)mi . The desired product is P = i =n−1
(
221
∑ i =0
(Mk)mi )mi. Mk = Xrns * Yrns [k] using n two input
AND gate where n is the maximum size of residue number. One of the input of these AND gates is Yrns [k] and other input is one bit from Xrns. So, bit-by-bit multiplication is computed using AND gate networks. Now Mk * 2k may exceed the permissible residue range of mi. So, we use a mapping function and it is implemented using LUT. For 32 bit data moduli bits vs bit efficiency 100
90
Bit efficiency(in %)
80
70
Mk =Xrns * Yrns[k]. Then for k = 0 to 4, M0=11011, M1=00000, M2=11011, M3=00000, M4=11011. Now modular multiplication result Mk mi= (Mk* 2k) mi = (Mk* 2k) mod mi, then M0 mi=11011, M1 mi=00000, M2 mi=10101, M3 mi=00000, M4 mi=11010. These Mk mi will be the input of carry save adder tree. Final output from Carry Save adder Tree will be S=10100, C=0110110. Now, Imul = 1001010. So (Imul[6:5] * 25) mod mi = 00110 and Modulo add (00110, 01010) = 10000. The final result is the binary representation of multiplication of two residue i.e. (Xrns*Yrns)mi = (27 * 21)29 = 16 = (10000)2
60
A. Implementation of Array Multiplier
50
To perform multiplication using array multiplier, the product P
40
= X * Y can be represented as P =
30
n−1
∑ 2 X Y. Considering bit i
i=0
i
by bit multiplication, P can be rewritten as 5
5.5
6
6.5 Moduli bits
7
7.5
P=
Fig. 1: Bit efficiency vs no. of bits in moduli for 32 bit data
Total n number of LUTs with the size of 2n x n are required for n bit size residue. Then the final summation is computed by Carry Save Adder tree as mentioned above, which produces a
n-bit sum and a ⎡log 2 (n - 1) ⎤ + n -bit carry word. The final assimilation is performed by CLA with normal internal carry propagation. The result of CLA (i.e. Imul in Fig. 2) has
⎡log 2 (n - 1) ⎤
+ n-bits. The MSB of ⎡log 2 (n - 1) ⎤ bits of Imul are mapped within the moduli (mi) range i.e. the mapping
function – (Imul [ ⎡log 2 (n - 1) ⎤ -1: n] * 2n)mi is performed using LUT. The LUT output and LSB 5-bit of Imul are added using a modulo mi adder. Fig. 2 depicts the proposed RNS Wallace tree multiplier for 5 bit moduli. Xrns
Yrns
5 M3 mi
AND Gate Array
M0
M4
LUT 4 ( M4 *2 )mi 25 x 5
M4 mi C
S
C
5
S
8-bit CLA M0 mi
Imul
7 5
2 Where Mk = Xrns * Yrns[k] Mk mi = ( Mk * 2k )mi i.e. Mk mi = (Mk * 2k) mod mi
S
CSA
5 M4 mi
C CSA
5
LUT ( M0 *20 )mi 25 x 5
5
CSA
5
5
5
5
LUT ( Imul[ 6:5] * 25) mi 22 x 5
Modulo Adder 5
5
2
∑ i =0
i
2(
∑ j =0
xi yj 2j )
Each of the n one bit product terms xi yj can be computed by n x n two input AND gate array. The summation can be performed by an array of n(n-1) full adders. Multiplication of two n-bit numbers produces a product of 2n bit. Now in RNS, the product term is always within the permissible residues for a mi. So, a mapping is used and it is implemented using LUT. The MSB n bit of multiplication result (Imul1) is converted into the equivalent number in range of moduli mi. The LUT store the mapping function, (Imul1 [2n-1: n] * 2n)mi. The converted value and LSB n–bit of Imul1 i.e. Imul1 [n-1:0] are added into modulo mi adder. Fig. 3 shows the scheme of RNS array multiplier. For multiplication of two 5-bit numbers (for 5 bit moduli), we have used LUT with a size of 25 x 5 and a 5-bit modulo adder.
M2 mi M1 mi M0 mi
5
5
j = n −1
i =n−1
8
Result
( Xrns * Yrns) mod mi
Fig. 2: RNS Wallace tree multiplier for 5 bit moduli
Example: Let mi = 29, X = 783 and Y = 609, then Xrns =27 = (11011)2, Yrns= 21 = (10101)2
222
B. Implementation of Modulo Adder The modular addition (xi + yi)mi ,consists of two binary additions. If the result of xi + yi exceeds the modulus (It is larger than mi - 1), we have to subtract the modulus mi [1]. zi = (xi + yi) mod mi , where 0 ≤ xi , yi ≤ mi – 1, can be carried out as if xi + yi < mi zi = xi + yi = xi + yi - mi if xi + yi ≥ mi The carry bit generated from the second adder indicates whether xi + yi is greater than mi or not. A multiplexer controlled by the carry, selects the correct output. yi
-mi n+1
For 32 bit data moduli bits vs Time complexity 240
200 180
n yi
n+1-bit CLA n
120
60 40
n 0
140
80
n
n-bit CLA
1
160
100
xi n
S
Wallace delay IP delay Array delay
220
xi n
Carry Save adder C
minimum delay. We have also performed comparative study on time complexity of Array multiplier and proposed Wallace tree multiplier using both Binary and RNS system. It is found that proposed Wallace tree multiplier using RNS exhibits lowest time complexity compared to its binary counterpart and Array multiplier (Fig.7).
Time complexity
Fig. 3 RNS Array multiplier for 5 bit moduli
1
5
5.5
6
6.5 Moduli bits
7
7.5
8
Fig. 6 Time complexity of multiplier vs Moduli bits for 32 bit data Comparison of time complexity between binary and RNS with 5bit max moduli 4000 BIN wallace RNS wallace 3500 BIN Array RNS Array 3000
n
Fig. 4: n bit Modulo adder using CSA
Time complexity
Modulo adder has CSA, CLA and Multiplexer. CSA is represented by a full adder. So it will contribute 4unit delay. For 5bit residue addition, two 4bit CLA are required. So, 8bit CLA will contribute 16 unit delays. Mux contribute 2 units of AND2 delay. So the delay of the 5 bit modular adder can be represented as tmodADD = tCSA + tnCLA +tMUX. This modulo adder unit is used to perform the final stage addition in the proposed Wallace tree multiplier.
2500 2000 1500 1000
IV COMPUTATION OF DELAY
500
To implement Wallace tree multiplier in RNS, the delay and hardware complexity of different adder and multiplier circuits have been compared. Circuit complexity using Unit-gate model (gate-equivalent (GE) model) has been analysed as presented in [3]. The assumptions are as follows: Inverter, Buffer: Delay=0, Area=0 Simple monotonic two input gates (AND, NAND, OR, NOR): Delay = 1, Area = 1 Simple non-monotonic 2-input gates (XOR, XNOR): Delay = 2, Area = 2
Simple m-input gates: Delay = ⎡log2 m⎤ , Area = (m-1). The proposed Wallace tree multiplier required 1 AND2 delay + n-bit LUT access delay + TCSA + TCLA + access time of ⎡log2 (n - 1) ⎤ bit LUT + time for modulo addition.
Hardware required for it N * [(2n x n) x n + log2 (n-1) x n] bits
+ n2 AND2 + h/w of CSA + h/w of ⎡(n+1)/4⎤ x 4-bit CLA + h/w of n+1 bit modulo adder. RNS Array multiplier required 1 AND2 + delay of n*(n-2) delay of 1bit full adder + access time of n-bit LUT+ n bit modulo adder. The comparative study on time complexity of RNS array multiplier, inner product (IP) multiplier (modulo multiplier) [4] and proposed Wallace tree multiplier is shown in Fig. 6. It is found that propose Wallace tree multiplier has
0
5
10
15
20 Bits
25
30
35
Fig. 7 Comparison of time complexity between binary and RNS multiplier with 5-bit maximum moduli
V. PERFORMANCE ANALYSIS Performance analysis of popular DSP algorithms (FIR and FFT) were carried out in details using modulo adder and proposed multiplier. N-tap FIR filter needs N multiplications, (N-1) additions, one Binary to RNS conversion and one RNS to binary conversion [1]. FFT is computed using butterfly computation unit where each butterfly involves one complex multiplication and two complex additions. For N point FFT computation, where N is power of 2, total number of complex multiplication required is (N/2) * log2 N and total number of complex addition N * log2 N. Since, each butterfly involves 4 real multiplications and 6 real additions, total number of real multiplication is (2 * N) * log2 N and total number of real addition is 3 * N * log2 N. N Binary to RNS and N RNS to Binary computation time are required [15]. The comparative study on time complexity of FIR and FFT algorithms are shown in Fig. 8 and Fig. 9 respectively.
223
4
using Modelsim and validated on Xilinx Virtex 4 (4vlx25ff668-10) FPGA. We have selected the moduli set {32, 31, 29, 27, 25, 23, 13} as the optimal set with respect to bit efficiency, hardware complexity and speed to represent 32-bit binary data in RNS. To design the adder unit we have used CSA and CLA combination for obtaining minimum delay. To implement RNS multiplier unit, we have used LUT, CSA, CLA and modulo adder. Finally we have carried out analysis of DSP algorithms like FIR and FFT using proposed RNS arithmetic unit. To interface the proposed RNS multiplier unit with a Von-Neumann CPU, the conversion times from binary to RNS and visa-versa are considered. Detail analysis reveals that the proposed RNS multiplier unit has better time complexity (as shown in Fig. 8 and Fig. 9) compared to binary multiplier.
4 x 10 Comperison of Time complexity of FIR filter betn RNS & Binary AU
RNS AU Binary AU
3.5
Time complexity
3 2.5 2 1.5 1 0.5 0
0
20
40
60
80 100 120 140 Number of tap of FIR filter
160
180
REFERENCES
200
Fig 8 Comparison of Time complexity of FIR filter between RNS and Binary Arithmetic Unit 6
6 x 10 Comperison of Time complexity of FFT between RNS & Binary AU
RNS AU Binary AU 5
Time complexity
4
3
2
1
0
0
200
400 600 800 Number of point of FFT
1000
1200
Fig 9 Comparison of Time complexity of FFT filter between RNS and Binary Arithmetic Unit
A. Synthesis report of RNS AU We have implemented a RNS Arithmetic unit (AU) with the proposed RNS Wallace tree multiplier. The AU was simulated using Modelsim and validated on Xilinx Virtex 4 FPGA. The synthesis report is mentioned below. Device utilization summary: Selected Device: 4vlx25ff668-10 Number of Slices: 2358 out of 10752 21% Number of Slice Flip Flops: 35 out of 21504 0% Number of 4 input LUTs: 4144 out of 21504 19% Number of bonded IOBs: 102 out of 450 22% Number of GCLKs: 1 out of 32 3% Timing Summary: Speed Grade: -10 Minimum period: No path found Minimum input arrival time before clock: 35.607ns Maximum output required time after clock: 25.692ns Maximum combinational path delay: No path found
VI CONCLUSION In this paper, we have carried out detailed study and analysis of different RNS arithmetic units like adder and multiplier and proposed a new architecture. The architecture was simulated
[1] G.C. Cardarilli, A. Del Re, R. Lojacono, A. Nannarelli, M. Re “RNS implementation of High Performance filters for Satellite Demultiplexing,” IEEE Aerospace Conference, vol 3, Mar., 2003 [2] A. Del Re, A. Nannarelli, and M. Re, “Implementation of Digital Filters in Carry-Save Residue Number System,” Proc. Of 35th Asilomar Conference on Signals,Systems, and Computers, pp. 1309-1 3 13, Nov. 2001. [3] Reto Zimmermann “Lecture notes on Computer Arithmetic: Principles,Architectures and VLSI Design,” Integrated System Laboratory, Swiss Federal Institute of Technology (ETH) Zurich, Mar, 16, 1999. URL http://www.iis.ee.ethz.ch/zimmi/publications/ comp_arith_notes.ps.gz. [4] A.Drolshagen,C.Chandra Sekhar and W.Anheier ,“A Residue Number Arithmetic based Circuit for Pipelined Computation of Autocorrelation Coefficients of Speech Signal” VLSI Design,1998. Eleventh International conference on 4-6 Jan ,1998 [5] H. L. Garner, “The Residue Number System,” IRE Trans. Electro. Comput., vol. EC-8,pp. 140-147, June 1959. [6] Wei Wang , et.al.”A Study of Residue to Binary Converters for the three modulo sets”, IEEE transactions of circuit and system1,vol.50,No.2 ,Feb 2003 [7] Maltar, L.; Felipe, C.B.; Franca, M.G.; Alves, V.C.; Amorim, C.L. “Implementation of RNS addition and RNS multiplication into FPGAs,” FPGAs for Custom Computing Machines, 1998. Proceedings. IEEE Symposium on Volume , Issue , 15-17 Apr 1998 p p331 - 332 [8] Vassil S. Dimitrov, et. al.,”Theory and Applications of the DoubleBase Number System” IEEE Transactions on Computers, vol. 48, No. 10, Oct., 1999. [9] N. G. Kingsburg and P. J. Rayner , “ Digital Filtering Using Logarithmic Arithmetic”, Electron. Lett., vol 7, Jan. , 1971. [10] W. Luo. et al “An array processor for inner product computations using a Fermat number ALU” IEEE International Conference on Application-Specific Array Processors (ASAP'95), pp 270,1995. [11] C. Efstathiou et. al. "Modified Booth Modulo 2^n-1 Multipliers" IEEE Transactions on ComputersVolume 53 , Issue 3, Pages: 370 374 March 2004 [12] John P Hayes “Computer Architecture and Organization”, Mcgraw-Hill 2004 [13] Milos D.Ercegovac & Tomacawg “Digital Arithmetic”, Elsevier 2005 [14] Ramasamy Krishnan et. al. ”A Core function based residue to binary decoder for RNS Filter Architecture”, Page(s): 837-840 vol.2 , IEEE MWSCAS 1989 [15] Wei Wang , et.al.”A Study of Residue to Binary Converters for the three modulo sets”, IEEE transactions of circuit and system1,vol.50,No.2 ,Feb 2003 [16] Ahmed A. Hiasat, “New efficient structurre for a modular multiplier for RNS”, IEEE transaction on Computers Vol. 49, No. 2, February, 2000.
224