AUTOMATED IMPLEMENTATION OF RNS-TO-BINARY CONVERTERS H. Henkelmann, A. Drolshagen*, H.Bagherinia, H.Ahrens, W. Anheier Institute for theoretical electronics and microelectronics, University of Bremen P.O. Box 33 04 40, D-28334 Bremen, Germany Tel.: ++49421/218-3961, Fax: ++49421/218-4434, Email:
[email protected] *
Siemens AG, HL SP D DF3, St.-Martin-Str.76, 81541 München, Germany Email:
[email protected]
ABSTRACT In this paper we will present an implementation of an RNS-to-Binary Converter based on the mixed radix conversion algorithm by using recently proposed new cell-elements. The proposed architecture can be generated automatically for arbitrary sets of moduli and the method is straight forward. This method is very well suited for standard cell implementations resulting in compact layouts.
1. INTRODUCTION Residue number system (RNS) arithmetic prove to be advantageous for the implementation of DSP algorithms resulting in fast regular designs processing long input words [1]. The RNS are non-weighted carry-free numerical systems and their arithmetic occurs on parallel finite rings. Since the RNS-based circuits are to be used in a binary system, the weight system environment requires converter stages between the different numerical systems. While the conversion into the RNS system is easy, the binary conversion is more computing intensive. Furthermore the hardware structure of the converter stages is depending on the given set of moduli of the RNS.
2. THE SYNTHESIS STRATEGY At the Institute for theoretical electronics and microelectronics (ITEM) a hardware compiler has been built, paying special attention to the technical and mathematical foundations of residue number arithmetic [2]. The compiler has been written in object-oriented C++. We provide certain parametrisable basic elements in form of C-program modules and are able to combine them to more and more complex structures. Thus it is possible to realize an automated generation of RNS
specific structures (such as scaler, extension of base or residue and binary conversion) depending on a given set of moduli. Our hardware compiler is able to start a bittrue simulation of the circuits with stimuli vectors as well as to issue VHDL generic maps. The synthesis of the generated VHDL-Codes is carried out with the help of the commercial synthesis tool SYNOPSYS [3] and the layout is done using Design Frame Works II (DFWII). Since we described the functionality of the elements in synthetisable VHDL Models we are able to use gatenetlists instead of ROM or PLA macro cells. The silicon needed by the generated standard cell implementation can be reduced in comparison to conventional realizations with ROM or PLA macro cells since gatenetlists can be integrated into the standard cell rows [2]. Thus no suboptimal placement by hand is necessary and the automated placement and routing algorithms of the design tools can be employed. Note that for the standard cell implementation of an RNS based DSP algorithm one might use some hundred ROM macrocells that have to be placed by hand [4][5][6]. The used strategy for automated implementation is efficient in regards to diminishing scaling of process and rising complexity of implementable systems.
3. BASIC ELEMENTS For the construction of the converter structures we use two recently proposed RNS processor elements for inner product computation on a ring m [7]. The cells obtain an improvement referring to the area requirements compared to the IPSPm element [8]. The first cell is called HYBIPm cell since it is combining the normal LUT methods for modulo multiplication with a constant and binary adders for modulo addition. The second method is based on full adders and proved to require less hardware effort for moduli above 7 bit compared to an equivalent IPSPm-element.
Furthermore the adder based ABIPm cell element in [7] has been extended to compute inner products of the form
z = ∑ ci ⋅ xi mod m i
(1)
Where z, xi are variables and ci are constants on the ring modm. The multiplications with ci are realized by a reasonable environment of the input words and for the summation modm the binary representation of the carry term of
(2
log 2 m
) mod m
4. RNS TO BINARY CONVERTERS For the conversion of a weight number X given in a corresponding residue number system representation mainly two different algorithms are available, the chinese remainder theorem (CRT) and the mixed radix conversion (MRC). However it is known from [9] that regarding to the number of operations, their computational complexity and the hardware effort of the implementation the algorithms are different. To obtain a weight number X in the CRT we have to solve the following equation
)
L −1 X = ∑ M i ⋅ xi ⋅ M i−1mi mod mi mod M , i =0 Mi =
M mi
The Mixed Radix Conversion algorithm is defined as follows
X=
L −1
∑ Wi ⋅ γ i
(4)
i =0
Both cell elements can be generated automatically by the RNSHDL compiler.
with
For example all multiplications can be done by ROM look up tables. The implementation of the modulo operation modM however causes more problems and different solutions had been proposed in the past [10][11]. Nevertheless in an implementation those methods required a couple of binary adders and ROM look up tables of the systems wordlength log2 M.
(2)
has to be feet back into the adder based cell. It is obvious that the wiring effort will increase with the number of bits to be processed by the carry save architecture that realizes the summation modm.
(
constant value of (L-1)⋅B Bit L times with a variable of the wordlength B.
(3)
The value Mi-1mi is the multiplicative inverse element and L is the number of moduli. If we assume a system's moduli wordlength of B Bit for each modulus of the RNS, the term modmi can be solved using small gatenetlists of the input wordlength B. It can be seen from the equation (2) that we have to multiply a
Where the Wi are computed by W0 = 1; W1 = m0; W2 = m0⋅m1; ....., and the values γi have the wordlength of B Bit. The computational complexity of equation (3) is less than in the CRT since the wordlength of the Wi is increasing with the number of moduli and no modulo operation modM is required. The values of the γi are calculated as follows
γ 0 = x0
(
)
γ 1 = ( x1 − γ 0 )m0−1 m1 mod m1 γ2 =
(((x
2
(( (
)
)
)
− γ 0 m0−1 m2 − γ 1 m1−1m2 mod m2
γ L−1 = .. ( x L−1 −γ 0 )m0−1mL−1 − γ 1 )m1−1mL−1 −..
)
........−γ L−2 )mL−1−2 mL−1 mod mL−1
(5) Even if this calculation requires more multiplications and additions than in the CRT, only modulo operations of B Bit wordlength are used. Thus only look up tables of 2B address space and B bit output wordlength are required. The implementation of (5) can be done using the HYBIPm processor elements mentioned above.
5. AUTOMATED IMPLEMENTATION OF THE CONVERTER STRUCTURES Since we use the processor elements and the synthesis strategy for standard cell techniques mentioned above the MRC can be implemented automatically depending on the given set of moduli. As an example we built a
converter using the set (32, 31, 29, 23, 21, 19, 17) of 5 Bit moduli achieving a 32 bit dynamic range of the RNS system. The hardware structure for the calculation of the gamma factors is very regular and can be done using 21 HYBIPm elements. Referring to equation (4) the gamma factors have to be multiplied with the Wi. If we use wiring for this operation its hardware effort can be reduced employing the ordering effect of the MRC discussed in [12]. The number of bits to be processed in the carry save architecture is depending on the combination of moduli to calculate the Wi. This is due to the different values of the Wi and their binary representations. It is possible to reduce the number of bits with the value of one which have to be processed in the adder structure. Thus the wiring effort the number of gates and their fan out can be reduced. Furthermore in the MRC no modM operation is required and thus the summation can be done by using a simple CSA architecture.
Set 1 2 3
Moduli 32, 31, 29, 23, 21, 19, 17 32, 31, 29, 23, 21, 19, 17 32, 17, 19, 23, 29, 21, 31
Number of Bits
1 2 3
450 190 150
Using the design flow established at the University of Bremen the layout of the MRC based implementation is done in an 0.7µm ES2 standard cell process [13]. The converter requires an area of 9 mm² and the clock rate is 100MHz for worst case conditions.
Method CRT MRC MRC
Table 1: Compared Moduli sets For a CRT based implementation we would have to use gatenetlists for the computation of the terms within the summation in equation (3) since we try to avoid ROM macros. It is not easy to synthesize the required gatenetlists without decreasing the possible clockrate of the complete system even if the address space of the look up tables is only 2B. For the summation modM the ABIPm cell elements would have been used requiring more fulladders than a normal binary CSA architecture. Furthermore no ordering effect occurs and thus the number of bits to be processed is relatively high.
Set
For the given moduli set an optimum of 150 bits can be found by computer search while the number of bits for a CRT based implementation is a constant number of 450 bits.
Maximum Fanout 15 14 9
Table 2: Comparison of wiring effort and fanout of CRT and MRC
Figure 1 : Layout of the RNS to Binary Converter
6. SUMMARY In this paper the hardware complexity of the CRT and MRC algorithms are compared referring to the usage of two recently proposed cell elements for inner product computation, suitable for automated synthesis. The MRC algorithm is preferable for automated synthesis since it requires less complex modules and no modulo operation modM occurs. Using the HYBIPm cell a very the efficient implementation of the gamma structure can be obtained. Furthermore it is pointed out that by choosing an optimized combination of moduli in a given moduli set the wiring effort and thus the hardware effort of the implementation can be reduced. The required look up tables for the computation of the gamma coefficients can be implemented by gatenetlists. Thus no ROM or PLA macro cells are required and the gatenetlists can be integrated into the standard cell rows resulting in a
reduced hardware effort. The proposed MRC architecture can be generated automatically for arbitrary sets of moduli. In regards to the further reduction of scaling process and increasing wiring intensity of the implementations, the chosen strategy for synthesis of RNS based circuits and systems seems to be favourable.
[10]
[11]
[12]
7. REFERENCES [1]
[2]
[3] [4]
[5]
[6]
[7]
[8]
[9]
Soderstrand M.A., Jenkins W.K., Jullien G.A., Taylor (Eds.) F.J.: Residue Number System Arithmetic: Modern Applications in Digital Signal Processing. New York: IEEE Press (1986). Drolshagen A., Anheier W.: "Ein Hardwarecompiler zur restklassenbasierten Schaltungssynthese", Electrical Engineering Archiv für Elektrotechnik, Vol 79, No 2, p 127134, April 1996 SYNOPSYS Ref. Man. 3.0. SYNOPSYS Inc. (1992) Drolshagen A., Sekhar C. C.: "A Residue Number Arithmetic based Circuit for Pipelined computation of Autocorrelation Coefficients of Speech Signal" , Eleventh International Conference on VLSI Design, Madras, Indien, January 4-7, 1998 (accepted for publication) Drolshagen A., Birreck D., Laur R., Anheier W.: "DCT/IDCT Architectures using Residue Number Systems", Proceedings of "Workshop on Design Methodologies for Microlelectronics and Signal Processing", Krakau Polen, Oktober 1993 Klaasen R., Birreck D., Wolter S., Laur R.: "VLSI Architechture for a Convoltuion-based DCT in Residue Arthmetic", International Symposium on Circuits and Systems ; May 1992, San Diego, USA, IEEE Proceedings ISCAS-92 Drolshagen A., Henkelmann H., Anheier W.: "Processor Elements for the Standard Cell Implementation of Residue Number Systems", IEEE ASAP Conference, 14.-16 July 1997, Federal Institute of Technology Zurich, Swiss Jullien G.A., Bird P.D., Carr J.T., Taheri M., Miller W.C.: An Efficient Bit-Level Systolic Cell Design for Finite Ring Digital Signal Processing Applications. Journal of VLSI Processing. 1 (1989) p. 189-207 Szabo N.S., Tanaka R.I.: Residue Arithmetic and its Application to Computer Technology. New York: McGraw-Hill 1967.
[13]
Cardarilli, M. et al: "Efficient Modulo Extraction for CRT based Residue to Binary Converters", IEEE 97 Thu Van Vu: "Efficient Implementations of the Chinese Remainder Theorem for Sign Detection and Residue Decoding”, Transactions on Computers Vol. C-34, No. 7 1985 Kutuso K.N., Yassine H. M.: “Effect of Moduli ordering on Mixed Radix Conversion Methods in Residue Number Systems”, IEEE 35th Midwest Symposium on Circuits and Systems,1993 European Silicon Structures, ES2 Library Databook ECPD07, May 1994