Implementation of RNS addition and RNS multiplication into FPGAs (Extended Abstract) Luiz Maltar C. B., Felipe M. G. França, Vladmir C. Alves and Cláudio L. Amorim COPPE - Universidade Federal do Rio de Janeiro Caixa Postal 68511, Postal Code 21945-970, Rio de Janeiro- RJ- Brazil
[email protected] [email protected] [email protected] [email protected] Abstract- We investigate whether arithmetic operations based on Residue Number Systems (RNS) are cost-effective solutions to implement DSP applications into reconfigurable hardware. We simulated several RNS addition and multiplication implementations by varying the RNS parameters. For RNS addition, our results show that it can be implemented into a 3-stage 80.6-92.5 MHz pipeline using about 22 to 33 FPGAs’ logic cells. For RNS multiplication, the attainable speed range was between 78.1 and 87.7 MHz, for operand lengths varying between 5 and 8 bits. Overall, a hybrid solution that combines logical elements and blocks of RAM is the best option, producing better average performance across the whole range of operand lengths. 1 Introduction Residue Number Systems (RNS) have been studied during decades due to the decomposition properties of some RNS arithmetic operations in a set of parallel sub-operations. However, only additions, subtractions, and multiplications can be decomposed. Other arithmetic operations such as division, sign detection, overflow, scaling, and magnitude comparison are non-modular and quite complex to allow a fast hardware implementation. Consequently, RNS are not considered to be suitable for general-purpose computations. Nevertheless, RNS have been used in high performance implementations of several special-purpose applications that include error detection and correction in fault-tolerant systems and high-speed Digital Signal Processing. We investigate whether RNS-based arithmetic operations are cost-effective solutions to implement DSP applications into reconfigurable hardware. This study is supported by two main observations: (i) LUT-FPGAs tend to include dedicated structures which allow high-speed arithmetic operations to be implemented in few bits, (ii) RNS-based reconfigurable systems can be also configured to binary-to-residue and residue-to-binary converters, thus reducing FPGAs’ cost/benefit ratio while keeping its performance at high level. For performance evaluation, we have used the ALTERA EPF10K10-3 FPGA as our reference implementation device, but the results we present here can be easily extended to other similar devices as well. The ALTERA FPGA has 576 logical elements (with 4-input lookup tables) and 3 blocks of RAM that can be configurable to any of the following four sizes: 256 x 8 bits, 512 x 4 bits, 1024 x 2 bits, and 2048 x 1 bit. The remainder of this paper is organized as follows. In Section 2, we present results for direct mapping of RNS operations. In Section 3, we describe our RNS adder algorithm and implementation. In Section 4, RNS multiplication is treated and performance results are discussed. Finally, in Section 5, our conclusion is drawn.
2 Direct mapping of RNS operations into LUT FPGAs Our first experiment was to measure the costs associated with a simple direct mapping of generic RNS operations into a LUT-FPGA. As can be seen in Table 1, direct mapping into 4input lookup table is clearly an inefficient approach, as the number of logical elements increases exponentially with the number of input operand bits. For large values of the number of input operand bits (e.g., greater than 10) direct mapping will require more than one FPGA device. In the next sections, we introduce better ways to implement efficiently both RNS adders and multipliers into FPGAs. Table 1: Mapping of generic operand input output length bits bits 3 6 3 4 8 4 5 10 5 6 12 6
RNS operations. LE no. blocks of RAM 21 1 124 1 635 3 2466 6
3 Implementing a RNS adder Considering that it is possible to reach high speed implementations of ripple-carry adders with few input bits exploring ALTERA EPF10K10 dedicated structures that provide fast carry propagation between logic elements [4], we have implemented RNS binary adders based on the work of Dugdale [3]. Each adder modulo m follows the rules: (x + y) mod m , where x, y < m , result
= x + y, if x + y < m, or = x + y - m, if x + y >= m.
two (2) adders modulo 2n , n = log2 m number of operands bits, are used to perform additions: suma = x + y, and sumb = suma + (2n - m); The correct result of (x + y) mod m , is dependent on carries originated by suma and sumb. Dugdale showed that if a carry occurs in any addition, the correct result is s u m b , otherwise suma. Table 2 presents some performance figures. moduli 17 37 61 127 251
Table 2: Performance of RNS adders. no. LE no. stages speed (MHz) bits 5 23 3 92.5 6 27 3 90.9 6 27 3 90.9 7 25 2 86.9 8 33 3 80.6
4 Implementing a RNS multiplier The index calculus approach to RNS multiplier is based on the properties associated to the Galois field theory [1,2]. Given a Galois field GF(p) , where p is a prime number, it is possible
to generate all nonzero elements by using a primitive root of p. Let ρ be a primitive root of p, the nonzero elements {1, 2, ... , p-1} are generated by | ρ i |p , where the indexes, i, are {0, 1, ... , p - 2 } . The nonzero elements form an multiplicative group with multiplication modulo p, and the index elements form an additive group with addition modulo (p-1). The Galois field multiplier is obtained by using the isomorphism between these two groups. Given x and y, nonzero elements of GF(p), the result of multiplying x by y, modulo p, is:
| x . y |p = ρ| ix + iy | p - 1
Hence, this approach requires 3 steps to perform multiplication: (i) find the operand’s index, (ii) adding them in m o d u l o p -1, and (iii) performing the inverse index operation. Implementations of index calculus multipliers were made according to the diagram of Fig. 1. operand_1
Index ROM
index of operand-1 OPoperand_1
Adder modulo p-1 operand_2
Index ROM
index of result
Inverse result index ROM
Table 4: The Index ROMs and Inverse Index ROM mapped only in 3 blocks of RAM. mod op. Memory LE speed no. stages bits bits:% (MHz) 31 5 450: 7% 37 78.1 9 61 6 1080: 17% 43 84.7 9 127 7 2646: 43% 51 78.1 9 251 8 6000: 97% 58 78.1 9 Table 5: The Index ROMs mapped in Logical elements and Inverse Index ROM mapped in one block of RAM. mod op. LE Memory bits speed no. bits inverse index (MHz) stages ROM: % 31 5 59 150: 7% 83.3 8 61 6 121 360: 17% 78.1 9 127 7 259 882: 43% 68.4 10 251 8 569 2000: 97% 61.7 11 Table 6: The Index ROMs mapped in 2 blocks of RAM and Inverse Index ROM mapped in Logical elements. mod op. Memory bits LE speed no. bits INDEX ROMs:% (MHz) stages 31 5 300:7% 48 84.7 8 61 6 720: 17% 82 87.7 9 127 7 1764: 43% 155 86.9 10 251 8 4000: 97% 313 80.6 11
index of operand_2
Fig. 1: Block diagram for index Calculus multipliers Given that direct mapping is not suitable to LUT-FPGAs our implementation of RNS multipliers used the pipelined binary adders modulo m described in Section 3. Both INDEX and INVERSE INDEX ROM blocks can be implemented in several ways by employing different combinations of the FPGAs’ resources. For instance, we can use any of the three alternatives: (i) only Logical elements, (ii) only blocks of RAM, and (iii) any combination of these two. For performance comparison we have simulated various FPGA implementations of INDEX ROMs and INVERSE INDEX ROMs. The results are presented in Tables 3-6. Table 3: The Index ROMs and Inverse Index ROM mapped only in Logical Elements. mod op. INDEX adder inverse speed no. bits ROMs mod p-1 ROM (MHz) stages 31 5 32 22 16 84.0 7 61 6 90 25 45 79.4 9 127 7 222 30 111 65.5 11 251 8 518 34 259 --The performance measures shown in the tables reveal that there is no single winner implementation across the RNS parameter range. The results indicate that 5-bits RNS multipliers should be implemented solely by logical elements (84.0 MHz, 70 LE) while 8-bits one's should use only blocks of RAM (78.1 MHz, 3 RAM block, and 58 LE); for 6-bit and 7-bit multipliers, hybrid solutions that mix logical elements with blocks of RAM are more efficient.
5 Conclusion The implementation of RNS addition and RNS multiplication in small and fast LUT-FPGAs was investigated. The resulting RNS addition implementation is a 3-stage pipeline that can run at 80.6-92.5 MHz into the ALTERA EPF10K10-3. For RNS multipliers, the best configuration option depends on the application’s RNS dynamic range, enabling different combinations of RNS adders and RNS multipliers. The investigation of RNS converters, binary-residue and residuebinary, into FPGAs is naturally our next step. Acknowledgments The authors want to acknowledge Altera Corporation for providing the ALTERA MAX plus II software. This work was partially supported by CNPq and FINEP - Brazil. References [1] G.A. Jullien, “Implementation of Multiplication, Modulo a Prime Number, with Applications to Number theoretic Transforms”, IEEE Trans. on Computers, vol.29, no. 10, pp. 899-905, October 1980. [2] Damu Radhakrishnan and Yong Yuan, “Novel Approaches to Design of VLSI RNS Multipliers”, IEEE Trans. on Circuit and Systems- II: Analog and Digital Signal Processing, vol. 39, no. 1, pp. 52-57, January 1992. [3] Melanie Dugdale, “VLSI Implementation of Residues Adders Based on Binary Adders”, IEEE Trans. on Circuit and Systems- II: Analog and Digital Signal Processing, vol. 39, no. 5, pp. 325-329, May 1992. [4] ALTERA Co. , “Ripple-carry Adders in Flex 8000 Devices”, Application Brief 118, May 1994.