2015 IEEE 22nd Symposium on Computer Arithmetic 1
Modulo-ሺ െ െ ) Parallel Prefix Addition via Excess-Modulo Encoding of Residues Seyed Hamed Fatemi Langroudi
Ghassem Jaberipur1
Electrical and Computer Engineering Department, Shahid Beheshti University, Tehran, Iran
Computer Science & Engineering Department, Shahid Beheshti University, Tehran, Iran
[email protected]
[email protected] ܺ ܻǡ ܺ ܻ ൏ ʹ (2) ȁܺ ܻ ͳȁଶ ǡ ܺ ܻ ʹ These adders end-around the carry-out of ܺ ܻ and actually compute ȁܺ ܻ ȁଶ , where EAC stands for end-around carry. A regular parallel prefix (RPP) realization that manipulates EAC via an extra parallel prefix level [3], shows ሺͷ ʹۀ݊ ڿሻ ܩlatency, where ܩdenotes the delay of a simple gate. Although the excellent EAC-fusing technique of [3] removes the extra parallel prefix level, thus reducing the delay to ሺ͵ ʹۀ݊ ڿሻܩ. Conventional realization of modulo-݉ adders, for ݉ ൌ ʹ െ Ɂ, and Ɂ ͳ, implement Eqn. 3. ܺ ܻǡ ܹ ൏ ʹ ȁܺ ܻȁଶିஔ ൌ ൜ , ܹ ൌ ܺ ܻ Ɂ (3) ȁܹȁଶ ǡ ܹ ʹ This scheme, whose various realizations [4-9] require considerable additional logic besides one ݊-bit adder, entails some speed loss. Innovative solutions, which are more latency-compatible with the case of Ɂ ൌ ͳ, have been appeared in the relevant literature; namely a parallel prefix generalized solution for Ɂ ൌ ʹ ͳሺͳ ݍ ݊ െ ʹሻ [10], and one specifically for ݍൌ ݊ െ ʹ [11]. Note that ݍൌ ݊ െ ͳ, is excluded since it leads to modulo ʹିଵ െ ͳ. These solutions, unlike to that of Eqn. 2 (for Ɂ ൌ ͳ), rely on partial or full computation towards obtaining both ܺ ܻ Ɂ and ܺ ܻ. The former accommodates two weighted-1 and weighted-ʹ EACs of ܺ ܻ Ɂ within the parallel prefix network (PPN). The latter, however, does it via some shared logic, where the selective carry signal is that of ܺ ܻ Ɂ. Neither are based on EAC addition of ܺ ܻ, which is however, the case in [12-13] for ݍൌ ͳ (i.e., Ɂ ൌ ͵). Therefore, as generalization of the latter EAC-based method, we are motivated to study the design and implementation of modulo-(ʹ െ Ɂ) parallel prefix adders that compute ȁܺ ܻ Ɂ ൈ ȁଶ , for Ɂ ൌ ʹ୯ ͳ. Other encouraging sources for this study are the existence of modulo-(ʹ െ Ɂ) residue generators [14-16], and multipliers [17-19]. Moreover, there are cryptosystems [20], and DSP applications [21] that take advantage of such moduli. The remaining sections of this paper are organized as follows. A background on RNS and modular adders is offered in Section II. Section III provides our general solution for parallel prefix modulo-(ʹ െ ʹ െ ͳ) adders. Performance evaluation and comparison with previous works, analytically and by synthesis, are taken up in Section IV, and we conclude in Section V.
Abstract— The residue number system ૌ ൌ ሼ ܖെ ǡ ܖǡ ܖ ሽ has been extensively studied towards perfection in realization of efficient parallel prefix modular adders, with ሺ ܖܗܔઢ۵ latency. Many applications, such as digital signal processing require fast modular operations. However, relying only on ૌ limits the magnitude of , and accordingly the dynamic range. Therefore, additional mutually prime moduli are required to accommodate for wider dynamic range. On the other hand, speed of modular arithmetic operations for the additional moduli should be as close as possible to those in ૌ. This could be best met by the moduli of the form ܖെ ሺ ܙ ሻ, with ܙ ܖെ , such as ܖെ ǡ ܖെ . However, the fastest parallel prefix realization of modulo-ሺ ܖെ ܙെ ሻ adders that we have encountered in the relevant literature, claims ሺૠ ܖ ܗܔሻઢ۵ latency. Motivated by the need to reduce the latter, we propose new designs of such adders with ሺ ܖ ܗܔሻઢ۵ latency without any penalty in area consumption or power dissipation. The proposed modular addition algorithm entails supplementary representation of residues in ሾǡ ܙሿ, as ሾ ܖെ ሺ ܙ ሻǡ ܖെ ሿ. This leads to additional performance efficiency similar to the effect of double zero representation in modulo-ሺ ܖെ ሻ adders. The aforementioned analytically evaluated speed gain and improvements in other figures of merit are also supported via circuit simulation and synthesis.
ȁܺ ܻȁଶିଵ ൌ ൜
Keywords-Residue number system; Parallel prefix modular adder; Excess-modulo encoding;
I.
INTRODUCTION
1
Modular adders are required in many digital applications, such as cryptography [1], and digital signal processing (DSP) via residue number systems (RNS) [2]. Modulo-݉ adders implement Eqn. 1. ܺ ܻǡ ܺ ܻ ൏ ݉ ȁܺ ܻȁ ൌ ൜ (1) ܺ ܻ െ ݉ǡ ܺ ܻ ݉ Performance of these adders depend, by and large, on the value of ݉. For example, in case of ݉ ൌ ʹ , conventional unsigned binary adders, with alternative architectures (e.g., ripple carry, parallel prefix) can be used. Modulo-(ʹ െ ͳ) adders can be also efficiently realized via direct utilization of conventional one’s complement adders. This is based on Eqn. 2, at the cost of imposing two representations for zero (i.e., 0 and ʹ െ ͳ) [3]. 1 G. Jaberipur is also affiliated with School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
1063-6889/15 $31.00 © 2015 IEEE DOI 10.1109/ARITH.2015.9
121
2
II.
BACKGROUND
Modular adders are vastly used in RNS applications. A typical RNS is defined via a moduli set ሼ݉ଵ,…݉ ሽ, where the ݇ moduli are commonly pair wise prime in order to maximize the cardinality (aka dynamic range) of represented numbers by the RNS in hand. As such, integers in ሾͲǡ ܯሻ are supported by the RNS, where ܯൌ ςୀଵ ݉ . Addition of two numbers ܷ, and ܸ, both in ሾͲǡ ܯെ ͳሿ, is decomposed to ݇ modulo-݉ addition operations ȁݑ ݒ ȁ for ͳ ݅ ݇, in parallel, where ݑ ൌ ȁܷȁ ൌ ܷ െ ܷہȀ݉ ۂൈ ݉ is the remainder of integer division ܷȀ݉ , and similarly ݒ ൌ ȁܸȁ . RNS is usually beneficial in applications where several additions and multiplications take place before RNS-tobinary conversion is required (e.g., FIR filters [22-23]). Latency of such multi-channel operation is determined by the slowest channel corresponding to one of the moduli. There is usually one modulo-ʹ channel, where the corresponding conventional ݊-bit adder is the fastest with lowest cost. The next best performance is experienced for modulo(ʹ െ ͳ) adders, where a conventional one’s complement adder does the job. However, in case of ݑ ݒൌ ʹ െ ͳ, which should normally lead to ȁ ݑ ݒȁଶ ିଵ ൌ Ͳ, the one’s complement adder provides ʹ െ ͳ, as the resulted modular sum. This value is recognized as the second representation for Ͳ with no problem in correctness of the subsequent operations, since for instance ȁ ݑ ʹ െ ͳȁଶିଵ ൌ ȁ ݑ Ͳȁଶ ିଵ ൌ ݑ, ȁ ݑൈ ሺʹ െ ͳሻȁଶିଵ ൌ ݑൈ and ȁʹ െ ͳȁଶିଵ ൌ Ͳ.
Figure 1. Modulo-(ʹ െ ͳ) RPP adder
(G , P ) (G A , PA ) (G r , Pr ) (G A , PA ) (G r , Pr )
(G , P )
G A ∨ PAG r
(G A ∨ PAG r , PA Pr )
x y
x ∨ y xy x ⊕ y
Figure 2. Basic cells used in Fig. 1
B. General modulo-(ʹ െ ߜ) adder There are several proposals for efficient implementation of modulo-݉ adders in the relevant literature, where without loss of practical generality, the authors opt on moduli of the form ݉ ൌ ʹ െ Ɂ [4-10]. All these adders are based on Eqn. 3, above. Most of them use ݓ (i.e., the most significant bit of ܹ ൌ ܺ ܻ Ɂ) to select one of the two sums, which are obtained in parallel [6, 8], or in sequence [4-5]. Hiasat [7] computes two sets of propagate ( )and generate (݃) signals for ܺ ܻ and ܹ. The latter set is used by a simplified carry look-ahead logic to solely compute ݓ , which is used as the selector of one of the (, ݃) sets. The selected set is used in a carry look-ahead architecture to lead to the desired sum. The extreme case of Ɂ ൌ ʹିଶ ͳ is studied in [11]. In this work, the carry-out of ܹ is recognized as the carry of ܺ ܻ ͳ, or that of ܺ ܻ ʹିଶ , where both cannot be 1 simultaneously, since ܹ ʹሺʹ െ Ɂ െ ͳሻ Ɂ ൌ ʹ ʹିଵ ʹିଶ െ ͵ ൏ ʹାଵ . The PPN for positions Ͳ, to ݊ െ ͵ is shared for the two adders that leads to significant area saving, where the carries for ܺ ܻ are easily obtained from the carries of ܺ ܻ ͳ. The sum bits corresponding to ܺ ܻ and ܹ are selected via the carry-out of the latter. Shang Ma et al. [10] have set up a special PPN (with a preprocessing carry save stage), for the case of Ɂ ൌ ʹ ͳሺͳ ݍ ݊ െ ʹሻ that primarily computes the positional carry signals of ܹ.
A. Parallel prefix modulo-ሺʹ െ ͳሻ adders The modular addition ȁܺ ܻȁଶିଵ , of Eqn. 2, with double zero representation, can be also described via Eqn. 4, where ܹ ൌ ݓ ݓିଵ ǥ ݓ ൌ ܺ ܻ, and EACൌ ݓ . ܹǡܹ ൏ ʹ ȁܺ ܻȁଶ ିଵ ൌ ൜ ൠ ൌ ݓିଵ ǥ ݓ ݓ (4) ȁܹ ͳȁଶ ǡ ܹ ʹ
Parallel prefix [24] realization of ܺ ܻ, where ܺ ൌ ܽିଵ ǥ ܽ , and ܻ ൌ ܾିଵ ǥ ܾ , primarily derives propagate ൌ ܽ ܾ ש and generate ݃ ൌ ܽ ܾ signals. These are entered into a network of logical nodes that produce group propagate ܲ and group generate ܩsignals, which lead to positional carries ܩିଵǣ (i.e., the generated carry within positions 0 to ݅ െ ͳ). The sum bits are then obtained as ݏ ൌ ݄ ْ ܩିଵǣ , in parallel, where ݄ ൌ ܽ ْ ܾ . Eqn. 4 can be realized via such parallel prefix network (PPN), which is augmented by an extra row of simpler logical nodes that enforce EAC within the computation of the actual carries ܿ ൌ ܩିଵǣ ܲ שିଵǣ ݓ , where ݓ ൌ ܩିଵǣ . Fig. 1 that is reproduced from [3] depicts such RPP structure, based on Kogge-Stone (KS) PPN, where the required logical nodes are described in Fig. 2. A similar, but slightly more complex, RPP design for modulo-(ʹ െ ͵) adders, which takes advantage of double representation of residues in ሾͲǡ ʹሿ, is offered in [13].
122
3
These carries are then corrected to lead to carries of ܺ ܻ, via removing the effect of the two 1-valued bits in positions 0 and ݍ. A simple logic is employed, per binary position, which combines the two carries to obtain the required actual carry.
TABLE II.
Since in cases of ʹ െ ߜ ܹ ʹ െ ͳ, we have ȁܺ ܻȁଶ ିఋ ൌ ܹ, the output residues in ሾͲǡ ʹ ሿ can also assume excess-(ʹ െ Ɂ) codes in ሾʹ െ Ɂǡ ʹ െ ͳሿ, respectively. These codes, as justified by Eqn. set 6, do not jeopardize the subsequent addition or multiplication operations, since the result of corresponding operations on these codes (i.e., ݈ ʹ െ Ɂ, for Ͳ ݈ ൏ Ɂ) is the same as for the normal codes. (6)
Table I contains the bit variables that are involved in the implementation of Eqn. 5. BIT ORGANIZATION OF ܵ ൌ ȁܺ ܻ ሺʹ ͳሻݓ ȁଶ ݔିଵ ݕିଵ
… …
ݔ ݕ
… …
ݔଵ ݕଵ
ݔ ݕ
ࢄࢅ ઼۳ۯ۱
ݓିଵ
…
ݓ
…
ݓଵ
ݓ
ࡿ
ݏିଵ
…
…
ݏଵ
ݓ ݏ
ݔିଵ
…
ݔଵ
ݔ
ݕ
ݕିଵ
…
ݕଵ
ݕ
݄ିଵ
…
݄
ݔିଵ
…
ݔଵ
ݔ
݃ିଶ
…
0
ݕିଵ
…
ݕଵ
ݕ
ࢄԢ
݄ିଵ
…
݄
ݔିଵ
…
ݔଵ
ݔ
ࢅԢԢ
݃ିଶ
…
ݕିଵ
…
ݕଵ
ݕ ݄ᇱ
݃ିଵ
݁ ᇱ ݄ିଵ
݁
…
݄ᇱ
ᇱ ݄ିଵ
…
݄ଵᇱ
ܿିଵ
…
ܿ
ܿିଵ
…
ܿଵ
ܿ
ݏିଵ
…
ݏ
ݏିଵ
…
ݏଵ
ݏ
This Table further contains the bits of conceptual stages of the proposed modulo-(ʹ െ ʹ െ ͳ) addition ȁܺ ܻȁଶ ିଶିଵ , where ᇱ Ԣ ൌ , the actual EAC for ȁܺԢ ܻԢȁଶ ିଶିଵ is ݁ ൌ ܩԢିଵǣ ݃ שିଵ , since ܩԢିଵǣ ݃ିଵ ൌ Ͳ due to ܺԢ ܻԢ ൌ ܺ ܻ ʹ ʹ െ ሺʹାଵ Ͷሻ ൏ ʹାଵ . Let ܿ denotes the carry into position ݍ, of the third part of Table II, within the addition ܺԢ ܻԢԢ Ɂ݁, where ܻԢԢ ൌ ܻԢ െ ʹ ݃ିଵ , and in contrast to position ݍof Table I, the collective value of the present bits(i.e., ݄ , ݁, and ܿ ) is at most 3. This desirable property helps in devising an augmented PPN for generation of carry signals ܿ that are needed to obtain the final sum bits ݏ ൌ ݄ᇱ ۩ܿ (Ͳ ݅ ݊ െ ͳ), where ݄ᇱ ൌ ݄ (Ͳ ݅ ൏ )ݍ, ݄ᇱ ൌ ݄ ۩݁, and ݄ᇱ ൌ ݄ ۩݃ିଵ ( ݍ ͳ ݅ ݊ െ ͳ). Eqn. 7, supported by Eqn. set 8, defines the required ܿ s, with due justification to follow with the support of Lemma 1. ܩିଵǣ ǡ ݅ ൌ Ͳ ܩۓ ש ܲ ܩ ǡ ͳ ݅ݍ ିଵǣ ିଵǣ ିଵǣ ܿ ൌ ᇱᇱ (7) ݄ ܩ ש ܲ ܩ ݅ ൌ ݍͳ ିଵǣ ǣ ିଵǣǡ ۔ ᇱᇱ ݄ ەିଵ ܩିଶǣ ܲ שିଵǣ ܩିଵǣ ǡ ݅ ݍ ͳ
ܹǡ ܹ ൏ ʹ ȁܺ ܻȁଶ ିஔ ൌ ൜ ൠ ൌ ݓିଵ ǥ ݓ Ɂݓ (5) ȁܹ Ɂȁଶ ǡ ܹ ʹ
ࢄ ࢅ
ݔ
…
ࡿ
The proposed modulo-(ʹ െ Ɂ) addition scheme is based on the following Eqn. 5, for Ɂ ൌ ʹ ͳሺͳ ݍ ݊ െ ʹሻ, where ܹ ൌ ܺ ܻ.
TABLE I.
…
ݕିଵ
઼ ۳ۯ۱
THE PROPOSED MODULAR ADDITION SCHEME
ȁܺ ൈ ሺ݈ ʹ െ Ɂሻȁଶ ିஔ ൌ ȁܺ ൈ ݈ȁଶିஔ
ݔିଵ
ࢅ ࢅԢ
None of the aforementioned contributions make use of the EAC of ܺ ܻ addition, as is common for the case of Ɂ ൌ ͳ [3], and recently used for Ɂ ൌ ͵[13].
ȁܺ ݈ ʹ െ Ɂȁଶିஔ ൌ ȁܺ ݈ȁଶିஔ ,
ࢄ ࢄԢ
Finally, Jaberipur et. al [9] compute ܹ and store ݓ along ȁܹȁଶ that indicates whether the correct sum is ȁܹȁଶ (ݓ ൌ ͳ) or ܹ െ Ɂ (ݓ ൌ Ͳ). However, the required subtraction is postponed and fused with the next addition.
III.
STAGES OF ȁܺ ܻȁଶ ିଶିଵ ADDITION SCHEME
ᇱᇱ ܲǣ ൌ ݄ ܲ שିଵǣ ܩ שିଵǣ ᇱᇱ ᇱᇱ ൌ ܲᇱ ିଵǣାଵ ܲǣ ܲିଵǣ
(8)
ᇱ ᇱ Lemma 1: ሺܩǣାଵ ǡ ܲǣାଵ ݄ ሻ ൌ ሺ݄ ܩିଵǣ ǡ ܪǣ ሻ, ݆ ݍ, where ᇱ ᇱ (ܩǣ ܲǣ ), and (ܩǣ , ܲǣ ) are the group (generate, propagate) signals of hypothetical PPNs for ܺ ܻ, and ܺ ᇱ ܻԢԢ, with ܻԢԢ ൌ ܻԢ െ ʹ ݃ିଵ , and ܪǣ ൌ ٿୀ ݄ . Proof (by induction on ݅ െ ݆): ᇱ ᇱ , ܲǣ ). Let ݃ᇱ and Ԣ refer to positional counterparts of (ܩǣ
ݓ ݏ
To avoid the two ݊-bit additions suggested by Table I, the ݍleast significant bits of the final sum can be obtained via a parallel prefix ݍ-bit addition ݔିଵ ǥ ݔ ݕିଵ ǥ ݕ ݓ , based on the RPP architecture of Fig. 1, tailored for ݍ bits. However, there are two carry signals entering position ݍ, namely; ݓ and the carry-out of the latter ݍ-bit addition. The maximum collective value of these carries plus those of ݔ and ݕ amounts to 4. Therefore, no conventional parallel prefix architecture, for positions ݍto ݊ െ ͳ can handle the two carry-in signals.
Basis (݅ െ ݆ ൌ ͳ): ᇱ ᇱ ൌ ݃ାଵ ൌ ݄ାଵ ݃ =݄ାଵ ܩǣ . ܩାଵǣାଵ ᇱ ᇱ ܲାଵǣାଵ ݄ ൌ ାଵ ݄ ൌ ൫݄ାଵ ݃ ש ൯݄ ൌ ܪାଵǣ . ᇱ ൌ ݄ ܩିଵǣ , Induction: We show that the propositionsܩǣାଵ ᇱ ᇱ and ܲǣାଵ ݄ ൌ ܪǣ imply ܩାଵǣାଵ ൌ ݄ାଵ ܩǣ , and ᇱ ܲାଵǣାଵ ݄ ൌ ܪାଵǣ , respectively, as follows. ᇱ ᇱ ᇱ ᇱ ܩାଵǣାଵ ൌ ݃ାଵ שାଵ ܩǣାଵ ൌ ݄ାଵ ݃ שሺ݄ାଵ ݃ ש ሻ݄ ܩିଵǣ ൌ ݄ାଵ ሺ݃ ݄ ש ܩିଵǣ ሻ ൌ ݄ାଵ ܩǣ . ᇱ ᇱ ᇱ ܲାଵǣାଵ ݄ ൌ ାଵ ܲǣାଵ ݄ ൌ ሺ݄ାଵ ݃ ש ሻܪǣ ൌ ݄ାଵ ܪǣ ݃ ש ݄ ܪିଵǣ ൌ ܪାଵǣ .ז
To remedy this problem, we devise a partial carry-save preprocessing stage, whose output bits are shown in Table II, as݄ ൌ ݔ ۩ݕ , and ݃ ൌ ݔ ݕ , for ݍ ݅ ݊ െ ͳ.
123
4 ᇱᇱ • Ȁ: The signal ܲǣ (ൌ ݄ ܲ שିଵǣ ܩ שିଵǣ ) is delivered one ȟ ܩsooner than the final last level ܩ signals, since ܩିଵǣ is produced in the level before last of the PPN. This leads to delivery of ܿவ also at the desired time (see the corresponding equations for ݅ ݍ, within Eqn. 7). Fig. 3a depicts the required circuitry, for ݊ ൌ ͺ, and ݍൌ Ͷ, followed by the description of each incorporated logical cell, in Fig. 3b. Note that no PPN nodes are present in position 6 to lead to ܩǣ , since ܩିଶǣ does not occur in Eqn. 7. ᇱᇱ is ready one ȟ ܩlater than the • ݊Ȁʹ: The signal ܲǣ last level ܩsignals. However, to achieve the same time delivery, Eqn. set 8 for ݍ ݊Ȁʹ and ݅ ݍcan be modified, as in Eqn. set 9. This is achieved via ሺ ۀ݊ ڿെ ͳሻ especial twin nodes [25] for computing ൫ܩȀଶିଵǣ ܲ שȀଶିଵ ൯, and devising a mixed KS/LadnerFischer PPN architecture, to generate ܩିଵǣȀଶ , as in Fig. 4a (for ݊ ൌ ͺ, and ݍൌ ͷ), followed by legends of the utilized cells (Fig. 4b). Note that the first two PPN levels use KS architecture, while the bottom one follows that of Ladner-Fischer. Therefore, some area savings (mainly due to additional buffer nodes), which grows with ݊, is expected in comparison to Fig. 3. ᇱᇱ ܲǣ ൌ ݄ ܲ שିଵǣ ܩ שିଵǣ ൌ ݄ ܲ שିଵǣȀଶ ܲȀଶିଵǣ ܩ שିଵǣȀଶ ܲ שିଵǣȀଶ ܩȀଶିଵǣ ൌ ൫݄ ܩ שିଵǣȀଶ ൯ ܲ שିଵǣȀଶ ൫ܩȀଶିଵǣ ܲ שȀଶିଵǣ ൯ (9) ᇱᇱ ᇱᇱ ܲିଵǣ ൌ ܲᇱ ିଵǣାଵ ܲǣ ൌ ܲᇱ ିଵǣାଵ ሺ݄ ܩ שିଵǣȀଶ ሻ ܲ שԢିଵǣାଵ ܲିଵǣȀଶ ൫ܩȀଶିଵǣ ܲ שȀଶିଵǣ ൯
Justification of Eqn. 7: ᇱ ᇱ ൌ : Recalling ሺܩǣାଵ ǡ ܲǣାଵ ݄ ሻ ൌ ሺ݄ ܩିଵǣ ǡ ܪǣ ሻ from Lemma 1, we have ᇱ ᇱ ᇱ ᇱ ൌ ݃ିଵ ܩ שିଵǣାଵ ܲ שିଵǣାଵ ܩǣ ܿ ൌ ݁ ൌ ݃ିଵ ܩ שିଵǣ ൌ ݃ିଵ ݄ שିଵ ܩିଶǣ ܪ שିଵǣ ܩିଵǣ ൌ ݃ିଵ ש ݄ିଵ ሺܩିଶǣ ܪ שିଶǣ ܩିଵǣ ሻ ൌ ܩିଵǣ . That is the EAC is equal to that of the original ܺ ܻǤ
: ܿ ൌ ܩିଵǣ ܲ שିଵǣ ܩିଵǣ , since the attendant bits are those of the original ܺ and ܻ. ൌ : ܿାଵ ൌ ൫݄ ݁ ש൯ܿ ݄ ש ݁ ൌ ൫݄ ܩ שିଵǣ ൯ሺܩିଵǣ ܲ שିଵǣ ܩିଵǣ ሻ ݄ ש ܩିଵǣ ᇱᇱ ൌ ݄ ܩିଵǣ ܲ שǣ ܩିଵǣ . ݍ ͳ: ᇱ ᇱ ܿ ൌ ܩିଵǣାଵ ܲ שିଵǣାଵ ܿାଵ ᇱ ᇱᇱ ൌ ݄ିଵ ܩିଶǣ ܲ שିଵǣାଵ ሺ݄ ܩିଵǣ ܲ שǣ ܩିଵǣ ሻ ᇱ ᇱ ᇱᇱ ൌ ݄ିଵ ܩିଶǣ ܲ שିଵǣାଵ ݄ ܩିଵǣ ܲ שିଵǣାଵ ܲǣ ܩିଵǣ ᇱᇱ ൌ ݄ିଵ ܩିଶǣ ܪ שିଵǣ ܩିଵǣ ܲ שିଵǣ ܩିଵǣ ᇱᇱ ൌ ݄ିଵ ሺܩିଶǣ ܪ שିଶǣ ܩିଵǣ ሻ ܲ שିଵǣ ܩିଵǣ ᇱᇱ ൌ ݄ିଵ ܩିଶǣ ܲ שିଵǣ ܩିଵǣ .ז A. Proposed RPP architectures for ݍ ݊Ȁʹ and ݍ ݊Ȁʹ In the normal RPP circuit (see Fig. 1) all carry signals are available ʹȟ ܩafter the last level ܩsignals. In the current design, the same is true for ݅ ݍ, as is evident by Eqn. 7. However, for ݅ ݍ, two cases are recognized depending on the value of ݍ:
xi yi
݄Ԣ
݄Ԣ
Ԣ
݄ͷԢ
ͷԢ
݄Ͷ
݄͵
݄ʹ
݄ͳ
݄Ͳ
xi ∨yi x i y i xi ⊕yi ( xi ⊕yi ) ∨xi −1yi −1 xi ⊕yi ⊕xi −1yi −1
xi yi Ԣ
x i −1 y i −1
x i y i xi −1 y i −1
Ԣ
݄Ͷ ͷ
xi ∨yi xi yi xi ⊕yi xi ∨yi xi yi xi ⊕yi x ⊕y ⊕x y i i i −1 i −1
(GA,PA ) (Gr , Pr ) hi
ࡼԢԢǣ ࡼԢԢ ࡼԢԢǣ ǣ
ሺࡳǣ ǡ ࡼԢԢǣ ሻ
ሺࡳǣ ǡ ࡼԢԢǣሻ
ሺࡳǣ ǡ ࡼԢԢǣ)
ሺࡳǣ ǡ ࡼǣ ሻ
ሺࡳǣ ǡ ࡼǣሻ ሺࡳǣ ǡ ࡼǣ ሻ
ሺࡳǣǡ ࡼǣ ሻ
( hi G A ∨ PAG r ) (G,P)
(GA, PA ) (Gr ,Pr ) (GA,PA ) (Gr ,Pr )
G A ∨ PAG r P 3:0
(b )
Figure 3. a: Modulo-(ʹ െ ʹ െ ͳ) adder for ݊ ൌ ͺ, ݍൌ Ͷ, b: The utilized cells
124
h4 p 5′ p 6′ G 3:0
′′ P5:0′′ P4:0′′ P6:0
(G, P ) (a )
(GA ∨ PAG r , PA Pr )
5
xi yi
݄Ԣ
݄Ԣ
݄ͷ Ԣ
݄͵
݄Ͷ
݄Ͳ
݄ͳ
݄ʹ
x i −1 y i −1
xi ∨yi x i y i xi ⊕yi ( xi ⊕yi ) ∨xi −1yi −1 xi ⊕yi ⊕xi −1yi −1
xi yi ࡳǣ ࡼ שǣ
x i y i xi −1 y i −1
ࡳǣ ࡼ שǣ
݄ͷ
Ԣ
͵ܩǣͲ ͵ܲڀǣͲ
xi ∨yi xi yi xi ⊕yi xi ∨yi xi yi xi ⊕yi x ⊕y ⊕x y i i i −1 i −1
ࡳǣ ࡼ שǣ
(GA,PA ) (Gr ,Pr )
ࡼԢԢǣ ࡼԢԢǣ
(GA, PA ) (Gr ,Pr ) (GA,PA ) (Gr ,Pr )
hi ሺࡳǣ ǡ ࡼԢԢǣ )
ሺࡳǣ ǡ ࡼԢԢǣ )
ሺࡳǣ ǡ ࡼǣሻ
ሺࡳǣǡ ࡼǣ ሻ
ሺࡳǣǡ ࡼǣ ሻ ሺࡳǣǡ ࡼǣ ሻ
ሺࡳǣ ǡ ࡼǣ ሻ
( hi G A ∨ PAG r ) (G , P )
(G , P )
G A ∨ PAG r
(G ,P )(G , P ) Gr ∨ Pr A A r r
(GA ∨ PAG r , PA Pr ) P4
G A ∨ PAG r GA ∨ PA (Gr ∨ Pr )
′ h5 p 6 g 4
G3:0 ∨ P3:0
′′ P5:0′′ P6:0
(b )
(a ) Figure 4. a: Modulo-( െ െ ) adder for ൌ ૡ, ൌ , b: The utilized cells
The ܿ s, in both Figs. 3 and 4, are all delivered ʹȟ ܩafter the last level ܩsignals. Therefore, all the ݏ signals are available in ሺͷ ʹۀ݊ ڿሻȟܩ, with exactly the same latency as in RPP realization of modulo-(ʹ െ ͳ) adders. This implies that the carry-save stage of Table II is effectively off the critical delay path. IV.
We have undertaken a similar evaluation for the work of [10], where in the low cost realization ࣛ ൌ ʹሺ݊ െ ݍሻ ݍെ ʹ, for ݊ ͳ െ ʹ ۀ݊ ڿ ݍ ʹۀ݊ ڿ, and otherwise,ࣛ ൌ ͳǤͷሺ݊ െ ݍെ ʹሻڿሺ݊ െ ݍെ ʹሻ ۀ ͳǤͷሺ ݍെ ͳሻڿሺ ݍെ ͳሻ ۀ ݊ െ ݍ ͳ. Fig. 5 shows (for ݊ ൌ ͳ) that as ݍgrows, our ࣛᇲᇲ drops quite faster than ࣛ of [10].
PERFORMANCE EVALUATION AND COMPARISONS
The analytical gate level evaluations of several previous works and the proposed ones are reflected in Table III, where ࣛீ denotes the area of a simple gate, respectively (ࣛைோ ൌ ʹሾʹሿ). The last entry relates to modulo-(ʹ െ ͳ) RPP adder, as the one with highest performance among modulo-(ʹ െ Ɂ) adders. In the works of [4-9], Ɂ can assume any value in ሾͳǡ ʹିଶ ሿ, is of the form ʹ ͳሺͳ ݍ ݊ െ ʹሻ in [10], in the proposed work, and in [13], for ݍൌ ͳ, and equal to ʹିଶ ͳ, in [11]. The term ࣛᇲᇲ , in the area formula for the proposed adders, relate to the logic that computes ܲԢԢ variables. To keep ܲԢԢ computations off the critical delay path, we use low cost or low delay realizations depending of value of ݍ. In case of ݍ ݊Ȁʹ, we implement ܲᇱ ିଵǣାଵ via an AND array for ݍ ݊ ͳ െ ʹ( ۀ݊ ڿe.g., ݍ ͵, for ݊ ൌ ͺ) that leads to ࣛᇲᇲ ൌ ʹሺ݊ െ ݍሻ െ ʹ, and by an AND tree (as in [11]), for ݍ൏ ݊ ͳ െ ʹ ۀ݊ ڿwith ࣛᇲᇲ ൌ ͳǤͷሺ݊ െ ݍെ ʹሻڿሺ݊ െ ݍെ ʹሻ ۀ ݊ െ ݍ. Likewise, for ݍ ݊Ȁʹ, ࣛᇲᇲ ൌ ͷሺ݊ െ ݍሻ െ , in the low cost case with ݍ ݊ ͵ െ ʹۀ݊ ڿ, and ࣛᇲᇲ ൌ ͳǤͷሺ݊ െ ݍെ ʹሻڿሺ݊ െ ݍെ ʹሻ ۀ Ͷሺ݊ െ ݍሻ െ ͷ, for low delay case with ݍ൏ ݊ ͵ െ ʹۀ݊ ڿǤ
TABLE III. PERFORMANCE MEASURES OF MODULO-(ʹ െ Ɂሻ ADDERS Design
[9] [10]
Delay (οࡳ) Ͷ ۀ݊ ڿ ͺ Ͷ ۀ݊ ڿ ͳʹ ʹ ڿሺ݊ െ ͳሻۀ ʹ ڿሺ݊ െ ͳሻ ۀ ʹ ڿሺ݊ െ ʹሻۀ ͺ ʹ ڿሺ݊ െ ͳሻۀ ʹ ۀ݊ ڿ ʹ ۀ݊ ڿ
[11] ( ൌ െ )
ʹ ۀ݊ ڿ ͷ
[13]-RPP ( ൌ )
ʹ ڿሺ݊ െ ͳሻۀ
[4] [5] [6]
[7]
[8]
New ( Ȁሻ
125
ʹ ۀ݊ ڿ ͷ
New ( ݊Ȁʹሻ
ʹ ۀ݊ ڿ ͷ
[3]-RPP
ʹ ۀ݊ ڿ ͷ
Area ሺऋࡳ ) ݊ ۀ݊ ڿ ͷ݊ ͳ ͵݊ ۀ݊ ڿ ͳͲ݊ ͳ ͵ሺ݊ െ ͳሻ ۀ݊ ڿ ͵ሺ݊ െ ͳሻ ڿሺ݊ െ ͳሻ ۀ ͷ݊ ͳ ͵ሺ݊ െ ʹሻ ڿሺ݊ െ ʹሻ ۀെ ڿሺ݊ െ ͳሻ ۀ ݊ ʹ ͵ሺ݊ሻڿሺ݊ െ ͳሻ ۀ ͳ͵݊ െ ͷ ሺ͵݊ െ ͳሻ ۀ݊ ڿ ͳͳǤͷ݊ ͳ ͵݊ ۀ݊ ڿ ݊ െ ͵ ݍ ʹ ࣛ ͵݊ ۀ݊ ڿ ͳǤͷ݊ ڿሺ݊ െ ͳሻۀ ݊ ʹڿ୪୭ ሺିଵሻିۀଵ ͵ሺ݊ െ ͳሻڿሺ݊ െ ͳሻ ۀ ͺ݊ െ ͳ ͵ሺ݊ െ ͳሻ ۀ݊ ڿ ݊ െ ͵ ݍെ ͳ ࣛᇲᇲ ͵ሺ݊ െ ͳሻ ۀ݊ ڿ ͷǤͷ݊ െ ͵ ݍ ʹ ۀ݊ ڿ ࣛᇲᇲ ͵݊ ۀ݊ ڿ Ͷ݊
6 TABLE V. SYNTHESIS RESULTS FOR ሺ ൌ ૡǡ ൌ ሻ Design
Least delay Ratio 0.77 1.08 0.84 1.18 0.83 1.17 0.84 1.18 0.80 1.12 0.71 1.00
ሺ࢙ሻ
[6] [8] [9] [10] [13] New
ሺࣆ ሻ
1828 1501 1524 1504 1400 1405
Area Ratio 1.30 1.07 1.08 1.07 1.00 1.00
AT 1.41 1.26 1.27 1.27 1.12 1.00
ሺࣆ࢝ሻ
577 499 500 500 446 449
Power Ratio 1.28 1.11 1.11 1.11 0.99 1.00
PDP 1.39 1.31 1.30 1.32 1.12 1.00
TABLE VI. SYNTHESIS RESULT TS FOR ሺ݊ ൌ ͳǡ ݍൌ ͳሻ Design
Least delay Ratio 0.88 1.03 0.90 1.06 0.92 1.08 0.97 1.14 0.89 1.05 0.85 1.00
ሺࣆ ሻ
ሺ࢙ሻ
[6] [8] [9] [10] [13] New
Design [6] [8] [9] [10] [11] New
TABLE VIII. Design
TABLE IV. PERFORMANCE MEASURES OF THE PR ROPOSED DESIGNS 2 0.71 1268 408
3 0.72 1224 383
4 0.711 1129 3533
5 0.70 1110 326
Least delay Ratio 0.77 1.10 0.84 1.20 0.88 1.26 0.75 1.07 0.73 1.04 0.70 1.00
ሺ࢙ሻ
In order to confirm the above analyticaal evaluations, we have described the works of [6, 8-11, 13], and those of proposed ones with HDL code, and syntheesized them with TSMC 0.13m standard CMOS technoloogy by Synopsys Design Compiler. The results are comppiled as follows. Performance measures for the proposed deesign is shown in Table IV, for ݊ ൌ ͺ, and ͳ ݍ , wheree area and power measures degrade, as ݍgrows, since moore buffer nodes bloom (see Fig. 4).
1 0.71 1405 449
[6] [8] [9] [10] [11] New
6 0.70 910 264
Similar performance measures are comppiled in Tables VVIII, for ݊ ൌ ͺ, and ݊ ൌ ͳ, with minimuum and maximum ݍvalues, where the proposed designs meeet the least delay that could be achieved by the synthesis tool, with no area or power penalty with respect to other works.. In particular the AT and PDP measures, for the proposed addders, are at least 21%, and 24% less than those of previouus modulo-(ʹ െ ʹ െ ͳ) designs, respectively. Moreoverr, at least 60% (75%), and 10% (8%) AT (PDP) improvement is experienced with respect to the previouus especial case designs for ݍൌ ݊ െ ʹ [11], and ݍൌ ͳ [13]..
New
8
[3]-RPP New [3]-RPP
16
ሺࣆ࢝ሻ
1383 1190 1295 1075 1018 991
Area
ሺࣆ ሻ Ratio
1825 1404 1338 1480 1394 910
2.00 1.54 1.47 1.63 1.53 1.00
Power Ratio 1.40 1.20 1.31 1.08 1.03 1.00
PDP 1.44 1.27 1.41 1.24 1.08 1.00
q 1 6 1 14 -
Least delay (ns) Ratio 0.71 1.11 0.70 1.09 0.64 1.00 0.85 1.02 0.85 1.02 0.83 1.00
Power AT 2.21 1.85 1.85 1.74 1.60 1.00
ሺࣆ࢝ሻ Ratio PDP
550 434 454 452 442 264
2.08 1.64 1.72 1.71 1.67 1.00
2.29 1.97 2.16 1.83 1.75 1.00
SYNTHESIS RESULTSS FOR ሺ݊ ൌ ͳǡ ݍൌ ͳͶሻ
Least delay ሺ࢙ሻ Ratio 0.90 1.06 0.91 1.07 0.90 1.06 0.89 1.05 0.87 1.02 0.85 1.00
ሺࣆ ሻ
4538 3988 3855 3394 3829 2083
Area Ratio 2.18 1.91 1.85 1.63 1.84 1.00
AT 2.31 2.05 1.96 1.71 1.88 1.00
ሺࣆ࢝ሻ
1360 1262 1341 940 1226 581
Power Ratio 2.34 2.17 2.31 1.62 2.11 1.00
PDP 2.48 2.32 2.44 1.69 2.16 1.00
The same results are also illustraated by Figs. 6-13, where the superiority of the proposed designs is evident. In particular, Figs. 6-9, for ݍൌ ͳ, show the performance advantage of the New design with respect r to the RPP adder of [13]. Moreover, extra curves co orresponding to modulo(ʹ െ ͳ) RPP adder are included d in order to show the minimal delay overhead of ou ur designs, where the corresponding exact figures are com mpiled in Table IX.
TABLE IX. OVERHEAD WITH RESPECT TO MODULO-(ʹ୬ െ ͳ) RPP ADDER Design n
AT 1.46 1.21 1.28 1.19 1.10 1.00
TABLE VII. SYNTHESIS RESULTSS FOR ሺ݊ ൌ ͺǡ ݍൌ ሻ
Figure 5. ࣛᇲᇲ , and ࣛ versus ݍ
ݍ Least delayሺ࢙ሻ Areaሺࣆ ሻ Power ሺࣆ࢝ሻ
4589 3725 3828 3394 3414 3248
Area Ratio 1.41 1.15 1.18 1.04 1.05 1.00
(μm2) 1405 910 1154 3248 2083 2282
126
Area Ratio 1.22 0.79 1.00 1.42 0.91 1.00
AT 1.35 0.86 1.00 1.46 0.93 1.00
(μw) 449 264 326 991 581 589
Power Ratio 1.38 0.81 1.00 1.68 0.99 1.00
PDP 1.53 0.89 1.00 1.72 1.01 1.00
7
Figure 6. Area comparisons for ݊ ൌ ͺ, ݍൌ ͳ
Figure 7. Power compariisons for ݊ ൌ ͺ, ݍൌ ͳ
Figure 8. Area comparisons for ݊ ൌ ͳ, ݍൌ ͳ
Figure 9. Power comparissons for ݊ ൌ ͳ, ݍൌ ͳ
Figure 10. Area comparisons for ݊ ൌ ͺ, ݍൌ
Figure 11. Power compariso ons for ݊ ൌ ͺ, ݍൌ
Figure 12. Area comparisons for ݊ ൌ ͳ, ݍൌ ͳͶ
Figure 13. Power comparison ns for ݊ ൌ ͳ, ݍൌ ͳͶ
127
8
[7]
V. CONCLUSION Efficient realization of modulo-(ʹ െ Ɂ) arithmetic operations can help in devising fast RNS arithmetic systems with several computational channels with balanced latency. For example, the moduli set ሼ͵ͳǡ ʹͻǡ ʹǡ ʹͷǡ ʹ͵ǡͳͻǡ ͳሽ, for ݊ ൌ ͷ, and Ɂ in ሼͳǡ ͵ǡ ͷǡ ǡ ͻǡ ͳ͵ǡ ͳͷሽ, respectively. Parallel prefix realization of modulo-(ʹ െ Ɂ) adders, for odd Ɂ ͳ, confront the difficulty of handling multiple ʹ -weighted reentrant carries. Previous solutions for ȁܺ ܻȁଶ ିஔ rely on partial or full computations towards obtaining two interim sums ܺ ܻ, and ܺ ܻ Ɂ. Furthermore, some recent designs work for restricted values of Ɂ such as Ɂ ൌ ʹିଶ ͳ [11], and Ɂ ൌ ʹ ͳ [10] (e.g., Ɂ ൌ ͵ǡ ͷǡ ͻ, for ݊ ൌ ͷ). However, following the case of ݍൌ ͳ in [13], via excess(ʹ ͳ) representation of residues in ሾͲǡ ʹ ሿ (as well as normal representation), we design the required modular adders based on only one interim sum (i.e., ܺ ܻ). This leads to significant performance improvement on all figures of merit. Latency of the proposed adder is ሺͷ ʹۀ݊ ڿሻȟܩ, which is equal to that of similar realization (i.e., RPP) for modulo-(ʹ െ ͳ) adders, and less than that of similar special case of ݍൌ ͳ in [13]. The latency of fastest previous general modulo-(ʹ െ ʹ െ ͳ) adder [10] is ሺ ʹۀ݊ ڿሻȟܩ, while it consumes considerable additional area and dissipates more power. For example, in case of ݊ ൌ ͳ, and ݍൌ ͳͶ, it is 5% slower, consumes 63% more area, and dissipates 62% more power. Other cases can be checked in Tables V-VIII that show superiority of the proposed designs. Also the adder of [11], for the single case of ݍൌ ݊ െ ʹ, experiences the same delay as ours, but at the cost of 84% (111%) more area (power) consumption. Furthermore, the proposed adder (for the highest value of )ݍis more area- and power-efficient (see Table IX) than the RPP realization of modulo-(ʹ െ ͳ) adder [3], but with slight additional delay penalty (e.g., 9% area and 1% power savings at the cost of 2% more delay, for ݊ ൌ ͳ, and ݍൌ ͳͶ). As for the future relevant research, modulo-(ʹ െ ʹ െ ͳ) multipliers and the corresponding totally parallel prefix (TPP in the terminology of [3]) adders can be studied.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
D. Schinianakis, T. Stouraitis, “Multifunction Residue Architectures for Cryptography,” IEEE Trans. Circuits Syst. I, Reg. papers, vol. 61, no. 4, pp. 1156-1169, Apr. 2014. R. Chokshi, K. S. Berezowski, A. Shrivastava, S. J. Piestrak, “Exploiting residue number system for power-efficient digital signal processing in embedded processors,” in Proc. of the international conference on Compilers, architecture, and synthesis for embedded systems (CASES), pp. 19-28, Oct. 2009. L. Kalampoukas, D. Nikolos, C. Efstathiou, H. T. Vergos, J. Kalamatianos, “High-Speed Parallel-Prefix Modulo ʹ െ ͳ Adders,” IEEE Trans. Comput. , vol. 49, no. 7, pp. 673-680, July 2000. M. A. Bayoumi, G. A. Jullien, W. C. Miller, “A VLSI implementation of residue adders,” IEEE Trans. Circuits Syst. , vol. 34 , no. 3, pp. 284 -288, Mar. 1987. M. Dugdale, “VLSI implementation of residue adders based on binary adders,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process. , vol. 39, no. 5, pp. 325-329, May 1992. S. J. Piestrak, “Design of high-speed residue-to-binary number system converter based on Chinese remainder theorem,” in Proc. of the Internatnal Conference on Computer Design, pp. 508-511, Oct. 1994.
[22]
[23]
[24]
[25]
[26]
128
A. A. Hiasat, “High-speed and reduced-area modular adder structures for RNS,” IEEE Trans. Comput. , vol. 51, no. 1, pp. 84 -89, Jan. 2002. R. A. Patel, M. Benaissa, N. Powell, S. Boussakta, “Novel PowerDelay-Area-Efficient Approach to Generic Modular Addition,” IEEE Trans. Circuits Syst. I, Reg. papers, vol. 54, No. 6, pp. 1279-1292, June 2007. G. Jaberipur, B. Parhami, S. Nejati, “On building general modular adders from standard binary arithmetic components,” in Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers, pp. 154-159, Nov. 2011. S. Ma, J. H. Hu, C. H. Wang, “A Novel Modulo-( ʹ െ ʹ െ ͳ) Adder for Residue Number System,” IEEE Trans. Circuits Syst. I, Reg. papers, vol. 60, no. 11, pp. 2962-2972, Nov. 2013. R. A. Patel, M. Benaissa, S. Boussakta, “Fast Modulo ʹ െ ሺʹିଶ ͳሻ Addition: A New Class of Adder for RNS,” IEEE Trans. Comput. , vol. 56, no. 4, pp. 572-576, April 2007. H. Fatemi, G. Jaberipur, “Double representation Modulo-(ʹ െ ͵) adders,” in Proc. of the 21st International Conference on Systems, Signals, and Image Processing, pp. 119-122, May 2014. G. Jaberipur, S. H. F. Langroudi, “ሺͶ ʹ ݊ሻ Delta-G Parallel Prefix Modulo-(ʹ െ ͵) Adder via Double Representation of Residues in [0,2],” IEEE Trans. Circuits Syst. II, Exp. Briefs , vol. 62, doi: 10.1109/TCSII.2015.2407772. S. J. Piestrak, “Design of residue generators and multioperand modular adders using carry-save adders,” IEEE Trans. Comput. , vol. 43, no. 1, pp. 68-77, Aug. 1994. S. J. Piestrak, “Design of multi-residue generators using shared logic,” in Proc. of the IEEE International Symposium on Circuits and Systems (ISCAS) , pp. 1435-1438, May 2011. J. Y. S. Low, Chip-Hong Chang, “A New Approach to the Design of Efficient Residue Generators for Arbitrary Moduli,” IEEE Trans. Circuits Syst. I, Reg. papers, vol. 60, no. 9, pp. 2366-2374, Feb. 2013. A. A. Hiasat, “RNS arithmetic multiplier for medium and large moduli,” IEEE Trans.Circuits Syst. II Analog and Digit. Signal Process. , vol. 47, no. 9, pp. 937-940, Sept. 2000. A. A. Hiasat, “A Suggestion for a Fast Residue Multiplier for a Family of Moduli of the Form ሺʹ െ ʹ േ ͳሻǡ” The Computer J. , vol. 47, no. 1,pp. 93-102, 2004. L. Li, J. Hu, Y. Chen, “An universal architecture for designing modulo ʹ െ ʹ െ ͳmultiplier,” IEICE Electron. Expr. , vol. 9, no. 3, pp. 193-199, Feb. 2012. J. C. Bajard, M. Kaihara, T. Plantard, “Selected RNS Bases for Modular Multiplication,” in Proc. Of the 19th IEEE Symposium on Computer Arithmetic(ARITH), pp. 25-32, June 2009. A. Nannarelli, M. Re, G. C. Cardarilli, “Tradeoffs between residue number system and traditional FIR filters,” in Proc. of the IEEE International Symposium on Circuits and Systems (ISCAS), pp. 305308, May 2001. P. Patronik, K. Berezowski, S. J. Piestrak, J. Biernat, A. Shrivastava, “Fast and energy-efficient constant-coefficient FIR filters using residue number system,” in Proc. of the International Symposium on Low Power Electronics and Design, pp. 385-390, Aug. 2011. G. C. Cardarilli, A. Nannarelli, M. Re, “Residue Number System for Low-Power DSP Applications,” in Conference Record of the FortyFirst Asilomar Conference on Signals, Systems and Computers (ACSSC), pp.1412-1416, Nov. 2007. D. Harris, “A taxonomy of parallel prefix networks,” in Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, pp. 2213-2217, Nov. 2003. R. A. Patel, M. Benaissa, S. Boussakta, “Fast Parallel-Prefix Architectures for Modulo ʹ െ ͳ Addition with a Single Representation of Zero,” IEEE Trans. Comput., vol. 56, no. 11, pp. 1484-1492, 2007. R. Zimmermann, “Efficient VLSI implementation of modulo ʹ േ ͳ addition and multiplication,” in Proc. Symposium on Computer Arithmetic, pp. 158-167, April 1999.