ON IMPLEMENTING EFFICIENT MODULO 2 n ю 1 ... - HT Vergos

0 downloads 0 Views 468KB Size Report
operand. 3-bit Parallel Adder c2 sc. HA sc. HA sc. FA+ b3 a3 d3 c3 sc. FA sc. FA sc. FA sc. FA. |A+B+C+D|9. 4-bit Parallel Adder. (Bias Subtractor) biased result.
Journal of Circuits, Systems, and Computers Vol. 19, No. 5 (2010) 911930 # .c World Scienti¯c Publishing Company DOI: 10.1142/S0218126610006529

ON IMPLEMENTING EFFICIENT MODULO 2 n þ 1 ¤ ARITHMETIC COMPONENTS

HARIDIMOS T. VERGOS Computer Engineering & Informatics Department, University of Patras, Greece [email protected] DIMITRIS BAKALIS Physics Department, University of Patras, Greece [email protected] Received 20 February 2009 Accepted 24 December 2009 It is shown that a diminished-1 adder, with minor modi¯cations, can be also used for the modulo 2 n þ 1 addition of two n-bit operands in the weighted representation, if the sum of its input operands is decreased by one. This modi¯ed diminished-1 adder can perform n-bit modulo 2 n þ 1 addition in less area and time than solutions that are based on the use of binary adders and/or weighted modulo 2 n þ 1 adders. Therefore, it can be applied e®ectively to all weighted modulo 2 n þ 1 arithmetic components that ¯nally derive two n-bit addends. A small number of weighted arithmetic components have in the past adopted such a scheme without presenting this general theory. By applying this idea, we propose novel multi-operand modulo 2 n þ 1 adders (MOMAs) and residue generators (RGs). Experimental results indicate that the resulting arithmetic components o®er signi¯cant savings in delay, implementation area and average power consumption compared to the currently most e±cient solutions. Keywords: Residue arithmetic; modulo 2 n þ 1 arithmetic; multi-operand adders; residue generation; residue number systems.

1. Introduction Arithmetic modulo 2 n þ 1 has found applicability in a variety of ¯elds ranging from pseudorandom number generation and cryptography,1 up to convolution computations without round-o® errors.2 A modulo 2 n þ 1 channel is also commonly included in a residue number system (RNS)35 application. Since the adoption of an RNS may o®er signi¯cant speedup over the binary system, it has been adopted in the design of DSPs,3,6,7 FIR ¯lters8,9 and communication components.1012 It is therefore no * This

paper was recommended by Regional Editor Piero Malcovati.

911

912

H. T. Vergos & D. Bakalis

surprise that a number of architectures has appeared for residue generators,1316 parallel-adders,1725 multi-operand adders,13,2628 multipliers,21,2937 squarers3840 and combined multiplication and sum-of-squares units,41 for the modulo 2 n þ 1 case. Almost all of the above architectures adopt, for both their input and output operands, either the normal weighted representation (we will hereafter refer to these components as weighted components) or the diminished-1 representation42 (we will hereafter refer to these components as diminished-1 components). The need for two representations has emerged because the weighted representation requires n þ 1 bits for representing each operand, while it utilizes only the 2 n þ 1 combinations. In the most common three-moduli RNS that uses channels of the f2 n  1; 2 n ; 2 n þ 1g form, if the weighted representation is adopted for the modulo 2 n þ 1 channel, then this will be the execution-rate bottleneck since it operates on wider operands than the other two channels. On the other hand, in the diminished-1 representation, each operand is represented decreased by one compared to its weighted representation. Zero operands are not used in the computation channel; the results are derived alternatively when any operand or the result is zero. Therefore, only n-bit operands are used in a diminished-1 channel, leading to smaller and faster components that can better match the speed of the rest RNS channels. The diminished-1 representation, however, su®ers from the need of translators from and to the weighted representation. The choice of the best representation thus depends on the number of computations that take place before a new translation is required, since intermediate results do not have to be translated immediately back to the weighted system. Several architectures have been proposed for the design of a weighted modulo 2 n þ 1 adder; some of them result from the general residue adder case, whereas others are dedicated architectures for this particular modulus. For the modulo 2 n þ 1 addition of A and B, hereafter denoted by jA þ Bj2 n þ1 , where A ¼ an an1 an2    a1 a0 and B ¼ bn bn1 bn2    b1 b0 are two ðn þ 1Þ-bit binary numbers in the range ½0; 2 n , we have that:  A þ B  ð2 n þ 1Þ ; if A þ B  2 n þ 1 jA þ Bj2 n þ1 ¼ ð1Þ A þ B; otherwise : For implementing Eq. (1), a generalized modulo adder architecture has been proposed17 which is based on binary adders. It uses two binary adders connected in series and a multiplexer. A þ B is computed by a ðn þ 1Þ-bit adder, while a ð2 n þ 1Þ correction is added to its output by the second ðn þ 2Þ-bit adder. The multiplexer is then used to select between the two adders' results depending on the value of the carry output of the second adder. Dugdale18 reduced the width of the second adder to n þ 1 bits and has shown that the multiplexers can be controlled by the logical OR of the two adders' carry outputs. In the same manuscript, an area e±cient architecture that performs the modular addition using just one adder in two addition cycles was presented.

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

913

Both above architectures are very slow. An obvious solution for decreasing the delay of the above architectures is to have both cases of Eq. (1) computed in parallel.24,43 This solution however, apart from the two adders, requires the addition of a carry-save adder (CSA) stage. Hiasat proposed a more area e®ective solution19 that is based on the observation that several carry propagate and generate signals for both cases of Eq. (1) can be shared for both additions and therefore an augmented single carry look-ahead unit is su±cient. Parallel-pre¯x weighted adders for the moduli of the 2 n þ 1 form have also been presented20 and have been shown to be the fastest currently available and more e±cient than those proposed by Hiasat.19 An even larger number of architectures is available for the design of a diminished1 parallel adder since its operation has been shown to be equivalent to an inverted end-around carry (EAC) binary adder.21,22 Single and multiple level carry lookahead architectures have been proposed22 by engaging the inverted carry output logic equations within the carry lookahead logic. These equations can also be computed in a parallel-pre¯x manner21,22 by the carry computation unit. The fastest parallel-pre¯x diminished-1 adders,22 o®er the same number of pre¯x levels with those required by the fastest binary adders. In this paper, motivated by the observation that a diminished-1 adder is more e±cient in both area and delay terms than a weighted one, irrespectively of the architecture used for the latter, we propose the replacement of the ¯nal weighted parallel adder commonly used in several arithmetic components with a modi¯ed diminished-1 one. We show that a diminished-1 adder with minor modi¯cations can be used for the modulo 2 n þ 1 addition of two n-bit weighted operands (obviously, a (n þ 1)-bit result is produced), provided that it is driven by operands whose sum has been decreased by one. The required modi¯cations do not increase the execution delay of the adder and can be implemented in very small area. We apply this idea to the design of weighted multi-operand modulo adders (MOMAs) and residue generators (RGs) modulo 2 n þ 1, for both the weighted and the diminished-1 representations. The resulting circuits heavily outperform the currently most e±cient ones in execution time while also being more compact and consuming less power. The partial products addition schemes of the weighted multipliers37 and squarers40 proposed recently, are shown to be straightforward applications of the introduced MOMAs, but these designs37,40 would also bene¯t from the use of the proposed modi¯ed diminished-1 adder since this derives the most signi¯cant bit in less implementation area than the one used in them. Moreover, we show that the design of a weighted parallel adder based on the use of a diminished-1 one,25 can be easily derived as the two-operand MOMA. This paper extends our previous work44 by: (a) providing a more e±cient way of designing the modi¯ed diminished-1 adder that imposes signi¯cantly less area penalty, (b) showing that the architectures that have been proposed for modulo 2 n þ 1 weighted multipliers,37 squarers40 and parallel adders25 are straightforward applications of the proposed MOMA architecture, (c) proposing e±cient diminished-1

914

H. T. Vergos & D. Bakalis

residue generators as well along with the weighted ones, and (d) substantially revising the experimental results using a contemporary CMOS VLSI technology and including average power dissipation estimations. The rest of this paper is organized as follows. In Sec. 2 we present a modi¯ed diminished-1 adder, capable of adding two n-bit weighted operands and producing a ðn þ 1Þ-bit result, provided that the operands sum has previously been decreased by one. In Sec. 3 we apply this modi¯ed diminished-1 adder in the design of weighted MOMAs and RGs and derive new architectures for them. Comparison results against the most e±cient currently known solutions are given in Sec. 4. The last section concludes the paper.

2. A Modi¯ed Diminished-1 Adder for n-bit Weighted Addition Table 1 summarizes the area and delay requirements, in equivalent gates, of the currently most e±cient weighted and diminished-1 architectures (ignore the last part at this point) that adopt the normal carry de¯nition. These estimates are derived using the unit-gate model.45 This model assumes that each monotonic two-input gate counts as one equivalent for both area and delay, while an XOR/XNOR gate counts as two equivalents for area and delay. We assume that both addition operands are in the ½0; 2 n  1 range (a 2 n operand value is not considered at this point). It becomes obvious that, under the normal carry de¯nition, a diminished-1 adder is both a smaller and faster circuit than a weighted modulo 2 n þ 1 adder, considering the current state of the art. Therefore, if we could use a diminished-1 adder with very small modi¯cations to also perform weighted modulo 2 n þ 1 addition, then we would reduce both the area and the delay of the resulting components. Such a modi¯ed diminished-1 adder is presented in the following. Consider that A and B are two n-bit weighted operands in the ½0; 2 n  1 range. Let A H and B H denote two n-bit vectors such that A H þ B H ¼ A þ B  1. From Eq. (1), we can equivalently derive that: ( jA þ Bj2 n þ1 ¼

ðA þ B  1Þ  2 n ;

if A þ B  1  2 n

ðA þ B  1Þ þ 1 ;

otherwise :

Table 1. Area and delay estimations provided by the unit-gate model. Delay Weighted modulo

2n

Diminished-1 modulo

20

þ 1 adders 2n

22

þ 1 adders

Proposed modi¯ed diminished-1 adder for n-bit weighted modulo 2 n þ 1 addition

Area

2 log n þ 7

9 2

n log n þ 4n þ 11

2 log n þ 3

9 2

n log n þ 12 n þ 5

2 log n þ 3

9 2

n log n þ 12 n þ 7

ð2Þ

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

Taking the modulo 2 n of Eq. (2) and using A H and B H we then get:  H ðA þ B H Þ ; if ðA H þ B H Þ  2 n jjA þ Bj2 n þ1 j2 n ¼ ðA H þ B H Þ þ 1 ; otherwise :

915

ð3Þ

Let cout denote the carry output of the ðA H þ B H Þ n-bit integer addition. Using it in Eq. (3), we can unify the two cases as: jjA þ Bj2 n þ1 j2 n ¼ jA H þ B H j2 n þ cout :

ð4Þ

Equation (4) indicates that we can derive the n least signi¯cant bits of the weighted modulo 2 n þ 1 addition of A and B by using an inverted end-around carry adder (equivalently, a diminished-1 adder) provided that the sum of its input operands is decreased by one, that is, if we use as inputs the A H and B H vectors. As we will show in the next section, in many arithmetic components this is very easy to derive. The most signi¯cant bit of the weighted sum should be 1, only if jA þ Bj2 n þ1 ¼ 2 n , which since 0  A; B  2 n  1, reduces to A H þ B H ¼ 2 n  1. That is, the most signi¯cant bit of the weighted sum should be 1, only when A H and B H are bit-wise complementary. This is the only case that leads to all generate and propagate bits of the adder to be 0 and 1, respectively. Therefore, it can be easily detected as Pðn1Þ : 0  Gðn1Þ : 0 , where Pðn1Þ : 0 and Gðn1Þ : 0 are the group propagate and group generate signals at the most signi¯cant bit position of the diminished-1 adder. The Gðn1Þ : 0 term is readily available, since it is the carry signal for the least signi¯cant bit of the diminished-1 addition. If a parallel-pre¯x structure is adopted for the diminished-1 adder, then the Pðn1Þ : 0 term can be derived by just one AND gate that accepts the group propagate signals of the previous to the last pre¯x level. Therefore, the extra hardware required for the most signi¯cant bit of the weighted addition is very small (just two two-input AND gates) and since it does not contribute to the critical path of the adder, no delay penalty arises. It should be noted that the derivation of the most signi¯cant bit of the weighted sum requires less area than the corresponding derivation used in the design of the weighted multipliers37 and squarers.40 The diminished-1 adder used in the latter manuscripts requires ðn  1Þ two-input AND gates, where n is the width of the operand(s), and may also require n more XOR gates, depending on whether an inclusive or an exclusive-OR diminished-1 adder is used. Therefore, the proposed modi¯ed diminished-1 adder can be used for attaining smaller implementations of the weighted multipliers37 and squarers.40 Figure 1 presents the scheme that results from the previous analysis. A mapping circuit M accepts the n-bit vectors A and B and provides the vectors A H and B H . These are driven to a modi¯ed diminished-1 adder that is capable of providing the ðn þ 1Þ-bit sum of the weighted modulo 2 n þ 1 addition of A and B. The last line of Table 1 provides area and delay estimates for the modi¯ed diminished-1 adder described above assuming that the parallel-pre¯x architecture22 is followed. Since the modi¯ed diminished-1 adder also o®ers implementation area

916

H. T. Vergos & D. Bakalis

Fig. 1.

Using a modi¯ed diminished-1 adder for n-bit weighted addition.

and execution delay savings over a corresponding weighted adder, we conclude that its use is very attractive if the mapping circuit of Fig. 1 can be designed e±ciently. In the following section, we show that this is the case in many weighted arithmetic components. We present novel MOMAs and RGs, in which the cost of the mapping circuit is extremely small and in some cases zero. Both n-bit and ðn þ 1Þ-bit operands are considered in the MOMA case. 3. Novel Weighted Components In this section we show that the modi¯ed diminished-1 adder incuded in Fig. 1 can be easily incorporated in the design of weighted MOMAs and weighted RGs. The notations MOMAðk; 2 n þ 1Þ and RGðk; 2 n þ 1Þ are used hereafter to indicate a weighted multi-operand modulo 2 n þ 1 adder for k ðn þ 1Þ-bit operands and a circuit that produces the residue of a k-bit weighted number, in modulo 2 n þ 1 arithmetic, respectively. The notation n-bit MOMA is used to denote a MOMAðk; 2 n þ 1Þ, with n-bit wide input operands. 3.1. Novel weighted multi-operand modulo 2 n +1 adders One of the arithmetic components that has been heavily researched in residue arithmetic is the multi-operand modulo adder. Hardware support for multi-operand modulo addition is highly appreciated in several multiply-and-add intensive computations, such as digital ¯ltering, convolution estimation and FFTs.

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

917

The ¯rst e®ort26 for a weighted MOMA required several parallel-adders connected in series. The architecture of Piestrak,13 is currently considered the most e±cient for weighted 2 n þ 1 MOMAs. Taking advantage of the periodic properties of 2 i taken modulo 2 n þ 1, for designing a MOMAðk; 2 n þ 1Þ it uses an n-bit wide inverted EAC CSA tree for deriving the two ¯nal addends. The use of an inverted EAC CSA tree for reducing the operands in two ¯nal summands is justi¯ed by the fact that a carry output at the most signi¯cant bit position, suppose cn , that has a weight of 2 n , can be complemented and added at the least signi¯cant bit position in the next stage, provided that a correction equal to 1 is taken into account, since it holds that: jcn 2 n j2 n þ1 ¼ j  cn j2 n þ1 ¼ j2 n þ cn j2 n þ1 ¼ jcn  1j2 n þ1 :

ð5Þ

The same technique is used for the most signi¯cant bits of the operands. These are complemented and added at the least signi¯cant position, reducing the width of the tree to n-bits. A total correction term needs also to be added with the adder tree to account for the above operations. Two parallel adders connected in series are used for the addition of the two ¯nal addends, since the architecture of Bayoumi17 is considered for the ¯nal weighted adder. The second adder can be signi¯cantly simpli¯ed since it has one constant operand (it acts as a constant subtractor). Figure 2 presents the weighted MOMAð4; 2 3 þ 1Þ for operands A; B; C and D designed according to Piestrak.13 Let a3 ; b3 ; c3 and d3 denote the most signi¯cant bits of A; B; C and D, respectively, and let a2 a1 a0 , b2 b1 b0 , c2 c1 c0 , d2 d1 d0 denote their rest bits. Blocks labeled HA, FA and FA þ represent half adders, full adders and simpli¯ed full adders because one of their inputs has a constant value equal to 1, respectively. It should be noted that the inverted EAC adder tree required is unbalanced, since the number of input bits at the least signi¯cant bit position is twice that of the other columns. This is attributed to the fact that the inverted most signi¯cant bits of the operands are also added to the least signi¯cant column. The ¯nal row of the CSA tree is used for adding the correction term, which for the MOMAð4; 2 3 þ 1Þ of Fig. 2 is equal to 1. Obviously, one can achieve signi¯cant savings in both area and delay if the modi¯ed diminished-1 adder of Fig. 1 can replace the ¯nal weighted adder of Fig. 2. The modi¯cations required for this (equivalently, the required mapping circuit indicated in Fig. 1) are analyzed in detail below and are shown to be small enough. We ¯rst consider the case of n-bit wide input operands. According to the analysis of Sec. 2, if two n-bit weighted operands are driven to a diminished-1 adder, this will output their modulo 2 n þ 1 sum increased by one. Consider now the addition of k n-bit weighted operands. This can obviously be achieved by ðk  1Þ diminished-1 additions, that will provide the modulo 2 n þ 1 sum increased by ðk  1Þ (irrespectively if these additions are carried out in parallel or in series). For deriving the correct sum, we can use one further addition (in fact a subtraction). Since however this will also increase by 1 the expected sum, in this last addition we have to add a correction term of j  kj2 n þ1 . It should be noted that, if j  kj2 n þ1 ¼ j  1j2 n þ1 ¼ 2 n ,

918

H. T. Vergos & D. Bakalis a3

c3

a2

b3

a1

b2

c1

a0

b1

HA

FA

c0 b0

FA

FA

c s

c s

c s

d2

d1

d0

c s

FA

FA

c s

c s

FA

c s

c2

FA

FA

c s

c s

FA

c s

n-bit wide Inverted EAC CSA tree for the multioperand addition Correction term →

HA

HA

c s

c s

First final operand

Fina l weighted adder

FA

a2 a1 a0 b2 b1 b0 c2 c1 c0 d2 d1 d0 a3 b3 c3 0 0 1 +

+

c s

2nd final operand

3-bit Parallel Adder

d3

d3 is used as the carry input of the first parallel adder

biased result

01 11

4-bit Parallel Adder (Bias Subtractor)

1

0 Multiplexer

|A+B+C+D|9

Fig. 2. The MOMAð4; 2 3 þ 1Þ design according to Piestrak.13

then adding the correction term is not required, since in modulo 2 n þ 1 arithmetic, an inverted EAC CSA addition of A, B with 1 results into A þ B þ ð1Þ þ 1 ¼ A þ B. Finally, the ðn þ 1Þ-bit weighted result can be derived by using the modi¯ed diminished-1 adder of Fig. 1 for the last addition. Example 1. Consider the multi-operand modulo 2 3 þ 1 weighted addition of the 3-bit operands A ¼ 710 ¼ 1112 , B ¼ 410 ¼ 1002 , C ¼ 510 ¼ 1012 and D ¼ 010 ¼ 0002 . Adding A with B and C with D in diminished-1 adders, will provide the sums S1 ¼ 310 ¼ 0112 and S2 ¼ 610 ¼ 1102 , respectively. Then, adding these together by a diminished-1 adder will provide R ¼ 110 ¼ 0012 . Finally, a correction term of E ¼ j  kj9 is added. Since k ¼ 4, we have that E ¼ j  kj9 ¼ 510 ¼ 1012 . The addition of R with E, in a modi¯ed diminished-1 adder yields the expected result 01112 ¼ 710 .

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

919

Example 2. Consider the multi-operand modulo 2 2 þ 1 weighted addition of the 2-bit operands A; B; C; D; F and G, with A ¼ B ¼ D ¼ G ¼ 110 ¼ 012 , C ¼ 310 ¼ 112 and F ¼ 010 ¼ 002 . The additions of A with B, C with D and F with G in diminished-1 adders, will provide the sums S1 ¼ 310 ¼ 112 , S2 ¼ 010 ¼ 002 and S3 ¼ 210 ¼ 102 , respectively. The diminished-1 addition of S2 and S3 will result in S4 ¼ 310 ¼ 112 . Since there are 6 input operands and j  6j2 2 þ1 ¼ j  1j2 2 þ1 ¼ 4, no correction term is required. To obtain the result we need to add S1 and S4 in a modi¯ed diminished-1 adder that will provide the expected result 0102 ¼ 210 . Instead of using in series or parallel diminished-1 adders, the multi-operand diminished-1 addition described above can be achieved by a diminished-1 MOMA. It has been shown34 that a diminished-1 MOMA can be implemented by an inverted EAC CSA tree and a ¯nal diminished-1 adder. Therefore, we conclude that an n-bit weighted MOMAðk; 2 n þ 1Þ, can be derived by using a diminished-1 MOMAðk þ 1; 2 n þ 1Þ using: .

the correction term E which is equal to j  kj2 n þ1 , as the ðk þ 1)-th summand and . the modi¯ed diminished-1 adder as the ¯nal adder. It is again noted that if j  kj2 n þ1 ¼ j  1j2 n þ1 ¼ 2 n , then a diminished-1 MOMAðk; 2 n þ 1Þ is su±cient since a correction is not required. In the following we extend the above scheme to also account for ðn þ 1Þ-bit operands. Since in modulo 2 n þ 1 arithmetic the most signi¯cant bit of an operand is 1 only when the value of the operand is 2 n and given that j2 n j2 n þ1 ¼ 1, we can still use in the previously described MOMA the n least signi¯cant bits of these operands and decrease the correction term E according to the number of operands that have their most signi¯cant bit equal to 1. That is, if only one operand is equal to 2 n , the correction term should be j  k  1j2 n þ1 , if two operands are equal to 2 n , the correction term should be j  k  2j2 n þ1 and so on. Therefore, a combinational circuit must be used which accepts the most signi¯cant bits of the k operands and outputs the correction term that should be used. Example 3. Consider the multi-operand modulo 2 3 þ 1 addition of the 4-bit operands A ¼ 810 ¼ 10002 , B ¼ 410 ¼ 01002 , C ¼ 610 ¼ 01102 and D ¼ 310 ¼ 00112 . Let a3 ; b3 ; c3 and d3 denote the most signi¯cant bits of the input operands, respectively, and let a2 a1 a0 , b2 b1 b0 , c2 c1 c0 , d2 d1 d0 denote their rest bits. Figure 3 presents the proposed implementation for a weighted MOMAð4; 2 3 þ 1Þ. On the left of the ¯gure, a small circuit based on half adder blocks and logic gates is used to derive the correction term that should be used. The circuit is actually a modi¯ed 1s counter. The inverted EAC CSA tree is shown on the right. It reduces the summands in two ¯nal operands that are added in a modi¯ed diminished-1 adder. For the assumed values, the correction term E ¼ e2 e1 e0 computed by the circuit is 4 ¼ j  4  1j2 3 þ1 . When used in the adder tree, the two ¯nal operands that result

920

H. T. Vergos & D. Bakalis

Fig. 3. Proposed weighted MOMAð4; 2 3 þ 1Þ.

are 0002 and 0102 . The modi¯ed diminished-1 adder then produces the expected sum 00112 ¼ 310 . Comparing the implementations presented in Figs. 2 and 3 we can see that, apart from the area and time savings expected by the replacement of the ¯nal weighted adder with the modi¯ed diminished-1 adder, the inverted EAC adder tree in the proposed MOMA is totally balanced and therefore shallower. It should be also noted that: .

since the correction term is derived later than the rest of the summands, it should be driven to the last stage of the inverted EAC CSA tree. . since the correction term can be derived by an 1s counter with a translator at its outputs composed of single gates, it can be derived earlier than needed in the CSA tree and therefore does not add any delay on its critical path. . since the correction term can receive all values in the ½j  2kj2 n þ1 ; j  kj2 n þ1  range, n in MOMAs with k  d 2 2þ1 e, a value of 2 n is possible. In these practically few cases, the combinational circuit must be designed so as to provide a ðn þ 1Þ-bit correction term. The most signi¯cant bit of this correction term, should be used to control a multiplexer that drives the inputs of the modi¯ed diminished-1 adder. If this bit is 0, then the inputs of the modi¯ed diminished-1 adder are driven by the last stage of the CSA tree, that is, the stage used to add the remaining bits of the correction term. If this bit is 1, then the inputs of the modi¯ed diminished-1 adder are driven directly by the inputs of the last stage of the CSA tree, that is, without the addition of any correction term. Let us now consider two architectures that have been recently proposed for multipliers37 and squarers,40 which use a ¯nal diminished-1 parallel adder. In the

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

921

weighted multipliers case, n weighted partial products are derived. Taking into account that several groups of bits of these partial products cannot simultaneously be at 1,30 each partial product is reduced to n bits width. Then, a diminished-1 MOMAðn þ 1; 2 n þ 1Þ is used for their addition. Obviously, the weighted n-bit MOMA introduced above, followed by the modi¯ed diminished-1 adder of Sec. 2, can be used for the addition of the partial products. Since the correction term required for the partial product formation37 is 2 n ð2 n  n  1Þ and the diminished-1 MOMAðn þ 1; 2 n þ 1Þ requires, according to the above analysis, a further correction of n, a total correction term of j2 n ð2 n  n  1Þ  nj2 n þ1 ¼ 2 is used as the ðn þ 1Þ-th input of the MOMA. The same concept can also be used in the design of weighted squarers, with the only exception being that the number of partial products is approximately halved compared to the multipliers case. Let us then consider the case of a parallel weighted adder or equivalently the case of the weighted MOMAð2; 2 n þ 1Þ. According to the above analysis, the correction term required is j  2j2 n þ1 ¼ 2 n  1 ¼ 111 . . . 1112 , j  3j2 n þ1 ¼ 2 n  2 ¼ 111 . . . 1102 and j  4j2 n þ1 ¼ 2 n  3 ¼ 111 . . . 1012 when none, one and both operands are equal to 2 n , respectively. If we denote the two operands as A ¼ an an1    a0 and B ¼ bn bn1    b0 , then the correction term can be represented by the vector 111 . . . 1ðan NAND bn Þðan XNOR bn Þ2 . The attained implementation is shown in Fig. 4 and is the same with the one derived following a di®erent approach.25 Finally, the scheme of Fig. 4 gives rise to a di®erent weighted MOMA architecture. Given k weighted input operands, we can group them into d k2 e pairs (if k is odd then the last operand is paired with the 0 operand) and use d k2 e mapping circuits

an bn

M an-1 bn-1

an-2 bn-2

a2 b2

FA+

FA+

FA+

a1

an bn

b1

FA

a0

b0

FA

Modified Diminished-1 adder

sn

sn-1

s2

s1

s0

Fig. 4. Weighted parallel adder using a modi¯ed diminished-1 adder.

922

H. T. Vergos & D. Bakalis

similar to M of Fig. 4 to derive 2  d k2 e n-bit vectors. These vectors can then be added by the n-bit weighted MOMAð2  d k2 e; 2 n þ 1Þ introduced earlier. However, this architecture would be area e±cient only when both k and n have small values, since a large value of k would lead to an increased number of mapping circuits, whereas a large value of n would increase the area required by each mapping circuit. This architecture is therefore expected to be less e±cient than the one proposed earlier.

3.2. Novel residue generators In order for an operand to be used as input to any modulo 2 n þ 1 arithmetic component, it has to be converted from the binary representation to its weighted 2 n þ 1 representation. This is done by a residue generator circuit (also known as forward converter). For attaining the residue of A taken modulo 2 n þ 1 a divider circuit can be used. However, the resulting circuits are very slow and unnecessarily complex. Several more e±cient combinational architectures have been proposed.1316 The residue generators proposed by Piestrak13 are more e±cient, for the modulo 2 n þ 1 case, than the rest proposals that do not take advantage of the periodic properties of the powers of 2 taken modulo 2 n þ 1,14 or require a complex converter at the output stage,15 or use modular parallel adders16 that require more area and execution time than the binary ones.13 The RG architecture proposed by Piestrak13 was derived by partitioning the k bits of the operand in n-bit groups, suppose g0 ; g1 ; . . . ; gd nk e , starting from the least signi¯cant bits. It then holds that:   k   ð6Þ jAj2 n þ1 ¼  gd nk e  2 d n en þ    þ g3  2 3n þ g2  2 2n þ g1  2 n þ g0  2 0  n 2 þ1

and since for i  0, j2 i n j2 n þ1 ¼ ð1Þ i , from Eq. (6) we can get that: 8    > > k      g3 þ g2  g1 þ g0  ; þg  > d e < n 2 n þ1 jAj2 n þ1 ¼

  > > > :  g d k e þ     g3 þ g2  g1 þ g0  n

2 n þ1

;

  k if even ; n   k if odd : n

ð7Þ

Taking advantage that for an n-bit vector g it holds that j  gj2 n þ1 ¼ jg þ 2j2 n þ1 , where g denotes the bit-by-bit complement of g, we get: 8    > > k þ    þ ðg þ 2Þ þ g þ ðg þ 2Þ þ g ; g  > dne 3 2 1 0 n < 2 þ1 jAj2 n þ1 ¼

  > > > :  ðg d k e þ 2Þ þ    þ ðg 3 þ 2Þ þ g2 þ ðg 1 þ 2Þ þ g0  n

  k even ; n   k if odd : n if

2 n þ1

;

ð8Þ

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

923

Based on Eq. (8), the RGðk; 2 n þ 1Þ circuits proposed by Piestrak13 are designed by the following steps: (i) Starting from the least signi¯cant bits, the k bits are partitioned in the g0 ; g1 ; . . . ; gd k e groups, of n bits each. If the last group contains less than n bits, n then 0s are assumed at its most signi¯cant bit positions. (ii) The bits of the odd numbered groups are inverted and a correction term of 2 is accounted for every inverted group. If the last group is an odd numbered one and incomplete at bit positions d; d þ 1; . . . ; n  1, then a correction term of Pn1 i i¼d 2 needs to be also accounted for. (iii) An inverted EAC CSA tree is used to add the numbers of all groups along with a required total correction term, until two n-bit ¯nal operands are derived. The total correction term is equal to E, with 0  E  2 n . E does not only include the correction required due to (ii), but also incorporates the correction due to the inverted EAC CSA tree itself. (iv) A ¯nal weighted adder, consisting of two binary adders connected in series (the second adder is actually a constant subtractor and therefore can be signi¯cantly simpli¯ed) and a multiplexer17 are used to derive the residue. The proposed modi¯ed diminished-1 adder proposed in Sec. 2 can obviously replace the ¯nal weighted adder of (iv), just by using jE  1j2 n þ1 instead of E as the total correction term. When E is 0, then jE  1j2 n þ1 ¼ j  1j2 n þ1 and therefore no correction term is at all required in the proposed architecture. In this case, apart from the area and time savings because of the replacement of the ¯nal weighted adder with the modi¯ed diminished-1 adder, there are also area savings in the inverted EAC CSA tree, since it has one input operand less and maybe further time savings, if the CSA tree depth is reduced due to the reduction of the operands. Such a case is presented in the following example. Example 4. Consider the design of an RGð9; 2 3 þ 1Þ for A ¼ a8 a7    a0 . The input bits are partitioned in three groups: g0 ¼ fa2 ; a1 ; a0 g, g1 ¼ fa5 ; a4 ; a3 g and g2 ¼ fa8 ; a7 ; a6 g. The bits of g1 are then inverted and along with the bits of the other groups and a total correction term are the inputs of an inverted EAC CSA tree. The total correction term that the architecture of Piestrak13 requires is 0, but its addition cannot be omitted since this would alter the number of inverted EACs. The adder tree provides two ¯nal addends that are the inputs of a weighted modulo 2 3 þ 1 adder. The design of the RG according to Piestrak13 is shown in Fig. 5(a). The second row, composed of HAs, is required in the inverted EAC CSA tree in order to take into account the zero total correction term and cannot be omitted. In the proposed architecture we do not need to add any correction and therefore the number of the input operands of the adder tree is reduced. The derived RG is shown in Fig. 5(b). Apart from the savings in both area and time because of the use of the proposed modi¯ed diminished-1 adder, further savings in both area and time are

924

H. T. Vergos & D. Bakalis

(a)

(b)

Fig. 5. An RGð9; 2 3 þ 1Þ designed (a) according to Piestrak13 and (b) according to the proposed method.

achieved due to the simpli¯ed CSA tree. Considering the proposed architecture and as an example the value A ¼ 14310 ¼ 0100011112 , we have that the ¯nal operands at the inputs of the proposed modi¯ed diminished-1 adder are 0 1 1 and 1 0 0. Since these are complementary, the most signi¯cant bit of the residue is 1. The rest bits are computed by the diminished-1 addition of the ¯nal operands, which provides 0 0 0 at the least signi¯cant bit positions. We therefore get j143j2 3 þ1 ¼ 10002 ¼ 810 . The proposed RGs can also be used for deriving the residue of a k-bit binary number A in its diminished-1 representation. The diminished-1 representation A ¦ of a weighted binary number A is given by: ( jA  1j2 n þ1 ; if jAj2 n þ1 6¼ 0 ; ¦ A ¼ ð9Þ 0; if jAj2 n þ1 ¼ 0 : Usually, along with A ¦ , a °ag is also derived to indicate a zero value since this case is considered separately by arithmetic components that accept operands in the diminished-1 representation. Obviously, the RGs proposed earlier can produce the diminished-1 residue if we decrease the correction term used by one more, that is, if

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

925

we use as correction term the value jE  2j2 n þ1 instead of the value jE  1j2 n þ1 . In such a case, the n least signi¯cant bits of the RG result form the diminished-1 residue, while the most signi¯cant bit is the zero indication °ag. As before, when jE  2j2 n þ1 ¼ j  1j2 n þ1 ¼ 2 n , then the correction does not have to be included in the inverted EAC CSA tree of the RG.

4. Evaluation and Comparisons In this section, we compare the architectures derived in Sec. 3 for multi-operand modulo adders and residue generators against those proposed by Piestrak,13 both qualitatively and quantitatively. For our qualitative comparisons, we use the approximations of the unit-gate model.45 We consider that a multiplexer has a delay of two equivalents and an area of three46 and that all binary adders follow the KoggeStone47 parallel-pre¯x carry computation architecture whereas all diminished-1 adders used follow the fastest parallel-pre¯x architecture.22 The MOMA(k; 2 n þ 1) architecture of Piestrak13 uses an unbalanced inverted EAC CSA tree, followed by an n-bit parallel adder with a carry input, an ðn þ 1Þ-bit constant adder and ðn þ 1Þ two to one multiplexers. The proposed architecture, on the other hand, uses a modi¯ed 1s counter for computing the required correction term, that is inserted at the last stage of the balanced CSA tree. The outputs of the tree are ¯nally driven to the modi¯ed diminished-1 adder. In the RG(k; 2 n þ 1) case, both architectures use the same CSA tree for reducing the d nk e þ 1 inputs in two ¯nal summands. In a few cases (as the one exempli¯ed earlier) the proposed architecture requires one input less. The two summands in the architecture proposed by Piestrak13 for RGs are driven to an n-bit parallel adder and its result is driven to a second ðn þ 1Þ-bit constant adder for bias removal. Finally, the multiplexers choose the correct result between these two. In the proposed architecture only a modi¯ed diminished-1 adder is used after the CSA tree. Table 2 lists the unit-gate area and delay estimations for several components used in the MOMAs and RGs under comparison, as well as the total estimations for each compared architecture. ðaÞ denotes the minimum depth of an a-input adder tree according to the Dadda architecture.48 The estimation for the area of the CSA tree required in the architecture proposed by Piestrak13 for MOMAs should be considered as lower bounds. The same holds for the area estimation of the circuit that is used in the proposed architecture to compute the correction term. This is attributed to the fact that the number of half-adders required in the unbalanced CSA tree of Piestrak13 cannot be easily computed, neither can the gate equivalents of the combinational circuit that produces the correction term in the proposed architecture. We have considered the latter as having an area complexity equal to that of a 1s counter. The closed forms of Table 2 reveal that the proposed MOMA and RG circuits outperform the previous proposal in both area and delay terms.

926

H. T. Vergos & D. Bakalis Table 2. Area and delay estimations provided by the unit-gate model.

13

CSA tree of MOMA n-bit binary adder with carry input (n+1)-bit constant binary adder without carry input Proposed MOMA CSA tree & correction term computation Modi¯ed diminished-1 modulo 2 n þ 1 adder 13

Piestrak

MOMA(k; 2 n þ 1)

Proposed MOMA(k; 2 n þ 1) CSA tree of RG 13

Piestrak

RG(k; 2 n þ 1)

Proposed RG(k; 2 n þ 1)

Delay

Area

4½ð2k þ 1Þ  1

7ðn þ 1Þðk  1Þ

2 log n þ 3

nð3 log n þ 2Þ þ 3

2 logðn þ 1Þ  1

ð3n þ 4Þ logðn þ 1Þ þ n

4ðk þ 1Þ

7½nðk  1Þ þ k  log k

2 log n þ 3

9 2

n log n þ 12 n þ 7

4ð2k þ 1Þ þ 2 log n þ 2 logðn þ 1Þ

nf7k  1 þ 3½log n þ logðn þ 1Þg þ 4 logðn þ 1Þ þ 7k  1

4ðk þ 1Þ þ 2 log n þ 3

nð7k þ 92 log n  13 2Þ þ 7ðk  log k þ 1Þ

4ðd nk e þ 1Þ

7nðd nk e  1Þ

4ðd nk e þ 1Þ þ 2½log n þ logðn þ 1Þ þ 4

nf3½log n þ logðn þ 1Þ þ 7d nk e  1g þ 4 logðn þ 1Þ þ 6

4ðd nk e þ 1Þ þ 2 log n þ 3

nð7d nk e þ 92 log n  13 2Þþ7

Since the unit-gate model is unable to provide estimations for the power consumed by each architecture and also ignores area and delay due to wiring, the above closed forms should be considered only as indicative. To attain realistic area and delay results and to compare the power consumption of each architecture, MOMAs and RGs for several values of k and n were described in HDL. The second adder (bias subtractor) used in the architectures of Piestrak13 was simpli¯ed since one of its operands is constant. After simulating the resulting descriptions, the designs were mapped to a 90 nm power characterized CMOS standard cell library, assuming typical process parameters. A bottom-up approach was followed during mapping. Each basic cell (for example the half and full adder cell) was iteratively optimized until no further delay savings were possible. The cells then underwent successive area recovery steps. Finally, \don't touch" primitives were applied to them. Then, the same optimization procedure was applied to every subcomponent and successively to the entire circuit. In this way, every design was mapped in the target technology as an interconnection of already optimized blocks and the architecture in each description was preserved as much as possible. All constraints, such as maximum fanout, output capacitance, and available input drive strength, were kept constant for all architectures. The derived netlists for each architecture and the layout constraints for achieving the predicted area and delay were then passed to a standard cell place and route tool. The °oorplan initialization information was kept constant for each compared MOMA and RG. After a constraint driven place and route procedure the netlists were back annotated with the extracted information and static

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

927

timing analysis was performed. For obtaining the power results, we followed a simulation driven approach. We applied 30,000 random input vectors at a 500 MHz frequency at each netlist (the same vectors were applied at the corresponding netlists of the architectures under comparison) and measured the average power dissipation using a commercial power estimator. Tables 3 and 4 present the attained results. All delay, area and average power results are expressed in ns, m 2 and mW, respectively. The results show that in every examined case the proposed designs outperform those of Piestrak13 in delay, area and average power consumption terms. The

Table 3. Experimental results for MOMAs. Piestrak's architecture13 MOMA k

n

Proposed architecture

Savings

Delay

Area

Power

Delay

Area

Power

Delay

Area

Power

(ns)

(m 2 )

(mW)

(ns)

(m 2 )

(mW)

(%)

(%)

(%)

4 8 12 16

4 4 4 4

0.80 1.01 1.14 1.23

3,188 5,847 8,257 10,729

1.93 3.07 4.72 6.30

0.54 0.77 0.96 1.01

2,588 4,887 7,730 10,095

1.29 2.24 3.46 4.57

32.5 23.8 15.8 17.9

18.8 16.4 6.4 5.9

33.0 27.0 26.8 27.5

4 8 12 16

8 8 8 8

0.94 1.17 1.31 1.39

6,465 11,402 15,719 20,391

3.83 5.71 8.19 11.50

0.61 0.83 0.94 1.10

5,323 9,658 13,947 17,801

2.56 5.00 6.56 9.34

35.1 29.1 28.2 20.9

17.7 15.3 11.3 12.7

33.2 12.4 19.9 18.8

4 8 12 16

16 16 16 16

1.11 1.32 1.43 1.54

13,237 23,066 31,653 40,162

6.94 11.30 16.30 24.10

0.68 0.90 1.00 1.10

11,274 19,542 27,614 35,696

5.59 10.70 13.30 18.20

38.7 31.8 30.1 28.6

14.8 15.3 12.8 11.1

19.5 5.3 18.4 24.5

Table 4. Experimental results for RGs. Piestrak's architecture13 RG k

n

Delay

Area

(ns)

(m 2 )

Power

Proposed architecture

Savings

Delay

Area

Power

Delay

Area

Power

(mW)

(ns)

(m 2 )

(mW)

(%)

(%)

(%)

8 16 32 64

4 4 4 4

0.55 0.74 0.88 1.08

1,543 2,556 4,872 9,170

0.88 1.65 3.58 7.08

0.33 0.54 0.67 0.86

1,259 2,347 4,527 8,766

0.67 1.44 3.21 6.69

40.0 27.0 23.9 20.4

18.4 8.2 7.1 4.4

24.0 12.9 10.3 5.5

16 32 64 128

8 8 8 8

0.69 0.89 1.04 1.23

3,631 5,657 10,071 18,744

1.99 3.55 7.26 14.24

0.41 0.61 0.74 0.93

2,984 5,122 9,545 18,095

1.55 3.12 6.72 13.79

40.6 31.5 28.8 24.4

17.8 9.5 5.2 3.5

22.1 12.2 7.4 3.2

32 64 128 256

16 16 16 16

0.84 1.04 1.18 1.38

8,563 12,689 21,646 39,022

4.29 7.47 15.03 28.93

0.49 0.68 0.80 1.01

7,008 11,374 20,317 36,580

3.50 6.78 14.13 28.32

41.7 34.6 32.2 26.8

18.2 10.4 6.1 6.3

18.3 9.3 6.0 2.1

928

H. T. Vergos & D. Bakalis

proposed MOMA circuits are on the average 27.7% faster, 13.2% more compact and consume 22.2% less power while the proposed RG circuits are 31.0% faster, 9.6% more compact and consume 11.1% less power. 5. Conclusions We proposed the replacement of the ¯nal weighted parallel adder, present in every weighted modulo 2 n þ 1 arithmetic component, by a modi¯ed diminished-1 adder. We have shown that the latter is able to output the correct result provided that the sum of its input operands has been decreased by one. Given the current state of the art in the architectures proposed for weighted and diminished-1 adders, this replacement provides smaller and faster arithmetic components if the input operands can be decreased with small or no cost. Two applications have been investigated, namely multi-operand weighted modulo adders and residue generators. In both, the hardware required for the decrement was shown to be su±ciently small or nothing at all. We have also shown that the partial products addition scheme adopted in some fast weighted multipliers and squarers as well as the weighted parallel addition are actually straightforward applications of the introduced MOMAs and that they can bene¯t from the use of the proposed modi¯ed diminished-1 adder. The experimental results, derived by implementing several multi-operand modulo adders and residue generators in static CMOS, indicate that the proposed approach for MOMAs and RGs o®ers signi¯cant savings in the delay, area and consumed power compared to the currently most e±cient proposals.

References 1. H. Nozaki et al., Implementation of RSA algorithm based on RNS Montgomery multiplication, Proc. 3rd Int. Workshop on Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science, Vol. 2162 (Springer-Verlag, 2001), pp. 364376. 2. V. K. Zadiraka and E. A. Melekhina, Computer implementation of e±cient discreteconvolution algorithms, Cybernet. Syst. Anal. 30 (1994) 106114. 3. M. A. Soderstrand et al., Residue Number System Arithmetic: Modern Applications in Digital Signal Processing (IEEE Press, 1986). 4. P. V. A. Mohan, Residue Number Systems: Algorithms and Architectures (SpringerVerlag, 2002). 5. A. Omondi and B. Premkumar, Residue Number Systems: Theory and Implementations (Imperial College Press, 2007). 6. J. Ramirez et al., RNS-enabled digital signal processor design, Electron. Lett. 38 (2002) 266268. 7. J. Ramirez et al., Design and implementation of high-performance RNS wavelet proccessors using custom IC technologies, J. VLSI Signal Process. 34 (2003) 227237. 8. G. C. Cardarilli, A. Nannarelli and M. Re, Reducing power dissipation in FIR ¯lters using the residue number system, Proc. IEEE 43rd IEEE Midwest Symp. Circuits and Systems (2000), pp. 320323.

On Implementing E±cient Modulo 2n þ 1 Arithmetic Components

929

9. Y. Liu and E. M.-K. Lai, Moduli set selection and cost estimation for RNS-based FIR ¯lter and ¯lter bank design, Design Automation for Embedded Systems 9 (2004) 123139. 10. U. Meyer-Bäse, A. Garcia and F. Taylor, Implementation of a communications channelizer using FPGAs and RNS arithmetic, J. VLSI Signal Process. 28 (2001) 115128. 11. J. Ramirez et al., Fast RNS FPL-based communications receiver design and implementation, Proc. 12th Int. Conf. Field Programmable Logic, Lecture Notes in Computer Science, Vol. 2438 (Springer-Verlag 2002), pp. 472481. 12. M. Panella and G. Martinelli, An RNS architecture for quasi-chaotic oscillators, J. VLSI Signal Process. 33 (2003) 199220. 13. S. J. Piestrak, Design of residue generators and multioperand modular adders using carrysave adders, IEEE Trans. Comput. 43 (1994) 6877. 14. G. Alia and E. Martineli, Designing multioperand modular adders, Electron. Lett. 32 (1996) 2223. 15. P. V. A. Mohan, E±cient design of binary to RNS converters, J. Circuits Syst. Comput. 9 (1999) 145154. 16. A. B. Premkumar, E. L. Ang and E. M.-K. Lai, Improved memoryless RNS forward converter based on the periodicity of residues, IEEE Trans. Circuits Syst. II 53 (2006) 133137. 17. M. Bayoumi, G. Jullien and W. Miller, A VLSI implementation of residue adders, IEEE Trans. Circuits Syst. 34 (1987) 284288. 18. M. Dugdale, VLSI implementation of residue adders based on binary adders, IEEE Trans. Circuits Syst. II 39 (1992) 325329. 19. A. Hiasat, High-speed and reduced-area modular adder structures for RNS, IEEE Trans. Comput. 51 (2002) 8489. 20. C. Efstathiou, H. T. Vergos and D. Nikolos, Fast parallel-pre¯x modulo 2 n þ 1 adders, IEEE Trans. Comput. 53 (2004) 12111216. 21. R. Zimmerman, E±cient VLSI implementation of modulo ð2 n  1Þ addition and multiplication, Proc. 14th IEEE Symp. Computer Arithmetic (1999), pp. 158167. 22. H. T. Vergos, C. Efstathiou and D. Nikolos, Diminished-one modulo 2 n þ 1 adder design, IEEE Trans. Comput. 51 (2002) 13891399. 23. C. Efstathiou, H. T. Vergos and D. Nikolos, Modulo 2 n  1 adder design using select pre¯x blocks, IEEE Trans. Comput. 52 (2003) 13991406. 24. H. T. Vergos and C. Efstathiou, On the design of e±cient modular adders, J. Circuits Syst. Comput. 14 (2005) 965972. 25. H. T. Vergos and C. Efstathiou, A unifying approach for weighted and diminished-1 modulo 2 n þ 1 addition, IEEE Trans. Circuits Syst. II 55 (2008) 10411045. 26. L. Skavantzos, Design of multi-operand carry-save adders for arithmetic modulo (2 n þ 1), Electron. Lett. 25 (1989) 11521153. 27. C. K. Koc and C. Y. Hung, Multi-operand modulo addition using carry-save adders, Electron. Lett. 26 (1990) 361363. 28. B. Cao, C.-H. Chang and T. Srikanthan, A new formulation of fast diminished-1 multioperand modulo 2 n þ 1 adder, Proc. IEEE Int. Symp. Circuits and Systems (2005), pp. 656659. 29. A. A. Hiasat, A memoryless modð2 n  1Þ residue multiplier, Electron. Lett. 28 (1992) 314315. 30. A. Wrzyszcz and D. Milford, A new modulo 2 a þ 1 multiplier, Proc. Int. Conf. Computer Design (1993), pp. 614617.

930

H. T. Vergos & D. Bakalis

31. Z. Wang, G. A. Jullien and W. C. Miller, An e±cient tree architecture for modulo 2 n þ 1 multiplication, J. VLSI Signal Process. 14 (1996) 241248. 32. Y. Ma, A simpli¯ed architecture for modulo ð2 n þ 1Þ multiplication, IEEE Trans. Comput. 47 (1998) 333337. 33. A. Curiger, VLSI architectures for computations in ¯nite rings and ¯elds, Ph.D. thesis, Swiss Federal Institute of Technology (1993). 34. C. Efstathiou, H. T. Vergos, G. Dimitrakopoulos and D. Nikolos, E±cient diminished-1 modulo 2 n þ 1 multipliers, IEEE Trans. Comput. 54 (2005) 491496. 35. L. Sousa and R. Chaves, A universal architecture for designing e±cient modulo 2 n þ 1 multipliers, IEEE Trans. Circuits Syst. I 52 (2005) 11661178. 36. R. Chaves and L. Sousa, Faster modulo 2 n þ 1 multipliers without Booth recoding, Proc. XX Conf. Design of Circuits and Integrated Systems (2005). 37. H. T. Vergos and C. Efstathiou, Design of e±cient modulo 2 n þ 1 multipliers, IET Proc. Comput. Digit. Techn. 1 (2007) 4957. 38. S. J. Piestrak, Design of squarers modulo A with low-level pipelining, IEEE Trans. Comput. 49 (2002) 3141. 39. H. T. Vergos and C. Efstathiou, Diminished-1 modulo 2 n þ 1 squarer design, IEE Proc. Comput. Digit. Techn. 152 (2005) 561566. 40. H. T. Vergos and C. Efstathiou, E±cient modulo 2 k þ 1 squarers, Proc. XXI Conf. Design of Circuits and Integrated Systems (2006). 41. D. Adamidis and H. T. Vergos, RNS multiplication/sum-of-squares, IET Proc. Comput. Digit. Techn. 1 (2007) 3848. 42. L. M. Leibowitz, A simpli¯ed binary arithmetic for the fermat number transform, IEEE Trans. Acoust. Speech Signal Process. 24 (1976) 356359. 43. S. J. Piestrak, Design of high-speed residue-to-binary number system converter based on Chinese remainder theorem, Proc. Int. Conf. Computer Design (1994), pp. 508511. 44. H. T. Vergos and D. Bakalis, On the use of diminished-1 adders for weighted modulo 2 n þ 1 arithmetic components, Proc. 11th Euromicro Conf. Digital Systems Design (2008), pp. 752759. 45. A. Tyagi, A reduced-area scheme for carry-select adders, IEEE Trans. Comput. 42 (1993) 11631170. 46. R. Zimmermann, Binary adder architectures for cell-based VLSI and their synthesis, Ph.D. thesis, Swiss Federal Institute of Technology (1997). 47. P. M. Kogge and H. S. Stone, A parallel algorithm for the e±cient solution of a general class of recurrence equations, IEEE Trans. Comput. 22 (1973) 786792. 48. L. Dadda, Some schemes for parallel multipliers, Alta Frequenza 34 (1965) 349356.