CSA-based Design of Feedforward Scalable Montgomery Modular ...

CSA-based Design of Feedforward Scalable Montgomery Modular Multiplier Tao Wu

Shuguo Li and Litian Liu

Department of Microelectronics and Nanoelectronics Tsinghua University Beijing 100084 Email: [email protected]

Institute of Microelectronics Tsinghua University Beijing 100084 Emails: [email protected] [email protected]

Abstract—Scalable Montgomery modular multiplier is composed of a queue of processing elements, and the total computation time is proportional to the latency between such elements. By a feedforward architecture proposed by Huang et al., the latency can be brought down from 2 clock cycles to 1 clock cycle. This paper presents both radix-2 and radix-4 CSA-based designs of the new architecture, and by Booth coding and the auxiliary coding the radix-4 design is faster than superior to the radix-2 design in terms of Time×Area. Keywords—Scalable, Montgomery modular multiplication, CSA, feedforward

I. Introduction For public-key cryptographic applications like RSA cryptography[12], Diffie Hellman key exchange[2], and elliptic curve cryptography [10], [3], the information processing refers to lots of long-precision modular multiplications. In fact, the efficiency of such modular multiplication sets a constraint for the whole performance of a cryptosystem or a coprocessor. To decrease such computational efforts, Montgomery modular multiplications have been widely utilized[11], which replaces modular reductions by bit-shifts in binary number system. Then in 1999, Montgomery modular multiplication with scalable architecture is firstly proposed[16], which is able to process much long-precision modular multiplication by a fixed-precision unit[17]. Such architecture can be considered as a parallel implementation of Montgomery modular multiplication algorithm. Since then, a unified scalable architecture in both GF(p) and GF(2m ) has been considered in [13], and the high-radix design of such scalable architectures has been implemented in [15], [18], [8], [9], [7], [1]. Besides high-radix implementations, the original radix-2 design has been improved by decreasing the latency between adjacent processing elements. In [4], the latency has been decreased from 2 clock cycles to 1 clock cycle by left shifting the multiplicand and the modulus. In [14], the low latency of only 1 clock cycle is also obtained by a pipelined quotient, which reorganizes the original operations into two separate parts. Recently, Huang et al. have proposed a feedforward architecture to decrease the latency in such scalable Montgomery

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

modular multipliers[5], which demonstrates an improvement of efficiency by 23% than that in [4] for FPGA implementation. In this paper, we present both radix-2 and radix-4 CSAbased design for feedforward scalable Montgomery modular multiplier in [5], which are superior to the scalable nonredundant designs in [5]. For the radix-4 design, the application of an auxiliary encoding and the modified Booth encoding results in an improvement of the performance, as is measured by Time×Area. Also, the experiment shows that the CSA-based radix-4 design may be more efficient than radix-2 design in such feedforward scalable architecture. The rest of this paper is organized as follows: Section II introduces scalable Montgomery modular multiplication algorithms; in Section III we present our CSA-based designs of radix-2 and radix-4 feedforward scalable Montgomery modular multipliers; and in Section IV the experiment results with both ASIC and FPGA implementations are presented; the last section concludes the paper. II. Scalable Montgomery Modular Multiplication Algorithm The so-called scalability derives from word-based operations in the pipelined parallel algorithm, by which a longprecision modular multiplication can be carried out by increased pipeline cycles. In fact, the scalable algorithm mainly differs from classic Montgomery’s algorithm at two points below: • It interleaves the modular reduction and the multiplication; • It takes parallel word-based operations in a pipeline. In other words, by the help of the FIFO, the scalable Montgomery algorithm deploys the whole computation into a number of processing elements (PEs) in a circular pipeline. A. Radix-2 Scalable Montgomery Modular Multiplication Here the scalable Montgomery modular multiplication algorithms are written in Carry-save addition (CSA) or redundant form, which has already used in the scalable Montgomery modular multiplication[13], [7], [14].

054

For convenience, we define the sum of three k-bit integers a, b, c in redundant form as (TC, T S ) := a + b + c, where TC has k + 1 bits and T S has k bits, with the least significant bit of TC being zero. Algorithm: Radix-2 scalable Montgomery modular multiplication[17], [13] Input: A, B and M are integers with n binary bits. A and M are separated into e words, with e = d(n + 1)/we. A j,k and (i) M j,k denote the k-th bit of the j-th word, TC (i) j,k and T S j,k denotes the k-bit of the j-th word in the i-th outer loop. B = (bn−1 · · · b1 b0 )2 . Output: S = AB2−n ( mod M), where 0 6 S < M. Method: 1: TC (0) := 0, T S (0) := 0; 2: C 1 = 1’b0, C 2 := 1’b0; 3: for i = 0 to n − 1 do 4: TC00 + T S 00 := bi A0 + TC0(i) + T S 0(i) ; 0 5: C1 := TC0,w ; 00 0 6: TC0 := {TC0,w−1..1 , 1’b0}; (i) 0 7: Q := T S 0,0 ; 8: TC0000 + T S 000 := Q(i) M0 + TC000 + T S 00 ; 000 9: TC0(i+1) := TC0,w..1 ; (i+1) 00 10: T S 0,w−2..0 := T S 0,w−1..1 ; 11: for j = 1 to e − 1 do 12: TC 0j + T S 0j := bi A j + TC j + T S j ; 13: C2 := TC 0j,w ; 14: TC 00j := {TC 0j,w−1..1 , C1 }; 15: C1 := C2 ; 00 00 0 16: TC 000 j + T S j := Qi M j + TC j + T S j ; 00 17: T S (i+1) j−1,w−1 := T S j,0 ; (i+1) 00 18: T S j,w−2..0 := T S j,w−1..1 ; 19: TC (i+1) := TC 000 j j,w..1 ; 20: end for (i+1) 21: T S e−1,w−1 := 0; 22: end for 23: S := TC (n) + T S (n) ; 24: if S > M then 25: S := S − M; 26: end if 27: return S . By the use of CSA, the critical path of a carry ripple addition is reduced to the delay of 2 exclusive OR gates, which is a universal technique in VLSI design and computational arithmetic. B. Radix-4 Scalable Montgomery Modular Multiplication The CSA-based design of a radix-4 scalable Montgomery modular multiplier has been reported in [18], [6], in which the modified Booth encoding and an auxiliary encoding are used. Algorithm: Radix-4 scalable Montgomery modular multilication Input: A, B and M are n-bit binary numbers, and A and M are separated into e words while B is divided into k digits, with e = dn/we + 1, k = bn/2c + 2. Respectively, A = (Ae−1 . . . A1 A0 )2w , M = (Me−1 . . . M1 M0 )2w , and B =

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

(Bk−1 . . . B1 B0 )4 = (bn−1 . . . b1 b0 )2 , with Bk−1 = 20 b00, Bk−2,1 = 10 b0. A j,l , M j,l and B j,l denotes the l-th bit of the j-th word of A, M and B, and TC (i) and T S (i) denotes their values in the i-th outer loop. Output: S = A · B · 2−2k ( mod M), −M < S < M. Method: 1: TC (0) := 0, T S (0) := 0, b−1 := 1’b0, C loop = 1’b0; 2: for i = 0 to k − 1 do 3: TC := TC (i) , T S := T S (i) ; 4: βi = BoothEncode ((b2i+1 b2i b2i−1 )2 ); 5: TC00 + T S 00 := TC0 + T S 0 + (βi A)0 ; 0 6: Ca := TC0,w ; 0 00 ; 7: TC0 := TC0,w−1..0 8: Qi := AuxEncode M0,1 , (TC000 + T S 00 + Cloop ) mod 4 ; 9: TC0000 + T S 000 := TC000 + T S 00 + Qi × M0 ; 10: Cb := (TC 000 )(0) w ; (i+1) 000 11: TC0,w−3..0 := TC0,w−1..2 ; (i+1) 00 12: T S 0,w−3..0 := T S 0,w−1..2 ; 00 13: Cloop := (TC 000 )0,1 |T S 0,1 ; 14: for j = 1 to e − 1 do 15: Ca · 2w + TC 0j + T S 0j := TC j + T S j + βi A j + Ca ; 16: Cb · 2w + TC 00j + T S 00j := TC 0j + T S 0j + Qi M j + Cb ; 00 17: TC (i+1) j−1,w−1..w−2 := TC j,1..0 ; (i+1) 18: T S j−1,w−1..w−2 := T S 00j,1..0 ; 00 19: TC (i+1) j,w−3..0 := TC j,w−1..2 ; (i+1) 20: T S j,w−3..0 := T S 00j,w−1..2 ; 21: end for (i+1) (e−1) (e−1) (e−1) 22: TCe−1,w−1..0 := {TCw−1 , TCw−1 , TCw−1..2 }; (i+1) (e−1) (e−1) (e−1) 23: T S e−1,w−1..0 := {T S w−1 , T S w−1 , T S w−1..2 }; 24: end for 25: S := TC (k) + T S (k) + C loop ; 26: if S < 0 then 27: S := S + M; 28: else 29: S := S ; 30: end if 31: return S . In the above algorithm, BoothEncode denotes the modified Booth encoding, and AuxEncode denotes the auxiliary encoding. The i-th quotient is Qi , and the i-th intermediate result S (i) has been represent in redundant form TC (i) + T S (i) . Booth encoding reduces the partial products from Bi A to βA, with β = ±1, ±2, 0(Tab. I). When β = ±2, the Booth-coded partial product requires a word shift of A j to obtain the Booth-coded partial product, bcp j . Take β = 2 for example, Fig. 1 shows the selection of (2A) j,0..m−1 . By definition of Montgomery algorithm, it is known that M is odd and therefore M0,0 = 0. As a result, the look-up table of Qi (Tab. II) only refers to M0,1 , which is the second least significant bit of M. In Tab. II, s1..0 is the most significant two bits of the temporary result at the beginning of the i-th loop: s1..0 = ((TC000 +T S 00 +Cloop ) mod 4)×(4−(M0,1..0 )−1 )

mod 4.

Furthermore, TC000 + T S 00 = T S 0(i) + TC0(i) + (βA)0 , so the

055

TABLE I: Booth encoding b2i+1

b2i

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

After the application of the aforementioned encoding tables, there are βi ∈ [−2, 2], Qi ∈ [−1, 2], and[18]

β = Booth(b2i+1 b2i b2i−1 ) b2i−1 β

sign bit

0 +1 +1 +2 -2 -1 -1 0

0 1 0 1 0 1 0 1

0 0 0 0 1 1 1 0

(2A)j, m-1..0

(2A)j,0 Aj,0

Aj-1,0

Aj,1

Aj-1,1

Aj,m-1

Aj-1,m-1

(2A)j,0

TABLE II: Encoding quotient digit Qi [18] −1 ) M1..0

Qi = s1..0 · (4 − mod 4 with s1..0 the lowest 2 bits of the first word in the i-th loop M0,1 s1 s0 Qi 0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

0 -1 2 1 0 1 2 -1

detection of Qi refers to 1) the lowest word from the previous outer loop TC000 +T S 00 , and 2) the lowest word of partial product Bi A at current loop ((βA)0 ). For convenience, a sign bit is separately set to account for negative β, so that only reverse and shift are required with word based operations about A j . The loop bit Cloop is a carry bit at the first word of an outer (i) (i) loop, which offsets the loss of 1 if TC0,1..0 + T S 0,1..0 equals (10)2 + (10)2 , (11)2 + (01)2 , or (01)2 + (11)2 . In the encoding of Tab. II, qi = 3 has been replaced by Qi = −1 to simplify the generation of partial products Qi M j . This is valid as long as qi ≡ Qi mod 4 at every step, which does not disrupt the determination logic. By contrast with the modified Booth encoding, there is no need to set another sign bit for M, because the least significant bit of −M can be directly set as 1b0.

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

1 2 1 − 4i (M + A) × 4 1 − 41 2 < (M + A), 3 where S i is the temporary result after i-th outer loops. Choosing 0 6 B < M < 22n−3 and 0 6 A < M in the algorithm, then βm−1 = Booth(b2n−1 b2n−2 b2n−3 ) = 0, and the last iteration can be reduced to [18]:

2, we can make sure that there are enough sign bits remained, and the final result is still correct. B. CSA-based Radix-2 Design The computation in a processing element with the CSAbased radix-2 design is shown if Fig. 4. At current clock j, the most significant bits LC j and LS j are unavailable; however, at the next clock j + 1, they are available and added up to a 2-bit vector in {00, 01, 10}. This vector determine a carry in to the current processing element for the next word as well as the most significant bit of T S j . If u0 = 0, then z0 keeps as the top bit of T S j while u1 is used as a carry bit to next word with higher power; if

056

PE K

PE K+1

B iA j

B(i+1)Aj

Q iM j

Q(i+1)Mj

TC(i-1)j

TC(i)j REG

Comb 1

TC(i+1)j REG

Comb 1

Comb 2

Comb 2

TS(i-1)j

TS(i)j

TS(i+1)j

LC(i-1)j

LC(i)j

LC(i+1)j

LS(i-1)j

LS(i)j

LS(i+1)j

Feedforwarding

Fig. 2: Feedforwarding between processing elements i=0, L, 2L,… 0

i=1, L+1, 2L+1,…

1

0

2

1

LCj unavailable

TCj(i)

LSj

TSj(i) BiAj

Clock j

One Kernel cycle

2 i=L-1, 2L-1, 3L-1,… 0

v1

1 2

FIFO

v2 0

m-1

1

0

2

QiMj

1

c0

2

Clock j+1

Next Kernel cycle

m-1

m-1 0 1

Fig. 3: Dataflow in feedforward scalable Montgomery modular multiplier

u0 = 1, then there must be u1 = 0, and therefore the top bit of T S j reads z0 and the carry bit equals z0 . The logic is shown in Fig. 5. C. CSA-based Radix-4 Design In the CSA-based radix-4 processing elements, the temporary results have to shift right 2 bits at every clock, and the determination of the most significant bits and the carries by forwarding gets more complex. Fig. 6 shows the computation in such a processing element, in which c1 , c0 read t0 = z1 ⊕ (x ⊕ y), c0 = z1 &(x ⊕ y) + (x&y), with x ⊕ y and x&y computed beforehand, and (c1 , t2 , t1 ) = z0 = (u2 , u1 , u0 ) are determined by Tab. III: Especially, the determination logic with (u1 , u0 ) == (1, 1) appears only when u2 = 0, or else there will be (u2 , u1 , u0 ) = (1, 1, 1), which

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

z0

available

(u1u0)2=LSj+LCj

Fig. 4: Computation in a CSA-based radix-2 processing element first

1 0

LSj & LCj

first

SC2[w]

1

carry1

SC1

TCj

0

0

BiAj CSA 1

SC1[0]

CSA 2

TSj QiMj

buff1

TC'j=buff1[w-1: 1]

buff2

TS'j={c,buff2[w-2: 1]}

SC2 SS2 buff2[w-1]

Qi Detection

1

1

0

0

c

LSj

LC'j=buff1[0]

LCj LS'j=buff2[0]

TC0

1

TS0 BiA0 LS0 LC0

Qi

0

0 first

Mj

0

QiMj

1

Fig. 5: Radix-2 processing element

057

LCj

first

TCj(i)

0 1

carry0

unavailable

first loop_bit

carry1

0 1

TSj(i)

LSj

SC2

TCj

SC1

CSA 1

CSA 2

TSj

Clock j

x

BiAj

SS2

Q i Mj Qi Detection

{SC2[w-1], SC2[w-1]}

sign_bit

TC'j={t0, buff1[w-2: 2]}

buff1 buff2

Forward logic

TS'j={t2, t1, buff2[w-3: 2]}

2

{SS2[w-1], SS2[w-1]}

LSj

x

LC'j=buff1[1: 0]

LCj

v1

LS'j=buff2[1: 0]

Fig. 7: Radix-4 processing element

v2 y

QiMj

IV. Experiment Result Clock j+1

z1 z0 x y (u2u1u0)2=LSj+LCj

available

c0

t0

c1

t2

t1

Fig. 6: Computation in a CSA-based radix-4 processing element

TABLE III: Forwarding logic in radix-4 processing element (u1 , u0 ) (c1 , t2 , t1 )

(0, 0) (u2 , 0, z0 )

(0, 1) (u2 , z0 , z0 )

(1, 0) (u2 , 1, z0 )

(1, 1) (z0 , z0 , z0 )

contradicts the fact that (u2 u1 u0 )2 6 (11)2 + (11)2 = (110)2 . As a result, there is (c1 , t2 , t1 ) = (z0 , z0 , z0 ). However, at the end of each inner loop or the last word, the logic for u1 , u0 should be replaced by {˜u1 , u˜ 0 }, which is decided by transferred sign extension bits and z0 . For example, if the sign bits for S C2 and S S 2 are respectively s1 and s0 , then {˜u01 , u˜ 00 } ⇐ ({s1 , s1 } + {s0 , s0 }) mod 4, where the arrow ⇐ means that the result passes a register before assignment. And then {˜u1 , u˜ 0 } = {˜u01 , u˜ 00 }+z0 . Obviously, the first addition can be replaced by a 3:1 multiplexor with respect to {s1 , s0 } = 20 b00, 20 b11 or others. In addition, at the last word the carry bits are of no sense due to sign-extension. The processing element with CSA-based radix-4 design is shown in Fig. 7, where the determination of Qi M j and Bi A j , are determined by two look-up tables[18], [6], respectively from the auxiliary encoding table (Tab. II) and the Booth encoding table (Tab. I).

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

The CSA-based designs have been described by Verilog Hardware Description Language, and then implemented in Xilinx Virtex2 6000FF1517-4 FPGA. The synthesis tool is Synplify Pro 9.6.2, and the place and route tool is Xilinx ISE 10.1. For the sake of comparison, a 1024-bit Montgomery modular multiplication is used as the reference for both the time delay and area cost, as is shown in Tab. IV. When the number of words is much larger than L, then FIFO registers are needed to buffer the intermediate results; and when it is much smaller than L, then some dummy clock cycles should be plugged in to await the results before. In other words, the first case is in short of processing elements, while the second case is in redundancy of processing elements. For a scalable architecture is supposed to work frequently in the first case. In Tab. IV, the parameter (L, w) = (32, 16) is chosen to denote such cases. In Tab. IV, the experiment results with different word sizes are demonstrated. While all of the designs in this work are scalable, the depth of FIFO registers is fixed as 2 for L×w = N and increases to a necessary number 2d in other cases. The area of FIFO is also included in the area cost. Meanwhile, the clock cycles do not include the input and output clock cycles, which are 2(N/w) clock cycles with respect to different designs. It is obvious that the radix-4 designs have less clock cycles, but their maximum frequencies are lower and the area also get larger. However, the total time for a full Montgomery modular multiplication has decreased much, and the performance in terms of T ime × Area is still improved. In this work, for radix-2 designs the critical path is determined by the logic with shift registers, while for radix-4 designs it is determined by the processing elements. Especially, for the radix-2 scalable design with (L, w) = (32, 16) the critical path is diverted to reading FIFO, of which the depth is 26 . In [5], only the design with radix-2 is implemented by scalable architecture, while the radix-2-b, radix-2-c, and radix-4-a design are nonscalable. By contrast, the radix-2 scalable designs with w 6 32 are of better performance than its counterparts in [5], and the radix-4 scalable designs are even faster than the radix-2 scalable design, which are due to a higher frequency with CSA-based design. In fact, it is even faster than the radix-4 non-scalable design in [5]. The performance in terms of T ime × Area is also improved, as is shown in Tab. IV. Meanwhile, the result shows that the

058

TABLE IV: Implementation of 1024-bit feedforward scalable Montgomery modular multiplier in Xilinx Virtex2 6000FF1517-4 FPGA

References

This Work Design

ing the paper. This work is supported by the National Natural Science foundation of China (No.61073173).

(L, w)

Cycles Max Freq. Time

Area

Time×Area (105

(MHz)

(µs)

(LUTs)

radix-2

(64, 16)

1187

150.3

7.9

9384

0.75

LUT·µs)

radix-2

(32, 32)

1155

143.0

8.1

9375

0.76

radix-2

(16, 64)

1187

107.0

11.0

11006

1.22

radix-2

(32, 16)

2211

148.5

14.9

7239

1.08

radix-4

(64, 16)

662

120.5

5.5

13172

0.72

radix-4

(32, 32)

614

108.1

5.7

12336

0.70

radix-4

(16, 64)

614

103.0

6.0

12661

0.75

radix-4

(32, 16)

1190

118.2

10.1

8143

0.82

radix-2

(65, 16)

1088

116.4

9.3

9319

0.87

radix-2-b

(65, 16)

1088

106.4

10.2

5356

0.55

radix-2-c

(33, 32)

1056

104.7

10.1

5310

0.54

radix-4-a

(65, 16)

576

99.3

5.8

13370

0.78

[5]

parameter (L, w) = (16, 64) is less efficient for hardware implementation, both from the point of speed and in terms of T ime × Area. Therefore, it is in general proper to choose √ the word size w < N. V. Conclusion This work complemented the CSA-based design of feedforward scalable Montgomery modular multiplier, where both radix-2 and radix-4 designs are demonstrated. The CSA-based design is faster than the designs in the literature. Although consuming extra area, it gains better performance in terms of T ime × Area. Especially, after the application of Booth encoding and the auxiliary encoding in feedforward scalable Montgomery modular multiplication, the CSA-based radix-4 design is of high performance and superior to CSA-based radix-2 design and the non-scalable radix-4 design. Acknowledgement The authors would like to appreciate the reviewer’s comment. They also appreciate the editor for their help in process-

978-1-4673-0753-6/11/$26.00 ©2011 IEEE

[1] P. Amberg, N. Pinckney, and D. Harris. Parallel high-radix montgomery multipliers. In 42nd Asilomar Conference on Signals, Systems and Computers, pages 772–776, 2008. [2] W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22:644–654, 1976. [3] D. Hankerson, A. Menezes, and S. Vanstone. Springer, New York, 2004. [4] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu. An improved unified scalable radix-2 montgomery multiplier. In 17th IEEE Symposium on Computer Arithmetic, pages 172–178, 2005. [5] M. Huang, K. Gaj, and T. El-Ghazawi. New hardware architectures for montgomery modular multiplication algorithm. IEEE Transactions on Computers, 60(7):923–936, 2011. [6] A. Ibrahim, H. Elsimary, and A. Nassar. Design and implementation of scalable low power radix-4 montgomery modular multiplier. In International Conference on Computer Engineering& Systems, pages 395–400, 2007. [7] N. Jiang and D. Harris. Quotient pipelined very high radix scalable montgomery multipliers. In Fortieth Asilomar Conference on Signals, Systems and Computers, pages 1673–1677, 2006. [8] K. Kelley and D. Harris. Parallelized very high radix scalable montgomery multipliers. In 39th Asilomar Conference on Signals, Systems, and Computers, pages 1196–1200, 2005. [9] K. Kelley and D. Harris. Very high radix scalable montgomery multipliers. In Fifth International Workshop on System-on-Chip for RealTime Applications, pages 400–404, 2005. [10] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48:203–209, 1987. [11] P. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519–521, 1985. [12] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21:120–126, 1978. [13] E. S. s, A. Tenca, and Ç. Koç. Scalable and unified multiplier architecture for finite fields g f (p) and g f (2m ). In Second International Workshop on Cryptographic Hardware and Embedded Systems, pages 277–292, Worcester, USA, 2000. [14] M. Shieh and W. Lin. Word-based montgomery modular multiplication algorithm for low-latency scalable architectures. IEEE Transactions on Computers, 59(8):1145–1151, 2010. [15] A. Tenca, G. T. G., and Ç. Koç. High-radix design of a scalable modular multiplier. In Third International Workshop on Cryptographic Hardware and Embedded Systems, volume 2162 of Lecture Notes in Computer Science, pages 189–205, Paris, France, 2001. Springer, Berlin, Germany. [16] A. Tenca and Ç. Koç. Scalable architecture for montgomery multiplication. In First International Workshop on Cryptographic Hardware and Embedded Systems, pages 94–108, Worcester, USA, 1999. [17] A. Tenca and Ç. Koç. A scalable architecture for modular multiplication based on montgomery’s algorithm. IEEE Transactions on Computers, 52:1215–1221, 2003. [18] A. Tenca and L. Tawalbeh. An efficient and scalable radix-4 modular multiplier design using recoding techniques. In 37th Asilomar Conference on Signals, Systems and Computers, pages 1445–1450, 2003.

059

CSA-based Design of Feedforward Scalable Montgomery Modular ...

CSA-based Design of Feedforward Scalable Montgomery Modular ...

Suggest Documents

comparison of scalable montgomery modular ... - CiteSeerX

Design of a Network of Scalable Modular ...

Comparison of Two Implementations of Scalable Montgomery ...

Design of a Scalable Modular Production System for a ... - InTechOpen

Radix-4 Design of A Scalable Modular Multiplier - KoÃ§ Lab

Implementation of Scalable Montgomery ... - Semantic Scholar

Carry-Save Montgomery Modular Exponentiation ... - Semantic Scholar

The Montgomery Modular Inverse - Revisited - CiteSeerX

Montgomery Modular Multiplication in Residue ... - Semantic Scholar

CRT-Based DSP Decryption Using Montgomery Modular ...

Improved Montgomery modular inverse algorithm - IEEE Xplore

Carry-Save Montgomery Modular Exponentiation on ... - CiteSeerX

Fast Montgomery Modular Multiplication and RSA Cryptographic ...

An expandable montgomery modular multiplication ... - KFUPM ePrints

FPGA montgomery modular multiplication architectures ... - IEEE Xplore

montgomery modular multiplier architectures and ...

Montgomery modular multiplication on ... - ACM Digital Library

Improved RNS Montgomery Modular Multiplication with Residue ...

Modified Montgomery modular multiplication and RSA exponentiation ...

An RNS Montgomery Modular Multiplication ... - Semantic Scholar

Montgomery modular multiplication architecture for ... - IEEE Xplore

DESIGN OF MODEL-BASED FEEDFORWARD ... - CiteSeerX

Scalable, Modular Three-Dimensional Silicon Microelectrode ... - MDPI

SMART: scalable and modular augmented reality ... - CiteSeerX