CSA-based Design of Feedforward Scalable Montgomery Modular Multiplier Tao Wu
Shuguo Li and Litian Liu
Department of Microelectronics and Nanoelectronics Tsinghua University Beijing 100084 Email:
[email protected]
Institute of Microelectronics Tsinghua University Beijing 100084 Emails:
[email protected] [email protected]
Abstract—Scalable Montgomery modular multiplier is composed of a queue of processing elements, and the total computation time is proportional to the latency between such elements. By a feedforward architecture proposed by Huang et al., the latency can be brought down from 2 clock cycles to 1 clock cycle. This paper presents both radix-2 and radix-4 CSA-based designs of the new architecture, and by Booth coding and the auxiliary coding the radix-4 design is faster than superior to the radix-2 design in terms of Time×Area. Keywords—Scalable, Montgomery modular multiplication, CSA, feedforward
I. Introduction For public-key cryptographic applications like RSA cryptography[12], Diffie Hellman key exchange[2], and elliptic curve cryptography [10], [3], the information processing refers to lots of long-precision modular multiplications. In fact, the efficiency of such modular multiplication sets a constraint for the whole performance of a cryptosystem or a coprocessor. To decrease such computational efforts, Montgomery modular multiplications have been widely utilized[11], which replaces modular reductions by bit-shifts in binary number system. Then in 1999, Montgomery modular multiplication with scalable architecture is firstly proposed[16], which is able to process much long-precision modular multiplication by a fixed-precision unit[17]. Such architecture can be considered as a parallel implementation of Montgomery modular multiplication algorithm. Since then, a unified scalable architecture in both GF(p) and GF(2m ) has been considered in [13], and the high-radix design of such scalable architectures has been implemented in [15], [18], [8], [9], [7], [1]. Besides high-radix implementations, the original radix-2 design has been improved by decreasing the latency between adjacent processing elements. In [4], the latency has been decreased from 2 clock cycles to 1 clock cycle by left shifting the multiplicand and the modulus. In [14], the low latency of only 1 clock cycle is also obtained by a pipelined quotient, which reorganizes the original operations into two separate parts. Recently, Huang et al. have proposed a feedforward architecture to decrease the latency in such scalable Montgomery
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
modular multipliers[5], which demonstrates an improvement of efficiency by 23% than that in [4] for FPGA implementation. In this paper, we present both radix-2 and radix-4 CSAbased design for feedforward scalable Montgomery modular multiplier in [5], which are superior to the scalable nonredundant designs in [5]. For the radix-4 design, the application of an auxiliary encoding and the modified Booth encoding results in an improvement of the performance, as is measured by Time×Area. Also, the experiment shows that the CSA-based radix-4 design may be more efficient than radix-2 design in such feedforward scalable architecture. The rest of this paper is organized as follows: Section II introduces scalable Montgomery modular multiplication algorithms; in Section III we present our CSA-based designs of radix-2 and radix-4 feedforward scalable Montgomery modular multipliers; and in Section IV the experiment results with both ASIC and FPGA implementations are presented; the last section concludes the paper. II. Scalable Montgomery Modular Multiplication Algorithm The so-called scalability derives from word-based operations in the pipelined parallel algorithm, by which a longprecision modular multiplication can be carried out by increased pipeline cycles. In fact, the scalable algorithm mainly differs from classic Montgomery’s algorithm at two points below: • It interleaves the modular reduction and the multiplication; • It takes parallel word-based operations in a pipeline. In other words, by the help of the FIFO, the scalable Montgomery algorithm deploys the whole computation into a number of processing elements (PEs) in a circular pipeline. A. Radix-2 Scalable Montgomery Modular Multiplication Here the scalable Montgomery modular multiplication algorithms are written in Carry-save addition (CSA) or redundant form, which has already used in the scalable Montgomery modular multiplication[13], [7], [14].
054
For convenience, we define the sum of three k-bit integers a, b, c in redundant form as (TC, T S ) := a + b + c, where TC has k + 1 bits and T S has k bits, with the least significant bit of TC being zero. Algorithm: Radix-2 scalable Montgomery modular multiplication[17], [13] Input: A, B and M are integers with n binary bits. A and M are separated into e words, with e = d(n + 1)/we. A j,k and (i) M j,k denote the k-th bit of the j-th word, TC (i) j,k and T S j,k denotes the k-bit of the j-th word in the i-th outer loop. B = (bn−1 · · · b1 b0 )2 . Output: S = AB2−n ( mod M), where 0 6 S < M. Method: 1: TC (0) := 0, T S (0) := 0; 2: C 1 = 1’b0, C 2 := 1’b0; 3: for i = 0 to n − 1 do 4: TC00 + T S 00 := bi A0 + TC0(i) + T S 0(i) ; 0 5: C1 := TC0,w ; 00 0 6: TC0 := {TC0,w−1..1 , 1’b0}; (i) 0 7: Q := T S 0,0 ; 8: TC0000 + T S 000 := Q(i) M0 + TC000 + T S 00 ; 000 9: TC0(i+1) := TC0,w..1 ; (i+1) 00 10: T S 0,w−2..0 := T S 0,w−1..1 ; 11: for j = 1 to e − 1 do 12: TC 0j + T S 0j := bi A j + TC j + T S j ; 13: C2 := TC 0j,w ; 14: TC 00j := {TC 0j,w−1..1 , C1 }; 15: C1 := C2 ; 00 00 0 16: TC 000 j + T S j := Qi M j + TC j + T S j ; 00 17: T S (i+1) j−1,w−1 := T S j,0 ; (i+1) 00 18: T S j,w−2..0 := T S j,w−1..1 ; 19: TC (i+1) := TC 000 j j,w..1 ; 20: end for (i+1) 21: T S e−1,w−1 := 0; 22: end for 23: S := TC (n) + T S (n) ; 24: if S > M then 25: S := S − M; 26: end if 27: return S . By the use of CSA, the critical path of a carry ripple addition is reduced to the delay of 2 exclusive OR gates, which is a universal technique in VLSI design and computational arithmetic. B. Radix-4 Scalable Montgomery Modular Multiplication The CSA-based design of a radix-4 scalable Montgomery modular multiplier has been reported in [18], [6], in which the modified Booth encoding and an auxiliary encoding are used. Algorithm: Radix-4 scalable Montgomery modular multilication Input: A, B and M are n-bit binary numbers, and A and M are separated into e words while B is divided into k digits, with e = dn/we + 1, k = bn/2c + 2. Respectively, A = (Ae−1 . . . A1 A0 )2w , M = (Me−1 . . . M1 M0 )2w , and B =
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
(Bk−1 . . . B1 B0 )4 = (bn−1 . . . b1 b0 )2 , with Bk−1 = 20 b00, Bk−2,1 = 10 b0. A j,l , M j,l and B j,l denotes the l-th bit of the j-th word of A, M and B, and TC (i) and T S (i) denotes their values in the i-th outer loop. Output: S = A · B · 2−2k ( mod M), −M < S < M. Method: 1: TC (0) := 0, T S (0) := 0, b−1 := 1’b0, C loop = 1’b0; 2: for i = 0 to k − 1 do 3: TC := TC (i) , T S := T S (i) ; 4: βi = BoothEncode ((b2i+1 b2i b2i−1 )2 ); 5: TC00 + T S 00 := TC0 + T S 0 + (βi A)0 ; 0 6: Ca := TC0,w ; 0 00 ; 7: TC0 := TC0,w−1..0 8: Qi := AuxEncode M0,1 , (TC000 + T S 00 + Cloop ) mod 4 ; 9: TC0000 + T S 000 := TC000 + T S 00 + Qi × M0 ; 10: Cb := (TC 000 )(0) w ; (i+1) 000 11: TC0,w−3..0 := TC0,w−1..2 ; (i+1) 00 12: T S 0,w−3..0 := T S 0,w−1..2 ; 00 13: Cloop := (TC 000 )0,1 |T S 0,1 ; 14: for j = 1 to e − 1 do 15: Ca · 2w + TC 0j + T S 0j := TC j + T S j + βi A j + Ca ; 16: Cb · 2w + TC 00j + T S 00j := TC 0j + T S 0j + Qi M j + Cb ; 00 17: TC (i+1) j−1,w−1..w−2 := TC j,1..0 ; (i+1) 18: T S j−1,w−1..w−2 := T S 00j,1..0 ; 00 19: TC (i+1) j,w−3..0 := TC j,w−1..2 ; (i+1) 20: T S j,w−3..0 := T S 00j,w−1..2 ; 21: end for (i+1) (e−1) (e−1) (e−1) 22: TCe−1,w−1..0 := {TCw−1 , TCw−1 , TCw−1..2 }; (i+1) (e−1) (e−1) (e−1) 23: T S e−1,w−1..0 := {T S w−1 , T S w−1 , T S w−1..2 }; 24: end for 25: S := TC (k) + T S (k) + C loop ; 26: if S < 0 then 27: S := S + M; 28: else 29: S := S ; 30: end if 31: return S . In the above algorithm, BoothEncode denotes the modified Booth encoding, and AuxEncode denotes the auxiliary encoding. The i-th quotient is Qi , and the i-th intermediate result S (i) has been represent in redundant form TC (i) + T S (i) . Booth encoding reduces the partial products from Bi A to βA, with β = ±1, ±2, 0(Tab. I). When β = ±2, the Booth-coded partial product requires a word shift of A j to obtain the Booth-coded partial product, bcp j . Take β = 2 for example, Fig. 1 shows the selection of (2A) j,0..m−1 . By definition of Montgomery algorithm, it is known that M is odd and therefore M0,0 = 0. As a result, the look-up table of Qi (Tab. II) only refers to M0,1 , which is the second least significant bit of M. In Tab. II, s1..0 is the most significant two bits of the temporary result at the beginning of the i-th loop: s1..0 = ((TC000 +T S 00 +Cloop ) mod 4)×(4−(M0,1..0 )−1 )
mod 4.
Furthermore, TC000 + T S 00 = T S 0(i) + TC0(i) + (βA)0 , so the
055
TABLE I: Booth encoding b2i+1
b2i
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
After the application of the aforementioned encoding tables, there are βi ∈ [−2, 2], Qi ∈ [−1, 2], and[18]
β = Booth(b2i+1 b2i b2i−1 ) b2i−1 β
sign bit
0 +1 +1 +2 -2 -1 -1 0
0 1 0 1 0 1 0 1
0 0 0 0 1 1 1 0
(2A)j, m-1..0
(2A)j,0 Aj,0
Aj-1,0
Aj,1
Aj-1,1
Aj,m-1
Aj-1,m-1
(2A)j,0
TABLE II: Encoding quotient digit Qi [18] −1 ) M1..0
Qi = s1..0 · (4 − mod 4 with s1..0 the lowest 2 bits of the first word in the i-th loop M0,1 s1 s0 Qi 0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
0 -1 2 1 0 1 2 -1
detection of Qi refers to 1) the lowest word from the previous outer loop TC000 +T S 00 , and 2) the lowest word of partial product Bi A at current loop ((βA)0 ). For convenience, a sign bit is separately set to account for negative β, so that only reverse and shift are required with word based operations about A j . The loop bit Cloop is a carry bit at the first word of an outer (i) (i) loop, which offsets the loss of 1 if TC0,1..0 + T S 0,1..0 equals (10)2 + (10)2 , (11)2 + (01)2 , or (01)2 + (11)2 . In the encoding of Tab. II, qi = 3 has been replaced by Qi = −1 to simplify the generation of partial products Qi M j . This is valid as long as qi ≡ Qi mod 4 at every step, which does not disrupt the determination logic. By contrast with the modified Booth encoding, there is no need to set another sign bit for M, because the least significant bit of −M can be directly set as 1b0.
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
1 2 1 − 4i (M + A) × 4 1 − 41 2 < (M + A), 3 where S i is the temporary result after i-th outer loops. Choosing 0 6 B < M < 22n−3 and 0 6 A < M in the algorithm, then βm−1 = Booth(b2n−1 b2n−2 b2n−3 ) = 0, and the last iteration can be reduced to [18]:
2, we can make sure that there are enough sign bits remained, and the final result is still correct. B. CSA-based Radix-2 Design The computation in a processing element with the CSAbased radix-2 design is shown if Fig. 4. At current clock j, the most significant bits LC j and LS j are unavailable; however, at the next clock j + 1, they are available and added up to a 2-bit vector in {00, 01, 10}. This vector determine a carry in to the current processing element for the next word as well as the most significant bit of T S j . If u0 = 0, then z0 keeps as the top bit of T S j while u1 is used as a carry bit to next word with higher power; if
056
PE K
PE K+1
B iA j
B(i+1)Aj
Q iM j
Q(i+1)Mj
TC(i-1)j
TC(i)j REG
Comb 1
TC(i+1)j REG
Comb 1
Comb 2
Comb 2
TS(i-1)j
TS(i)j
TS(i+1)j
LC(i-1)j
LC(i)j
LC(i+1)j
LS(i-1)j
LS(i)j
LS(i+1)j
Feedforwarding
Fig. 2: Feedforwarding between processing elements i=0, L, 2L,… 0
i=1, L+1, 2L+1,…
1
0
2
1
LCj unavailable
TCj(i)
LSj
TSj(i) BiAj
Clock j
One Kernel cycle
2 i=L-1, 2L-1, 3L-1,… 0
v1
1 2
FIFO
v2 0
m-1
1
0
2
QiMj
1
c0
2
Clock j+1
Next Kernel cycle
m-1
m-1 0 1
Fig. 3: Dataflow in feedforward scalable Montgomery modular multiplier
u0 = 1, then there must be u1 = 0, and therefore the top bit of T S j reads z0 and the carry bit equals z0 . The logic is shown in Fig. 5. C. CSA-based Radix-4 Design In the CSA-based radix-4 processing elements, the temporary results have to shift right 2 bits at every clock, and the determination of the most significant bits and the carries by forwarding gets more complex. Fig. 6 shows the computation in such a processing element, in which c1 , c0 read t0 = z1 ⊕ (x ⊕ y), c0 = z1 &(x ⊕ y) + (x&y), with x ⊕ y and x&y computed beforehand, and (c1 , t2 , t1 ) = z0 = (u2 , u1 , u0 ) are determined by Tab. III: Especially, the determination logic with (u1 , u0 ) == (1, 1) appears only when u2 = 0, or else there will be (u2 , u1 , u0 ) = (1, 1, 1), which
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
z0
available
(u1u0)2=LSj+LCj
Fig. 4: Computation in a CSA-based radix-2 processing element first
1 0
LSj & LCj
first
SC2[w]
1
carry1
SC1
TCj
0
0
BiAj CSA 1
SC1[0]
CSA 2
TSj QiMj
buff1
TC'j=buff1[w-1: 1]
buff2
TS'j={c,buff2[w-2: 1]}
SC2 SS2 buff2[w-1]
Qi Detection
1
1
0
0
c
LSj
LC'j=buff1[0]
LCj LS'j=buff2[0]
TC0
1
TS0 BiA0 LS0 LC0
Qi
0
0 first
Mj
0
QiMj
1
Fig. 5: Radix-2 processing element
057
LCj
first
TCj(i)
0 1
carry0
unavailable
first loop_bit
carry1
0 1
TSj(i)
LSj
SC2
TCj
SC1
CSA 1
CSA 2
TSj
Clock j
x
BiAj
SS2
Q i Mj Qi Detection
{SC2[w-1], SC2[w-1]}
sign_bit
TC'j={t0, buff1[w-2: 2]}
buff1 buff2
Forward logic
TS'j={t2, t1, buff2[w-3: 2]}
2
{SS2[w-1], SS2[w-1]}
LSj
x
LC'j=buff1[1: 0]
LCj
v1
LS'j=buff2[1: 0]
Fig. 7: Radix-4 processing element
v2 y
QiMj
IV. Experiment Result Clock j+1
z1 z0 x y (u2u1u0)2=LSj+LCj
available
c0
t0
c1
t2
t1
Fig. 6: Computation in a CSA-based radix-4 processing element
TABLE III: Forwarding logic in radix-4 processing element (u1 , u0 ) (c1 , t2 , t1 )
(0, 0) (u2 , 0, z0 )
(0, 1) (u2 , z0 , z0 )
(1, 0) (u2 , 1, z0 )
(1, 1) (z0 , z0 , z0 )
contradicts the fact that (u2 u1 u0 )2 6 (11)2 + (11)2 = (110)2 . As a result, there is (c1 , t2 , t1 ) = (z0 , z0 , z0 ). However, at the end of each inner loop or the last word, the logic for u1 , u0 should be replaced by {˜u1 , u˜ 0 }, which is decided by transferred sign extension bits and z0 . For example, if the sign bits for S C2 and S S 2 are respectively s1 and s0 , then {˜u01 , u˜ 00 } ⇐ ({s1 , s1 } + {s0 , s0 }) mod 4, where the arrow ⇐ means that the result passes a register before assignment. And then {˜u1 , u˜ 0 } = {˜u01 , u˜ 00 }+z0 . Obviously, the first addition can be replaced by a 3:1 multiplexor with respect to {s1 , s0 } = 20 b00, 20 b11 or others. In addition, at the last word the carry bits are of no sense due to sign-extension. The processing element with CSA-based radix-4 design is shown in Fig. 7, where the determination of Qi M j and Bi A j , are determined by two look-up tables[18], [6], respectively from the auxiliary encoding table (Tab. II) and the Booth encoding table (Tab. I).
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
The CSA-based designs have been described by Verilog Hardware Description Language, and then implemented in Xilinx Virtex2 6000FF1517-4 FPGA. The synthesis tool is Synplify Pro 9.6.2, and the place and route tool is Xilinx ISE 10.1. For the sake of comparison, a 1024-bit Montgomery modular multiplication is used as the reference for both the time delay and area cost, as is shown in Tab. IV. When the number of words is much larger than L, then FIFO registers are needed to buffer the intermediate results; and when it is much smaller than L, then some dummy clock cycles should be plugged in to await the results before. In other words, the first case is in short of processing elements, while the second case is in redundancy of processing elements. For a scalable architecture is supposed to work frequently in the first case. In Tab. IV, the parameter (L, w) = (32, 16) is chosen to denote such cases. In Tab. IV, the experiment results with different word sizes are demonstrated. While all of the designs in this work are scalable, the depth of FIFO registers is fixed as 2 for L×w = N and increases to a necessary number 2d in other cases. The area of FIFO is also included in the area cost. Meanwhile, the clock cycles do not include the input and output clock cycles, which are 2(N/w) clock cycles with respect to different designs. It is obvious that the radix-4 designs have less clock cycles, but their maximum frequencies are lower and the area also get larger. However, the total time for a full Montgomery modular multiplication has decreased much, and the performance in terms of T ime × Area is still improved. In this work, for radix-2 designs the critical path is determined by the logic with shift registers, while for radix-4 designs it is determined by the processing elements. Especially, for the radix-2 scalable design with (L, w) = (32, 16) the critical path is diverted to reading FIFO, of which the depth is 26 . In [5], only the design with radix-2 is implemented by scalable architecture, while the radix-2-b, radix-2-c, and radix-4-a design are nonscalable. By contrast, the radix-2 scalable designs with w 6 32 are of better performance than its counterparts in [5], and the radix-4 scalable designs are even faster than the radix-2 scalable design, which are due to a higher frequency with CSA-based design. In fact, it is even faster than the radix-4 non-scalable design in [5]. The performance in terms of T ime × Area is also improved, as is shown in Tab. IV. Meanwhile, the result shows that the
058
TABLE IV: Implementation of 1024-bit feedforward scalable Montgomery modular multiplier in Xilinx Virtex2 6000FF1517-4 FPGA
References
This Work Design
ing the paper. This work is supported by the National Natural Science foundation of China (No.61073173).
(L, w)
Cycles Max Freq. Time
Area
Time×Area (105
(MHz)
(µs)
(LUTs)
radix-2
(64, 16)
1187
150.3
7.9
9384
0.75
LUT·µs)
radix-2
(32, 32)
1155
143.0
8.1
9375
0.76
radix-2
(16, 64)
1187
107.0
11.0
11006
1.22
radix-2
(32, 16)
2211
148.5
14.9
7239
1.08
radix-4
(64, 16)
662
120.5
5.5
13172
0.72
radix-4
(32, 32)
614
108.1
5.7
12336
0.70
radix-4
(16, 64)
614
103.0
6.0
12661
0.75
radix-4
(32, 16)
1190
118.2
10.1
8143
0.82
radix-2
(65, 16)
1088
116.4
9.3
9319
0.87
radix-2-b
(65, 16)
1088
106.4
10.2
5356
0.55
radix-2-c
(33, 32)
1056
104.7
10.1
5310
0.54
radix-4-a
(65, 16)
576
99.3
5.8
13370
0.78
[5]
parameter (L, w) = (16, 64) is less efficient for hardware implementation, both from the point of speed and in terms of T ime × Area. Therefore, it is in general proper to choose √ the word size w < N. V. Conclusion This work complemented the CSA-based design of feedforward scalable Montgomery modular multiplier, where both radix-2 and radix-4 designs are demonstrated. The CSA-based design is faster than the designs in the literature. Although consuming extra area, it gains better performance in terms of T ime × Area. Especially, after the application of Booth encoding and the auxiliary encoding in feedforward scalable Montgomery modular multiplication, the CSA-based radix-4 design is of high performance and superior to CSA-based radix-2 design and the non-scalable radix-4 design. Acknowledgement The authors would like to appreciate the reviewer’s comment. They also appreciate the editor for their help in process-
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
[1] P. Amberg, N. Pinckney, and D. Harris. Parallel high-radix montgomery multipliers. In 42nd Asilomar Conference on Signals, Systems and Computers, pages 772–776, 2008. [2] W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22:644–654, 1976. [3] D. Hankerson, A. Menezes, and S. Vanstone. Springer, New York, 2004. [4] D. Harris, R. Krishnamurthy, M. Anders, S. Mathew, and S. Hsu. An improved unified scalable radix-2 montgomery multiplier. In 17th IEEE Symposium on Computer Arithmetic, pages 172–178, 2005. [5] M. Huang, K. Gaj, and T. El-Ghazawi. New hardware architectures for montgomery modular multiplication algorithm. IEEE Transactions on Computers, 60(7):923–936, 2011. [6] A. Ibrahim, H. Elsimary, and A. Nassar. Design and implementation of scalable low power radix-4 montgomery modular multiplier. In International Conference on Computer Engineering& Systems, pages 395–400, 2007. [7] N. Jiang and D. Harris. Quotient pipelined very high radix scalable montgomery multipliers. In Fortieth Asilomar Conference on Signals, Systems and Computers, pages 1673–1677, 2006. [8] K. Kelley and D. Harris. Parallelized very high radix scalable montgomery multipliers. In 39th Asilomar Conference on Signals, Systems, and Computers, pages 1196–1200, 2005. [9] K. Kelley and D. Harris. Very high radix scalable montgomery multipliers. In Fifth International Workshop on System-on-Chip for RealTime Applications, pages 400–404, 2005. [10] N. Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48:203–209, 1987. [11] P. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519–521, 1985. [12] R. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21:120–126, 1978. [13] E. S. s, A. Tenca, and C¸. Koc¸. Scalable and unified multiplier architecture for finite fields g f (p) and g f (2m ). In Second International Workshop on Cryptographic Hardware and Embedded Systems, pages 277–292, Worcester, USA, 2000. [14] M. Shieh and W. Lin. Word-based montgomery modular multiplication algorithm for low-latency scalable architectures. IEEE Transactions on Computers, 59(8):1145–1151, 2010. [15] A. Tenca, G. T. G., and C¸. Koc¸. High-radix design of a scalable modular multiplier. In Third International Workshop on Cryptographic Hardware and Embedded Systems, volume 2162 of Lecture Notes in Computer Science, pages 189–205, Paris, France, 2001. Springer, Berlin, Germany. [16] A. Tenca and C¸. Koc¸. Scalable architecture for montgomery multiplication. In First International Workshop on Cryptographic Hardware and Embedded Systems, pages 94–108, Worcester, USA, 1999. [17] A. Tenca and C¸. Koc¸. A scalable architecture for modular multiplication based on montgomery’s algorithm. IEEE Transactions on Computers, 52:1215–1221, 2003. [18] A. Tenca and L. Tawalbeh. An efficient and scalable radix-4 modular multiplier design using recoding techniques. In 37th Asilomar Conference on Signals, Systems and Computers, pages 1445–1450, 2003.
059