CRT RSA Decryption: Modular Exponentiation based solely on Montgomery Multiplication Jo˜ao Carlos N´eto∗ , Alexandre Ferreira Tenca∗∗ , and Wilson Vicente Ruggiero∗ . ∗ Department
of Computer and Digital Systems Engineering, Polytechnic School, University of S˜ao Paulo, SP, Brazil. Email:
[email protected],
[email protected] ∗∗ Synopsys, Inc., Oregon, USA. Email:
[email protected]
Abstract—An innovative hardware design is proposed to perform modular exponentiation using only Montgomery Multiplication for CRT RSA decryption. The same hardware used to perform exponentiation is also used to perform conversions. The proposed algorithm is described and provided a versatile hardware implementation. When compared to the classical sequential Radix-2 MM architecture from which it was derived, the new RSA architecture shows 44% average reduction in the energy consumption. The efficient design proposed is shown through an experimental synthesis with a 90nm CMOS technology. The results are compared with the state-of-art in the RSA 1024-bit implementations using non-RNS solutions. Index Terms—Cryptography, Modular Exponentiation and Modular Multiplication, Chinese Remainder Theorem, RSA cryptosystem, Residue Number System.
I. I NTRODUCTION Public-Key Cryptography (PKC) is used to encrypt or to validate a digital signature, and to decrypt or to create a digital signature. PKC provides an authenticated keyexchange scheme over an insecure network between two entities. Public-key cryptosystems are based on hard mathematical problems, such as the integer factorization problem (Diffie-Hellman key exchange scheme [1] and RSA [2]), the discrete logarithm problem (ElGamal [3] and DSA [4]), and the elliptic curve discrete logarithm problem (ECC [5]). Modular multiplications are widely used in cryptography and this arithmetic operation is complex because the operands are extremely large numbers. Hence, computational methods to accelerate the operations, reduce the energy consumption, and simplify the use of such operations, especially in hardware, are always of great value for systems that require data security. Currently, one of the most successful modular multiplication methods is Montgomery Multiplication (MM) [6]. Power and performance are key design goals to provide arithmetic functions for low-energy consumption. Several methods for low-power design and high performance can be found in [7]. In this paper, we propose an efficient implementation of the method proposed in [8] for Montgomery Exponentiation algorithm, using only Montgomery Multiplication for CRT RSA decryption. The proposed hardware architecture uses
,(((
two important techniques for time and energy-efficiency: the MM and CRT algorithms. The MM algorithm is efficient for hardware implementations. With CRT algorithm two half-size (simpler) exponentiations can be done in place of one full-size exponentiation. When these two simpler exponentiation operations are performed in parallel, which can be done in hardware, a speedup of four times can be achieved. The major contributions of this work are: (1) to use the same MM hardware to perform conversions (forward and reverse) and exponentiation, and (2) introduce the design of a Dual-model MM architecture (DMM), which can perform full-size as well as half-size modular multiplications. As a case study, the scope herein is limited for RSA decryption (or signature generation) operations to the establishment of a shared cryptographic key (symmetric) in an insecure network, where two sets of entities, namely a set of powerful servers and a set of energy-limited mobile devices, employ the key-establishment scheme in RSA cryptosystem [9]. This paper is organized as follows: Section II provides a brief description of the concepts of Montgomery reduction and the general methods for the Montgomery Multiplication and Exponentiation algorithms. Also, we review the basics concepts of RNS based on the CRT. In Section III, we derive a method to compute the Montgomery Exponentiation algorithm in RNS, using only Montgomery Multiplication in RNS. In Section IV and V, we discuss some implementation aspects, focusing on the performance and energy savings of this solution. In Section VI, we present the concluding remarks. II. C ONCEPTS AND D EFINITIONS A. Montgomery Multiplication Modular multiplication is the core operation of the public-key cryptography. The most important algorithm for modular multiplication is Montgomery Multiplication [6]. The general methods for the Montgomery Multiplication (MM) and Montgomery Exponentiation (MEXP) algorithms are introduced as follows.
$VLORPDU
1) Montgomery Multiplication Algorithm: This algorithm is based on the residue system suggested by Peter Montgomery in [6]. Algorithm 1 shows the pseudo code of the generalized Radix-2 MM. Algorithm 1 Radix-2 Montgomery Multiplication
nx −1 nm −1 i Require: X = yi 2i , odd M , and R = i=0 xi 2 , Y = i=0 2n , with X ≥ 0, 0 ≤ Y < M, and n ≥ 1 + max(nx , nm ), where nx = log2 X, nm = log2 M Ensure: Z ≡ XY R−1 (mod M ), with 0 ≤ Z < M 1: S[0] ← 0 2: for i ← 0 to n − 1 step 1 do 3: a ← S[i] + xi Y 4: S[i + 1] ← (a + a0 M )/2 5: end for 6: if S[n] ≥ M then 7: S[n] ← S[n] − M 8: end if 9: return Z ← S[n]
M=
(1)
2) Montgomery Exponentiation Algorithm: This algorithm computes z ≡ xe (mod M ) using the MM [10]. Algorithm 2 describes the MEXP algorithm, where both MM operations (lines 4 and 6) are executed at the same time. We denote the exponentiation in the Montgomery domain as follows: z = MEXP(x, e, R2 , M, R) ≡ xe
(mod M ).
(2)
Algorithm 2 Montgomery Exponentiation
n−1 t−1 i i 2 n Require: x = i=0 xi 2 , e = i=0 ei 2 , R , M , and R = 2 , −1 with gcd(M, R) = 1, (RR ) ≡ 1 (mod M ), and R2 ≡ RR (mod M ), where et = 1, 1 ≤ x < M, n = 1 + log2 M Ensure: z ≡ xe (mod M ) 1: u ← 1 2: s ← MM(x, R2 , M, R) 3: for i ← 0 to t − 1 step 1 do 4: s ← MM(s, s, M, R) {Run in parallel with line 6} 5: if ei = 1 then 6: u ← MM(u, s, M, R) {Run in parallel with line 4} 7: end if 8: end for 9: z ← MM(1, u, M, R) 10: return z
B. RNS based on the CRT Residue Number Systems (RNS) are based on the Chinese Remainder Theorem (CRT), which allows for fast parallel arithmetic. In RNS, suppose we have a set of r different moduli, {m1 , m2 , . . . , mr }, that are pairwise relatively prime to each other, i.e., gcd(mi , mj ) = 1, for i = j. Let M be the product of the moduli set, which is called the dynamic range of the RNS, because the amount of numbers that can be represented is M [11]. This product is expressed as
i=1
mi .
(3)
An integer x is represented by an ordered set of r residues of positive integers, {X1 , X2 , . . . , Xr }, defined within the dynamic range. Consider the following correspondence: (4) x ↔ X1 , X2 , . . . , Xr , where x ∈ ZM and Xi ∈ Zmi . The number Xi is said to be the residue of x with respect to mi , we often use the following notation: Xi ≡ |x|mi = x (mod mi ), for i = 1, 2, . . . , r.
(5)
If x and y are given in their RNS forms x ↔ X1 , X2 , . . . , Xr and y ↔ Y1 , Y2 , . . . , Yr , then we may define the operations of addition, subtraction and multiplication with the following equation:
We denote the product of X and Y in the Montgomery domain as follows: Z = MM(X, Y, M, R) ≡ (XY R−1 ) mod M.
r
|x Δ y|M ↔ X1 , X2 , . . . , Xr Δ Y1 , Y2 , . . . , Yr ↔ Z1 , Z2 , . . . , Zr ↔ z,
(6)
where Zi ≡ (Xi Δ Yi ) mod mi , and Δ denotes the operation +, −, or ×. In this work, we focus on the CRT use that is the basic theorem in RNS, and it ensures the uniqueness of this representation within the range 0 ≤ z < M . The proof of this theorem [12] can be used to recover z from its residue. The relationship between z and its residues is shown in the following equation: r −1 z≡ Z i Mi M mod M, (7) i=1
i
mi
M and Mi−1 m is the inverse of Mi modulo i mi mi , i.e., (Mi−1 Mi ) mod mi ≡ 1. Garner’s Algorithm for CRT is an efficient method for determining z, given an RNS representation z ↔ Z1 , Z2 , . . . , Zr , the residues of z modulo the pairwise coprime moduli m1 , m2 , . . . , mr . For further details, please refer to Section 14.5.2 of [10]. where Mi =
III. M ONTGOMERY E XPONENTIATION M ETHOD IN RNS The use of the RNS for RSA cryptosystem faces a limitation because one cannot choose an arbitrary dynamic range (distinct secret primes) to calculate the modulus M as the RNS product of the moduli set [13]. An efficient RNS modular reduction with no such restriction using base extensions was proposed in [14], where this application is an adaptation in RNS of the Montgomery Multiplication algorithm. In this context, Guillermin’s paper shows a good example of use of RNS in cryptography, which provides protection against side channels attacks [15]. Our investigations were conducted to provide an application of the Montgomery Multiplication in RNS for computing z ≡ xe (mod M ), using the Algorithm 3 without
base extensions. However, any comparison of this proposed method with [14] and [15] is unfair. The following proof of concept shows how to operate in parallel modular multiplication repeatedly in the Montgomery r-domain through the use of RNS, where each domain is based on Ri = 2ni , ni = 1 + log2 mi , and Ri−1 is the multiplicative inverse of Ri , such that gcd(mi , Ri ) = 1, Ri Ri−1 ≡ 1 (mod mi ), and a set of r relatively prime moduli mi , i.e., gcd(mi , mj ) = 1, for i = j. Let x be an arbitrary input operand with n bits in their RNS form x ↔ X1 , X2 , . . . , Xr , and let e be an exponent, which is given t−1 here in the most significant bit form, such that e = i=0 ei 2i , with et = 1. The proposal is the reconstruction of the r-residue Xi from the RNS representation to the Montgomery r-domain. Then, in each Modulo mi , the Montgomery Exponentiation is performed. Algorithm 3 shows the generalized MM operator being used to perform all the phases to implement exponentiation in RNS. This algorithm involves the Forward Conversion, Exponentiation in RNS, and Reverse Conversion processes. This algorithm involves the following steps: 1) Forward conversion: Lines 1 to 4 perform the conversion of the input operand x to the RNS representation of the r-residue Xi , with respect to mi , as denoted in equation (5). 2) Exponentiation in RNS: Lines 5 to 7 process the r algorithms MEXP in parallel, which computes their e Zi ≡ (Xi ) mod mi outputs. 3) Reverse conversion: Lines 8 to 13 converts the intermediate results Zi , produced by each modular exponentiation, from the RNS representation to the output result z ≡ xe (mod M ) in the conventional notation of the input x using the Garner’s algorithm for CRT [10], as denoted in equation (7).
Algorithm 3 Montgomery Exponentiation in RNS
n t−1 i k Require: k=0 xk 2 , e = i=0 ei 2 , and odd M = r x = m > 1, with n = 1 + log M , e = 1, 1 ≤ x < M , t i 2 i=1 for i, j = 1, 2, . . . , r, gcd(mi , mj ) = 1, for all i = j, R = 2n , gcd(M, R) = 1, RR−1 ≡ 1 (mod M ), |R|mi ≡ R (mod mi ), ni gcd(mi , |R|mi ) = 1, |R|mi |R|−1 mi ≡ 1 (mod mi ), Ri = 2 , gcd(mi , Ri ) = 1, Ri Ri−1 ≡ 1 (mod m ), n = 1 + log m i i i , 2 and the precomputed values: R2 m ≡ RR (mod mi ), Ri2 ≡ i
−1 Ri Ri (mod mi ), R2 ≡ RR (mod M ), Mi = M/mi , MR ≡ i −1 −1 Mi Ri (mod mi ), with (Mi Mi ) mod mi ≡ 1. Ensure: z ≡ xe (mod M ) 1: for each Modulo mi , i = 1, 2, . . . , r, in parallel do 2: W ai ← MM(x, 1,mi , R) 3: Xi ← MM(W ai , R2 m , mi , R) i 4: end for 5: for each Modulo mi , i = 1, 2, . . . , r, in parallel do 6: Zi ← MEXP(Xi , e, Ri2 , mi , Ri ) 7: end for 8: for each Modulo mi , i = 1, 2, . . . , r, in parallel do −1 9: W bi ← MM(Zi , MR , m i , Ri ) i 10: Zri ← MM(W bi , Mi , M, R) 11: end for 12: Zr ← Zr1 + Zr2 + . . . + Zrr 13: z ← MM(Zr, R2 , M, R) 14: return z
IV. H ARDWARE I MPLEMENTATION FOR CRT RSA D ECRYPTION In this section, we present a case study of the Montgomery Exponentiation in the RNS (MEXPRNS) architecture, as described in Algorithm 3, but using only 2 moduli for CRT RSA Decryption. A. Architecture of Montgomery Exponentiation in RNS (MEXPRNS) The top level of MEXPRNS architecture is illustrated in Figure 1. A flexible multiplication module (named Dualmode Montgomery Multiplier - DMM) performs all the MM operations in Algorithm 3. Data routers create data paths from the register set to the DMM blocks. The control block is not detailed in this work, and can be derived from the algorithm descriptions. A set of registers are employed to store values required for the computation and intermediate results.
Fig. 1.
Montgomery Exponentiation in RNS (MEXPRNS) – Top Level
The notation (a|b) is used to indicate the concatenation of 2 bit-vectors. Since each Modulo Channel has half of the size of the main modulus (M ) in this study, we have used the notation X (1) and X (2) to indicate the MS and LS half of a bit-vector with n bits. Observe that in the Forward conversion calculation, two multiplications are performed: one using the full length of the data bus (n bits), and the other using only half of the
bus. In both cases, the results are found in the LS half of the bus, and stored in the temporary registers Xi . During exponentiation, once the conversion from integer domain in RNS to r-domain in RNS is done, two multiplications modulo mi are done in each DMM, making the best utilization of the hardware resources. Once all the multiplications required for exponentiation are done, the following multiplications are performed to bring the value from RNS to integer number system. The AddSub block is responsible for the combination and reduction of results generated by the DMMs modules. It is activated only after the completion of the DMMs calculation. Thus, a way to save area and energy is to share this hardware block with the DualAdder blocks (Figure 2), turning it on when required. This way, this block consumes power only when the accumulation of the DMMs outputs is needed. The accumulation should be conducted in such way that it does not compromise the total clock period. There are several ways to perform this task, but here we are using a sequential circuit that accumulates the results. B. Dual-mode MM Architecture (DMM) The block diagram of the Dual-mode MM is shown in Figure 2. A DMM is capable of performing one multiplication with full-size operands (n bits) or 2 multiplications with half-size operands. The output of the DMM module is represented in carry-save form, and its binary value (Z – output of Dual Adder), for inputs X, Y , and M , corresponds to the following equations: Z = MM(X, Y, M, R), when mode = 1, or
(8)
Z = ((MM(X (1) , Y (1) , M (1) , R(1) )|
(9)
as XS = X when the multiplication starts. In each clock cycle, the state and outputs are computed as: XS = XS >> 1; XL = XS [0]; XH = XS [n/2]. Module DMMK (DMM Kernel) is the central data path of the DMM hardware. It performs the main iterations in the MM algorithm for each mode of operation. Given the inputs XL , XH , S = (S (1) , S (2) ), C = (C (1) , C (2) ), Y = (Y (1) , Y (2) ), M = (M (1) , M (2) ), it computes: S=
S + Y XL + M q , when mode = 1, 2
(10)
where q is internally calculated as the LS bit of S + Y XL , or (S (1) + Y (1) XH + M (1) q1 ) , and 2 (2) (2) (2) (S + Y XL + M q2 ) , when mode = 2, = 2
S (1) =
(11)
S (2)
(12)
where q1 and q2 are the LS bit of the S (1) + Y (1) XH and S (2) + Y (2) XL computation, respectively. Dual Adder (a Dual-mode adder) is responsible for the accumulation the DMMKs results, as the following equations: Z (1) = |S (1) >> 1 + C (1) >> 1|n1 , and Z (2) = |S (2) >> 1 + C (2) >> 1|n2 , when mode = 1, or (13) (Z (1) |Z (2) ) = (|S (1) >> 1 + C (1) >> 1|n1 | |S (2) >> 1 + C (2) >> 1|n2 ), when mode = 2. (14)
MM(X (2) , Y (2) , M (2) , R(2) )), when mode = 2.
Fig. 2.
Dual-mode MM Architecture (DMM)
The modules that compose the DMM are responsible for the implementation of the MM algorithm, as previous described. Shif tX is used to identify each bit of X (multiplier) for each mode, registers are used to store intermediate values S and C, and a final adder (Dual Adder), is used to convert the CS representation of the final product into conventional binary non-redundant representation. The module Shif tX has an input X = (X (1) , X (2) ), state XS and outputs XL and XH . The state is initialized
Fig. 3.
Dual MM Kernel (DMMK)
C. DMM Kernel circuit (DMMK) Figure 3 shows the Dual MM Kernel block diagram. It represents the basic function performed by the DMM module. Shif tSC1 and Shif tSC2 blocks separately or together, depending on the signal cf g, which controls the propagation of S2 [0] and C2 [0] bits into the bit positions
S1 [ni ] and C1 [ni ], respectively, making the intermediate result with n bits ((S1 |S2 ) and (C0 |C1 )). Thus, these blocks shift the intermediate result by 1–bit position to the right in the CS form generating the input for the CSA 11 and CSA 12 blocks. The signal cf g is associated with the multiplication mode, such that cf g = 0 or 1 means mode = 1 or 2, respectively. The SelOp11 block receives the signal XH with the same value of the signal XL . As a result, SelOp11 selects the value Y1 or zeros, depending on the XL value, as the SelOp12 block does for the value Y1 or zeros. These blocks work extended to join n bits of the multiples of the Y operand, (Y1 |Y2 ), generating the input operand (Op11 |Op12 ) for CSA 11 and CSA 12 . In this way, CSA 11 and CSA 12 turn into a n–bit adder. The LogicQi1 block is disabled for single multiplication mode. In this case, the bit value Qi2 [0] is transmitted from the LogicQi2 block as input to SelOp21 . According to the Qi2 [0] value, SelOp21 selects the value M1 or zeros, as SelOp22 does for the value M2 or zeros. In this way, the SelOp21 and SelOp22 blocks also work extended to combine the n bits of the multiples of the M operand, (M1 |M2 ), by compounding the (Op21 |Op22 ) operands for CSA 21 and CSA 22 . The Cin21 [0] bit value is zero for an operand size of ni bits. In this case, the Cin22 [ni ] bit value is transferred into this bit position, Cin21 [0], to produce the n bits of the intermediate result in CS form, (Sin21 |Sin22 ) and (Cin21 |Cin22 ), as input for CSA 21 and CSA 22 blocks. V. E XPERIMENTAL R ESULTS The functionality of the Montgomery Exponentiation in RNS method was verified using simulation. The blocks presented in Section IV were described in VHDL and simulated using VCS (Synopsys simulation tool). The designs were developed using the same design facilities and tools. The hardware description was synthesized with Synopsys Design Compiler using a 90nm CMOS library “saed90nm typ.db”. A. Benchmark We based our baseline architecture on the Exponentiation Radix-2 MM (MexpRadix-2 – Algorithm 2) and showed that the Montgomery Exponentiation in the RNS architecture (MEXPRNS – Algorithm 3) outperforms it for different sizes of problem on the architectures that were tested. Table I shows the summary of the experiments used for the comparison of the two architectures, which were implemented for four different values of n (256, 512, 1024, and 2048 bits). For a given operand size (n) it is required to set the equivalent clock period value (the largest critical path of these architectures) to determine the dynamic power, which allows the calculation of the energy consumption of all architectures. Compared to the baseline design, the proposed method has higher performance, consumes more power, and exhibits more parallelism, but uses significantly less energy.
TABLE I T HE E NERGY C ONSUMPTION OF T HE M ONTGOMERY E XPONENTIATION A RCHITECTURES
n bits
256 512 1024 2048
Architecture
Total Power per bit Exponent (mW )
Energy Consumption for one Montgomery Exponentiation (mW − ms)
MExpRadix-2 MEXPRNS MExpRadix-2 MEXPRNS MExpRadix-2 MEXPRNS MExpRadix-2 MEXPRNS
2.87 3.04 5.70 6.17 11.25 13.10 22.31 25.81
0.19 0.10 1.50 0.81 11.80 6.87 93.58 54.14
Reduction in Energy Consumption
47% 46% 42% 42%
For architectures that are faster and larger than the Radix2 architecture used in this work, it is expected that the application of the proposed technique would also result in energy gains, but at a different level. The total area and the total area increase columns show the impact of the number of the modulo channels in the MM Extended architecture area. The MEXPRNS method has an expected O(n) area growth with the number of operands bits and the number of modulo channels. The energy consumption for one Montgomery Exponentiation and reduction energy consumption columns show the most significant result. When comparing the baseline architecture (MExpRadix-2) against the experiments on the MEXPRNS architectures, we observed a 44% average reduction in the energy consumption for the MEXPRNS circuits described in Table I. Table II shows the area-time comparisons between the proposed architecture using the 1024-bit RSA operations and other state-of-the-art related work found in the literature. This table indicates that the proposed architecture uses the lowest clock frequency for comparable exponentiation time and circuit area (see, for example, the values of this work with Radix 28 and Miyamoto with Radix 232 ). By using less clock frequency, the circuit consumes less power on clock networks, which helps to reduce the energy required by the circuit to perform 1024-bit RSA operations. B. Analysis of the Energy Consumption The analysis of the energy consumption was achievable by using Power Compiler, which is part of the Synopsys Design Compiler synthesis tools. By applying Power Compiler various power reduction techniques, including clockgating, operand isolation, multi-voltage leakage power optimization, and gate-level power optimization, can increase the power savings and the area and timing optimization in the front-end synthesis domain. Based on the experiments, each Dual-mode MM block (DMM) consumes 43% of the overall average power.
TABLE II T HE A REA -T IME C OMPARISONS FOR RSA 1024- BIT
References
This Work CRT-RSA
Plataform
90-nm CMOS
Parameters n = 1024
Max Freq. (M Hz)
Exponentiation time (ms)
Radix 2
60.61
8.74
9,387
Radix 22
59.00
4.49
10,850
Radix 23
57.14
3.09
12,479
Radix 28
55.56
2.38
14,145
28
432.90
30.42
861
Radix 232
471.70
2.03
11,437
Radix 2128
421.94
0.24
153,862
Radix A. Miyamoto CRT-RSA [16]
90-nm CMOS
Area (LU T /Gates)
C. McIvor RSA [17]
Xilinx XC2V6000
Scalable
95.90
21.90
23,208
M. Sudhakar RSA [18]
Xilinx 2VP100
Scalable
140.00
15.00
5,892
M.-D. Shieh RSA [19]
Xilinx XC2V6000
Scalable
152.49
13.82
25,074
The remaining power consumption is associated with data routers and DMM block controllers. VI. C ONCLUSIONS The proposed method provides an application of Radix-2 MM in RNS for computing z = xe (mod M ), using the same hardware to perform exponentiation and conversions (forward and reverse) between conventional number system and RNS. A proof of concept for the RSA cryptosystem implementation of the proposed method is provided. We investigated the operations required for hardware implementations of the modular exponentiation using a generalized Montgomery multiplier concept in the context of RNS. We created an efficient hardware architecture to reduce the energy consumption without sacrificing performance with the use of arithmetic functions to perform the calculations involved in public-key cryptography. R EFERENCES [1] W. Diffie, M. E. Hellman, New directions in cryptography, IEEE Transactions on Information Theory IT-22 (6) (1976) 644–654. [2] R. L. Rivest, A. Shamir, L. M. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Communications of the ACM 21 (2) (1978) 120–126. [3] T. Elgamal, A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Transactions on Information Theory 31 (4) (1985) 469–472. doi:10.1109/TIT.1985.1057074. [4] NIST, Federal information processing standard (FIPS PUB 186-3) Digital Signature Algorithm (DSA) (2009). [5] D. Hankerson, A. J. Menezes, S. Vanstone, Guide to elliptic curve cryptography, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003. [6] P. L. Montgomery, Modular multiplication without trial division, Mathematics of Computation 44 (170) (1985) 519–521. [7] M. Pedram, Power aware design methodologies, Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[8] J.-J. Quisquater, C. Couvreur, Fast decipherment algorithm for RSA public-key cryptosystem, Electronics Letters 18 (21) (1982) 905– 907. [9] F. Zhu, D. Wong, A. Chan, R. Ye, Password authenticated key exchange based on RSA for imbalanced wireless networks, in: A. Chan, V. Gligor (Eds.), Information Security, Vol. 2433, Springer Berlin Heidelberg, 2002, pp. 150–161. [10] A. Menezes, P. C. v. Oorschot, S. A. Vanstone, Handbook of applied cryptography, CRC Press, 1996. [11] A. Omondi, B. Premkumar, Residue number systems: theory and implementation, Imperial College Press, London, UK, UK, 2007. [12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms (3rd ed.), The MIT Press, 2009. [13] S. Kawamura, M. Koike, F. Sano, A. Shimbo, Cox-rower architecture for fast parallel Montgomery multiplication, in: Proceedings of the 19th international conference on Theory and application of cryptographic techniques, EUROCRYPT’00, Springer-Verlag, Berlin, Heidelberg, 2000, pp. 523–538. [14] J.-C. Bajard, L.-S. Didier, P. Kornerup, Modular multiplication and base extensions in residue number systems, in: Proceedings of the 15th IEEE Symposium on Computer Arithmetic, ARITH ’01, IEEE Computer Society, Washington, DC, USA, 2001, p. 59. [15] N. Guillermin, A high speed coprocessor for elliptic curve scalar multiplications over Fp, in: Proceedings of the 12th International Conference on Cryptographic Hardware and Embedded Systems, CHES’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 48–64. [16] A. Miyamoto, N. Homma, T. Aoki, A. Satoh, Systematic design of RSA processors based on high-radix Montgomery multipliers, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 19 (7) (2011) 1136–1146. [17] C. McIvor, M. McLoone, J. McCanny, Modified Montgomery modular multiplication and RSA exponentiation techniques, Computers and Digital Techniques, IEE Proceedings - 151 (6) (2004) 402–408. [18] M. Sudhakar, R. Kamala, M. Srinivas, A bit-sliced, scalable and unified Montgomery multiplier architecture for RSA and ECC, in: Very Large Scale Integration, 2007. VLSI - SoC 2007. IFIP International Conference on, 2007, pp. 252–257. [19] M.-D. Shieh, J.-H. Chen, H.-H. Wu, W.-C. Lin, A new modular exponentiation architecture for efficient design of RSA cryptosystem, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 16 (9) (2008) 1151–1161.