A Two-stage Pipelined Architecture for Parallel Modular ... - IEEE Xplore

0 downloads 0 Views 560KB Size Report
left-to-right binary method by two parallel modular multipli- cations. So far, in ... number is utilized;. Section 4 describes the proposed two-stage pipelined mod-.
2012 IEEE International Conference on Information Science and Technology Wuhan, Hubei, China; March 23-25, 2012

A Two-stage Pipelined Architecture for Parallel Modular Exponentiation Tao Wu, Shuguo Li, and Litian Liu

Abstract— In 1998, Koc¸ and Hung proposed a modular multiplication algorithm with carry save additions, in which a sign-detection method in carry save logic is utilized. In this work, the carry save additions in their algorithm are divided into two pipelined stages, so that Montgomery ladder can be used to interleave two nearby modular multiplications. Efficient full modular exponentiation can be performed in such an architecture.

I. I NTRODUCTION

E

FFICIENT implementation of modular exponentiation is of great significance for RSA(Rivest, Shamir and Adlema) cryptography [1] and Diffie-Hellman key exchange protocol [2]. Montgomery powering ladder is proposed in [3] to speed up modular exponentiation, which performs the left-to-right binary method by two parallel modular multiplications. So far, in the field of elliptic curve cryptography[4], Montgomery ladder has also been employed[6], [7]. Despite the acceleration from parallelism, Montgomery ladder also leads to the uniformity in arithmetic, i.e., the operations with the exponent bits 0 and 1 are the same. Based upon such an idea, a RSA hardware unit resistant to fault and simple power attack has been reported in [5]. In this work, by combining Montgomery ladder with Koc¸ and Hung’s modular multiplication algorithm[8], a two-stage pipelined hardware unit is designed to perform modular exponentiation. Such a unit enjoys good performance and resistance from fault and simple power attacks. The remaining parts of this paper are organized as follows: Section 2 is an introduction of Montgomery ladder for parallel modular exponentiation; Section 3 is an introduction of Koc¸ and Hung’s modular multiplication algorithm, in which the sign detection of a carry-save number is utilized; Section 4 describes the proposed two-stage pipelined modular exponentiation algorithm; in Section 5 the experiment result with hardware implementation is given; and the last section concludes the paper. II. M ONTGOMERY L ADDER

Usually, a modular exponentiation can be performed by binary algorithm from left to right or from right to left, which are shown in Algorithm 1 and Algorithm 2. This work was supported by the National Natural Science foundation of China (No.61073173) Tao Wu is with Department of Microelectronics and Nanoelectronics, Tsinghua University, Beijing 100084, P.R. China (email: [email protected]) Shuguo Li and Litian Liu are with the Institute of Microelectronics, Tsinghua University, Beijing 100084, P.R. China (emails: [email protected], [email protected]).

978-1-4577-0345-4/12/$26.00 ©2012 IEEE

215

Algorithm 1 Left-to-right modular exponentiation Require: 𝐴, 𝑀 and 𝑘 are all 𝑛-bit binary numbers, with 𝑘 = (𝑘𝑛−1 . . . 𝑘1 𝑘0 )2 . Ensure: 𝑆 = 𝐴𝑘 mod 𝑀 . 1: 𝑆 := 𝐴; 2: for 𝑖 = 𝑛 − 2 to 0 do 3: 𝑆 := 𝑆 2 ( mod 𝑀 ); 4: if 𝑘𝑖 = 1 then 5: 𝑆 := 𝑆 ⋅ 𝐴( mod 𝑀 ), 6: end if 7: end for 8: return 𝑆. Algorithm 2 Right-to-left modular exponentiation Require: 𝐴, 𝑀 and 𝑘 are all 𝑛-bit binary numbers, with 𝑘 = (𝑘𝑛−1 . . . 𝑘1 𝑘0 )2 . Ensure: 𝑆 = 𝐴𝑘 mod 𝑀 . 1: 𝑆 := 1, 𝑇 := 𝐴; 2: for 𝑖 = 0 to 𝑛 − 1 do if 𝑘𝑖 = 1 then 3: 𝑆 := 𝑆 ⋅ 𝑇 ( mod 𝑀 ), 4: end if 5: 6: 𝑇 := 𝑇 2 ( mod 𝑀 ); 7: end for 8: return 𝑆.

On average, a 𝑘-bit modular exponentiation 1.5𝑘 modular multiplications for both left-to-right and right-to-left algorithms. From the point of view of parallelism, there are about 1.5𝑘 sequential steps in the left-to-right algorithm, while within each step there is only 1 modular multiplier. Meanwhile, in the right-to-left algorithm there are only 𝑘 sequential steps no matter what the exponent reads, but it necessitates 2 parallel modular multipliers to achieve it. By contrast, Montgomery ladder also performs modular exponentiation by the binary algorithm from left to right, and it employs 2 parallel modular multipliers like the right-toleft algorithm. The Montgomery ladder algorithm is shown as Algorithm 3, by which a 𝑘-bit modular exponentiation requires about 𝑘 sequential steps and 2𝑘 modular multiplications. Apparently, the computational efforts are increased from 1.5𝑘 modular multiplications to 2𝑘 modular multiplications (modular squaring is looked at as modular multiplication). However, compared with the original left-to-right algorithm, it reduces the 1.5𝑘 sequential steps to 𝑘 steps; and compared with the right-to-left algorithm it increases

no hardware resources and enjoys the same performance. Moreover, Montgomery ladder processes the exponent bits 0s and 1s as the same, which protects the hardware unit from fault and simple power attach[5]. In Algorithm 3, Line Algorithm 3 Montgomery ladder for modular exponentiation Require: 𝐴, 𝑀 and 𝑘 are all 𝑛-bit binary numbers, with 𝑘 = (𝑘𝑛−1 . . . 𝑘1 𝑘0 )2 . Ensure: 𝑅 = 𝐴𝑘 mod 𝑀 . 1: 𝑆0 := 1, 𝑆1 := 𝐴; 2: for 𝑖 = 𝑛 − 1 to 0 do 3: if 𝑘𝑖 = 1 then 4: 𝑆0 := 𝑆0 ⋅ 𝑆1 ( mod 𝑀 ), 5: 𝑆1 := 𝑆1 ⋅ 𝑆1 ( mod 𝑀 ); else 6: 𝑆1 := 𝑆0 ⋅ 𝑆1 ( mod 𝑀 ); 7: 𝑆0 := 𝑆0 ⋅ 𝑆0 ( mod 𝑀 ); 8: 9: end if 10: end for 11: return 𝑆0 . 4 and 5 are computed at one step, while Line 7 and 8 are also computed at the same step. In each loop, 𝑆0 is the expected result, while 𝑆1 is a modular product of the expected result with 𝐴. It is beneficial to look at 𝑆1 as a pre-computed number for the next loop. III. KOC¸ AND H UNG ’ S M ODULAR M ULTIPLICATION A LGORITHM In [8], a fast modular multiplication based on carry save addition (CSA) is proposed. It performs modular reduction by successive subtractions of the modulus, which incorporates a sign detection method for numbers in CSA form. In Algorithm 4, the sign detection function ‘Estim(𝑆, 𝐶)’ uses a few most significant bits to detect the sign of (𝑆, 𝐶) [8]. On Line 3, the bit ‘pos’ stands for a number beyond or equal to 2𝜀, and the bit ‘neg’ denotes a number below or equal to −4𝜀. The validity of the sign detection is explained in subsection III-A. On Line 5, the ‘Decode’ function obtains 𝑞𝑖 = 1, −1, or 0, when {pos, neg} respectively equals 10, 01 or 00. A. Sign Estimation in Modular Multiplication Define 𝑋 and 𝑌 as 𝑚-bit numbers with 𝑋 = (𝑆, 𝐶) := 𝑆 +𝐶 and 𝑌 = 𝑇 (𝑆)+𝑇 (𝐶), where 𝑇 (𝑆) and 𝑇 (𝐶) replace the least 𝑡 bits of 𝑆 and 𝐶 by zeros. Let 𝜖 = 2𝑡 , 𝑋𝑈 < 2𝑚−1 − 2𝜖, 𝑚 > 𝑡, and −2𝑚−1 + 2𝑡+1 < 𝑋 < 2𝑚−1 − 2𝑡+1 .

(1)

Set 𝑋𝑈 = 𝑋 − 2𝑚 𝑋𝑚−1 and 𝑌𝑈 = 𝑌 − 2𝑚 𝑌𝑚−1 , which are just the absolute value of 𝑋 and 𝑌 . At the moment, the constraint with 𝑋 can be divided into two branches: 1) 0 ⩽ 𝑋𝑈 ⩽ 2𝑚−1 , 𝑋 ⩾ 0; 2) 2𝑚−1 + 2𝜖 ⩽ 𝑋𝑈 < 2𝑚 , 𝑋 < 0. As is given in [8], the sign of the CSA-form can be detected by the corresponding ranges of 𝑌 :

216

Algorithm 4 Koc¸ and Hung’s modular multiplication algorithm Require: 𝐴, 𝐵 and 𝑀 are all 𝑛-bit binary numbers with ˆ = 8𝑀 and 𝐴ˆ = 8𝐴. 0 ⩽ 𝐴 < 𝑀 , 0 ⩽ 𝐵 < 𝑀 . Set 𝑀 Also, the parameter with sign detection reads 𝑡 = 𝑛 − 1, and 𝜀 = 2𝑡 = 2𝑛−1 . All intermediate values have 𝑚 binary bits with 𝑚 = 𝑛 + 4, and the most significant bit is the sign bit. Ensure: 𝑅 = 𝐴 ⋅ 𝐵( mod 𝑀 ). 1: 𝑆 := 0, 𝐶 := 0; 2: for 𝑖 = 𝑛 + 2 to 0 do (pos, neg) = Estim(𝑆, 𝐶); 3: 𝑞𝑖 = Decode(pos, neg); 4: if 𝑞𝑖 = 1 then 5: (𝑆, 𝐶) := 2𝑆 + 2𝐶 + 𝐴ˆ𝑖 𝐵 + 𝑀 ; 6: else if 𝑞𝑖 = −1 then 7: 8: (𝑆, 𝐶) := 2𝑆 + 2𝐶 + 𝐴ˆ𝑖 𝐵 − 𝑀 ; 9: else 10: (𝑆, 𝐶) := 2𝑆 + 2𝐶 + 𝐴ˆ𝑖 𝐵; end if 11: 12: end for ˆ; 13: (𝑆 ′ , 𝐶 ′ ) := 𝑆 + 𝐶 + 𝑀 14: 𝑅0 := 𝑆 + 𝐶; 15: 𝑅1 := 𝑆 ′ + 𝐶 ′ ; 16: if 𝑅0 < 0 then 17: 𝑅2 := 𝑅1 ; 18: else 19: 𝑅2 := 𝑅0 ; 20: end if 21: 𝑅 = 𝑅2 /8; 22: return 𝑅. If 𝑌 ⩾ 2𝜖, then 𝑋 ⩾ 2𝜖; If 𝑌 ⩽ −4𝜖, then 𝑋 < −2𝜖; ∙ If −4𝜖 < 𝑌 < 2𝜖, then −3𝜖 ⩽ 𝑋 < 3𝜖. The above three cases can be explained as follows: 1) If 𝑌 ⩾ 2𝜖 > 0, then 0 < 2𝜖 ⩽ 𝑌 = 𝑌𝑈 < 2𝑚−1 . Because 𝑌 ⩽ 𝑋 < 𝑌 + 2𝑡+1 , we have 2𝜖 ⩽ 𝑋𝑈 < 2𝑚−1 +2𝜖. Obviously, it does not agree with the second constraint; therefore, the first constraint must exist, and one gets 𝑋 > 0 as is expected. 2) If 𝑌 ⩽ −4𝜖, then 𝑌 + 2𝑡+1 < −2𝜖. Thus 𝑋 < −2𝜖 < 0. 3) Finally, if −4𝜖 < 𝑌 < 2𝜖, then −4𝜖 < 𝑌 < 𝑋 < 𝑌 + 2𝑡+1 < 4𝜖. Because 𝑋 is a multiple of 𝜖, there must be −3𝜖 ⩽ 𝑋 ⩽ 3𝜖. ∙ ∙

IV. P ROPOSED T WO - STAGE P IPELINED PARALLEL M ODULAR E XPONENTIATION In [9], [10] a common multiplicand method is proposed for the right-to-left algorithm with modular exponentiation, and the kernel idea steps from the common computation of Montgomery modular reduction of 𝑆 ⋅ 𝛽 −𝑖 for 𝑖 = 1, 2, . . . , 𝑛 − 2. Such an idea shares the temporary results from the same computation, which can be seen as a strategy of dynamic programming.

It is obvious that common multiplicands also exist in Montgomery ladder algorithm, 𝑆0 or 𝑆1 , as is shown in Algorithm 3. Unlike sharing common results with Montgomery modular reductions in [9], we can design a pipelined architecture for Koc¸ and Hung’s modular multiplication algorithm, which then share the hardware instead. It can be looked at as a time-multiplexing strategy. A. Two-stage Pipelined Modular Multiplication We can separate Koc¸ and Hung’s algorithm into two pipelined stages, as is shown in Algorithm 5. As is shown in Algorithm 5 Pipelined Koc¸ and Hung’s modular multiplication Require: 𝐴, 𝐵 and 𝑀 are all 𝑛-bit binary numbers with ˆ = 8𝑀 and 𝐴ˆ = 8𝐴. 0 ⩽ 𝐴 < 𝑀 , 0 ⩽ 𝐵 < 𝑀 . Set 𝑀 Also, the parameter with sign detection reads 𝑡 = 𝑛 − 1, and 𝜀 = 2𝑡 = 2𝑛−1 . All intermediate values have 𝑛 + 4 binary bits, with the most significant bit as the sign bit. Ensure: 𝑅 = 𝐴 ⋅ 𝐵( mod 𝑀 ). 1: 𝑆 := 0, 𝐶 := 0; 2: for 𝑖 = 𝑛 + 2 to 0 do (pos, neg) = Estim(𝑆, 𝐶); 3: 4: (𝑆, 𝐶) := 2𝑆 + 2𝐶 + 𝐴ˆ𝑖 𝐵; 5: 𝑞𝑖 = Decode(pos, neg); 6: (𝑆, 𝐶) := 𝑆 + 𝐶 + 𝑞𝑖 ⋅ 𝑀 ; 7: end for ˆ; 8: (𝑆 ′ , 𝐶 ′ ) := 𝑆 + 𝐶 + 𝑀 9: 𝑅0 := 𝑆 + 𝐶; 10: 𝑅1 := 𝑆 ′ + 𝐶 ′ ; 11: if 𝑅0 < 0 then 12: 𝑅2 := 𝑅1 ; 13: else 14: 𝑅2 := 𝑅0 ; 15: end if 16: 𝑅 = 𝑅2 /8; 17: return 𝑅.

Fig. 1.

Fig. 2. ladder

Pipelined two-stage modular multiplication

Interleaved modular multiplications in two stages by Montgomery

from CSA to the nonredundant representations. For hardware implementations, we have used a carry-select adder to sequentially accumulate the sum. As is shown in Fig. 3,

Fig. 1, the operations at Lines 3∼4 can be performed in one clock cycle, while the operations at Lines 5∼6 can also be performed in the same clock cycle. The control signal ‘init’ is to initialize (𝑆, 𝐶) as zeros. B. Interleaved Modular Multiplications for Modular Exponentiation In Algorithm 5, it requires two clock cycles to process 1 bit in the modular multiplication. Therefore, it is able to plug in another modular multiplication in such a two-stage architecture, which is demonstrated in Fig. 2. We can arrange the control logic so that the parallel computation of Lines 4∼5 or Lines 7∼8 are processed by only one modular multiplier. In the situation, the two multiplications are interleaved with each other. Fig. 2 shows the bit-serial inputs of 𝑆0 and 𝑆1 , from 𝑆0,𝑛+3 and 𝑆1,𝑛+3 to 𝑆0,0 and 𝑆1,0 . When Algorithm 5 is used in a full modular exponentiation, during each loop the results must be converted

217

Fig. 3.

Carry-select adder in two clock cycles

a 𝑘-word carry-select adder (CSLA) is able to processing 𝑘 ⋅ 𝑤 bits for 𝑋 + 𝑌 , with 𝑤 being the word size, and 𝑋 (𝑖) , 𝑌 (𝑖) being the (𝑖 + 1)-th words. To reduce the path delay, the two 1024-bit additions are interleaved in the two-clock CSLA in multi-cycles. In this work, with 𝑤 = 15, 𝑘 = 3, it requires ⌈1024/(𝑘 ⋅ 𝑤)⌉ × 2 = 46 clock cycles to complete the conversion of two 1024-bit CSA-form integers.

V. E XPERIMENT R ESULT By the two-stage pipelined architecture for modular multiplications, we have described a 1024-bit modular exponentiation unit by Verilog Hardware Description Language and implemented it in Xilinx XC2V6000 FF1517-6 FPGA. The synthesis tool is Synplify Pro 9.6.2, and the place and route tool is Xilinx ISE 10.1. The implementation result is compared with other designs in the literature in Tab. I.

Besides, the proposed modular exponentiation algorithm enjoys some advantages that does not belong to [11]: ∙ It can be applied to modular exponentiations with respect to both odd and even integers, and there is no precomputation. These two characteristics are inherited from Koc¸ and Hung’s modular multiplication algorithm. ∙ The interleaved architecture helps the system immune from fault and simple power attack, by a similar mechanism as that in [5].

TABLE I H ARDWARE I MPLEMENTATION OF 1024- BIT F ULL M ODULAR

VI. C ONCLUSION

E XPONENTIATION Max Freq. Area 1024-bit modular (MHz) (slices) exponentiation time (ms) This work XC2V6000 155.38 16808 14.24 [11] XC2V6000 215.83 ⩾ 15826 9.60 [12] XC2V6000 97.08 ⩾ 23208 17.92 Platform

The time for a 1024-bit full exponentiation (𝑇1𝑘 ) in [11] is computed from the throughput and the ratio of frequencies in different technologies. In [11], the throughput for full modular exponentiation in FPGA platform is estimated as 𝑣 = 265.44 × 215.83/550 kbps, and 𝑇1𝑘 =

1000 × 550 = 9.60(ms). 265.44 × 215.83

In [12], 𝑇1𝑘 is estimated from the encryption time with the public key 𝐸 = 216 + 1. Multiplying the encryption time 0.21 ms by 1024 × 1.5/18 yields 17.92 ms. Meanwhile, the area cost is is supposed to be larger for full modular exponentiations than for a RSA encryption, and a sign of ‘⩾’ is added before the area in [12] and [11]. The performance of this work is inferior to that in [11] due to the following reasons: ∙ Assuming an average Hamming weight of the exponent and taking into account the acceleration of Montgomery ladder, the latency to process one bit in the multiplicand in Koc¸ and Hung’s modular multiplication is about (2/1.5) clock cycles. By contrast, the latency in [11] is only 1 clock cycle. ∙ The critical path of this work is constrained by the signestimation logic of an integer in carry-save form, so that we have moved some logic from Stage 1 to Stage 2 to optimize it. The critical path is than longer than one 3:2 carry-save compression, which is the critical path in [11]. ∙ The critical path in [11] may increase a little after place and route. Meanwhile, this design for full modular exponentiation is still of good performance compared with many peer works, such as that in [12].

218

In this work, we have proposed a two-stage interleaved architecture to perform full modular exponentiation. This architecture is based on Koc¸ and Hung’s modular multiplication algorithm and Montgomery ladder. Hardware implementation result shows that the proposed idea is efficient for modular exponentiation. Meanwhile, it is immune to fault and simple power attack due to the interleaved architecture. R EFERENCES [1] R.L. Rivest, A. Shamir, and L. Adleman L., “A method for obtaining digitalsignaturesandpublic-keycryptosystems,” Communications of the ACM, vol. 21, pp. 120-126, 1978. [2] W. Diffie and M.E. Hellman M.E., “New directions in cryptography,” IEEE Transactions on Information Theory, vol. 22, pp. 644-654, 1976. [3] M. Joye and S.M. Yen, “The Montgomery Powering Ladder,” Cryptographic Hardware and Embedded Systems (CHES 2002), Lecture Notes in Computer Science, vol. 2523, pp. 291-302, Springer-Verlag, 2003. [4] D. Hankerson, A. Menezes, and S. Vanstone, Guide to elliptic curve cryptography. Springer-verlag, New York, 2004. [5] A.P. Fournaris, “Fault and simple power attack resistant RSA using Montgomery modular multiplication,”IEEE International Symposium on Circuits and Systems,2010, pp. 1875-1878. [6] J. Bajard, S. Duquenne, and N. Meloni N., “Combining Montgomery Ladder for Elliptic Curves Defined Over 𝐹𝑝 and RNS Representation,” Research Report LIRMM, vol. 6041, 2006. [7] S. Ant˜ao S., J. Bajard, and L. Sousa, “Elliptic Curve Point Multiplication on GPUs,” 21st IEEE International Conference on Applicationspecific Systems, Architectures and Processors, 2010, pp. 192-199. [8] C¸.K. Koc¸ and C.Y. Hung, “A Fast Algorithm for Modular Reduction,” IEE Proceedings on Computer and Digital Techniques, vol. 145, no. 4, pp. 265-271, 1998. [9] J.C. Ha, and S.J. Moon, “A common-multiplicand method to the Montgomery algorithm for speeding up exponentiation,” Information Processing Letters, vol.66, no.2, pp.105-107, 1998. [10] C.L. Wu, D.C. Lou, and T.J. Chang, “An Efficient Montgomery Exponentiation Algorithm for Public-Key Cryptosystems,” International Conference on Intelligence and Security Informatics, 2008, pp. 284285. [11] M.D. Shieh, J.H. Chen, W.C. Lin, and H.H. Wu, “A New Algorithm for High-Speed Modular Multiplication Design,” IEEE Transactions on Circuits and Systems-I, vol.56, no.9, pp. 2009-2019, 2009. [12] C. McIvor, M. McLoone and J.V. McCanny, “Modified Montgomery modular multiplicationand RSA exponentiation techniques,” IEEE Proceedings on Computers and Digital Techniques, vol. 151, no. 6, pp. 402-408, 2004.