VLSI Design of RSA Cryptosystem Based on the Chinese Remainder ...

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 17, 967-980 (2001)

Short Paper________________________________________________ VLSI Design of RSA Cryptosystem Based on the Chinese Remainder Theorem CHUNG-HSIEN WU, JIN-HUA HONG* AND CHENG-WEN WU Department of Electrical Engineering National Tsing Hua University Hsinchu, 300 Taiwan *Department of Electrical Engineering Da-Yeh University Changhua, 515 Taiwan

This paper presents the design and implementation of a systolic RSA cryptosystem based on a modified Montgomery’s algorithm and the Chinese Remainder Theorem (CRT) technique. The CRT technique improves the throughput rate up to 4 times in the best case. The processing unit of the systolic array has 100% utilization because of the proposed block interleaving technique for multiplication and square operations in the modular exponentiation algorithm. For 512-bit inputs, the number of clock cycles needed for a modular exponentiation is about 0.13 to 0.24 million. The critical path delay is 6.13ns using a 0.6µm CMOS technology. With a 150 MHz clock, we can achieve an encryption/decryption rate of about 328 to 578 Kb/s. Keywords: cellular array, Montgomery algorithm, modular multiplication, Chinese remainder theorem (CRT), RSA, public-key cryptography

1. INTRODUCTION Electronic communication technology has advanced very rapidly over the past few decades, creating new applications and opportunities along the way. Today we can send a multimedia message to, or receive one from virtually anyone around the world in seconds through the internet. To protect transmitted data from eavesdropping by someone other than the desired receiver, we need to disguise the message before sending it into a non-secure communication channel. This is achieved by a cryptosystem. In 1978, Rivest, Shamir, and Adleman invented a method to implement the public-key cryptosystem, which is known as the RSA cryptosystem [1]. Since it provides high security and is easy to implement, so it quickly became the most widely used public-key cryptosystem. However, developing an inexpensive hardware device for real-time RSA encryption and decryption is still a challenge. Finding an efficient hardware implementation for RSA is one of the important tasks that remain to be done in cryptography. In the RSA cryptosystem, both encryption and decryption are modular exponentiation, which can be done by a sequence of modular multiplications. Many modular mulReceived January 31, 2001; accepted July 10, 2001. Communicated by Chi Sung Laih.

967

968

CHUNG-HSIEN WU, JIN-HUA HONG AND CHENG-WEN WU

tipliers have been proposed (see, e.g., [2-4]). One of the most important algorithms was proposed by Montgomery in 1985 [5]. Montgomery’s algorithm needs n iterations in each modular multiplication and two additions per iteration, where n is the word length. Cellular arrays based on Montgomery's algorithm can be found in [6-8]. In this algorithm an odd modulus is assumed. It calculates the modular multiplication without performing division. Some improved algorithms were later proposed, either involving reducing the number of iterations or removing the final modular reduction step. A modified Montgomery's algorithm was reported in [9], where the multiplication and modular reduction steps in Montgomery’s algorithm are separated so that only one addition is required in each iteration. However, the number of iterations in the modified algorithm is two times that of Montgomery’s, and hence the overall computation time is not reduced. In [10-11], an algorithm was further improved to reduce the number of iterations, doubling the speed of modular multiplication. Another way to reduce the computation time is to use the Chinese Remainder Theorem (CRT) technique, since CRT is known to reduce the RSA computation by a divide-and-conquer method. In this paper, we present the design and implementation of a systolic RSA cryptosystem based on a modification of Montgomery’s algorithm and the CRT technique. The systolic array has a 100% utilization ratio because of a novel block interleaving approach. The throughput of the proposed design is much higher than for previous ones. For a 512-bit input, for example, the number of clock cycles needed for a modular exponentiation is only about 0.13 to 0.24 million. The layout and post-layout simulation of the 512-bit RSA chip has been finished, and the critical path delay is 6.13ns using a 0.6µm CMOS technology. Under a 150MHz clock, the decryption rate will be about 328 to 578 Kb/s.

2. MODULAR EXPONENTIATION ALGORITHM The RSA cryptosystem was invented by Rivest, Shamir, and Adleman [1]. It is, so far, the most commonly used public-key cryptographic algorithm, and when sufficiently long keys are used it is considered secure. The security of RSA relies on the difficulty of factoring large integers. There are three keys in RSA—N, E, and D. We first choose two different large prime numbers P and Q, and let their product be N, i.e., N = P×Q. Because it is difficult to factor large numbers, we can safely make N public without worrying that someone will be able to recover our secret numbers P and Q. We now choose a large number E that has no factors in common with the integer (P – 1) × (Q – 1), i.e., gcd ((P – 1)×(Q – 1), E) = 1. Then, we compute and keep secret a number D where E×D is congruent to 1 modulo (P – 1)×(Q – 1), i.e., D is the multiplicative inverse of E, modulo (P – 1)×(Q – 1). Note that we can find D by using the extended Euclidean algorithm. To encrypt a message using the encryption key (E, N), we first partition the message into a sequence of blocks and consider each block M as an integer between 0 and N ﹣1. Then, we encrypt the message by raising M to the Eth power modulo N, i.e., C = ME (mod N).

(1)

Similarly, to decrypt the ciphertext C using the decryption key (D, N), we raise C to

CRT-BASED VLSI DESIGN OF RSA CRYTOSYSTEM

969

the power of D modulo N, i.e., M = CD (mod N).

(2)

Clearly, modular exponentiation is the main operation in the RSA algorithm. For modular exponentiation, a sequence of modular multiplications can be performed instead. If we represent the exponent D in its binary form, i.e., D = dk-1dk-2…d1 d0, then k-1

2

CD = C2 dk-1 + … + 2 d2 + 2d1 +d0 k-1 2 = C2 dk-1 … C2 d2．M2d1．Md0.

(3)

Clearly, this can be done by a sequence of squaring and multiplication operations, by scanning the bits of D either from high-order to low-order, or in the opposite direction. These two methods are known as the H-Algorithm and L-Algorithm, respectively [12]. Both require n iterations for an n-bit exponent. Each iteration contains two modular multiplications. 2.1 Modular Multiplication In 1985, Montgomery proposed a modular multiplication algorithm without division [5]. Let N be an n-bit odd integer. The operands A = an-1…a1a0 and B = bn-1…b1b0 are also n-bit integers, where 0 ≤ A, B < N. Montgomery’s procedure for computing A × B (mod N) is MG (A, B, N) { S[0] = 0; for (i = 0; i < n; i++) { qi = (S[i] + aiB) (mod 2); S[i+1] = (S[i] + aiB + qiN)/2; } return S[n]; } Our approach here is based on the foundation established in [9-11], which is an improved version of Montgomery’s algorithm. Suppose N, A, and B are defined as above. We extend A and B to (n + 1) bits by appending a leading zero to the MSB, so 0 < A, B, < 2n+1. Let X = A × B, where X = (x2n+1…x1 x0) is a (2n + 2)-bit integer. We can repre2 n +1 n −1 n +1 sent it as X = ∑ i = 0 xi2i = 2n+2XM + XL, where XM =∑i=0 xi+n+22i and XL = ∑i =0 xi2i. The modular multiplication algorithm is MM (A, B, N) { A × B = X = XM‧2n+2 + XL; S[0] = 0; for (i = 0; i < n+2; i ++){ qi = (S[i] + xi) (mod 2);

970


S[i+1] = (S[i] + xi + qiN)/2;} S[n+3] = S[n+2]+ XM; return S[n+3]; } The advantage of procedure MM() is that the partial products are always in the range (0, 2n+1), which is the same as the input range. Hence, the modular reduction is removed. Based on MM() and the L - algorithm, we obtain the modular exponentiation algorithm MLA() shown below. Note that by letting R' = 2n+2 and C = R'2 (mod N), the final result Pk+1 will be equal to ME (mod N). That is, we let M0 = M × 2n+2(mod N) and P0 = 2n+2 initially to guarantee that the finial result is ME (mod N). MLA (C, D, N, R) { C0 = MM (R, C, N); P0 = MM (R, 1, N); for (i = 0; i < k; i ++){ Ci+1 = MM (Ci, Ci, N) if (di = 1) Pi+1 = MM (Ci, Pi, N); else Pi+1 = Pi;} Pk+1 = MM (1, Pk, N); return Pk+1; }

//Initialization

//Squaring //Multiplication

//Post-process

The intermediate values of MLA() are listed in Table 1. Note that the finial result Pk+1 is equal to CD (mod N) and falls in the range (0, N). We keep all the intermediate values Mi and Pi in the range (0, N) during each iteration, and only reduce Pn to the range (0, N) in the finial step of the MLA() algorithm. Table 1. Intermediate value of the MLA-Algorithm.

i Ci Pi

0 C．2n+2 1．2n+2

1 C2．2n+2 Cd0．2n+2

… … …

k k C2 ．2n+2 CD．2n+2

k+1 CD

2.2 RSA Computation Using CRT The Chinese Remainder Theorem [13] was announced around A.D. 100 by a Chinese mathematician who solved the problem of finding those integers X that leave remainders 2, 3, and 2 when divided by 3, 5, and 7, respectively. The solutions are of the form 23 + 105k for all non-negative integers k. Mathematically the problems can be stated as finding integers X, given its remainders when divided by several numbers m0, m1, …, mk-1. Theorem 1 (Chinese Remainder Theorem) Let m0, m1, …, mn-1 be pairwise relatively prime positive integers and let x0, x1, …, xn-1 be


971

any integers which satisfy the linear congruence system in one variable given by X ≡ X ≡     X ≡

x0 (mod m0) x1 (mod m1) M xn-1 (mod mn-1)

has a unique solution modulo m0, m1, …, mn-1. The RSA decryption and signature operation can be speeded up by using the CRT [3, 8, 14], where the factors of the modulus N (i.e., P and Q) are assumed to be known. By using the CRT, the computation of M = CD (mod N) can be partitioned into two parts: MP = CPDP (mod P), MQ = CQDQ (mod Q),

(4) (5)

where CP = C (mod P), DP = D (mod P－1), CQ = C (mod Q), DQ = D (mod Q－1). This reduces computation time since DP, DQ < D and CP, CQ < C. In fact, their sizes are about half the original sizes. In the ideal case we can have a speedup of about 4 times. Finally, we compute M by CRT as follows: M = [MP (Q-1 (mod P)) Q + MQ (P-1 (mod Q)) P] (mod N).

(6)

In the CRT, we can precalculate DP, DQ, (Q-1 (mod P)) Q and (P-1 (mod Q)) P. But of course, we cannot compute CP and CQ until we receive C. Also, the computation of M is done after we have obtained MP and MQ. The extra operations are CP = C (mod P), CQ = C (mod Q) and M = [MP (Q-1 (mod P))Q + MQ (P-1 (mod Q))P] (mod N). Compared with the calculation of MP and MQ, the time spent on the extra operations is negligible. Therefore, in most cases, the speedup is close to 4 times.

3. CRT-BASED RSA CHIP IMPLEMENTATION Our design consists of a modular multiplier and some logic to control the data interleaving and scheduling mechanism. We follow the standard systolic array design flow [15] to obtain the systolic RSA cryptosystem according to the MM() and MLA() algorithms described above. We then add a partitioning circuit to our systolic RSA architecture to obtain two distinct modular multipliers which can be used for computing MP D D = C P (mod P) and MQ = C Q (mod Q) concurrently. The details follow. Q

P

3.1 Multiplier The dependence graph (DG), the hyperplanes, and the schedule vector [15] of the

972


Fig. 1. The DG and SFG of a 4- bit multiplier.

systolic multiplier are shown in Fig. 1 [16]. The utilization rate of this multiplier is only 50%. However, this can be doubled by a novel block interleaving technique [17], processing two sets of data in the multiplier during each iteration cycle, i.e., MM (Ci, Ci, N) and MM (Ci, Pi, N). 3.2 Modular Reduction Unit In the modular reduction operation, the value qi is generated by XORing the LSBs of S[i] and xi. We use an n-bit addition and right shifts to accomplish the modular operation. Fig. 2 shows the DG, hyperplanes, schedule vector, cell structures, and SFG of the modular reduction unit [17]. Note that the Z cell is the combination of the X and Y cells, so we need extra control logic to realize the cell function. This circuit can also be interleaved by two sets of data in way similar to as the multiplier discussed above. We combine the systolic array multiplier and modular reduction unit to form a modular multiplier, which is shown in Fig. 3(a) [16]. 3.3 Composite Modular Multiplier By the CRT technique, we partition the modular multiplier into two smaller ones D according to the lengths of P and Q. Then we compute MP = C P (mod P) and MQ = D C Q (mod Q) using two independent modular multipliers concurrently. This apparently reduces the computation time. Note that the partition is done on-line according to the lengths of P and Q, so each cell in the array must be able to read the data from the priP

Q


973

mary inputs directly, and be able to propagate the data to the primary outputs directly to finish the final step of the modular multiplication. Therefore, we use some switches, MUXes, and a control signal to achieve the goals. The dynamic partition modular multiplier is called the composite modular multiplier (CMM), and is shown in Fig. 3(b).

Fig. 2. The DG and SFG of a 4-bit modular reduction unit.

Fig. 3. (a) Systolic array modular multiplier. (b) The composite modular multiplier.

974


The signal split that is shifted into the register formed by chaining the D-FFs in all blocks determines the partition location. The partition circuit is shown in Fig. 4(a). We extend the prime number Q to an n-bit integer with leading zeros and shift it into the n-bit shift register Q_Reg initially, where n is the length of the modulus N, i.e., n = [log2N]. The clock signal CLK is disabled until the MSB of Q_Reg is high. The two shift registers, Q_Reg and Split_Reg, perform the left-shift function. Finally, the information in the Split_Reg indicates the partition location. Note that there is only a single bit that is high in the Split_Reg. Thus, we have two smaller modular multipliers that can operate concurrently to perform the original modular multiplication. This reduces the execution time for the RSA decryption and signature operations. In a typical RSA cryptosystem, once the keys are generated, they are fixed until the new keys are generated, i.e., P and Q are fixed after key generation. Therefore, the partition of N into P and Q needs to be done only once, before the RSA computation starts.

Fig. 4. The partition circuit (a) and the modified register CON_REG (b).

3.4 CRT-Based RSA Cryptosystem We have presented the systolic-array modular multiplier based on the CRT technique. We can add some control logic to accomplish the modular exponentiation operation. The circuit for the entire RSA computation that incorporates the CRT technique is shown in Fig. 5 [16]. The details of the control register CON_REG are shown in Fig. 4 (b), where the partition scheme is the same as CMM. The control signals are supposed to come from the host controller, so they are considered as primary inputs in the core circuit design. The four inputs in1p, in2p, in1q, and in2q are used only for the initialization and post-processing steps. The control logic is responsible for timing management, data interleaving, iteration handling, etc.


975

Fig. 5. The control circuit for modular exponentiation.

As depicted in Fig. 6, the time complexity (i.e., number of clock cycles) of the CRT-based RSA computation is highly dependent on the lengths of P and Q. The best case occurs when |P| = |Q| = |N|/2. Suppose N is an n-bit integer, and the lengths of P and Q are p bits and (n－ p + 1) bits, respectively. The numbers of clock cycles required to finish a modular exponentiation using the original and the CRT-based approaches are listed in Table 2. Table 2. Complexity comparison for modular exponentiation.

Original Modular Exponentiation D Operation C (mod N) Length N:n-bit Clock cycles (2n+8) (n+2) Initial setup Unnecessary

CRT-Based Modular Exponentiation D C Q (mod Q) C (mod P) P : p-bit Q:(n－p+1)-bit (2p+8) (p+2) (2n－2p+10)(n－p+3) n+p Ep

Q

P

From the table, if P and Q have equal length, i.e., n/2, the number of clock cycles required will be the smallest. Although this is not likely to occur in practice (since P and Q are prime and P ≠ Q), their lengths should be as close as possible to ensure security. In the worst case, it is assumed that the maximum length of P or Q is 2n/3 (and the minimum is n/3).

976


Fig. 6. Time complexity of the RSA circuit.

The circuit has been implemented with a 0.6µm CMOS standard-cell library. The post-layout simulation result (see Fig. 7) shows that the critical path delay is about 6.13 ns, i.e., we expect the chip to operate at a 150MHz clock rate. For a 512-bit RSA cryptosystem, e.g., the best case needs 0.13M clock cycles, while the worst case requires 0.24M cycles to finish an RSA operation. With a 150MHz clock, the baud rate is 578K in the best case and 328K in the worst case. Fig. 8 shows the layout of the 512-bit CRT-based RSA cryptosystem. The core area is about 7653µm × 7486µm with about 109K primitive gates.

Fig. 7. Verilog-XL simulation result for Step III of Fig.5.


977

Fig. 8. Layout of the 512-bit CRT-based RSA cryptosystem.

4. CONCLUSIONS In this paper we have presented a systolic RSA cryptosystem design, based on an improved Montgomery’s modular multiplication algorithm [16, 17] and the CRT technique. The design is suitable for decryption and digital signature. The implementation of a 512-bit RSA cryptosystem with a 0.6µm CMOS standard-cell library has been done. The comparison of our implementation with other designs is summarized in Table 3, which shows a clear advantage for our approach in terms of speed. The number of clock cycles is 0.13M in the best case (when |P| = |Q| = n/2) and 0.24M in the worst case (when ||P| ﹣|Q|| = n/3). Therefore, the throughput of our design is the highest among all the designs, but with about 50% higher hardware cost. The circuit is easily extendible for large moduli due to the systolic-array design. Table 3. Performance and area comparison.

Years Gate [18] [9] [10]

1994 1996 1998

75K 77K 74K

[11] 1999 77K Ours 2000 109K

# of Tech. Clk Baud Clk (Hz) Rate 0.125M 1 25M 100K 1.05M 0.8 50M 24.3K 0. 39M 0.6 125M 164K 0.54M 118K 0.53M 0.6 250M 241K 0. 13M 0.6 150M 578K 0.24M 328K

978


REFERENCES 1. R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signatures and public-key cryptosystems,” Communications of the ACM, Vol. 21, No. 2, 1978, pp. 120-126. 2. Ç. K. Koç and C. Y. Hung, “Bit-level systolic arrays for modular multiplication,” Journal of VLSI Signal Processing, Vol. 3, No. 3, 1991, pp. 215-223. 3. Ç. K. Koç, “RSA hardware implementation,” Technical Report, RSA Laboratories, RSA Data Security, Inc., Redwood City, CA, 1995. 4. N. Takagi and S. Yajima, “Modular multiplication hardware algorithms with a redundant representation and their application to RSA cryptosystem,” IEEE Transactions on Computers, Vol. 41, No. 7, 1992, pp. 887-891. 5. P. L. Montgomery, “Modular multiplication without trial division,” Math. Computation, Vol. 44, No. 7, 1985, pp. 519-521. 6. P. Kornerup, “A systolic, linear-array multiplier for a class of right-shift algorithms,” IEEE Transactions on Computers, Vol. 43, No. 8, 1994, pp. 892-898. 7. C. D. Walter, “Systolic modular multiplication,” IEEE Transactions on Computers, Vol. 42, No. 3, 1993, pp. 376-378. 8. M. Shand and J. Vuillemin, “Fast implementation of RSA cryptography,” in Proceedings of 11th IEEE Symposium on Computer Arithmetic, 1993, pp. 252-259. 9. P.-S. Chen, S.-A. Hwang, and C.-W. Wu, “A systolic RSA public key cryptosystem,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 1996, pp. 408-411. 10. C.-C. Yang, T.-S. Chang, and C.-W. Jen, “A new RSA cryptosystem hardware design based on Montgomery’s algorithm,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. 45, No. 7, 1998, pp. 908-913. 11. C.-Y. Su, S.-A. Hwang, P.-S. Chen, and C.-W. Wu, “An improved Montgomery algorithm for high-speed RSA public-key cryptosystem,” IEEE Transactions on VLSI Systems, Vol. 7, No. 2, 1999, pp. 280-284. 12. D. E. Knuth, Seminumerical Algorithms, Vol. 2 of The Art of Computer Programming, Addison-Wesley, Reading, Massachusetts, second ed., 1981. 13. I. Koren, Computer Arithmetic Algorithms, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 07632, 1993. 14. J. Groβschädl, “The Chinese remainder theorem and its application in a high-speed RSA crypto chip,” in Proceedings of 16th Computer Security Applications Conference, 2000, pp. 384-393. 15. S. Y. Kung, VLSI Array Processors, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1988. 16. C.-H. Wu, “An RSA cryptosystem core based on the Chinese remainder theorem,” Master Thesis, Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 2000. 17. J.-H. Hong, P.-Y. Tsai, and C.-W. Wu, “Interleaving schemes for a systolic RSA public-key cryptosystem based on an improved Montgomery’s algorithm,” in Proceedings of 11th VLSI Design/CAD Symposium, 2000, pp. 163-166. 18. H. Orup, “A 100Kbits/s single chip modular exponentiation,” in Proceedings of Hot Chips VI, 1994, pp. 53-59.


979

19. J.-H. Guo, C.-L. Wang, and H. C. Hu, “Design and implementation of an RSA public-key cryptosystem,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 1999, pp. 504-507.

Chung-Hsien Wu (吳宗憲) received the BS degree in electrical engineering in 1995 from National Central University, Chungli, Taiwan, and the MS degree in electrical engineering in 2000 from National Tsing Hua University, Hsinchu, Taiwan. His research interests include VLSI testing and design of high-performance application-specific VLSI circuits and systems. He is currently with ELAN Microelectronics Corp., Hsinchu, Taiwan, where he is a design engineer in the Product Development Division.

Jin-Hua Hong (洪進華) received the BS degree in Physical Sciences in 1988 from National Central University, Chungli, Taiwan, and the PhD degree in electrical engineering in 2000 from National Tsing Hua University, Hsinchu, Taiwan. His research interests include VLSI design and testing, with emphasis on design for testability and built-in self-test. He is currently with the Department of Electrical Engineering, Da-Yeh University, Changhua, Taiwan, where he is an Assistant Professor.

Cheng-Wen Wu (吳誠文) received the BSEE degree in 1981 from National Taiwan University, Taipei, Taiwan, and the MS and PhD degrees, both in electrical and computer engineering, in 1985 and 1987, respectively, from the University of California, Santa Barbara. Since 1988 he has been with the Department of Electrical Engineering, National Tsing Hua University (NTHU), Hsinchu, Taiwan, where he is currently a professor. He also has served as the director of the university’s Computer and Communications Center from 1996 to 1998, and the director of the university’s Technology Service Center from 1998 to 1999. From August 1999 to February 2000, he was a visiting faculty of the ECE Department, UCSB. Since August 2000, he has been the Chair of the EE Department of NTHU. He also is the Director of the IC Design Technology Center of the university. Dr. Wu was the Technical Program Chair of the IEEE Fifth Asian Test Symposium (ATS’96), and the General Chair of ATS’00. He is an Associate Editor for the Journal of the Chinese Institute of Electrical Engineers (JCIEE) and a Guest Editor of the JCIEE Special Issue on Design and Test of System-on-Chip, and was a Guest Editor of the Journal of Information Science and Engineering (JISE), Special Issue on VLSI Testing. He received the Distinguished Teaching Award from NTHU in 1996, the Outstanding Electrical Engineering Professor Award from the Chinese Institute of Electrical Engineers (CIEE) in 1997, and the Distinguished Research Award from National Science Council in 2001. He is interested in design and testing of high performance VLSI circuits and systems. Dr. Wu is a life member of CIEE and a senior member of IEEE.