A New RSA Encryption Architecture and Hardware Implementation based on Optimized Montgomery Multiplication A. P. Fournaris and O. Koufopavlou Electrical and Computer Engineering Department University of Patras, Patras, GREECE
[email protected] Abstract— RSA is a widely acceptable and well used algorithm in many security applications. Its main mathematical function is the demanding, in terms of speed, operation of modular exponentiation. In this paper a systolic, scalable, redundant Carry – Save Modular Multiplier and an RSA encryption architecture are proposed using the Montgomery Modular Multiplication algorithm. By completely avoiding the transformations from redundant to non redundant numbers at the intermediate stages of the architectures, the need for addition is eliminated and very interesting results, in terms of Clock Frequency, Throughput and Chip Covered Area, are achieved.
I.
INTRODUCTION
The ever increasing need for privacy in the growing network connected world of the last years, has lead to a significant increase of security related applications. To satisfy that increasing need for security, many different cryptographic algorithms have been proposed and tested for their security strength. RSA is one of the most popular public key cryptographic algorithms [1]. It is used in public signatures applications and generally in secure transactions. It offers good cryptographic security but due to its demanding mathematical calculation complexity it lacks in speed when compared to symmetric key algorithms. That fact leads to a well founded need for speeding up the calculations for the RSA cryptosystem. The mathematics behind RSA algorithm, are summarized in two operations, modular multiplication and modular exponentiation. In the RSA cryptosystem, the arithmetic operation ACmodN is used, where N is a prime product of two relative prime numbers, A is the message and C the secret key. In order to create an efficient implementation of RSA one has to design efficiently the multiplication of two modular numbers. However, modular multiplication has a very big drawback, trial division has to be employed to obtain the necessary remainder value.
Many attempts have been made to overcome the trial division obstacle [2]. The most popular solution is the Montgomery Modular Multiplication algorithm (MMM) and Montgomery Modular Exponentiation algorithm (MME), first proposed by P. Montgomery in [3]. By normalizing the numbers to be multiplied, this algorithm manages to completely avoid trial division. Many works have been published for modular multiplication using MMM algorithm. Some researchers propose systolic arrays and scalability as a solution in designing MMM architectures [4], while others use redundancy [5]. Many attempts have also been made to improve the MMM algorithm, leading in hardware oriented optimized versions of MMM [6]. Although, such MMM designs are very efficient, not all of them lead to similarly efficient modular exponentiation architectures in terms of Clock Frequency, Throughput and Chip Covered Area. Since the RSA cryptosystem is based on modular exponentiation, an inefficient MME architecture would result in an impractical RSA encryption application. In this paper an RSA encryption architecture is proposed using a systolic, scaleable, redundant Carry – Save MMM architecture based on an optimized version of the MMM algorithm. Transforming all the inputs, outputs and intermediate signals in C – S format the proposed RSA encryption and MMM architectures completely avoid the transformations from redundant to non redundant numbers at the intermediate stages of the architectures and therefore achieves very interesting results in terms of Clock Frequency, Throughput and Chip Covered Area. The paper is organized as follows. In Section II the optimized MMM algorithm is presented. In Section III the proposed MMM architecture is described. In Section IV the MME algorithm is presented and the proposed MME architecture and RSA architecture is described and analyzed. In Section V measurements and implementation results are given and Section VI concludes the paper.
II.
MONTGOMERY NODYLAR MULTIPLICATION ALGORITHM
A. The Original MMM algorithm The MMM algorithm [3] calculates the value A = X ⋅ Y ⋅ R −1 mod N where R is a constant number usually R=2n. The n-bit value N has to be an integer filling the condition gcd(R,N)=1. In RSA, N value is odd as the product of two primes therefore the above constrain is always true. Because the algorithm is used for exponentiation, the output A is never greater than N so the final subtraction stated in [3], can be omitted [7]. Therefore the algorithm becomes Function MMM (X, Y, N) 1. A=0 2. For k=0 to n-1 do begin 3. q=(a0 +xky0) mod 2 4. A=A+xkY+qN 5. A=A/2 End 6. Return A . B. The Optimized MMM algorithm Observing the original MM algorithm, the function MMM consists basically of a loop of additions of A with the values Y, N, Y+N or zero (0). The choice of what value will be added to A depends on the values of xk and q. So by knowing xk and q one can add only the needed value to A. Using that observation and the Carry - Save redundant logic, a modified version of the MMM algorithm can be presented [6] Function MMMop (X, Y, N) 1.
Cin=0 , Sin=0
2.
For k=0 to n-1 do begin
3.
q=(Sin0 + Cin0 +xky0) mod 2
4.
if (xk=0) then a.
if (q=0) then
b.
I=0
c.
Else
d.
I=N
e.
End
5.
End.
6.
if (xk=1) then a.
if (q=0) then
b.
I=Y
c.
Else
d.
I=Y+N
e.
End
7.
End.
8.
C+S= Cin+Sin+I
9.
Cin=(C)/2 , Sin=(S)/2 End
10. Return Cin and Sin As shown in MMMop, the needed additions in one loop cycle of the algorithm are greatly reduced in a three value addition. That is done in the expense of an if – state argument. The value Y+N can be precomputed and it does not affect the efficiency of the algorithm. Overall, the gain in additions overcomes the if – state problem and MMMop is proved [6] to be an optimization on the MMM algorithm. III.
MMM ARCHITECTURE
To design an architecture for the optimized MMMop algorithm using systolic array logic, two different kind of Processing Elements are needed; the simple PE and the PE that calculates the q value (qPE). The difference between them, as shown in Fig.1, is that the qPE has some extra gates. Those gates are used for the calculation of q. All calculations are done using Carry – Save binary arithmetic. One of the basic principles of Carry – Save logic is that at the end of the computations the outcome has to be transformed into non redundant format using an adder. This is very efficient if used only once. However, because the proposed MMM architecture is designed for RSA encryption where repeated multiplications are needed, the use of one extra addition in every loop is not affordable. That addition step can be avoided by making the proposed MMM architecture fully functional with C – S numbers. The PE elements of the proposed systolic, scalable, fully redundant MMM architecture are shown in Fig. 1.
a Figure 1.
B The proposed MMM architecture PE (a) and qPE (b).
The proposed PEs of Fig. 1 seem more expensive than those of a conventional design of MMMop algorithm [6]. However, if the cost of one extra addition in every cycle is considered, the overall Chip Cover Area, in the case of the conventional architecture, would be bigger that the proposed design where the addition is needed only once at the end. Using the PEs of Fig. 1 all the intermediate calculations of the MMM algorithm are done in C – S format with out any transformations. If we use a full systolic pattern for the MMM architecture, n × n PEs would be needed resulting in a impractical design in terms of Chip Cover Area. Therefore, only n PEs are used in the proposed MMM architecture that operate through feedback logic, as shown in Fig. 2.
Figure 2.
value. They can be precomputed. As for the value in step 3 it has to be computed only once for each X and then it can be stored in a memory module. Step 4b is executed only when the i-th bit of e is set. B. The proposed MME architecture Using the proposed MMM architecture of Fig. 2 and algorithm MME, the Montgomery Modular Exponentiation architecture (MME) can be designed, as shown in Fig. 3. Our basic goal is to achieve a constant flow of data, therefore, a high throughput has to be maintained. For that purpose, two MMM architectures are needed, MMM1 that manages the incoming X values and MMM2 that performs the main loop of the MME algorithm. The X values are normalized in MMM1 (step 3 of the MME algorithm) and stored in a Register Set. The data are pushed into MMM2 when et =1, else the output of MMM2 is reinserted in the input. There is a precomputation unit needed for the calculation of Y+N that performs one computation every n clock cycles.
The proposed MMM architecture.
All input, output and intermediate signals are in Carry – Save format. The C (Carry) output signal is backtracked in the previous PE thus achieving the shifting (division by 2 operation) in step 9 of the MMMop algorithm. The proposed MMM architecture comes up with a result after n clock cycles. IV.
THE PROPOSED MME AND RSA ARCHITECTURE
A. The MME algorithm The distinctiveness of MMM algorithm is that it uses the −1 R value and calculates A = X ⋅ Y ⋅ R mod N . The well known square and multiply method [2] has to be modified to transform the input value into the Montgomery format, which include the R value, and to transform the output from the Montgomery format into a plain number. Therefore the algorithm for Montgomery Modular Exponention (MME) [2] is Function MME (X, e, N) 1. A = R mod N 2. G = R 2 mod N 3. X = MMMop( X , G ) 4. For i=t to 0 do begin a. A = MMMop( A, A) b. If ei =1 then A = MMMop( A, X ) 5. End 6. A = MMMop( A,1) All values are in Carry – Save format, e is the exponent and R=2n. Steps 1 and 2 are the same regardless of the X or e
Figure 3.
The proposed MME architecture
The X value enters in MMM1 unit and is normalized in parallel with the first square operation done in MMM2. After n clock cycles, the output of MMM1 is stored in the Register Set. If et=1, the stored output of MMM1 is pushed into MMM2 and a multiply operation begins. Else the square operation is repeated. MMM2 uses feedback and renews its input with the previous calculation output. When the loop of the MME algorithm is completed, instead of X, value 1 is used as input in MMM1 and the output of MMM2 is used to transform the Montgomery number to normal. That does not result in extra delay because a X value can be normalized and stored in the register set from the previous MMM n clock cycle. Exponentiation is completed in [t + Hw(e)]n clock cycles and Throughput for the proposed MME architecture is:
TMME =
n ⋅ Fclock
=
Fclock
[t + HW (e)]n [t + HW (e)]
where HW(e) is the Hamming Weight of e. C. RSA Architecture RSA encryption can be considered a special case of MME because the exponent e can be a relatively small number with out changing the security of the algorithm. In practice, a small e number is chosen, usually 3 or 216+1, that has few non zero bits. One RSA encryption is done in 18 clock cycles for e=216+1 and the Throughput is
TMME V.
VI.
F F = clock = clock [16 + 2] 18
MEASUREMENTS AND COMPARISONS
The proposed MMM and MME architectures were captured in VHDL and implemented in FPGA. The results in Chip Covered Area, Clock Frequency and Throughput for the proposed MMM architecture are given in Table I in comparison with other well known MMM architectures. TABLE I.
AREA, CLOCK FREQUENCY AND THROUGHPUT COMPARISONS FOR 1024 BIT MONTGOMERY MODULAR MULTIPLICATION.
MMM Architecture Proposed MMM architecture [5] [6] [8] [9]
Chip Area (CLBs)
Clock Frequency (MHz)
Throughput (bit/sec)
3611
129.1
129M
11617 65408 5458 5706
76.23 168.7 54.61 95.62
76.08M 168G 54.4M 31.83M
The proposed MMM architecture fairs very well against the other compared designs. The Chip Covered Area of the proposed MMM architecture is smaller than the architectures of [5], [8] and [9]. Also, the proposed MMM architecture is very fast and has very high Clock Frequency and Throughput when compared with [5], [8], [9]. However the fully systolic architecture of [6] is by far the fastest but its Chip Covered Area is so high that makes this architecture unusable for Modular Exponentiation or RSA encryption. TABLE II.
AREA, CLOCK FREQUENCY AND THROUGHPUT 16 COMPARISONS FOR 1024 BIT RSA ENCRYPTION WITH E=2 +1.
RSA Architecture Proposed RSA architecture [5] [8] [10]
Results of the proposed RSA encryption architecture concerning Chip Covered Area, Clock Frequency and Throughput for e=216+1 in comparison with other similar architectures are presented in Table II. The proposed RSA encryption architecture has very high Throughput and Clock Frequency value, unmatched by any other design. The proposed RSA encryption architecture has relatively small Chip Covered Area, smaller than designs in [5] and [8] but higher than the design in [10].
Chip Area (CLBs)
Clock Frequency (MHz)
Throughput (bit/sec)
7873
129
7.17 M
22076 10369 2902
93.34 49.63 32.36
4.66M 45.8K 342.4K
CONCLUSIONS
In this paper a systolic, scalable, Carry – Save Montgomery Modular Multiplier architecture and RSA encryption architecture was proposed. The proposed architectures uses an optimized version of Montgomery Modular Multiplication algorithm. By making all the input, output and intermediate signals Carry free, the proposed architectures proved very efficient in terms of Clock Frequency, Chip Covered Area and Throughput. REFERENCES [1]
B. Schneier, Applied Cryptography – Protocols, Algorithms and Source Code in C, John Wiley & Sons, second ed. New York, 1996. [2] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography, CRC Press, 1997. [3] Peter L. Montgomery, “Modular multiplication without trial division,” Mathematics of Computation, vol. 44, no. 170, pp. 519-521, 1985. [4] L. Batina and G. Muurling, “Montgomery in Practice: How to do it more efficiently in Hardware”, The Cryptographer's Track at the RSA Conference on Topics in Cryptology, 2002. [5] C. McIvor, M. McLoone, J. V. McCanny, A. Daly, W. Marcane, “Fast Montgomery Modular Multiplication and RSA Cryptographic Processor Architectures” 37th Annual Asilomar Conference on Signals, Systems and Computers, California, Nov 2003. [6] A. P. Fournaris and O. Koufopavlou, “Montgomery Modular Multiplier Architectures and Hardware Implementations for an RSA cryptosystem, 46th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2003), Egypt, Cairo, December 2003. [7] C. D. Walter, “Montgomery Exponentiation Needs No Final Subtractions”, Electronics Letters, vol. 35 no. 21, October 1999, pp 1831-1832. [8] A. Daly and W. Marnane, “Efficient Architectures for implementing Montgomery Modular Multiplication and RSA Modular Exponentiation on Reconfigurable Logic”, in proc. of 10th International symposium on Fieldprogrammable gate arrays, February 2002. [9] S. B. Ors, L. Batina, B. Preneel and J. Vandewalle, “Hardware Implementation of a Montgomery Modular Multiplier in a Systolic Array” International Parallel and Distributed Processing symposium (IPDPS’03), 2003. [10] A. Mazzeo, L. Romano, G. P. Saggese,”FPGA-based Implementation of a serial RSA processor” Design, Automation and Test in Europe Conference and Exhibition (DATE'03), Munich, Germany, 2003.