Towards an Efficient Implementation of Sequential Montgomery ...

Towards an Efficient Implementation of Sequential Montgomery Multiplication João Carlos Néto1, Alexandre Ferreira Tenca2 and Wilson Vicente Ruggiero1 1

Department of Computer and Digital Systems Engineering, Polytechnic School, University of Sao Paulo, Sao Paulo, Brazil 2 Synopsys, Inc. Oregon, USA

Abstract—A method to generate efficient implementations of sequential Montgomery Multiplication (MM) is proposed. It is applied to radix-2 MM, but could be used for other radices. An efficient solution is obtained when inactive adders in a cycle are re-assigned to perform useful computation. The resulting hardware algorithm and architecture accelerate the modular multiplication by looking ahead the input data of two iterations and in some cases compressing two iterations in one, without increasing the iteration time too much. Experiments show 33.6% average reduction in clock cycles when proposed multiplier is applied to implement modular exponentiation in the 2048-bit RSA cryptosystem. Keywords-Cryptography; high-speed arithmetic; modular exponentiation and multiplication

I.

INTRODUCTION

Public key cryptosystems are based on arithmetic functions, such as the modular exponentiation and multiplication (e.g., the RSA algorithm [1] and the Diffie-Hellman key exchange scheme [2]). Therefore, computational methods to accelerate, to reduce consumption energy, and to simplify the use of such operations, particularly in hardware, will be always of great value for systems that require data security. Nowadays, one of the most successful modular multiplication methods is the Montgomery Multiplication (MM) [3]. Efforts to improve this method are always of main importance to designers of dedicated cryptographic hardware and security in embedded systems [4]. When implementing arithmetic algorithms in hardware there are always situations for which the arithmetic operators are not performing useful computation. Addition of zero is a typical case for 2-input adders. If the operators were used for actual computation we would reach a more efficient implementation, with higher utilization of operators and reduction in the number of clock cycles required to complete the task. The total computation time will depend on the impact of the extra complexity introduced in the control logic for this more efficient use of resources. We propose a method to better explore the hardware resources available to implement a sequential radix-2 MM. An efficient solution is obtained when we optimize the use of its adders. Adders that are not performing useful computation in a given step are re-assigned to execute other steps in the algorithm. The result of this approach is the acceleration of modular multiplication by compressing two iterations of the original radix-2 MM algorithm into one. We emphasize that this approach shows even better results than a radix-4 implementation of MM, and it is not the same

978-1-4244-9721-8/10/$26.00 ©2010 IEEE

1680

approach used in a traditional radix-4 design where there is a need for recoding of multiples using Booth encoding and special hardware is required to inject bits to generate two’s complement form of negative multiples [9]. The scope of this work is limited to the application of the proposed technique on the sequential original radix-2 MM algorithm, but the same strategy may be combined with other architectures to improve the implementation of MM, such as high-radix [10], scalable architecture [6] and the bipartite method [7]. In the next section, we describe the original Montgomery modular multiplication. In section 3, we propose the new accelerated Montgomery Multiplication method based on “skipping rounds”. In section 4 and 5, we discuss several implementation aspects and experimental results about the proposed optimization method. In section 6, we present the concluding comments. II. MONTGOMERY MULTIPLICATION The Montgomery Multiplication algorithm is well defined in [3]. Several versions of the MM algorithm were analyzed and compared from software-based implementations on general-purpose processors to dedicated arithmetic coprocessors [5]. The MM algorithm is used to speed up modular multiplications and squares required during the modular exponentiation process in public key cryptosystems without using division [3]. The radix-2 MM algorithm is the most common algorithm to generate a fast and simple hardware implementation. Let us describe it before we introduce the new accelerated Montgomery Multiplication proposal using the skip rounds method. The MM algorithm computes Z { ( X .Y .R 1 ) mod M , with n 1 ¬log 2 M ¼ , where the variables are defined below. X : n-bit multiplier, Y : n-bit multiplicand, M : n-bit odd modulus, R : 2n ,

R 1 : modular multiplicative inverse of R , i.e., 1 { ( R.R 1 ) mod M . Algorithm 1 shows the pseudocode of the radix-2 MM for m-bit operands X ( xm1 ,...,x1 , x0 ), Y , and M [5]. The arithmetic operations at bit level are logical XOR ( ), logical AND ( x ), 1-bit right shift ( / 2 ), addition ( ), subtraction (), multiplication ( . ), and comparison for the condition “greater or equal than” ( t ). At the i th iteration the intermediate result is given by S[i]. Subscripts are used to

Asilomar 2010

indicate a bit in the bit-vector that represents the value of a variable. For example, S[i]0 is a notation for bit 0 of the binary representation of variable S[i]. Algorithm 1. Radix-2 Montgomery Multiplication (R2MM) Require: odd M , n Ensure: Z

n1 ¦i 0 xi .2i , with 0 d X ,Y M

1 ¬log2 M ¼ , X

Cout [n].

R2MM ( X ,Y , M ) { ( X .Y .R 1 ) mod M , 0 d Z M ,

1: 2: 3:

with 0 d X , Y M S[0] 0 for i 0 to n 1 step 1 do qi S[i]0 xi x Y0

4: 5: 6: 7: 8: 9:

S[i 1] (S[i] xi .Y qi .M ) / 2 end for if S[n] t M then S[n] S[n] M end if return Z S[n]

The inner loop (lines 2 to 5) of the radix-2 MM algorithm uses two 2-input adders. The first adder sums Y to the intermediate result S[i], if the current bit of X (or xi ) is 1. The second adder sums M to the intermediate result, depending on qi . Basically qi has value 1 when the result of the first addition is odd. The intermediate result ( S[i 1] ) of each iteration is then obtained by dividing the output of the second adder by 2, and this way reducing the intermediate result to n 1 bits. Therefore, it can be observed that the two adders are not always used to perform useful arithmetic on every iteration. Sometimes, the adders are used only to pass one of the inputs to the output. This is the key concept to take advantage of in a hardware implementation.

Ensure: Z


1 ¬log2 M ¼ , X

1: 2: 3:

for i 0 to n 1 step 1 do sin Sout [i]; cin Cout [i]

4:

0

qi sin0 cin0 ( xi x Y0 ) (Sout [i 1], Cout [i 1]) CSA(sin, cin, xi .Y ) Sout [i 1]; cin

2Cout [i 1]

6:

sin

7:

(Sout [i 1], Cout [i 1])

8:

sin

9:

S[i 1]

S[i] xi .Y qi .M . 2

R2MM _ CSA( X , Y , M ) { ( X .Y .R 1 ) mod M , 0 d Z M ,

with 0 d X , Y M Sout [0] 0; Cout [0]

5:

III. OVERVIEW OF SKIP ROUNDS METHOD The proposed accelerated Montgomery Multiplication is called Skip Rounds (SR) method. In the traditional radix-2 MM algorithm, when the current bit xi is zero, the first adder is unused. Similarly, when the least significant bit of the intermediate result is zero, the second adder is also unused. Our proposal to accelerate the computation is to reallocate the unused adders to perform other meaningful algorithm steps, still keeping the main computational structures of the radix-2 Montgomery's method. Loop unrolling is performed on two consecutive algorithm iterations. As a consequence we look ahead for two consecutive bits of X , i.e., xi and xi 1 , and the parity of S[i] xi .Y and S[i 1] xi1.Y , to decide on how to reassign inputs to the adders. The loop invariant for the intermediate result at the i th round is

When two bits of X from two consecutive rounds i and i 1 are taken into account, the loop invariant becomes

Algorithm 2. Radix-2 Montgomery Multiplication CSA (R2MM_CSA) Require: odd M , n

adders. Note the equivalence between S[i]0 and sin0 cin0 , respectively in line 3 (algorithm 1) and line 4 (algorithm 2). At step 11 the multiplier final result S[n] in binary form is obtained by addition of Sout [n] and

Sout [i 1]; cin

Sout [i 1]

CSA(sin, cin, qi .M )

2Cout [i 1]

sin/2; Cout [i 1]

10: 11:

end for S[n] Sout [n] Cout [n]

12: 13: 14: 15:

if (S[n] t M ) then S[n] S[n] M end if return Z S[n]

2cin / 2

Algorithm 2 shows the pseudocode for the radix-2 MM CSA algorithm, with the same core procedure from algorithm 1, but now using carry save adders (CSA) to implement the iterations in hardware. In contrast, where we were reading the intermediate result from variable S[i] (algorithm 1), now we read this result from the CSA adders where S[i] is manipulated in Carry-save form, i.e. S[i] = Sout [i] 2Cout [i]. Furthermore, the input variables sin and cin denoted the sum and the carry inputs for CSA

1681

S[i 2]

S[i] xi 1.2Y xi .Y qi 1.2M qi .M . 4

We know in advance that there will be additions of 2Y , Y , 2M or M values to the intermediate result, depending on the bit xi 1 , xi , qi 1 and qi . Thus, we can assign inputs to the adders that are very easily computed (2Y, Y, 2M, M) and may consume two bits of X , on some iterations. When the multiples of Y or M are more complex to be added to the partial sum, only one bit of X is consumed in one cycle, and the hardware would operate as the original radix-2 MM hardware. With this modification, the additions needed in two consecutive iteration rounds can be done in one cycle of the new algorithm. The Table I has input bit combinations of xi1 , xi , qi1 and qi , as well the possible inputs to the multiplier main adders and the control signal to indicate occurrence of skip rounds. There are five input combinations for which it is not possible to perform two rounds in one. Those cases are indicated by Skr 0, and involve the calculations of more complicated multiples (3Y or 3M), which would require more hardware resources than available (which could be more registers to store the multiples – pre-calculated, or more

adders to compute them when needed). In this situation, for efficiency, it is better to slow down the computation speed and execute an iteration the same way as the original radix-2 MM algorithm does. TABLE I.

THE MULTIPLES OF Y AND M APPLIED TO EACH ADDER DURING SKIP R OUNDS

transformations on the table. These transformations may lead to cases when we give up on possible double iterations per cycle, in order to optimize the hardware. All the transformations are shown on Table II. Algorithm 3. Radix-2 Montgomery Multiplication CSA Skip Rounds (R2MM_CSA_SR) Require: odd M , n

1:

with 0 d X , Y M Xin..0 X n1..0 ; i 0; Sout n1..0

2: 3:

while ( i n ) do (Sin1, Cin1 )

4:

Qi1..0

6:

Op 2

HARDWARE ALGORITHM AND ARCHITECTURE

Algorithm 3 shows the pseudocode for the radix-2 MM CSA Skip Rounds algorithm. In addition to the notation of arithmetic operations at the bit level presented in Algorithm 2 we used the bit concatenation operator (&) in the description (line 8). It is worth noting the following equivalences between these algorithms description, when we used the notation Xi1..0 ( xi1 & xi ) and Qi1..0 ( qi1 & qi ). In this section we describe the logic used in the implementation of the functions SelOp1 (selection of the operand for the first adder), LogicQi (determination of the qi1 and qi values), SelOp2 (selection of the operand for the second adder), control logic SkrCtrl and shift operation on X at the end of each cycle (ShiftX). Algorithm 3 does not directly use the operations suggested on Table I. In order to get the best hardware implementation it is necessary to perform some

1682

0; Coutn..0

0; Skr

0; Cexti

0

ShiftSC( Sout , Cout , Skr )

LogicQi( Xi1..0 , Y1..0 , Sin11..0 , Cin11..0 , Cexti , M 1..1 ) SelOp 2( Qi1..0 , M )

7:

( Sout , Cout ) Sin2

9:

( Sout , Cout )

CSA( Sin1, Cin1, Op1 )

Sout ; Cin2

Cout & Cexti

CSA( Sin2, Cin2, Op 2 )

Cexti Sout1 x Cout0 x Skr (Skr,ClrXi) SkrCtrl( Xi1..0 , Qi1..0 )

12:

Xi

13:

i

14: 15: 16: 17: 18: 19:


SelOp1( Xi1..0 , Y )

8: 10:

IV.

Op1

5:

11:

In order to take advantage of available adders, we initially considered the options shown on Table I with Skr 1. These are the cases when multiples of Y and M are added to S[i] in order to correctly compute the value of S[i 2] (two iterations in one cycle).When two iterations are processed, a division by 4 is executed at the end of the iteration. The information on the table, with some modifications to avoid excessive increase in hardware or cycle time, defines the main steps to be executed in the new Radix-2 MM Skip Rounds algorithm described on the next section. Notice that it would be possible to recode the multiples and still have enough adders to perform the calculation (as done in the radix-4 MM algorithm), but such a solution would require more hardware for recoding, selection of values for adders, and conditional inversion, and would increase the complexity of the overall hardware. The application of the same technique we now use on the radix-2 MM algorithm would be possible on the radix-4 MM algorithm, and it is not the subject of this paper.

1 ¬log2 M ¼ , X

R2MM _ CSA _ SR( X ,Y , M ) { ( X .Y .R 1 ) mod M , 0 d Z M ,

Ensure: Z

ShiftX ( Xi, Skr , ClrXi) loopCtrl(i, Skr )

end while S Sout 2Cout if S[n] t M then S[n] S[n] M end if return Z S[n]

Table I shows that the input of the first adder gets mostly multiples of Y , besides one case when M is needed (line 3). Because of this line, the selection logic for the operands going to the first adder is more complex, and therefore, the relaxation of the condition for this line will result in hardware savings, with a small impact in performance. Another transformation is based on the concept that a regular pattern on a column will generate highly simplified logic. We observed that the values on lines 11 and 12 could be modified to allow for smaller logic. Therefore, if we replace the value 0 by 2Y on line 11 and the value 2Y by Y on line 12, we would get a better logic optimization for the selection control, which would depend only on xi 1 and xi values. The circuit for the Y multiples used as operands to the first adder is shown in Fig.1. Note, however, that this circuit sends 2Y for CSA 1 in the case of the line 11, which is equivalent to performing the sum of xi1 .2Y one cycle earlier (observe that Skr 0 in this case). Thus, on the next cycle, when xi1 becomes xi we make xi 0 to avoid adding the same product again. The transformation applied to line 12 also helps the simplification of the logic SelOp2 by eliminating the only case when a multiple of Y is used on the second adder. This way, all multiples are derived from M . Following the same idea, used for SelOp1, the formation of a regular pattern is important. For the second adder we have the pattern (0, M, 2M, M) repeating for most lines. Only line 14 breaks the pattern. If we were to introduce the value 2M on that line, we would have exactly the same circuit shown in Fig. 1, with the replacements of inputs Xi and Y by Qi and M, and of output Op1 by Op2. The addition of 2M in the case of the line 14 is only advancing the addition of the product term qi1 .2M . Thus, that sum does not alter the

expected result. Table II shows the new table with those transformations.

THE LOGIC TO CONTROL SHIFTX AND SELECT XR REGISTER

xi 1..i FROM

The bit Qi0 is computed by the LogicQi block, as the same way is computed qi in Algorithm 2 (line 4). Similarly, to compute the bit Qi1 (qi1 ) is needed to anticipate some sum bits using the following equations. As shown in Fig. 5, the LogicQi block runs in parallel with the SelOp1 and CSA 1 blocks.

Figure 1. Selection Operand The Multiples of Y

TABLE II.

TABLE III.

THE MULTIPLES OF Y AND M (RELAXATION PHASE)

wS[i]1..0

wS[i 1]1..0

The shiftX function is reasonably expensive, because each bit register needs hardware to perform four functions: the initial load of X , keeping the value in the register, shifting by 1 bit and shifting by 2 bits (division by 2 or 4). To reduce this circuit’s complexity, we can leave aside the 1bit shift right function, making a circuit with only three functions and carefully selecting the correct xi values from the least significant 3 bits of the register (X_Reg). The 2-bit right shift function is controlled by the signal enShiftX, which is produced in the LogicSD (Shift right and Displacement) block. Two out of the three least significant bits of XR are selected to be the bits xi 1 and xi , depending on the signal Skr (Skip Rounds) from the previous iteration, as well as the signal Shd (Shift displacement). The block diagram of the ShiftX is shown in Fig. 2.

Sin11..0 Cin11..0 Cexti

wS[i]1..0 ( Xi0 .Y1 & Xi0 .Y0 ) (Qi0 .M1 & Qi0 ) Qi1

wS[i 1]1 Xi1.Y0

The skip rounds signal (Skr) is generated by a simple logic that combines the bits xi 1 , xi , qi 1 and qi . According Table II, it is enough to check the conditions xi1 xi 1 or qi1 qi 1 to assert Skr=0 (no skip rounds in that iteration). Furthermore, the SkrCtrl block detects when line 11 happens and sets ClrXi 1 to cancel the bit xi 1 in the next iteration, by the SelXi block. When the multiplication is completed the signal done is set by the logic loop control (LoopCtrl), which is a counter of bits of X that were processed, taking into account the skip rounds signal. The block diagrams of the R2MM_CSA_SR overview and core algorithm in hardware are shown in Fig. 3 and 4.

Figure 2. The ShiftX Function

The LogicSD block is a two-state machine. The state bit Shd indicates the position where we can find bits xi1 and xi in XR ( XR1..0 xi1..i , when Shd 0 and XR2..1 xi1..i , when Shd 1 ), which is done by the SelXi block. Table III shows the transitions for the FSM (ShdNext is the next state). Based on this table we get: enShiftX (Skr or Shd ) and

ShdNext

not (Skr or Shd ).

1683

Figure 3. Montgomery Multiplication CSA Skip Rounds Architecture (Top Level – R2MM_CSA_SR)

The lines 15-18 in Algorithm 3 were not implemented. Line 15 converts the final multiplier CS result to binary

representation. Lines 16 to 18 correspond to the modular reduction of the final result. These steps are not the inner loop of the MM multiplication, and should be done on separate hardware block.

Figure 4. Montgomery Multiplication CSA Skip Rounds Architecture (Core – R2MM_CSA_SR)

V. IMPLEMENTATION RESULTS We compared four architectures for Montgomery Multiplication in the present work - Radix-2 MM (R2MM), Radix-2 MM CSA (R2MM_CSA), Radix-2 MM CSA Skip Rounds (R2MM_CSA_SR) and Radix-4 MM Booth CSA (R4MM_BOOTH_CSA) in terms of different bit precisions, critical path length, total area (combinational, noncombinational and net interconnect area) and total computational time (critical path x the number of clock cycle) to perform a complete modular multiplication. The designs have been described in VHDL hardware description language (HDL) and synthesized with the Synopsys Design Compiler using CMOS technology library “tc6a_cbacore.db”. The actual technology used is not important, since it is only being used as a reference for comparison between the designs. In order to assess the effective reduction of clock cycles of the R2MM_CSA_SR algorithm, we simulated the modular exponentiation to encrypt and decrypt required in the 2048-bit RSA public-key cryptosystem, using the Xilinx ISE Design Suite. The results were checked by software implementation using the Multi-precision integer library from PolarSSL. Table IV shows the experimental results where the total computational time for the R2MM_CSA_SR algorithm is taking into account the reduction of 33.6% in clock cycles with respect to the R2MM_CSA algorithm. There was an area increase between 10% to 24% for the circuits described on the table, which cover moduli with precision 1024 and 2048 bits. The gain when using the method based on “skipping rounds” depends on the data processed by the multiplier. A statistical model using Markov Chain was done to estimate a priori the skip rounds percentage events. By this model R2MM_CSA_SR produces 27.8% in clock cycles reduction. The average reduction in the number of cycles that was found in our experiment will happen for other data inputs because the input data for each iteration has high-entropy [8], due to the modular multiplication for cryptographic strength. We performed the 1024-bit RSA decryption experiments with thousands of calls to the R2MM_CSA_SR algorithm.

1684

TABLE IV.

THE SUMMARY REPORT T IMING AND AREA OF MM ALGORITHMS

VI. CONCLUSIONS This work presented a method to make a more efficient use of hardware resources required to compute the Montgomery Multiplication algorithm, optimizing the use of its adders when they are inactive due to some particular data values, by looking ahead the data to be processed on the adders we were able to compress two clock cycles in one cycle for some combinations of the input data. The incorporation of CS representation for intermediate results generated the R2MM_CSA_SR algorithm and a hardware implementation. The experiments demonstrate that this solution uses significantly less total computational time to perform a complete modular multiplication than the original radix-2 MM CSA algorithm, without a significant increased in area. It is an efficient implementation of a MM derived from a radix-2 algorithm. REFERENCES [1]

R. L. Rivest, A. Shamir, L. Adleman, A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120-126, February 1978. [2] W. Diffie, M. E. Hellman, New directions in cryptography. IEEE Transactions on Information Theory. 22:644-654, November 1976. [3] P. L. Montgomery, Modular multiplication without trial division. Mathematics of Computation, 44(170):519-521, April 1985. [4] N. Nedjah (Editor), L. M. Mourelle (Editor), Embedded Cryptographic Hardware - Methodologies & Architectures. Nova Science Publishers, 2004. [5] Ç. K. Koç, T. Acar, B. S. Kaliski, Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, vol. 16, no. 3, June 1996, pp. 26-33. [6] A. F. Tenca, Ç. K. Koç, A Scalable Architecture for Montgomery Multiplication. Lecture Notes in Computer Science, Springer, 1999, 94-108. [7] M. E. Kaihara, N. Takagi, Bipartite Modular Multiplication Method. IEEE Trans. Computers (TC) 57(2):157-164 (2008). [8] A. J. Menezes, P. C. van Oorschot, S. A. Vanstone, Handbook of Applied Cryptography. CRC Press, 1996. [9] J. Leu, A. Wu, Design methodology for Booth-encoded Montgomery module design for RSA cryptosystem, ISCAS 2000 Geneva, IEEE International Symposium on, vol.5, pp.357-360 vol.5, 2000. [10] P. Amberg, N. Pinckney, D. M. Harris, Parallel High-Radix Montgomery Multipliers, Signals, Systems and Computers, 2008 42nd Asilomar Conference on, pp.772-776, 26-29 Oct. 2008.

Towards an Efficient Implementation of Sequential Montgomery ...

Towards an Efficient Implementation of Sequential Montgomery ...

Suggest Documents

Towards an Efficient Implementation of Tree Automata Completion - Irisa

Towards an Efficient Functional Implementation of ... - Semantic Scholar

Towards an Efficient Implementation of Traceback ... - LAAS-CNRS

Efficient FPGA Implementation of Montgomery Multiplier ... - CSE IIT Kgp

Efficient VLSI Implementation for Montgomery Multiplication in GF(2m)

AN EFFICIENT SEQUENTIAL PAGING ... - Semantic Scholar

An efficient mobile PACE implementation

An Energy Efficient ONU Implementation

An Implementation of Energy-efficient Routing

An Efficient Implementation of Decoupled Communication in ...

Implementation of an Efficient Transformerless Single ...

An efficient implementation of Slater-Condon rules

An efficient microfluidic sorter: implementation of ...

Towards an Understanding of the Implementation ... - CiteSeerX

Towards an Understanding of Implementation and ...

Towards an Efficient Numerical Simulation of

Implementation of Scalable Montgomery ... - Semantic Scholar

80 Towards Design and Implementation of Space Efficient and ...

jc: An E cient and Portable Sequential Implementation of Janus

An Efficient Sequential Approach for Simulation of Thermal ... - DiVA

An Efficient Scanline Visibility Implementation - Semantic Scholar

An Approach Towards Efficient Energy Distribution

Towards an Efficient Urdu Keyboard Layout ...

Towards an Efficient and Scalable Discontinuous Galerkin