Implementation of Scalable Montgomery Multiplication Coprocessor in Altera Reconfigurable Hardware Miloš Drutarovský* and Viktor Fischer** Abstract – The paper describes implementation of a scalable Montgomery Multiplication (MM) coprocessor in Altera Field Programmable Devices (FPD). Proposed coprocessor performs modular MM with large numbers and can be used as a scalable building block of cryptographic RSA processor. All blocks of the MM coprocessor are implemented in VHDL as parametrized modules and optimized for Altera FPD using large dual port embedded memory blocks. There is no limitation on the maximum size of operands and the selection of actual word-size can be made according to the available FPD capacity and/or desired performance. Index terms – Montgomery multiplication, RSA, Altera FPD, scalable cryptographic coprocessor
1 Introduction The Montgomery Multiplication (MM) algorithm [1] is an efficient method for modular multiplication with an arbitrary modulus, particularly suitable for implementation on general-purpose computers. The method is based on a representation of the residue class modulo M , and replaces division by M operation with division by a power of 2. This operation is easily accomplished on a computer since the numbers are typically represented in binary form. Various high-radix MM algorithms attempt to modify the original method in order to obtain more efficient software implementation on specific computer architectures [2]. For hardware implementation low-radix designs are usually more attractive since high-radix algorithms are more complex, typically consume significant amount of chip area and require longer clock cycle times [3]. The MM is the basic building block for the modular exponentiation operation that is required in the Diffie-Hellman and RSA public-key (PK) cryptosystems [1]. In this paper we describe implementation of scalable MM coprocessor, one of the main building blocks of a complete RSA PK cryptographic processor implemented in Altera FPD [4]. Main novelty of our implementation is an optimization of Multiple Word Radix-2 Montgomery Multiplication (MWR2MM) algorithm [5] for Altera FPD using large dual port embedded memory.
2 Scalability Arithmetic (cryptographic) unit is scalable if [5] *
the unit can be reused or replicated in order to generate long-precision results independently of the data path precision for which the unit was designed. In the following, we propose an Altera FPD hardware algorithm implementation of the MM coprocessor that is attractive in terms of performance and scalability and uses dual port Embedded Memory Blocks (EMB) of current Altera FPD [6].
3 MM based Modular Exponentiation Basic mathematical operation used by popular PK exchange schemes, is modular exponentiation [1] Z = X E mod M (1) that a binary or general m -nary methods can break into a series of modular multiplications. All of these computations have to be performed with large k -bit integers (typical k ∈ {512, 768,1024,… 2048,…} ). The well-known MM algorithm [2] speeds-up modular multiplication and squaring required for exponentiation. It computes the MM product for k bit integers X , Y (2) MM ( X , Y ) = XYR −1 mod M k where R = 2 and M is an integer in the range 2 k −1 < M < 2 k such that Greatest Common Denominator GCD ( R, M ) = 1 . Basic MM (2) can be used for efficient computation of (1) by the standard Montgomery exponentiation algorithm [1] ( E = (et −1, … , e0 )2 , with et −1 = 1 , all other variables are k -bit integers) X = MM ( X , R 2 mod M ) = XR mod M A = R mod M for i = t − 1 to 0 A = MM ( A, A ) if ei = 1 then
(
A = MM A, X A = MM ( A,1)
(3)
)
4 MWR2MM Algorithm Implemented MM coprocessor uses MWR2MM algorithm [5] with word length w . MWR2MM
Technical University of Košice, Department of Electronics and Multimedia Communications, Park Komenského 13, 04120 Košice, Slovak Republic, E-mail:
[email protected], Tel: ++421-55-6024169 , Fax: ++421-55-6323989 ** Laboratoire Traitement du Signal et Instrumentation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Saint-Etienne, France, E-mail:
[email protected]
performs bit-level computations, produces word-level outputs and provides direct support for scalable MM coprocessor design. For operands with a k -bit precision e = k / w words are required. MWR2MM algorithm scans word-wise operand Y (multiplicand), and bit-wise operand X (multiplier), so it uses vectors
(
M = M(
(
Y = Y(
e −1)
e −1)
,… , M ( ) , M ( 1
,… , Y ( ) , Y ( 1
0)
0)
),
), X = ( x
k −1
,… , x1 , x0 ) ,
(4)
where words are marked with superscripts and bits are marked with subscripts. The concatenation of vectors A and B is represented as ( A, B ) . A particular range of bits in a vector A from position i to position j is represented as Aj ..i . The bit position k of the k th word of A is represented as Ai( ) . MWR2MM algorithm can be described by the following pseudocode: S =0 for i = 0 to k − 1 C =0
(C , S ( ) ) = x Y ( ) + S ( ) 0
0
0
i
if S 0( ) = 1 then 0
software NIOS CPU) prepares input values (X, Y, and M) and reads result (S) obtained by the coprocessor. To implement the coprocessor, we have adapted Processing Unit (PU) from [5]. Since input and output values can have a high bit count (2048 bits or more per object), we have decided to store inputs and intermediate results in EMB. Therefore the coprocessor has been split into two blocks: a Radix 2 Montgomery multiplier and a dual port data memory. The multiplier realizes arithmetic operations (multiplication of a word by one bit, addition of two or more words) following equations (5). Data memory is used to store input values, intermediate and final results. The CPU can access data memory to store input values and to read multiplication result. In order to reduce storage and arithmetic hardware complexity, data path of MM coprocessor uses M , X and Y in a standard non-redundant form. The internal sum S is received and generated in the redundant Carry-Save form [3]. Therefore the bit resolution of the sum S is effectively doubled. In this case, 2w bits per word S have to be transferred between EMB and data path in each clock cycle. The design of data path is based on the structure presented in [5]. It consists of two layers of carrysave adders and it is shown for w = 3 in Fig.1.
(C , S ( ) ) = C + S ( ) + M ( ) 0
0
j
j
(j)
j
1S2
(j)
2S2
i
j
0
e −1
(j)
1S1
FA
j −1 w −1..1
(j)
FA
(j)
2S0
FA
FA
for j = 1 to e − 1
j
1S0
t
else
(C , S ( ) ) = C + x Y ( ) + S ( ) S ( ) = (S ( ) , S ( ) ) S ( ) = (C , S ( ) )
(j)
2S 1
Register
e −1 w −1..1
j
Y0(j) M0(j)
c
(C , S ( ) ) = C + x Y ( ) + M ( ) + S ( ) S ( ) = (S ( ) , S ( ) ) S ( ) = (C , S ( ) ) j −1
M1(j)
xi
for j = 1 to e − 1 j
Y1(j)
Y2(j) M2(j)
0
FA
FA
Registers
j
i
j −1
j
0
e −1
j −1 w −1..1
e −1 w −1..1
(j-1) (j-1) 1S2 2S2
(5)
The algorithm computes a partial sum S for each bit of X , scanning the words of Y and M . Once the precision is exhausted, another bit of X is taken, and the scan is repeated. Thus, the algorithm imposes no constraints to the precision of operands. What varies is the number of loop iterations e required to accomplish the MM operation. The total number of cycles is k2 N MWR 2 MM = ke = (6) w
5 Radix 2 MM coprocessor Implemented radix 2 MM coprocessor is a unit realizing inner part of the main loop in (5). We suppose that some embedded processor (e.g. Altera
(j-1)
1S1
(j-1)
2S1
(j-1) (j-1) 2S0 1S0
Fig. 1 Structure of MM coprocessor data path (FA represents Full Adder) for word-length w = 3 Input c represents latched value t that is the least 0 0 significant bit of the value S ( ) + xi Y ( ) (0) (0) ( c = t = S0 ). This value is computed at the beginning of the main loop (when j = 0 ). While computing the word j (step j in the internal loop in (5), the circuit generates 2 ( w − 1) j −1 j bits of S ( ) , and two most significant bits of S ( ) . ( j −1) The bits of S computed at step j − 1 must be delayed and concatenated with the most significant bits generated at step j . The block diagram of MM coprocessor and the critical timing data path are shown in Fig.2. It can be seen that values of operand X do not pass via a critical path. The most important parameter influencing the overall multiplier speed is the
memory access time. Since during one cycle previous result S has to be read and current result has to be written to the same memory, we have chosen to configure the memory block as a dual port RAM using Altera-specific function lpm_ram_dp() from the Library of Parameterized Modules (LPM). S-value reset xi
Critical Timing Path
Y(j) M(j) 1S
Radix 2 Montgomery Modular Multiplier
(j)
2S
(j)
1S
(j-1)
2S
(j-1)
w w
Embedded Memory Block 4×w×e
w w
Fig. 2 Block diagram of MM coprocessor Using this option we could increase the final speed by the factor of two. Further speed-up of implemented MM coprocessor is based on introducing pipeline registers into critical timing data path shown for w = 3 in Fig.3. (j) (j) 1S2 2S2
(j)
1S1
Y2(j)
(j) 2S1
(j)
1S0 (j)
(j)
Y1
M2
(j) 2S0
M1(j)
Y0(j) M0(j)
S-value reset
Pipeline registers c xi
FA
FA
FA t
FA
FA
FA
Pipeline registers (j-1)
1S2
(j-1)
2S2
(j-1) (j-1) 1S1 2S1
(j-1)
1S0
(j-1)
2S 0
Fig. 3 Pipelined structure of implemented MM coprocessor data path for word-length w = 3 Special attention was also paid to S-value reset (first line of the pseudo-code). S-value memory initialization by the CPU is not acceptable, because in that case an additional multiplexer inserted into the critical timing data path (before of S-value memory block) would decrease the performance of the coprocessor. This problem has been solved using an additional reset signal added to the input pipeline register (dashed line in Fig.3). Using this signal all input S-values can be reset during the first e computation cycles.
6 EMB Size/Speed Tradeoffs and Limitations The total computation time of unpipelined MM operation TMM is k2 k2 TMM ≈ TMM _ loop = (TPU ( w ) + TR / W ( w )) (7) w w where TPU ( w ) is propagation delay of the multiplier and TR / W ( w ) is access time for reading w -bit words j j j j Y ( ) , M ( ) , 1 S ( ) , 2 S ( ) and writing w -bit words ( j −1) ( j −1) , 2S from/to EMB. In order to decrease 1S TMM for given k -bit operands it is necessary to increase word length w and decrease TPU ( w ) and TR / W ( w ) times. Thanks to the Carry-Save adder chain structure, the TPU ( w ) cycle time is almost independent from word length w so TPU ( w ) ≈ 2TFA . This is a direct consequence of absence of carry bit chain in FA array of data path structure. TR / W ( w ) can be minimized by parallel reading of 4w bits from EMB and parallel writing of 2w bits at the same time using dual-port RAM feature of EMB in Altera ACEX and APEX FPD [6]. Memory blocks of these FPD families can be configured so that data writing and reading can be executed in parallel on two independent ports controlled by two clocks. Under these conditions TR / W ( w ) ≈ TEMB , where TEMB is access time for parallel read/write access to EMB and k2 k2 1 k2 = (8) TMM ≈ (T2 FA + TEMB ) ≈ const w w Fclk w where Fclk is maximal possible FPD clock frequency (maximal values Fclk depend significantly on actually used FPD family but only slightly on the word length w). This is demonstrated in Table 1 for selected ACEX and APEX devices. Presented results have been obtained using Altera QUARTUS II, v. 1.1 development system. For pipelined implementation the total computation time of MM operation TMM _ pip is k2 1 k2 (9) TMM _ pip ≈ Max TPU ( w ) , TEMB ( w ) ≈ w Fclk _ pip w since TPU and TEMB are overlapped. This is demonstrated in Table 2. Actual FPD device/family imposes limitation on actual word length w . EAB (ESB) blocks of ACEX (APEX) FPD can provide up to 16-bit access per j j EAB (ESB). Fully parallel access to Y ( ) , M ( ) , ( j) ( j) and 2 S variables therefore requires 1S 4w N EAB / ESB = (10) 16 EAB (ESB) blocks. Number of blocks in low cost ACEX FPD and low capacity high performance APEX FPD for k = 2048 is given in Table 3. Following equation (10) and number of available EAB/ESB in Table 3, we have decided to use w = 16 for ACEX and w = 32 for APEX family.
Table 1 Max Fclk for unpipelined implementation EP1K100-1 EP20K100-1 EP20K160-1
w=16 89 MHz 127 MHz 128 MHz
w=32 80 MHz 126 MHz 127 MHz
w=64 N/A 112 MHz 119 MHz
Table 2 Max Fclk _ pip for pipelined implementation EP1K100-1 EP20K100-1 EP20K160-1
w=16 130 MHz 162 MHz 158 MHz
w=32 104 MHz 148 MHz 142 MHz
w=64 N/A 145 MHz 138 MHz
Table 3 Number of EMB in ACEX and APEX FPD EMB EAB ESB
EP1K100 EP20K60 EP20K100 EP20K160 12 16 26 40
7 Results of Mapping to the Altera FPD To map the presented MM coprocessor into Altera FPD, a VHDL-based design methodology has been used. All presented results have been obtained using timing analysis and implementation reports generated by Altera development tools. The use of memory resources together with area occupation expressed in Logic Elements (LE) for selected Altera FPD is given in the Table 4 and Table 5. Table 4 Area occupation of MM coprocessor for unpipelined implementation LE EP1K100 EP20K100 EP20K160
# 160 206 206
% 3 5 3
EMB Blocks 4 8 8
% 33 31 20
Table 5 Area occupation of MM coprocessor for pipelined implementation LE EP1K100 EP20K100 EP20K160
# 258 400 206
% 5 10 6
EMB Blocks 4 8 8
% 33 31 20
Performance of implemented scalable MM coprocessor for some typical operand sizes k is presented in Table 6. Table 6 Speed of MM operation for pipelined implementation MM size (bits) 1024 2048
EP1K100 130 MHz 531µs 2100 µs
EP20K100 145 MHz 235 µs 932 µs
8 Conclusions In the paper we have described the MM coprocessor hardware implemented in FPD with large dual port EMB. Design tradeoffs to identify an optimal hardware configuration for Altera FPD has been discussed. Finally, speed/area results of proposed MM coprocessor implementation was presented for some typical RSA operand sizes. It was shown that pipelined implementation provides significant speed improvement at the expense of only small increase of used LE. The proposed MM coprocessor is developed as a parameterized VHDL building block of more complex cryptographic chip based on FPD [4]. In real applications it has to be controlled by a control unit (e.g. VHDL synthesized PIC microcontroller used in [4] or NIOS CPU [7]) in order to perform modular exponentiation necessary for PK systems. Advantage of FPD implementation lies in the fact that proposed MM coprocessor together with a high performance symmetrical algorithm (e.g.[8]) can provide a unique cryptographic system on chip solution fitted into one low-cost FPD. This solution is currently in development and will be presented in our future paper. References [1] J.A. Menezes, P.C. Oorschot, S.A. Vanstone: “Applied Cryptography”, CRC Press, New York, 1997. [2] C.K. Koc, T. Acar: “Analyzing and Comparing Montgomery Multiplication Algorithms”, IEEE Micro, (16) 3, pp.26-33, June 1996. [3] C.K. Koc: “RSA Hardware Implementation”, www.rsa.com, pp.1-28, August 1995. [4] V. Fischer, M. Drutarovský: “Scalable RSA Processor in Reconfigurable Hardware - a SoC Building Block”, Proceedings of XVI Conference on Design of Circuits and Integrated Systems DCIS 2001, November 2001, Porto, Portugal. [5] A.F. Tenca, C.K. Koc: “A Scalable Architecture for Montgomery Multiplication”, In C.K. Koc and C. Paar, editors: Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science No.1717, pp.94-108. Springer, Berlin, Germany 1999. [6] “ACEX 1K and APEX 20K Programmable Logic Family”, www.altera.com. [7] “NIOS Soft Core Embedded processor”, www.altera.com. [8] V. Fischer, M. Drutarovský, “Two Methods of Rijndael Implementation in Reconfigurable Hardware”, In C.K. Koc, D. Naccache and C. Paar, editors: Cryptographic Hardware and Embedded Systems – CHES 2001, Lecture Notes in Computer Science No.2162, pp.77-92. Springer, Berlin, Germany 2001.