Scalable RSA Processor in Reconfigurable Hardware - a SoC Building Block Viktor Fischer1 and Miloš Drutarovský2 1
Laboratoire Traitement du Signal et Instrumentation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Saint-Etienne, France
[email protected] 2 Department of Electronics and Multimedial Communications, Technical University of Košice, Park Komenského 13, 041 20 Košice, Slovak Republic
[email protected]
Abstract The paper introduces a scalable programmable RSA cryptographic processor implemented as IP core in Field Programmable Devices (FPD). The processor is built on three main blocks – an embedded standard microcontroller, a scalable Montgomery Multiplication unit and simple vector adder unit. All blocks are implemented in VHDL as parameterized modules. The IP core is developed as a System on a Chip (SoC) building block for public-key exchange schemes to be used in more complex cryptographic chip using both symmetrical and asymmetrical algorithms. There is no limitation on the maximum size of RSA operands and the selection of actual word-size can be made according to the available FPD capacity and/or desired performance.
1. Introduction In recent years, public key (PK) cryptography has gained increasing attention from both chip vendors and end users. One consequence of this trend has been the growing importance of PK hardware to store user private key and to provide a secure computing environment for private key operations. Typical examples are dedicated smart card cryptocoprocessors [1]. These programmable chips provide very good support for PK cryptography but have only limited support for symmetrical block ciphers (e.g. they typically support only low throughput DES algorithm). For applications that require high throughput (say several MBytes per second or more), symmetrical block cipher implementation combined with PK exchange scheme in multi-chip solution is necessary. Moreover, chips for new Advanced Encryption Standard (AES) [2] are not yet commercially available. Current high-density FPD provide an alternative hardware platform even for system-level integration. This paper describes implementation of programmable scalable RSA cryptographic
processor implemented in FPD and optimized for Altera FPD with large embedded memory blocks (EMB). The processor is developed as an Intellectual Property (IP) building block for PK exchange schemes to be used in complete SoC solution of symmetrical/asymmetrical cryptographic chip. The main aim of our work was to implement the whole RSA algorithm in a low cost FPD with practical computational latency. The paper starts with a general overview of basic RSA algorithm operations, then we describe hardware building blocks in FPD with large EMB. Next we discuss design tradeoffs to identify the best hardware configuration for Altera FPD. Finally, speed/area results of proposed RSA processor implementation in Altera ACEX and APEX FPD families are presented for some RSA operand sizes and compared with the performance of standard smart cart crypto-coprocessor chips.
2. Basic RSA Algorithm Operations Basic mathematical operation used by the RSA algorithm (and also by other popular key exchange schemes, e.g. Diffie-Hellman protocol) is modular exponentiation [4] Z = X E mod M
(1)
that binary or general m -nary methods can break into a series of modular multiplications. All of these computations have to be performed with large k -bit integers ( k ∈ {512, 768,1024,… 2048,…} ). The well-known Montgomery Multiplication (MM) algorithm [4], [5] speeds-up modular multiplication and squaring required for exponentiation (1). It computes the Montgomery product for k -bit integers X , Y
MM ( X , Y ) = XYR −1 mod M
(2)
where R = 2 k and M is an integer in the range 2k −1 < M < 2k such that Greatest Common Denominator GCD ( R, M ) = 1 . Since R = 2k , it is sufficient that the modulus M is an odd integer. Although efficient software implementations use high-radix MM algorithms [5], for hardware implementations low radix (radix-2) methods are usually preferred [6]. The radix-2 MM algorithm for k -bit operands X = [ xk −1 ,… , x1 , x0 ] , Y can be computed by the following pseudocode
S0 = 0
where Ed is decryption exponent. The final result (1) can be computed by combining coordinates Z p , Z q according to Garner’s CRT algorithm [4] Pq−1 = p −1 mod q K = ( Z q − Z p ) Pq−1 mod q Z = Z p + Kp
that can be up to 4-times faster than standard decryption [4].
for i = 0 to k − 1
3. Blocks for RSA Algorithm Implementation
if ( Si + xiY ) is even
then Si +1 = ( Si + xi Y ) / 2
(3)
else Si +1 = ( Si + xi Y + M ) / 2 if Sk ≥ M then Sk = Sk − M
(final correction)
MM ( X,Y ) = S k Basic MM algorithm (3) can be used for efficient computation of (1) by the standard Montgomery exponentiation algorithm [4] ( E = ( et −1, … , e0 )2 , with1 et −1 = 1 , all other variables are k -bit integers) X = MM ( X , R 2 mod M ) = XR mod M A = R mod M for i = t − 1 to 0 A = MM ( A, A ) if ei = 1 then
(
A = MM A, X A = MM ( A,1)
(4)
)
Standard RSA k -bit exponentiation (1) can be speeded up for decryption using Chinese Reminder Theorem (CRT). CRT based RSA decryption uses the fact that M = pq and p , q are large k / 2 -bit prime numbers available in decryption hardware and (1) can by represented in residue system as Z p = Z mod p = ( X mod p )
dp
d p = Ed mod ( p − 1) (5)
Z q = Z mod q = ( X mod q )
dq
d q = Ed mod ( q − 1) (6)
The speed and FPD resource requirements of proposed IP core depend on method used for mapping into available FPD resources. Based on analysis of necessary basic operations (2)-(7) and target FPD features we have split the IP core into 4 blocks: a) b) c) d)
Montgomery Multiplication Unit (MMU) Vector Adder (VAU) Embedded Microcontroller (MCU) Vector Memory (VMU)
3.1 Montgomery Multiplication Unit MMU is the most critical block of the IP core. It realizes the MM operation (2). MM operations represent the inner operations of the loop (4) and the speed of MM operation significantly determines the speed of complete IP core. MMU as well as VAU are designed as scalable units. It means that units can be reused in order to generate long-precision results independently of the data-path precision (word length w ) for which the units were originally designed. MMU uses Multiple Word Radix-2 Montgomery Multiplication algorithm (MWR2MM) [7] that performs bit-level computations, produces wordlevel outputs and provides direct support for scalable MMU design. MWR2MM can be summarized as follows [7]: For operands with k -bit precision e = k / w words are required. MWR2MM scans operand Y (multiplicand) word-by-word, and it scans operand X (multiplier) bit-by-bit, so it uses vectors
(
M = M( 1
(7)
RSA typically uses small encryption exponents e.g. Ee = ( 3)10 = (11)2 ( t = 2 ) or Ee = ( 65537 )10 = (10001)16 ( t = 17 ). Standard RSA decryption uses full-size t = k bit exponents.
(
Y = Y(
e −1)
e −1)
,… , M ( ) , M ( 1
,… , Y ( ) , Y ( 1
X = ( xk −1 ,… , x1 , x0 ) ,
0)
),
0)
), (8)
where words are marked with superscripts and bits are marked with subscripts. The concatenation of vectors A and B is represented as ( A, B ) . A particular range of bits in a vector A from position i to position j is represented as Aj ..i . The bit position of the k th word of A is represented as k Ai( ) . MWR2MM algorithm can be described by the following pseudocode:
xi
Critical Timing Path
Y(j) M(j)
Radix 2 Montgomery Modular Multiplier
(j) 1S 2S
(j)
1S
(j-1)
2S
(j-1)
w
S =0
w
for i = 0 to k − 1
w
C=0
Embedded Data Memory 4×w×e
w
(C, S ( ) ) = x Y ( ) + S ( ) 0
0
0
i
Fig. 1. Block diagram of MMU processing unit
if S0( ) = 1 then 0
(C, S ( ) ) = C + S ( ) + M ( ) 0
0
0
for j = 1 to e − 1
(C, S ( ) ) = C + x Y ( ) + M ( ) + S ( ) S( ) = (S( ) , S( ) ) S ( ) = (C, S ( ) ) j
j
j
j
i
j −1
j
0
e −1
j −1 w −1..1
e −1 w −1..1
else for j = 1 to e − 1
(C, S ( ) ) = C + x Y ( ) + S ( ) S( ) = (S( ) , S( ) ) S ( ) = (C, S ( ) ) j
j
j
i
j −1
j
0
e −1
j −1 w −1..1
e −1 w −1..1
(9)
The algorithm computes a partial sum S for each bit of X , scanning the words of Y and M . Once the precision is exhausted, another bit of X is taken, and scan is repeated. Thus, the algorithm imposes no constraints to the precision of operands. What varies is the number of loop iterations required to accomplish the MM operation. The total number of cycles required by the algorithm is
N MWR 2 MM = ke =
k2 w
In order to reduce storage and arithmetic hardware complexity, data path of MMU uses M , X and Y in a standard non-redundant form. The internal sum S is received and generated in the redundant CarrySave form [6]. In this case, 2 w bits per word S are transferred between EMB and data path in each clock cycle. The data path also makes available the information on the least significant bit 0 0 0 0 ( c = t ( ) = S0( ) ) of the computation S ( ) + xiY ( ) which is the first computation step performed by the data path before cycle ( for j = 1 to e − 1 in (9)) starts. The design of data path is based on the structure presented in [7]. It consists of two layers of carrysave adders and it is shown for w = 3 in Fig.2. When computing the bits of word j (step j in the internal loop in (9)), the circuit generates 2 ( w − 1) j j −1 bits of S ( ) , and two most significant bits of S ( ) . ( j −1) The bits of S computed at step j − 1 must be delayed and concatenated with the most significant bits generated at step j . c
Y1(j) M1(j)
Y2(j) M2(j)
Y0(j) M0(j)
xi (j) (j) 1S2 2S2
(j) (j) 1S1 2S1
FA
(j) (j) 1 S0 2 S0
FA
FA t
(10)
We have adapted Processing Unit from [7] to a FPD with large EMB. Corresponding MMU and the critical data path are shown in Fig.1. j j The data path of MMU receives inputs S ( ) , Y ( ) , ( j) M from EMB, computes new intermediate value j S ( ) and stores it into the internal data-path registers. Results from previous computation combined with actual results form w -bit word j −1 S ( ) that is stored back to EMB.
FA
(j-1) (j-1) 1 S2 2S2
FA
(j-1) (j-1) 1S1 2S1
FA
(j-1) (j-1) 1 S0 2S0
Fig. 2. Structure of MMU data path for w = 3 ( FA
represents Full Adder)
3.2 Vector Adder This unit provides support for necessary vectororiented computations (with k -bit or k / 2 -bit integer numbers). Typical operations are
-
final correction in (3), precomputation of X mod p and X mod q in (5) and (6), - computation of R mod M and XR mod M in (4), - standard multiplication of long numbers in (7). Although these computations are not time-critical, they are very complex for used MCU. VAU provides basic vector primitives for execution of these computations under MCU control. All of necessary computations can be done with the following k -bit vector primitives (11) T = U ±V T = 2U (12) U >V (13) Note: Last two operations can be realized using addition and subtraction, respectively. 3.3 Embedded Microcontroller Programmable MCU implements sequential automats that control operation of MMU and VAU. MCU can also perform other IP supporting functions as key management, hardware control, etc. Its function is fully programmable and MCU architecture is parametrizable (e.g. size of actual program memory, stack size, etc. can be changed) so its functionality can be tailored to specific SoC target application. MCU is VHDL synthesized, well tested core based on standard Microchip’s PIC 16C55 – 8-bit EPROM based RISC CMOS microcontroller [8]. It has only 33 single-word (12-bit), single-cycle instructions (except for program branches, which take two cycles). The program is stored in the internal EPROM organized as 12 × 512 bits. The MCU VHDL core has the following basic features and extensions [9]:
-
low FPD resource requirements, ROM size can be tailored to the application requirements, stack levels and register bank can be easily extended or reduced, new functions can be added to the MCU.
Another advantage of used MCU is that powerful and easy to use development tools (assembler,
simulator [10], C-compiler) are available for this microcontroller. 3.4 Vector Memory VMU consists of the memory itself and of control and multiplexing circuits. The memory stores all working vector variables of MMU, provides input data to MMU and stores output data from MMU. The critical path of IP depicted on Fig.1 includes a part of VMU. In order to simplify this critical path, the memory for 1 S and 2 S vectors is not accessible directly from MCU. The only necessary initialization – clearing of 1 S , 2 S is done directly by the control unit of MMU. The rest of VMU memory blocks can be read as well as written from MCU through multiplexing circuits. This feature simplifies the critical path of complete algorithm.
4. Core Design and Implementation Aspects 4.1 EMB Size/Speed Tradeoffs and Limitations The total computation time of MM operation TMM is
TMM ≈
k2 k2 TMM _ loop = (TPU ( w ) + TR / W ( w ) ) w w
(14)
where TPU ( w ) is propagation delay of the MMU and TR / W ( w ) is access time for reading w -bit j j j j words Y ( ) , M ( ) , 1 S ( ) , 2 S ( ) and writing w -bit ( j −1) ( j −1) words 1 S , 2S from/to EMB. In order to decrease TMM for given k -bit operands it is necessary to increase word length w and decrease TPU ( w ) and TR / W ( w ) times. Thanks to the CarrySave adder chain structure, the TPU ( w ) cycle time is approximately independent from word length w so TPU ( w ) ≈ 2TFA . This is a direct consequence of absence of carry bit chain in FA array of data path structure. TR / W ( w ) can be minimized by parallel reading of 4 w bits from EMB and parallel writing of 2 w bits at the same time using dual-port feature of EMB in Altera ACEX [11] and APEX [12] FPD. Under these conditions TR / W ( w ) ≈ TEMB , where TEMB is access time for parallel read/write access to EMB and k2 k2 1 k2 TMM ≈ (T2 FA + TEMB ) const = (15) w w Fclk w where Fclk is maximal possible FPD clock frequency (maximal values Fclk depend significantly on actually used FPD family but only slightly on the word length w). This is demonstrated in Table 1 for selected ACEX and APEX devices. Presented results have been obtained using Altera MaxPlus II v. 10.0 and QUARTUS II, v. 1.0 development systems.
w=16 w=32 w=64 54 MHz 45 MHz N/A 120 MHz 116 MHz 113 MHz 116 MHz 110 MHz 103 MHz Table 1. Maximal Fclk for different values w in ACEX EP1K100-1 EP20K100-1 EP20K160-1
and APEX FPD
Actual FPD device/family imposes limitation on actual word length w . EAB (ESB) blocks of ACEX (APEX) FPD can provide up to 16-bit access per EAB (ESB). j j j j Fully parallel access to Y ( ) , M ( ) , 1 S ( ) , 2 S ( ) variables therefore requires
4w N EAB / ESB = 16
(16)
EAB (ESB) blocks. Number of EAB (ESB) blocks in currently available low cost ACEX FPD (and low capacity high performance APEX FPD) is given in Table 2. EMB EAB ESB
EP1K100 EP20K60 EP20K100 EP20K160 12 16 26 40
Table 2. Number of EMB in some ACEX and APEX FPD
Following equation (16) and number of available EAB/ESB in Table 2, we have decided to use w = 16 for ACEX and w = 32 for APEX family. 4.2 MCU Interface to MMU and VAU Data, address, control and status registers of MMU are mapped directly to the MCU register area by modifying VHDL code of used MCU. This approach, shown in Fig.3, allows control these units as memory mapped MCU peripherals. Embedded PIC16C55 Microcontroller Port A
Internal Registers D_OUT MSB D_OUT LSB
Port B
Y
M
1S
2S
X
EMB EMB EMB EMB EMB
w× e w×e w×e w× e w× e bits bits bits bits bits
Radix-2 Montgomery Multiplier
D_IN MSB
Vector Adder
ADDR
MEM_SEL
CTRL
5. Implementation Results in Altera FPD 5.1 Firmware Programming The program for MCU was developed in C language, compiled by standard PIC C compiler into HEX format and converted to Altera MIF files that define the contents of EMB. After “Smart recompilation” of precompiled IP core, the configuration file can be downloaded into Altera FPD via standard FPD configuration techniques. 5.2 Results of Mapping to the Target Altera FPD To map the presented IP core into Altera FPD, a VHDL-based design methodology has been used. It should be stressed, that all presented results have been obtained using timing analysis and implementation reports generated by Altera development tools. Hardware tests were realized on a PCI development board based on PLX 9052 PCI interface and Altera ACEX 1K100 QC208-3 FPD. Following timing analysis results we have selected 33 MHz global clock frequency for this device (PCI bus frequency). The fastest device version could run at 54 MHz (see Table 1). The use of resources for this device together with area estimations for some APEX devices are given in the following table:
LE EMB # % blocks % EP1K100 1650 33 10 83 EP20K100 2165 52 20 77 EP20K160 2165 33 20 50 Table 3. Area occupation for selected FPD
Xj
D_IN LSB
Port C
Vector oriented MMU and VAU can be efficiently controlled by a simple control oriented MCU. Since EMB capacity of low cost FPD is a typical limitation of complete SoC, it is expected that EMB (or at least part of it) will be shared with other modules of the chip. Typical example is EMB usage by presented IP core for public-key exchange of encrypted symmetrical keys and its usage by another IP core for symmetrical encryption/decryption.
Memory Control Unit
Multiplier Control Unit
STATE
Fig. 3. Structure of complete RSA core
Adder Control Unit
5.3 Comparison with Smart Card CryptoCoprocessors Main task of proposed IP core is to provide support for RSA based PK cryptography. It is instructive to compare performance of proposed IP core with optimized smart card crypto-coprocessors for PK cryptography [1]. The common parameters used in [1] are: Sign with CRT (implemented in our IP core as k bit decryption based on Garner’s CRT algorithm (5)-
(7)). Such implementation has in our IP core execution time approximately TCRT ≈
1 1.5k 3 16.25k 2 + Fclk 4 w 16
(17)
Sign without CRT (standard k -bit decryption). with the execution time approximately
Tnon _ CRT ≈
1 1.5k 3 9.5k 2 + Fclk w 16
(18)
Verify for e = F4 ( k -bit encryption with standard t = 17 bit exponent) with the execution time approximately
TF 4 ≈
1 20k 2 5k 2 60k + + Fclk w 16 16
(19)
where second order terms with k 2 /16 represent execution time necessary for constants initialization (e. g. XR mod M in (4)) and final correction in (3) performed by VAU with fixed 16 bit word length. Performance of presented IP core (depicted in gray background) for these common parameters as well as some reference values from [1] are presented in Table 4.
6. Conclusions In today‘s changing marketplace, time-to-market is the key to success. Current high-density FPD provide hardware platform even for system-level integration – SoC solution. IP building blocks, if available, allow electronic systems manufacturers to build easily and quickly on a single chip the same functionality that previously consumed several chips or even entire printed circuit board. The proposed IP block presents implementation of programmable scalable RSA cryptographic processor implemented in FPD and optimized for Altera FPD with large EMB. Although proposed RSA processor is not as
flexible as dedicated smart-card chips, it provides comparable support for efficient PK exchange schemes with practical operand sizes and it fits into commercially available low-cost FPD. Advantage of our solution lies in the fact that proposed IP block together with high performance symmetrical algorithm (e. g. [3]) can provide a unique cryptographic SoC solution fitted into one low-cost FPD. This solution is currently in development and will be presented in our future paper. References [1] H. Handschuh, P. Paillier, ”Smart Card CryptoCoprocessors for Public-Key Cryptography”, RSA Laboratories’ Crypto Bytes, Vol. 4, No. 1, pp. 6-10, Summer 1998. [2] Advanced Encryption Standard, http://www.nist.gov/aes/ [3] V. Fischer, M. Drutarovský, “Two Methods of Rijndael Implementation in Reconfigurable Hardware”, Proceedings of the Workshop on Cryptographic Hardware and Embedded Systems – CHES’2001, Paris, pp. 81-96, May 2001. [4] J.A. Menezes, P.C. Oorschot, S.A. Vanstone, “Applied Cryptography”, CRC Press, New York, 1997. [5] C.K. Koc, T. Acar, “Analyzing and Comparing Montgomery Multiplication Algorithms”, IEEE Micro, (16) 3, pp. 26-33, June 1996. [6] C.K. Koc, “RSA Hardware Implementation”, www.rsa.com, pp.1-28, August 1995. [7] A.F. Tenca, C.K. Koc, “A Scalable Architecture for Montgomery Multiplication”, Proceedings of the Workshop on Cryptographic Hardware and Embedded Systems – CHES’99, August 1999, Worcester, Massachusetts, USA. [8] PIC16C5X EPROM/ROM Based 8-bit CMOS Series Microcontroller, DS30453, Microchip 2000, www.microchip.com. [9] V. Fischer, J. Dubois, “Flexible Didactic Card With Embedded Processor”, Proceedings of DCIS 2000, Montpellier, pp. 392-396, Nov. 2000. [10] MPLAB-IDE, Integrated Development Environment, www.microchip.com. [11] ACEX 1K Programmable Logic Family, www.altera.com. [12] APEX 20K Programmable Logic Family, www.altera.com.
RSA Algorithm EP1K100 EP20K100 P83W8516 SLE66CX160S µPD789828 (bits) 50 MHz 100 MHz 5 MHz 5 MHz 40 MHz 1024 Sign with CRT 551 ms 143 ms 160 ms 230 ms 100 ms 1024 Sign without CRT 2127 ms 535 ms 400 ms 880 ms 360 ms 1024 Verify (Ee= F4) 34,5 ms 10,4 ms 25 ms 24 ms 7 ms 2048 Sign with CRT 4317 ms 1101 ms 1100 ms 1475 ms 750 ms 2048 Sign without CRT 16,9 s 4,2 s 6,4 s 44 s N/A 2048 Verify (Ee = F4) 138 ms 41 ms 54 ms 268 ms 45 ms Table 4. Performance comparison of presented RSA IP block and some standard smart card crypto-coprocessors