Montgomery Multiplication Coprocessor for Altera NIOS Embedded Processor ˇ Martin Simka, Viktor Fischer Laboratoire Traitement du Signal et Instrumentation, Unit Mixte de Recherche CNRS 5516, Universit Jean Monnet, Saint-Etienne, France
[email protected],
[email protected]
Abstract This paper describes scalable Montgomery Multiplication (MM) coprocessor optimized for Altera NIOS embedded processor implemented in reconfigurable hardware. Features of the NIOS soft processor Avalon Bus are used to connect the coprocessor as a memory mapped peripheral so that the overall performance is improved. Implemented coprocessor performs modular MM with large numbers (up to 4096 bits), the NIOS processor controls data preparation and communication with the MM unit and executes the rest of RSA algorithm operations. Various configurations of both units together with timing analysis results and area estimations for Altera devices are presented.
1.
Introduction
Several applications, such as RSA algorithm, Diffie-Hellman key exchange algorithm [1], Digital Signature Standard, and Elliptic curve cryptography use modular multiplication and modular exponentiation. The Montgomery Multiplication (MM) algorithm provides certain advantages in the implementation of modular multiplication. An aspect of cryptographic applications is that very large numbers are used. The precision varies from 128 and 256 bits for elliptic curve cryptography to 1024 and 2048 bits for applications based on exponentiation. Most of the hardware designs for modular multiplication are fixed-precision solutions. The main features of multiplier implemented in presented MM coprocessor are (1) the ability to work on several operand precision at the kernel level, (2) be adjustable to devices with different capacity, a (3) use a pipelined organization that reduces the impact on signal loads as a result of high precision of the operands. The ability to handle long-precision numbers with small precision operations has been done using conventional multipliers, and a control algorithm that uses these multipliers.
The second feature comes from the flexibility of the algorithm and hardware to be adjusted in both word size and number of processing elements. The high load on signals broadcast to several hardware components is an important factor to slow down high-precision Montgomery multiplier (MM) designs. For this reason, the use of systolic structures have been considered by other researchers. The organization presented in this paper is not purely systolic, and has a flavor of serialparallel implementation of the multiplication algorithm.
2.
RSA Algorithm
Basic mathematical operation used by RSA is modular exponentiation [1] Z = X E mod M
(1)
that a binary or general m-nary methods can break into a series of modular multiplications. All of these computations have to be performed with large k-bit integers (typical k ∈ {512, 768, 1024, . . . , 2048, . . .}). The well-known MM algorithm [2] speedsup modular multiplication and squaring required for exponentiation (1). It computes
the MM product for k-bit integers X, Y M M (X, Y ) = XY R−1 mod M
(2)
where R = 2k and M is an integer in the range 2k−1 < M < 2k such that Greatest Common Divisor GCD(R, M ) = 1. Basic MM (2) can be used for efficient computation of (1) by the standard Montgomery exponentiation algorithm [1] (E = (et−1 , . . . , e0 )2 , with et−1 = 1, all other variables are k-bit integers) Algorithm 1 Montgomery exponentiation e = M M (X, R2 mod M ) = XR mod M X A = R mod M for i = t − 1 down to 0 do A = M M (A, A) if ei = 1 then e A = M M (A, X) A = M M (A, 1) The starting point of this algorithm is MM. The faster the MM is performed, the faster the exponentiation process will be accomplished.
3.
MM based modular exponentiation algorithms
3.1
Radix-2 algorithm
Implemented MM coprocessor uses Multiple Word Radix-2 Montgomery Multiplication (MWR2MM) algorithm [6] with word length w. MWR2MM performs bit-level computations, produces word-level outputs and provides direct support for scalable MM coprocessor design. For operands with a k-bit precision e = dk/we words are required. MWR2MM algorithm scans wordwise operand Y (multiplicand) and M, and bit-wise operand X(multiplier), so it uses vectors M = (M (e−1) , . . . , M (1) , M (0) ) Y = (Y (e−1) , . . . , Y (1) , Y (0) ) X = (xk−1 , . . . , x1 , x0 )
(3)
where words are marked with superscripts and bits are marked with subscripts. The concatenation of vectors A and B is represented as (A, B). A particular range of bits in a vector A from position i to position j is represented as Aj..i . The bit position of the (k) k th word of is represented as Ai . MWR2MM algorithm can be described by the following pseudocode: Algorithm 2 Multiple Word Radix-2 Montgomery Multiplication S=0 for i = 0 to k − 1 do C=0 (C, S (0) ) = xi Y (0) + S (0) (0) if S0 = 1 then (C, S (0) ) = C + S (0) + M (0) for j = 1 to e − 1 do (C, S (j) ) = C + xi Y (j) + M (j) + S (j) (j) (j−1) S (j−1) = (S0 , Sw−1..1 ) (e−1) S (e−1) = (C, Sw−1..1 ) else for j = 1 to e − 1 do (C, S (j) ) = C + xi Y (j) + S (j) (j) (j−1) S (j−1) = (S0 , Sw−1..1 ) (e−1) S (e−1) = (C, Sw−1..1 ) The algorithm computes a partial sum S for each bit of X, scanning the words of Y and M . Once the precision is exhausted, another bit of X is taken, and the scan is repeated. Thus, the algorithm imposes no constraints to the precision of operands. What varies is the number of loop iterations e required to accomplish the MM operation. 3.2
High-radix algorithm
Algorithm 3 shows the Multiple-word HighRadix (2m ) Montgomery Multiplication algorithm (MWR2m MM) [7], a generalization of the MWR2MM algorithm 2 presented in subsection 3.1. The parameter m changes depending on how many bits of the multiplier X are scanned during each loop, or the Radix of
Algorithm 3 Multiple Word High-Radix (Radix-2m ) Montgomery Multiplication S=0 x−1 = 0 for i = 0 to k − 1 step m do qYi = Booth (xi+m..i−1 ) (Ca , S (0) ) = S (0) + (qYi Y )(0) (0) (0)−1 qMi = Sm−1..0 (2k − Mm−1..0 ) mod 2m (Cb , S (0) ) = S (0) + (qMi M )(0) for j = 1 to e − 1 do (Ca , S (j) ) = Ca + S (j) + (qYi Y )(j) (Cb , S (j) ) = Cb + S (j) + (qMi M )(j) (j) (j−1) S (j−1) = (Sm−1..0 , Sw−1..m ) Ca = Ca or Cb (e−1) S (e−1) = sign ext (Ca , Sw−1..m ) the computation (r = 2m ). Each loop iteration (computational loop) scans m-bits of X (a radix-r digit Xi ) and determines the value qY , according to Booth encoding. Booth encoding is applied to a bit vector to reduce the complexity of multiple generation in the hardware. For Radix-2 computation m = 1 and qYj = xj are used, making the algorithm 2 equivalent to the one presented in subsection 3.1. By using high-radix approach we suppose to obtain faster implementation than radix-2 design as is presented in [7].
4.
Radix-2 coprocessor implementation
Implemented radix-2 MM coprocessor is a unit realizing inner part of the main loop in algorithm 2. Input values (X, Y, and M) as well as result (S) obtained by the coprocessor are prepared and read by the NIOS processor. After data are stored in coprocessor’s memory, multiplication process is run by control register and status of coprocessor’s activity is checked. When results are obtained, new data are sent for processing. Since input and output values can have a high bit count (1024 bits or more per object), we have decided to store inputs and inter-
mediate results in Embedded System Blocks (ESBs). Therefore the coprocessor has been split into two blocks: a Radix-2 Montgomery multiplier and a dual port data memory. The multiplier realizes arithmetic operations (multiplication of a word by one bit, addition of two or more words) following equations in algorithm 2. Data memory is used to store input values, intermediate and final results. In order to reduce storage and arithmetic hardware complexity, data path of MM coprocessor uses X, Y and M in a standard non-redundant form. The internal sum S is received and generated in the redundant Carry-Save form [3]. Therefore the bit resolution of the sum S is effectively doubled. The design of data path is based on the structure presented in [6]. MM unit consists of two layers of carry-save adders and it is shown for w = 3 in Fig. 1.
! " #$% &' #" &
Fig. 1: Structure of MM unit for w = 3 (FA – Full Adder) Input c represents latched value t that is the least significant bit of the value S (0) + (0) xi Y (0) (c = t(0) = S0 ). This value is computed at the beginning of the main loop (when j = 0). While computing the word j (step j in the internal loop in algorithm 2, the circuit generates 2(w − 1) bits of S (j) , and two most significant bits of S (j−1) . The bits of S (j−1) computed at step j − 1 must be delayed and concatenated with the most significant bits
generated at step j. The most important parameter influencing the overall multiplier speed is the memory access time. Since during one cycle previous result S has to be read and current result from the last stage has to be written to the same memory, we have chosen to configure the memory block as a dual port RAM using Altera-specific function lpm ram dp from the Library of Parameterized Modules (LPM). In previous work [5] the only one MM unit was used, our aim was to implement solution, where parallelism of algorithm is utilized. Short analysis of data dependencies [6] shows that the degree of pipelining and parallelism can be very high. The dependency between operations within the loop for j restricts their parallel execution due to dependency on the carry – c. However, parallelism is possible among instructions in different i loops. Results from one MM unit are not stored in the memory, but are sent to the next one. The data path is organized as a pipeline of MM units separated by registers (Fig.2). A stage consists of a MM unit and a register. The MM unit implements one iteration of the inner loop in the MWR2MM algorithm. Each stage gets as inputs one word of Y , M , 1 S and 2 S each clock cycle. Depending on the computations progress, one bit of X is loaded in a different stage every 3 clock cycles. Each stage needs this bit at different times. Loading the signal in the right stage at the right time is controlled by special signal.
Fig. 2: Block diagram of MM coprocessor data path Each MM unit propagates the words of Y and M and the newly computed words of 1 S
and 2 S to the next MM unit, which performs another computational loop of the MM algorithm and on its turn propagates the words of Y and M and the newly computed words of 1 S and 1 S, with a latency of 3 cycles. The maximum degree of parallelism that can be attained with this organization is found as lem (4) pmax = 3 To preserve not too much complicated control structure of coprocessor, use of n, where n | e (i.e. for e = 32, n ∈ {1, 2, 4 . . .}), stages is only possible. When less than pmax stages are available, the total execution time will increase, but it is still possible to perform the full precision computation with the smaller circuit. The total computation time T (in clock cycles) when n ≤ pmax modules (stages) are used in the pipeline is T =e
5.
we k2 + 3n = + 3n n wn
(5)
NIOS processor and MM coprocessor interfacing
NIOS is a soft-core embedded processor from Altera, that includes a CPU and peripherals optimized for programmable logic and system-on-a-programmable chip (SOPC) integration. This configurable, general-purpose RISC processor can be combined with user logic and programmed into an Altera programmable logic device (PLD). The Nios CPU can be configured for a wide range of applications. Two basic configurations are available – 16- and 32-bit. By using SOPC Builder Megawizard Plug-In Function in Altera development tools parameters of the CPU, the peripherals and the whole block are set to be suitable for required solution. An Avalon Bus included in the NIOS is parameterized interface bus used for connection of master (NIOS core) and multiple slaves (peripherals). Besides the several vendor-defined peripherals (e.g. UART,
timer . . . ) user-defined peripherals are accepted. The SOPC Builder uses a PTF (Peripheral Template File) as a database to store information about an Avalon System – master, sets of peripherals and bus. Each PTF contains a part, which describes all the module’s I/O signals. For all signals wired with Avalon Bus avalon role parameter has to be defined, if it is not then the port is exposed as I/O pin in the final module. This feature we used for coprocessor’s clock signal because of the required possibility to connect the NIOS CPU and the MM coprocessor to different clock signal sources. The Avalon Bus offers a variety of options to tailor the bus-signals and timing for different types of peripherals. In addition two different approaches in the access to the narrow peripherals are possible. A dynamic address alignment have been chosen to permit using of the same program code for different settings of the MM coprocessor. If 32-bit NIOS is used as the CPU, all operands in the NIOS are 32-bit aside from the word length w inside the coprocessor.
6.
Implementation results
To map the presented MM coprocessor into Altera PLD, a VHDL-based methodology has been used. All of blocks are fully parameterized to get a solution suitable for custom application and available sources. The Mentor Graphics’ package of application was used during development. The MM coprocessor as separated design as well as the part of the NIOS system was simulated in ModelSim for functional correctness. A technic of self-testing testbenches was applied to make simulation quick and comfortable. Maximum System Gates LEs ESBs Maximum RAM bits
525,824 8320 52 106,496
Table 1: APEX20K200E device features During development a NIOS develop-
ment board with Altera device APEX 20K200EFC484-2X has been used for implementation. Elementary device features are written in Table 1. Presented results are generated after the synthesis in LeonardoSpectrum and following Place&Route procedure in Altera Quartus II, v. 1.1 development system. Area occupation expressed in Logic Elements (LEs) and maximal possible PLD clock frequency (fclk ) for Altera PLD APEX 20K200E are given in Tables 2-4 for precision from 1024 up to 4096 bits. n=1 n=2 n=4 n=8
w=8 w = 16 w = 32 281/69.98 429/70.01 776/67.14 439/65.73 743/69.01 1394/62.56 757/62.80 1365/70.69 2618/67.2 N/A 2609/65.63 5125/66.35
Table 2: Area occupation(LEs)/max fclk (MHz) of MM coprocessor (1024 bits)
n=1 n=2 n=4 n=8
w=8 w = 16 w = 32 292/70.18 446/55.91 786/62.24 446/65.93 752/64.64 1404/67.08 765/52.44 1378/56.31 2635/57.52 N/A 2619/65.45 5137/64.57
Table 3: Area occupation(LEs)/max fclk (MHz) of MM coprocessor (2048 bits)
n=1 n=2 n=4 n=8
w=8 w = 16 w = 32 302/68.32 455/61.64 832/59.44 461/64.3 763/63.61 1413/54.92 779/63.51 1386/65.41 2695/60.61 N/A 2632/61.49 5151/60.41
Table 4: Area occupation(LEs)/max fclk (MHz) of MM coprocessor (4096 bits) Only selected combinations of word length w and number of stages n are presented. Performance of implemented MM coprocessor for word length w = 32 is presented in Table 5 (frequency value of clock signals is
identical and equal 33.333 MHz for both MM coprocessor and NIOS processor).
n=1 n=2 n=4 n=8
1024 bits 2048 bits 4096 bits (ms) (ms) (ms) 1.013 3.932 15.729 0.492 1.966 7.864 0.246 0.983 3.933 0.124 0.492 1.967
In next work we suppose an implementation of the high-radix MM coprocessor to obtain a comparison between radix-2 and highradix implementation.
References [1] J.A. Menezes, P.C. Oorschot, S.A. Vanstone: Applied Cryptography, CRC Press, New York, 1997.
Table 5: Speed of MM operation for w = 32
[2] C.K. Koc, T. Acar: Analyzing and Comparing Montgomery Multiplication Algorithms, IEEE Micro, (16) 3, pp.26-33, June 1996.
Number of ESBs required for different configurations of MM coprocessor is shown in Table 6.
[3] C.K. Koc: RSA Hardware Implementation, www.rsa.com, pp.1-28, August 1995.
precision/ word length w=8 w = 16 w = 32
1024 2048 4096 bits bits bits 4 5 10 5 10 10
Table 6: Number of used ESBs Minimal NIOS embedded processor configuration includes 16-bit CPU (128 register window, 8 kB address range), 4k on-chip memory and UART peripheral. In Table 7 results of mapping are shown. LEs (%) 1700 (20%) ESBs (%) 26 (50%) fclk 40.06 MHz Table 7: Occupied sources by NIOS processor
7.
Conclusion
This paper presented the implementation and connection of the MM coprocessor to the NIOS embedded processor. The architecture is highly flexible, the number of stages parameter provides possibility to prepare design suitable for a target device, with a priority in the speed or in the area occupation.
[4] V. Fischer, M. Drutarovsk´ y: Scalable RSA Processor in Reconfigurable Hardware - a SoC Building Block, DCIS 2001 Conference, pp. 327-332, Porto, Nov. 2001. [5] M. Drutarovsk´ y, V. Fischer: Implementation of Scalable Montgomery Multiplication Coprocessor in Altera Reconfigurable Hardware, International Conference on Signal Processing and Telecommunications, Kosice, Slovakia, pp. 132-135, Nov. 2001 [6] A.F. Tenca, C.K. Koc: A Scalable Architecture for Montgomery Multiplication, In C.K. Koc and C. Paar, editors: Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science No.1717, pp.94-108. Springer, Berlin, Germany 1999. [7] A.F. Tenca, G. Todorov, and C.K. Koc: High-radix design of a scalable modular multiplier. Cryptographic Hardware and Embedded Systems - CHES 2001, C. K. Koc, D. Naccache, and C. Paar, editors, Lecture Notes in Computer Science No. 2162, pages 189-205, Springer Verlag, Berlin, Germany, May 13-16, 2001. [8] ACEX 1K and APEX 20K Programmable Logic Family, www.altera.com. [9] NIOS Soft Core Embedded processor, www.altera.com.