FPGA montgomery modular multiplication architectures ... - IEEE Xplore

FPGA MONTGOMERY MODULAR MULTIPLICATION ARCHITECTURES SUITABLE FOR ECCs OVER GF(p) Ciaran McIvor, Máire McLoone, John V McCanny The Institute of Electronics, Communications and Information Technology The Queen’s University of Belfast, Northern Ireland [email protected], [email protected], [email protected]

Abstract New FPGA architectures for the ordinary Montgomery multiplication algorithm and the FIOS modular multiplication algorithm are presented. The embedded 18×18-bit multipliers and fast carry look-ahead logic located on the Xilinx Virtex2 Pro family of FPGAs are used to perform the ordinary multiplications and additions/subtractions required by these two algorithms. The architectures are developed for use in Elliptic Curve Cryptosystems over GF(p), which require modular field multiplication to perform elliptic curve point addition and doubling. Field sizes of 128-bits and 256-bits are chosen but other field sizes can easily be accommodated, by rapidly reprogramming the FPGA. Overall, the larger the word size of the multiplier, the more efficiently it performs in terms of area/time product. Also, the FIOS algorithm is flexible in that one can tailor the multiplier architecture is to be area efficient, time efficient or a mixture of both by choosing a particular word size. It is estimated that the computation of a 256-bit scalar point multiplication over GF(p) would take about 4.8 ms.

1. Introduction Modular multiplication is important in modern cryptography, as it is the basic operation that needs to be performed in many publickey cryptosystems such as RSA [1] and elliptic curve cryptosystems (ECCs) [2,3]. In particular, the Montgomery multiplication algorithm [4] is the most efficient modular multiplication algorithm available. It replaces trial division by the modulus with a series of additions and divisions by a power of 2. Thus, it is well suited to hardware implementations. Koc at al. [5] developed this Montgomery multiplication technique and described a new class of algorithms, namely, the Separated Operand Scanning (SOS) method, the Coarsely Integrated Operand Scanning (CIOS) method, the Finely Integrated Operand Scanning (FIOS) method, the Finely Integrated Product Scanning (FIPS) method, and the Coarsely Integrated Hybrid Scanning (CIHS) method. These algorithms break up the modular multiplications into iterative series’ of ordinary word additions and multiplications. The word size can be chosen arbitrarily but should ideally reflect the properties of the target technology into which the algorithms are to be implemented. For instance, Koc et al. [5] presented software implementations of these algorithms operating on an Intel Pentium-60 Linux system. In this case, they chose a word size of 32-bits. To the authors’ knowledge, out of the five multiplication methods listed above, only one hardware implementation [6] has been reported in the literature, to date. These are FIOS-based modular multipliers, which were used in dual-field elliptic curve cryptographic processors. These architectures have been implemented in a 0.13 µm CMOS technology, using word sizes of 8, 16, 32 and 64 bits respectively. Although ASIC implementations offer the advantage of high clock speeds, they are typically less flexible than similar FPGA implementations, which are easily reprogrammable and offer a fast time-to-market solution.

;,(((

The motivation, therefore, of the work presented in this paper has been to develop novel FPGA-based modular multiplier hardware architectures, which are optimised for use in ECCs defined over a prime field GF(p). For the purposes of this research, modern Xilinx Virtex2 Pro FPGAs have been used as the target technology, as these devices offer several key features, which we believe are well suited to the implementation of both the ordinary Montgomery modular multiplication and the FIOS multiplication algorithms. For example, the fast carry chains and embedded 18×18-bit multiplier blocks located on these devices can be used to perform the ordinary word additions and multiplications required. These issues will be discussed further in Sections 2 and 3. The architectures presented can be used to perform the modular multiplications required in ECCs. These offer similar security to RSA cryptosystems but are able to use a much smaller key size. For instance a 256-bit ECC key size is equivalent to a 6000-bit RSA key size in terms of security [7]. Thus, potentially higher data throughput rates and lower silicon area usage is achievable using ECCs rather than a traditional RSA public-key scheme. The application of the modular multiplier architectures to ECCs is discussed in greater detail in Section 4.

2. The Xilinx Virtex2 Pro Family of FPGAs This section provides relevant information about some of the features of the Xilinx Virtex2 Pro family of FPGAs [8], which have been used to develop the ordinary Montgomery and FIOS modular multiplier architectures described in Section 3. Figure 1 provides an overview of the Virtex2 Pro generic architecture. The Virtex2 Pro family has been developed using a 0.13 µm CMOS nine-layer copper process. As shown in Figure 1, the main processing elements in these FPGAs are the Configurable Logic Blocks (CLBs), which are arranged in a rectangular array across the face of the device. A CLB comprises four slices each of which contains two Look-Up Tables (LUTs), two flip-flops and fast carry look-ahead logic. Several columns of dedicated 18×18-bit embedded multiplier blocks and Block SelectRAM are also distributed across the device. Input/Output pins surround the array of CLBs. Some Virtex2 Pro devices also contain up to four PowerPC processor blocks.

,,,

Multipliers and Block SelectRAM CLB

Processor Block

Input / Output Pins

Figure 1: Virtex2 Pro Generic Architecture Overview

,6&$6

For the modular multiplier architectures, we are mainly interested in the arithmetic functions on the Virtex2 Pro devices, namely, the fast look-ahead carry logic chains and the dedicated 18×18-bit multiplier blocks. Each CLB contains two separate carry chains and the height of these chains is two bits per slice. The carry chains run from the bottom to the top of a CLB column. Therefore, their length is limited to the height of a single CLB column. Each multiplier block can be reconfigured to support up to 18×18-bit unsigned multiplication or 17×17-bit signed multiplication. They can also be cascaded to create larger bit length multipliers. For the modular multiplier architectures described, 16×16-bit unsigned multipliers are cascaded to create larger multipliers, as will be discussed in the next section.

16-bit

32-bit Multiplier

16-bit

64-bit Multiplier

× 16-bit

16-bit

32-bit +

32-bit

32-bit

32-bit

64-bit

32-bit

32-bit

256-bit Multiplier

32-bit ×

+

64-bit

64-bit

32-bit

64-bit

64-bit

128-bit

128-bit

128-bit Multiplier

128-bit

64-bit

64-bit

64-bit

64-bit

×

× 128-bit

128-bit

3. Montgomery Multiplication Architectures 3.1 Ordinary Montgomery Multiplication Architectures This section describes the ordinary full-word Montgomery multiplication algorithm [5] and corresponding FPGA hardware architectures. Given an integer a < n, where n is the k-bit modulus, A is said to be its n-residue with respect to r if, A = a*r (mod n) (1) where r = 2k. Likewise, given an integer b < n, B is said to be its n-residue with respect to r if, B = b*r (mod n) (2) The Montgomery product of A and B can then be defined as, (3) MP = A*B*r-1 (mod n) where r-1 is the inverse of r, modulo n. The full-word version of Montgomery’s multiplication algorithm [5], which calculates the Montgomery product of A and B, is summarised in the pseudo code below, where (r*r-1–n*n`)=1.

128-bit

256-bit +

256-bit

256-bit

128-bit

+

128-bit

256-bit

128-bit

512-bit

256-bit

Figure 2: Cascading 16× ×16 Bit Unsigned Multipliers multiplications, addition and conditional subtraction are performed, according to Algorithm 1. The t-REG/UPDATE REG/CONTROL component stores t=A*B and the results of the other multiplications and addition, which are then fed back into the control unit to be re-used as inputs to the 128×128-bit multiplier or Addition/Subtraction component. The tREG/UPDATE REG/CONTROL component also performs the trivial mod and div operations in Algorithm 1. Once the conditional subtraction is performed, u is output from the chip 32bits at a time over 4 clock cycles.

Algorithm 1: Montgomery Multiplication (A, B, n, n`) t = A*B; u = (t + (t*n` mod r)*n) div r; if u ≥ n then return u–n else return u;

A

The main arithmetic functions to be performed in Algorithm 1 include three full-word multiplications, one full-word addition, and a conditional full-word subtraction. The full-word addition and subtraction are computed using the fast carry chains and two’s-complement addition. To perform the multiplications for k=128-bit and k=256-bit, we need to develop 128×128-bit and 256×256-bit multipliers respectively, by cascading numerous 16×16-bit unsigned multipliers, as shown in Figure 2. The larger multipliers are developed in a systematic fashion. For instance, the partial products of the 32×32-bit multiplier are calculated using the 16×16-bit unsigned multiplier blocks. These partial products are then added together using the fast look-ahead carry chains to obtain the 64-bit final product. This process is continued until the desired multiplier size is attained, as shown in Figure 2. The calculation and addition of the partial products is a fully pipelined process, needed to take full advantage of the small critical path delay of the 16×16-bit multiplier blocks and the fast carry chains. Therefore it takes 3, 5, 7, and 9 clock cycles to complete a 32, 64, 128, and 256 bit multiplication respectively. By using the 128×128-bit and 256×256-bit multipliers, shown in Figure 2, to carry out the full-word multiplications, we have been able to develop 128-bit and 256-bit Montgomery multiplier architectures, respectively. Figure 3 shows a block diagram of the 128-bit Montgomery multiplier architecture. The inputs are registered into the FPGA 32-bits per clock cycle over 4 cycles. The control unit then determines the order in which the

CONTROL UNIT

B 32

n 32

n`

CLK

32

RST

32

INPUT REGISTER 128

128

128

128

128

128

128

128

Addition / Subtraction

128×128-Bit Multiplier

256

129

t-REG / UPDATE REG / CONTROL 128

OUTPUT REGISTER 32

u

Figure 3: 128-Bit Montgomery Multiplier Architecture The architectures described have been captured in VHDL and then used to create implementations using the Xilinx Virtex2 Pro family of FPGAs. Table 1 provides performance results for these architectures. The 128-bit and 256-bit Montgomery multipliers have been implemented in the XC2VP50-7-ff1517 and XC2VP125-7-ff1696 FPGAs respectively. The ordinary Montgomery multiplier architectures perform well in terms of data throughput rate when compared with the FIOS architectures, described in Section 3.2. However, as will be seen, the FIOS algorithm offers more flexibility in that the

,,,

Multiplier

Clock (MHz)

No. Slices

MontMult_128Bit MontMult_256Bit

75.63 45.68

3,468 11,992

No. Mult 18×18 64 256

No. Clock Cycles 26 32

Data Rate (Mb/s) 372.33 365.44

Words×Wordsize (s×w)

Clock (MHz)

No. Slices

2×64 Bits 4×32 Bits 8×16 Bits

87.55 105.35 111.67

1,727 1,054 836

Table 1: Performance Results for Full-Word Montgomery Architectures

No. Mult 18×18 16 4 1

No. Clock Cycles 70 166 318

Throughput Rate (Mb/s) 160.09 81.24 44.95

Table 2: Performance Results for 128-bit FIOS Multiplier Architectures

designer can choose a particular word size, which determines if the resulting architecture is area efficient, time efficient, or a mixture of both.

Words×Wordsize (s×w)

Clock (MHz)

No. Slices

2×128 Bits 4×64 Bits 8×32 Bits

55.66 65.22 66.97

4,770 2,491 1,709

3.2 FIOS Montgomery Multiplication Architectures The FIOS method [5] was proposed by Koc et al. as one of a number of flexible alternatives to the ordinary Montgomery multiplication algorithm. The algorithm is summarised below. Here w is the word size, s is the number of words, and W=2w. Algorithm 2: FIOS Montgomery Multiplication (A, B, n, n`(0)) t = 0; for i = 0 to s–1 loop (C, S) = t(0) + A(0)*B(i); ADD [t(1), C]; m = S*n`(0) mod W; (C, S) = S + m*n(0); for j = 1 to s–1 loop (C, S) = t(j) + A(j)*B(i) + C; ADD [t(j+1), C]; (C, S) = S + m*n(j); t(j–1) = S; end loop; (C, S) = t(s) + C; t(s–1) = S; t(s) = t(s+1) + C; t(s+1) = 0; end loop; if t ≥ n then return t–n else return t;

No. Mult 18×18 64 16 4

No. Clock Cycles 90 238 590

Throughput Rate (Mb/s) 158.32 70.15 29.06

Table 3: Performance Results for 256-bit FIOS Multiplier Architectures As shown in Tables 2 and 3, as the word size decreases so too does the silicon area needed for a FIOS multiplier, albeit at the expense of a drop in the throughput rate. In order to analyse the time-area trade-offs we have calculated the Area/Throughput Rate for a range of word sizes. Table 4 provides the results of this analysis, with smaller values indicating better performance. Multiplier Type (s×w) MontMult_128Bit 2×64 Bits FIOS 4×32 Bits FIOS 8×16 Bits FIOS MontMult_256Bit 2×128 Bits FIOS 4×64 Bits FIOS 8×32 Bits FIOS

No. of Slices / Throughput Rate 9.31 10.79 12.97 18.60 32.82 30.12 35.51 58.81

Table 4: Area / Throughput Rate Comparisons

Here, the operands A, B, n and n` take the form of s w-bit words. Only the least significant word of n` is needed in Algorithm 2. The variable (C, S) consists of two w-bit words of which C is the most significant word and S is the least significant word. The variable t, which is used to accumulate the partial products of A*B and m*n (m=t*n`(mod r) in Algorithm 1), comprises of s+2 w-bit words. A subtraction is required after the i-loop has terminated if t is greater than or equal to n. The word size w can be chosen arbitrarily. For our 128-bit and 256-bit FIOS multiplier architectures, we have chosen w as 16, 32, 64 and 32, 64, 128 respectively. For small w, relatively smaller multipliers and carry chains can be used to calculate the multiplications and additions in Algorithm 2, implying a smaller silicon area usage. For instance, for a 128-bit FIOS multiplier using a 32-bit word size, we use the 32×32-bit multiplier, shown in Figure 2, in place of the 128×128-bit multiplier in Figure 3. However, for small w a greater amount of clock cycles are required to compute a FIOS multiplication, as the number of words s will be larger. Therefore, more iterations of the for loops in Algorithm 2 are needed, implying a lower data throughput rate. For larger w, less clock cycles are needed and a higher throughput rate can be attained, albeit at the expense of a larger silicon area. Tables 2 and 3 provide performance results for the FIOS architectures for varying word sizes. Again, the designs have been captured in VHDL and implemented using the Xilinx XC2VP507-ff1517 and XC2VP125-7-ff1696 FPGA devices for the 128-bit and 256-bit FIOS multipliers respectively.

Table 4 clearly shows that the larger the word-size of the FIOS multiplier then the more efficiently it performs in terms of area/throughput rate usage. The ordinary Montgomery multiplier architecture described in Section 3.1 performs best overall for a 128-bit field size. However, the 2×128-bit FIOS architecture is better for a 256-bit field size. Also, as mentioned, the FIOS algorithm offers the advantage of greater flexibility over Algorithm 1. Thus, it can be used in a broader range of applications. One such application is elliptic curve cryptography, which is discussed further in Section 4.

4. Elliptic Curve Cryptography This section explains how the ordinary Montgomery multipliers and FIOS multipliers described in Section 3 can be used to perform the modular multiplication operation needed in ECCs defined over GF(p). More information on elliptic curve cryptographic primitives and detailed mathematical background can be found in the IEEE standard for Public-Key Cryptography [9]. For completeness, a brief definition of what an elliptic curve over GF(p) is, and a description of elliptic curve scalar multiplication, is given below. The Weierstrass equations defining an elliptic curve over GF(p) for p>3 are as follows: (4) y2 = x3 + ax + b

,,,

where x and y are elements of GF(p) and a and b are integers modulo p satisfying: 4a3 + 27b2 ≠ 0 (mod p) (5) An elliptic curve E over GF(p) consists of the solutions (x, y) as defined by equations (4) and (5) along with an additional element called the point at infinity, denoted O. The set of points (x, y) are in the so-called affine coordinate point representation. Elliptic curve cryptographic primitives [9] require scalar point multiplication. That is, given a point P on an elliptic curve, we need to compute eP, where e is a positive integer. This is achieved by a series of additions and doublings of P, dependent on the value of the integer e. There are distinct formulae to calculate elliptic curve point addition and point doubling. When affine coordinate representation is used, these operations require the relatively expensive operation of modular inversion. By using projective coordinates, as defined in [9], the need for modular inversion is eliminated, except when converting back from projective to affine co-ordinates. Conversion formulae and point addition and point doubling formulae using projective coordinates are given below. Conversion from affine co-ordinates to projective co-ordinates (trivial): (6) X ← x, Y ← y, Z ← 1 Conversion from projective co-ordinates to affine co-ordinates: (7) x = X / Z 2, y = Y / Z 3 Elliptic curve point addition using projective co-ordinates computes: (8) (X0, Y0, Z0) + (X1, Y1, Z1) = (X2, Y2, Z2) where U0 = X0Z12 ; S0 = Y0Z13 ; U1 = X1Z02 ; S1 = Y1Z03 ; W = U0 – U1 ; R = S0 – S1 ; T = U0 + U1 ; M = S0 + S1 ; Z2 = Z0Z1W ; (9) X2 = R2 – TW2 ; V = TW2 – 2X2 ; 2Y2 = VR – MW3 Elliptic curve point doubling using projective co-ordinates computes: (10) 2(X1, Y1, Z1) = (X2, Y2, Z2) where M = 3X12 + aZ14 ; Z2 = 2Y1Z1 ; S = 4X1Y12 ; (11) X2 = M2 – 2S ; T = 8Y14 ; Y2 = M(S – X2) – T As can be seen from equations (9) and (11), point addition and point doubling require 16 and 10 prime field multiplications respectively. These multiplications can be performed using the modular multipliers described. Also, the additions and subtractions required in equations (9) and (11) can be performed using the Virtex2 Pro carry chains. There are 7 and 4 field additions or subtractions to be performed for point addition and doubling respectively. Therefore, we can provide accurate estimates of the execution times for a 256-bit elliptic curve point addition and point doubling using the different modular multipliers described in Section 3, as shown in Table 5.

Multiplier Type (s×w) MontMult 256Bit 2×128 Bits FIOS 4×64 Bits FIOS 8×32 Bits FIOS

Again, the ordinary Montgomery multiplier offers the best throughput rate when used in elliptic curve point adder or doubler architectures. It is estimated that by using the ordinary Montgomery multiplier architecture in an elliptic curve processor, then the computation of a 256-bit scalar point multiplication over GF(p) would take about 4.8 ms, not including the final coordinate conversion. This is based on the use of the double-and-add algorithm as defined in [9]. Also, the point addition and doubling architectures based on the FIOS multipliers are more flexible and can offer a lower silicon area solution by choosing a smaller word size. Moreover, as discussed in Section 1, the FPGA architectures used are highly adaptable and easily reprogrammable for varying prime field sizes. These are very desirable characteristics when using elliptic curve cryptography due to the large number of different secure curves and prime fields currently available.

5. Conclusion In this paper we presented new FPGA architectures for the ordinary Montgomery multiplication algorithm and the FIOS multiplication algorithm. The embedded 18×18-bit multipliers and fast carry look-ahead logic located on the Xilinx Virtex2 Pro family of FPGAs were used to perform the ordinary multiplications and additions/subtractions required by these two algorithms. Also, the architectures can be migrated to other FPGA families such as Altera’s Stratix devices, which possess similar arithmetic functions. The architectures were developed for use in Elliptic Curve Cryptosystems over GF(p), which require modular field multiplication to perform elliptic curve point addition and doubling. Field sizes of 128-bits and 256-bits were chosen but other field sizes can easily be accommodated, by reprogramming the FPGA. Overall, the larger the word size of the multiplier, the more efficiently it performs in terms of area/time product. Also, the FIOS algorithm is flexible in that one can tailor the multiplier architecture to be area efficient, time efficient or a mixture of both by choosing a particular word size. Elliptic curve point addition and doubling performance estimations were provided for a 256-bit prime field size. It was estimated that the computation of a 256-bit scalar point multiplication over GF(p) would take about 4.8 ms.

Acknowledgment Amphion Semiconductor Ltd. and a Northern Ireland Department of Learning postgraduate studentship in the form of a CAST award have funded this research.

References [1] [2] [3]

Estimated Clock (MHz)

No. Clock Cycles for Addition

No. Clock Cycles for Doubling

Addition Data Rate (Mb/s)

Doubling Data Rate (Mb/s)

45

519

324

22.2

35.6

55

1,447

904

9.7

15.6

65

3,815

2,384

4.4

7.0

65

9,447

5,904

1.8

2.8

[4] [5] [6] [7] [8]

Table 5: Estimated Performance Results for 256-Bit GF(p) EC Point Addition and Doubling

[9]

,,,

Rivest, R.L., Shamir, A., Adleman, L.: “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems”. Communications of the ACM, 21(2): 120-126, February 1978. Miller, V.S.: “Use of Elliptic Curves in Cryptography”. Proc. Advances in Cryptology (Crypto’ 85), pp. 417-426, 1986. Koblitz, N.: “Elliptic Curve Cryptosystems”. Math. Computing, Vol. 48, pp. 203-209, 1987. Montgomery, P.L.: “Modular Multiplication without Trial Division”. Math. Computation, Vol. 44, pp. 519-521, 1985. Koc, C.K., Acar, T., Kaliski, B.S.: “Analyzing and Comparing Montgomery Multiplication Algorithms”. IEEE Micro, Vol. 16, No. 3, pp. 26-33, June 1996. Satoh, A., Takano, K.: “A Scalable Dual-Field Elliptic Curve Cryptographic Processor”. IEEE Trans. Computers, Vol. 52, No. 4, pp. 449-460, April 2003. National Institute of Standards and Technology: http://www.nist.gov Xilinx, Inc.: http://www.xilinx.com, “Xilinx Virtex2 Pro Data Sheets”. IEEE P1363 Draft Version D13, “Standard for Public-Key Cryptography”, Draft Standard, Nov. 1999.

FPGA montgomery modular multiplication architectures ... - IEEE Xplore

FPGA montgomery modular multiplication architectures ... - IEEE Xplore

Suggest Documents

Montgomery modular multiplication architecture for ... - IEEE Xplore

fpga montgomery multiplier architectures â a comparison - IEEE Xplore

Improved Montgomery modular inverse algorithm - IEEE Xplore

Fast Montgomery Modular Multiplication and RSA Cryptographic ...

An expandable montgomery modular multiplication ... - KFUPM ePrints

Montgomery Modular Multiplication in Residue ... - Semantic Scholar

Montgomery modular multiplication on ... - ACM Digital Library

Improved RNS Montgomery Modular Multiplication with Residue ...

Modified Montgomery modular multiplication and RSA exponentiation ...

An RNS Montgomery Modular Multiplication ... - Semantic Scholar

montgomery modular multiplier architectures and ...

MONTGOMERY MULTIPLICATION

FPGA implementation of multiplication algorithms for ECC - IEEE Xplore

High-Radix Systolic Modular Multiplication on ... - IEEE Xplore

Implementation of Modular Multiplication for RSA ... - IEEE Xplore

Improved RNS Montgomery Modular Multiplication with ... - Springer Link

MONTGOMERY MULTIPLICATION ... - Semantic Scholar

Montgomery Multiplication - CiteSeerX

Two systolic architectures for modular multiplication - Very Large

Reconfigurable Computing Architectures - IEEE Xplore

Montgomery Multiplication Using Vector Instructions

Bipartite Modular Multiplication - CiteSeerX

Bipartite Modular Multiplication

Bipartite Modular Multiplication - CiteSeerX

FPGA montgomery modular multiplication architectures ... - IEEE Xplore