A Reconfigurable Coprocessor for Finite Field Multiplication in GF(2 ...

A Reconfigurable Coprocessor for Finite Field Multiplication in GF(2n) M. Jung, F. Madlener, M. Ernst and S. A. Huss {mjung | madlener | ernst | huss }@iss.tu-darmstadt.de Integrated Circuits and Systems Lab Computer Science Department Darmstadt University of Technology, Germany

Abstract

key is only known by its owner, the corresponding public key is published in a dictionary and thus publicly accessible. For The performance of elliptic curve based public key cryptosys- instance, when Alice wants to send sensitive data to Bob, she tems is mainly appointed by the efficiency of the underlying encrypts the data with Bob’s public key, which can be found finite field arithmetic. This work describes a reconfigurable in the dictionary. The data can only be decrypted with Bob’s finite field multiplier, which is implemented within the lat- private key, which is only known to him. Since it is comest family of Field Programmable System Level Integrated putationally intractable to calculate the private key from the Circuits FPSLIC from Atmel, Inc. The architecture of the public key, the information is transmitted securely. coprocessor is adapted from Karatsuba’s divide and conquer Besides the widely-used RSA method [2], public-key algorithm and allows for a reasonable speedup of the top-level schemes based on elliptic curves (EC) have gained more and public key algorithms. The VHDL hardware models are automore importance. In 1985 elliptic curve cryptography (ECC) matically generated based on an eligible operand size, which has been first proposed by V. Miller [3] and N. Koblitz [4]. In permits the optimal utilization of a particular FPSLIC device. the following a lot of research has been done and nowadays ECC is widely known and accepted. Because EC methods in general are believed to give a higher security per key bit in 1 Introduction comparison to RSA, one can work with shorter keys in order to achieve the same level of security (1024 RSA-bits are Today there exists a wide range of distributed systems, which equivalent to 160 EC-bits [5]). The smaller key size permits use communication resources that can not be safeguarded more cost-efficient implementations, which is of special inagainst eavesdropping or unauthorized data alteration. Thus terest for low-cost and high-volume systems. cryptographic protocols are applied to these systems in order to prevent information extraction or to detect data manipulaRecently, Atmel, Inc. introduced their new AT94K famtion by unauthorized parties. ily of FPSLIC devices (Field Programmable System Level Each application has different demands on the utilized Integrated Circuits). This architecture integrates FPGA recryptosystem (e.g., in terms of required bandwidth, level of sources, an AVR microcontroller core, several peripherals security, incurred cost per node and number of communi- and SRAM within a single chip. This platform is appealcating partners.) The major market share probably is occu- ing for the implementation of prototypes and systems with pied by the low-bandwidth, low-cost and high-volume appli- a low production volume. Based on HW/SW co-design and cations, most of which are based on SmartCards or similar co-verification methodologies aiming at FPSLIC, this archilow complexity systems. Examples are given by the mobile tecture is perfectly suited for System on Chip (SoC) implephone SIM cards, the German GeldKarte and various access mentations. control or electronic payment systems. This paper is focusing on a FPSLIC based HW implemenDepending on the underlying algorithms, today’s cryptation of a Finite Field coprocessor for the acceleration of tosystems fall in one of two categories: symmetric or asymECC. A variety of fast EC cryptosystems can be built by dismetric cryptosystems. posing the proposed system partitioning. Running the EC In symmetric cryptosystems a single key is used for both, level algorithms in SW on the AVR microcontroller allows for the encryption and the decryption process. This implies that algorithmic flexibility while the HW accelerated finite field the key has to be known by both communicating parties and arithmetic contributes the required performance. thus must have been transmitted through a secure channel beforehand. Spontaneous secure communications, which is The mathematical background of elliptic curves and finite necessary in many epplications (e.g., in electronic payment fields is explained in the following section. In Section 3 the systems) is not possible. FPSLIC platform is introduced and Section 4 deals with the In asymmetric (aka public key) cryptosystems, as first pro- architecture and implementation of the Finite Field coprocesposed in [1], each participant has two keys, one of which is sor. Finally, we give some performance numbers and concluthe public key and the other the private key. While the private sions.

2 Mathematical Background

Known to the public: P0 1. YA = XA P0 2. 3. 4. 5. KAB = XA YB 6.

The theory of elliptic curves can be applied in different ways in order to obtain a cryptosystem. In this section we will consider the example shown in Fig. 1, but we will not detail the ElGamal cryptosystem [6]. It is essentially a public key cryptosystem based on the results on which W. Diffie and M. Hellman report in [1]. We will just briefly review the Diffie-Hellman Key Exchange scheme and show how the elliptic curve theory can be applied to it. Further on, we will give an introduction to elliptic curves and finite fields. The goal is to come up with the algorithms, which can be accelerated using the proposed coprocessor. A complete description on how these algorithms are derived is far behind the scope of this paper. Most equations stated in this section are taken from [7] and [8], the latter one giving a good introduction from the implementor’s point of view. A detailed and more theoretical description can be found in [9].

Alice’s secret: XA

YA YB

YB = XB P0

KAB= XB YA

Bob’s secret: XB

Eve knows P0 and eavedropped: YA and YB

Figure 2: The Diffie-Hellman Key Exchange

ElGamal Cryptosystem is a

hand randomly chooses some number XB and sends over YB , which is computed as f (XB , P0 ). Both, Alice and Bob can now easily compute a common secret key KAB = f (XA , YB ) = f (XA , f (XB , P0 )) = f (XA XB , P0 ) = f (XB , f (XA , P0 )) = f (XB , YA ). Eve, the eavesdropper, has listened to the communication and thus knows P0 , YA and YB . There is no known way for Eve to compute KAB without applying f −1 . Since it is computationally intractable to determine f −1 , KAB is not compromised. The applied abelian group in the original version of the Diffie-Hellman Key Exchange scheme is the finite set of whole numbers modulo a large prime number p together with the corresponding multiply operation. The function f is the modular exponentiation Q = f (n, P ) = P n and can be computed efficiently with the fast exponentiation algorithm. Computing the inverse function n = f −1 (P, Q) = log P Q is known as the discrete logarithm problem and so far nobody came up with an efficient algorithm. As detailed in Sec. 2.2, the set of points on an elliptic curve together with a welldefined operation also forms an abelian group which can be applied, e.g. to the Diffie-Hellman Key Exchange scheme alternatively.

Public Key Cryptosystem

is based on Diffie−Hellman Key Exchange

(2.1)

Diffie−Hellman Key Exchange is a

Public Key Distribution System

is based on an Abelian Group with some special Property

(2.2)

Group of Points on an Elliptic Curve with an Operation is an

Abelian Group with that special Property

is based on a Field

(2.3)

Galois Field GF(2n) is a

(finite) Field

Figure 1: Layers of an EC based cryptosystem

2.1 Diffie-Hellman Key Exchange With the Diffie-Hellman Key Exchange scheme [1] two users can securely exchange a key over an insecure channel. The scheme operates in an abelian group (G, ◦) with the following property: There is an efficient algorithm to compute Q = f (n, P ) : N × G → G := P | ◦ P ◦{z. . . ◦ P},

2.2 Elliptic Curve Arithmetic An elliptic curve E is defined as the cubic equation

(1)

E : y 2 + a1 xy + a3 y = x3 + a2 x2 + a4 x + a6

n−times

(3)

but applying the inverse function

with a1 , . . . , a6 , x and y ranging over any given algebra that (2) meets the field axioms. Fig. 3 illustrates an example where the underlying field is the set of real numbers R with a1 = is computationally intractable. Given such a function f , two persons (Alice and Bob) a2 = a3 = a6 = 0 and a4 = −7. can negotiate a secret key in the following way over a pubIn order to construct an abelian group based on the set of lic channel (see Fig. 2): Alice chooses a random natu- points on an elliptic curve, some group operation (historically, ral number XA , calculates YA = f (XA , P0 ) and trans- but also arbitrarily called addition) has to be defined. The admits YA (P0 ∈ G is a public value). Bob on the other dition of two points P, Q ∈ E is defined as follows: First f −1 (P, Q) = min {n|Q = f (n, P )}

2

6

The sum of two polynomials corresponds to a bitwise addition modulo 2 of their binary representations, which is identical to a bitwise XOR.

−R

4

A+B

2

Q 0

P

−2 −4

−4

−2

0

2

x3 + x2 + (1 ⊕ 1)x + 1 x3 + x 2 + 1 10102 ⊕ 01112 11012

Multiplication of two polynomials can be calculated with the school-method, where addition again is executed as shown above:

R

−6

= = = =

4

A · B = 10102 · 01112 10102 ⊕10102 ⊕10102 1101102

Figure 3: E : y 2 = x3 − 7x

a straight line is drawn through P and Q. If there is a third intersection point with the curve (−R), this point is mirrored at the x axis and R = P + Q is the result. Otherwise the result is O, some imaginary point at infinity, which is the group’s neutral element. The set of points and O, together with the described point addition operation meets the axioms of an abelian group. For cryptographic applications a finite field is used as the underlying algebra of the elliptic curve1. The proposed Finite Field coprocessor (detailed in Sec. 4) provides algebraic operations for finite fields of characteristic 2, the so-called Galois Fields.

Finally, the result is divided by the prime polynomial P . The remainder is a representative of the residue class, which can be stored in a bit string of length n: ⊕ ⊕

110110 ÷ 10011 = 11 (quotient) 10011 10000 10011 11 (remainder)

Thus 10102 · 01112 ≡ 00112 mod 100112.

2.3 Finite Field Arithmetic

2.4 Karatsuba Multiplication

The smallest imaginable finite field is the set B = {0, 1} with ⊕ (XOR) as the additive, (AND) as the multiplicative operation and 0 and 1 the corresponding neutral elements, respectively. Its elementsPcan be stored in a single bit. i The set PB (x) = { ∞ i=0 ai x | ai ∈ B} of polynomials with coefficients in B together with the additive neutral element 0x0 , the multiplicative neutral element 1x0 and polynomial addition and multiplication operations constitutes another field, but with an infinite number of elements. It’s elements can be stored in bit strings, but it is not possible to give an upper bound on their length. A Galois Field GF(2n ) can be constructed by modular arithmetic out of PB (x). The modulus in this case is a degree n prime polynomial, which is a polynomial that is not the product of two others of lower degree. The set, which is underlying the Galois Field, is the finite set of residue classes of polynomials modulo the prime polynomial. Each residue class has a representative polynomial with a degree less than n. Therefore, each element of the Galois Field can be stored in a bit string of length n. Let’s look at some examples. Suppose polynomials P , A and B which are defined as follows: prime P : A: B: 1 The

x4 + x + 1 x3 + x x2 + x + 1

To perform a n-bit multiplication we need an algorithm that divides the n-bit multiplication into several one bit multiplications, which are the only multiplications that can be computed directly (i.e., by an AND-gate). This can be done by the classical school method, which allows to divide a n-bit multiplication into several n/2-bit multiplications by the following equation: (A1 xn/2 ⊕ A0 ) · (B1 xn/2 ⊕ B0 ) = A1 B1 xn ⊕ A1 B0 xn/2 ⊕ A0 B1 xn/2 ⊕ A0 B0 = A1 B1 xn ⊕ (A1 B0 ⊕ A0 B1 )xn/2 ⊕ A0 B0 As one can see this method needs 4 n/2-bit multiplications and 3 n-bit additions. In 1963 A. Karatsuba and Y. Ofman described a divide and conquer multiplication algorithm [12], which is quite useful in our case. Using their algorithm, n-bit multiplications are divided into n/2-bit multiplications by the following equation: (A1 xn/2 ⊕ A0 ) · (B1 xn/2 ⊕ B0 ) = A1 B1 xn ⊕ (A1 B0 ⊕ A0 B1 )xn/2 ⊕ A0 B0 by defining some additional polynomials: T1 = A 1 B 1 T2 = (A1 ⊕A0 )(B1 ⊕B0 ) = (A1 B0 ⊕A0 B1 ) A1 B1 A0 B0 T3 = A 0 B 0

= 100112 = 10102 = 01112

infinite field R is used in Fig. 3 in order to ease the illustration.

3

one gets (A1 x ⊕ A0 ) · (B1 xn/2 ⊕ B0 ) = n T1 x ⊕ (T1 T2 T3 )xn/2 ⊕ T3 and since and ⊕ are equal in GF(2n ) = T1 xn ⊕ (T1 ⊕ T2 ⊕ T3 )xn/2 ⊕ T3

Bytes SRAM are organized as 20K Bytes program memory, 4K Bytes data memory and 12K Bytes that can dynamically allocated as data or program memory.

n/2

4 Architecture and Implementation

This method needs only 3 n/2-bit multiplications, but 3 n-bit additions and 2 n/2-bit additions. In GF(2n ), where multiplication is a quite expensive operation and an addition can be performed at nearly no costs (since an XOR is very small on an FPGA and no carry bits exist), the Karatsubaalgorithm is the best choice for our implementation. With this algorithm, we can build an n-bit multiplier by just using 3 n/2-bit multipliers and some XORs. These n/2-bit multipliers can be built by 3 n/4-bit multipliers each and so on until one reaches a one-bit multiplier (which is an AND). Alternatively, one can stop at some small bitwidth and perform the multiplication by a lookup table.

An ideal HW/SW partitioning targeting the FPSLIC platform for an EC based cryptosystem depends on several parameters. As stated before, the finite field arithmetic is the most time critical part of an EC cryptosystem. Depending on the utilized key size and the amount of available FPGA resources the finite field operations can be run more or less in HW. Therefore, flexibility within the HW design flow is essential. In order to ensure this flexibility a VHDL generator approach (similar to that one documented in [11]) was used to derive VHDL models of the finite field arithmetic which are adapted to the key size n.

4.1 Coprocessor Architecture

3 Hardware Platform

One of the requirements for the proposed coprocessor design was scalability over the whole FPSLIC family. Particularly the low cost µFPSLIC, which has only 5k gate equivalents, is quite interesting regarding economic criteria. On this device it is not achievable to implement a complete multiplier for any bit width, which can be regarded adequate even for low security applications. Thus, we decided for a design, which can speed up the multiplication operation for any bit width, but leaves a fair amount of work to be done by software on the microcontroller.

The AT94K FPSLIC product family (see [10] and Fig. 4) from Atmel, Inc. integrates up to 40K gates of FPGA resources, an AVR 8-bit RISC microcontroller core, several peripherals and up to 36K Bytes SRAM within a single chip. The AVR microcontroller core is a common embedded processor, e.g., on SmartCards and is also available as a standalone device. The AVR is capable of 129 instructions that can mostly be performed within a single clock cycle. This results in a 20+ MIPS throughput at 25 MHz clock rate.

RE IOSEL8 IOSEL15 GCLK5 IOSEL0 EN

RESET CLK

EN LOAD RESET CLK

DIN

IOSEL4

8

EN

RESET CLK

combinatorial Karatsuba multiplier

WE

8

DOUT

Figure 4: Atmel AT94K FPSLIC architecture Figure 5: Generic Coprocessor architecture

The FPGA resources within the AT94K devices are based on Atmel’s AT40K FPGA architecture. A special feature of this architecture are FreeRamTM cells which are located at the corners of each 4x4 cell sector. Using these cells results in minimal impact on bus resources and by that in fast and compact FPGA designs. The FPGA part is connected to the AVR over an 8-bit data bus. Both, the AVR microcontroller core and the FPGA part are connected to the embedded memory separately. Up to 36K

Fig. 5 shows the structure of the coprocessor, which is build around a combinatorial Karatsuba multiplier. The two operand registers on the left are shift registers, which shift in 8 bits of data from the AVR data bus at every write access. The result register at the right of Fig. 5, has to be twice as wide as the operand registers; it can be loaded in parallel 4

a1

with the product and shifts out 8 bits of the result to the data bus at every read access. The signals IOSEL0. . . IOSEL15 are sixteen I/O selector lines, which can be raised by the AVR and can be interpreted in an arbitrary way by the peripheral device implemented in the FPGA. Our design applies the IOSEL15 signal as a reset (which means that a read or write access to the FPGA, while I/O selector line 15 is raised, results in a reset.) IOSEL4 is used to write to the operand register B, IOSEL0 to write to operand register A or to read from the result register. IOSEL8 is used in an unconventional way; on a read or write access with arbitrary data and this selector line raised, the result register latches in the output of the combinatorial Karatsuba multiplier. In order to attain scalability we adopted a generator based design flow. The generator is given the bit width of the operand registers as a paramter (which, due to the 8 bit wide data bus, has to be a multiple of eight) and computes the VHDL code for the coprocessor, which can then be fed to commercial synthesis and placement and routing tools. This allows optimal resource usage for all instances of the FPSLIC family. In Sec. 5 we report on the achievable bit widths and performance gains for some of them. The next section illustrates the generator process for the combinatorial Karatsuba multiplier.

a0

b0

0

one bit polynomial karatsuba multiplier

c 2

b) a3

b1

b0

c

a)

a0

a2

a1

a0

c1

c

0

two bit polynomial karatsuba multiplier (KM2)

b3

b2

KM2

b1

b0

KM2

KM2

4.2 Combinatorial Karatsuba Multiplier Like stated before in Sec. 2.4 and shown in Fig. 6a the product of two one bit polynomials is computed by a single AND operation. With Karatsuba’s divide and conquer multiplication algoc c1 c c5 c4 c3 c rithm, a multiplication of two n-bit polynomials can be com6 0 2 puted with three n/2-bit multiplications and some additions four bit polynomial (which are XOR’s in our case) to determine interim results c) karatsuba multiplier and accumulate the final result. This leads immediately to a recursive construction process, which builds combinatorial Figure 6: Recursive construction process for polynomial Karatsuba multipliers of width n = 2m for arbitrary m ∈ N Karatsuba multipliers (see Fig. 6.) With slight modifications, this scheme can be generalized to support arbitrary bit widths. n/2−1 1 n/2−1 n/2 To determine the number of gates that constitute a n-bit multiplier we take a look at Fig. 7. In addition to the resources T1 n/2 A=A1x +A0 of the three included n/2-bit multipliers, 2(n/2) = n 2-input n/2 B=B1x +B0 T1 XOR’s are needed to compute the sub-terms (A1 ⊕ A0 ) and T1=A1B1 (B1 ⊕ B0 ) of T2 . As can be seen from the figure, in addition T2 T2=(A1+A0 )(B1+B0 ) 2(n/2−1)= (n−2) 4-input XOR’s (light gray) and 1 3-input T3=A0B0 T3 XOR (dark gray) are necessary to add up the product. Thus, we can calculate the number of gates of a n-bit Karatsuba T3 multiplier as follows: 1 n=1 A. B AN D2 (n) = 3 · AN D2 (n/2) n > 1 2n−1

XOR2 (n) =

0 n=1 n + 3 · XOR2 (n/2) n > 1

XOR3 (n) =

0 n=1 1 + 3 · XOR3 (n/2) n > 1

Figure 7: Karatsuba Multiplication

5

XOR4 (n) =

Currently we are working on a multiplier version, which will supplement the presented combinatorial multiplier by sequential logic and wider operand and result registers. With this approach, it should be possible to compute a 113 bit wide multiplication plus the necessary reduction in approximately 20 clock cycles. However, this design will probably only fit on the AT94K40, Atmel’s largest FPSLIC device.

0 n=1 n − 2 + 3 · XOR4 (n/2) n > 1

Some gate counts for multipliers of various operand bit widths are summarized in the following table and in Fig. 8. Bit Width AN D2 XOR2 XOR3 XOR4 SU M

1 1 0 0 0 1

2 3 2 1 0 6

4 9 10 4 2 25

8 27 38 13 12 90

16 81 130 40 50 301

32 243 422 121 180 966

64 729 1330 364 602 3025

Acknowledgment This work was sponsored by and has been done in cooperation with cv cryptovison GmbH, Gelsenkirchen, Germany.

References [1] W. Diffie and M. Hellman, “New Directions in Cryptography,” IEEE Transactions on Information Theory, vol. IT-22, no. 6, 1976.

3500 3000

[2] R. L. Rivest, A. Shamir and L. M. Adleman, “A Method for Obtaining Digital Signatures and Public-Key Cryptosystems,” Communications of the ACM, Feb 1978.

2500 2000 gate count 1500 1000

[3] V. Miller, “Use of elliptic curves in cryptography,” Advances in Cryptology, Proc. CRYPTO’85, LNCS 218, H. C. Williams, Ed., Springer-Verlag, pp. 417–426, 1986.

SUM XOR2

500

AND2 XOR4

0 4

8

16

bit width

gate type

XOR3 32

[4] N. Koblitz, “Elliptic Curve Cryptosystems,” Mathematics of Computation, vol. 48, pp. 203–209, 1987.

64

Figure 8: Karatsuba multiplier gate count

[5] A. Lenstra and E. Verheul, “Selecting Cryptographic Key Sizes,” Proc. Workshop on Practice and Theory in Public Key Cryptography, Springer-Verlag, ISBN 3540669671, pp. 446–465, 2000.

5 Results and Conclusion

[6] T. ElGamal, “A Public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms,” IEEE Transactions on Information Theory, vol. IT-31, no. 4, 1985.

Speeding up the most time critical part of public-key crypto schemes enables the use of these methods within systems with relatively low computing power. By exploiting the presented Finite Field coprocessor a reasonable speedup of elliptic curve cryptosystems can be achieved. Two different hardware accelerated multipliers are compared to a pure software implementation, all multiplying two 128 bit polynomials. The software, a manually optimized assembler implementation takes 5409 cycles. Using a 32 bit hardware multiplier results in a speedup of factor 3, since a multiplication takes 1791 cycles. The 32 bit hardware multiplier fits on the AT94K40 using 53% of the available resources. A 48 bit multiplier was also implemented on the AT94K40, but since this version would not be comparable to the other implementations in an obvious way, we do not report on it’s performance. With the µFPSLIC an 8 bit multiplier is possible. This leads to 3384 cycles for the 128 bit multiplication, which corresponds to a speedup factor of 1.6. Device AVR MCU AT94K5 AT94K40

Clock Cycles 5409 3384 1791

Speedup 1.0 1.6 3.0

[7] O. Hauck, A. Katoch and S. A. Huss, “VLSI System Design Using Asynchronous Wave Pipelines: A 0.35 µm CMOS 1.5 GHz Elliptic Curve Public Key Cryptosystem Chip,” Proc. IEEE ASYNC 2000, Eilat, April 2000. [8] M. Rosing, “Implementing Elliptic Curve Cryptogarphy,” Manning Publications Co., ISBN 1-884777-69-4, Greenwich, 1999. [9] A. J. Menezes, “Elliptic Curve Public Key Cryptosystems,” Kluwer Akademic Publishers, 1993. [10] Atmel, Inc. “Configurable Logic Data Book,” 2001. [11] M. Ernst, S. Klupsch, O. Hauck and S. A. Huss, “Rapid Prototyping for Hardware Accelerated Elliptic Curve PublicKey Cryptosystems,” Proc. 12th IEEE Workshop on Rapid System Prototyping (RSP01), Monterey, CA, June 2001. [12] A. Karatsuba and Y. Ofman, “Multiplication of multidigit numbers on automata,” Sov. Phys.-Dokl (Engl. transl.), vol. 7, no. 7, pp. 595-596, 1963.

FPGA Utilization N/A 43% 53% 6