FPGA based unified architecture for public key and ... - Springer Link

Front. Comput. Sci., 2013, 7(3): 307–316 DOI 10.1007/s11704-013-2187-2

FPGA based unified architecture for public key and private key cryptosystems Yi WANG

1,2

, Renfa LI1

1 Embedded Systems and Networking Laboratory, Hunan Province Key Laboratory of Network and Information Security, Hunan University, Changsha 410082, China 2 Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576, Singapore

c Higher Education Press and Springer-Verlag Berlin Heidelberg 2013

Abstract Recently, security in embedded system arises attentions because of modern electronic devices need cautiously either exchange or communicate with the sensitive data. Although security is classical research topic in worldwide communication, the researchers still face the problems of how to deal with these resource constraint devices and enhance the features of assurance and certification. Therefore, some computations of cryptographic algorithms are built on hardware platforms, such as field program gate arrays (FPGAs). The commonly used cryptographic algorithms for digital signature algorithm (DSA) are rivest-shamir-adleman (RSA) and elliptic curve cryptosystems (ECC) which based on the presumed difficulty of factoring large integers and the algebraic structure of elliptic curves over finite fields. Usually, RSA is computed over GF(p), and ECC is computed over GF(p) or GF(2 p ). Moreover, embedded applications need advance encryption standard (AES) algorithms to process encryption and decryption procedures. In order to reuse the hardware resources and meet the trade-off between area and performance, we proposed a new triple functional arithmetic unit for computing high radix RSA and ECC operations over GF(p) and GF(2 p ), which also can be extended to support AES operations. A new high radix signed digital (SD) adder has been proposed to eliminate the carry propagations over GF(p). The proposed unified design took up 28.7% less hardware resources than implementing RSA, ECC, and AES individually, and the experimental results show that our Received June 1, 2012; accepted December 5, 2012 E-mail: [email protected]

proposed architecture can achieve 141.8 MHz using approximately 5.5k CLBs on Virtex-5 FPGA. Keywords AES, RSA, ECC, signed-digit number, FPGA, cryptographic algorithms, high radix, arithmetic unit

1

Introduction

In recent years, we have been assisted by enormous uses of embedded devices in our everyday live, where such devices are used ranging from washing machine and cars to mobile phones and smart card for conditional access. The design requirements of such devices are not only low power, real time performance and reconfigurability but also strong security features in security-sensitive applications. To fulfill this demand, it is desired to integrate the necessary cryptographic primitives (public key and private key cryptosystems) into one platform. Public key cryptosystem is usually used in key exchange, digital signature and certification authority, and private key cryptosystem is usually used in data encryption and decryption. Some considerable research effort has already been expended on the efficient unification of modular multiplication and modular addition/subtraction over both GF(p) and GF(2 p ) [1–4]. The difference between modular multiplication over GF(p) and GF(2 p ) lies in the addition arithmetic, and in particular, whether the carry of the summand is forwarded or not. A modified full adder supporting carry propagation and carry-less operations is commonly used in the

308

Front. Comput. Sci., 2013, 7(3): 307–316

existing designs. However, this results in a critical path delay difference between the two operations which affects the performance. For instance, in a conventional logic implementation of a full adder, the critical path has three simple gate delays for the carry and two XOR gate delays to produce the result. This contrasts to a critical path of just a single XOR gate delay for a carry-less mode of operation. Using signed digital (SD) number systems could speed up the high radix modular multiplication calculation in the work of [5], but only over GF(p). Niimura and Fuwa achieved 90 MHz using an arithmetic unit that had been optimized for the FPGA’s configurable logic block (CLB) architecture [5]. There also have some relative new methods to unify the modular multiplicative operations over GF(p) and the polynomial multiplicative operations over GF(2 p ) [6, 7]. Chen et al. proposed a unified reconfigurable architecture for cryptographic processor which supported operations of rivest-shamir-adleman (RSA) and elliptic curve cryptosystems (ECC) over GF(p) and GF(2 p ) [6]. They achieved 415 MHz on 0.13 µm standard cell CMOS technology using 331.7k gates. Lai and Huang proposed a dual-field processor for high performance ECC applications [7] using 179k gates on TSMC 0.13 µm standard cell CMOS technology. In our earlier works of [8,9], we proposed an SD adder to support the computations of RSA and ECC over GF(p) and GF(2 p ), which shortened the computational critical path to just two look-up table (LUT) delays. We also pointed out that the above work can be extended and modified for modular exponentiation with the higher radix (radix = 4) [10]. But the above all methods did not consider the applications using advanced encryption standard (AES) algorithm [11]. Significant efforts were also posed on the implementations of the AES within a limited area occupation. Among these works, it was remarkable that Feldhofer et al. [12] proposed a design of the AES using approximate 3 400 gates. A different approach was the one proposed by Grabher et al. [13]. The authors concentrated on application specific instruction set processor (ASIP) and proposed an optimized design for instruction set extension which accelerated the bit-sliced implementation of the AES algorithm using a minimal area. Implementations combining both public and private key algorithms were proposed by Tillich and Großschädl [14], in which they proposed a functional unit (FU) to accelerate ECC and AES on the 32-bit LEON2 SPARC V8 processor. The aim of this paper is to propose a unified design for both public key cryptosystem, RSA and ECC, and extend the available operations to support also private key cryptosystem, AES without much overhead in hardware resources. Modern

FPGAs have a 6-input LUT structure, opening up possibilities for using higher radix SD arithmetic. Therefore, a new general arithmetic unit would be proposed based on radix-4 SD number system. The radix-4 SD number adder would be extended to support also AES algorithm in terms of the MixColumns and the InvMixColumns operations with the two goals of minimizing the area and not affecting the performance of the original architecture. Unfortunately, the SubBytes could not be integrated into the proposed triple functional arithmetic unit, we implemented the SubBytes and the InvSubbytes on-the-fly. In order to reduce the hardware area, these two operations are implemented by the substitution formation over GF(24 ). The reminder of this paper is as follows: Section 2 summarizes the principles of high radix SD number system, RSA, ECC and AES. The modified algorithms are introduced in Section 3. Our unified architecture is proposed in Section 4. The experimental results are reported in Section 5. Section 6 draws the conclusion

2

Algorithms overview

In this section we summarize high radix SD number system, and also detail the supported cryptographic algorithms in this paper. 2.1 High-radix SD number system Carry propagation becomes a big issue when considering the operations of high-radix modular multiplication, and modular multiplication is the repeated additions in hardware implementation. SD number representation can be used to accelerate addition of integers because it can eliminate carries. A high radix SD number can be used to speed up high radix addition. To represent an integer X using a radix-4 SD number system is given as: X = 4n−1 xn−1 + 4n−2 xn−2 + · · · + 41 x1 + 40 x0 ,

(1)

where a radix-4 SD number consists of the digit set xi ∈ ¯ 2, ¯ 1, ¯ 0, 1, 2, 3}. The addition of two radix-4 SD numbers {3, has been defined in [15] as follows: ⎧ ⎪ ⎪ 1, (xi + yi ) 2, ⎪ ⎪ ⎪ ⎨¯ ¯ ui = (xi + yi ) − 4ci+1 , ci+1 = ⎪ (2a) 1, (xi + yi ) 2, ⎪ ⎪ ⎪ ⎪ ⎩ 0, (xi + yi ) = 0, (2b) si = ui + ci , ¯ 2, ¯ 1, ¯ 0, 1, 2, 3}. This results in where xi , yi , si , ui and ci+1 ∈ {3, a two-stage operation to produce the result. The first operation (Eq. (2a)) calculates the intermediate result, ui , and the

Yi WANG et al. FPGA based unified architecture for public key and private key cryptosystems

carry propagates to the next stage, ci+1 . While Eq. (2b) generates the final result, si , by adding ui and ci (the carry from the previous stage). The proposed SD adder performs two functions: high-radix SD addition and high-radix SD XOR. 2.1.1 High radix SD addition The computational steps of high radix SD addition are introduced in the followings. The computation procedures for the possible bit patterns are shown in Table 1 and Table 2. These two steps are identical with the two six-input LUTs (Virtex-5 FPGA) of the high radix SD adder shown in Fig. 1 and will be referred to as LUT1 and LUT2, respectively. This results in a carry propagation delay which is dependent on two LUTs. Table 1 (LUT1) yi

Rules of generating carry(ci+1 ) and sum(ui ) for two inputs (xi , yi ) xi 3¯ ¯ 1, 2¯

2¯ ¯ 1, 1¯

1¯ ¯ 1, 0

¯ 1¯ 1, ¯ 0 1,

¯ 0 1, ¯ 1 1,

¯ 1 1, 0, 2¯

1

¯ 1 1, 0, 2¯

0, 2¯ 0, 1¯

2

0, 1¯

0, 0

3

0, 0

0, 1

0, 2

3¯ 2¯ 1¯ 0

Table 2

0 ¯ 1 1, 0, 2¯

1 0, 2¯ 0, 1¯

2 0, 1¯

0, 0

0, 0

0, 1

0, 1¯

0, 0

0, 1

0, 1¯

0, 0

0, 1

0, 2 1, 1¯

0, 0

0, 1

0, 1

0, 2 1, 1¯

0, 2 1, 1¯

0, 2 1, 1¯

1, 0

3

1, 0

1, 0

1, 1

1, 1

1, 2

Rules of computing final result(si ) with two inputs(ui , ci ) (LUT2) ui

ci

2¯

1¯

0

1

2

1

3

2

1

0

1

0

2

1

0

1

2

1

1

0

1

2

3

Table 3 yi

309

Rules of computing the result(si ) with two inputs(xi , yi ) xi 0

1

2

3

0

0

1

2

3

1

1

2

3

0

2

2

3

0

1

3

3

0

1

2

2.2 RSA cryptosystem The RSA [16] cryptosystem is a public-key cryptosystem that offers both encryption and digital signatures (authentication). Public-key cryptosystems use two different keys. One is public (the public key) while the other is kept secret (the private key). Clearly, it is required that computing the private key from the public key has to be intractable [17, 18]. The key generation for RSA is shown in Algorithm 1 [19]. Algorithm 1 Key generation for RSA public-key encryption 1: Each entity creates an RSA public key and a corresponding private key. Each entity A should do the following: 2: Generate two large random (and distinct) primes p and q, each roughly the same size. 3: Compute M = pq and φ = (p − 1)(q − 1). 4: Select a random integer b, 1 < b < φ, such that gcd(b, φ) = 1. 5: Use the extended Euclidean algorithm to compute the unique integer e, 1 < e < φ, such that be ≡ 1(modφ). 6: A’s public key is (M, b); A’s private key is e.

Message transformation between parties A and B involves using these RSA keys. B receives the authentic public key (M, b) from A and computes d = C b mod M, where C is the message to be signed or encrypted. Then B sends d to A. A computes using private key e to recover the message C. RSA is mainly concerned with modular exponentiations in the prime field. The speed of modular multiplication is critical for the performance of modular exponentiation. 2.3 Elliptic curve cryptosystem

Fig. 1

The proposed high radix signed adder

2.1.2 High radix SD XOR The difference in modular multiplication over these two fields lies in whether the carry is propagated or not. In order to support these two algorithms, the proposed arithmetic unit must have the ability to handle the carry propagation and carry-less operations. The carry-less addition over GF(2 p ) is just a simple XOR operation. This XOR operation is given in Table 3.

An ECC is a public key cryptosystem that a curve is a set of pairs which fulfill a given equation. In the contest of ECC, the considered curve is defined over a finite field, GF(p), and a binary field, GF(2 p ). Two basic operations on the curve are defined: the Point Addition and the Point Doubling. Point Addition calculates a third point on the curve taking two different points as the inputs, while Point Doubling calculates a third point on the curve when the two inputs are the same point. The Point Addition and Point Doubling are calculated using modular arithmetic, where they require modular addition, modular subtraction, modular multiplication and modular inversion. Since modular inversion is the most costly

310

Front. Comput. Sci., 2013, 7(3): 307–316

computation among them, in order to avoid modular inversion, we use project coordinate transformation for ECC over GF(p) and Lopez-Dahab coordinate transformation for ECC over GF(2 p ) [20]. 2.3.1 Projective coordinate If the prime field, GF(p), is used, then the elliptic curve E(GF(p)) is defined as: y2 = x3 + ax + b,

(3)

where 4a3 + 27b2 0. Let x = X/Z and y = Y/Z, and Eq. (3) becomes: Y 2 Z = X 3 + aXZ 2 + bZ 3 . (4) Point Addition and Point Doubling using projective coordinate transformation are define as in Appendix (Algorithm 4 and Algorithm 5). 2.3.2 Lopez-Dahab coordinate If the binary field GF(2 p ) is used, then the elliptic curve E(GF(2 p )) is defined as: y2 + xy = x3 + ax2 + b,

(5)

where a ∈ GF(2 p ) and b ∈ GF(2 p ) are constants, and b 0. Let x = X/Z and y = Y/Z 2 , and Eq. (5) becomes: Y 2 + XYZ = X 3 Z + aX 2 Z 2 + bZ 4 .

(6)

Point Addition and Point Doubling using Lopez-Dahab coordinate transformation are defined as in Appendix (Algorithm 6 and Algorithm 7). 2.4 Private key cyptosystem AES is a symmetric encryption algorithm. AES cipher has a 128-bit block size with key sizes of 128, 192, and 256 bits, whereas it can be specified. Figure 2 shows the encryption of AES algorithm where the key size is 128 bits. The encryption starts with the first key addition, followed by nine round functions. Each round function is composed of four transformations: SubBytes, ShiftRows, Mixcolumns, and AddRoundKey. The final round only composed of SubBytes, ShiftRows and AddRoundKey. A round subkey is generated by a key schedule and fed into each round’s input, which takes the secret key and expands it as specified in the standard.

3

Proposed arithmetic unit

In this section, we present our unified RSA and ECC

Fig. 2

The procedure of computing AES

co-processor over GF(p) and GF(2 p ), which is also extended to support AES operations. The core of our architecture is the new triple functional arithmetic unit, which extends an accelerator designed for computing the modular multiplication of RSA and ECC in order to compute also the MixColumns and the InvMixColumns states of AES algorithm. The performance mostly lies on how fast the modular multiplication would be, therefore, the proposed arithmetic unit would balance the carry propagation and carry-less operations which optimized the basic computing of modular multiplication. In order to avoid modular inversion, the most costly computing for ECC, we use projective coordinate and Lopez-Dahab coordinate formation to compute ECC operations over GF(p) and GF(2 p ). The modular multiplication, which is one of the basic modular operations for computing the Point Addition and the Point Doubling of ECC, is usually implemented by the efficient algorithms, such as the one proposed by Montgomery [21]. Based on the work of [21], a high radix modified Montgomery’s modular multiplication is proposed by Orup [22]. Montgomery’s modular multiplication is used to implement the modular multiplication of RSA and ECC over GF(p), while Koc and Acar’s algorithm is used to compute the polynomial multiplication of ECC over GF(2 p ) [23]. In order to level up the security, the modular multiplication and the polynomial multiplication algorithm with high radix are chosen as


the basic computational blocks of RSA and ECC. Note that, the radix value is assigned to 4 in this paper. The proposed unified algorithms are listed in Algorithm 2 and Algorithm 3, where HSADD and HSXOR represent radix-4 addition and radix-4 XOR respectively. Algorithm 2 The proposed radix-4 SD implementation of Montgomery’s modular multiplication 2 Input: M = n−1 (22 )mi , mi ∈ {0, 1}; B = n−1 i=0 (2 )bi , bi ∈ {0, 1}; A = n−1 i=0 2 (2 )a , a ∈ {0, 1}; A, B < 2M, 4M < R = 22n , M = −M −1 · i i i=0 mod22 , gcd(22 , M) = 1

Output: S = A · B · R−1 mod M ¯ B, ¯M ¯ = convert(A, B, M), S 0 = 0; 1: A,

5: 6:

¯ = P¯i /4; S i+1

4:

8: S n = Invconvert(S¯n ); 9: if S n > M then S n = S n − M;

11: else 12:

Since the coefficients of the above two fixed polynomials are only seven (0x1, 0x2, 0x3, 0x0b, 0x0d, 0x09 and 0x0e), it is the needed coefficients for the AES’s encryption and decryption. The polynomial multiplication operation over GF(28 ) can be calculated by the Xtime function, which means multiplying x to each byte. The reduction is performed only when the most significant bit (MSB) equaling 1, and the operation of reduction is bitwise XORed with 0x1b. For example:

= (byte · {02}) ⊕ (byte · {04}) ⊕ (byte · {08}).

7: end for

10:

S n = S n;

13: end if Algorithm 3 The proposed radix-4 SD implementation of modular multiplication over GF(2 p ) n−1 2i 2i Input: a(x) = n−1 i=0 ai x , ai ∈ GF(2); b(x) = i=0 bi x , bi ∈ GF(2); n−1 n 2i m(x) = x + i=0 mi x , mi ∈ GF(2); m(x) is an irreducible polynomial of degree n over the field GF(2) 1: s(x)0 = 0; 2: for i = 0 to n − 1 do s(x)i = HSXOR(s(x)i , ai b(x)i );

4:

s(x)i mod x2 ;

5:

p(x)i = HSXOR(s(x)i , qi m(x)i );

6:

s(x)i+1 = p(x)i /x2 ;

7: end for

The above two algorithms, which are used for performing RSA and ECC over GF(p) and GF(2 p ), can be adapted to compute also AES. Some multiplexers have been added into the proposed design in order to support both the modular multiplication of RSA and ECC and the MixColumns and InvMixColumns states of AES. The MixColumns and the InvMixColumns states of AES operate column by column. The columns are considered as a polynomial over GF(28 ) and multiplied modulo x4 + 1 with a fixed polynomial. The fixed polynomials used in the MixColumns and the InvMixColumns states are defined by Eq. (7a) and Eq. (7b). a(x) = {03}x3 + {01}x2 + {01}x + {02},

(7a)

(8)

Based on the above discussion, we proposed a new triple functional arithmetic unit based on the SD number and it would support: (1) RSA operations over GF(p); (2) ECC operations over GF(p) (radix-4); (3) ECC operations over GF(2 p ) (radix-4); (4) AES encryption; (5) AES decryption. Figure 3 shows the structure of the proposed triple functional arithmetic unit. HSADDER represents the radix4 signed adder. It has been used for the modular multiplication over GF(p), the polynomial multiplication over

Output: a(x) · b(x) · x−n mod m(x)

3:

(7b)

byte · {0e} = byte · ({02} ⊕ {04} ⊕ {08})

2: for i = 0 to n − 1 do ¯ 3: S¯i = HS ADD(S¯i , a¯i , B); q¯i = S¯i mod 4; ¯ P¯i = HS ADD(S¯i , q¯i ), M;

b(x) = {0b}x3 + {0d}x2 + {09}x + {0e}.

311

Fig. 3

The proposed triple functional arithmetic unit

312

Front. Comput. Sci., 2013, 7(3): 307–316

GF(2 p ), the MixColumns and the InvMixColumns states. Each HSADDER has a selector of Tri which would indicate AES operations (encryption and decryption), RSA over GF(p) and ECC over GF(p) or GF(2 p ). Each result from HSADDER takes a left shift computation (in Fig. 3) to realize the Xtime function. The parameters with bar in Fig. 3 are represented in radix-4 SD number. Signal Tri selects the AES, RSA operations over GF(p) and ECC operations over GF(p) or GF(2 p ) respectively. While signal Sel selects the result combination of {1}, {2}, {4}, and {8} for computing each byte of the MixColumns and the InvMixColums states. Therefore, the basic proposed arithmetic unit has 24-bit datapath, called AU j in this paper (1 normal binary bit can be represented into 3 bits in the radix-4 SD number system).

4

Hardware implementation

This architecture can support up to 96l (l = 3n/2 96 ) bits operations if it needs four AU j s, where each AU j has a 24-bit datapath for data transformation in radix-4 SD number system. In this system, 2-bit normal binary number is represented by 3-bit SD number, therefore, the width of Unitm is 96 bits. The structure of the proposed Unitm is shown as Fig. 4. The overall structure of the proposed architecture is described in Fig. 5. The core arithmetic Unitm is shared among RSA, ECC, and AES branches, where Unitm supports the operations of the dual-field modular multiplication, the MixColumns and

Fig. 4

InvMixColumns states. Both branches are controlled by the same control unit and share the I/O interface. The RSA part is computed by the modular multiplication over GF(p). The ECC part is composed by a unit to compute dual-field modular addition/subtraction, and dual-field modular multiplication over GF(p) and GF(2 p ). The AES encryption and decryption procedures share the key expansion module and the unified MixColumns/InvMixColumns module. Besides, there exits the extra computational modules of the SubBytes, the ShiftRows state of the AES encryption, and the InvSubBytes, the InvShiftRows of the AES decryption. Note that, although the most area of the AES implementation is dominated by the SubBytes and the InvSubBytes operations states, reusing the arithmetic unit to support the MixColumns and the InvMixColumns states is to provide an alternative function core to have RSA, ECC, and AES together. The critical path of our proposed design lies in the modular multiplication operation. Therefore, it is important to speed up the triple functional modular multiplier. Using this l-stage pipelined architecture, 2(n/2 − 1) + l cycles are needed to accomplish one modular multiplication. Suppose we have an n-bit modular exponentiation computation, the total computation time is 2n×(n/2−1)+l cycles. The above computations could be applied to polynomial multiplication over GF(2 p ). In this case the carry-propagation between any two units is forced to zero, so that the proposed design can be reused. The latest suggested RSA parameter are 1 024 bits, but 2 048bit RSA can provide higher security than the 1 024-bit one. While the latest suggested ECC’s parameter is 192 bits over

The proposed structure of Unitm


313

Fig. 5 The proposed overall structure

GF(p) and 163 bits over GF(2 p ) . Our architecture also has the scalability to support other parameters of ECC as they are much smaller than the RSA’s parameters. Except RSA and ECC operations, one Unitm can compute one byte of the MixColumns or the InvMixColumns states in a clock cycle, there need four Unitm s to accomplish the MixColumns or the InvMixColumns states. Therefore, 44 cycles are needed for a full AES encryption or decryption.

5

Experimental results

In this section, we present the area and performance of the proposed co-processor. To evaluate the proposed unified architecture, we have implemented it using hardware description language (VHDL) and then synthesized on FPGA platform. We synthesized our design using Xilinx ISE 12.1 and we evaluated our design on Xilinx Virtex IV as well as on Xilinx Virtex-5 FPGAs. Table 4 shows the experimental results for the proposed architecture when porting to Xilinx Virtex-IV and Virtex-5 platforms. We compared the area and the performances of our architecture with AES, RSA and ECC architecture, by instantiating three different IP cores, one for AES, one for RSA and one for ECC. The results show that our proposed unified architecture takes up less area than the sum area of these three IPs, which is 27.8% and 28.7% porting to VirtexIV and Virtex-5 FPGA platform separately. It is obvious that the area optimization is achieved by unifying the operations together.

Table 4

The experimental results for unified architecture on FPGA XC4VLX100

Platform

XC5VLX60

Area

Speed

Area

/CLBs

/MHz

/CLBs

Speed /MHz

AES

409

133.85

387

153.06

RSA(radix-4)

2 175

109.7

2 280

144.9

ECC(radix-4)

1 395

109.7

1 457

144.9

This work

2 873

106.7

2 939

141.8

Area saving

1 106

1 185

A comparison of our unified architecture with previously proposed unified architectures is not straightforward, since they were targeting different applications and platforms. In order to have a fair comparison, we estimate the gate equivalence of our design on Virtex-5 platform. Table 5 shows the comparisons between the existing unified design and our proposed design. Compared with the work proposed by Chen et al. [6], our occupied area is only 35.8% of their work and the total computational time is relative faster than theirs with the same parameters (1 024-bit RSA and 160-bit ECC). The proposed design also achieved shorter total computational time than the design in [7]. And it is obvious that our proposed design achieved the shortest computational time than the existing methods with the acceptable hardware area. This achievement depends on applying radix-4 SD adder to the proposed design, we not only shorten the overall critical path but also fully optimize with six inputs LUT on Virtex-5 FPGA. Although our design takes up larger hardware area than the AES only method, we achieve the better throughput than others with supporting also RSA and ECC operations.

314

Front. Comput. Sci., 2013, 7(3): 307–316

Table 5

Comparison with previous works

Satoh [2]

Algorithms

Key Size/bit

Freq/MHz

ME/ms

PM /ms

Area

Platforms

ECC over GF(p)

192

137.7

–

1.44

11.3k gates

0.13 µm CMOS

ECC over GF(2 p )

160

510.2

–

0.19

10.8k gates

0.13 µm CMOS

RSA

1 024

53

22.8

–

4 547 CLBs

Xilinx Virtex 800

Batina [24]

ECC over GF(2 p )

275

53

–

11.4

4 547 CLBs

Xilinx Virtex 800

RSA

1 024

77.3

27.36

–

6 709 CLBs

Xilinx Virtex-E2000

ECC over GF(2 p )

163

77.3

–

1.18∗

6 709 CLBs

Xilinx Virtex-E2000

RSA

1 024

415

3.84

–

331.7k gates

0.13 µm CMOS

ECC over GF(2 p )

163

415

–

0.44

331.7k gates

0.13 µm CMOS

Cilardo [4] Chen [6] Lai [7]

ECC over GF(p)

160

141.3

0.385

–

179k gates

TSMC 0.13 µm CMOS

ECC over GF(2 p )

160

158.1

–

0.272

179k gates

TSMC 0.13 µm CMOS

RSA

1 024(22 )

141.8

3.701

–

2 939 CLBs

Xilinx XC5VFX100T

ECC over GF(p)

192 (22 )

141.8

0.130

–

2 939 CLBs

Xilinx XC5VFX100T

ECC over GF(2 p )

163 (x2 )

141.8

–

0.094

2 939 CLBs

Xilinx XC5VFX100T

AES

128

141.8

–

–

2 939 CLBs

Xilinx XC5VFX100T

This work

63.6k gates∗ RSA

2 048(22 )

132.5

15.83

–

5 490 CLBs

Xilinx XC5VFX100T

ECC over GF(p)

192 (22 )

132.5

0.139

–

5 490 CLBs

Xilinx XC5VFX100T

ECC over GF(2 p )

163 (x2 )

132.5

–

0.100

5 490 CLBs

Xilinx XC5VFX100T

AES

128

132.5

–

–

5 490 CLBs

Xilinx XC5VFX100T

This work

118.8k gates∗ Note: *: estimated by the author; ME: modular multiplication; PM: polynomial multiplication.

6

Conclusion

We have proposed an extension of a high radix (radix = 4) public key accelerator which allows to support also AES encryption and decryption. The core of our design is the triple functional arithmetic unit, which not only supports the modular multiplication and polynomial multiplication over GF(p) and GF(2 p ), but also extends the proposed arithmetic unit to support also the MixColumns and the InvMixColumns states of the AES. To accomplish this, we inserted several multiplexors to switch among the above three functions. The whole system was based on SD number system to eliminate the carry propagations deference between GF(p) and GF(2 p ). We evaluated our design on the modern FPGA platforms. Our experimental results show that our proposed architecture requires less area resources compared to the implementation of one RSA, one ECC, and one AES core separately. Our proposed design provided an alternative implementation to have an integrated RSA, ECC and AES core.

nate are define as in Algorithm 4 and Algorithm 5. Point Addition and Point Doubling using Lopez-Dahab coordinate are define as in Algorithm 6 and Algorithm 7.

Algorithm 4 Point Addition in projective coordinate Input: In prime filed GF(p); the field element a and b defining a curve E over GF(p); let P1 = (X1 , Y1 , Z1 ) and P2 = (X2 , Y2 , Z2 ) on E, where P1 P2 Output: x-coordinate P3 = (X3 , Y3 , Z3 ) for the point P1 + P2 1: T 1 ← Y2 × Z1 2: T 2 ← Y1 × Z2 3: T 3 ← X2 × Z1 4: T 4 ← X1 × Z2 5: U ← T 1 − T 2 6: Z3 ← Z1 × Z2 7: T 1 ← U 2 8: T 1 ← T 1 × Z3 9: X3 ← T 3 − T 4 10: V ← X32 11: T 3 ← X3 × V 12: T 3 ← 2V × T 4

Acknowledgements This work was supported by National Natural Science Foundation of China (Grant No. 61173036) and the Fundamental Research Funds for Chinese Central Universities.

13: T 1 ← T 1 − T 3 − V 14: X3 ← X3 × T 1 15: Z3 ← Z3 × T 3 16: Y3 ← T 3 × T 2

Appendixes Point Addition and Point Doubling using projective coordi-

17: T 3 ←

T3 2

−V

18: T 3 ← U × T 3 19: Y3 ← T 3 − Y3


Algorithm 5 Point Doubling in projective coordinate

315

on Cryptographic Hardware and Embedded Systems. 2001, 202–219 2.

Satoh A, Takano K. A scalable dual-field elliptic curve cryptographic processor. IEEE Transactions on Computers, 2003, 52(4): 449–460

3.

Batina L, Bruin-muurling G, Örs S. Flexible hardware design for RSA and elliptic curve cryptosystems. In: Proceedings of 2004 Topics in Cryptology-CT-RSA. 2004

4.

Cilardo A, Mazzeo A, Mazzocca N, Romano L. A novel unified architecture for public-key cryptography. In: Proceedings of the 2005 Design, Automation and Test in Europe. 2005, 52–57

6: T 3 ← a . . . Z42

5.

Niimura M, Fuwa Y. High speed adder algorithm with radix-2k sub signed-digit number. Journal of Formalized Mathematics, 2003

8: T 3 ← T 3 + T 4

6.

Chen J, Shieh M, Lin W. A high-performance unified-field reconfigurable cryptographic processor. IEEE Transactions on Very Large Scale Integration Systems, 2010, 18(8): 1145–1158

7.

Lai J, Huang C. Energy-adaptive dual-field processor for highperformance elliptic curve cryptographic applications. IEEE Transactions on Very Large Scale Integration Systems, 2011, 19(8): 1512– 1517

8.

Wang Y, Maskell D, Leiwo J, Srikanthan T. Unified signed-digit number adder for RSA and ECC public-key cryptosystems. In: IEEE Asia Pacific Conference on Circuits and Systems. 2006, 1655–1658

9.

Wang Y, Maskell D, Leiwo J. A unified architecture for a public key cryptographic coprocessor. Journal of Systems Architecture, 2008, 54(10): 1004–1016

10.

Wang Y, Maskell D. A unified signed-digit adder for high-radix modular exponentiation on gf (p) and gf (2 p ). In: Proceedings of the 2009 12th International Symposium on Integrated Circuits. 2009, 687–690

11.

FIPS N. Announcing the advanced encryption standard (AES). Federal Information Processing Standards Publication 197. National Institute of Standards and Technology, 2001

12.

Feldhofer M, Wolkerstorfer J, Rijmen V. AES implementation on a grain of sand. Information Security. 2005, 13–20

13.

Grabher P, Großschädl J, Page D. Light-weight instruction set extensions for bit-sliced cryptography. In: Proceedings of the 10th International Workshop on Cryptographic Hardware and Embedded Systems. 2008, 331–345

14.

Tillich S, Großschädl J. VLSI implementation of a functional unit to accelerate ECC and AES on 32-bit processors. In: Proceedings of the 1st International Workshop on Arithmetic of Finite Fields. 2007, 40–54

15.

Natick K I P A. Computer arithmetic algorithms. Prentice Hall, 2002

16.

Rivest R, Shamir A, Adleman L. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 1978, 21(2): 120–126

17.

Pieprzyk J, Seberry J, Hardjono T. Fundamentals of computer security. Computing Reviews, 2004, 45(10): 621–622

18.

Stinson D. Cryptography: theory and practice. Chapman & Hall/CRC, 2005

19.

Menezes A, Van Oorschot P, Vanstone S. Handbook of Applied Cryptography. CRC Press, 1996

20.

Cohen H, Frey G, Avanzi R, Doche C, Lange T, Nguyen K, Vercauteren F. Handbook of Elliptic and Hyperelliptic Curve Cryptography. Chapman & Hall/CRC, 2005

21.

Montgomery P. Modular multiplication without trial division. Mathematics of Computation, 1985, 44(170): 519–521

Input: In prime filed GF(p); the field element a and b defining a curve E over GF(p); let P4 = (X4 , Y4 , Z4 ). Output: x-coordinate P5 = (X5 , Y5 , Z5 ) for the point 2P4 1: S ← Y4 × Z4 2: T 1 ← S 2 3: Z5 ← 8T 1 × S 4: T 2 ← X4 × Y4 5: T 2 ← T 2 × S 7: T 4 ← sX42 9: X5 ←

T 32

10: T 4 ← X5 − 8T 2 11: X5 ← 2T 4 × S 12: Y5 ←

Y42

13: Y5 ← 8Y5 × T 1 14: T 2 ← 4T 2 − T 4 15: T 2 ← T 2 × T 3 16: Y5 ← T 2 − Y5 Algorithm 6 Point Addition in Lopez-Dahab coordinate Input: In finite filed GF(2 p ); the field element a and b defining a curve E over GF(2 p ); the x coordinate of the point P; the x coordinates X1 /Z1 and X2 /Z2 for the points P1 and P2 on E, where P1 P2 Output: x-coordinate X3 /Z3 for the point P1 + P2 1: T 1 ← x 2: X3 ← X1 × Z2 3: Z3 ← Z1 × X2 4: T 2 ← X3 × Z3 5: Z3 ← Z3 + X3 6: Z3 ←

Z32

7: X3 ← Z3 × T 1 8: X3 ← X3 + T 2 Algorithm 7 Point Doubling in Lopez-Dahab coordinate Input: In finite filed GF(2 p ); the field element a and c = b2m−1 (c2 = b) defining a curve E over GF(2 p ); the x coordinate X4 /Z4 for a point P4 . Output: x-coordinate X5 /Z5 for the point 2P4 1: T 1 ← c 2: X5 ← X42 3: Z5 ← Z42

4: T 1 ← Z5 × T 1 5: Z5 ← Z5 × X5 6: T 1 ← T 12 7: X5 ←

X52

8: X5 ← X5 + T 1

References 1. Großschädl J. A bit-serial unified multiplier architecture for finite fields GF (p) and GF (2m). In: Proceedings of the 3rd International Workshop

316 22.

Front. Comput. Sci., 2013, 7(3): 307–316 Orup H. Simplifying quotient determination in high-radix modular multiplication. In: Proceedings of the 12th Symposium on Computer Arithmetic. 1995, 193–199

23.

Koc C, Acar T. Montgomery multiplication in GF (2k). Designs, Codes and Cryptography, 1998, 14(1): 57–69

24.

Batina L, Guajardo J, Kerins T, Mentens N, Tuyls P, Verbauwhede I. An elliptic curve processor suitable for rfid-tags. In: Proceedings of the Benelux Workshop Information and System Security. 2006

University of Singapore. Her research interests are in the general area of embedded security, with the focus on high performance cryptographic bricks and side-channel resistant algorithms. Renfa Li is a professor in the College of Information Science and Engineering at Hunan University. He received the BEng and MEng degrees from Tian-

Yi Wang received the BEng and MEng

jin University, China in 1982 and 1987,

degrees Northwestern Polytechnical

and the PhD degree from Huazhong

University, China in 2000 and 2003,

University of Sciences and Technology,

and the PhD degree from the School

China in 2003. He was a professor at

of Computer Engineering, Nanyang

Hunan Technology University from 1987 to 1999. From 2000, he

Technological University, Singapore in

became the dean at the College of Computer and Communication,

2008. She worked as a post DOC in

Hunan University. His research interests are in the areas of embed-

crypto group at Université Catholique

ded system architecture, cyber-physical system, and wireless net-

de Louvain, Belgium from 2009 to 2010. And she was a lecture at

works. He is the founder of Embedded Systems & Networking Lab-

College of Information Technology and Engineering, Hunan Uni-

oratory of Hunan University, and the leader of Hunan Provincial Key

versity from 2010 to 2011. From December 2011, she worked as

Laboratory of Network and Information Security of Hunan Univer-

research fellow in Electrical & Computer Engineering at National

sity.