An efficient scalable and hybrid arithmetic unit for public key ... - J-Stage

0 downloads 0 Views 1MB Size Report
for Elliptic Curve Cryptography (ECC) arithmetic operation. The Scalable Word-based Montgomery Modular arithmetic unit. (SWMM) can process RSA with ...
IEICE Electronics Express, Vol.4, No.14, 461–466

An efficient scalable and hybrid arithmetic unit for public key cryptographic applications SuJung Yu1a) , MoonGyung Kim2 , SeokWon Heo2 , JooSeok Song1 , and YongSurk Lee3 1

Department of Computer Science, Yonsei University, Seoul, Korea 120–749

2

Samsung Electronics co., LTD., Gyeonggi-Do, Korea 449–711

3

Department of Electrical and Electronic Engineering, Yonsei University,

Seoul, Korea 120–749 a) [email protected]

Abstract: In this paper, we propose an efficient scalable and flexible arithmetic unit which executes word-based multiplication and squaring for RSA arithmetic and addition, multiplication and inversion in GF(2 m ) for Elliptic Curve Cryptography (ECC) arithmetic operation. The Scalable Word-based Montgomery Modular arithmetic unit (SWMM) can process RSA with modified Finely Integrated Operand Scanning (FIOS) Montgomery algorithm. Also, the proposed flexible hybrid Galois Field Arithmetic Unit (GFAU) is beneficial in costeffectively implementing ECC operation extensions. The proposed processor is synthesized by employing Synopsys Design Analyzer with a Hynix 0.25 standard cell library at the worst case of 2.3 V , 100◦ C. The synthesis results show that the gate count of processor is below 60,000. Keywords: RSA, ECC, public key cryptographic processor Classification: Science and engineering for electronics References

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

[1] S. Wu and Y. Zhu, “A Resource Efficient Architecture for RSA and Elliptic Curve Cryptosystems,” Commun. Circuits Syst. Proc., vol. 4, pp. 2356– 2360, June 2006. [2] C. Huang, J. Lai, J. Ren, and Q. Zhang, “Scalable Elliptic Curve Encryption Processor for Portable Application,” Proc. 5th Int. Conf. on ASIC, vol. 2, pp. 1312–1316, Oct. 2003. [3] J. Gromßsch¨adl and G.-A. Kamendje, “Architectural Enhancements for Montgomery Mul- tiplication on Embedded RISC Processors,” Applied Cryptography and Network Security, vol. 2846, pp. 418–434, 2003. [4] M.G. Kim, S.J. Yu, Y.S. Lee, and J.S. Song, “A Fast Hybrid Arithmetic Unit for Elliptic Curve Cryptosystem in Galois Fields with prime and composite Exponents,” IEICE Electron. Express, vol. 1, no. 1, pp. 13–18, 2004. [5] F. Crowe, A. Daly, and W. Marnane, “A Scalable Dual Mode Arithmetic Unit for Public Key Cryptosystems,” ITCC 2005, vol. 1, pp. 568–573, 4–6

461

IEICE Electronics Express, Vol.4, No.14, 461–466

April 2005.

1

Introduction

The fast modular arithmetic becomes the key to real-time encryption and decryption since a high throughput is needed in data communication [1]. This contribution focuses on public key algorithm arithmetic which is RSA and ECC [2]. In this paper, we propose the viability of variable key space sizes cryptography on 32-bit processor core for using smart card application. We have chosen the ARM processor core for our implementation. The proposed RSA cryptographic block is implemented Montgomery modular exponentiation with modified Finely Integrated Operand Scanning (FIOS) method [3]. Also, the proposed hybird GFAU architecture [4] is reinforced by modifying the implementation slightly and can work on not only composite exponents but also non-composite exponents of GF(2 m ). The propose the hybrid GFAU can execute flexile with 113 - 571 bit-long keys.

2

Arithmetic for Public Cryptosystem

In 1985, Peter L. Montgomery invented an efficient algorithm for computing the modular multiplication and squaring required during the exponentiation process. There have been a number of publications in recent years reporting ¯ B) ¯ = A¯ ∗ B ¯ ∗ R−1 mod N , on modular exponentiation [3]. C¯ = MonProd (A, ¯ B, ¯ and C¯ are Montgomery representation. The reduction modulo where A, N requires a division which is a costly operation. An integer residue R is chose on two restrictions: R > N and gcd(R, N ) = 1. For computing the  Montgomery product we need two additional values, N and R−1 being the  modular inverse of R mod N , i.e., R ∗ R−1 = 1(mod N ). N is an integer  with the property R ∗ R−1 − N ∗ N = 1. There are a variety of ways to perform the word-based Montgomery multiplication such as FIOS [3]. The FIOS computes a ∗ b and m ∗ n that is performed in a single loop. Finite field division in the GF(2 m ) has the form A(x)/B(x) modular P (x), where the degrees of A(x) and B(x) are small than m, and P (x) =  i xm + m−1 i=0 pi x is an irreducible polynomial of degree m with pi ∈ GF(2 ). The efficient hardware can use for finite field inversion operation. The basic architecture of a GF divider is based on a 2-bit shift architecture [2].

3

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

The Proposed Processor Architecture

The Proposed processor can operate RSA and ECC arithmetic unit called a Scalable Word-based Montgomery Modular arithmetic unit (SWMM) and hybrid Galois Field Arithmetic Unit (hybrid GFAU) individually that built in interface that is available data transmission through the AMBA bus as shown in (a) of Fig. 1. Proposed processor designs as a part of co-processor of ARMTM core. This core architecture provides the scalability and increasing performance required to support a wide range of RSA and ECC based 462

IEICE Electronics Express, Vol.4, No.14, 461–466

Fig. 1. The structure diagram and instruction sets of proposed cryptographic processor applications. We determine instruction set for the proposed cryptographic processor and that is described in (d) of Fig. 1.

3.1 Proposed Register File The design of register file is used two SRAM (Static Random Access Memory) cell as 64-bit * 2 parallel structure. The SRAM chips have higher power consumption and power dissipation than DRAM (Dynamic Random Access Memory) cell. The register file can be implemented with a decoder for each read or write port and an array of registers build from D flip-flops. We only need to supply a register number as input. Output will be contained the data in that register because a reading register does not change any state. (b) and (c) of Fig. 1 shows comparison with register file structures supported by standard cell library and SRAM cell.

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

3.2 Fake Mode Operation When the cryptographic algorithm running time is non-constant, timing measurements can leak the secret key information [1]. These are probably considering additional hardware cost for a practical system designer. The proposed processor adopts a fake operation mode for solutions against timing attacks with hardware cost savings. According to state of input command at register file, cryptography block sends 2-bit signal to register file. One is register port enable signal, and another is read/write selection signal. Register file set up fake flag and is inputed this flag and register port enable signal. The flag 463

IEICE Electronics Express, Vol.4, No.14, 461–466

instruction code is shown in (d) of Fig. 1. The opcode 1110 and 1111 are used for RSA fake mode and 0111 is used for ECC mode.

4

The Proposed SWMM Arithmetic Unit for RSA Mode

We propose the SWMM for WMM (Word-based Montgomery Multiplication) with modified FIOS which is based on the presented MM and MS (Montgomery Squaring) algorithm. The proposed SWMM structure allows the core to be dynamically scalable for key size, as delay is converted to latency. The RSA module consists of Smart MAC and control block as (a) of Fig. 1. Smart MAC consisted of a control block, 256-bit adder and 64 * 32-bit multiplier, them who can carry out addition and multiplication of 128 bits unit. We employ modified radix-4 booth algorithm and 3 : 2 Wallace tree CSA (Carry Save Adder). Improved adder operate 144.3 MHz clock cycle and 21279 gate area. The WMM procedure with modified FIOS method contains three processes which are product, reduction, and division as shown in (a) of Fig. 2. Each step performs overall operation that it generates control signals in in ner loop. Here, n is pre-computed at encryption initialization step with n and saved at register that improves overall operation performance. Operation is generated control block and transmitted Smart MAC (MultiplyAccumulator). We propose an efficient MS method as shown in (b) of Fig. 2. The proposed method reduces overall operation overhead. The Wallace tree is reduced by merging repeated partial products in squaring. In this case, our proposed squaring method reduce multiplication 25% or more and about 10% speed improvement although extra-processes.

5

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

The Proposed Hybrid GFAU for ECC Mode

Initially, a common serial multiplier works as Zi (x) = bm−1−i A(x)+xZi−1 (x) and we modify to reform a serial multiplier into a hybrid multiplier as Zi (x) = m−3  j+1 + b m j+2 + bm−1−2i m−2 m−1−2i am−1 x + bm−1−2i−1 A(x) + j=0 aj x j=0 Zj x Zm−2 xm + Zm−1 xm+1 . It is independent from bit-location j and can be shared among bit-functional blocks. Those overlapping blocks statistically significantly reduce overall area of the circuit. Also, we implemented the new hybrid GF divider on the basis of the above standard hybrid architecture. To be certain of our hypothesis, we need to define an all-inclusive equation for U/x and U/x2 . First, we can calculate U/x considering the fact that the coefficient p0 in P (x) is always m−2  i−1 +u /x = i one, and obtain as U (x)/x = m−1 0 i=1 ui x i=0 (ui+1 +u0 pi+1 )x + u0 xm−1 . Next, we can calculate U/x2 while dividing U/x with x, and obtain m−3 i as U (x)/x2 = (U (x)/x)/x = i=0 (ui+2 + u0 pi+2 + (u1 + u0 p1 )pi+1 )x + m−2 m−1 + (u1 + u0 p1 )x . (u0 + (u1 + u0 p1 )pm−1 )x We merged the multiplier and the divider, which are implemented as presented above. First, the hybrid divider can calculate xU , xV , x2 U , and x2 V operations, and those operations are the essential operations of our hybrid multiplier. The control signal in the divider is generated from rm rm−1 , 464

IEICE Electronics Express, Vol.4, No.14, 461–466

Fig. 2. Improved MM and MS with modified FIOS and the hybrid GFAU architecture

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

not rm−1 rm−2 , so the processing sequence is changed to the beginning 1bit shift once, and continues as a 2-bit shift. The resulting modification is shown in (c) of Fig. 2. Multiplication can be calculated when V register is initialized to zero and the multiplicand and multiplier values are loaded into U and R registers, respectively. If the V register is not initialized to zero, the multiplication process produces a MAC result, which aid in reducing the overall number of calculations. The hybrid GFAU can execute 113 - 571 flexible word-length operations with any reduction polynomials.

465

IEICE Electronics Express, Vol.4, No.14, 461–466

6

Measurements and Comparisons

The proposed SWMM processor block, hybrid GFAU and register files are synthesized with Design Analyzer (Synopsys) in standard cell library of hynix 0.25 µm process and the synthesis condition is tested in the worst case condition, 100◦ C and 2.30 V . In (a) of Fig. 3, we compare proposed register file with existing register file about area, delay time, and power consumption. Delay time is increased but area and power consumption is decreased due to using SRAM cell. The synthesis results of the SWMM, the hybrid GFAU, and the register file is shown in (b) of Fig. 3. In (c) and (d) of Fig. 3, we compare proposed SWMM and hybrid GFAU with existing processor. In case [5], support variable key size (256 to 1024-bit), 44.91 MHz frequency, 5,267 slices area, a RSA and ECC operation dependently. Proposed processor support more variable key size (128 to 2048-bit for RSA and 113 to 571-bit for ECC) length, a efficient RSA and ECC operation independently.

Fig. 3. synthesis results and comparison

7

Conclusion

In this paper, we implement scalable logics for RSA and ECC arithmetic independently with variable key size. We proposed an efficient design to implement an optimized RSA processor which uses the Montgomery algorithm with modified FIOS. And, the hybrid GFAU is based on a hybrid divider, and it can execute addition, multiplication and division. From its high performance and small area, the proposed processor provides a good solution to implementation for various security application used in smart card application or other restricted system.

Acknowledgments

c 

IEICE 2007

DOI: 10.1587/elex.4.461 Received June 18, 2007 Accepted June 27, 2007 Published July 25, 2007

This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST) (No. R01-2006000-10614-0).

466

Suggest Documents