Document not found! Please try again

Elliptic Curve Cryptography support for ARM based Embedded systems

13 downloads 9351 Views 58KB Size Report
ARM instruction set extension for Elliptic Curve Cryptography over binary finite ... Gamal [1]), digital signature schemes (like Digital Signature Algorithm, DSA [2]), ...
Elliptic Curve Cryptography support for ARM based Embedded systems Sandro Bartolini, Paolo Bennati, Roberto Giorgi, Enrico Martinelli Dept. Ingegneria dell’Informazione, University of Siena, Via Roma 56, 53100 Siena, Italy

ABSTRACT Elliptic Curve Cryptography (ECC) is emerging as an attractive approach to public-key cryptography for constrained environments, because of the small key sizes and computational efficiency, while preserving the same security level as the standard methods. The performance of public-key cryptography methods is critical in embedded environments such as applications for wireless, handheld internet devices and smart cards with small memory and strict CPU-latency constraints. Power control is also important for embedded systems as well as security against Differential Power Analysis (DPA). We examined the performance of a set of ECC kernel benchmarks and proposed ISA extensions to support secure and efficient execution, on a ARM processor, which is a very common platform for embedded system applications. An evaluation of possible ARM instruction set extension for Elliptic Curve Cryptography over binary finite fields GF(2m) is presented. With almost no cost at hardware level, we found an average 33% reduction of the total number of dynamically executed instructions. Finally, we analyzed the power requirement to achieve an efficient and secure execution from a power standpoint. Some preliminary results showing the power consumption of the benchmarks are presented. KEYWORDS: ACACES; poster session; Elliptic Curves Cryptography; ECC; Differential Power Analisys; DPA; Diffie-Hellman; El-Gamal.

1 Introduction Cryptography algorithms can be divided into two categories: private-key (symmetric) and public-key (asymmetric). Private-key systems use a common private key shared between the communicating parties, while public-key do not require any key exchange. Mainly public-key algorithms are currently used into encryption schemes (like RSA or ElGamal [1]), digital signature schemes (like Digital Signature Algorithm, DSA [2]), and key agreement methods (like Diffie-Hellman [3]). Both private and public key sizes must grow over time to offer acceptable security [2, 4]. The major advantage of ECC is the use of shorter keys with a security level equivalent to finite fields public-key algorithms; as a consequence faster implementations, reduced energy and bandwidth consumption can be achieved. Because of these characteristics, ECC is already incorporated into two important public-key cryptography standards, FIPS 186-2 [2]and IEEE-P1363 [5].

We analyzed the major bottlenecks at function level in ARM processors (90% of mobile phones use it [6]) and evaluated the performance improvement [7, 8], when we introduce some simple architectural support in the ARM ISA. We also address the problem of the power requirements of the benchmark; DPA techniques [9-11] are a serious threat for the security of embedded devices and power analysis could be useful to achieve a more secure execution.

2 Methodology We used a modified version of sim-profile Energy Configuration D ata L1 cache 32 KB, 32-way and sim-outorder Technology TECH_070 L1 data cache ports 2 simulators of the L1 data cache temperature 353 K Fetch & Decode 1 L1 cache hit latency 1 widt h ( cycles) SimpleScalar toolset [12] L1 data cache supply voltage 0.9 V Issue width 1 (in order ) L1 cache block size 32 bytes for the ARM target [13] L1 data cache threshold voltage N 0.1902 V ITLB 32 en try, fully L2 cache None Type Cell as sociative to perform performance L1 data cache threshold voltage P- 0.2130 V DTLB 32 en try, fully M emory latency 24 Type Cell as sociative ( cycles) simulation and Fun ctional units 1 ALU, 1 int M emory bus width 4 L1 data cache threshold voltage N 0.1902 V Type Bitline MUL/DIV ( bytes) Hotleakage [14] to Table 1: S imula ted architectur e perform power simulation. The parameters used represent an architecture configuration modelled after Intel XScale architecture [15] (Table 1). The sim-outorder tool was modified to profile to application at a functioning level and we also added all the unimplemented system calls for ARM target, although they were not critical for the execution of the benchmark. The compilation is performed with a version gcc compiler that was modified to Benchmark acronym Benchmark n ame Description recognize newly added Generates a prime suitable for Diffie- Hellman algorithm ecdh EC Diffie- Hellman key exchange and calculates a shared key. instructions. Calculates the message diges t of a file using s ha algorithm, signs the message using the private key and The description of our benchmark ecdsign EC digital segnature generation writes the signature into a file. Calculates the message diges t of a file using s ha suite, implemented using MIRACL ecdsver EC digital segnature verification algorithm, then verifies the signature using public-key. ecelgenc EC El-Gamal encryption Encrypts a point on the curve using El-Gamal algorithm. C library [16] is given in Table 2. ecelgdec EC El-Gamal decryption Decrypts a point on a curve using El-Gamal algorithm. The binary finite fields and elliptic Ta ble 2 : ECC benchmar k set curves used in tests were chosen according to NIST standard [2]. Micro proces s or C onf iguratio n

Fetch queue 4 (ins tructions) Branch prediction 8K bimodal, 2K 4-way BTB

I nstruction L1 cache 32 KB, 32-way

3 The proposed ISA extension A very large percentage (approximately 40%) of integer instructions as well as load and store operations (approximately 50%) are present in the instruction mix; the finite field operations translate into a large number of logical operations (XOR, shift etc.), and result in a large number of register-memory transfers to operate on m-bit data (e.g. m ranges from 163 to 283 and is much larger than 32 bit register width in ARM).

Results (Figure 1) show that in particular, the mr_mul2 procedure, which multiplies two 32-bit binary finite field polynomials and produces a 64-bit product consumes 34% of the total execution time in average for all benchmarks; mr_mul2 procedure is translated into about 400 dynamic instructions (roughly 500 cycles), which correspond to about 12 instructions per bit to perform 32bit polynomial multiplication. Based on the previous analysis, we decided to measure the impact of Figure 1: Br eakdown of ex ecution time in ter ms of progra m functions. extending ARM instruction set with an instruction, called MULGF [17], for polynomial word-level multiplication in binary finite fields. The appropriate calls of C pr ocedure for 32-bit polynomial multiplication in software were substituted with a single MULGF instruction. The MULGF instruction was modelled to have a delay of one cycles, as the integer multiplier unit of ARM processor. The impact of adding the MULGF instruction for word-level polynomial multiplication in finite field is shown in Figure 2 Number of dynamically executed instructions, as well as the execution time, is lower by approximately one-third in average, with projective coordinates. The improvement in execution time is more significant for Diffie-Hellman (54% in number of instructions and 55% for execution time in GF(2233 )) and ElGamal algorithms (48% for encryption and 37% for decryption in number of instructions, i.e. 39% and 35% in execution time in GF(2233 )), where 32-bit polynomial multiplication is more used. The improvement for digital signature algorithm is more modest (19% in 2 : Impact of MULGF instr uction. Number of dyna mic instruction number and 17% in execution Figure instr uctions for pr ojective coor dina tes befor e a nd a fter a dding the time for the same key length), but still MULGF instr uction for wor d-level polynomia l multiplica tion. significant. e celgd ec.b163

key lengt h coordinates kP_method mult _method

other 23%

ecdsig n.b163

163 bit lopezdahab mont gom ery karat suba

other 12%

getc 24%

reduc e2 3%

mr_mul2 30%

mr_mul2 7%

get_mip 3%

_IO_funloc k file 8%

e cdh.b163

c opy 5%

square2 6% mr_bottom3 7%

_IO_floc k file 8%

reduc e2

other 18%

18%

k armul2

s hs _proc es s 20%

s hs _t rans for m 18%

8%

mr _lzero 3%

mr _mul2 35%

copy 5%

s quare2 5%

e celge nc.b163

e cdsve r.b 163

m r_bott om3 8%

other 12%

r educe2

other 17%

karmul2 9%

mr_lz ero 3%

mr_mul2 35%

c opy 5%

17 %

getc 24%

reduc e2 3%

mr_mul2 7%

_IO_funloc kfile

s quare2 5%

8%

mr_bottom3 8%

_IO_flo c k fil e

s hs _proces s 20%

8%

k armul2 9%

reduc e2 18%

s hs _trans form 18%

4 Power consumption characterization The use of ECC public-key system is particularly useful for embedded systems in which computation efficiency is required. In addition, embedded systems require also low-power consumption, as the market requests long battery life and satisfying performance. We will address the problem of power consumption characterization of our ECC benchmarks workload because this aspect is particularly interesting for the physical

02 4 ds ve rify .1 0 24

16 3

ds sig n .1

ec -v er .b

gn .b 16 3

ec - si

24

dh .1 0

ec -d h. b1 63

W

implementation of embedded secure devices (Figure 3). Our recent (and future) work focuses on the characterization of the benchmark workload based on ECC P ower Consumption from a power consumption and DPA clock 4 resultbus point of view; in particular currently 4 falu 3 we are investigating how the power alu 3 consumption of the benchmarks can dtlb2 2 dtlb 2 be reduced. itlb 1 Reducing the power consumption is dcache 1 icache 0 fundamental in embedded system not regfile only to increase battery life but also lsq window because it allows the use of techniques bpred Benchmark to prevent power attack. ECC rename cryptography is itself lighter than Figure 3 : Power ra tio in the va r ious pa r t of the system. standard methods and we expect a better performance also from this point of view. A power characterization of the benchmarks can be done through HotLeakage [14]. Preliminary results are presented in Figure 3. After a characterization of the power requirements of ECC benchmarks, adaptive techniques will be investigated to improve the benchmark low-power behaviour.

5 Acknowledgment The power characterization of the benchmarks has been performed with a modified version of HotLeakage retargeted for ARM ISA realized at the Cornell University, Ithaca (USA). We are particularly grateful to Sally McKee for allowing us to use this version.

References [1 ]

T. ElGa ma l, " A public key cr yptosystem and a signa ture scheme ba sed on discrete logar ithms" , in Proceed ing s of CRYPTO 84 on Advances in cryp tology . Sa nta Barbara , Ca lifor nia , United S ta tes: S pr inger -Ver la g New York, Inc., 1985 . [2 ] N. I. o. S . a . Technology, " Digita l S igna tur e S ta ndar d (DS S )" , vol. pub 186 -2 , U. S . D. o. Commer ce, Ed.: Federa l Infor ma tion Processing S tandar ds Publica tion (FIPS ), 2000 . [3 ] W. Diffie a nd M. E. Hellma n, " New Directions in Cryptogra phy" , IEEE Trans actions on Infor mation Theory , vol. IT-2 2 , pp. 644 654 , 1976 . [4 ] V. Gupta , S . Blake-Wilson, and B. M. C. Ha wk, " ECC Cipher S uites for TLS" , in IETF inter ne t draft, 2006 . [5 ] IEEE P1363 S ta ndar d S pecifica tions For Public-Key Cr yptogra phy, http://grouper .ieee.or g/groups/1363 / [6 ] ARM Web S ite, http://www.ar m.com [7 ] I. Br anovic, R. Giorgi, and E. Mar tinelli, " Memory Per for ma nce of Public-Key cr yptogra phy Methods in Mobile Environments" , pr esented a t Wor kshop on MEmory per for mance: DEa ling with Applica tions, systems a nd architectur e (MEDEA-0 3 ), New Or leans, LA, US A, 2003 . [8 ] S . Bar tolini, I. Br anovic, R. Giorgi, a nd E. Mar tinelli, " A Per for ma nce Eva lua tion of ARM IS A Ex tension for Elliptic Curve Cryptography over Binary Finite Fields" , pr esented a t 16 th S ymposium on Computer Architectur e a nd High Per for ma nce Computing (S BAC-PAD-0 4 ), Foz do Iguacu, Bra sil, 2004 . [9 ] P. Kocher , J. Ja ffe, a nd B. Jun, " Differ entia l Power Ana lysis" , Lectur e Notes in Computer Scie nce , vol. 1666 , pp. 388 -397 , 1999 . [1 0 ] D. Mesquita , J.-D. Techer , L. Torr es, G. Sassa telli, G. Ca mbon, M. Rober t, and F. Moraes, "Curr ent ma sk genera tion: a tr ansistor level secur ity aga inst DPA a ttacks" , in Pr oceed ings of th e 18 th annual sy mp os ium on Integr ated circuits and sys te m des ig n. Flor ianolpolis, Braz il: ACM Pr ess, 2005 . [1 1 ] A. S ha mir , " Protecting S mar t Car ds from Passive Power Ana lysis with Deta ched Power S upplies." pr esented a t CHES Cryptographic Har dwa re a nd Embedded Systems, S pr inger , 2000 . [1 2 ] D. Bur ger and T. M. Austin, "The S impleSca lar tool set, version 2 .0" , SIGARCH Comp ut. Arch it. Ne ws , vol. 25 , pp. 13 -25 , 1997 . [1 3 ] S impleSca lar Version 4 .0 Test Relea ses, http://www.simplesca la r .com/v4 test.html [1 4 ] Y. Zhang, D. Par ikh, K. Sa nkaranara ya nan, K. S ka dron, a nd M. S ta n, " HotLea kage: A Tempera ture-Awar e Model of S ubthr eshold a nd Ga te Lea kage for Ar chitects," University of Virginia , Char lottesville 2003 . [1 5 ] Intel, " Intel XS ca le Microa rchitectur e" , in Te chnical Summary , 2000 . [1 6 ] MIRACL - Multipr ecision Integer a nd Ra tiona l Ar ithmetic C/C++ Library, http://indigo.ie/~mscott/ [1 7 ] T. Acar , " High-S peed Algor ithms & Ar chitectur es For Number -Theor etic Cryptosystems." in Depar tme nt of Electr ical & Comp uter Eng ineer ing , vol. Ph.D. Thesis: Or egon S ta te University, 1997 .