CHES 2012 Tutorial. Diego F. .... Formulation of state-of-the-art binary field arithmetic using vector ... 256-bit Advanced Vector Extensions instructions (32 nm).
Efficient Binary Field Arithmetic and Applications to Curve-based Cryptography Diego F. Aranha Department of Computer Science University of Bras´ılia
CHES 2012 Tutorial
Diego F. Aranha
Efficient Binary Field Arithmetic
Part I: RELIC
Diego F. Aranha
Efficient Binary Field Arithmetic
Numbers
RELIC is an Efficient LIbrary for Cryptography (http://code.google.com/p/relic-toolkit): Research framework Licensed as free software (LGPL)
Relic
toolkit
11 source code releases 78,000 lines of code 1300 visitors from 74 countries 1500 downloads
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
Limitations of other libraries: Restricted portability Uninteresting licensing model Emphasis on standards and commercial algorithms Why a new criptographic library? Organization oriented for portability Complete control of licensing model Code sharing and reproducibility of results Focus on research
Diego F. Aranha
Efficient Binary Field Arithmetic
Relic
toolkit
Organization Basic organization: Meta-library Compile-time configuration Inspired on GNU Multiple Precision Arithmetic Library (GMP) Protocols
Arithmetic backend
Diego F. Aranha
Efficient Binary Field Arithmetic
Breakdown Arithmetic backend: Architecture-dependent Rigid interface with upper layers Generic modules available in C and with GMP support 21 functions for multiple precision integer arithmetic, 26 functions for binary fields, 32 functions for prime fields Why this organization? It is currently possible to obtain competitive timings with the same library in an 8-bit processor with 4KB of RAM and an 8-core Intel desktop processor.
Diego F. Aranha
Efficient Binary Field Arithmetic
Breakdown Binary field arithmetic: Field size specified on compile time 3 different strategies for squaring, 5 for multiplication, 2 for square root extraction, 2 for half-trace and 6 for inversion Modular reduction by trinomials and pentanomials Binary curve arithmetic: Supersingular, Koblitz and ordinary (standardized or not) Affine, projective and mixed coordinate systems 4 different strategies for random point scalar multiplication, 6 for fixed point and 4 for multiple point Symmetric pairings over genus-1 or genus-2 curves
Diego F. Aranha
Efficient Binary Field Arithmetic
Breakdown Miscellaneous: Support for words of 8, 16, 32 and 64 bits Static, stack, automatic and dynamic memory allocators Helper macros for testing and benchmarking Support for debugging, profiling, tracing and multithreading Abundant Doxygen documentation Deactivation of modules and automatic elimination of algorithms to reduce code size Standard PRNG with configurable seed source Support for FreeBSD, Linux, Mac OS X, Windows Management of configuration and build system with CMake Open collaboration with academia and industry
Diego F. Aranha
Efficient Binary Field Arithmetic
Part II: Binary fields
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
A finite field Fpm consists of all polynomials with coefficients in Zp , prime p, modulo an irreducible degree-m polynomial f (z). Prime p is the characteristic of the field and m is the extension degree. A binary field F2m is the special case p = 2 and is formed by polynomials with binary coefficients.
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
Example: Field F28 Irreducible polynomial: f (z) = z 8 + z 4 + z 3 + z + 1 = 1 0001 1011 Representation: a(z) = z 7 + z 3 + 1 = 1000 1001 = 0x89 b(z) = z 6 + z 5 + z 2 = 0110 0100 = 0x64 Addition: a(z) + b(z) = z 7 + z 6 + z 5 + z 3 + z 2 + 1 = 1110 1101 = 0xED Note: a(z) + a(z) = 2 · a(z) = 0, ∀a ∈ F2m
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction Example: Field F28 Irreducible polynomial: f (z) = z 8 + z 4 + z 3 + z + 1 = 1 0001 1011 Representation: a(z) = z 7 + z 3 + 1 = 1000 1001 = 0x89 b(z) = z 6 + z 5 + z 2 = 0110 0100 = 0x64 Multiplication: a(z) × b(z) = z 13 + z 12 + z 8 + z 6 + z 2 mod f (z) = z 7 + z 5 + z 4 + z 3 + z 2 + 1 = 0xBD Multiplication by z: z × b(z) = z 7 + z 6 + z 3 = 1100 1000 = 0xC8 = b 1
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
Binary fields (F2m ) are omnipresent in Cryptography: Efficient Curve-based Cryptography (ECC, PBC) Post-quantum Cryptography Block ciphers Many algorithms/optimizations already described in the literature: Is it possible to unify the fastest ones in a simple formulation? Can such a formulation reflect the state-of-the-art and provide new ideas?
Diego F. Aranha
Efficient Binary Field Arithmetic
Objective
Contributions Formulation of state-of-the-art binary field arithmetic using vector instructions New strategy for the implementation of multiplication Time-memory trade-offs to compensate for native multiplier Experimental results
Diego F. Aranha
Efficient Binary Field Arithmetic
Arsenal Intel Core architecture: 128-bit Streaming SIMD Extensions instructions (65/45 nm) Super shuffle engine introduced in 45 nm series Carry-less multiplier introduced in Nehalem family 256-bit Advanced Vector Extensions instructions (32 nm) Relevant vector instructions: Instruction MOVDQA PSLLQ, PSRLQ PXOR,PAND,POR PUNPCKLBW/HBW PSLLDQ,PSRLDQ PSHUFB PALIGNR PCLMULQDQ
Description Memory load/store 64-bit bitwise shifts Bitwise XOR,AND,OR Byte interleaving 128-bit bytewise shift Byte shuffling Memory alignment Carry-less multiplication Diego F. Aranha
Cost 3/2 1 1 3 2 (1) 3 (1) 2 (1) 10 (8)
Mnemonic ← -8 , -8 ⊕, ∧, ∨ interlo/hi 8 , 8 shuffle,lookup C ⊗
Efficient Binary Field Arithmetic
New SSSE3 instructions PSHUFB instruction ( mm shuffle epi8):
Real power: We can implement in parallel any function:
Diego F. Aranha
Efficient Binary Field Arithmetic
New SSSE3 instructions Example: Bit manipulation
Diego F. Aranha
Efficient Binary Field Arithmetic
New SSSE3 instructions
PALIGNR instruction ( mm alignr epi8):
Diego F. Aranha
Efficient Binary Field Arithmetic
Binary field F2m
Irreducible polynomial: f (z) (trinomial or pentanomial)
Polynomial basis: a(z) ∈ F2m =
m−1 X
ai z i .
i=0
Software representation: vector of n = dm/64e words (even). Graphical representation:
Diego F. Aranha
Efficient Binary Field Arithmetic
Data types
# if WORD == 8 typedef uint8_t dig_t ; # elif WORD == 16 typedef uint16_t dig_t ; # elif WORD == 32 typedef uint32_t dig_t ; # elif WORD == 64 typedef uint64_t dig_t ; # endif typedef __m128i vec_t ;
Diego F. Aranha
Efficient Binary Field Arithmetic
Useful macros
# define # define # define # define # define # define # define # define # define # define # define # define
LOAD STORE PSHUFB XOR AND SHL SHR SHL8 SHR8 UNPACKLO UNPACKHI CLMUL
_mm_load_si128 _mm_store_si128 _mm_shuffle_epi8 _mm_xor_si128 _mm_and_si128 _mm_slli_epi64 _mm_srli_epi64 _mm_slli_si128 _mm_srli_si128 _mm_unpacklo _epi8 _mm_unpackhi _epi8 _m m_ clmul epi 64_ si12 8
Diego F. Aranha
Efficient Binary Field Arithmetic
Proposed representation
To employ 4-bit granular arithmetic, convert to split form: X X aL = ai z i , aH = ai z i−4 , 0≤i 4 Easy to convert back: a(z) = aH (z)z 4 + aL (z).
Diego F. Aranha
Efficient Binary Field Arithmetic
Addition/subtraction in F2m c(z) = a(z) + b(z) =
m−1 X
(ai ⊕ bi )z i
i=0
AA BB
n-1
n-1
...
A9 A8 A7 A6 A5 A4 A3 A2 A1 A0
...
B9 B8 B7 B6 B5 B4 B3 B2 B1 B0
+ + + + + + + + + + + +
CC
n-1
...
C9 C8 C7 C6 C5 C4 C3 C2 C1 C0
Guidelines: Use XOR instruction with largest operand size. Verify impact of higher throughput. Diego F. Aranha
Efficient Binary Field Arithmetic
Addition/subtraction in F2m
void fb_addn_low ( dig_t *c , int i ; for ( i = 0; i < FB_DIGS ; vec_t t0 = LOAD (( vec_t vec_t t1 = LOAD (( vec_t t0 = XOR ( t0 , t1 ); STORE (( vec_t *) c , t0 ); } }
Diego F. Aranha
dig_t *a , dig_t * b ) { i += 2 , c += 2 , a += 2 , b += 2) { *) a ); *) b );
Efficient Binary Field Arithmetic
Squaring in F2m
a(z) =
m X
ai z i = am−1 + · · · + a2 z 2 + a1 z + a0
i=0
a(z)2 =
m−1 X
ai z 2i = am−1 z 2m−2 + · · · + a2 z 4 + a1 z 2 + a0
i=0
Example: a(z) = (am−1 , am−2 , . . . , a2 , a1 , a0 ) a(z)2 = (am−1 , 0, am−2 , 0, . . . , 0, a2 , 0, a1 , 0, a0 )
Diego F. Aranha
Efficient Binary Field Arithmetic
Squaring in F2m Since squaring is a linear operation: a(z)2 = aH (z)2 · z 8 + aL (z)2 . We can compute aL (z)2 and aH (z)2 with a lookup table. For u = (u3 , u2 , u1 , u0 ), use table(u) = (0, u3 , 0, u2 , 0, u1 , 0, u0 ):
Diego F. Aranha
Efficient Binary Field Arithmetic
Proposed squaring in F2m Ai AL
AH
table
01010101
...
00010001 00010000
00000101 00000100 00000001 00000000
lookup
lookup
AH
AL interhi, interlo
T 2i+1
T 2i
a(z)2 = aL (z)2 + aH (z)2 · z 8 . Diego F. Aranha
Efficient Binary Field Arithmetic
Squaring in F2m Algorithm 1 Proposed optimization for squaring in F2m . Input: a(z) = a[0..n − 1]. Output: c(z) = c[0..n − 1] = a(z)2 mod f (z). 1: Store in table the squares u(z)2 of all 4-bit polynomials u(z). 2: table ← (0x5554515045444140,0x1514111005040100) 3: mask ← (0x0F0F0F0F0F0F0F0F,0x0F0F0F0F0F0F0F0F) 4: for i ← 0 to n2 − 1 do 5: a0 ←load(a[2i]) 6: Convert to split representation. 7: aL ← a0 ∧ mask 8: aH ← a0 -8 4, aH ← aH ∧ mask 9: Perform parallel table lookups. 10: aL ←lookup(table, aL ), aH ←lookup(table, aH ) 11: Simulate addition with 8-bit offset. 12: t[2i] ←interlo(aL , aH ) 13: t[2i + 1] ←interhi(aL , aH ) 14: end for 15: return c = t mod f (z) Diego F. Aranha
Efficient Binary Field Arithmetic
Squaring in F2m void fb_sqrm_low ( dig_t *c , dig_t * a ) { vec_t t0 , t1 , t2 , t3 , mask ; t0 = _mm_set_epi32 (0 x55545150 , 0 x45444140 , 0 x15141110 , 0 x05040100 ); mask = _mm_set_epi32 (0 x0F0F0F0F , 0 x0F0F0F0F , 0 x0F0F0F0F , 0 x0F0F0F0F ); for ( int i = 0; i < FB_DIGS ; i += 2) { t1 = LOAD (( vec_t *)( a + i )); t2 = AND ( t1 , mask ); t2 = PSHUFB ( t0 , t2 ); t3 = SHR ( t1 , 4); t3 = AND ( t3 , mask ); t3 = PSHUFB ( t0 , t3 ); t1 = UNPACKLO ( t2 , t3 ); t2 = UNPACKHI ( t2 , t3 ); STORE (( vec_t *)( c + 2* i ) , t1 ); STORE (( vec_t *)( c + 2*( i + 1)) , t2 ); } REDUCE (c , c ); } Diego F. Aranha
Efficient Binary Field Arithmetic
Square root extraction in F2m Algorithm by Fong et al.: √
a = a2
m−1
=
m−1 X
ai z i
2m −1
i=0
=
=
m−1 X
ai z 2
m −1
i
i=0
i
X
ai z 2 +
i even
√
= aeven +
√ X i−1 z ai z 2 i odd
z · aodd
Since square-root is also a linear operation: q p a(z) = aH (z)z 4 + aL (z) p p = aH (z)z 2 + aL (z) √ = z · (aLodd (z) + aHodd (z)z 2 ) + aLeven (z) + aHeven (z)z 2 √ Note: Multiplication by z ideally requires shifted additions only. √ If not possible, precompute product by z. Diego F. Aranha
Efficient Binary Field Arithmetic
Proposed square root in F2m Ai shuffle
AL table
AH 00110011
...
00000001 00000000
11001100
...
00000100 00000000
table · z²
lookup
lookup
AH
AL
AL
AH
A even
p
a(z) =
√
Aodd
z · (aLodd (z) + aHodd (z)z 2 ) + aLeven (z) + aHeven (z)z 2 Diego F. Aranha
Efficient Binary Field Arithmetic
Multiplication in F2m
Three strategies: 1
L´opez-Dahab comb method
2
Shuffle-based multiplication
3
Native multiplication
Diego F. Aranha
Efficient Binary Field Arithmetic
L´opez-Dahab multiplication in F2m We can compute u · b(z) using shifts and additions.
If a(z) is divided into 4-bit polynomials, compute a(z) · b(z) by:
Diego F. Aranha
Efficient Binary Field Arithmetic
L´opez-Dahab multiplication in F2m
If the multiplier is represented in split form: a(z) · b(z) = b(z) · (aH (z)z 4 + aL (z)) = b(z)z 4 aH (z) + b(z)aL (z)
This is a well-known technique for removing expensive 4-bit shifts! Note: The core operation is accumulating u × dense b(z).
Diego F. Aranha
Efficient Binary Field Arithmetic
L´opez-Dahab multiplication in F2m Algorithm 2 LD multiplication implemented with n 128-bit registers. Input: a(z) = a[0..n − 1], b(z) = b[0..n − 1]. Output: c(z) = c[0..n − 1]. Note: mi denotes the vector of n2 128-bit registers (r(i−1+n/2) , . . . , ri ). 1: Compute T0 (u) = u(z) · b(z), T1 (u) = u(z) · (b(z)z 4 ) for all u(z) of degree < 4. 2: (rn−1 . . . , r0 ) ← 0 3: for k ← 56 downto 0 by 8 do 4: for j ← 1 to n − 1 by 2 do 5: Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. 6: Let v = (v3 , v2 , v1 , v0 ), where vt is bit (k + t + 4) of a[j]. 7: m(j−1)/2 ← m(j−1)/2 ⊕ T0 (u), m(j−1)/2 ← m(j−1)/2 ⊕ T1 (v ) 8: end for 9: (rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8 10: end for 11: for k ← 56 downto 0 by 8 do 12: for j ← 0 to n − 2 by 2 do 13: Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. 14: Let v = (v3 , v2 , v1 , v0 ), where vt is bit (k + t + 4) of a[j]. 15: mj/2 ← mj/2 ⊕ T0 (u), mj/2 ← mj/2 ⊕ T1 (v ) 16: end for 17: if k > 0 then (rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8 18: end for 19: return c = (rn−1 . . . , r0 ) mod f (z) Diego F. Aranha
Efficient Binary Field Arithmetic
Shuffle-based multiplication in F2m
If both multiplicand and multiplier are represented in split form: a(z) · b(z) = (bH (z)z 4 + bL (z)) · (aH (z)z 4 + aL (z))
Using Karatsuba formula, we can reduce it to 3 multiplications: a(z)·b(z) = aH bH z 8 +[(aH + aL )(bH + bL ) + aH bH + aL bL ] z 4 +aL bL Note: The core operation is accumulating u × sparse bL,H (z). x
B
n-1
...
B 9 B 8 B 7 B6 B 5 B 4 B 3 B 2 B 1 B 0
Diego F. Aranha
Efficient Binary Field Arithmetic
Shuffle-based multiplication in F2m Algorithm 3 Multiplication in split form. Input: Operands a, b in split representation. Output: Result a · b stored in registers (rn−1 . . . , r0 ). 1: table stores all products of 4-bit × 4-bit polynomials. 2: (rn−1 . . . , r0 ) ← 0 3: for k ← 56 downto 0 by 8 do 4: for j ← 1 to n − 1 by 2 do 5: Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. 6: for i ← 0 to n2 − 1 do ri ← ri ⊕ shuffle(table[u], b[i]) 7: end for 8: (rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8 9: end for 10: for k ← 56 downto 0 by 8 do 11: for j ← 0 to n − 2 by 2 do 12: Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. 13: for i ← 0 to n2 − 1 do ri ← ri ⊕ shuffle(table[u], b[i]) 14: end for 15: if k > 0 then (rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8 16: end for Diego F. Aranha
Efficient Binary Field Arithmetic
Native multiplication
Two organizations: 1
128-bit granularity: Reduce number of required registers Use Karatsuba for each 128 × 128-bit multiplication Use maximum number of Karatsuba levels for n2 digits
2
64-bit granularity: Interleave storage to reduce additions and number of registers Use the formula with the lower number of multiplications
Diego F. Aranha
Efficient Binary Field Arithmetic
Native multiplication Algorithm 4 Proposed implementation for multiplication in F2283 . Input: a(z) = a[0..4], b(z) = b[0..4]. Output: c(z) = c[0..4] = a(z) · b(z). Note: Pairs ai , bi , ci , mi of 64-bit words represent vector registers. 1: for i ← 0 to 4 do ci ← (a[i], b[i]) 2: c5 ← c0 ⊕ c1 , c6 ← c0 ⊕ c2 , c7 ← c2 ⊕ c4 , c8 ← c3 ⊕ c4 3: c9 ← c3 ⊕ c6 , c10 ← c1 ⊕ c7 , c11 ← c5 ⊕ c8 , c12 ← c2 ⊕ c11 4: for i ← 0 to 12 do mi ← ci [0] ⊗ ci [1] 5: c0 ← m0 , c8 ← m4 6: c1 ← c0 ⊕ m1 , c2 ← c1 ⊕ m6 7: c1 ← c1 ⊕ m5 , c2 ← c2 ⊕ m2 8: c7 ← c8 ⊕ m3 , c6 ← c7 ⊕ m7 9: c7 ← c7 ⊕ m8 , c6 ← c6 ⊕ m2 10: c5 ← m11 ⊕ m12 , c3 ← c5 ⊕ m9 11: c3 ← c3 ⊕ c0 ⊕ c10 12: c4 ← c1 ⊕ c7 ⊕ m9 ⊕ m10 ⊕ m12 13: c5 ← c5 ⊕ c2 ⊕ c8 ⊕ m10 14: c9 ← c7 8 64 15: (c7 , c5 , c3 , c1 ) ← (c7 , c5 , c3 , c1 ) C 8 16: c0 ← c0 ⊕ c1 , c1 ← c2 ⊕ c3 , c2 ← c4 ⊕ c5 17: c3 ← c6 ⊕ c7 , c4 ← c8 ⊕ c9 18: return c = (c4 , c3 , c2 , c1 , c0 ) mod f (z) Diego F. Aranha
Efficient Binary Field Arithmetic
Comparison L´opez-Dahab multiplication: Explores highest-granularity XOR operation Consumes memory space proportional to field size Shuffle-based multiplication: Relies on sparser core operation Consumes constant memory space (apart from Karatsuba) Depends on constants stored in memory Native multiplication: Faster and with constant memory consumption. No widespread support.
Diego F. Aranha
Efficient Binary Field Arithmetic
Modular reduction
We need to reduce squaring/multiplication results in F2m modulo f (z) = z m + r (z). We can process multiple bits at a time based on the observation: c(z) = c2m−2 z 2m−2 + · · · + cm z m + cm−1 z m−1 + . . . c1 z + c0 = (c2m−2 z m−2 + · · · + cm )r (z) + cm−1 z m−1 + . . . c1 z + c0 .
Important: We need to multiply by r (z).
Diego F. Aranha
Efficient Binary Field Arithmetic
Modular reduction
Requires heavy shifting, so split representation does not help. Guidelines: If f (z) is a trinomial, implement with vector digits If f (z) is a pentanomial, process pairs of digits in parallel or 64-bit mode Write f (z) in a format that minimizes shifting: f (z) = z 283 + z 12 + z 7 + z 5 + 1 = z 283 + (z 7 + 1)(z 5 + 1) Accumulate writes into registers before writing to memory Reduce squaring/multiplication results in registers
Diego F. Aranha
Efficient Binary Field Arithmetic
Modular reduction (64-bit mode) Algorithm 5 Fast modular reduction by f (z) = z 1223 + z 255 + 1. Input: c(z) = c[0..2n − 1]. Output: c(z) mod f (z) = c[0..n − 1]. 1: for i ← 2n − 1 downto n do 2: t ← c[i] 3: c[i − 15] ← c[i − 15] ⊕ (t 8) 4: c[i − 16] ← c[i − 16] ⊕ (t 56) 5: c[i − 19] ← c[i − 19] ⊕ (t 7) 6: c[i − 20] ← c[i − 20] ⊕ (t 57) 7: end for 8: t ← c[19] 7, c[0] ← c[0] ⊕ t, t ← t 7 9: c[3] ← c[3] ⊕ (t 56) 10: c[4] ← c[4] ⊕ (t 8) 11: c[19] ← (c[19] ⊕ t) ∧ 0x7F 12: return c
Diego F. Aranha
Efficient Binary Field Arithmetic
Modular reduction (128-bit mode) Algorithm 6 Proposed fast reduction by f (z) = z 1223 + z 255 + 1. Input: t(z) = t[0..n − 1] (vector of 128-bit elements). Output: c(z) mod f (z) = c[0..n − 1]. Note: The accumulate function R(r3 , r2 , r1 , r0 , t) executes: s ← t -8 7, r3 ← t -8 57 r3 ← r3 ⊕ (s 8 64) r2 ← r2 ⊕ (s 8 64) r1 ← r1 ⊕ (t 8 56) r0 ← r0 ⊕ (t 8 72) 1: r0 , r1 , r2 , r3 ← 0 2: for i ← 19 downto 15 by 4 do 3: R(r3 , r2 , r1 , r0 , t[i]), t[i − 7] ← t[i − 7] ⊕ r0 4: R(r0 , r3 , r2 , r1 , t[i − 1]), t[i − 8] ← t[i − 8] ⊕ r1 5: R(r1 , r0 , r3 , r2 , t[i − 2]), t[i − 9] ← t[i − 9] ⊕ r2 6: R(r2 , r1 , r0 , r3 , t[i − 3]), t[i − 10] ← t[i − 10] ⊕ r3 7: end for 8: R(r3 , r2 , r1 , r0 , t[11]), t[4] ← t[4] ⊕ r0 9: R(r0 , r3 , r2 , r1 , t[10]), t[3] ← t[3] ⊕ r1 10: t[2] ← t[2] ⊕ r2 , t[1] ← t[1] ⊕ r3 , t[0] ← t[0] ⊕ r0 11: r0 ← m[9] 8 64, r0 ← r0 -8 7, t[0] ← t[0] ⊕ r0 12: r1 ← r0 8 64, r1 ← r1 -8 63, t[1] ← t[1] ⊕ r1 13: r1 ← r0 -8 1, t[2] ← t[2] ⊕ r1 14: for i ← 0 to 9 do c[2i] ← store(t[i]), c[19] ← c[19] ∧ 0x7F 15: return c Diego F. Aranha
Efficient Binary Field Arithmetic
Modular reduction (128-bit mode) Algorithm 7 Proposed fast reduction by f (z) = z 283 + (z 7 + 1)(z 5 + 1). Input: Double-precision polynomial stored into 128-bit registers c = (c4 , c3 , c2 , c1 , c0 ). Output: Field element c mod f (z) stored into 128-bit registers (c2 , c1 , c0 ). 1: t2 ← c2 , t0 ← (c3 , c2 ) B 64, t1 ← (c4 , c3 ) B 64 2: c4 ← c4 -8 27, c3 ← c3 -8 27, c3 ← c3 ⊕ (t1 -8 37) 3: c2 ← c2 -8 27, c2 ← c2 ⊕ (t0 -8 37) 4: t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 1) 5: t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 7) ⊕ (t1 -8 57) 6: t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 7) ⊕ (t0 -8 57) 7: t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 3) 8: t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 5) ⊕ (t1 -8 59) 9: t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 5) ⊕ (t0 -8 59) 10: c0 ← c0 ⊕ c2 , c1 ← c1 ⊕ c3 , c2 ← t2 ⊕ c4 11: t0 ← c4 -8 27 12: t1 ← t0 ⊕ (t0 -8 5) 13: t0 ← t1 ⊕ (t1 -8 7) 14: c0 ← c0 ⊕ t0 , c2 ← c2 ∧ (0x0000000000000000,0x0000000007FFFFFF) 15: return c = (c2 , c1 , c0 )
Diego F. Aranha
Efficient Binary Field Arithmetic
Half-trace We want to compute H(c) =
P(m−1)/2 i=0
2i
c2 .
Important: For even i, H(z i ) = H(z i/2 ) + z i/2 + Tr (z i ). Algorithm 8 Solve x 2 + x = c P i m Input: c = m−1 i=0 ci z ∈ F2 where m is odd and Tr (c) = 0 Output: solution s of x 2 + x = c.
1: 2: 3: 4: 5: 6: 7: 8:
Compute H(l0 z 8i+1 + l1 z 8i+3 + l2 z 8i+5 + l3 z 8i+7 ) for 0 ≤ i ≤ b m−3 c and lj ∈ F2 . 8 s←0 for i = (m − 1)/2 downto 1 do if c2i = 1 then c ← c + zi , s ← s + zi end if end for P return s + i∈I c8i+1 H(z 8i+1 ) + c8i+3 H(z 8i+3 ) + c8i+5 H(z 8i+5 ) + c8i+7 H(z 8i+7 )
Diego F. Aranha
Efficient Binary Field Arithmetic
Multi-squaring [Bos et al.]
Precompute a table T of 16d m4 e field elements such that T [j, i0 + 2i1 + 4i2 + 8i3 ] = (i0 z 4j + i1 z 4j+1 + i2 z 4j+2 + i3 z 4j+3 )2 k
Then we can compute a2 as: m
k
a2 =
d4e X
T [j, ba/24j c mod 24 ].
j=0
Diego F. Aranha
Efficient Binary Field Arithmetic
k
Inversion
Guidelines: If memory is not available, implement Extended Euclidean Algorithm in 64-bit mode. If memory is available, implement Itoh-Tsuji with precomputed 2i powers: m−1 −1)2
a−1 = a(2
Diego F. Aranha
Efficient Binary Field Arithmetic
Inversion Algorithm 9 Inversion in F2m using EEA. P i Input: a = m−1 i=0 ai z ∈ F2m Output: a−1 mod f . 1: u ← a, v ← f 2: g1 ← 1, g2 ← 0 3: while u 6= 1 do 4: j ← deg (u) − deg (v ) 5: if j < 0 then 6: u ↔ v , g1 ↔ g2 , j ← −j 7: u ← u + zjv 8: g1 ← g1 + z j g2 9: end if 10: end while 11: return g1 Diego F. Aranha
Efficient Binary Field Arithmetic
Implementation Material: GCC 4.1.2 (fastest SSE intrinsics, GCC 4.5.0 is good again) RELIC cryptographic library1 Intel Core 2 65,45nm processors OpenMP constructs for parallelism; Parameters: 16 different binary fields ranging from 113 to 1223 bits Choices of square-root friendly and standard f (z) Comparison: Only vector implementations (mpFq , Beuchat et al. 2009) Only in entry-level Intel Core 2 65 nm 1
http://code.google.com/p/relic-toolkit/ Diego F. Aranha
Efficient Binary Field Arithmetic
Experimental results – Squaring 500 Related work This work
Cycles in Intel Core 2 65nm
400
300
200
100
0 200
400
600
800
1000
Field size
Diego F. Aranha
Efficient Binary Field Arithmetic
1200
Experimental results – Square-root with friendly f (z) 800 Related work This work 700
Cycles in Intel Core 2 65nm
600
500
400
300
200
100
0 200
400
600
800
1000
1200
Field size
Diego F. Aranha
Efficient Binary Field Arithmetic
Experimental results – Square-root with standard f (z) 900 Related work This work 800
Cycles in Intel Core 2 65nm
700
600
500
400
300
200
100
0 200
400
600
800
1000
Field size
Diego F. Aranha
Efficient Binary Field Arithmetic
1200
Experimental results – L´opez-Dahab multiplication 6000 Related work This work (López-Dahab)
Cycles in Intel Core 2 65nm
5000
4000
3000
2000
1000
0 200
400
600
800
1000
1200
Field size
Diego F. Aranha
Efficient Binary Field Arithmetic
Experimental results – Shuffle-based multiplication 8000 This work (López-Dahab) This work (Shuffling) 7000
Cycles in Intel Core 2 45nm
6000
5000
4000
3000
2000
1000
0 200
400
600
800
1000
1200
Field size
Note: Native multiplier on newer machines is twice faster than LD. Diego F. Aranha
Efficient Binary Field Arithmetic
Observations Squaring and square-root are: Efficiently formulated with M/S ratio up to 34 Faster when shuffling throughput is higher Heavily dependent on the choice of f (z) Shuffle-based multiplication: Has a bottleneck with constants stored in memory Requires faster table addressing scheme Is only 50%-90% slower than L´ opez-Dahab! Other operations: Restore the ratio to native multiplication (H ≈ M, I ≈ 25M).
Diego F. Aranha
Efficient Binary Field Arithmetic
Part III: Applications
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
Elliptic Curve Cryptography (ECC): Underlying problem harder than integer factoring (RSA) Same security level with smaller parameters Efficiency in storage and execution time Pairing-Based Cryptography (PBC): Initially destructive Allows innovative protocols Flexibilizes curve-based cryptography
Diego F. Aranha
Efficient Binary Field Arithmetic
Introduction
Point multiplication is the most expensive operation in Elliptic Curve Cryptography. Pairing computation is the most expensive operation in Pairing-Based Cryptography. Parallelism is being increasingly introduced in modern architectures.
Diego F. Aranha
Efficient Binary Field Arithmetic
Objective Explore two types of parallelism in software to reduce computation latency: Vector instructions Multiprocessing Applications: desktop-class computers, real-time services, embedded devices. Contributions State-of-the-art timings for ECC and PBC Parallelization of Miller’s Algorithm with load balancing Experimental results
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curves
(a) Point addition R = P + Q;
(b) Point doubling R = 2P;
Figure : Elliptic curve arithmetic. [Picture: Hankerson et al. 2003]
Diego F. Aranha
Efficient Binary Field Arithmetic
Binary elliptic curves
A binary elliptic curve is the set of solutions (x, y ) ∈ F2m × F2m satisfying the equation y 2 + xy = x 3 + ax 2 + b, where a, b ∈ F2m with b 6= 0, and a point at infinity ∞. When a ∈ {0, 1} and b = 1, the curve is called a Koblitz curve.
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curves The set of points {(x, y ) ∈ E (F2m )} ∪ {∞} under the addition operation + (chord-and-tangent rule) forms an additive group. Given an elliptic point P and an integer k, the operation kP, called scalar multiplication, is defined by kP = P | + P +{z. . . + P.} k times
This is the fundamental operation employed by protocols based on elliptic curves. Important: Underlying problem: ECDLP: Recover k from hP, kPi.
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic
Algorithm 10 Double-and-add scalar multiplication P i Input: k = t−1 i=0 ki 2 , P ∈ E (F2m ) of order r . Output: kP. 1: Q ← ∞ 2: for i = t − 1 downto 0 do 3: Q ← 2Q 4: if ki = 1 then 5: Q ←Q +P 6: end if 7: end for 8: return Q
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic Algorithm 11 Left-to-right wNAF scalar multiplication Input: w , k ∈ Z, P ∈ E (F2m ) of order r . Output: kP. Pt−1 i 1: Obtain the representation NAFw (k) = i=0 ki 2 2: Compute Pi = iP for i ∈ {1, 3, . . . , 2w −1 − 1} 3: Q ← ∞ 4: for i = t − 1 downto 0 do 5: Q ← 2Q 6: if ki > 0 then 7: Q ← Q + Pki 8: else if ki < 0 then 9: Q ← Q − Pki 10: end if 11: end for 12: return Q Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic Algorithm 12 Left-to-right multiple scalar multiplication Input: w , k ∈ Z, l ∈ Z, P ∈ E (F2m ) of order r . Output: kP + lQ. Pt−1 i P i 1: Obtain the representations NAFw (k) = t−1 i=0 li 2 i=0 ki 2 , NAFw (l) = 2: Compute Pi = iP, Qi = iQ for i ∈ {1, 3, . . . , 2w −1 − 1} 3: R ← ∞ 4: for i = t − 1 downto 0 do 5: R ← 2R 6: if ki > 0 then 7: R ← R + Pki 8: else if ki < 0 then 9: R ← R − Pki 10: end if 11: if li > 0 then 12: R ← R + Qli 13: else if li < 0 then 14: R ← R − Qli 15: end if 16: end for 17: return R
Important: With endomorphism ψ, compute kP = k1 P + k2 ψ(P). Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic
Koblitz curves have the Frobenius automorphism τ on E (F2m ) given by τ (x, y ) = (x 2 , y 2 ). If it is possible to recode k in another basis related to τ , point doublings can be replaced by applications of τ . Important: Many other approaches for scalar multiplication and other scenarios (fixed, multiple).
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic Algorithm 13 Left-to-right τ -and-add scalar multiplication Input: w , k ∈ Z, P ∈ E (F2m ) of order r . Output: kP. Pt−1 i 1: Obtain the representation TNAF (k) = i=0 ui τ 2: Q ← ∞ 3: for i = t − 1 downto 0 do 4: Q ← τQ 5: if ui = 1 then 6: Q ←Q +P 7: else if ui = −1 then 8: Q ←Q −P 9: end if 10: end for 11: return Q Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic Algorithm 14 Left-to-right w τ NAF scalar multiplication Input: w , k ∈ Z, P ∈ E (F2m ) of order r . Output: kP. Pt i 1: Obtain the representation TNAFw (k) = i=0 ui τ 2: Compute Pu = αu P foru ∈ {1, 3, 5, . . . , 2w −1 − 1} where αi = i mod τ ω 3: Q ← ∞ 4: for i = t − 1 downto 0 do 5: Q ← τQ 6: if ui = αj , for some j then 7: Q ← Q + Pj 8: else if ui = −αj , for some j then 9: Q ← Q − Pj 10: end if 11: end for 12: return Q Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic Algorithm 15 Constant-time point multiplication. P Input: k = t−1 i=0 ki ∈ Z, P = (x, y ) ∈ E (F2m ), b-coefficient. Output: kP ∈ E (F2m ). 1: x1 ← x, z1 ← 1, z2 ← x 2 , x2 ← z22 + b, 2: for i ← t − 2 to 0 do 3: r1 ← x1 z2 , r2 ← x2 z1 , r3 ← r1 + r2 , r4 ← r1 r2 4: if ki 6= 0 then 5: z1 ← r32 , r1 ← xz1 , x1 ← r1 + r4 , r1 ← z22 , r2 ← x22 6: z2 ← r1 r2 , x2 ← r12 , r1 ← r22 , r2 ← br1 , x2 ← x2 + r2 7: else 8: z2 ← r32 , r1 ← xz2 , x2 ← r1 + r4 , r1 ← z12 , r2 ← x12 9: z1 ← r1 r2 , x1 ← r12 , r2 ← r22 , r2 ← br1 , x1 ← x1 + r2 10: end if 11: end for 12: return Q = (x3 , y3 ) from (x1 /z1 , x2 /z2 ); Diego F. Aranha
Efficient Binary Field Arithmetic
Experimental results – Elliptic curve arithmetic
Table : Timings given in 103 cycles for side-channel resistant scalar multiplication.
Curve CURVE2251 - Core 2 CURVE2251 - CLMUL CURVE2251 - CLMUL + AVX BBE (Bernstein) - Core 2 eBACS (mpFq ) - Core 2 4-GLV-GLS TED (Longa) - Core i7
Diego F. Aranha
This work 594 282 225 Related work 314 855 137
Efficient Binary Field Arithmetic
Elliptic curve arithmetic
Briefly recall that Koblitz curves have the Frobenius automorphism τ on E (F2m ) given by τ (x, y ) = (x 2 , y 2 ). Computing fixed powers 2k in constant time (independent of k) provides an endomorphism in the context of the GLV method. Let map ψ ≡ τ bm/2c , giving kP = k1 P + 2bm/2c k2 P = k1 P + k2 ψ(P). Interleaving saves b m2 c applications of the Frobenius. In general, exploiting the bm/sc-th power of τ is the analogue of an s-dimensional GLV decomposition and saves (s − 1)b ms c Frobenius. Note: Tables can be reused for fast Ito-Tsuji inversion.
Diego F. Aranha
Efficient Binary Field Arithmetic
Algorithm 16 Interleaved width-w τ NAF scalar multiplication. Input: k ∈ Z, P ∈ E (F2m ), integer s denoting the interleaving factor. Output: kP ∈ E (F2m ). P i 1: Compute width-w τ -NAF(k) = l−1 i=0 ui τ 2: Compute P0,u = αu P, for u ∈ {1, 3, 5, . . . , 2w −1 − 1} 3: for i ← 1 to (s − 1) do Compute Pi,u = τ bm/sc Pi−1,u 4: Q ← ∞ 5: for i ← l − 1 to sb ms c do 6: Q ← τQ 7: if ui 6= 0 then 8: Let u be such that αu = ui or α−u = −ui 9: if ui > 0 then Q ← Q + P0,u ; else Q ← Q − P0,u 10: end if 11: end for 12: for i ← (b ms c − 1) to 0 do 13: Q ← τQ 14: for j ← 0 to (s − 1) do 15: if ui+jbm/sc 6= 0 then 16: Let u be such that αu = ui+jbm/sc or α−u = −ui+jbm/sc 17: if ui > 0 then Q ← Q + Pj,u ; else Q ← Q − Pj,u 18: end if 19: end for 20: end for 21: return Q = (x, y )
Diego F. Aranha
Efficient Binary Field Arithmetic
Elliptic curve arithmetic
Table : Timings given in 103 cycles for unprotected scalar multiplication.
Curve NISTK283 - CLMUL (w = 5, s = 2) NISTK283 - CLMUL + AVX (w = 5, s = 2) 4-GLV-GLS TED (Longa) - Sandy Bridge
Diego F. Aranha
This work 128 99 Related work 91
Efficient Binary Field Arithmetic
Bilinear pairings
Let G1 = hPi and G2 = hQi be additive groups and GT be a multiplicative group such that |G1 | = |G2 | = |GT | = prime n. An efficiently-computable map e : G1 × G2 → GT is an admissible bilinear map if the following properties are satisfied: 1
2
Bilinearity: given (V , W ) ∈ G1 × G2 and (a, b) ∈ Z∗q : e(aV , bW ) = e(V , W )ab = e(abV , W ) = e(V , abW ). Non-degeneracy: e(P, Q) 6= 1GT , where 1GT is the identity of the group GT .
Diego F. Aranha
Efficient Binary Field Arithmetic
Bilinear pairings
[Picture: Avanzi, Cesena 2009]
Diego F. Aranha
Efficient Binary Field Arithmetic
Bilinear pairings
If G1 = G2 , the pairing is symmetric. [Picture: Avanzi, Cesena 2009]
Diego F. Aranha
Efficient Binary Field Arithmetic
Example of protocol
Non-interactive ID-based key distribution protocol [SOK 2000]: A trusted authority generates master key s; An user i receives IDi , hPi = h(IDi ), Si = sPi i; Users A and B can derive the same key e(SA , PB ) = e(SB , PA ) = e(PA , PB )s . Important: Underlying problem: BCDHP: Compute e(P, Q)abc from hP, aP, bP, cP, Q, aQ, bQ, cQi.
Diego F. Aranha
Efficient Binary Field Arithmetic
Pairing computation
Let P, Q be r -torsion points. The pairing e(P, Q) is defined by the evaluation of fr ,P at a divisor related to Q. [Miller 1986] constructed fr ,P in stages combining Miller functions evaluated at divisors. [Barreto et al. 2002] showed how to evaluate fr ,P at Q using the final exponentiation employed by the Tate pairing.
Diego F. Aranha
Efficient Binary Field Arithmetic
Pairing computation
Let gU,V be the line equation through points U, V ∈ E (Fqk ) and gU the shorthand for gU,−U . For any integers a and b, we have: 1
fa+b,P (D) = fa,P (D) · fb,P (D) ·
2
f2a,P (D) = fa,P (D)2 ·
3
fa+1,P (D) = fa,P (D)
gaP,bP (D) ; g(a+b)P (D)
gaP,aP (D) g2aP (D) ; g (D) · g (a)P,P (D) . (a+1)P
Diego F. Aranha
Efficient Binary Field Arithmetic
Pairing computation Algorithm 17 Miller’s Algorithm [Miller 1986, Barreto et al. 2002]. Plog r Entrada: r = i=02 ri 2i , P, Q. Sa´ıda: er (P, Q). 1: T ← P 2: f ← 1 3: r ← r − 1 4: for i = blog2 (r )c − 1 downto 0 do 5: f ← f 2 · lT ,T (Q) 6: T ← 2T 7: if ri = 1 then 8: f ← f · lT ,P (Q) 9: T ←T +P 10: end if 11: end for k 12: return f (q −1/r ) Diego F. Aranha
Efficient Binary Field Arithmetic
Related work
Scalable approaches: [Mitsunari 2009] and [Beuchat et al. 2009] precompute pairs (Ti , part of lTi ,Ti (Q)) in the symmetric case and divide loop iterations among processors. Problem: High storage costs (large precomputation).
Diego F. Aranha
Efficient Binary Field Arithmetic
New approach Property of Miller functions fa·b,P (D) = f b,P (D)a · f a,bP (D) We can write r = 2w r1 + r0 and compute fr ,P (D): fr ,P (D) = f2w r1 +r0 ,P (D) w
= f r1 ,P (D)2 · f 2w ,r1 P (D) · f r0 ,P (D) ·
g(2w r1 )P,r0 P (D) . grP (D)
If r has low Hamming weight, w can be chosen so that r0 is small. For many processors, we can: Apply the formula recursively: Write r as r = 2wi ri + · · · + 2w2 r2 + 2w1 r1 + r0 . If P is fixed (private key), ri P can also be precomputed. Diego F. Aranha
Efficient Binary Field Arithmetic
Load balancing
Problem: We must determine an optimal partition wi . Let c1 (1) the cost of a serial loop and cπ (i) the cost of a parallel loop for processor 1 ≤ i ≤ π. We can count the operations executed by each processor and solve the system cπ (1) = cπ (i) to obtain wi . The speedup is: s(π) =
c1 (1)+exp cπ (1)+par +exp ,
where par is the cost of parallelization and exp is the cost of the final exponentiation.
Diego F. Aranha
Efficient Binary Field Arithmetic
Symmetric case – Elliptic curves
A pairing-friendly supersingular binary elliptic curve is the set of solutions (x, y ) ∈ F2m × F2m satisfying the equation y 2 + y = x 3 + x + b, where b ∈ {0, 1}, and a point at infinity ∞. m+1
The order of this curve is N = 2m + 1 ± 2 2 and the embedding degree is k = 4 (the least integer such that N divides 2km − 1).
Diego F. Aranha
Efficient Binary Field Arithmetic
Symmetric case – Pairing definition
Choosing T = 2m − N and a prime r dividing N, [Barreto et al. 2004] defined the reduced ηT pairing: ηT
:
E (F2m )[r ] × E (F2m )[r ] → F∗24m
ηT (P, Q) = fT 0 ,P 0 (ψ(Q))
24m −1 N
,
where T 0 = ±T and P 0 = ±P. The function f is a Miller function and ψ is the distortion map ψ(x, y ) = (x 2 + s, y + sx + t).
Diego F. Aranha
Efficient Binary Field Arithmetic
Symmetric case – Pairing algorithm Algorithm 18 ηT pairing [Barreto et al. 2004], [Beuchat et al. 2008]. Input: P = (xP , yP ), Q = (xQ , yQ ) ∈ E (F2m )[r ]. Output: ηT (P, Q) ∈ F∗24m .
1: yP ← yP + 1 − δ 2: u ← xP + α, v ← xQ + α 3: g0 ← u · v + yP + yQ + β 4: g1 ← u + xQ , g2 ← v + xP2 5: G ← g0 + g1 s + t 6: L ← (g0 + g2 ) + (g1 + 1)s + t 7: F ← L · G do 8: for i ← 1 to m−1 2 √ √ 9: xP ← xP , yP ← yP , xQ ← xQ2 , yQ ← yQ2 10: u ← xP + α, v ← xQ + α 11: g0 ← u · v + yP + yQ + β 12: g1 ← u + xQ 13: G ← g0 + g1 s + t 14: F ←F ·G 15: end for m+1 2m m 16: return F (2 −1)(2 +1±2 2 )
Diego F. Aranha
Efficient Binary Field Arithmetic
Symmetric case – Parallel pairing Algorithm 19 Proposed parallel ηT pairing. Input: P = (xP , yP ), Q = (xQ , yQ ) ∈ E (F2m )[r ]. Output: ηT (P, Q) ∈ F∗24m .
1: parallel section(processor i) 2: if i = 1 then Initialize F1 as in lines 1-7 of the previous algorithm; 3: else Fi ← 1 1 1 w w 4: xP i ← (xP ) 2wi , yP i ← (yP ) 2wi , xQ i ← (xQ )2 i , yQ i ← (yQ )2 i 5: for j ← wi to wi+1 − 1 do √ √ 6: xP i ← xP i , yP i ← yP i , xQ i ← xQ 2i , yQ i ← yQ 2i 7: ui ← xP i + α, vi ← xQ i + α 8: g0i ← ui · vi + yP i + yQ i + β 9: g1i ← ui + xQ i 10: Gi ← g0i + g1i s + t 11: Fi ← Fi · Gi 12: end for Q 13: F ← πi=1 Fi 14: end parallel 15: return F M
Note: If memory is available, use partitions with the same size. Diego F. Aranha
Efficient Binary Field Arithmetic
Experimental results – Speedup (45nm) 14 12
Beuchat et al. 2009 Aranha et al. 2010
Speedup
10 8 6 4 2 0
10
20
30
40
50
Number of processors
Diego F. Aranha
Efficient Binary Field Arithmetic
60
Experimental results – Latency (45nm)
Latency (millions of cycles)
30 25
Beuchat et al. 2009 Aranha et al. 2010 23.03
20
17.40
15
13.14 9.34
10
9.08
8.93 5.08
5 0
3.02
1
2
4 Number of threads
Diego F. Aranha
8
Efficient Binary Field Arithmetic
Conclusions New formulation and implementation of binary field arithmetic: Follows trend of faster shuffle instructions Improve results from related work by 8%-84% Induces a new implementation strategy for multiplication Still requires architectural features to be optimal May be cheaper to support than a full native multiplier Timings for non-batched arithmetic on binary elliptic curves: Provide new speed record for side-channel resistant scalar multiplication on binary curves Improve results for kP on eBACS by at least 27%-30%
Diego F. Aranha
Efficient Binary Field Arithmetic
Conclusions New state-of-the-art for parallel implementation of pairings: No significant storage costs, smaller precomputation; In comparison with our serial implementation, speedups of 46%, 70% and 83% with 2, 4 and 8 cores; In comparison with previous state-of-the-art, improvements in latency of 24%, 29%, 44% and 66% with 1, 2, 4 and 8 cores. Parallelization scales: In the covered case, point doublings and extension field squarings are efficient; Our finite field implementation make these exceptionally fast.
Diego F. Aranha
Efficient Binary Field Arithmetic
Detailed results Platform 1 – Intel Core 2 65nm Hankerson et al. – latency Beuchat et al. – latency Beuchat et al. – speedup This work – latency This work – speedup Improvement Platform 2 – Intel Core 2 45nm Beuchat et al. – latency Beuchat et al. – speedup This work – latency This work – speedup Improvement Platform 3: Intel Core i7 32nm This work – latency This work – speedup Improvement
1 39 26.86 1 18.76 1 30.2% 1 23.03 1 17.40 1 24.4% 1 6.46 1.00 62.3%
Number of threads 2 4 – – 16.13 10.13 1.67 2.65 10.08 5.72 1.86 3.28 32.9% 39.9% 2 4 13.14 9.08 1.77 2.54 9.34 5.08 1.86 3.42 28.9% 44.0% 2 4 3.37 1.79 1.92 3.60 63.9% 64.8%
Table : Timings are reported in millions of cycles. Diego F. Aranha
Efficient Binary Field Arithmetic
8* – – – 3.55 5.28 – 8 8.93 2.58 3.02 5.76 66.2% 8* 1.03 6.24 65.9%
Sources [1] D. F. Aranha, J. L´ opez, D. Hankerson, “High-speed parallel software implementation of the ηT pairing”, In CT-RSA 2010, Springer LNCS 5985, pp. 89–105, San Francisco, USA, 2010. [2] D. F. Aranha, J. L´ opez, D. Hankerson, “Efficient Software Implementation of Binary Field Arithmetic Using Vector Instruction Sets”, In LATINCRYPT 2010. Springer LNCS 6212, pp. 144–161, Puebla, Mexico, 2010. [3] J. Taverne, A. Faz-Hern´ andez, D. F. Aranha, F. Rodr´ıguez-Henr´ıquez, D. Hankerson, J. L´ opez, “Software implementation of binary elliptic curves: impact of the carry-less multiplier on scalar multiplication”, In CHES 2011, Springer LNCS 6917, pp. 108–123, Nara, Japan, 2011. [4] J. Taverne, A. Faz-Hern´ andez, D. F. Aranha, F. Rodr´ıguez-Henr´ıquez, D. Hankerson, J. L´ opez, “Speeding scalar multiplication over binary elliptic curves using the new carry-less multiplication instruction”, Journal of Cryptographic Engineering, Vol. 1, Number 3, pp. 187–199, 2011. [5] D. F. Aranha, E. Knapp, A. Menezes, F. Rodr´ıguez-Henr´ıquez, “Parallelizing the Weil and Tate pairings”, In IMACC 2011, Springer LNCS 7089, pp. 275–295, Oxfork, UK, 2011. [6] D. F. Aranha, A. Faz-Hern´ andez, J. L´ opez, F. Rodr´ıguez-Henr´ıquez, “Faster Implementation of Scalar Multiplication on Koblitz Curves”, In LATINCRYPT 2012, Springer LNCS 7533, pp. 177–193, Santiago, Chile, 2012. Diego F. Aranha
Efficient Binary Field Arithmetic