High-speed Parallel Software Implementation of the

0 downloads 0 Views 424KB Size Report
Algorithm 2 Proposed implementation of square root in F2m . ...... 413 432. Springer (2009). 12. Barreto, P.S.L.M., Lynn, B., Scott, M.: E cient Implementation of ...
High-speed Parallel Software Implementation of the

ηT

Pairing

Diego F. Aranha1 ? , Julio López1 , and Darrel Hankerson2 1 University of Campinas, {dfaranha,jlopez}@ic.unicamp.br 2 Auburn University, [email protected]

Abstract. We describe a high-speed software implementation of the ηT

pairing over binary supersingular curves at the 128-bit security level. This implementation explores two types of parallelism found in modern multi-core platforms: vector instructions and multiprocessing. We rst introduce novel techniques for implementing arithmetic in binary elds with vector instructions. We then devise a new parallelization of Miller's Algorithm to compute pairings. This parallelization provides an algorithm for pairing computation without increasing storage costs signicantly. The combination of these acceleration techniques produce serial timings at least 24% faster and parallel timings 66% faster than the best previous result in an Intel Core platform, establishing a new state-of-theart implementation of this pairing instantiation in this platform.

Key words: Ecient software implementation, vector instructions, multicore architectures, bilinear pairings, parallelization.

1

Introduction

The computation of bilinear pairings is the most expensive operation in Pairingbased Cryptography, especially for high levels of security. For this reason, implementations must employ all the resources found in the target platform to obtain maximum eciency. A resource being increasingly introduced in computing platforms is parallelism, in the form of vector instructions (data parallelism) and multiprocessing (task parallelism). This trend is observed even in the embedded space, with proposals of resource-constrained multi-core architectures and vector instruction sets for multimedia processing in portable devices. This work describes a high-performance implementation of the ηT pairing [1] over binary supersingular curves at the 128-bit security level which employs these two forms of parallelism in a very ecient way. The target platform is the Intel Core architecture [2], the most popular 64-bit computing platform. Our main contributions are: ?

Supported by FAPESP under grant no. 2007/06950-0.

 Novel techniques for implementing arithmetic in binary elds: we explore powerful SIMD instructions to accelerate arithmetic in binary elds. We focus on the SSE family of vector instructions, but the same techniques can be employed with other SIMD instruction sets such as Altivec and the upcoming AMD SSE5.  Parallelization of Miller's Algorithm to compute pairings: we develop a simple algorithm for parallel pairing computation which does not increase storage costs. Our parallelization is independent of the underlying pairing instantiation, allowing a parallel implementation to reach scalability in a variable number of processors unrelated to the pairing mathematical denition. This parallelization provides good scalability in elds of small characteristic.  Static load balancing technique: we present a simple technique to balance the costs of parallel pairing computation between the available processing units. The technique is successfully applied for latency minimization, but its exibility allows the implementation to determine controlled non-optimal partitions of the algorithm.  Experimental results: speedups of parallel implementations over serial implementations are estimated and experimentally veried for platforms up to 8 processors. We also obtain an approximation of the performance up to 32 processing units and compare our serial and parallel execution times with the current state-of-the-art implementations with the same parameters.

The results of this work can improve serial and parallel implementations of pairings. The parallelization may be important to reduce the latency of pairing computation in two scenarios: (i) desktop-class processors running real-time applications with strict response time requirements; (ii) embedded multiprocessor architectures with weak processing units. The availability of parallel algorithms for application in these scenarios is suggested as an open problem by [3] and [4]. Our features of exible load balancing and small storage overhead are critical for the second scenario, because they can support static scheduling schemes for compromises between pairing computation time and power consumption; and memory capacity is commonly restricted in embedded devices.

2

Finite Field Arithmetic

In this section we will represent the elements of F2m using a polynomial basis. Let f (z) be an irreducible binary polynomial of degree m. The elements of F2m are P i the binary polynomials of degree at most m−1. A eld element a(z) = m−1 i=0 ai z is associated with the binary vector a = (am−1 , . . . , a1 , a0 ) of length m. In a software implementation, these bit coecients are packed and stored in an array (a[0], . . . , a[n − 1]) of n W -bit words, where W is the word size of the processor. For simplicity, we assume that n is always even. 2.1 Vector Instruction Sets Vector instructions, also called SIMD (Single Instruction, Multiple Data) because they operate in several data objects simultaneously, are widely supported

in recent families of processor architectures. The number, functionality and efciency of these instructions have been improved with each new generation of processors, and natural applications include multimedia processing, scientic applications or any software with high arithmetic density. Some well-known SIMD instruction sets are the Intel MMX and SSE [5] families, the Altivec extensions introduced by Apple and IBM in the Power architecture specication and AMD 3DNow. Instruction sets supported by current technology are restricted to 128bit registers and provide simple orthogonal operations across 8, 16, 32 or 64-bit data units stored inside these registers, but future extensions such as Intel AVX and AMD SSE5 will support 256-bits registers with the added inclusion of a heavily-anticipated carry-less multiplier [6]. The Intel Core microarchitecture is equipped with several vector instruction sets which operate in 16 architectural 128-bit registers. A small subset of these instructions can be used to implement binary eld arithmetic, some found in the Streaming SIMD Extensions 2 (SSE2) and others in the Supplementary SSE3 instructions (SSSE3). The SSE2 instruction set is also supported by the recent VIA Nano processors, AMD processors since the K8 family and Intel processors since the Pentium 4. A non-exhaustive list of SSE2 instructions relevant for our work is given below. Each instruction described will be referred in the algorithms by the short mnemonic which follows the instruction opcode:  MOVDQU/MOVDQA (load /store ): implements load/store between unaligned/ aligned memory addresses and registers. In our implementation, all allocated memory is stored in 128-bit aligned base addresses so that the faster MOVDQA instruction can always be used.  PSLLQ/PSRLQ (-8 , -8 ): implements bitwise left/right shifts of a pair of 64bit integers while shifting in zero bits. This instruction does not propagate bits from the lower 64-bit integer to the higher 64-bit integer, thus additional shifts and additions are required to implement bitwise shifts of 128-bit values.  PSLLDQ/PRLLDQ (8 , 8 ): implements byte-wise left/right shifts of a 128-bit register. Since this instruction propagates bytes from the lower half to the higher half of a 128-bit register, this instruction is preferred over the previous one when the shift amount is a multiple of 8. Thus shifts by multiples of 8 bits should be used whenever possible. The latency of this instruction is 2 cycles in the rst generation of Core 2 Conroe/Merom (65nm) processors and 1 cycle in the more recent Penryn/Wolfdale (45nm) microarchitecture.  PXOR/PAND/POR (⊕, ∧, ∨): implements bitwise XOR/AND/OR of two 128-bit registers. These instructions have a high throughput, reaching 3 instructions per cycle when the operands are registers and there are no dependencies between consecutive operations.  PUNPCKLBW/PUNPCKHBW (interlo /interhi ): interleaves the lower/higher bytes in a register with the lower/higher bytes of another register.

We also nd application for powerful but often-missed SSSE3 instructions:

 PALIGNR (C): takes registers ra and rb , concatenate their values, and pull out a 128-bit section from an oset given by a constant immediate; in other words, implements a right byte-wise shift with propagation of shifted out bytes from ra to rb . This instruction can be used to implement a left shift by s bytes with the immediate (16 − s).  PSHUFB (lookup or shue depending on functionality): takes registers of bytes ra = a0 , a1 , . . . , a15 and rb = b0 , b1 , . . . , b15 and replaces ra with the permutation ab0 , ab1 , . . . , ab15 ; except that it replaces ai with zero if the most signicant bit of bi is set. A powerful use of this instruction is to perform 16 simultaneous lookups in a 16-byte lookup table. This can be easily done by storing the lookup table in ra and the lookup indexes in rb . Intel introduced a specic Super Shue Engine in the latest microarchitecture to reduce the latency of this instruction from 3 cycles to 1 cycle.

Alternate vector instruction sets present functional analogues of these instructions. In particular, the PSHUFB permutation instruction is implemented as VPERM in Altivec and as PPERM in SSE5, although the PPERM instruction is reportedly more powerful as it can also operate at bit level. SIMD instructions are critical for the performance of binary eld arithmetic and can be easily accessed with compiler intrinsics. In the remainder of this section, the optimization techniques applied during the implementation of each eld operation are detailed. We will describe algorithms in terms of vector operations using the mnemonics dened above so that algorithms can be easily transcribed to other target platforms. Specic instruction choices based on latency or functionality will be focused on the SSE family. 2.2

Squaring

Since the square of a nite eld element a(z) ∈ F2m is given by a(z)2 = Pm−1 2i = am−1 z 2m−2 + · · · + a2 z 4 + a1 z 2 + a0 , the binary representation of i=0 ai z a(z)2 can be computed by inserting a zero bit between each pair of consecutive bits on the binary representation of a(z). This operation can be accelerated by introducing a lookup table as discussed in [7]. This method can be improved further if the table lookups can be executed simultaneously. This way, for an implementation which processes 4 bits per iteration, squaring can be implemented mainly in terms of permutation instructions which convert groups of 4 bits (nibbles ) to the corresponding expanded bytes. The proposed optimization is shown in Algorithm 1. The algorithm receives a eld element a stored in a vector of n 64-bit words (or n2 128-bit values) and expands the input into a double-precision vector t which can be reduced modulo f (z). At each iteration of this algorithm, a 128-bit value a[2i] is loaded from memory and separated by a bit mask into two registers containing the low nibbles (aL ) and the high nibbles (aH ). Each group of nibbles is then expanded from 4 bits to 8 bits by a parallel table lookup. The proper order of bytes is restored by interleaving instructions which pick alternately the lower or higher bytes of aL or aH to form two consecutive 128-bit values (t[2i], t[2i + 1]) produced as the result.

Algorithm 1 Proposed implementation of squaring in F2m . Input: a(z) = a[0..n − 1]. Output: c(z) = c[0..n − 1] = a(z)2 mod f (z). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

2.3

. Store in table the squares u(z)2 of all 4-bit polynomials u(z). table ← 0x5554515045444140,0x1514111005040100 mask ← 0x0F0F0F0F0F0F0F0F,0x0F0F0F0F0F0F0F0F for i ← 0 to n2 − 1 do a0 ←load (a[2i]) aL ← a0 ∧ mask, aL ←lookup (table, aL ) aH ← a0 -8 4, aH ← aH ∧ mask, aH ←lookup (table, aH ) t[2i] ←store (interlo (aL , aH )), t[2i + 1] ←store (interhi (aL , aH ))

end for return c = t mod f (z)

Square Root

Given an element a(z) ∈ F2m , the eld element c(z) such that c(z)2 = a(z) mod √ f (z) can be computed by c(z) = aeven + z · aodd mod f (z), where aeven represents the concatenation of even coecients of a(z), aodd represents the concatena√ tion of odd coecients of a(z) and z is a constant depending on the irreducible polynomial f (z) [8]. When f (z) is a suitable trinomial f (z) = z m + z t + 1 with √ √ m+1 t+1 odd exponents m, t, z has the sparse form z = z 2 +z 2 and multiplication by this constant can be computed with shifts and additions only. This algorithm can also be implemented with simultaneous table lookups. Algorithm 2 presents our implementation of this method with vector instructions. The algorithm processes 128 bits of a in each iteration and progressively separates the coecients of a[2i] in even or odd coecients. First, a permutation mask is used to divide a[2i] in bytes of odd index and bytes of even index. The bytes with even indexes are stored in the lower 64-bit part of a0 and the bytes with odd indexes are stored in the higher 64-bit part of a0 . The high and low nibbles of a0 are then divided into aL and aH and additional lookup tables are applied to further separate the bits of aL and aH into bits with odd and even indexes. At the end of the 128-bit section, a0 stores the interleaving of odd and even coecients of a packed into groups of 4 bits. The remaining instructions in the 128-bit section separate the even √ and odd coecients into u and v , which can be reordered and multiplied by z inside the 64-bit section. We implement these nal steps in 64-bit mode to avoid expensive shifts in 128-bit registers. 2.4

Multiplication

Two dierent strategies are commonly considered for the implementation of multiplication in F2m . The rst one consists in applying the Karatsuba algorithm [9] to divide the multiplication in sub-problems and solve each problem independently [7] (for a(z) = A1 z dm/2e + A0 and b(z) = B1 z dm/2e + B0 ): c(z) = a(z)·b(z) = A1 B1 z m +[(A1 +A0 )(B1 +B0 )+A1 B1 +A0 B0 ]z dm/2e +A0 B0 .

Algorithm 2 Proposed implementation of square root in F2m . Input: a(z) = a[0..n − 1], exponents m and t of trinomial f (z). 1 Output: c(z) = c[0..n − 1] = a(z) 2 mod f (z). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

. Permutation mask to divide a 128-bit value in bytes with odd and even indexes. perm ← 0x0F0D0B0907050301,0x0E0C0A0806040200 . Tables to divide a low/high nibble in bits with odd and even indexes. sqrtL ← 0x3332232231302120,0x1312030211100100 . Table to divide a high nibble in bits with odd and even indexes (sqrtL  2). sqrtH ← 0xCCC88C88C4C08480,0x4C480C0844400400 . Bit masks to isolate bytes in lower or higher nibbles. maskL ← 0x0F0F0F0F0F0F0F0F,0x0F0F0F0F0F0F0F0F maskH ← 0xF0F0F0F0F0F0F0F0,0xF0F0F0F0F0F0F0F0 c[0 . . . n − 1] ← 0, h ← n+1 , l ← t+1 , s1 ← m+1 mod 64, s2 ← t+1 mod 64 2 128 2 2 n for i ← 0 to 2 − 1 do a0 ←load (a[2i]), a0 ←shue (a0 , perm) aL ← a0 ∧ maskL , aL ←lookup (sqrtL , aL ), aH ← a0 ∧ maskH , aH ← aH -8 4, aH ←lookup (sqrtH , aH ) a0 ← aL ∨ aH , aL ← a0 ∧ maskL , aH ← a0 ∧ maskH u ← store(aL ), v ← store(aH ) . From now on, operate in 64-bit registers. aeven ← u[0] ∨ u[1]  4, aodd ← v[1] ∨ v[0]  4 c[i] ← c[i] ⊕ aeven c[i + h − 1] ← c[h + i − 1] ⊕ (aodd  s1 ) c[i + h] ← c[h + i] ⊕ (aodd  (64 − s1 )) c[i + l] ← c[i + l] ⊕ (aodd  s2 ) c[i + l + 1] ← c[i + l + 1] ⊕ (aodd  (64 − s2 ))

end for return c

The second one consists in applying a direct algorithm like the comb method proposed by López and Dahab in [10]. Conventionally, the series of additions involved in this method are implemented through additions over sub parts of a double-precision vector. In order to reduce the number of memory accesses during these additions, we employ n registers. These registers simulate the series of memory additions by accumulating consecutive writes, allowing the implementation to reach maximum XOR throughput. We also employ an additional table T1 analogue to T0 which stores u(z) · (b(z)  4) to eliminate shifts by 4, as discussed in [10]. Recall that shifts by multiples of 8 bits are faster in the target platform. We assume that the length of operand b[0..n−1] is at most 64n−7 bits; if necessary, terms of higher degree can be processed separately at relatively low cost. The implemented LD multiplication algorithm is shown as Algorithm 3. The element a(z) is processed in groups of 8 bits separated by intervals of 128 bits. This avoids shifts of the register vector since a 128-bit shift can be emulated by referencing mi+1 instead of mi . The multiple precision shift by 8 bits of the register vector (C8) is implemented with 15-byte shifts with carry propagation (C) of register pairs.

Algorithm 3 LD multiplication implemented with n 128-bit registers. Input: a(z) = a[0..n − 1], b(z) = b[0..n − 1]. Output: c(z) = c[0..n − 1]. Note: mi denotes the vector of n2 128-bit registers (r(i−1+n/2) , . . . , ri ). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

2.5

Compute T0 (u) = u(z) · b(z), T1 (u) = u(z) · (b(z)  4) for all u(z) of degree < 4. (rn−1 . . . , r0 ) ← 0 for k ← 56 downto 0 by 8 do for j ← 1 to n − 1 by 2 do Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. Let v = (v3 , v2 , v1 , v0 ), where vt is bit (k + t + 4) of a[j]. m(j−1)/2 ← m(j−1)/2 ⊕ T0 (u) m(j−1)/2 ← m(j−1)/2 ⊕ T1 (v)

end for

(rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8

end for for k ← 56 downto 0 by 8 do for j ← 0 to n − 2 by 2 do

Let u = (u3 , u2 , u1 , u0 ), where ut is bit (k + t) of a[j]. Let v = (v3 , v2 , v1 , v0 ), where vt is bit (k + t + 4) of a[j]. mj/2 ← mj/2 ⊕ T0 (u) mj/2 ← mj/2 ⊕ T1 (v)

end for if k > 0 then (rn−1 . . . , r0 ) ← (rn−1 . . . , r0 ) C 8 end for return c = (rn−1 . . . , r0 ) mod f (z)

Modular Reduction

Ecient modular reduction depends on the format of the trinomial or pentanomial f (z). In general, it's better to choose f (z) such that bitwise shifts amounts are multiples of 8 bits. If the non-null coecients of f (z) are located in the lower words of the array representation of f (z), consecutive writes into memory can also be accumulated into registers to avoid redundant memory writes. We illustrate these optimizations with modular reduction by f (z) = z 1223 +z 255 +1 in Algorithm 4. The algorithm receives as input a vector of n 128-bit elements and reduces this vector by accumulating four memory writes at a time in registers. Note also that shifts by multiples of 8 bits are used whenever possible. 2.6

Inversion

For inversion in F2m we implemented a variant of the Extended Euclidean Algorithm for polynomials [7] where the length of each temporary vector is tracked. Since this algorithm requires exible left shifts by arbitrary amounts, we implemented the full algorithm in 64-bit mode. Some Assembly in the form of a compiler intrinsic was used to eciently count the number of leading 0 bits to determine the highest set bit.

Algorithm 4 Proposed modular reduction by f (z) = z 1223 + z 255 + 1. Input: t(z) = t[0..n − 1] (vector of 128-bit elements). Output: c(z) mod f (z) = c[0..n − 1]. Note: The accumulate function R(r3 , r2 , r1 , r0 , t) executes: s ← t -8 7, r3 ← t -8 57 r3 ← r3 ⊕ (s 8 64) r2 ← r2 ⊕ (s 8 64) r1 ← r1 ⊕ (t 8 56) r0 ← r0 ⊕ (t 8 72)

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

2.7

r0 , r 1 , r 2 , r 3 ← 0 for i ← 19 downto 15 by R(r3 , r2 , r1 , r0 , t[i]), R(r0 , r3 , r2 , r1 , t[i − 1]), R(r1 , r0 , r3 , r2 , t[i − 2]), R(r2 , r1 , r0 , r3 , t[i − 3]),

end for

4 do t[i − 7] ← t[i − 7] ⊕ r0 t[i − 8] ← t[i − 8] ⊕ r1 t[i − 9] ← t[i − 9] ⊕ r2 t[i − 10] ← t[i − 10] ⊕ r3

R(r3 , r2 , r1 , r0 , t[11]), t[4] ← t[4] ⊕ r0 R(r0 , r3 , r2 , r1 , t[10]), t[3] ← t[3] ⊕ r1 t[2] ← t[2] ⊕ r2 , t[1] ← t[1] ⊕ r3 , t[0] ← t[0] ⊕ r0 r0 ← m[9] 8 64, r0 ← r0 -8 7, t[0] ← t[0] ⊕ r0 r1 ← r0 8 64, r1 ← r1 -8 63, t[1] ← t[1] ⊕ r1 r1 ← r0 -8 1, t[2] ← t[2] ⊕ r1 for i ← 0 to 9 do c[2i] ← store (t[i]) c[19] ← c[19] ∧ 0x7F return c

Implementation Timings

In this section, we present our timings for nite eld arithmetic. We implemented arithmetic in F21223 with irreducible trinomial f (z) = z 1223 + z 255 + 1. This eld is suitable for instantiations of the ηT pairing over supersingular binary curves at the 128-bit security level [4]. The C programming language was used in conjunction with compiler intrinsics for accessing vector instructions. The chosen compiler was GCC version 4.1.2 because it generated the fastest code from vector intrinsics, as already observed by [4]. The dierences between our implementations in the 65nm and 45nm processors can be explained by the lower cost of the PSLLDQ and PSHUFB instructions in the newer generation after the introduction of the Super Shue Engine by Intel. Field multiplication was implemented by a combination of one instance of Karatsuba and the LD method depicted as Algorithm 3. Karatsuba's splitting point was at 632 bits and the divide-and-conquer steps were also implemented with vector instructions. Note that our binary eld multiplier precomputes two tables of 16 rows, while the multiplier implemented in [4] precomputes a single table. This increase in memory consumption is negligible when compared to the total memory capacity of the target platform.

Table 1. Comparison of dierent software implementations of nite eld arithmetic

in two Intel Core 2 platforms. All timings are reported in cycles. Improvements are computed in comparison with the previous fastest result in a 65nm platform, since the related works do not present timings for eld operations in a 45nm platform.

Implementation

Hankerson et al. [4] Beuchat et al. [11] This work (Core 2 65nm) Improvement This work (Core 2 45nm)

3

1 2

Operation

a mod f a mod f a · b mod f a−1 mod f 600 500 8200 162000 480 749 5438  160 166 4030 149763 66.7% 66.8% 25.9% 7.6% 108 140 3785 149589 2

Pairing Computation

Miller's Algorithm for pairing computation requires a rich mathematical framework. We briey present some denitions and point the reader to more complete treatments of the subject presented in [12,13]. 3.1 Preliminary Denitions An admissible bilinear pairing is an eciently computable map e : G1 × G2 → GT , where G1 and G2 are additive groups of points in an elliptic curve E and GT is a related multiplicative group. Let P, Q be r-torsion points. The computation of a bilinear pairing e(P, Q) requires the construction and evaluation of a function fr,P such that div(fr,P ) = r(P ) − r(O) at a divisor D which is equivalent to (Q)−(O). Miller constructs fr,P in stages by using a double-and-add method [14]. Let gU,V : E(Fqk ) → Fqk be the line equation through points U and V . If U = V , the line gU,V is the tangent to the curve at U . If V = −U , the line gU is the shorthand for gU,−U . A Miller function is any function fc,P with divisor div(fc,P ) = c(P ) − (cP ) − (c − 1)(O), c ∈ Z. The following property is true for all integers a, b ∈ Z [13, Theorem 2]: fa+b,P (D) = fa,P (D) · fb,P (D) ·

gaP,bP (D) . g(a+b)P (D)

(1)

Direct corollaries are: (i) f1,P (D) = 1. (D) (ii) fa,P (D) = fa−1,P (D) · g(a−1)P,P . gaP (D) 2 gaP,aP (D) (iii) f2a,P (D) = fa,P (D) · g2aP (D) . Miller's Algorithm is depicted in Algorithm 5. The work by Barreto et al. [13] later showed how to use the nal exponentiation of the Tate pairing to eliminate the denominators involved in the algorithm and to evaluate fr,P at Q instead of the divisor D. Additional optimizations published in the literature focus on minimizing the latency of the Miller loop, that is, reduce the length of r while keeping its low Hamming weight [1,15,16].

Algorithm 5 Miller's Algorithm [14]. 2 (r) Input: r = log ri 2i , P , D = (Q + R) − (R) i=0 Output: fr,P (D).

P

1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

3.2

T ← P, f ← 1 for i = blog2 (r)c − 1 downto 0 do g (Q+R)g (R) f ← f 2 · gT2T,T(Q+R)gT2T ,T (R) T ← 2T if ri = 1 then g (Q+R)gT +P (R) f ← f · gTT ,P +P (Q+R)gT ,P (R) T ←T +P

end if end for return f

Related Work

In this work, we are interested in parallel algorithms for pairing computation with no static limits on scalability, or more precisely, algorithms in which the scalability is not restricted by the mathematical denition of the pairing. Practical limits will always exist when: (i) the communication cost is dominant; (ii) the cost of parallelization is higher than the cost of computation. Several works already developed parallel strategies for the computation of pairings achieving mixed results. Grabher et al. [3] analyzes two approaches: parallel extension eld arithmetic, which gives good results but has a clear limit on scalability; a parallel Miller loop strategy for two processors, where lines 3-4 for all iterations in Miller's Algorithm are precomputed by one processor and both processors compute in parallel the iterations where ri = 1. Because r frequently has a low Hamming weight, this strategy results in performance losses due to unbalanced computational costs between the processors. Mitsunari [17] observes that the dierent iterations of the algorithm can be computed in parallel if the points T of dierent iterations are available and proposes a specialized version of the ηT pairing over F3m for parallel execution 1i 1i in 2 processors. In this version, all the values (xP 3 , yP 3 , xQ 3 i , yQ 3 i ) used for line evaluation in the i-th iteration of the algorithm are precomputed and the Miller loop iterations are divided in sets of the same size. Hence load balancing is trivially achieved. Since the cost of cubing and cube root computation is small, this approach achieves good speedups ranging from 1.61 to 1.76 at two dierent security levels. However, it requires signicant storage overhead, since 4 · ( m+1 2 ) eld elements must be precomputed and stored. This approach is generalized and extended in the work by Beuchat et al. [11], where results are presented for elds of characteristic 2 and 3 at the 128-bit security level. For characteristic 2, the speedups achieved by parallel execution reach 1.75, 2.53 and 2.57 for 2, 4, and 8 processors, respectively. For characteristic 3, the speedups reach 1.65, 2.26 and 2.79, respectively. This parallelization represents the current state-of-the-art in parallel implementations of cryptographic pairings.

Cesena and Avanzi [18,19] propose a technique to compute pairings over trace zero varieties constructed from supersingular elliptic curves and extensions with degrees a = 3 or a = 5. This approach allows a pairing computation to be packed in a short parallel Miller loops by the action of the a-th power of Frobenius. The problem with this approach is again the scalability limit (restricted by the extension degree a). The speedup achieved with parallel execution in 3 processors is 1.11 over a serial implementation of the ηT pairing at the same security level [19]. 3.3 Parallelization In this section, a parallelization of Miller's Algorithm is derived. This parallelization can be used to accelerate serial pairing implementations or improve the scalability of parallel approaches restricted by the pairing denition. This formulation is similar to the parallelization presented by [17] and [11], but our method focuses on minimizing the number of points needed for parallel executions of dierent iterations of the algorithm. This allows us to eliminate the overhead of storing 4( m+1 2 ) precomputed eld elements. Miller's Algorithm computes fr,P in log2 (r) iterations. For a parallel algorithm, we must divide these log2 (r) iterations between some number π of processors. To achieve this, rst we need a simple property of Miller functions [16,20]. Lemma 1. Let P, Q be points on E(Fq ), D ∼ (Q) − (∞) and fc,P denote a Miller function. For all integers a, b ∈ Z, fa·b,P (D) = fb,P (D)a · fa,bP (D). We need this property because Equation (1) just divides a Miller's Algorithm instance computed in log2 (r) iterations in two instances computed in at least log2 (r) − 1 iterations. If we could represent r as a product r0 · r1 , it would be possible to compute fr,P in two instances of log22 (r) iterations. Since for some pairing instantiations, r is a prime group order, we write r in the simple and exible form 2w r1 + r0 , with w ∼ log22 (r) . This way, we can compute: fr,P (D) = f2w r1 +r0 ,P (D) = f2w r1 ,P (D) · fr0 ,P (D) ·

g(2w r1 )P,r0 P (D) . grP (D)

(2)

The previous Lemma provides two choices to further develop f2w r1 ,P (D): w (i) f2w r1 ,P (D) = fr1 ,P (D)2 · f2w ,r1 P (D). (ii) f2w r1 ,P (D) = f2w ,P (D)r1 · fr1 ,2w P (D). The choice can be made based on eciency: (i) compute w squarings in the extension eld F∗qk and a point multiplication by r1 ; (ii) compute an exponentiation to r1 in the extension eld and a point multiplication by 2w (or w repeated point doublings). In the general case, the most ecient strategy will depend on the curve and embedding degree. The higher the embedding degree, the higher the cost of exponentiation in the extension eld in comparison with point multiplication in the elliptic curve. If r has low Hamming weight, the two strategies should have similar costs. We adopt the rst strategy: w

fr,P (D) = fr1 ,P (D)2 · f2w ,r1 P (D) · fr0 ,P (D) ·

g(2w r1 )P,r0 P (D) . grP (D)

(3)

This formula is clearly suitable for parallel execution in π = 3 processors, since each Miller function can be computed in log22 (r) iterations. For our purposes, however, r will have low Hamming weight and r0 will be very small. In this case, fr,P can be computed by two Miller functions of approximately log22 (r) iterations. The parameter w can be adjusted to balance the costs in both processors (w extension eld squarings with a point multiplication by r1 ). This formula can also be applied recursively for fr1 ,P and f2w ,r1 P to develop a parallelization suitable for any number of processors. Observe that π also does not have to be a power of 2, because of the exible way we write r to exploit parallelism. An important detail is that a parallel implementation will only have signicant speedups if the cost of the Miller loop is dominant over the communication overhead or the parallelization overhead. It is also important to note that the higher the number of processors, the higher the number of squarings and the smaller the constants ri involved in point multiplication. However, applying the formula recursively can increase the size of the integers which multiply P , because they will be a product of ri constants. Thus, the scalability of this algorithm for π processors depends on the cost of squarings in the extension eld, the cost of point multiplications by ri in the elliptic curve and the actual length of the Miller loop. Fortunately, these parameters are constant and can be statically determined. If P is xed (a private key, for example), the multiples ri P can also be precomputed and stored with low storage overhead. 3.4

Parallel ηT Pairing

In this section, the performance gain of a parallel implementation of the ηT pairing over a serial implementation is investigated following the analysis by [4]. Let E be a supersingular curve with embedding degree k = 4 dened m+1 over F2m with equation E/F2m : y 2 + y = x3 + x + b. The order of E is 2m + 1 ± 2 2 . A quartic extension is built over F2m with basis {1, s, t, st}, where s2 = s + 1 and t2 = t + s. Let P, Q ∈ E(F2m ) be r-torsion points. An associated distortion map ψ from E(F2m )[r] to E(F24m ) is dened by ψ : (x, y) → (x + s2 , y + sx + t). For this family of curves, Barreto et al. [1] dened the optimized ηT pairing: ηT : E(F2m )[r] × E(F24m )[r] → F∗24m ,

(4)

ηT (P, Q) = fT 0 ,P 0 (Q0 )M ,

with Q0 = ψ(Q), T 0 = (−v)(2m − #E(F2m )), P 0 = (−v)P , M = (22m − 1)(2m + m+1 1 ± 2 2 ) for a curve-dependent parameter v ∈ {−1, 1}. At the 128-bit security level, the base eld must have m = 1223 bits [4]. Let E1 be the supersingular curve with embedding degree k = 4 dened over F21223 with equation E1 (F21223 ) : y 2 +y = x3 +x. The order of E1 is 5r = 21223 +2612 +1, where r is a 1221-bit prime number. Applying the parallel form developed in Section 3.3, the pairing computation can be decomposed in: fT 0 ,P 0 (Q0 )M =



w

f2612−w ,P 0 (Q0 )2 · f2w ,2612−w P 0 (Q0 ) ·

g2612−w P 0 ,P 0 (Q0 ) gT 0 P 0 (Q0 )

M .

Since squarings in F24m and point duplication in supersingular curves require only binary eld squarings and these can be eciently computed, the cost of parallelization is low, but further improvements are possible. Barreto et al. [1] proposed a closed formula for this pairing based on a reversed-loop approach with square roots which eliminates the extension eld squarings in Miller's Algorithm. Beuchat et al. [21] encountered further algorithmic improvements and proposed a slightly faster formula for the ηT pairing computation. We can obtain a parallel algorithm directly from the parallel formula derived above by excluding the involved extension eld squarings and simply dividing the loop iterations between the processors. This algorithm is shown as Algorithm 6. In this algorithm, each processor i starts the loop from the wi counter, computing wi squarings/square roots of overhead. Without extension eld squarings to oset these operations, it makes sense to assign processor 1 the rst line evaluation and to increase the loop parts executed by processors with small wi . The total overhead is smaller because extension eld squarings are not needed and point arithmetic in binary supersingular curves can be computed with inexpensive squarings and square roots. Observe that the combining step can be implemented in at least two different ways: (i) serial combining of results with (π − 1) serial extension eld multiplications executed in one processor; (ii) parallel logarithmic combining of results with latency of dlog2 (π)e extension eld multiplications. We adopt the parallel strategy for eciency. 3.5 Performance Analysis Now we proceed with performance analysis of Algorithm 6. Processor 1 has an initialization cost of 3 multiplications and 2 squarings. Processor i has a parallelization cost of 2wi squarings and 2wi square roots. Additional parallelization overhead is dlog2 (π)e extension eld multiplications to combine the results. A full extension eld multiplication costs 9 eld multiplications. Each iteration of the algorithm executes 2 square roots, 2 squarings, 1 eld multiplication and 1 extension eld multiplication. Exploring the sparsity of Gi , this extension eld multiplication costs 6 eld multiplications. The nal exponentiation has a cost of 26 multiplications, 7 nite eld squarings, 612 extension eld squarings and 1 inversion. Each extension eld squaring costs 4 nite eld squarings [21]. Let m, e se, re, ei be the cost of nite eld operations: multiplication, squaring, square root and inversion, respectively. For our ecient implementation of nite eld F21223 in an Intel Core 2 65nm processor, we have re ≈ se, m e ≈ 25e s and ei ≈ 37m e . From these ratios, we will illustrate how to compute the optimal wi values which balance the computational cost between processors. Let cπ (i) be the computational cost of a processor 0 < i ≤ π while executing its portion of the parallel algorithm. For π = 2 processors: c2 (1) = (3m e + 2e s) + (7m e + 4e s)w2 = 80e s + (186e s)w2 c2 (2) = (4e s)w2 + (7m e + 4e s) (611 − w2 ) .

Naturally, we always have w1 = 0 and wπ+1 = 611. Solving c2 (1) = c2 (2) for w2 , we can obtain the optimal w2 = 309. For π = 4 processors, we solve

Algorithm 6 Proposed parallelization of the ηT pairing (π processors).

Input: P = (xP , yP ), Q = (xQ , yQ ) ∈ E(F2m [r]), starting point wi for processor i. Output: ηT (P, Q) ∈ F∗24m . 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

yP ← yP + 1 − δ

parallel section(processor i) if i = 0 then ui ← xP + α, vi ← xQ + α g0 i ← ui · vi + yP + yQ + β g1 i ← ui + xQ , g2 i ← vi + x2P Gi ← g0 i + g1 i s + t Li ← (g0 i + g2 i ) + (g1 i + 1)s + t Fi ← Li · Gi

else

Fi ← 1

end if

wi

wi

xQ i ← (xQ )2 , yQ i ← (yQ )2 1 1 xP i ← (xP ) 2wi , yP i ← (yP ) 2wi for j ← w√i to wi+1 − 1 do √ xP i ← xP i , yP i ← yP i , xQ i ← xQ 2i , yQ i ← yQ 2i ui ← xP i + α, vi ← xQ i + α g0 i ← ui · vi + yP i + yQ i + β g1 i ← ui + xQ i Gi ← g0 i + g1 i s + t Fi ← Fi · Gi

end for Q F ←

π i=0

Fi

end parallel return F M

c4 (1) = c4 (2) = c4 (3) = c4 (4) to obtain w2 = 158, w3 = 312, w4 = 463. Observe that by solving a simple system of equations it is always possible to balance the computational cost between the processors. Furthermore, the latency of the Miller loop will always be equal to cπ (1). Let c1 (1) be the cost of a serial implementation of the main loop, par be the parallelization overhead and exp be the cost of nal exponentiation. Considering the additional dlog2 (π)e extension eld multiplications as parallelization overhead and 26m e + (7 + 2446)e s + ei as the cost of nal exponentiation, the speedup for π processors is the ratio between the cost of the serial implementation over the cost of the parallel implementation: s(π) =

c1 (1) + exp 77 + 179 · 611 + 3978 = . cπ (1) + par + exp cπ (1) + 225dlog2 (π)e + 3978

Table 2 presents speedups estimated by our performance analysis. Note that our ecient implementation of binary eld arithmetic in a 45nm processor has a bigger multiplication-to-squaring ratio, concentrating higher computational costs in the main loop of the algorithm. This explains why the speedups should be higher in the 45nm processor.

Table 2. Estimated speedups for our parallelization of the ηT pairing over supersingular binary curves at the 128-bit security level. The optimal partitions were computed by a Sage1 script.

Number π of processors Estimated speedup s(π) 1 2 4 8 16 32 Core 2 65nm Core 2 45nm

4

1.00 1.90 3.45 5.83 8.69 11.48 1.00 1.92 3.54 6.11 9.34 12.66

Experimental Results

We implemented the parallel algorithm for the ηT pairing over our ecient binary eld arithmetic in two Intel Core platforms: an Intel Core 2 Quad 65nm platform running at 2.4GHz (Platform 1) and a dual quad-core Intel Xeon 45nm processor running at 2.0GHz (Platform 2). The parallel sections were implemented with OpenMP2 constructs. OpenMP is an application programming interface that supports multi-platform shared memory multiprocessing programming in C, C++ and Fortran. We used a special version of the GCC 4.1.2 compiler included in Fedora Linux 8 with OpenMP support backported from GCC 4.2 and SSSE3 support backported from GCC 4.3. This way, we could use both multiprocessing support and fast code generation for SSE intrinsics. The timings and speedups presented in Table 3 were measured on 104 executions of each algorithm. We present timings in millions of cycles to ignore dierences in clock frequency between the target platforms. From the table, we can observe that real implementations can obtain speedups close to the estimated speedups derived in the previous section. We veried that threading creation and synchronization overhead stayed in the order of microseconds, being negligible compared to the pairing computation time. Timings for π > 4 processors in Platform 1 and π > 8 processors in Platform 2 were measured through a highprecision per-thread counter measured by the main thread. These timings might be an accurate approximation of future real implementations, but memory eects (such as cache locality) or scheduling inuence may impose penalties. Table 3 shows that the proposed parallelization presents good scalability. We improve the state-of-the-art serial and parallel execution times signicantly. The fastest timing for computing the ηT pairing obtained by our implementation was 1.51 milliseconds using all 8 cores of Platform 2. The work by Beuchat et al. [11] reports a timing of 3.08 milliseconds in a Intel Core i7 45nm processor clocked at 2.9GHz. Note that we obtain a much faster timing with a lower clock frequency and without requiring the storage overhead of 4 · ( m+1 2 ) eld elements present in [11], which may reach 365KB for these parameters and be prohibitive in resource-constrained embedded devices. 1 2

SAGE: Software for Algebra and Geometry Experimentation. http://www.sagemath.org Open Multi-Processing. http://www.openmp.org

Table 3. Experimental results for serial/parallel executions of the ηT pairing. Times

are presented in millions of cycles and the speedups are computed by the ratio between execution times of serial implementations over execution times of parallel implementations. The columns marked with (*) present estimates based on per-thread data.

Platform 1  Intel Core 2 65nm Hankerson et al. [4]  latency Beuchat et al. [11]  latency Beuchat et al. [11]  speedup This work  latency This work  speedup Improvement

Platform 2  Intel Core 2 45nm Beuchat et al. [11]  latency Beuchat et al. [11]  speedup This work  latency This work  speedup Improvement

5

1

39 26.86 1.00 18.76 1.00 30.2%

1

23.03 1.00 17.40 1.00 24.4%

Number of threads 2 4 8* 16* 32*

 16.13 1.67 10.08 1.86 37.5%

2

13.14 1.77 9.34 1.86 28.9%

    10.13    2.65    5.72 3.55 2.51 2.14 3.28 5.28 7.47 8.76 43.5%   

4

9.08 2.54 5.08 3.42 44.0%

8

8.93 2.58 3.02 5.76 66.2%

16* 32*

    2.03 1.62 8.57 10.74  

Conclusion and Future Work

In this work, we proposed novel techniques for exploring parallelism during the implementation of the ηT pairing over supersingular binary curves in modern multi-core computers. Powerful vector instructions of the SSE family were shown to accelerate considerably the arithmetic in binary elds. We obtained signicant performance in computing the ηT pairing, using an ecient implementation of eld multiplication, squaring and square root computation. The optimizations improved the state-of-the-art timings of this pairing instantiation at the 128-bit security level by 24% and 30% in two dierent Intel Core processors. We also derived a parallelization of Miller's Algorithm to compute pairings. This parallelization is generic and can be applied to any pairing algorithm or instantiation. The construction also achieves good scalability in the symmetric case and this scalability is not restricted by the denition of the pairing. We illustrated the formulation when applied to the ηT pairing over supersingular binary curves and validated our performance analysis with a real implementation. The experimental results show that the actual implementation could sustain performance gains close to the estimated speedups. Parallel execution of the ηT pairing improved the state-of-the-art timings by at least 28%, 44% and 66% in 2, 4 and 8 cores respectively. This parallelization is suitable for embedded platforms and can be applied to reduce computation latency when response time is critical. Future work can adapt the introduced techniques for the case F3m . Improvements to the parallelization should focus on minimizing the serial region and parallelization cost. The proposed parallelization should also be applied to an optimal asymmetric pairing setting, where parallelization costs are clearly higher. Preliminary data for the R-ate pairing [16] over Barreto-Naehrig curves at the 128-bit security level points to a 10% speedup using 2 processor cores.

References 1. Barreto, P.S.L.M., Gailbraith, S., Ó hÉigeartaigh, C., Scott, M.: Ecient Pairing Computation on Supersingular Abelian Varieties. Design, Codes and Cryptography 42(3), 239271 (2007) 2. Wechsler, O.: Inside Intel Core Microarchitecture: Setting new standards for energy-ecient performance. Technology@Intel Magazine (2006) 3. Grabher, P., Groszschaedl, J., Page, D.: On Software Parallel Implementation of Cryptographic Pairings. In: Avanzi, R., Keliher, L., Sica, F. (eds.) SAC 2008. LNCS, vol. 5381, pp. 3449. Springer, (2008) 4. Hankerson, D., Menezes, A., Scott, M.: Identity-Based Cryptography, chapter 12, pp. 188206. IOS Press, Amsterdam (2008) 5. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2: Instruction Set Reference, http://www.intel.com/Assets/PDF/manual/253666.pdf. 6. Gueron, S., Kounavis, M.E.: Carry-Less Multiplication and Its Usage for Computing The GCM Mode. White paper available at http://software.intel.com/. 7. Hankerson, D., Menezes, A., Vanstone, S.: Guide to Elliptic Curve Cryptography. Springer-Verlag, Secaucus (2003) 8. Fong, K., Hankerson, D., López, J., Menezes, A.: Field inversion and point halving revisited. IEEE Transactions on Computers 53(8), 10471059 (2004) 9. Karatsuba, A., Ofman, Y.: Multiplication of many-digital numbers by automatic computers (in Russian). Doklady Akad. Nauk SSSR 145, 293294 (1962) 10. López, J., Dahab, R.: High-speed software multiplication in GF(2m ). In: Roy, B.K., Okamoto, E. (eds.) INDOCRYPT 2000, LNCS, vol. 1977, pp. 203212 (2000) 11. Beuchat, J., López-Trejo, E., Martínez-Ramos, L., Mitsunari, S., RodríguezHenríquez, F.: Multi-core implementation of the Tate pairing over supersingular elliptic curves. In: Garay, J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 413432. Springer (2009) 12. Barreto, P.S.L.M., Lynn, B., Scott, M.: Ecient Implementation of Pairing-Based Cryptosystems. Journal of Cryptology 17(4), pp. 321334 (2004) 13. Barreto, P.S.L.M., Kim, H.Y., Lynn, B., Scott, M.: Ecient algorithms for pairingbased cryptosystems. In: Yung, M. (ed.) CRYPTO 2002. LNCS, vol. 2442, pp. 354368, Springer (2002) 14. Miller, V.S.: The Weil Pairing, and Its Ecient Calculation. Journal of Cryptology, 17(4), pp. 235261 (2004) 15. Hess, F., Smart, N.P., Vercauteren, F.: The eta pairing revisited. IEEE Trans. on Information Theory 52, pp. 45954602 (2006) 16. Lee, H., Lee, E., Park, C.: Ecient and Generalized Pairing Computation on Abelian Varieties. IEEE Trans. on Information Theory 55(4), pp. 17931803 (2009) 17. Mitsunari, S.: A Fast Implementation of ηT Pairing in Characteristic Three on Intel Core 2 Duo Processor. Cryptology ePrint Archive, Report 2009/032 18. Cesena, E.: Pairing with Supersingular Trace Zero Varieties Revisited. Cryptology ePrint Archive, Report 2008/404 19. Cesena, E., Avanzi, R.: Trace Zero Varieties in Pairing-based Cryptography. In: Conference on Hyperelliptic curves, discrete Logarithms, Encryption, etc. http: //inst-mat.utalca.cl/chile2009/Slides/Roberto_Avanzi_2.pdf (2009) 20. Vercauteren, F.: Optimal pairings. Cryptology ePrint Archive, Report 2008/096 21. Beuchat, J., Brisebarre, N., Detrey, J., Okamoto, E., Rodríguez-Henríquez, F.: A Comparison Between Hardware Accelerators for the Modied Tate Pairing over F2m and F3m . In: Pairing 2008. LNCS, vol. 5209, pp. 297315, Springer (2008)