Feb 29, 2012 - support for binary field arithmetic is available in the target platform and if resistance to some ... of the field F2m represented in canonical basis consumes n64 = â m. 64 ..... tions accessed through their intrinsics interface.
Faster implementation of scalar multiplication on Koblitz curves We design a state-of-the-art software implementation of eld and elliptic curve arithmetic in standard Koblitz curves at the 128-bit security level. Field arithmetic is carefully crafted by using the best formulae and implementation strategies available; and the increasingly common native support to binary eld arithmetic in modern desktop computing platforms. The i-th power of the Frobenius automorphism on Koblitz curves is exploited to obtain new and faster interleaved versions of the well-known τ -NAF scalar multiplication algorithm. The eciency of these versions is illustrated by timings for scalar multiplication of xed, random and multiple points. In particular, the random point case is computed in just below 105 cycles, setting a new speed record across all curves with or without endomorphisms dened over binary or prime elds. Our focus is on pure eciency, thus our results suggest a tradeo between speed, compliance with the published standards and sidechannel protection. Finally, we estimate the performance of curve-based cryptographic protocols instantiated using the proposed techniques and compare our results to related work. Abstract.
1
Introduction
Since its introduction in 1985, elliptic curve cryptography has become one of the most important and ecient public key cryptosystems still in use. Its security is based on the computational intractability of solving discrete logarithm problems over the group formed by the rational points on an elliptic curve. Anomalous binary curves also known as
Koblitz Elliptic Curves, were intro-
Fq for q = 2m , a Koblitz curve Ea (Fq ), (x, y) ∈ Fq × Fq that satisfy the equation
duced in [18]. Given a nite eld dened as the set of points
Ea : y 2 + xy = x3 + ax2 + 1, together with a point at innity denoted by
O.
a ∈ {0, 1},
is
(1)
It is known that
Ea (Fq )
forms
an additive Abelian group with respect to the elliptic point addition operation, In
Ea is a Koblitz curve with order #Ea (F2m ) = 22−a r, r is an odd prime. Let hP i be an additively written subgroup in Ea of order r , and let k be a positive integer such that k ∈ [0, r − 1]. Then,
this paper it is assumed that where prime
the elliptic curve scalar multiplication operation computes the multiple Q = kP, which corresponds to the point resulting of adding
r, P
and
Q ∈ hP i,
P
to itself,
k − 1 times. Given
the Elliptic Curve Discrete Logarithm Problem (ECDLP)
consists of nding the unique integer
k
such that
Q = kP holds. F2 , the Frobenius
Since Koblitz curves are dened over the binary eld
map
and its inverse naturally extend to an automorphism of the curve denoted by
τ . The τ map takes (x, y) to (x2 , y 2 ) and O to O. It can been shown that (x4 , y 4 ) + 2(x, y) = µ(x2 , y 2 ) for every (x, y) on Ea , where µ = (−1)1−a . In 2 other words, τ satises τ + 2 = µτ . By solving the quadratic equation, we can √ −1+ −7 associate τ with the complex number τ = . 2 Elliptic curve scalar multiplication is the most expensive operation in cryptographic protocols whose security guarantees are based on the ECDLP. Improving the computational eciency of this operation is a widely studied problem. Across the years, a number of algorithms and techniques providing ecient implementations with higher performance have been proposed [15]. Many research works have focused their eorts on the unknown point scenario, where the base point
P
is not known in advance and when only one single scalar multiplication is
required, as in the case of the Die-Hellman key exchange protocol [26,19,13]. However, there are situations where a single scalar multiplication must be performed on xed base points such as in the case of the key and signature generation procedures of the Elliptic Curve Digital Signature Algorithm (ECDSA) standard. In other scenarios, such as in the ECDSA signature verication, the simultaneous computation of two scalar multiplications (one with unknown point and the other with xed point) of the form
R = kG + lQ,
is required. Compar-
atively less research works have studied the latter cases [6,10,4]. In [25,26], authors evaluated the achievable performance of binary elliptic curve arithmetic in the latest 64-bit micro-architectures, presenting a comprehensive analysis of unknown-point scalar multiplication computations on random and Koblitz NIST elliptic curves at the 112-bit and 192-bit security levels. However, for the 128-bit security level authors in [26] only considered a random curve with side-channel resistant scalar multiplication.
1
In this work we revisit the single-core software computation of the scalar multiplication on Koblitz curves dened over binary elds. This study includes the computation of the scalar multiplication using unknown and xed points and single and simultaneous scalar multiplication computations as required in the digital signature generation and verication ECDSA procedures. We extend the analysis given in [25,26] and further investigate the remaining curve choices to provide a complete picture of the performance scenario, while also showing through operation counting and experimental results that Koblitz curves are still the fastest choice for deploying curve-based cryptography if sucient native support for binary eld arithmetic is available in the target platform and if resistance to some software side-channel attacks can be disregarded.
1
2
2
Scalar multiplication on the Edwards curve CURVE2251 was implementing in [26] using the Montgomery laddering approach that is naturally protected against rstorder side-channel attacks. Since our library does not read secret data from RAM we believe that our software is basically immune to cache-timing attacks. Furthermore, since we compute the scalar multiplication using an interleaved approach that simultaneously processes the scalars associated to two or more points, we believe that our library oers some rst-order resistance against branch-prediction side-channel attacks.
To this end, we adopted several techniques previously proposed by dierent authors: (i) formulation of binary eld arithmetic using vector instruc-
2k -powers in multiplication over F2 and
tions [2]; (ii) time-memory trade-os for the evaluation of xed binary elds [5]; (iii) new formulas for polynomial
its extensions [7]; (iv) ecient support for the recently introduced carry-less multiplier [26]. Besides of building on these nite eld arithmetic advancements, this paper presents several novel techniques including: (i) improved implementation of
ωτ -
NAF integer recoding; (ii) a new precomputation scheme for small multiples of a random point in a Koblitz curve; (iii) lazy-reduction formulae for mixed addition in binary elliptic curves; (iv) novel interleaving strategies of the
τ -NAF
algorithm for scalar multiplication in Koblitz curves via powers of the Frobenius automorphism. Further, in this work only the tried and tested Koblitz curve NIST-K283 was considered, providing immediate compatibility with standards and interoperability with existing implementations. Note however that most of our techniques are not in any sense restricted to this curve choice, and can therefore be used to accelerate scalar multiplication in other Koblitz curves at dierent security levels. Our main implementation result is a speed record for the unknown-point single-core scalar multiplication computation over the NIST-K283 curve in slightly less than
105
clock cycles. Running on an Intel Core i7-2600K processor clocked
at 3.4 GHz we were able to compute a random point scalar multiplication in just 29.18µS. To our knowledge, this is the fastest software implementation of this operation across all curves ever proposed for cryptographic applications at the 128-bit security level. This document is structured as follows: Section 2 discusses the low-level techniques used for the implementation of eld arithmetic and integer recoding. Section 3 presents higher-level techniques for arithmetic in the elliptic curve, comprising improved formulas for mixed addition by means of lazy reduction and strategies for speeding up the scalar multiplication computation by using powers of the Frobenius automorphism. Section 4 illustrates the eciency of the proposed techniques reporting timings for scalar multiplication in the xed, unknown and multiple point scenarios; and extensively compares the results with related work. Additionally in this Section we estimate the performance of signature and key agreement protocols when they are instantiated through the usage of Koblitz curves. The nal section concludes the paper with perspectives for further performance improvement based on upcoming instruction sets.
2
Low-level techniques
f (z) be a monic irreducible polynomial of degree m over F2 . Then, the binary F2m is isomorphic to F2m ∼ = F2 [z]/ (f (z)), i.e., F2m is a nite eld of characteristic 2, whose elements are the set of all the binary polynomials modulo f (z). In order to achieve a security level equivalent to 128-bit AES
Let
extension eld
when working with binary elliptic curves, NIST recommends to choose the eld
m = 283, along with the irreducible pentanomial f (z) = z 283 + z 12 + z + z + 1. It is noticed that in a modern 64-bit computing platform, an element m of the eld F2m represented in canonical basis consumes n64 = d 64 e processor words, or n64 = 5 when m = 283. In the rest of this Section, algorithm and
extension
7
5
formula descriptions will refer to either generic or xed versions of the binary eld, depending on whether or not the optimization is restricted to the choice of
m = 283. As it was mentioned before, in this work we made an extensive use of vector
instruction sets present in contemporary desktop processors. The platform model given in Table 1 extends the notation reported in [2]. Notice that vectorized multiple-precision or intra-digit shifts can always be made faster when the shift amount is a multiple of 8 by means of the memory alignment instruction or the bytewise shift instruction, respectively, and that a simultaneous table lookup mapping 4-bit indexes to bytes can be implemented through the byte shuing instruction called
Table 1.
PSHUFB
in the SSE instruction set.
Relevant vector instructions for the implementation of binary eld arithmetic.
Mnemonic
⊗ -8 , -8 8 , 8 ⊕, ∧, ∨ C, B
Description
SSE
Carry-less multiplication PCLMULQDQ 64-bit bitwise shifts PSLLQ,PSRLQ 128-bit bytewise shift PSLLDQ,PSRLDQ Bitwise XOR,AND,OR PXOR,PAND,POR Memory alignment/Multi-precision shifts PALIGNR
In the following we will briey discuss how the most relevant eld arithmetic operations, namely, addition, multiplication, squaring and multi-squaring, modular reduction, inversion and integer to
ωτ N AF
recoding were implemented.
Addition. It is the simplest operation on a binary eld and can employ the exclusive-or instruction with the largest operand size in the target platform. This is particularly benecial for vector instructions, but according to our experiments, the 128-bit SSE [16] integer instruction proved to be faster than the 256-bit AVX [8] oating-point instruction due to a higher reciprocal throughput [9] when operands are stored into registers.
Multiplication. Field multiplication is the performance-critical arithmetic op-
a, b ∈ F2283 we c = a · b mod f (z). This
eration for elliptic curve arithmetic. Given two eld elements want to compute a third eld element
c ∈ F2m ,
as
can be accomplished by performing two separate steps: rst the polynomial multiplication of the two operands
a, b
is computed and then the resulting
double length polynomial is modular reduced by
f (z). From our eld element
representation, the polynomial multiplication step can be seen as the compu-
n64 − 1 degree polynomials each with n64 64-bit d n264 e−1 den64 gree polynomials each with d 2 e 128-bit coecients. In the latter case, each tation of the product of two
coecients. Alternatively, the two operands may also be seen as
term-by-term multiplication can be solved carry-less multiplications. When
n64 = 5,
à la
Karatsuba by performing 3
the above approaches require 13
(see [21,7]) and 14 invocations of the carry-less multiplier instruction, respectively. Algorithm 1 below presents our implementation of eld multiplication over the eld
F2283
with 64-bit granularity using the formula given in [7]. The
computational complexity of Algorithm 1 is of thirteen and sixty-two 64-bit
3 plus one modu-
carry less multiplications and 64-bit additions, respectively,
lar reduction (Alg. 1, Step 20) that will be discussed later. The most salient feature of Algorithm 1 is that all the thirteen carry-less multiplications have been grouped in the three for loops of steps 1-4, 6-8 and 10-12. This is an attractive feature from the implementation point of view, as it is important to potentially reduce the cost of the carry-less multiplication instruction from 14 to 8 clock cycles in the Intel Sandy Bridge micro-architecture; and from 12 to 7 clock cycles in an AMD Bulldozer [9]. The rationale behind this cost reduction is that the batch execution of independent multiplications directly benet the micro-architecture pipeline occupancy level. It is worth mentioning that in [26], authors concluded that the 64-bit granular approach tends to consume more resources and complicate register allocation, limiting the natural throughput exhibited by the carry-less multiplication instruction. However, if the digits are stored in an interleaved form [12], these side eects are mitigated and a throughput can again be achieved.
Squaring and multi-squaring. Squaring is a cheap operation in a binary eld because due to the action of the Frobenius map, it consists of a linear expansion of coecients. Vectorized approachs using simultaneous table lookups through byte shuing instructions allow a particularly ecient formulation of the coecient expansion stage [2]. The modular reduction stage usually is the most expensive step when computing a squaring, especially when is an
ordinary pentanomial
f (z)
(see [23]) for the given word size. Dealing with
ordinary pentanomials requires exible and often shifting instructions that are unsupported by the platform model. Multi-squaring is a time-memory trade-o in which a table of
k
2
16d m 4 e eld elements allows computing any xed
power with the cost equivalent of just a few squarings. It is usually the
case that the multi-squaring approach becomes faster whenever
k≥6
[26].
Contrary to addition, the availability of 256-bit instructions here contributes signicantly to a performance increase. This is because this operation basically consists of a sequence of additions with eld elements obtained through a precomputed table stored in main memory.
Modular reduction. Ecient modular reduction of a
2 · n64
double-precision
value coming as the result of a squaring or multiplication operation to a proper eld element of word-size
n64 , involves expressing the required shifted
additions in terms of the best shifting instructions possible. For the instruc-
3
which yields the same number of multiplications but with ten less additions that the computational complexity reported in [21] for multiplication of degree-four binary polynomials.
Algorithm 1 Proposed implementation of multiplication in
F2283 .
a(z) = a[0..4], b(z) = b[0..4]. Output: c(z) = c[0..4] = a(z) · b(z). Note: Pairs of 64-bit words represent vector registers. 1: for i ← 0 to 4 do 2: ri ← (a[i], b[i]) 3: mi ← ri [0] ⊗ ri [1] 4: end for 5: t0 ← r0 ⊕ r1 , t1 ← r0 ⊕ r2 , t2 ← r2 ⊕ r4 , t3 ← r3 ⊕ r4 6: for i ← 0 to 3 do 7: m5+i ← ti [0] ⊗ ti [1] 8: end for 9: t1 ← t1 ⊕ r3 , t2 ← t2 ⊕ r1 , t3 ← t3 ⊕ t0 , t0 ← t3 ⊕ r2 10: for i ← 0 to 3 do 11: m9+i ← ti [0] ⊗ ti [1] 12: end for 13: c0 ← m0 , c4 ← m4 , t0 ← m0 ⊕ m1 , t1 ← m3 ⊕ m4 14: c1 ← t0 ⊕ m2 ⊕ m6 , c3 ← t1 ⊕ m2 ⊕ m7 15: t0 ← t0 ⊕ m5 , t3 ← t1 ⊕ m8 16: c2 ← t0 ⊕ t3 ⊕ m9 ⊕ m10 ⊕ m11 , t1 ← m9 ⊕ m12 17: t2 ← t1 ⊕ c1 ⊕ m4 ⊕ m11 18: t1 ← t1 ⊕ c3 ⊕ m0 ⊕ m10 19: (c4 , c3 , c2 , c1 , c0 ) ← (c4 , c3 , c2 , c1 , c0 ) ⊕ ((0, t3 , t2 , t1 , t0 ) 64) 20: return c = (c4 , c3 , c2 , c1 , c0 ) mod f (z) Input:
tion sets available in our target platform, this amounts to converting the highest possible number of shifts to memory alignment instructions or bytewise shifts. Curve NIST-K283 is dened over an ordinary pentanomial, a particularly inecient choice for our vector register size. However, by observing that
f (z) = z 283 + z 12 + z 7 + z 5 + 1 = z 283 + (z 7 + 1)(z 5 + 1),
one can take advantage of this factorization to formulate faster shifted additions. Algorithm 2 presents our explicit scheduling of shift instructions to perform modular reduction in written as of
c.
c = p1 ||p0
F2283 .
Suppose that the polynomial
p0
where the polynomial
The computation of
c
mod
c
is
represent the 283 lower bits
f (z) in Algorithm 2 is performing p1 is computed by shifting the
lows: in lines 1 to x, the polynomial
as folvector
(c4,c3,c2) to the left exactly 27 bits. Then, in lines xx to yx, the operation
c + p1 ∗ (z 7 + 1) ∗ (z 5 + 1)
is performed, thus getting the vector
(c2 , c1 , c0 ) . c2 are
Finally, in lines xxx to yyy, the remaining 101 most signicant bits of
reduced, a process that again involves a multiplication by the polynomial
(z 7 + 1) ∗ (z 5 + 1). Inversion. The eld inversion approach that probably is the friendliest for vector instruction sets, is the Itoh-Tsuji inversion [17] that computes the eld inverse of
a
using the identity
a−1 =
m−1
a2
−1
2
. The term
m−1
a2
−1
is
f (z) = z 283 + (z 7 + 1)(z 5 + 1).
Algorithm 2 Implementation of reduction by
Double-precision polynomial stored into 128-bit registers c = (c4 , c3 , c2 , c1 , c0 ). Field element c mod f (z) stored into 128-bit registers (c2 , c1 , c0 ). t2 ← c2 , t0 ← (c3 , c2 ) B 64, t1 ← (c4 , c3 ) B 64 c4 ← c4 -8 27, c3 ← c3 -8 27, c3 ← c3 ⊕ (t1 -8 37) c2 ← c2 -8 27, c2 ← c2 ⊕ (t0 -8 37) t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 1) t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 7), c3 ← c3 ⊕ (t1 -8 57) t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 7), c2 ← c2 ⊕ (t0 -8 57) t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 3) t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 5), c3 ← c3 ⊕ (t1 -8 59) t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 5), c2 ← c2 ⊕ (t0 -8 59) c0 ← c0 ⊕ c2 , c1 ← c1 ⊕ c3 , c2 ← t2 ⊕ c4 t0 ← c4 -8 27 t1 ← t0 ⊕ (t0 -8 5) t0 ← t1 ⊕ (t1 -8 7) c0 ← c0 ⊕ t0 , c2 ← c2 ∧ (0x0000000000000000,0x0000000007FFFFFF) return c = (c2 , c1 , c0 )
Input:
Output:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
obtained by sequentially computing intermediate terms of the form
i
a2
−1
2j
j
· a2
−1
.
(2)
0 ≤ i, j ≤ m−1, are elements of the addition chain assoe = m−1 [14,22]. In F2283 , the shortest eleven-step ade = 282 is, 1→2→3→4→8→16→17→35→70→140→141→282.
where the exponents
ciated to the exponent dition chain for
The computation of the above outlined procedure introduces an important memory cost of storing 6 multi-squaring tables, each table contains
16d m 4e
eld elements. However, several of those tables can be reused in the interleaving approach for scalar multiplications by exploiting powers of the Frobenius automorphism as will be explained in the next section.
Integer
τ -NAF
recoding Solinas [24] presented a
τ -adic
analogue of the cus-
tomary non-adjacent form (NAF) recoding as we briey explained next. An
ρ ∈ Z[τ ] is found with ρ ≡ k (mod δ) of as small norm as possible, δ = (τ m − 1)/(τ − 1), where for the subgroup of interest, kP = ρP and a width-ω τ NAF representation for ρ can be obtained in a way that ω mimics the usual ω NAF recoding. As in [24], let us dene αi = i mod τ for ω−1 i ∈ {1, 3, 5, . . . , 2 − 1}. A width-ω τ NAF of a nonzero element k is an Pl−1 i expression k = u i=0 i τ where each ui ∈ {0, ±α1 , ±α3 , . . . , ±α2ω−1 −1 } and ul−1 6= 0, and at most one of any consecutive ω coecients is nonzero. Un-
element where
der reasonable assumptions, Solinas procedures produce an expansion with length
l ≤ m + 1.
Although the cost of
ω NAF
recoding is usually negligible when compared
with the overall cost of scalar multiplication, this is not generally the case with Koblitz curves, where integer to
ωτ N AF
recoding can reach more
than 10% of the scalar multiplication computational time [26]. In this work,
ωτ -NAF
recoding [24] was implemented by employing as much as possible
branchless techniques: the branches inside the recoding operation essentially depend on random values, presenting a worst-case scenario for branch prediction and causing severe timing penalties. In addition to that, the code was also completely unrolled to handle only the precision required in the current iteration. Since the magnitude of the involved scalars gets reduced with each iteration, it is suboptimal to perform operations considering the initial full precision. The deterministic nature of the algorithm allows one to know exactly what is the iteration of the main recoding loop when the intermediate values lose a full processor word of precision.
3
High-level techniques
In the last section, several notes detailed our algorithmic and implementation choices for eld arithmetic. This section describes the higher-level strategies used in the elliptic curve arithmetic layer for increasing the performance of scalar multiplication.
3.1
Exploiting powers of the Frobenius automorphism
Scalar multiplication algorithms on Koblitz curves are always tailored to exploit
τ on E(F2m ) given by τ (x, y) = (x2 , y 2 ). One such τ -NAF scalar multiplication algorithm [24] and its widthw window variants. Given k ∈ Z and P ∈ E(F2m ), these methods work by rst P writing k = ki τ i for ki ∈ {0, ±α1 , ±α3 , . . . , ±α2ω−1 −1 }, with αi = i mod τ ω ω−1 for i ∈ {1, 3, 5, . . . , 2 − 1}. Then the scalar multiplication is computed as P kP = ki τ i (P ). i While powers τ of the automophism can be automatically considered en-
the Frobenius automorphism example is the classic
domorphisms in the context of the GLV method [11], this does not bring any performance improvement, since applying these powers to a point has exactly the
τ2i -th
same cost of iterating the automorphism during a standard execution of the NAF algorithm. By employing time-memory trade-os for computing xed powers with cost signicantly smaller than form
τ bm/ic
i consecutive squarings, a map of the
can now be seen as an endomorphism useful for accelerating scalar
multiplication through interleaving strategies. For example, the map
ψ ≡ τ bm/2c
allows an interleaved scalar multiplication of two points from the expression
kP = k1 P + 2bm/2c k2 P =
k2,i τ i ψ(P ), saving the computational m cost of 2 eld squarings. This might be seen as a modest saving as eld squaring in a binary eld is often considered a free of cost operation. However, this P
k1,i τ i P +
P
is not entirely true when working with inecient irreducible polynomials that lead to relatively expensive modular reductions. This is exactly the case studied in this work and, to be more precise, it can be said instead that interleaving via the
ψ
endomorphism saves the computational cost associated to
reductions.
bm 2c
modular
As explained above, With the map
ψ,
an analogue of a bidimensional GLV
decomposition for a Koblitz curve can be achieved. Similarly, the usage of the
τ bm/3c
and
τ bm/4c
maps can be seen as analogues to
decompositions or, more generally, the phism as an analogue of an case where
m = 283,
i-th
i-dimensional
3-
and
4-dimensional
GLV
power of the Frobenius automor-
GLV decomposition. In our working
note that the addition chain for Itoh-Tsuji inversion was
already chosen to include
bm/2c
and
bm/4c.
Thus, exploiting these powers of
the automorphism does not implies additional storage costs. Observe that [1,25] already explored this concept to obtain parallel formulations of scalar multiplication in Koblitz curves.
3.2
Lazy-reduced mixed point addition
3.3
Precomputation scheme
3.4
Scalar multiplication algorithm
3.5
Operation count
4
Experimental Results and Discussion
In order to illustrate the performance obtained by the proposed techniques, we implemented a library targeted to the Westmere and Sandy Bridge microarchitectures, focusing our eorts on benetting from the SSE and AVX instruction sets with the corresponding availability of 128-bit and 256-bit registers. The library was implemented in the C programming language, with vector instructions accessed through their intrinsics interface. Both version 4.7.0 of the GNU C Compiler Suite (GCC) and version 12.1 of the Intel C Compiler (ICC) were used to build the library in a GNU/Linux environment. Benchmarking was conducted on an Intel Core i7-2600K processor clocked at 3.4 GHz., following the guidelines provided in the EBACS website [3]. Namely, automatic overclocking, frequency scaling and HyperThreading technologies were disable to reduce randomness in the results. Table 2 presents timings and ratios related to the cost of multiplication for the low-level eld arithmetic layer of the library, which computes basic operations on the eld
F2283 .
Table 2.
Base eld operation GCC ICC op/M Multiplication 143 149 1.00 Squaring 28 29 0.18 Multi-Squaring 236 237 1.59 Inversion 3,292 3,308 22.20 Timings given in clock cycles for basic operations in F2283 .
Table 3 shows the number of clock cycles for elliptic curve operations, such as point addition, Frobenius endomorphism, and point doubling. The later is shown
just to reect the improvement of using doubling-free scalar multiplication as in the case of Koblitz curves.
Elliptic curve operation GCC ICC op/M Frobenius (Ane) 55 56 0.37 Frobenius (LD) 85 83 0.55 Doubling (LD) 741 764 5.12 Addition (LD Mixed) 1,300 1,336 8.96 Addition (LD General) 2,086 2,145 14.39 Table 3. Elliptic curve operations on NIST-K283 when points are represented in ane or López-Dahab coordinates [20].
Timings reported for scalar multiplication are divided into three scenarios: (i) known point, where the point to be multiplied is already known before the execution of scalar multiplication; (ii) unknown point, the general case, where the input point is known until scalar multiplication is processed; (iii) double multiplication of xed and random points, a case usually needed for verifying curve-based digital signatures. For the three scenarios, we used interleaved versions of the width-5 window
τ -NAF
scalar multiplication algorithm. We present
timings in Table 4.
Scalar multiplication GCC ICC Fixed point (kG) 74.9 76.1 Random point (kP ) 99.2 101.9 Fixed/random point (kG + lQ) 166.5 171.1 Table 4. Scalar multiplication in three dierent scenarios: xed, random and double point. Timings are given in 103 processing cycles.
5
Conclusion
References
1. Omran Ahmadi, Darrel Hankerson, and Francisco Rodríguez-Henríquez. Parallel formulations of scalar multiplication on Koblitz curves. Journal of Universal Computer Science, 14(3):481504, 2008. 2. D. F. Aranha, J. López, and D. Hankerson. Ecient Software Implementation of Binary Field Arithmetic Using Vector Instruction Sets. In M. Abdalla and P. S. L. M. Barreto, editors, 1st International Conference on Cryptology and Information Security (LATINCRYPT 2010), volume 6212 of LNCS, pages 144161, 2010. 3. D. J. Bernstein and T. Lange (editors). eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://bench.cr.yp.to, accessed 25 May 2010.
4. Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin Yang. High-speed high-security signatures. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems - CHES 2011, volume 6917 of Lecture Notes in Computer Science, pages 124142. Springer, 2011. 5. J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. Ecc2k-130 on cell cpus. In J. Bernstein D and T. Lange, editors, 3rd International Conference on Cryptology in Africa (AFRICACRYPT 2010), volume 6055 of LNCS, pages 225 242. Springer, 2010. 6. M. Brown, D. Hankerson, J. López, and A. Menezes. Software Implementation of the NIST Elliptic Curves Over Prime Fields. In D. Naccache, editor, Cryptographers' Track at RSA Conference (CT-RSA 2001), volume 2020 of LNCS, pages 250265. Springer, 2001. 7. Murat Cenk and Ferruh Özbudak. Improved polynomial multiplication formulas over $if2 $ using chinese remainder theorem. IEEE Trans. Computers, 58(4):572 576, 2009. 8. N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo. Intel AVX: New frontiers in performance improvement and energy eciency. White paper. http://software. intel.com/. 9. A. Fog. Instruction tables: List of instruction latencies, throughputs and microoperation breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/ optimize/instruction_tables.pdf, version published on 29 Feb 2012. 10. Steven D. Galbraith, Xibin Lin, and Michael Scott. Endomorphisms for faster elliptic curve cryptography on a large class of curves. In Antoine Joux, editor, Advances in Cryptology - EUROCRYPT 2009, 28th Annual International, volume 5479 of Lecture Notes in Computer Science, pages 518535. Springer, 2009. 11. R. Gallant, R. Lambert, and S. Vanstone. Faster Point Multiplication on Elliptic Curves with Ecient Endomorphisms. In J. Kilian, editor, 21st Annual International Cryptology Conference (CRYPTO 2001), volume 2139 of LNCS, pages 190200. Springer, 2001. 12. P. Gaudry, R. Brent, P. Zimmermann, and E. Thomï¾ 21 . The gf2x binary eld multiplication library. https://gforge.inria.fr/projects/gf2x/. 13. P. Gaudry and E. Thomé. The mpFq library and implementing curve-based key exchanges. In Software Performance Enhancement of Encryption and Decryption (SPEED 2007), pages 4964, 2009. http://www.hyperelliptic.org/SPEED/ record.pdf. 14. J. Guajardo and C. Paar. Itoh-Tsujii inversion in standard basis and its application in cryptography and codes. Designs, Codes and Cryptography, 25(2):207216, 2002. 15. D. Hankerson, A. J. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer-Verlag, Secaucus, NJ, USA, 2003. 16. Intel. Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference. http://www.intel.com, 2002. 17. Toshiya Itoh and Shigeo Tsujii. A fast algorithm for computing multiplicative inverses in GF(2m ) using normal bases. Inf. Comput., 78(3):171177, 1988. 18. N. Koblitz. CM-Curves with Good Cryptographic Properties. In 11th Annual International Cryptology Conference (CRYPTO 1991), volume 576 of LNCS, pages 279287. Springer, 1991. 19. P. Longa and C. H. Gebotys. Ecient techniques for high-speed elliptic curve cryptography. In S. Mangard and F.-X. Standaert, editors, 12th International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2010), volume 6225 of LNCS, pages 8094. Springer, 2010.
20. J. López and R. Dahab. Improved Algorithms for Elliptic Curve Arithmetic in GF(2n ). In S. E. Tavares and H. Meijer, editors, 5th Annual International Workshop Selected Areas in Cryptography (SAC 98), volume 1556 of LNCS, pages 201 212. Springer, 1998. 21. P.L. Montgomery. Five, six, and seven-term Karatsuba-like formulae. IEEE Transactions on Computers, 54(3):362369, 2005. 22. Francisco Rodríguez-Henríquez, Guillermo Morales-Luna, Nazar A. Saqib, and Nareli Cruz-Cortés. Parallel itohtsujii multiplicative inversion algorithm for a special class of trinomials. Des. Codes Cryptography, 45(1):1937, October 2007. 23. Michael Scott. Optimal Irreducible Polynomials for GF (2m ) Arithmetic. Cryptology ePrint Archive, Report 2007/192, 2007. http://eprint.iacr.org/. 24. J. A. Solinas. Ecient Arithmetic on Koblitz Curves. Designs, Codes and Cryptography, 19(2-3):195249, 2000. 25. J. Taverne, A. Faz-Hernández, D. F. Aranha, F. Rodríguez-Henríquez, D. Hankerson, and J. López. Software implementation of binary elliptic curves: Impact of the carry-less multiplier on scalar multiplication. In B. Preneel and T. Takagi, editors, 13th International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2011), volume 6917 of LNCS, pages 108123. Springer, 2011. 26. J. Taverne, A. Faz-Hernández, D. F. Aranha, F. Rodríguez-Henríquez, D. Hankerson, and J. López. Speeding scalar multiplication over binary elliptic curves using the new carry-less multiplication instruction. Journal of Cryptographic Engineering, 1(3):187199, 2011.