Faster implementation of scalar multiplication on

Faster implementation of scalar multiplication on Koblitz curves We design a state-of-the-art software implementation of eld and elliptic curve arithmetic in standard Koblitz curves at the 128-bit security level. Field arithmetic is carefully crafted by using the best formulae and implementation strategies available; and the increasingly common native support to binary eld arithmetic in modern desktop computing platforms. The i-th power of the Frobenius automorphism on Koblitz curves is exploited to obtain new and faster interleaved versions of the well-known τ -NAF scalar multiplication algorithm. The eciency of these versions is illustrated by timings for scalar multiplication of xed, random and multiple points. In particular, the random point case is computed in just below 105 cycles, setting a new speed record across all curves with or without endomorphisms dened over binary or prime elds. Our focus is on pure eciency, thus our results suggest a tradeo between speed, compliance with the published standards and sidechannel protection. Finally, we estimate the performance of curve-based cryptographic protocols instantiated using the proposed techniques and compare our results to related work. Abstract.

1

Introduction

Since its introduction in 1985, elliptic curve cryptography has become one of the most important and ecient public key cryptosystems still in use. Its security is based on the computational intractability of solving discrete logarithm problems over the group formed by the rational points on an elliptic curve. Anomalous binary curves also known as

Koblitz Elliptic Curves, were intro-

Fq for q = 2m , a Koblitz curve Ea (Fq ), (x, y) ∈ Fq × Fq that satisfy the equation

duced in [18]. Given a nite eld dened as the set of points

Ea : y 2 + xy = x3 + ax2 + 1, together with a point at innity denoted by

O.

a ∈ {0, 1},

is

(1)

It is known that

Ea (Fq )

forms

an additive Abelian group with respect to the elliptic point addition operation, In

Ea is a Koblitz curve with order #Ea (F2m ) = 22−a r, r is an odd prime. Let hP i be an additively written subgroup in Ea of order r , and let k be a positive integer such that k ∈ [0, r − 1]. Then,

this paper it is assumed that where prime

the elliptic curve scalar multiplication operation computes the multiple Q = kP, which corresponds to the point resulting of adding

r, P

and

Q ∈ hP i,

P

to itself,

k − 1 times. Given

the Elliptic Curve Discrete Logarithm Problem (ECDLP)

consists of nding the unique integer

k

such that

Q = kP holds. F2 , the Frobenius

Since Koblitz curves are dened over the binary eld

map

and its inverse naturally extend to an automorphism of the curve denoted by

τ . The τ map takes (x, y) to (x2 , y 2 ) and O to O. It can been shown that (x4 , y 4 ) + 2(x, y) = µ(x2 , y 2 ) for every (x, y) on Ea , where µ = (−1)1−a . In 2 other words, τ satises τ + 2 = µτ . By solving the quadratic equation, we can √ −1+ −7 associate τ with the complex number τ = . 2 Elliptic curve scalar multiplication is the most expensive operation in cryptographic protocols whose security guarantees are based on the ECDLP. Improving the computational eciency of this operation is a widely studied problem. Across the years, a number of algorithms and techniques providing ecient implementations with higher performance have been proposed [15]. Many research works have focused their eorts on the unknown point scenario, where the base point

P

is not known in advance and when only one single scalar multiplication is

required, as in the case of the Die-Hellman key exchange protocol [26,19,13]. However, there are situations where a single scalar multiplication must be performed on xed base points such as in the case of the key and signature generation procedures of the Elliptic Curve Digital Signature Algorithm (ECDSA) standard. In other scenarios, such as in the ECDSA signature verication, the simultaneous computation of two scalar multiplications (one with unknown point and the other with xed point) of the form

R = kG + lQ,

is required. Compar-

atively less research works have studied the latter cases [6,10,4]. In [25,26], authors evaluated the achievable performance of binary elliptic curve arithmetic in the latest 64-bit micro-architectures, presenting a comprehensive analysis of unknown-point scalar multiplication computations on random and Koblitz NIST elliptic curves at the 112-bit and 192-bit security levels. However, for the 128-bit security level authors in [26] only considered a random curve with side-channel resistant scalar multiplication.

1

In this work we revisit the single-core software computation of the scalar multiplication on Koblitz curves dened over binary elds. This study includes the computation of the scalar multiplication using unknown and xed points and single and simultaneous scalar multiplication computations as required in the digital signature generation and verication ECDSA procedures. We extend the analysis given in [25,26] and further investigate the remaining curve choices to provide a complete picture of the performance scenario, while also showing through operation counting and experimental results that Koblitz curves are still the fastest choice for deploying curve-based cryptography if sucient native support for binary eld arithmetic is available in the target platform and if resistance to some software side-channel attacks can be disregarded.

1

2

2

Scalar multiplication on the Edwards curve CURVE2251 was implementing in [26] using the Montgomery laddering approach that is naturally protected against rstorder side-channel attacks. Since our library does not read secret data from RAM we believe that our software is basically immune to cache-timing attacks. Furthermore, since we compute the scalar multiplication using an interleaved approach that simultaneously processes the scalars associated to two or more points, we believe that our library oers some rst-order resistance against branch-prediction side-channel attacks.

To this end, we adopted several techniques previously proposed by dierent authors: (i) formulation of binary eld arithmetic using vector instruc-

2k -powers in multiplication over F2 and

tions [2]; (ii) time-memory trade-os for the evaluation of xed binary elds [5]; (iii) new formulas for polynomial

its extensions [7]; (iv) ecient support for the recently introduced carry-less multiplier [26]. Besides of building on these nite eld arithmetic advancements, this paper presents several novel techniques including: (i) improved implementation of

ωτ -

NAF integer recoding; (ii) a new precomputation scheme for small multiples of a random point in a Koblitz curve; (iii) lazy-reduction formulae for mixed addition in binary elliptic curves; (iv) novel interleaving strategies of the

τ -NAF

algorithm for scalar multiplication in Koblitz curves via powers of the Frobenius automorphism. Further, in this work only the tried and tested Koblitz curve NIST-K283 was considered, providing immediate compatibility with standards and interoperability with existing implementations. Note however that most of our techniques are not in any sense restricted to this curve choice, and can therefore be used to accelerate scalar multiplication in other Koblitz curves at dierent security levels. Our main implementation result is a speed record for the unknown-point single-core scalar multiplication computation over the NIST-K283 curve in slightly less than

105

clock cycles. Running on an Intel Core i7-2600K processor clocked

at 3.4 GHz we were able to compute a random point scalar multiplication in just 29.18µS. To our knowledge, this is the fastest software implementation of this operation across all curves ever proposed for cryptographic applications at the 128-bit security level. This document is structured as follows: Section 2 discusses the low-level techniques used for the implementation of eld arithmetic and integer recoding. Section 3 presents higher-level techniques for arithmetic in the elliptic curve, comprising improved formulas for mixed addition by means of lazy reduction and strategies for speeding up the scalar multiplication computation by using powers of the Frobenius automorphism. Section 4 illustrates the eciency of the proposed techniques reporting timings for scalar multiplication in the xed, unknown and multiple point scenarios; and extensively compares the results with related work. Additionally in this Section we estimate the performance of signature and key agreement protocols when they are instantiated through the usage of Koblitz curves. The nal section concludes the paper with perspectives for further performance improvement based on upcoming instruction sets.

2

Low-level techniques

f (z) be a monic irreducible polynomial of degree m over F2 . Then, the binary F2m is isomorphic to F2m ∼ = F2 [z]/ (f (z)), i.e., F2m is a nite eld of characteristic 2, whose elements are the set of all the binary polynomials modulo f (z). In order to achieve a security level equivalent to 128-bit AES

Let

extension eld

when working with binary elliptic curves, NIST recommends to choose the eld

m = 283, along with the irreducible pentanomial f (z) = z 283 + z 12 + z + z + 1. It is noticed that in a modern 64-bit computing platform, an element m of the eld F2m represented in canonical basis consumes n64 = d 64 e processor words, or n64 = 5 when m = 283. In the rest of this Section, algorithm and

extension

7

5

formula descriptions will refer to either generic or xed versions of the binary eld, depending on whether or not the optimization is restricted to the choice of

m = 283. As it was mentioned before, in this work we made an extensive use of vector

instruction sets present in contemporary desktop processors. The platform model given in Table 1 extends the notation reported in [2]. Notice that vectorized multiple-precision or intra-digit shifts can always be made faster when the shift amount is a multiple of 8 by means of the memory alignment instruction or the bytewise shift instruction, respectively, and that a simultaneous table lookup mapping 4-bit indexes to bytes can be implemented through the byte shuing instruction called

Table 1.

PSHUFB

in the SSE instruction set.

Relevant vector instructions for the implementation of binary eld arithmetic.

Mnemonic

⊗ -8 , -8 8 , 8 ⊕, ∧, ∨ C, B

Description

SSE

Carry-less multiplication PCLMULQDQ 64-bit bitwise shifts PSLLQ,PSRLQ 128-bit bytewise shift PSLLDQ,PSRLDQ Bitwise XOR,AND,OR PXOR,PAND,POR Memory alignment/Multi-precision shifts PALIGNR

In the following we will briey discuss how the most relevant eld arithmetic operations, namely, addition, multiplication, squaring and multi-squaring, modular reduction, inversion and integer to

ωτ N AF

recoding were implemented.

Addition. It is the simplest operation on a binary eld and can employ the exclusive-or instruction with the largest operand size in the target platform. This is particularly benecial for vector instructions, but according to our experiments, the 128-bit SSE [16] integer instruction proved to be faster than the 256-bit AVX [8] oating-point instruction due to a higher reciprocal throughput [9] when operands are stored into registers.

Multiplication. Field multiplication is the performance-critical arithmetic op-

a, b ∈ F2283 we c = a · b mod f (z). This

eration for elliptic curve arithmetic. Given two eld elements want to compute a third eld element

c ∈ F2m ,

as

can be accomplished by performing two separate steps: rst the polynomial multiplication of the two operands

a, b

is computed and then the resulting

double length polynomial is modular reduced by

f (z). From our eld element

representation, the polynomial multiplication step can be seen as the compu-

n64 − 1 degree polynomials each with n64 64-bit d n264 e−1 den64 gree polynomials each with d 2 e 128-bit coecients. In the latter case, each tation of the product of two

coecients. Alternatively, the two operands may also be seen as

term-by-term multiplication can be solved carry-less multiplications. When

n64 = 5,

à la

Karatsuba by performing 3

the above approaches require 13

(see [21,7]) and 14 invocations of the carry-less multiplier instruction, respectively. Algorithm 1 below presents our implementation of eld multiplication over the eld

F2283

with 64-bit granularity using the formula given in [7]. The

computational complexity of Algorithm 1 is of thirteen and sixty-two 64-bit

3 plus one modu-

carry less multiplications and 64-bit additions, respectively,

lar reduction (Alg. 1, Step 20) that will be discussed later. The most salient feature of Algorithm 1 is that all the thirteen carry-less multiplications have been grouped in the three for loops of steps 1-4, 6-8 and 10-12. This is an attractive feature from the implementation point of view, as it is important to potentially reduce the cost of the carry-less multiplication instruction from 14 to 8 clock cycles in the Intel Sandy Bridge micro-architecture; and from 12 to 7 clock cycles in an AMD Bulldozer [9]. The rationale behind this cost reduction is that the batch execution of independent multiplications directly benet the micro-architecture pipeline occupancy level. It is worth mentioning that in [26], authors concluded that the 64-bit granular approach tends to consume more resources and complicate register allocation, limiting the natural throughput exhibited by the carry-less multiplication instruction. However, if the digits are stored in an interleaved form [12], these side eects are mitigated and a throughput can again be achieved.

Squaring and multi-squaring. Squaring is a cheap operation in a binary eld because due to the action of the Frobenius map, it consists of a linear expansion of coecients. Vectorized approachs using simultaneous table lookups through byte shuing instructions allow a particularly ecient formulation of the coecient expansion stage [2]. The modular reduction stage usually is the most expensive step when computing a squaring, especially when is an

ordinary pentanomial

f (z)

(see [23]) for the given word size. Dealing with

ordinary pentanomials requires exible and often shifting instructions that are unsupported by the platform model. Multi-squaring is a time-memory trade-o in which a table of

k

2

16d m 4 e eld elements allows computing any xed

power with the cost equivalent of just a few squarings. It is usually the

case that the multi-squaring approach becomes faster whenever

k≥6

[26].

Contrary to addition, the availability of 256-bit instructions here contributes signicantly to a performance increase. This is because this operation basically consists of a sequence of additions with eld elements obtained through a precomputed table stored in main memory.

Modular reduction. Ecient modular reduction of a

2 · n64

double-precision

value coming as the result of a squaring or multiplication operation to a proper eld element of word-size

n64 , involves expressing the required shifted

additions in terms of the best shifting instructions possible. For the instruc-

3

which yields the same number of multiplications but with ten less additions that the computational complexity reported in [21] for multiplication of degree-four binary polynomials.

Algorithm 1 Proposed implementation of multiplication in

F2283 .

a(z) = a[0..4], b(z) = b[0..4]. Output: c(z) = c[0..4] = a(z) · b(z). Note: Pairs of 64-bit words represent vector registers. 1: for i ← 0 to 4 do 2: ri ← (a[i], b[i]) 3: mi ← ri [0] ⊗ ri [1] 4: end for 5: t0 ← r0 ⊕ r1 , t1 ← r0 ⊕ r2 , t2 ← r2 ⊕ r4 , t3 ← r3 ⊕ r4 6: for i ← 0 to 3 do 7: m5+i ← ti [0] ⊗ ti [1] 8: end for 9: t1 ← t1 ⊕ r3 , t2 ← t2 ⊕ r1 , t3 ← t3 ⊕ t0 , t0 ← t3 ⊕ r2 10: for i ← 0 to 3 do 11: m9+i ← ti [0] ⊗ ti [1] 12: end for 13: c0 ← m0 , c4 ← m4 , t0 ← m0 ⊕ m1 , t1 ← m3 ⊕ m4 14: c1 ← t0 ⊕ m2 ⊕ m6 , c3 ← t1 ⊕ m2 ⊕ m7 15: t0 ← t0 ⊕ m5 , t3 ← t1 ⊕ m8 16: c2 ← t0 ⊕ t3 ⊕ m9 ⊕ m10 ⊕ m11 , t1 ← m9 ⊕ m12 17: t2 ← t1 ⊕ c1 ⊕ m4 ⊕ m11 18: t1 ← t1 ⊕ c3 ⊕ m0 ⊕ m10 19: (c4 , c3 , c2 , c1 , c0 ) ← (c4 , c3 , c2 , c1 , c0 ) ⊕ ((0, t3 , t2 , t1 , t0 ) 64) 20: return c = (c4 , c3 , c2 , c1 , c0 ) mod f (z) Input:

tion sets available in our target platform, this amounts to converting the highest possible number of shifts to memory alignment instructions or bytewise shifts. Curve NIST-K283 is dened over an ordinary pentanomial, a particularly inecient choice for our vector register size. However, by observing that

f (z) = z 283 + z 12 + z 7 + z 5 + 1 = z 283 + (z 7 + 1)(z 5 + 1),

one can take advantage of this factorization to formulate faster shifted additions. Algorithm 2 presents our explicit scheduling of shift instructions to perform modular reduction in written as of

c.

c = p1 ||p0

F2283 .

Suppose that the polynomial

p0

where the polynomial

The computation of

c

mod

c

is

represent the 283 lower bits

f (z) in Algorithm 2 is performing p1 is computed by shifting the

lows: in lines 1 to x, the polynomial

as folvector

(c4,c3,c2) to the left exactly 27 bits. Then, in lines xx to yx, the operation

c + p1 ∗ (z 7 + 1) ∗ (z 5 + 1)

is performed, thus getting the vector

(c2 , c1 , c0 ) . c2 are

Finally, in lines xxx to yyy, the remaining 101 most signicant bits of

reduced, a process that again involves a multiplication by the polynomial

(z 7 + 1) ∗ (z 5 + 1). Inversion. The eld inversion approach that probably is the friendliest for vector instruction sets, is the Itoh-Tsuji inversion [17] that computes the eld inverse of

a

using the identity

a−1 =

m−1

a2

−1

2

. The term

m−1

a2

−1

is

f (z) = z 283 + (z 7 + 1)(z 5 + 1).

Algorithm 2 Implementation of reduction by

Double-precision polynomial stored into 128-bit registers c = (c4 , c3 , c2 , c1 , c0 ). Field element c mod f (z) stored into 128-bit registers (c2 , c1 , c0 ). t2 ← c2 , t0 ← (c3 , c2 ) B 64, t1 ← (c4 , c3 ) B 64 c4 ← c4 -8 27, c3 ← c3 -8 27, c3 ← c3 ⊕ (t1 -8 37) c2 ← c2 -8 27, c2 ← c2 ⊕ (t0 -8 37) t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 1) t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 7), c3 ← c3 ⊕ (t1 -8 57) t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 7), c2 ← c2 ⊕ (t0 -8 57) t0 ← (c4 , c3 ) B 120, c4 ← c4 ⊕ (t0 -8 3) t1 ← (c3 , c2 ) B 64, c3 ← c3 ⊕ (c3 -8 5), c3 ← c3 ⊕ (t1 -8 59) t0 ← c2 8 64, c2 ← c2 ⊕ (c2 -8 5), c2 ← c2 ⊕ (t0 -8 59) c0 ← c0 ⊕ c2 , c1 ← c1 ⊕ c3 , c2 ← t2 ⊕ c4 t0 ← c4 -8 27 t1 ← t0 ⊕ (t0 -8 5) t0 ← t1 ⊕ (t1 -8 7) c0 ← c0 ⊕ t0 , c2 ← c2 ∧ (0x0000000000000000,0x0000000007FFFFFF) return c = (c2 , c1 , c0 )

Input:

Output:

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

obtained by sequentially computing intermediate terms of the form

i

a2

−1

2j

j

· a2

−1

.

(2)

0 ≤ i, j ≤ m−1, are elements of the addition chain assoe = m−1 [14,22]. In F2283 , the shortest eleven-step ade = 282 is, 1→2→3→4→8→16→17→35→70→140→141→282.

where the exponents

ciated to the exponent dition chain for

The computation of the above outlined procedure introduces an important memory cost of storing 6 multi-squaring tables, each table contains

16d m 4e

eld elements. However, several of those tables can be reused in the interleaving approach for scalar multiplications by exploiting powers of the Frobenius automorphism as will be explained in the next section.

Integer

τ -NAF

recoding Solinas [24] presented a

τ -adic

analogue of the cus-

tomary non-adjacent form (NAF) recoding as we briey explained next. An

ρ ∈ Z[τ ] is found with ρ ≡ k (mod δ) of as small norm as possible, δ = (τ m − 1)/(τ − 1), where for the subgroup of interest, kP = ρP and a width-ω τ NAF representation for ρ can be obtained in a way that ω mimics the usual ω NAF recoding. As in [24], let us dene αi = i mod τ for ω−1 i ∈ {1, 3, 5, . . . , 2 − 1}. A width-ω τ NAF of a nonzero element k is an Pl−1 i expression k = u i=0 i τ where each ui ∈ {0, ±α1 , ±α3 , . . . , ±α2ω−1 −1 } and ul−1 6= 0, and at most one of any consecutive ω coecients is nonzero. Un-

element where

der reasonable assumptions, Solinas procedures produce an expansion with length

l ≤ m + 1.

Although the cost of

ω NAF

recoding is usually negligible when compared

with the overall cost of scalar multiplication, this is not generally the case with Koblitz curves, where integer to

ωτ N AF

recoding can reach more

than 10% of the scalar multiplication computational time [26]. In this work,

ωτ -NAF

recoding [24] was implemented by employing as much as possible

branchless techniques: the branches inside the recoding operation essentially depend on random values, presenting a worst-case scenario for branch prediction and causing severe timing penalties. In addition to that, the code was also completely unrolled to handle only the precision required in the current iteration. Since the magnitude of the involved scalars gets reduced with each iteration, it is suboptimal to perform operations considering the initial full precision. The deterministic nature of the algorithm allows one to know exactly what is the iteration of the main recoding loop when the intermediate values lose a full processor word of precision.

3

High-level techniques

In the last section, several notes detailed our algorithmic and implementation choices for eld arithmetic. This section describes the higher-level strategies used in the elliptic curve arithmetic layer for increasing the performance of scalar multiplication.

3.1

Exploiting powers of the Frobenius automorphism

Scalar multiplication algorithms on Koblitz curves are always tailored to exploit

τ on E(F2m ) given by τ (x, y) = (x2 , y 2 ). One such τ -NAF scalar multiplication algorithm [24] and its widthw window variants. Given k ∈ Z and P ∈ E(F2m ), these methods work by rst P writing k = ki τ i for ki ∈ {0, ±α1 , ±α3 , . . . , ±α2ω−1 −1 }, with αi = i mod τ ω ω−1 for i ∈ {1, 3, 5, . . . , 2 − 1}. Then the scalar multiplication is computed as P kP = ki τ i (P ). i While powers τ of the automophism can be automatically considered en-

the Frobenius automorphism example is the classic

domorphisms in the context of the GLV method [11], this does not bring any performance improvement, since applying these powers to a point has exactly the

τ2i -th

same cost of iterating the automorphism during a standard execution of the NAF algorithm. By employing time-memory trade-os for computing xed powers with cost signicantly smaller than form

τ bm/ic

i consecutive squarings, a map of the

can now be seen as an endomorphism useful for accelerating scalar

multiplication through interleaving strategies. For example, the map

ψ ≡ τ bm/2c

allows an interleaved scalar multiplication of two points from the expression

kP = k1 P + 2bm/2c k2 P =

k2,i τ i ψ(P ), saving the computational m cost of 2 eld squarings. This might be seen as a modest saving as eld squaring in a binary eld is often considered a free of cost operation. However, this P

k1,i τ i P +

P

is not entirely true when working with inecient irreducible polynomials that lead to relatively expensive modular reductions. This is exactly the case studied in this work and, to be more precise, it can be said instead that interleaving via the

ψ

endomorphism saves the computational cost associated to

reductions.

bm 2c

modular

As explained above, With the map

ψ,

an analogue of a bidimensional GLV

decomposition for a Koblitz curve can be achieved. Similarly, the usage of the

τ bm/3c

and

τ bm/4c

maps can be seen as analogues to

decompositions or, more generally, the phism as an analogue of an case where

m = 283,

i-th

i-dimensional

3-

and

4-dimensional

GLV

power of the Frobenius automor-

GLV decomposition. In our working

note that the addition chain for Itoh-Tsuji inversion was

already chosen to include

bm/2c

and

bm/4c.

Thus, exploiting these powers of

the automorphism does not implies additional storage costs. Observe that [1,25] already explored this concept to obtain parallel formulations of scalar multiplication in Koblitz curves.

3.2

Lazy-reduced mixed point addition

3.3

Precomputation scheme

3.4

Scalar multiplication algorithm

3.5

Operation count

4

Experimental Results and Discussion

In order to illustrate the performance obtained by the proposed techniques, we implemented a library targeted to the Westmere and Sandy Bridge microarchitectures, focusing our eorts on benetting from the SSE and AVX instruction sets with the corresponding availability of 128-bit and 256-bit registers. The library was implemented in the C programming language, with vector instructions accessed through their intrinsics interface. Both version 4.7.0 of the GNU C Compiler Suite (GCC) and version 12.1 of the Intel C Compiler (ICC) were used to build the library in a GNU/Linux environment. Benchmarking was conducted on an Intel Core i7-2600K processor clocked at 3.4 GHz., following the guidelines provided in the EBACS website [3]. Namely, automatic overclocking, frequency scaling and HyperThreading technologies were disable to reduce randomness in the results. Table 2 presents timings and ratios related to the cost of multiplication for the low-level eld arithmetic layer of the library, which computes basic operations on the eld

F2283 .

Table 2.

Base eld operation GCC ICC op/M Multiplication 143 149 1.00 Squaring 28 29 0.18 Multi-Squaring 236 237 1.59 Inversion 3,292 3,308 22.20 Timings given in clock cycles for basic operations in F2283 .

Table 3 shows the number of clock cycles for elliptic curve operations, such as point addition, Frobenius endomorphism, and point doubling. The later is shown

just to reect the improvement of using doubling-free scalar multiplication as in the case of Koblitz curves.

Elliptic curve operation GCC ICC op/M Frobenius (Ane) 55 56 0.37 Frobenius (LD) 85 83 0.55 Doubling (LD) 741 764 5.12 Addition (LD Mixed) 1,300 1,336 8.96 Addition (LD General) 2,086 2,145 14.39 Table 3. Elliptic curve operations on NIST-K283 when points are represented in ane or López-Dahab coordinates [20].

Timings reported for scalar multiplication are divided into three scenarios: (i) known point, where the point to be multiplied is already known before the execution of scalar multiplication; (ii) unknown point, the general case, where the input point is known until scalar multiplication is processed; (iii) double multiplication of xed and random points, a case usually needed for verifying curve-based digital signatures. For the three scenarios, we used interleaved versions of the width-5 window

τ -NAF

scalar multiplication algorithm. We present

timings in Table 4.

Scalar multiplication GCC ICC Fixed point (kG) 74.9 76.1 Random point (kP ) 99.2 101.9 Fixed/random point (kG + lQ) 166.5 171.1 Table 4. Scalar multiplication in three dierent scenarios: xed, random and double point. Timings are given in 103 processing cycles.

5

Conclusion

References

1. Omran Ahmadi, Darrel Hankerson, and Francisco Rodríguez-Henríquez. Parallel formulations of scalar multiplication on Koblitz curves. Journal of Universal Computer Science, 14(3):481504, 2008. 2. D. F. Aranha, J. López, and D. Hankerson. Ecient Software Implementation of Binary Field Arithmetic Using Vector Instruction Sets. In M. Abdalla and P. S. L. M. Barreto, editors, 1st International Conference on Cryptology and Information Security (LATINCRYPT 2010), volume 6212 of LNCS, pages 144161, 2010. 3. D. J. Bernstein and T. Lange (editors). eBACS: ECRYPT Benchmarking of Cryptographic Systems. http://bench.cr.yp.to, accessed 25 May 2010.

4. Daniel J. Bernstein, Niels Duif, Tanja Lange, Peter Schwabe, and Bo-Yin Yang. High-speed high-security signatures. In Bart Preneel and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems - CHES 2011, volume 6917 of Lecture Notes in Computer Science, pages 124142. Springer, 2011. 5. J. W. Bos, T. Kleinjung, R. Niederhagen, and P. Schwabe. Ecc2k-130 on cell cpus. In J. Bernstein D and T. Lange, editors, 3rd International Conference on Cryptology in Africa (AFRICACRYPT 2010), volume 6055 of LNCS, pages 225 242. Springer, 2010. 6. M. Brown, D. Hankerson, J. López, and A. Menezes. Software Implementation of the NIST Elliptic Curves Over Prime Fields. In D. Naccache, editor, Cryptographers' Track at RSA Conference (CT-RSA 2001), volume 2020 of LNCS, pages 250265. Springer, 2001. 7. Murat Cenk and Ferruh Özbudak. Improved polynomial multiplication formulas over $if2 $ using chinese remainder theorem. IEEE Trans. Computers, 58(4):572 576, 2009. 8. N. Firasta, M. Buxton, P. Jinbo, K. Nasri, and S. Kuo. Intel AVX: New frontiers in performance improvement and energy eciency. White paper. http://software. intel.com/. 9. A. Fog. Instruction tables: List of instruction latencies, throughputs and microoperation breakdowns for Intel, AMD and VIA CPUs. http://www.agner.org/ optimize/instruction_tables.pdf, version published on 29 Feb 2012. 10. Steven D. Galbraith, Xibin Lin, and Michael Scott. Endomorphisms for faster elliptic curve cryptography on a large class of curves. In Antoine Joux, editor, Advances in Cryptology - EUROCRYPT 2009, 28th Annual International, volume 5479 of Lecture Notes in Computer Science, pages 518535. Springer, 2009. 11. R. Gallant, R. Lambert, and S. Vanstone. Faster Point Multiplication on Elliptic Curves with Ecient Endomorphisms. In J. Kilian, editor, 21st Annual International Cryptology Conference (CRYPTO 2001), volume 2139 of LNCS, pages 190200. Springer, 2001. 12. P. Gaudry, R. Brent, P. Zimmermann, and E. Thomï¾ 21 . The gf2x binary eld multiplication library. https://gforge.inria.fr/projects/gf2x/. 13. P. Gaudry and E. Thomé. The mpFq library and implementing curve-based key exchanges. In Software Performance Enhancement of Encryption and Decryption (SPEED 2007), pages 4964, 2009. http://www.hyperelliptic.org/SPEED/ record.pdf. 14. J. Guajardo and C. Paar. Itoh-Tsujii inversion in standard basis and its application in cryptography and codes. Designs, Codes and Cryptography, 25(2):207216, 2002. 15. D. Hankerson, A. J. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer-Verlag, Secaucus, NJ, USA, 2003. 16. Intel. Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference. http://www.intel.com, 2002. 17. Toshiya Itoh and Shigeo Tsujii. A fast algorithm for computing multiplicative inverses in GF(2m ) using normal bases. Inf. Comput., 78(3):171177, 1988. 18. N. Koblitz. CM-Curves with Good Cryptographic Properties. In 11th Annual International Cryptology Conference (CRYPTO 1991), volume 576 of LNCS, pages 279287. Springer, 1991. 19. P. Longa and C. H. Gebotys. Ecient techniques for high-speed elliptic curve cryptography. In S. Mangard and F.-X. Standaert, editors, 12th International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2010), volume 6225 of LNCS, pages 8094. Springer, 2010.

20. J. López and R. Dahab. Improved Algorithms for Elliptic Curve Arithmetic in GF(2n ). In S. E. Tavares and H. Meijer, editors, 5th Annual International Workshop Selected Areas in Cryptography (SAC 98), volume 1556 of LNCS, pages 201 212. Springer, 1998. 21. P.L. Montgomery. Five, six, and seven-term Karatsuba-like formulae. IEEE Transactions on Computers, 54(3):362369, 2005. 22. Francisco Rodríguez-Henríquez, Guillermo Morales-Luna, Nazar A. Saqib, and Nareli Cruz-Cortés. Parallel itohtsujii multiplicative inversion algorithm for a special class of trinomials. Des. Codes Cryptography, 45(1):1937, October 2007. 23. Michael Scott. Optimal Irreducible Polynomials for GF (2m ) Arithmetic. Cryptology ePrint Archive, Report 2007/192, 2007. http://eprint.iacr.org/. 24. J. A. Solinas. Ecient Arithmetic on Koblitz Curves. Designs, Codes and Cryptography, 19(2-3):195249, 2000. 25. J. Taverne, A. Faz-Hernández, D. F. Aranha, F. Rodríguez-Henríquez, D. Hankerson, and J. López. Software implementation of binary elliptic curves: Impact of the carry-less multiplier on scalar multiplication. In B. Preneel and T. Takagi, editors, 13th International Workshop on Cryptographic Hardware and Embedded Systems (CHES 2011), volume 6917 of LNCS, pages 108123. Springer, 2011. 26. J. Taverne, A. Faz-Hernández, D. F. Aranha, F. Rodríguez-Henríquez, D. Hankerson, and J. López. Speeding scalar multiplication over binary elliptic curves using the new carry-less multiplication instruction. Journal of Cryptographic Engineering, 1(3):187199, 2011.

Faster implementation of scalar multiplication on

Faster implementation of scalar multiplication on

Suggest Documents

Efficient Implementation of Scalar Multiplication for ...

Faster relaxed multiplication - Hal

Euclidean addition chains scalar multiplication on ...

Vectors Multiplication by a Scalar

Fast Scalar Multiplication on the Jacobian of a

Secure Outsourcing of Scalar Multiplication on Elliptic ... - IEEE Xplore

Faster Double-Size Modular Multiplication From ... - CiteSeerX

optimal elliptic curve scalar multiplication using ...

FASTER INTEGER MULTIPLICATION USING PLAIN VANILLA FFT ...

Parallel scalar multiplication on general elliptic curves ... - CiteSeerX

Accelerating the Scalar Multiplication on Elliptic ... - Semantic Scholar

Parallel scalar multiplication on general elliptic ... - Semantic Scholar

Faster Interleaved Modular Multiplication Based on ... - KU Leuven

An Efficient Scalar Multiplication Algorithm for ... - onlinepresent.org

Montgomery Scalar Multiplication for Genus 2 Curves

Scalar Multiplication on Koblitz Curves using Ï2âNAF

Scalar Multiplication on Koblitz Curves using Double Bases

mbNAF for Scalar Multiplication - Springer Link

Parallel Scalar Multiplication on Elliptic Curves in ... - Semantic Scholar

Scalar Multiplication on WeierstraÃ Elliptic Curves ... - Matthieu Rivain

Faster Elliptic Curve Scalar Multiplications on ARM Processors

Implementation of Multiple-precision Modular Multiplication on GPU

implementation of karatsuba algorithm using polynomial multiplication

Implementation of Multi-Precision Multiplication ... - Semantic Scholar

Faster implementation of scalar multiplication on