Document not found! Please try again

NUMS Elliptic Curves and their Efficient Implementation - Webs

1 downloads 0 Views 1MB Size Report
2. ) ≠ 0. Elliptic Curve Cryptography (ECC): basics point doubling. 1. 1. ( , ) .... “Consider now the possibility that one in a million of all curves have an exploitable ...
NUMS Elliptic Curves and their Implementation Patrick Longa Microsoft Research Joppe Bos Craig Costello Michael Naehrig

NXP Semiconductors Microsoft Research Microsoft Research

Elliptic Curve Cryptography (ECC) (Independently) discovered by V. Miller and N. Koblitz in mid 80’s. Elliptic curves enable secure and efficient cryptographic systems, currently competing with RSA to become the next generation public-key system. ECC key size (bits)

RSA key size (bits)

Key size ratio

160

1024

1:6

256

3072

1:12

128

384

7680

1:20

192

512

15360

1:30

256

NIST Guidelines for Public Key Sizes

AES key size (bits)

Elliptic Curve Cryptography (ECC): basics (Focusing on the prime case) A (short Weierstrass) elliptic curve over a prime field 𝔽𝑝 is given by:

𝐸/𝔽𝑝 : 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏 where 𝑎, 𝑏 ∈ 𝔽𝑝 and ∆ = −16(4𝑎3 + 27𝑏 2 ) ≠ 0.

Elliptic Curve Cryptography (ECC): basics (Focusing on the prime case) A (short Weierstrass) elliptic curve over a prime field 𝔽𝑝 is given by:

𝐸/𝔽𝑝 : 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏 where 𝑎, 𝑏 ∈ 𝔽𝑝 and ∆ = −16(4𝑎3 + 27𝑏 2 ) ≠ 0.

𝑥, 𝑦 ∈ 𝔽𝑝 × 𝔽𝑝 : 𝑦 2 − 𝑥 3 − 𝑎𝑥 − 𝑏 = 0 ∪ {∞} is an abelian group with group operation (+) • Given points 𝑃 and 𝑄, any of the following cases can arise for 𝑅 = 𝑃 + 𝑄: •

𝑃=∞ ⇒𝑅=𝑄 𝑄=∞ ⇒𝑅=𝑃 𝑃 = −𝑄 ⇒ 𝑅 = ∞ 𝑃 = 𝑄 ⇒ 𝑅 = 2𝑃 (point doubling) otherwise 𝑅 = 𝑃 + 𝑄 (point addition)

Q  ( x2 , y2 )

P  ( x1, y1 )

y2  y1    x  x 2 1  2  x3    x1  x2 y   x  x  y  1 3 1  3 

P  Q  ( x3 , y3 )

point addition

Elliptic Curve Cryptography (ECC): basics (Focusing on the prime case) A (short Weierstrass) elliptic curve over a prime field 𝔽𝑝 is given by:

𝐸/𝔽𝑝 : 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏 where 𝑎, 𝑏 ∈ 𝔽𝑝 and ∆ = −16(4𝑎3 + 27𝑏 2 ) ≠ 0.

𝑥, 𝑦 ∈ 𝔽𝑝 × 𝔽𝑝 : 𝑦 2 − 𝑥 3 − 𝑎𝑥 − 𝑏 = 0 ∪ {∞} is an abelian group with group operation (+) • Given points 𝑃 and 𝑄, any of the following cases can arise for 𝑅 = 𝑃 + 𝑄: •

𝑃=∞ ⇒𝑅=𝑄 𝑄=∞ ⇒𝑅=𝑃 𝑃 = −𝑄 ⇒ 𝑅 = ∞ 𝑃 = 𝑄 ⇒ 𝑅 = 2𝑃 (point doubling) otherwise 𝑅 = 𝑃 + 𝑄 (point addition)

P  ( x1, y1 )

 3 x12  a   2 y 1   2  x3    2 x1 y   x  x  y  1 3 1  3 

2 P  ( x3 , y3 )

point doubling

Elliptic Curve Cryptography (ECC): basics (Focusing on the prime case) Let #𝐸 = ℎ ∙ 𝑛, where 𝑛 is a large prime and ℎ is a small cofactor (typically 1, 2 or 4). ECC computations are performed on the subgroup of prime order 𝑛, which we will denote 𝐸(𝔽𝑝 ). Scalar multiplication: given a point 𝑃 ∈ 𝐸(𝔽𝑝 ), with order 𝑛, and an integer 𝑘 ∈ [1, 𝑛 , compute 𝑄 = 𝑘 𝑃 = 𝑃 + 𝑃 + ⋯ + 𝑃 (𝑘 times) This is the central and most time-consuming operation in ECC.

ECC arithmetic layers

SCALAR

Scalar multiplication operations E.g., [k]P, [u]P+[v]Q, etc.

ARITHMETIC

POINT ARITHMETIC

Doubling and addition of points: 2P , P+Q

FIELD ARITHMETIC

Field addition, subtraction, multiplication, squaring, inversion

TLS Handshake with ECDHE_ECDSA: an example Client

Server

ClientHello

Verify(𝐵, …) : 𝑢∗𝑃+𝑣∗𝑄 Ephemeral public key 𝐴=𝑎∗𝐺 Premaster secret 𝑀 = 𝑎 ∗ 𝐵 (≡ 𝑎 ∗ 𝑏 ∗ 𝐺)

ServerHello Certificate ServerKeyExchange CertificateRequest* ServerHelloDone

Ephemeral public key 𝐵 =𝑏∗𝐺 Sign(𝐵, …) : 𝑟 ∗ 𝐻

Certificate* ClientKeyExchange CertificateVerify* [ChangeCipherSpec] Finished [ChangeCipherSpec] Finished ApplicationData

ApplicationData

Premaster secret 𝑀 = 𝑏 ∗ 𝐴 (≡ 𝑎 ∗ 𝑏 ∗ 𝐺)

(Standard) NIST elliptic curves NIST-recommended prime elliptic curves (part of NSA’s Suite B) cover three security levels (matching AES): 128-bit : P-256 (𝑝 = 2256 − 2224 + 2192 + 296 − 1) 192-bit : P-384 (𝑝 = 2384 − 2128 − 296 + 232 − 1) 256-bit : P-521 (𝑝 = 2521 − 1) These curves have short Weierstrass form (with 𝑎 = −3 for efficiency) and cofactor ℎ = 1.

June 2013 – the Snowden leaks “… the NSA had written the [crypto] standard and could break it.”

NIST’s Curve P-256: one-in-a-million? Base field prime: Elliptic curve: Curve constant: Seed:

𝑝 = 2256 − 2224 + 2192 + 296 − 1 𝐸/𝑭𝑝 : 𝑦 2 = 𝑥 3 − 3𝑥 + 𝑏 𝑏=

27 − SHA1 𝑠

𝑠 = c49d360886e704936a6678e1139d26b7819f7e90

Scott ‘99: “Consider now the possibility that one in a million of all curves have an exploitable structure that "they" know about, but we don't.. Then "they" simply generate a million random seeds until they find one that generates one of "their" curves…”

Post-Snowden responses • Bruce Schneier: “I no longer trust the constants. I believe the NSA has manipulated them…”

• TLS Working Group: Formal request to CFRG for new elliptic curves for usage in TLS (and other Internet standards) • NIST: Plans to host workshop to discuss new elliptic curves are announced: http://crypto.2014.rump.cr.yp.to/487f98c1a1a031283925d7affdbdef1c.pdf

Our motivations 1. Curves that regain confidence and acceptance from public - rigid generation / “nothing up my sleeves” 2. Improved performance and security for standard ECC algorithms and protocols - new curve models - faster finite fields - side-channel resistance Industry moving to Perfect Forward Secrecy (PFS) modes (e.g., ECDHE)

Our ECC research Comprehensive analysis • Curve forms and their arithmetic • Prime forms • Constant-time and exception-free implementation • Performance in protocols (ECDHE and ECDSA)

Curve forms and their arithmetic Weierstrass curves 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏

• • • •

Most general form Prime order possible Exceptions in group law NIST and Brainpool curves

Montgomery curves

Twisted Edwards curves

𝐵𝑦 2 = 𝑥 3 + 𝐴𝑥 2 + 𝑥

𝑎𝑥 2 + 𝑦 2 = 1 + 𝑑𝑥 4 𝑦 4

• Subset of curves • Not prime order • Montgomery ladder

• • • •

Subset of curves Not prime order Fastest arithmetic Some have complete group law

Prime forms Assume 𝑠 ∈ 128, 192, 256 . Two candidates: Pseudo-Mersenne primes: 𝑝 = 2𝑚 − 𝑐, with small 𝑐 Two options: 𝑚 ∈ {2𝑠, 2𝑠 − 1}

Montgomery-friendly primes: 𝑝 = 2𝑎 (2𝑏 − 𝑐) − 1 Two options: 𝑎 = 8𝑡 and 𝑏 ∈ {2𝑠 − 𝑎, 2𝑠 − 2 − 𝑎}, where 𝑡 and 𝑐 are the smallest positive integers s.t. 𝑝 is prime • Deterministically given by smallest 𝑐 such that 𝑝 ≡ 3 mod 4. • Support efficient arithmetic on wide range of devices: 8-, 32-, 64-bit (and beyond).

Constant-time and exception-free • Regular algorithms for scalar multiplication: Resilience against simple side-channel attacks • No secret-data conditional branches, no secret memory accesses: Resilience against timing and cache attacks • No algorithmic failures during scalar multiplications: Resilience against exceptional point attacks E.g., fixed-window method for variable-base scalar multiplication 𝑤=4

[ …, 0, -1, 0, 1, 1, 1, 0, -1, -1, 0, 1, 1, … ] [ …, -3, 11, -5, … ]

… 4 DBLs, ADD(-3P), 4 DBLs, ADD(11P), 4 DBLs, ADD(-5P), …

Constant-time and exception-free • Regular algorithms for scalar multiplication: Resilience against simple side-channel attacks • No secret-data conditional branches, no secret memory accesses: Resilience against timing and cache attacks • No algorithmic failures during scalar multiplications: Resilience against exceptional point attacks E.g., point extraction through linear passes over precomputed tables (Assume that input “𝑥” contains the secret index) 𝑄 = table[0] for 𝑖 = 1 to (2𝑤−1 − 1) 𝑥-𝑚𝑎𝑠𝑘 = ( 𝑥 OR−𝑥 ≫ bitlength − 1 ) − 1 𝑄 = 𝑚𝑎𝑠𝑘 AND 𝑄 XOR table 𝑖 XOR 𝑄 end for

table[0] table[1] table[2]

3P

table[3]

7P

P 5P

Constant-time and exception-free • Regular algorithms for scalar multiplication: Resilience against simple side-channel attacks • No secret-data conditional branches, no secret memory accesses: Resilience against timing and cache attacks • No algorithmic failures during scalar multiplications: Resilience against exceptional point attacks Solutions in the paper:

- Proved that scalar multiplications work for any possible pair {k, P} using the fastest, dedicated formulas (whenever possible) - Clear small torsion in twisted Edwards curves (applying point doublings before execution)

- New “pseudo-complete” addition formulas for Weierstrass curves

The problem of Weierstrass completeness • Bosma-Lenstra specialized to 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏, homogeneous space, the sum 𝑋1 : 𝑌1 : 𝑍1 + 𝑋2 : 𝑌2 : 𝑍2 will be at least one of 𝑋3 : 𝑌3 : 𝑍3 or 𝑋3 ′: 𝑌3 ′: 𝑍3 ′ :

• For our 𝑎 = −3 Weierstrass curves, after reasonable level of optimization the above gave 𝟐𝟐𝑴 + 𝟒𝑴𝒃 (compared to ≈ 𝟏𝟒𝑴 for dedicated projective additions) • Roughly, this is expected to be twice as slow There’s got to be a better way…

Weierstrass “pseudo-completeness” We give a “pseudo-complete’’ addition for general Weierstrass curves that works for any possible inputs when computing 𝑃 + 𝑄 Approach: • • • •

Find “common” computations between addition and doubling formulas Create a mask after evaluating if the computation is addition or doubling Mask inputs to “common” computations using mask A table lookup is used to select other possible results (∞, 𝑃 and 𝑄)

Weierstrass “pseudo-completeness” Example: computation of the Y-coordinate For doubling 2𝑃:

For addition 𝑃 + 𝑄:

𝑌2𝑃 = 3𝑋𝑃2 − 3𝑍𝑃4 𝑋𝑃 𝑌𝑃2 − 𝑋2𝑃 − 𝑌𝑃4

𝑌𝑃+𝑄 = 𝑍𝑃3 𝑌𝑄 − 𝑍𝑄3 𝑌𝑃

𝑌2𝑃 = 𝛼 𝑋𝑃 𝛽 2 − 𝑋 − 𝑌𝑃 𝛽 3

𝑌𝑃+𝑄 = 𝛼 𝑋𝑃 𝛽 2 − 𝑋 − 𝑌𝑃 𝛽 3

If (DBL): α = 3𝑋𝑃2 − 3𝑍𝑃4

else: α = 𝑍𝑃3 𝑌𝑄 − 𝑍𝑄3 𝑌𝑃 𝑍𝑃2 𝑋𝑄

𝛽 = 𝑌𝑃

𝛽=

𝑋 = 𝑋𝑃

𝑋 = 𝑋𝑃+𝑄



𝑍𝑄2 𝑋𝑃

• Jac+aff (original, dedicated) = 8M+3S, • Jac+aff (complete-masking) = 8M+3S+𝝐 (𝜖 ≈ 20%) ~10% overhead in the total cost of scalar multiplication

𝑋𝑃 𝑍𝑃2 𝑋𝑄 − 𝑍𝑄2 𝑋𝑃

2

− 𝑋𝑃+𝑄 − 𝑌𝑃 𝑍𝑃2 𝑋𝑄 − 𝑍𝑄2 𝑋𝑃

One only needs to choose the right values using predefined mask for computing the fully common expression

3

Curve selection Requirements (in order of importance) 1. Security and rigidity Simple/deterministic/unquestionable curve generation, ECDLP bit-security, conservative curve features, side-channel resiliency

2. Compatibility CPU alignment and bitlength compatibility, design consistency across security levels, cofactor issues in existing protocols

3. Performance By principle, not to trade a small gain in a given requirement at the expense of a (small) loss in another requirement above in the list

Curve selection • Consistency through all the security levels • Maximal ECDLP security • Standard prime bitlengths: favors compatibility with existing {NIST, Brainpool} implementations • Little room for manipulation Weierstrass • Backwards compatible with existing {NIST, Brainpool} implementations Twisted Edwards curve • fastest curve arithmetic • no Montgomery form necessary: no conversions required

protocol layer

Constant-time, exception-free algorithms to do crypto

128-bit security

192-bit security

Weierstrass curves

256-bit security

twisted Edwards curves

Pseudo-Mersenne primes with maximal bitlength: 𝟐𝟐𝟓𝟔 − 𝒄, 𝟐𝟑𝟖𝟒 − 𝒄, 𝟐𝟓𝟏𝟐 − 𝒄

Montgomery curves

“Nothing-Up-My-Sleeve” curve generation Given the short Weierstrass curve 𝐸𝑏 : 𝑦 2 = 𝑥 3 − 3𝑥 + 𝑏 with quadratic twist 𝐸′𝑏 : 𝑦 2 = 𝑥 3 − 3𝑥 − 𝑏. 1. Pick a bit-security 𝑠 from the standard set {128, 192, 256} 2. Fix 𝑝 = 22𝑠 − 𝑐, where 𝑐 is the smallest integer s.t. 𝑝 ≡ 3 mod 4 (standard choice) 3. Find smallest |𝑏| s.t. #𝐸𝑏 and #𝐸′𝑏 are prime. Choose ±𝑏 based on smallest order Given the twisted Edwards curve 𝐸𝑑 : −𝑥 2 + 𝑦 2 = 1 + 𝑑𝑥 2 𝑦 2 with quadratic twist 𝐸′𝑑 : −𝑥 2 + 𝑦 2 = 1 + (1/𝑑)𝑥 2 𝑦 2 and birationally equivalent to the Montgomery curve 𝐸𝐴 : 𝑦 2 = 𝑥 3 + 𝐴𝑥 2 + 𝑥. 1. and 2. 3. Find smallest 𝑑 > 0 such that #𝐸𝑑 and #𝐸′𝑑 are 4 × prime, and 𝑡 > 0 Note: search that minimizes 𝑑 also minimizes the Montgomery constant (𝐴 + 2)/4.

The NUMS curves Security Prime 𝒔= 𝒑= 128 2256 − 189

192 256

2384 − 317 2512 − 569

Weierstrass Twisted Edwards 𝒃= 𝒅= 152961 15342

−34568 121243

333194 637608

Montgomery 𝑨= −61370

−1332778 −2550434

The NUMS curves Security Prime 𝒔= 𝒑= 128 2256 − 189

192 256

2384 − 317 2512 − 569

Weierstrass Twisted Edwards 𝒃= 𝒅= 152961 15342

−34568 121243

333194 637608

Montgomery 𝑨= −61370

−1332778 −2550434

Other candidates (in the CFRG curve selection): ‐ Curve25519: Montgomery curve 𝑦 2 = 𝑥 3 + 𝐴𝑥 2 + 𝑥 defined over GF(2255 − 19) ‐ Ed448-Goldilocks: Edwards curve 𝑥 2 + 𝑦 2 = 1 − 39081𝑥 2 𝑦 2 defined over GF(2448 − 2224 − 1) ‐ E-521: Edwards curve 𝑥 2 + 𝑦 2 = 1 − 376014𝑥 2 𝑦 2 defined over GF(2521 − 1)

MSR ECCLib: a library for the NUMS curves Main features: • Wide variety of scalar multiplications, multiple security levels, different curve forms • Regular algorithms for scalar multiplication: Basic protection against simple side-channel attacks • No secret-data conditional branches, no secret memory accesses: Resilience against timing and cache attacks • No algorithmic failures during scalar multiplications: Resilience against exceptional point attacks

MSR ECCLib: overview complementary API (e.g., modular operations, etc.)

tEd W

Protocol API (ECDH, ECDSA, etc.)

tEd W

ECC API (point formulas, scalar mul, etc.)

Base field API C + intrinsics

32-bit

64-bit

Arch-based C 256 384 512

256 384 512

asm

Fully portable C 256 384 512

x64

ARM32

ARM64

256 384 512

256 384 512

256 384 512

tEd: twisted Edwards W: Weierstrass

MSR ECCLib: a library for the NUMS curves • Design prioritizes security, maintainability and scalability • Modularized design keeps separate API for field, curve and protocol layers (Among many other benefits) this helps to reduce the possibility of implementation flaws

Cons: Expected (relatively minor) loss in efficiency

Comparison of x64 assembly implementations Running on a 3.4GHz Intel Core i7 (Sandy Bridge) processor: • MSR ECCLib computes one 256-bit scalar multiplication in 216,000 cycles or 0.000064 seconds • (Full x64 assembly, standalone, one security level, one curve) Curve25519-based amd64-51 computes one scalar multiplication in 194,000 cycles or 0.000057 seconds

Security Level

Prime 256

128

𝑝=2

192

𝑝 = 2384 − 317

256

𝑝 = 2512 − 569

− 189

Curve

Variable -base

Fixed -base

ECDHE

Weierstrass twisted Edwards Weierstrass twisted Edwards Weierstrass twisted Edwards

270 216 714 588 1,504 1,242

107 82 252 201 488 391

377 298 966 789 1,992 1,633

Table: results in 10^3 cycles. Platform: Intel Core i7, Sandy-Bridge architecture (@3.4GHz)

@128sec-level • Fastest report NIST P-256 (Gueron & Krasnov ‘13): ≈ 490𝑘 cycles ECDHE using a 150KB table (timing-attack resistant?) • Compare Curve25519 ≈ 398𝑘 cycles to twisted Edwards ≈ 298𝑘 • (Not currently implemented) compare Curve25519+Ed25519 ≈ 263𝑘 cycles using a 30KB table to twisted Edwards ≈ 298𝑘 using a 9KB table

Security Level

Prime 256

128

𝑝=2

192

𝑝 = 2384 − 317

256

𝑝 = 2512 − 569

− 189

Curve

Variable -base

Fixed -base

ECDHE

Weierstrass twisted Edwards Weierstrass twisted Edwards Weierstrass twisted Edwards

270 216 714 588 1,504 1,242

107 82 252 201 488 391

377 298 966 789 1,992 1,633

Table: results in 10^3 cycles. Platform: Intel Core i7, Sandy-Bridge architecture (@3.4GHz)

@mid sec-level (192 - 224) • Compare Goldilocks ≈ 664𝑘 cycles variable-base to twisted Edwards ≈ 588𝑘

Security Level

Prime 256

128

𝑝=2

192

𝑝 = 2384 − 317

256

𝑝 = 2512 − 569

− 189

Curve

Variable -base

Fixed -base

ECDHE

Weierstrass twisted Edwards Weierstrass twisted Edwards Weierstrass twisted Edwards

270 216 714 588 1,504 1,242

107 82 252 201 488 391

377 298 966 789 1,992 1,633

Table: results in 10^3 cycles. Platform: Intel Core i7, Sandy-Bridge architecture (@3.4GHz)

@256sec-level • Compare E-521 ≈ 1,120𝑘 cycles variable-base to twisted Edwards ≈ 1,242𝑘 MSR ECCLib uses basic Comba multiplication There is room for optimization in the NUMS high-security curves.

Relevant links • Report: http://eprint.iacr.org/2014/130.pdf • MSR ECC Library: http://research.microsoft.com/en-us/projects/nums/ • Specification of curve selection: http://research.microsoft.com/apps/pubs/default.aspx?id=219966 • IETF Internet Draft (authored with Benjamin Black) http://tools.ietf.org/html/draft-black-numscurves-02

NUMS Elliptic Curves and their Implementation Q&A Patrick Longa Microsoft Research http://research.microsoft.com/en-us/people/plonga/