Table: Published timings for a point multiplication in a MICAz Mote for the 160-bit security level. .... Karatsuba multiplication in F2m c(z) = a(z) · b(z). = a1b1zm + ...
Efficient implementation of elliptic curves on sensor nodes Diego F. Aranha, Julio L´opez, Leonardo Oliveira, Ricardo Dahab
Institute of Computing - UNICAMP Supported by FAPESP, Grant No. 2007/06950-0.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Wireless Sensor Networks
A WSN is an ad hoc network comprised of sensoring devices employed for cooperative monitoring tasks.
Sensor Node Gateway Sensor Node
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
The problem
Challenge Since the nodes must be cheap and disposable, protecting the communication between resource-constrained nodes is hard.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
The problem
Challenge Since the nodes must be cheap and disposable, protecting the communication between resource-constrained nodes is hard.
Contributions Efficient implementation of arithmetic in F2163 and F2233 ; Efficient implementation of elliptic curve cryptography.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
The platform
MICAz Mote: ATMega128 processor, 7.3828 MHz of clock frequency; 4KB of RAM memory, 128KB of ROM memory; Simple two-stage pipeline; Limited shift instructions; High cost of memory instructions (addressing, reads, writes). Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Programmer’s arsenal ATMega128 is a typical RISC processor: 32 registers, but 6 of them are special for pointers; 1 register for memory/arithmetic temporary values; Only 32 - 6 - 1 = 25 useful registers. Relevant instructions: Instruction rsl, lsl swap bld, bst eor ld, st adiw, sbiw
Description Right/left shift by 1-bit Swap high and low nibbles Bit load/store from/to flag Exclusive bitwise OR Memory load/store Pointer arithmetic
Aranha, L´ opez, Oliveira, Dahab
Cost 1 cycle 1 cycle 1 cycle 1 cycle 2 cycles 2 cycles
Efficient implementation of ECC on sensor nodes
Related work Table: Published timings for a point multiplication in a MICAz Mote for the 160-bit security level.
Finite Field
Binary
Prime
Work [Malan et al. 2004] [Yan and Shi 2006] [Eberle et al. 2005] [Szczechowiak et al. 2008] [Seo et al. 2008] [Kargl et al. 2008] [Wang and Li 2006] [Szczechowiak et al. 2008] [Gura et al. 2004] [Uhsadel et al. 2007] [GrobSchadl 2006]
Aranha, L´ opez, Oliveira, Dahab
Execution Time (s) 34 13.9 4.14 2.16 1.14 0.83 1.35 1.27 0.87 0.76 0.745
Efficient implementation of ECC on sensor nodes
Related work
According to previous works: Binary fields are insufficiently supported; Binary curves would lead to lower performance; Architectural extensions are heavily needed.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Binary elliptic curves
A binary elliptic curve is the set of solutions (x, y ) ∈ F2m × F2m satisfying the equation y 2 + xy = x 3 + ax 2 + b, where a, b ∈ F2m with b 6= 0, and a point at infinity ∞. When a ∈ {0, 1} and b = 1, the curve is called a Koblitz curve.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Elliptic curves
The set of points {(x, y ) ∈ E (F2m )} ∪ {∞} under the addition operation + (chord-and-tangent rule) forms an additive group. Given an elliptic point P and an integer k, the operation kP, called scalar multiplication, is defined by kP = P | + P +{z. . . + P.} k times
This is the fundamental operation employed by protocols based on elliptic curves.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Binary field F2m Irreducible polynomial: f (z) (trinomial or pentanomial)
Polynomial basis: a(z) ∈ F2m =
m−1 X
ai z i .
i=0
Software representation: vector of n = dm/8e bytes. Graphical representation:
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Addition in F2m
c(z) = a(z) + b(z) =
n−1 X
(Ai ⊕ Bi )z i
i=0
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Addition in F2m
c(z) = a(z) + b(z) =
n−1 X
(Ai ⊕ Bi )z i
i=0
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Squaring in F2m
a(z)2 =
m−1 X
ai z 2i = am−1 z 2m−2 + · · · + a2 z 4 + a1 z 2 + a0
i=0
Squaring is a simple expansion of the coefficients of a. Example: a(z) = z 4 + z 3 + 1 = 11001 a(z)2 = 101000001
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Squaring in F2m
We can accelerate this algorithm with a lookup table. For each 4-bit u, compute T (u) = (0, u3 , 0, u2 , 0, u1 , 0, u0 ):
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Squaring in F2m c(z) = a(z)2
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Modular squaring in F2m c(z) = a(z)2 mod f (z)
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Modular squaring in F2m c(z) = a(z)2 mod f (z)
Problem: Redundant memory accesses between squaring and modular reduction.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Modular squaring in F2m
Our solution: Integrate squaring and modular reduction.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Modular squaring in F2m
Our solution: Integrate squaring and modular reduction. c(z) = a(z)2 mod f (z)
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Integrated modular squaring in F2m
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Integrated modular squaring in F2m
Problem: Too much additions. Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for modular squaring Our solution: Precompute sparse contributions.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for modular squaring Our solution: Precompute sparse contributions.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multiplication in F2m
Two strategies: Karatsuba multiplication; L´opez-Dahab multiplication;
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Karatsuba multiplication in F2m
c(z) = a(z) · b(z) = a1 b1 z m + [(a1 + a0 )(b1 + b0 ) + a1 b1 + a0 b0 ]z m/2 + a0 b0
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Karatsuba multiplication in F2m
c(z) = a(z) · b(z) = a1 b1 z m + [(a1 + a0 )(b1 + b0 ) + a1 b1 + a0 b0 ]z m/2 + a0 b0
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multiplication in F2m c(z) = a(z) · b(z) = (. . . (am−1 b(z)z + am−2 b(z)) z + . . . + a1 b(z)) z + a0 b(z)
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multiplication in F2m c(z) = a(z) · b(z) = (. . . (am−1 b(z)z + am−2 b(z)) z + . . . + a1 b(z)) z + a0 b(z) Example: (z 3 + 1) · (z 3 + z + 1) =
1001 · 1011 =
= ((1011 · z + 0)z + 0)z + 1011 1011 1001 1011 0000 0000 1011 1010011 Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m We can use this formula to multiply b(z) by a 4-bit polynomial.
If a(z) is divided into 4-bit polynomials, compute a(z) · b(z) by:
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m We can accellerate this method with a precomputation table.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m We can accellerate this method with a precomputation table. For each 4-bit u, compute T (u) = u · b: T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
L´opez-Dahab multiplication in F2m
Problem: Lots of memory operations and not enough registers! Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Our solution: Use a rotating register window of length n + 1
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Our solution: Use a rotating register window of length n + 1
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for L´opez-Dahab multiplication
Problem: Available registers might be insufficient (e.g. F2233 ). Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multi-step implementation of L´opez-Dahab multiplication Our solution: Break series of summations in blocks.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multi-step implementation of L´opez-Dahab multiplication Our solution: Break series of summations in blocks.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Multi-step implementation of L´opez-Dahab multiplication
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Analysis of multiplication algorithms
Table: Costs in number of executed instructions for the multiplication of two n-byte vectors.
Algorithm L´ opez-Dahab Proposed Karatsuba
Number of instructions in terms of n words Reads Writes XOR 4n2 + 9n |T | + 2n2 + 6n 2n2 + 13n 2n2 + 4n |T | + 5n 2n2 + 11n 11n + 3M(dn/2e) 7n + 3M(dn/2e) 4n + 3M(dn/2e)
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Analysis of multiplication algorithms
Table: Costs in number of cycles for multiplication in F2163 and F2233 . Algorithm L´ opez-Dahab Proposed Karatsuba+LD Karatsuba+Proposed
Aranha, L´ opez, Oliveira, Dahab
n = 21 Total cycles 7743 3923 8379 5019
n = 30 Total cycles 14844 7226 13530 7748
Efficient implementation of ECC on sensor nodes
Modular reduction
Algorithm 1 Fast reduction for f (z) = z 163 + z 7 + z 6 + z 3 + 1. Input: c(z) = c[0..2n − 1]. Output: c(z) = c(z) mod f (z). 1: for i ← 41 to 21 do 2: t ← c[i] 3: c[i − 21] ← c[i − 21] ⊕ (t 5) 4: c[i − 20] ← c[i − 20] ⊕ (t 4) ⊕ (t 3) ⊕ t ⊕ (t 3) 5: c[i − 19] ← c[i − 19] ⊕ (t 4) ⊕ (t 5) 6: end for 7: t ← c[20] 3 8: c[0] ← c[0] ⊕ (t 7) ⊕ (t 6) ⊕ (t 3) ⊕ t 9: c[1] ← c[1] ⊕ (t 1) ⊕ ( 2) 10: c[20] ← c[20] & 0x07 11: return c
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Modular reduction Algorithm 2 Fast reduction for f (z) = z 163 + z 7 + z 6 + z 3 + 1. Input: c(z) = c[0..2n − 1]. Output: c(z) = c(z) mod f (z). 1: for i ← 41 to 21 do 2: t ← c[i] 3: c[i − 21] ← c[i − 21] ⊕ (t 5) 4: c[i − 20] ← c[i − 20] ⊕ (t 4) ⊕ (t 3) ⊕ t ⊕ (t 3) 5: c[i − 19] ← c[i − 19] ⊕ (t 4) ⊕ (t 5) 6: end for 7: t ← c[20] 3 8: c[0] ← c[0] ⊕ (t 7) ⊕ (t 6) ⊕ (t 3) ⊕ t 9: c[1] ← c[1] ⊕ (t 1) ⊕ ( 2) 10: c[20] ← c[20] & 0x07 11: return c
Problems: Reduntant memory operations and expensive shifts!
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for modular reduction Our solutions: Small register window and lookup tables.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Proposed optimization for modular reduction Our solutions: Small register window and lookup tables. Algorithm 4 Proposed optimization for faster modular reduction. Input: c(z) = c[0..2n − 1], T0 , T1 . Output: c(z) = c(z) mod f (z). Note: R(r0 , r1 , r2 , t) ≡ r0 ← r0 ⊕T0 [t], r1 ← r1 ⊕T1 [t], r2 ← t 5 1: rb ← 0, rc ← 0 2: i ← 21, j ← 40 3: while i > 3 do 4: R(rb , rc , ra , c[j]), c[i] ← c[i] ⊕ rb 5: R(rc , ra , rb , c[j − 1]), c[i − 1] ← c[i − 1] ⊕ rc 6: R(ra , rb , rc , c[j − 2]), c[i − 2] ← c[i − 2] ⊕ ra 7: i ← i − 3, j ← j − 3 8: end while 9: R(rb , rc , ra , c[22]), c[3] ← c[3] ⊕ rb 10: R(rc , ra , rb , c[21]), c[2] ← c[2] ⊕ rc 11: c[1] ← c[1] ⊕ ra 12: c[0] ← c[0] ⊕ rb 13: return c Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Analysis of modular reduction Modular reduction in F2163 : Uses a rotating window of 3 registers; Needs two 256-byte lookup tables. Modular reduction in F2233 : Cannot use register windows; Does not need lookup tables; Unrolling and elimination of redundant memory operations. Table: Costs in number of executed instructions for modular reduction.
Algorithm Original Proposed
F2163 Reads Writes 88 66 43 23
Aranha, L´ opez, Oliveira, Dahab
F2233 Reads Writes 122 92 92 62
Efficient implementation of ECC on sensor nodes
Observations
Additional technicalities: Lookup tables and precomputed tables are aligned at 256 byte addresses; Inversion implemented by extended Euclidean algorithm with dedicate shifting functions (I /M ≈ 16).
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Elliptic curve arithmetic
We selected two fast algorithms for point multiplication: 4-TNAF in Koblitz curves [Solinas 2000]; L´opez-Dahab method in generic curves [L´ opez et al. 1999].
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Elliptic curve arithmetic Algorithm 5 w -TNAF method for point multiplication. Input: k ∈ Z, P ∈ E (F2m ). Output: kP ∈ E (F2m ). P i 1: Compute the representation TNAFw (k) = t−1 i=0 ui τ w −1 2: Compute Pu = αu P, for u ∈ {1, 3, 5, . . . , 2 − 1} 3: Q ← ∞ 4: for i ← t − 1 to 0 do 5: Q ← τQ 6: if ui 6= 0 then 7: Let ui such that αu = ui or α−u = −ui 8: if ui > 0 then Q ← Q + Pu ; else Q ← Q − Pu 9: end if 10: end for 11: return Q
Important: Point addition/subtraction costs 8 multiplications!
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Elliptic curve arithmetic Algorithm 6 LD method for point multiplication. P m Input: k = t−1 i=0 ki ∈ Z, P = (x, y ) ∈ E (F2 ), curve coefficient b. Output: kP ∈ E (F2m ). 1: x1 ← x, z1 ← 1, z2 ← x 2 , x2 ← z22 + b, 2: for i ← t − 2 to 0 do 3: r1 ← x1 · z2 , r2 ← x2 · z1 , r3 ← r1 + r2 , r4 ← r1 · r2 4: if ki 6= 0 then 5: z1 ← r32 , r1 ← x · z1 , x1 ← r1 + r4 , r1 ← z22 , r2 ← x22 6: z2 ← r1 · r2 , x2 ← r12 , r1 ← r22 , r2 ← b · r1 , x2 ← x2 + r2 7: else 8: z2 ← r32 , r1 ← x · z2 , x2 ← r1 + r4 , r1 ← z12 , r2 ← x12 9: z1 ← r1 · r2 , x1 ← r12 , r2 ← r22 , r2 ← b · r1 , x1 ← x1 + r2 10: end if 11: end for 12: return Q = (x3 , y3 ) computed from (x1 /z1 , x2 /z2 );
Important: Resistant to simple timing attacks. Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Implementation Material: GCC 4.1.2 for ATMega128; Software library implemented from scrath; AVR Simulator 4.14. Programming languages: C; Assembly. Curve parameters: Koblitz curves NIST-K163 and NIST-K233; Binary curves NIST-B163 and NIST-B233.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Implementation results Table: Timings for arithmetic algorithms in F2163 .
Algorithm Squaring Modular Squaring LD Mult. with registers Karatsuba+LD with registers Modular reduction Inversion
Aranha, L´ opez, Oliveira, Dahab
C language Cycles 629 1154 9738∗ 12246 606 243790
Assembly Cycles 430 570 4508 6968 430 81365
Efficient implementation of ECC on sensor nodes
Implementation results Table: Timings for arithmetic algorithms in F2163 .
Algorithm Squaring Modular Squaring LD Mult. with registers Karatsuba+LD with registers Modular reduction Inversion
C language Cycles 629 1154 9738∗ 12246 606 243790
Assembly Cycles 430 570 4508 6968 430 81365
Observations: Reduction by pentanomial costs the same as reduction by trinomial in [Kargl et al. 2008]; (∗ ) This timing is for a new variant of the algorithm. Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Implementation results
Table: Timings for arithmetic algorithms in F2233 .
Algorithm Squaring Modular Squaring LD Mult. with registers (multi-step) Karatsuba+LD with registers Modular reduction Inversion
Aranha, L´ opez, Oliveira, Dahab
C language Cycles 908 1340 18028∗ 25850 911 473618
Assembly Cycles 463 956 8314 9261 620 142986
Efficient implementation of ECC on sensor nodes
Implementation results
Table: Timings for point multiplication.
Algorithm 4-TNAF on curve NIST-K163 LD on curve NIST-K163 LD on curve NIST-B163 4-TNAF on curve NIST-K233 LD on curve NIST-K233 LD on curve NIST-B233
Aranha, L´ opez, Oliveira, Dahab
C language Time (s) 0.67 1.30 1.55 1.48 3.25 3.90
Assembly Time (s) 0.32 0.62 0.74 0.73 1.57 1.89
Efficient implementation of ECC on sensor nodes
Comparison - Execution time
Implementation in C language at the 160-bit security level: Improvement of 41% over previous fastest implementation;
Implementation in Assembly at the 160-bit security level: Improvement of 61% over previous fastest implementation; Improvement of 25% over previous fastest implementation while satisfying resistance to simple timing attacks.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Comparison - Storage
Table: Cost in bytes of memory for implementations of point multiplication on the 160-bit security level.
TinyECCK (C-only) Kargl et a. (C+ASM) 4-TNAF method – C version 4-TNAF method – C+ASM LD method – C version LD method – C+ASM
Aranha, L´ opez, Oliveira, Dahab
ROM memory 5.6 KB 11 KB 20 KB 24 KB 12 KB 16 KB
RAM memory 0.6 KB – 1 KB 1.6 KB 1 KB 1.6 KB
Efficient implementation of ECC on sensor nodes
Conclusions New state-of-art implementation of ECC on sensor MICAz Mote: Efficient implementation of binary field arithmetic: Most efficient implementation of squaring, multiplication, modular reduction and inversion for this platform (improvements ranging from 11% to 68%); Binary fields can be efficient on wireless sensors; Optimizations can be applied to similar platforms.
Efficient implementation of elliptic curve cryptography: Point multiplication under 13 second on the 163-bit security level and under 34 second on the 233-bit level; Binary curves are suitable for ECC on WSNs.
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes
Conclusions Curve
Binary, Generic
Binary, Koblitz
Prime
Work [Malan et al. 2004] [Yan and Shi 2006] [Eberle et al. 2005] [Eberle et al. 2005] (extensions) [Kargl et al. 2008] Proposed(timing attacks) [Szczechowiak et al. 2008] [Seo et al. 2008] Proposed Proposed(timing attacks) Proposed(233-bit security) [Wang and Li 2006] [Szczechowiak et al. 2008] [Gura et al. 2004] [Uhsadel et al. 2007] [GrobSchadl 2006]
Aranha, L´ opez, Oliveira, Dahab
Execution Time (s) 34 13.9 4.14 0.50 0.83 0.74 2.16 1.14 0.32 0.62 0.73 1.35 1.27 0.87 0.76 0.745
Efficient implementation of ECC on sensor nodes
Questions?
Aranha, L´ opez, Oliveira, Dahab
Efficient implementation of ECC on sensor nodes