Low Latency Elliptic Curve Cryptography Accelerators for NIST Curves over Binary Fields Chang Shu, Kris Gaj ECE Department George Mason University 4400 University Drive Fairfax, VA 22030-4444, USA {cshu, kgaj}@gmu.edu
Abstract We designed hardware accelerators based on Xilinx FPGAs, XCV2000E, to speed up the scalar multiplications on elliptic curves recommended by NIST, over GF (2163 ) and GF (2233 ), in polynomial basis representation. Linear-Feedback-Shift-Registers (LFSRs) are exploited in the most significant digitserial (MSD) multipliers in order to improve design efficiency. We adopt the algorithm of scalar multiplication devised by L´opez and Dahab [4]. We demonstrate how this algorithm can be implemented using multiple multipliers working in parallel, and we select the optimal parameters for these multipliers. The accelerators can run around 3 times faster than the best hardware implementation reported previously by Gura et al. [1] at CHES 2002, when ported to the same device Xilinx Virtex XC2000E.
1. Introduction Over the last 20 years, Elliptic Curve Cryptography has evolved from a mere curiosity into a mature and secure family of public key cryptosystems used in practical applications. Several implementations of ECC over GF (2n ) have been developed and reported in the literature [1, 5]. Most of these ECC accelerators are composed of an arithmetic unit (AU) and a controller. In such architecture, one field multiplier adopted in AU completes all field multiplications. In our approach, all ECC operations are implemented as independent arithmetic units, with no resource sharing. As a result, multiple GF (2n ) multipliers can work in parallel, and a substantial improvement in speed can be achieved. Additionally, the complicated data-path containing many levels of bit-wide multiplexers has been simplified in order to reduce the minimum clock period. The design has been ported to the same device as Gura et al’s design, namely, Xilinx XCV2000E-FG680-7, and performance comparisons
Tarek El-Ghazawi ECE Department The George Washington University 801 22nd Street NW Washington DC, 20052 USA
[email protected]
are demonstrated.
2. L´opez-Dahab algorithm Let P , P1 , and P2 be points on the curve E such that P2 = P1 + P . Let the affine x-coordinate of Pi be represented by Xi /Zi , for i ∈ 1, 2. The projective X-coordinates 1 of 2Pi and P3 = P1 + P2 can be represented as follows: X(2Pi ) = Xi4 + b · Zi4 Z(2Pi ) = Zi2 · Xi2 X(P1 + P2 ) = x · Z3 + (X1 · Z2 ) · (X2 · Z1 ) Z(P1 + P2 ) = (X1 · Z2 + X2 · Z1 )2
(1)
Due to the limitation of pages, we only list the formulae computing point addition and point doubling. More details of the algorithm can be found in Reference [4]. According to Equation (1), both point addition and doubling can be performed in parallel. Coordinate transform need to be performed at the last step.
3. Field arithmetics Addition in a binary Galois Field is trivial. If trinomial or pentanomial can be chosen as the field polynomial, squaring can be implemented very efficiently using XOR gates. Multiplication is the most important field operation that must be implemented with high efficiency. We presented a new architecture aimed at low wire density via hardwired XORs. In our MSD serial multipliers (See Figure 1), reductions are performed at each iterative step to keep the partial pruduct size as n instead of n + d, where d is the digit size, so that it’s easier for EDA tools to place and route. For multiplicative inversion, we adopted Itoh-Tsujii’s [3] method. 1 The projective Y-coordinate don’t need to be computed at the intermediate stages, but can be retrieved from the X and Z coordinates of the final results.
Shift by 4 each cycle
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9
c 10
c 11
c 12
c 13
...
c 14
c 155 c 156 c 157 c 158
c 159 c 160 c 161 c 162
...
b(x)
d3(x)
x b(x) mod f(x)
d2(x)
x2 b(x) mod f(x)
d1(x)
163 163
b(x)
163 163 163
d0(x)
3
x b(x) mod f(x)
cj 163
163 163
163
163
cj
ck
ck
d(x)
c i-4
ci
ci
=
c i-4
D
Q
ci
FF
163
clk
Shift by 4 each cycle
di a0
...
a160 a161 a162 a163 di
Figure 1. Digit-serial multiplier over GF(2163 )
4. FPGAs implementation and results Table 2. Performance comparisons with Gura et al. results Elliptic Curve Cryptography Accelerator
mul_1
mul_2
sqr_1
sqr_2
mul_3
sqr_4
Coordinate Transformer mul_4
mul_5
Guar et al.
Point Doubler
Point Adder
sqr_3 sqr_5
Inverter
sqr_6
mul_6
sqr_7
Our design (d = 32)
Field size n
163
233
163
233
FFs
6,442
NA
7,467
10,632
LUTs
19,508
NA
25,763
35,800
f(MHz)
66.5
66.5
68.9
67.9
Latency (µs)
143
225
48
89
Multiple Squarer
5. Conclusions Figure 2. The diagram of the ECC accelerator
Table 1. Digit size of multipliers Digit size mul mul mul mul mul mul
1 2 3 4 5 6
GF (2163 )
GF (2233 )
32 32 32 8 8 8
32 32 32 8 8 8
Field multipliers are not shared among point adder, point doubler, and coordinate converter to avoid complicated data-path that will have negative effect on timing and routing. Performance comparisons with the accelerators developed by Gura et al. [1] are provided in Table 2, for which both designs are ported into the same FPGA device, Xilinx XCV 2000E-FG680-7.
Our accelerator can run three times as fast as the accelerator designed by Gura et al. with the same resource utilization. Fast speed can be achieved due to efficient field arithmetic, top-level algorithm and rational partition of the design. In particular, LFSRs are exploited , together with AND-XORs arrays in order to further optimize the design. The best choice of the word length of multipliers is also a significant contribution to the efficiency of the design.
6. References [1] N. Gura et al. An end-to-end systems approach to elliptic curve cryptography. In CHES ’02, pages 349–365, 2003. [2] FIPS-186-2, Digital Signature Standard [3] T. Itoh and S. Tsujii. A fast algorithm for computing multiplicative inverses in GF (2m ) using normal bases. Inf. Comput., 78(3):171–177, 1988. [4] J. L´opez and R. Dahab. Fast multiplication on elliptic curves over GF (2m ) without precomputation. In CHES ’99, pages 316–327, 1999. [5] G. Orlando and C. Paar. A high performance reconfigurable elliptic curve processor for GF (2m ). In CHES ’00, pages 41–56, 2000.