FPGA based Hardware Acceleration for Elliptic Curve Public Key Cryptosystems M. Ernst, B. Henhapl, S. Klupsch and S. A. Huss {ernst klupsch huss }@iss.tu-darmstadt.de
[email protected] Integrated Circuits and Systems Lab. Cryptography and Computer Algebra Computer Science Department Darmstadt University of Technology, Germany Keywords: Public Key cryptography, VHDL model generator, co-processor synthesis, FPGA-based hardware acceleration, ECDSA, Java and JCA Abstract E-Commerce applications and the emerging communication of private persons with administrations (E-Government) give rise to the important question, how to reliably exchange confidential data via public communication networks such as the Internet. Any data transfer must be protected from a fraudulent access by third parties in the sense that it has to be ensured that exchanged documents are neither read nor modified during the data transfer. Furthermore, the author-document relationships have to be known and unique at any point in time. The fundamental technology for document protection during public data transfer is known as public key cryptography. Digital signature schemes are probably the most common occurrence of public key cryptosystems. This paper addresses public key cryptosystems based on elliptic curves, which are aimed to high-performance digital signature schemes. Elliptic curve (EC) algorithms are characterized by the fact that one can work with considerably shorter keys compared to the RSA approach at the same level of security. A general and highly efficient method for mapping the most time-critical operations to a configurable co-processor is proposed. By means of real-time measurements the resulting performance values are compared to previously published state of the art hardware implementations. A generator based approach is advocated for that purpose which supports application specific co-processor configurations in a flexible and straight forward way. Such a configurable CryptoProcessor has been integrated into a Java-based digital signature environment resulting in a considerable increase of its performance. The outlined approach combines in an unique way the advantages of mapping functionality to either hardware or software and it results in high-speed cryptosystems which are both portable and easy to update according to future security requirements.
1 Introduction Many E-Commerce applications are characterized, for instance, by their demand for confidential data exchange via public communication networks (e.g., Internet). These data exchanges must be protected from fraudulent access by third parties. The basic technology, which can warrant this kind of protection, is known as public key cryptography. Digital signature schemes are probably the most common occurrence of public key cryptosystems. Digital signatures as defined in the Digital Signature Standard (DSS) are used to detect unauthorized modifications in documents and to authenticate the signatories identity [21]. Furthermore, digital signatures guarantee non-repudiation, i.e., the recipient of a signed document can prove to a third party that the signature was in fact created by the signatory. Besides the widely-used RSA method [23], public-key schemes based on elliptic curves (EC) have gained more and more importance. In 1985 elliptic curve cryptography (ECC) has been first proposed by Victor Miller [20] and Neal Koblitz [13]. In the following a lot of research has been done and nowadays ECC is widely known and accepted. The digital signature algorithm for elliptic curves (ECDSA), which is the EC based counterpart of DSA [21], is officially approved as ANSI standard X9.62 [2] and as IEEE standard P1363 [9]. Because EC methods in general are believed to give a higher security per key bit in comparison to RSA, one can work with shorter keys in order to achieve the same level of security (1024 RSA-bits are equivalent to 160 EC-bits [14]). The smaller key size permits more cost-efficient implementations and a higher throughput. The operation of point multiplication ( ) is the basic arithmetic in the area of ECC. This is a complex operation, and its computation is very time consuming. Basically, the time required for the computation of determines the overall performance of EC algorithms. Due to the immense computational effort for the computation, high performance hardware implementations, like the proposed CryptoProcessor, are necessary in order to enable the use of EC methods in server-based cryptosystems (e.g., online banking servers). The performance of software implementations is not sufficient for this kind of applications because -bit 1 operations have to be mapped to a processor with fixed word length (e.g., Intel Pentium, 32 bit) which introduces an immense computational overhead. For detailed information about leading software implementations we refer to [7] [16] [5] and to [24]. The Elliptic Curve CryptoProcessor described in Sec. 4 implements the operation completely within hardware. This hardware implementation is based on a reconfigurable logic device (FPGA) mounted on a PCI card, so that the system integration can be done easily via the PCI interface. It is demonstrated, that the performance of an ECDSA implementation can be enhanced by orders of 1
Supposing is the utilized key size.
2
magnitude by using this processor. The mathematical background for the CryptoProcessor is explained in the following section. Section 3 deals with specification and validation for this crypto application. In Section 4 the FPGA based hardware implementation is detailed. Section 5 focuses on the hardware accelerated ECDSA application, and Section 6 summarizes the conclusions.
2 Mathematical Background There are several cryptographic schemes based on elliptic curves, which work on a subgroup of points of an EC over a finite field. Arbitrary finite fields are approved to be suitable for ECC. In this paper we will concentrate on the finite field , the elliptic curves over this field and their arithmetics only. For further in deep information on EC arithmetics we refer to [26].
2.1 Elliptic Curve Arithmetic In the sequel we will consider so-called non-supersingular elliptic curves only as they provide the highest security. Please keep in mind, that we will talk of elliptic curves over only. The properties of this special field will be detailed in Sec. 2.2. An elliptic curve over is defined as the cubic equation
with
and
. The set of solutions
(1)
are called
the points of the elliptic curve . By defining an appropriate addition operation and an extra point , called the point at infinity, these points become an additive, abelian group with the neural element. Fig. 1 depicts an example of an elliptic curve over the reals. Here, a geometric interpretation of the addition can be given: Find the third intersection point (-) of a straight line through and with the elliptic curve. The result
is found by mirroring - at the x-axis.
For a non-supersingular elliptic curve defined over a finite field the basic operation of adding points , with
and is as follows:
If
(addition), then
3
Figure 1: Example of an elliptic curve over the real numbers visualizing the point addition. If
(doubling), then
Thus, adding two elliptic curve points (EC-Add) as well as doubling an elliptic curve point (ECDouble) requires one inversion and two multiplications each over the underlying finite field . Computing inverses is relatively expensive in comparison to multiplication in . In order to avoid computing inverses, we switch to projective coordinates of which several types have been suggested in literature. The kind of projective coordinates published in [17] results in the least number of multiplications and is therefore exploited in our implementation. Replacing and
in Eqn. 1 leads to the EC equation
An affine point and and
is converted into its projective representation by setting
(2)
,
. The conversion from projective to affine is done as stated before by computing
.
4
If
and
is given for
2 , where
and
, then
by computing the sequence:
If
, then
Thus, computing
is given by:
(EC-Add) requires 10 finite field multiplications, 8 additions and 4 square op-
erations. The computation of (EC-Double) requires multiplications, 4 additions and 5 squares. This is illustrated in the lower part of Fig. 2. Ultimately, multiplication, addition and squaring have to be done in the underlying finite field.
2.2 Finite Field Arithmetic As mentioned before, we are using as the underlying finite field (FF). is a FF of characteristic 2 and extension degree , and can be viewed as a vector space of dimension over the field , which consists of the two elements
. By exploiting a field of characteristic 2, the addition is
reduced to XOR-ing the corresponding bits. The multiplication rule depends on the basis chosen for We can fix ¾ multiplication ¼ . 2
because the base point ¼ ¼ ¼ will always be added during the computation of the point
5
k P
Double-and-Add algorithm
n = Extension degree h = Hamming weight
n
h
EC-Double
EC-Add 5
5
10
4
FF-Mult O(n log n)
8
Elliptic-Curve arithmetic
4
FF-Add
FF-Square
O(1)
O(1)
Finite-Field arithmetic
Figure 2: Arithmetic hierarchy representing the field elements. There are several bases known for . The most common bases, which are also permitted by the leading standards concerning ECC (IEEE P1363 and ANSI X9.62) are standard bases (also called polynomial bases) and normal bases. In standard bases, field elements are represented by binary polynomials modulo an irreducible binary polynomial (called reduction polynomial) of degree . Standard basis representation is a good choice for software implementations. They especially benefit from reduction polynomials with low Hamming weight and also from terms of low degree, such as trinomial or pentanomial. For each value of at least one of these exists. Normal bases are generally better suited for hardware implementations. Squaring a field element can be easily performed by a cyclic shift. In hardware this can be done very efficiently within one clock cycle. In case of optimal normal bases 3 (ONB) very efficient hardware multipliers can also be implemented. The proposed CryptoProcessor uses the Massey-Omura architecture [18] for the field multiplication in ONB. In this architecture all bits of the result vector can be computed independently from each other, so that very area-efficient bit-serial implementations as well as high-performance parallel architectures may be generated. If the following conditions are satisfied, an ONB of type II exists for the field : 1. is a prime 2.
is primitive in 4
or 3 4
.
A special class of normal bases, which are also called Gaussian normal bases. i.e., if we take for then we get every value in range back.
6
Let be an field element in , then an ONB can be formed as:
In ONB, any element is represented by a bit string , so that:
or
Squaring of an element is just a rotation of bits:
, and from Fermat’s little theorem:
.
Multiplication of two elements
is done as follows:
which can be modified to
where
This requires the calculation of only , which is called -matrix or multiplication matrix. An ONB has the minimum number of
non-zero terms in its -matrix. With respect to a hardware
implementation this is also the number of the wiring connections from the registers to the multiplier circuit. We have
¼
if
or
In case
if
at least one of the equations will have a solution
7
and at least one of the equations will have a solution
¼
¼
For the calculation of the -matrix we set
and then find the solution to the following equations:
For each we get two5 values of represented as and . For example, the partial multiplication matrix for is shown subsequent.
-
.. .
.. .
.. .
Now, the ’s can be calculated as
where
. For example, referring to the table for above, we get:
.. . 5
except for , where only one value is obtained
8
(3)
Algorithm 1 Finite Field Inversion INPUT: OUTPUT: ! "#$% while do " && {right shift by bits} ' "#$% for from to " && do ' _('$"#' {perform square operations} end for % _)$%"#$% ' if " is odd then % _('$"#% "#$% _)$%% else "#$% % end if end while "#$% _('$"#"#$% return "#$% This formula for calculating the ’s is quite regular and can be easily mapped to the Massey-Omura architecture. Such a multiplier takes 2 -bit inputs and reduces them within a huge XOR-tree according to Eqn. 3 to one single bit. The method published in [11] is utilized for FF inversion. This algorithm, which is highlighted in the following, is based on Fermat’s little theorem. Given , then:
Supposing "
we get further on:
if " is even
if " is odd
Alg. 1 in particular benefits from the fact that squaring in ONB representation is much cheaper than multiplication. The total number of multiplications ) required for one FF inversion is given
9
by
) * where * denotes the Hamming weight in the binary representation of its argument [19].
2.3 EC Point Multiplication Since the points on an elliptic curve form an additive group, there is no inner group operation like the multiplication. Even so repeated point additions such as
with
times
, are usually considered as the operation called point multiplication. With this
operation we obtain a parallel problem to the discrete logarithm problem (DLP) over finite fields: is called the discrete logarithm of to the base . Given only and it is hardly possible to find , since there is no sub-exponential algorithm known for this problem. The operation is performed with the so-called Double-and-Add algorithm, which is given below and is also depicted in Fig. 2. The computation of is done by repeated doublings and addings of the base point using the underlying EC arithmetic as described in Sec. 2.1. Each multiplication requires EC-Double and * EC-Add operations. As EC-Double is cheaper in terms of FF multiplication as EC-Add, the performance of the algorithm benefits from a key with low Hamming weight *. The EC arithmetic in turn is based on the underlying finite field arithmetic. Hence, as detailed in Sec. 2.2, the FF operations FF-Add and FF-Square can be implemented very efficiently, so that the performance of the EC arithmetic is mainly determined by the time required for the computation of FF-Mult. For one computation * FF-Mult operations have to be executed. Using the previously described multiplier architecture this results in * elementary Massey-Omura operations taking about time each. Example: For
and *
with a pure bit-serial implementation of the Massey-Omura
multiplier, one computation of requires multiplier iterations.
3 Hardware Specification In this paper we focus on an EC crypto algorithm, which is realized in a single integrated circuit. This application, as many other digital circuits, is based on models specified in a hardware description language. The hardware description language VHDL is the de-facto standard for abstract modeling of digital circuits.
10
Algorithm 2 EC Point Multiplication (Double-and-Add) INPUT: and OUTPUT: while and do end while if & then while do _ $ # if then _++ end if end while else end if return
3.1 Modeling Levels Abstract VHDL model descriptions can be processed by synthesis tools in order to derive a netlist of basic logic elements, which can then be fed into place and route tools. Based on this design flow Register Transfer Level (RTL) descriptions have proven to be well suited to efficiently design integrated circuits. In addition to using commercial synthesis tools, there is a lot of potential for application specific model generators, which produce an RTL description from a more abstract rule set. The RTL abstraction level is a good choice when aiming at synthesizable models which have to be independent from the utilized synthesis tool. In 1999 an IEEE Standard [10] was drafted which defines a rule set for RTL synthesis. RTL is characterized by functional blocks, such as registers, memory units or ALUs, and control logic. The latter is based on clocked state transitions. This abstraction level is well-suited for synchronous designs with a clear functionality and hard chipsize or performance constraints. Designers have a good chance to add manual optimizations, and complexity can be managed by a hierarchical design. The Algorithmic Level is defined with multi-cycle operations in mind. While each control step
11
Key size
n
Radix
entity FF_MULT FF_MULT is is entity port ( port ( A, BB :in :in std_logic_vector std_logic_vector (190 (190 downto downto 0); 0); A, C :out std_logic ); C :out std_logic ); end FF_MULT; FF_MULT; end
r
architecture STRUCTURE STRUCTURE of of FF_MULT FF_MULT is is architecture begin begin