FPGA based Hardware Acceleration for Elliptic ... - Semantic Scholar

4 downloads 281436 Views 970KB Size Report
toProcessor has been integrated into a Java-based digital signature ... Digital signatures as defined in the Digital Signature Standard (DSS) are used to detect ...
FPGA based Hardware Acceleration for Elliptic Curve Public Key Cryptosystems M. Ernst, B. Henhapl, S. Klupsch and S. A. Huss {ernst  klupsch  huss }@iss.tu-darmstadt.de [email protected] Integrated Circuits and Systems Lab. Cryptography and Computer Algebra Computer Science Department Darmstadt University of Technology, Germany Keywords: Public Key cryptography, VHDL model generator, co-processor synthesis, FPGA-based hardware acceleration, ECDSA, Java and JCA Abstract E-Commerce applications and the emerging communication of private persons with administrations (E-Government) give rise to the important question, how to reliably exchange confidential data via public communication networks such as the Internet. Any data transfer must be protected from a fraudulent access by third parties in the sense that it has to be ensured that exchanged documents are neither read nor modified during the data transfer. Furthermore, the author-document relationships have to be known and unique at any point in time. The fundamental technology for document protection during public data transfer is known as public key cryptography. Digital signature schemes are probably the most common occurrence of public key cryptosystems. This paper addresses public key cryptosystems based on elliptic curves, which are aimed to high-performance digital signature schemes. Elliptic curve (EC) algorithms are characterized by the fact that one can work with considerably shorter keys compared to the RSA approach at the same level of security. A general and highly efficient method for mapping the most time-critical operations to a configurable co-processor is proposed. By means of real-time measurements the resulting performance values are compared to previously published state of the art hardware implementations. A generator based approach is advocated for that purpose which supports application specific co-processor configurations in a flexible and straight forward way. Such a configurable CryptoProcessor has been integrated into a Java-based digital signature environment resulting in a considerable increase of its performance. The outlined approach combines in an unique way the advantages of mapping functionality to either hardware or software and it results in high-speed cryptosystems which are both portable and easy to update according to future security requirements.

1 Introduction Many E-Commerce applications are characterized, for instance, by their demand for confidential data exchange via public communication networks (e.g., Internet). These data exchanges must be protected from fraudulent access by third parties. The basic technology, which can warrant this kind of protection, is known as public key cryptography. Digital signature schemes are probably the most common occurrence of public key cryptosystems. Digital signatures as defined in the Digital Signature Standard (DSS) are used to detect unauthorized modifications in documents and to authenticate the signatories identity [21]. Furthermore, digital signatures guarantee non-repudiation, i.e., the recipient of a signed document can prove to a third party that the signature was in fact created by the signatory. Besides the widely-used RSA method [23], public-key schemes based on elliptic curves (EC) have gained more and more importance. In 1985 elliptic curve cryptography (ECC) has been first proposed by Victor Miller [20] and Neal Koblitz [13]. In the following a lot of research has been done and nowadays ECC is widely known and accepted. The digital signature algorithm for elliptic curves (ECDSA), which is the EC based counterpart of DSA [21], is officially approved as ANSI standard X9.62 [2] and as IEEE standard P1363 [9]. Because EC methods in general are believed to give a higher security per key bit in comparison to RSA, one can work with shorter keys in order to achieve the same level of security (1024 RSA-bits are equivalent to 160 EC-bits [14]). The smaller key size permits more cost-efficient implementations and a higher throughput. The operation of point multiplication (   ) is the basic arithmetic in the area of ECC. This is a complex operation, and its computation is very time consuming. Basically, the time required for the computation of   determines the overall performance of EC algorithms. Due to the immense computational effort for the   computation, high performance hardware implementations, like the proposed CryptoProcessor, are necessary in order to enable the use of EC methods in server-based cryptosystems (e.g., online banking servers). The performance of software implementations is not sufficient for this kind of applications because -bit 1 operations have to be mapped to a processor with fixed word length (e.g., Intel Pentium, 32 bit) which introduces an immense computational overhead. For detailed information about leading software implementations we refer to [7] [16] [5] and to [24]. The Elliptic Curve CryptoProcessor described in Sec. 4 implements the   operation completely within hardware. This hardware implementation is based on a reconfigurable logic device (FPGA) mounted on a PCI card, so that the system integration can be done easily via the PCI interface. It is demonstrated, that the performance of an ECDSA implementation can be enhanced by orders of 1

Supposing  is the utilized key size.

2

magnitude by using this processor. The mathematical background for the CryptoProcessor is explained in the following section. Section 3 deals with specification and validation for this crypto application. In Section 4 the FPGA based hardware implementation is detailed. Section 5 focuses on the hardware accelerated ECDSA application, and Section 6 summarizes the conclusions.

2 Mathematical Background There are several cryptographic schemes based on elliptic curves, which work on a subgroup of points of an EC over a finite field. Arbitrary finite fields are approved to be suitable for ECC. In this paper we will concentrate on the finite field  , the elliptic curves over this field and their arithmetics only. For further in deep information on EC arithmetics we refer to [26].

2.1 Elliptic Curve Arithmetic In the sequel we will consider so-called non-supersingular elliptic curves only as they provide the highest security. Please keep in mind, that we will talk of elliptic curves over   only. The properties of this special field will be detailed in Sec. 2.2. An elliptic curve over   is defined as the cubic equation

 with   



 and

   





   

. The set of solutions   

 



   

(1) 









 are called

the points of the elliptic curve  . By defining an appropriate addition operation and an extra point  , called the point at infinity, these points become an additive, abelian group with  the neural element. Fig. 1 depicts an example of an elliptic curve over the reals. Here, a geometric interpretation of the addition can be given: Find the third intersection point (- ) of a straight line through  and with the elliptic curve. The result  

is found by mirroring - at the x-axis.



For a non-supersingular elliptic curve  defined over a finite field  the basic operation of adding points    , with 

 

    and     is as follows: 

If 





(addition), then   





         

        

          3

Figure 1: Example of an elliptic curve over the real numbers visualizing the point addition. If 



    (doubling), then

                  

Thus, adding two elliptic curve points (EC-Add) as well as doubling an elliptic curve point (ECDouble) requires one inversion and two multiplications each over the underlying finite field   . Computing inverses is relatively expensive in comparison to multiplication in   . In order to avoid computing inverses, we switch to projective coordinates of which several types have been suggested in literature. The kind of projective coordinates published in [17] results in the least number of multiplications and is therefore exploited in our implementation. Replacing    and 



  

in Eqn. 1 leads to the EC equation

     An affine point  and  and 



          

   is converted into its projective representation by setting 

 

(2) 

, 





. The conversion from projective to affine is done as stated before by computing   

  

  .

4

If 

 

    and

    is given for 





  2 , where 

 

 

and 

, then

 











by computing the sequence:        

    





  



              







    







    

        If 



   , then   







 

 

 Thus, computing 





  

    is given by:

 







  

   

               

(EC-Add) requires 10 finite field multiplications, 8 additions and 4 square op-

erations. The computation of  (EC-Double) requires multiplications, 4 additions and 5 squares. This is illustrated in the lower part of Fig. 2. Ultimately, multiplication, addition and squaring have to be done in the underlying finite field.

2.2 Finite Field Arithmetic As mentioned before, we are using  as the underlying finite field (FF).  is a FF of characteristic 2 and extension degree , and can be viewed as a vector space of dimension  over the field  , which consists of the two elements



 

. By exploiting a field of characteristic 2, the addition is

reduced to XOR-ing the corresponding bits. The multiplication rule depends on the basis chosen for We can fix ¾  multiplication  ¼ . 2

 because the base point ¼  ¼  ¼  will always be added during the computation of the point

5

k P

Double-and-Add algorithm

n = Extension degree h = Hamming weight

n

h

EC-Double

EC-Add 5

5

10

4

FF-Mult O(n log n)

8

Elliptic-Curve arithmetic

4

FF-Add

FF-Square

O(1)

O(1)

Finite-Field arithmetic

Figure 2: Arithmetic hierarchy representing the field elements. There are several bases known for  . The most common bases, which are also permitted by the leading standards concerning ECC (IEEE P1363 and ANSI X9.62) are standard bases (also called polynomial bases) and normal bases. In standard bases, field elements are represented by binary polynomials modulo an irreducible binary polynomial (called reduction polynomial) of degree . Standard basis representation is a good choice for software implementations. They especially benefit from reduction polynomials with low Hamming weight and also from terms of low degree, such as trinomial or pentanomial. For each value of  at least one of these exists. Normal bases are generally better suited for hardware implementations. Squaring a field element can be easily performed by a cyclic shift. In hardware this can be done very efficiently within one clock cycle. In case of optimal normal bases 3 (ONB) very efficient hardware multipliers can also be implemented. The proposed CryptoProcessor uses the Massey-Omura architecture [18] for the field multiplication in ONB. In this architecture all bits of the result vector can be computed independently from each other, so that very area-efficient bit-serial implementations as well as high-performance parallel architectures may be generated. If the following conditions are satisfied, an ONB of type II exists for the field   : 1.      is a prime 2.



is primitive in   4

or 3 4





.



A special class of normal bases, which are also called Gaussian normal bases. i.e., if we take     for           then we get every value in range       back.

6

Let  be an field element in  , then an ONB can be formed as:  









  

In ONB, any element    is represented by a bit string      , so that:  

    

  





or

 









    



  



  

Squaring of an element is just a rotation of bits:   









, and from Fermat’s little theorem:

.

Multiplication of two elements  

  is done as follows:



  



  







 



 

   





which can be modified to











 



    





 



where 



   

  



 

  





  







 

  

    

This requires the calculation of only   , which is called -matrix or multiplication matrix. An ONB has the minimum number of 

 

non-zero terms in its -matrix. With respect to a hardware

implementation this is also the number of the wiring connections from the registers to the multiplier circuit. We have 











 ¼



if







 





or 

 In case 



 













if













   at least one of the equations will have a solution







  

 







7









 

  



and at least one of the equations will have a solution 



  





  



 





 

 

For the calculation of the -matrix we set 





and then find the solution to the following equations:

 

















           

















For each  we get two5 values of  represented as  and  . For example, the partial multiplication matrix for    is shown subsequent.









 -



















.. .

.. .

.. .













Now, the  ’s can be calculated as

 where 



     

  . For example, referring to the table for    above, we get:

 





              







                 





               

.. . 5

except for  , where only one value is obtained

8

(3)

Algorithm 1 Finite Field Inversion INPUT:    OUTPUT:     !    "#$%  while   do "  &&  {right shift by  bits} ' "#$% for  from  to " &&  do '   _('$"#'  {perform  square operations} end for %   _)$%"#$% '  if " is odd then %   _('$"#% "#$%   _)$%%  else "#$% % end if   end while "#$%   _('$"#"#$% return "#$% This formula for calculating the  ’s is quite regular and can be easily mapped to the Massey-Omura architecture. Such a multiplier takes 2 -bit inputs and reduces them within a huge XOR-tree according to Eqn. 3 to one single bit. The method published in [11] is utilized for FF inversion. This algorithm, which is highlighted in the following, is based on Fermat’s little theorem. Given    ,    then:

Supposing "



 



   







    



      

   we get further on:





  

   



 



 





 



  

  

if " is even  

 if " is odd

Alg. 1 in particular benefits from the fact that squaring in ONB representation is much cheaper than multiplication. The total number of multiplications )  required for one FF inversion is given

9

by

)         *     where *  denotes the Hamming weight in the binary representation of its argument [19].

2.3 EC Point Multiplication Since the points on an elliptic curve  form an additive group, there is no inner group operation like the multiplication. Even so repeated point additions such as



 with 



















 times











 , are usually considered as the operation  called point multiplication. With this

operation we obtain a parallel problem to the discrete logarithm problem (DLP) over finite fields:  is called the discrete logarithm of to the base  . Given only and  it is hardly possible to find  , since there is no sub-exponential algorithm known for this problem. The operation   is performed with the so-called Double-and-Add algorithm, which is given below and is also depicted in Fig. 2. The computation of    is done by repeated doublings and addings of the base point  using the underlying EC arithmetic as described in Sec. 2.1. Each   multiplication requires  EC-Double and * EC-Add operations. As EC-Double is cheaper in terms of FF multiplication as EC-Add, the performance of the algorithm benefits from a key  with low Hamming weight *. The EC arithmetic in turn is based on the underlying finite field arithmetic. Hence, as detailed in Sec. 2.2, the FF operations FF-Add and FF-Square can be implemented very efficiently, so that the performance of the EC arithmetic is mainly determined by the time required for the computation of FF-Mult. For one   computation   * FF-Mult operations have to be executed. Using the previously described multiplier architecture this results in    * elementary Massey-Omura operations taking about    time each. Example: For 

 

and *

 

with a pure bit-serial implementation of the Massey-Omura

multiplier, one computation of    requires   multiplier iterations.

3 Hardware Specification In this paper we focus on an EC crypto algorithm, which is realized in a single integrated circuit. This application, as many other digital circuits, is based on models specified in a hardware description language. The hardware description language VHDL is the de-facto standard for abstract modeling of digital circuits.

10

Algorithm 2 EC Point Multiplication (Double-and-Add) INPUT:        and    OUTPUT:      while    and   do   end while if  &  then    while   do  _ $ #  if    then  _++   end if   end while else  end if return

3.1 Modeling Levels Abstract VHDL model descriptions can be processed by synthesis tools in order to derive a netlist of basic logic elements, which can then be fed into place and route tools. Based on this design flow Register Transfer Level (RTL) descriptions have proven to be well suited to efficiently design integrated circuits. In addition to using commercial synthesis tools, there is a lot of potential for application specific model generators, which produce an RTL description from a more abstract rule set. The RTL abstraction level is a good choice when aiming at synthesizable models which have to be independent from the utilized synthesis tool. In 1999 an IEEE Standard [10] was drafted which defines a rule set for RTL synthesis. RTL is characterized by functional blocks, such as registers, memory units or ALUs, and control logic. The latter is based on clocked state transitions. This abstraction level is well-suited for synchronous designs with a clear functionality and hard chipsize or performance constraints. Designers have a good chance to add manual optimizations, and complexity can be managed by a hierarchical design. The Algorithmic Level is defined with multi-cycle operations in mind. While each control step

11

Key size

n

Radix

entity FF_MULT FF_MULT is is entity port ( port ( A, BB :in :in std_logic_vector std_logic_vector (190 (190 downto downto 0); 0); A, C :out std_logic ); C :out std_logic ); end FF_MULT; FF_MULT; end

r

architecture STRUCTURE STRUCTURE of of FF_MULT FF_MULT is is architecture begin begin

Suggest Documents