A High-Throughput Area Efficient FPGA ... - Semantic Scholar

2 downloads 2535 Views 251KB Size Report
Transmission and storage of sensitive data in open networked environments is rapidly .... encryption of data collections (such as files or packets) of typical sizes ...
A High-Throughput Area Efficient FPGA Implementation of AES-128 Encryption Andreas Brokalakis

Athanasios P. Kakarountas , Costas E. Goutis

Computer Engineering & Informatics Department University of Patras Greece [email protected]

Electrical & Computer Engineering Department University of Patras Greece {kakaruda, goutis}@ee.upatras.gr encryption algorithm and independently implementing a large number (typically all) of rounds in hardware. Such designs when compared to loop architectures, offer throughputs up to an order of magnitude higher, in a range of 7 ([18]) to 18 Gbps ([8], [10], [11]), while area and memory requirements can be more than 10 times greater.

Abstract—Advanced Encryption Standard (AES) is used nowadays extensively in many network and multimedia applications to address security issues. In this paper, a high throughput area efficient FPGA implementation of the latter cryptographic primitive is proposed. It presents the highest performance (in terms of throughput) among competitive academic and commercial implementations. Using a Virtex-II device, a 1.94Gbps throughput is achieved, while the memory usage remains low (8 BlockRAMs) and the CLB coverage moderate.

I.

This paper describes an iterative implementation of the AES encryption algorithm with 128-bit cipher key (AES128) on a Xilinx Virtex-II FPGA. This design has moderate area demands in terms of combinational logic and memory when compared to other iterative designs ([4], [5], [10], [12], [14]) while the offered throughput is the highest in its class. The low resource usage and high throughput achieved, position this implementation as a good candidate for integration in SoCs or ASIP designs that target applications where the AES algorithm is needed. Two most representative examples of these applications are IPSec [15] and JPSec[16].

INTRODUCTION

Transmission and storage of sensitive data in open networked environments is rapidly growing. Along with it, grows the need of efficient and fast data encryption. Software implementations of cryptographic algorithms cannot provide the necessary performance when large amounts of data have to be moved over high speed network links that reach the Gbps range. Therefore hardware implementations have to be considered for these applications, either in the form of ASIC or FPGA designs.

The next section provides a brief overview of the AES encryption algorithm. Section 3 details the architecture and overall design of the implementation, while area and timing results are presented in Section 4, in comparison with other published FPGA-implementations. Conclusions are drawn in Section 5.

FPGAs are very well suited for high speed cryptography, as they can provide the required performance without the excessive cost of initializing an ASIC manufacturing process. Furthermore, when multiple protocols have to be implemented on the same platform, the ability of the FPGA devices to be reconfigured in a very short time, provides an extra advantage.

II.

The AES algorithm is a round-based, symmetric block cipher. It processes data blocks of fixed size (128 bits) using cipher keys of length 128, 196 or 256 bits. Depending on the key used, it is usually abbreviated as AES-128, AES-196 or AES-256 respectively. In this paper only AES-128 is considered, as it is the most popular variant of the algorithm. Nonetheless, all principles described can be applied to the other variants as well.

AES (Advanced Encryption Standard) [1] has become the algorithm of choice among the cryptographic protocols that are based on symmetric cipher. Since its standardization by NIST in May 2002, numerous FPGA implementations of the algorithm have been reported, both academic and commercial. Explorations on the design space, such as those presented in [7], [9] and [10], indicate that in order to keep the area requirements low, loop or iterative architectures are to be considered. These architectures implement a small number of rounds in hardware (typically just one) and achieve throughputs at the order of a few Gbps (up to 2 Gbps as reported in [10]). At the cost of increased area, higher throughputs may be accomplished by unrolling the

0-7803-9333-3/05/$20.00 ©2005 IEEE

AES ENCRYPTION ALGORITHM

A. AES-128 Key Expansion Process The initial 128-bit cipher key has to be expanded to eleven round keys of same length. The first round key is the cipher key (RoundKey0) and all subsequent round keys are produced when a function is applied to the previously generated round key. If this function is called KeyExpand, then the process may be modeled as :

116

SIPS 2005

TABLE I.

RC[1:10] = (‘01’,‘02’,‘04’,‘08’,‘10’,‘20’,‘40’, ‘80’,‘1B’, ‘36’); Rcon[i] = (RC[i], ‘00’,‘00’,‘00’);

Signal

I/O

I/O INTERFACE SIGNAL DEFINITIONS Width (bits)

Description

Indicates that a key is placed on Key/Plaintext lines Indicates that plaintext is placed on Key/Plaintext lines

W[0:3] = (Key[0],Key[1],Key[2],Key[3]);

LoadKey

I

1

for(i = 4; i < 44; i++){ if ((i%4) == 0) temp = SubWord(RotWord(W[i-1])) XOR Rcon[i/4]; W[i] = W[i-4] XOR temp; }

LoadData

I

1

Key/Plaintext

I

128

Key or Plaintext data

clk

I

1

System clock

rst

I

1

Figure 1. Pseudocode of the AES-128 key expansion process.

RoundKeyi = KeyExpand(RoundKeyi-1), for all 0 < i < 11 In Fig. 1, a pseudocode for the key expansion process is given. Note that while cipher (round) keys are 128-bits long, in the pseudocode they are referenced as 4 words of 32-bits each. The RotWord function rotates the input word by one byte to the left and the SubWord function applies the S-Box transformation to each byte of the input word. S-Box transformation will be furthered explained at the next subsection. More details concerning the key expansion process may be found in [1].

System Reset (synchronous) Indicates that a new key may be accepted at the next cycle Indicates that new plaintext may be accepted at the next cycle

ReadyForKey

O

1

ReadyForData

O

1

Ciphertext

O

128

Ciphertext data

1

Indicates that the value present at Ciphertext lines is valid

CTValid

O

The MixColumns transformation is applied to the state, in a column by column fashion. Each column is treated as a four-term polynomial over GF(28) and is multiplied modulo x4+1 with the constant polynomial : a(x) = {03}x3 + {01}x2 +{01}x + {02} The AddRoundKey transformation is a simple bitwise XOR operation. The 128-bit state is xored with a 128-bit round key.

B. AES Encryption Block Processing A data block of 128-bits, called plaintext, is provided as input to the AES encryption algorithm. AES performs a number of transformations to the plaintext, and a 128-bit output block (called ciphertext) is produced as a result. All intermediate results are called states. The states, as data structures, are organized as a 4x4 matrix of bytes. Plaintext is mapped to a state matrix at the beginning of the encryption process and the final state is mapped to the ciphertext at the end.

III.

PROPOSED AES-128 ARCHITECTURE

The high-level architectural organization of the AES encryption core is presented in Fig. 2. In Table I the core I/O is detailed. As it may be seen from Fig.2, five logic blocks compose the overall system. The Input Interface unit is responsible for feeding the Key Logic and the Processing Core with the correct data from the input lines. Key Logic handles any cipher key related operation. Processing Core is the unit that performs the main AES encryption process. SBox is a ROM that is used for the SubBytes (SubWord) transformation and finally the System Control Unit is responsible for the system’s overall synchronization and communication with external logic. In the following subsections the functionality of each logic block will be further explained.

AES performs 4 discrete transformations - SubBytes, ShiftRows, MixColumns and AddRoundKey - in that specific order. Each transformation maps a 128-bit input state to a 128-bit output state. An application of all these 4 transformations comprises an encryption round. The number of rounds that have to be performed in order to produce the ciphertext depends on the size of the cipher key, and for AES-128 this number is 10. The last round is slightly different as the MixColumns transformation is not performed. Also, an initial round (where only AddRoundKey is performed) is carried out in advance.

A. Input Interface When new plaintext is loaded, the Input Interface latches the result of a bitwise xor operation between the value of the input lines and the value of the cipher key that is stored in the local register. The Processing Core unit is then notified that a new state is available for processing. This way, the initial round of the encryption process is carried out together with the loading of data and processing time is saved.

SubBytes transformation is a non-linear substitution that operates independently on each byte of a state using a substitution table (S-Box). S-Box is defined as the multiplicative inverse in the finite field GF(28) with the irreducible polynomial m(x) = x8+x4+x3+x+1 followed by an affine transformation. In the ShiftRows transformation, the bytes of the last three rows of the state are cyclically shifted over different offsets (offset value is respective of the row it is applied, i.e. the offset of row 1 is 1, the offset of row 2 is 2, etc).

117

/ 32

LoadKey LoadData

128 / InputBlk

Key Logic

32

/

128 /

128 /

0 DataRd /

/

Control

128 RoundKey

3

128

mode

S-Box ROM

/ 128

Processing Core 2

/

/

System Control Unit

Addr / 128

/

Key / Plaintext

Input Inteface

KeyRd

128 /

128

ReadyForKey ReadyForData

CipherText

CTValid

Clk

Rst

Figure 2. Block diagram of the AES encryption core

storage of keys is not provided, the power saved by the absence of specific hardware to generate the round keys is invested in increased bus activity in order to load the keys to the core. If the bus is external (i.e. the encryption core does not reside inside a greater design on the same IC) power dissipation actually grows significantly.

B. Key Logic As already mentioned, the AES algorithm requires that a new round key is used in every encryption round. There are three different approaches to the time and manner that these round keys are calculated. The online (or key-agile as mentioned in [8]) approach will calculate a new round key at every encryption round using the previous round key. This principle is followed by the majority of the referenced papers. Another approach (offline or stored-key), generates all round keys upon the reception of the initial cipher key and stores them in a local memory. This memory is accessed at every encryption round in order to provide the necessary round key. (Amphion AES cores [5] seem to use this approach yet insufficient information is provided). The final method for round key generation is the usage of an external source (which may be an external key generator or a program running on a processor). This external source calculates the round keys and seeds them sequentially to the AES processing core. The Helion Technologies tiny and standard AES cores [6] are examples of this approach.

From the previous analysis, it is derived that the Key Logic unit has two main functions. The first one is to perform the key expansion process whenever a new cipher key is inserted to the system and the second is to access the local memory in order to provide the right round keys during an encryption process. The operation mode of the Key Logic unit is controlled by a signal from the system’s Control unit. As shown in Fig. 1, in order to produce a new round key, two transformations have to be performed, RotWord and SubWord. The first one simply cyclically shifts the bytes of the first 32-bit word of the previous key by one position to the left. SubWord on the other hand performs the SubBytes transformation (see subsection II.B) to each byte of the rotated word. Simple bitwise xors are then needed in order to produce the final round key. Fig. 3 demonstrates this process.

The implementation presented in this paper is based on the second approach for a number of reasons. The calculation of the round keys is totally independent of the plaintext and the round keys that are produced from a certain cipher key will remain the same as long as the initial cipher key does not change. Since a single cipher key is used for the encryption of data collections (such as files or packets) of typical sizes significantly larger than 128-bits, the same round keys are going to be used for a great number of plaintext blocks. Therefore the online approach wastes both resources and power. However, the cost to be paid for not using an online key expansion, comes in terms of latency, but as it will be shown this is not actually high. Also, power is exchanged for area, since a memory is needed for key storage.

The SubWord (SubBytes) transformation is implemented with a ROM, called S-Box. Details for this ROM are provided later on, but it should be mentioned here that it is a synchronous memory and as a result it requires one clock cycle to produce an output. Therefore, the process to generate a round key is logically divided into two processing cycles. During the first one, the RotWord transformation is performed and an address is generated for the S-Box ROM. At the second cycle, the ROM’s output is XORed with the proper Rcon value and subsequent xors of this result produce the remaining 3 words of the round key. All four words are then stored at the unit’s register file. The whole key expansion process requires 20 cycles to be completed (10 round keys to be generated in 2 cycles each).

The usage of an external source for key generation, on the other hand, is better suited to systems that are resourcelimited and do not target high performance. Also, if internal

An 11x128 register file is needed in order to store all round keys. Since only one access (read or write) is required in a cycle, the register file is organized as a single-ported

118

InputReady

word0

InputBlk

+

128

ReadyForNew Key/Data

+ keyWord1

word2

KeyAddr

+

Control

/ 2 / 4

Suggest Documents