An FPGA Based Reconfigurable IPSec AH Core with Efficient Implementation of SHA-3 for High speed IoT applications Muzaffar Rao, Thomas Newe*, Ian Grout and Avijit Mathur University of Limerick, Ireland *
[email protected] Abstract— The need for securing data across the internet has become a fundamental issue over the last decade. The Internet
Protocol Security (IPSec) standard has been developed as one solution to the problem of end-to-end secure communications. IPSec implementation is computationally intensive and can significantly limit the performance of high speed networks. To overcome this speed issue hardware implementations of IPSec offer the best solution. This work presents an FPGA (Field Programmable Gate Array) based reconfigurable IPSec AH core. AH is one of the two main IPSec protocols namely; Authentication Header (AH) and Encapsulating Security Payload (ESP) and it supports both transport and tunnel modes of operations. For the AH protocol, a newly selected cryptographic hash function called, Secure Hash Algorithm–3 (SHA-3) is implemented and used in this work. SHA-3 is implemented using a unique two phase implementation approach that combines all the steps of SHA-3. The resultant equations, after combining the SHA-3 steps, are implemented as a proposed high speed architecture, which results in data throughput in the Gbps range. The AH core proposed here outperforms other published techniques and is capable of supporting IPv4 datagrams for both modes of operation (transport and tunnel) and also can be used to provide security services for IoT (Internet of Thing) applications that require high data throughput speeds. Index Terms—FPGA, SHA-3, IPSec, AH, IoT, VPN 1.
Introduction
The Internet Protocol (IP) [1][2][3] is the primary communication protocol for transferring data across a network. It defines the IP datagram structures that encapsulates the data to be delivered. The IPSec protocol [4], developed by the IETF (Internet Engineering Task Force) in 1998, is a popular solution to facilitate protection of the data being transferred at the IP layer. The IPSec protocol provides security services like access control, connectionless integrity, data origin authentication, protection against replay attacks, confidentiality and limited traffic flow confidentiality. In IPSec, there are two protocols used to provide the above mentioned services; a) Authentication header (AH) [5] and b) Encapsulating Security Payload (ESP) [6]. The AH protocol provides support for; (1) Data integrity of an IP datagram; modification to an IP datagram in transit can be detected because of the data integrity check, (2) Authentication of an IP datagram; because of this feature an end system can
verify the sender and prevent address spoofing attacks, (3) Replay protection; guards against replay attacks due to the use of sequence numbers in the AH header. The ESP protocol defines mechanisms for both data confidentiality and integrity (optional). Both IPSec protocols (AH & ESP) support two modes of operation, (a) Transport mode and (b) Tunnel mode. In transport mode, only the upper-layer protocol data segment of the IP datagram is authenticated and it is typically used for end-to-end protection of IP datagram packets between two hosts. In tunnel mode, the entire original IP datagram is authenticated within a new outer IP header. Tunnel mode can be used between security gateways to create a VPN (virtual private network). The IPSec protocol is almost always embedded into the TCP/IP protocol stack via software in the OS (operating system), such as in Linux and NetBSD. However, IPSec has proven to be computationally intensive [7] which greatly affects the performance of the network it is implemented on. Data throughput in core routers has already achieved up to terabits per second, and line card interface speeds exceed 10Gbps, yet high performance internet security device speeds are far behind these data throughputs. The main reason for this reduction in speed is that data processing requirements for security protocols is often complex and time consuming, so it is difficult for security devices to achieve equal performance when compared to internet devices. Given that software solutions to complex problems like IPSec generally suffer from low performance issues, when compared to hardware, it is necessary for high data throughput speeds that hardware implementations of IPSec are utilised. Hardware solutions for security algorithms provide high speed and real time performance for applications like data confidentiality, authentication and integrity, and they provide better performance statistics than software solutions because of their dedicated operations. Hardware implementations require both efficient and cost effective solutions and generally security algorithms are considered more physically secure by nature if they are located physically separate from the main system processor. The FPGA platform is selected for hardware implementation because it is considered as the best leading representative of reconfigurable hardware devices of the modern era. The flexible architecture and high performance features of FPGA make it suitable to use particularly for applications involving complex cryptographic algorithms (e.g.: cryptographic algorithms [8][9]). It can be said that FPGAs are in fact
parallel in nature (unlike processors) and combine the best parts of Application Specific Integrated Circuit (ASICs) and processor-based systems. The advantage of using a software programmed processor is that software is very flexible to change, while a disadvantage is that performance can suffer if the clock is not fast. The advantage of an ASIC is that it can provide very high performance because of its dedicated type of operation and disadvantages are: 1) high cost to volume ratio, 2) extended delay between design to end product, 3) incapability to include new changes after the system is fabricated and 4) difficulties in debugging errors. FPGAs fill the gap between hardware and software and offer numerous advantages, such as: 1) flexibility, 2) reliability, 3) low cost, 4) fast time-to-market and 5) long-term maintenance. Given these reasons the authors consider that an FPGA is the best reconfigurable hardware platform for the implementation of cryptographic algorithms. In addition a successful FPGA implementation can easily be transferred into a full–custom ASIC to reduce cost for large scale device production. A possible implementation of an FPGA based IPSec core is suggested in Fig. 1. This is a BITW (Bump In The Wire) architecture for IPSec. In Fig. 1 two networks that previously communicated using an insecure IP link with each other can now communicate securely by layering IPSec underneath regular IP using an FPGA based IPSec hardware solution. This security is achieved by using an FPGA based IPSec device which takes IP datagrams from the gateway/router, applies IPSec security and then forwards the secure IP datagrams to the internet. The process is reversed on the receiving end before the regular IP packet is passed to the receiving gateway/router device. This technique allows legacy IPv4 hardware to implement IPSec without having to replace expensive networking devices.
Fig. 1: FPGA based IPSec core implementation To demonstrate the need for a BITW solution, consider the following Internet of Things (IoTs) applications (Fig. 2), where existing network interfaces do not support security checks. These applications are susceptible to various security attacks [10]. To provide security for many of these applications it may
require considerable hardware changes, in particular to network interface devices that attach to the Internet. The simplest and most cost effective solution is BITW technology that supports Integrity and Data Security for the provision of VPNs. BITW technology is an implementation approach that places a network security mechanism outside of the system that is to be protected.
Fig. 2 IoT Applications Attack Possibilities [11] The work presented here, to the best of author’s knowledge, is the first contribution which combine both modes (transport and tunnel) of the IPSec AH protocol on a single FPGA chip with a unique high speed implementation of the SHA-3 hash algorithm. The presented IPSec AH core is capable of performing AH protocol processing for an IPv4 datagram. The HMAC process is used to calculate the ICV (Integrity Check Value) for the AH header by using a cryptographic hash function. Here the recent cryptographic hash function called SHA-3 [12] is used. SHA-3 is selected because this is the newly selected and most secure [13][14] cryptographic hash function available. SHA-3 is implemented using a unique two phase implementation approach that combines all the steps of SHA-3. The resultant equations, after combining the SHA-3 steps, are implemented as a proposed high speed architecture, which results in data throughput in the Gbps range. The remainder of this article is organized as follows. A brief overview of the IPSec AH protocol is given in Section 2. Section 3 provides an overview of SHA-3. The literature review of FPGA based IPSec implementations and FPGA based SHA-3 implementations is given in Section 4. Section 5 describes the proposed implementation of SHA-3 and in Section 6 the IPSec AH protocol implementation is discussed.
Implementation performance results are given in Section 7, while Section 8 concludes the paper. All Acronyms used here are given at the end of this paper in Table I. 2. IPSec AH protocol overview One of the weaknesses of the Internet Protocol (IPv4) is that it lacks any sort of general purpose mechanism for ensuring the authenticity and privacy of data, which is encapsulated into the IP datagram and passed over the internet. Due to the lack of security, any information in IP datagrams is subject to interceptions and even possible alteration when these datagrams are routed between two devices over unknown or public networks. To address this lack of security IPSec technology was introduced. The AH protocol is used to provide security services (mentioned in Section 1) using a special header, called the AH header. The AH header consists of six fields as shown in Table II. The ‘Next header’ field is used to link the headers together and identify the type of header that immediately following the AH header. The ‘Payload length’ field represents the length of the AH header in 32-bit words, while the ‘Reserved’ field is reserved for the future use. The ‘SPI’ (Security Parameter Index) field identifies a security association (SA) that specifies shared security attributes. According to RFC-4302 [5], a SPI value of zero is reserved for local specific implementation in the absence of security association. The ‘Sequence number’ field represents a monotonic increasing counter value, which is used to provide protection against replay attacks. The last field of the AH header is ‘Authentication data’, which contains the ICV value. The length of the ICV value depends on the selected cryptographic hash function used to generate it. IPv4 datagram format before AH header insertion, is given in Fig. 3. The AH header insertion scheme in the IP datagram depends on the selected mode of operation, transport or tunnel.
Fig. 3: IP Datagram (Before applying AH protocol) 2.1 AH Transport mode In AH transport mode authentication covers the entire IP datagram except mutable fields (modified during transit) [5] in the IP header. The mutable fields of the IP header are set to zero during HMAC
calculation. In IPv4 the AH header is inserted after the original IP header. The insertion of the AH header in transport mode for IPv4 datagram is shown in Fig. 4.
Fig. 4: AH header insertion scheme for IPv4 (transport mode) 2.2 AH tunnel mode In AH tunnel mode the entire IP datagram is authenticated. A new IP header is generated and the AH header is inserted between the original IP header and the new IP header. The new IP header contains the IP addresses of the security gateways to be traversed, while the original IP header caries the addresses of the end systems. Like transport mode, the mutable fields of IPv4 are replaced by zeros during HMAC calculation. The insertion of the AH header in tunnel mode for IPv4 is shown in Fig. 5. The next section discussed SHA-3, its background and its compression function.
Fig. 5: AH header insertion scheme for IPv4 (tunnel mode) 3. SHA-3 overview The National Institute of Standards and Technology (NIST) selected the ‘Keccak’ algorithm as the new hash function ‘SHA-3’, in October 2012 [15]. NIST issued a call for a new algorithm as previously used hash functions (SHA-0, SHA-1, RIPEMD and MD5) had vulnerabilities detected [16][17][18]. Although no attack on SHA-2 has been reported yet, but given the algorithmic similarity of SHA-2 with SHA-1 it makes this algorithm’s security susceptible also. The new algorithm, SHA-3, is the most secure cryptographic hash function publically available to date. SHA-3 is a family of sponge functions characterised by two parameters, the bitrate ‘r’, and capacity ‘c’. The sum (r+c) determines the width of the SHA-3 function permutation and is restricted to a maximum value of 1600. Selection of ‘r’ and ‘c’ depends on the desired length of the hash output value (For a 256-bit hash: r = 1088, c = 512; For a 512-bit hash: r = 576, c = 1024). In this work the SHA-3
core has been designed for a 256-bit hash output. The 1600-bit state of SHA-3 consists of a 5x5 state matrix of 64-bit words, see Fig. 6.
Fig. 6 State Matrix A with 25 A[x, y] states (0 ≤ x, y ≤ 4) There are 24 rounds in the compression function of SHA-3 and each round consists of five steps/operations, Theta(θ), Rho(ρ), Pi(π), Chi(χ) and Iota (ι) as shown in Eq. (1) to Eq. (6), where (0 ≤ x, y ≤ 4). Theta (θ) Step: C[x]=A[x,0]⊕A[x,1]⊕A[x,2]⊕A[x,3]⊕A[x,4]; D[x] = C[x–1] ⊕ ROT (C[x+1], 1); A[x, y] = A[x, y] ⊕ D[x];
(1) (2) (3)
Rho (ρ) and Pi (π) Step: B [y,2x+3y]=ROT(A[x,y],r[x,y]);
(4)
Chi (χ) Step: A[x, y] =B[x, y] ⊕((NOTB[x+1, y]) AND B[x+2, y]);
(5)
Iota (ι): A [0, 0] = A [0, 0] ⊕RC;
(6)
In the above equations all operations within indices are performed modulo 5. The complete permutation state array is denoted by ‘A’, while ‘A[x, y]’ denotes a particular 64-bit word in that state. All logical operations from Eq. (1) to (6) are bit-wise operations. In Eqs. (2) and (4) ‘ROT’ is used to represent a bitrotation operation. The constant ‘r[x, y]’ provides the rotation bit scheme for the updated bits of A[x, y], while the ‘RC’ (Round Constant) is a 64-bit word that is unique for each round of the compression function. The details of ‘r[x, y]’ and ‘RC’ are given in [12].
The SHA-3 hash function operation consists of three phases namely initialization, absorbing and squeezing. Initialization is simply the initialization of the state matrix (A) with all zeros. In the absorbing phase each r-bit wide block of the message is XORed with the current matrix state and 24 rounds of the SHA-3 compression function are performed. The initialization and absorbing phases are represented in Fig. 7 and Fig. 8 respectively. After absorbing all blocks of the input message, the squeezing phase is utilized. In this phase, the state matrix is simply truncated to the desired length of the output hash.
Fig. 7: Initialization phase of SHA-3
Fig. 8: Absorbing phase of SHA-3 The next section summarises the latest published work related with SHA-3 and IPSec implementations. 4. Related Work There is limited literature available regarding hardware implementation of the IPSec protocol, while a lot of work has been done on FPGA based implementation of SHA-3. 4.1 IPSec implementations In [19] Nui et al. deal with an in-line Network Security Processor (NSP) design of a 10Gbps Ethernet link. The IPSec protocol is implemented by adopting a configurable multi-core architecture. This work allows one to flexibly choose the type as well as the number of cores used in the IPSec processor. This feature is enabled by a flexible bus architecture connecting these cores. In [20], Driessen et al. proposed a reconfigurable lightweight IPSec core. To provide the lightweight IPSec core the authors evaluated different versions of lightweight algorithms such as: PRESENT,
GRØSTL, PHOTON and a very compact ECC core. The functionality of the IPSec protocol is handled using a soft-core processor. In [21] the authors proposed an architecture for implementing IPSec on a Xilinx Virtex-4 using partial reconfiguration. The cryptographic primitives, AES and SHA-256 are realized as a reconfigurable coprocessor, which is attached to a MicroBlaze soft-core. The MicroBlaze is responsible for handling the protocol layer and reconfiguring the co-processor according to the type of required primitive. The supported primitives are programmed into the co-processor on demand, i.e., the co-processor supports only one primitive at a time, which allows for a lower overall resource consumption. Since partial reconfiguration comes with a time penalty for switching the crypto core, this approach does not allow for extremely high throughput in a typical setting. An implementation on Xilinx Virtex-II Pro FPGA was presented in [22]. The protocol layer is handled by a soft-core and reconfigurable hardware is used to implement AES and HMAC (SHA-1 and MD5). These algorithms were implemented on a single device and all hardware cores achieved over 1 Gb/s throughput. In [23] a hardware co-processor was proposed to design an IPSec cryptographic core. It had an AES and HMAC-SHA1 cores. This solution was implemented on a XCV1000E Xilinx Virtex device. 4.2 SHA-3 implementations All previously published work related to FPGA implementations of the SHA-3 core mainly focused on one of: 1) Access to the dedicated resources of the FPGA; 2) The use of different architectural approaches like iterative, pipeline, parallel etc., or 3) Simple hardware descriptive language (HDL) implementations. For example; in [24] LUT primitives are used for implementation of the Keccak core; this helped to improve the throughput. In [25] different data path architecture implementations were proposed and the best TPA was achieved on a Virtex-6 using basic iteration with 2 pipeline stages. In [26], the authors proposed a common platform for the evaluation of all SHA-3 final candidates for a fair comparison of results. They showed that the best throughput was available from Keccak compared to all other SHA-3 candidates. In [27], the authors proposed the slice-oriented architecture of Keccak with different data path widths. At the end of this implementation, the authors concluded that the throughput of the proposed
slice-oriented architecture scaled almost linear with increasing data path width. In [28], the authors presented hardware designs for all the SHA-3 candidates that participated in the 2nd round of the SHA-3 contest. Here, the authors tested the implementation for different message digest sizes. This work also includes padding as part of the hardware and presents a simple implementation without using any efficiency techniques. Similarly in [29], the authors presented an implementation for all 2nd round SHA3 candidates, here again without using any efficiency techniques. The work of IPSec AH presented in this paper differs for all previously published related work (Section 4.1) in that it is a complete hardware implementation (without using any soft-core processor) of the AH protocol that supports both transport and tunnel mode operations for IPv4 datagram. This IPSec AH implementation is detailed in Section 6. The IPSec AH work presented here uses a unique two phase implementation approach of SHA-3, which results in a high speed implementation as compared to previously reported related work (Section 4.2). This SHA-3 implementation is detailed in Section 5. 5. Proposed implementation of SHA-3 core The SHA-3 algorithm is implemented using a novel high speed FPGA architecture [30]. This high speed architecture is achieved using a two phase implementation technique outlined below. In the first phase, all steps of the SHA-3 core (Theta, Rho, Pi, Chi and Iota) are logically combined. This step is significant because all previous FPGA implementations of the SHA-3 core (mentioned in Section 4.2) either involved a number of clock cycles and/or needed some dedicated area (in terms of registers, Block RAMs (BRAMs) etc.) for the storage of intermediate states. Here, these intermediate states are eliminated by logically combing all steps of the SHA-3 core, which helps not only to improve the data throughput but also reduces the area utilization of the FPGA. At the end of this phase, there are 25 equations (A[0,0] - A[4,4]), each of 64-bit word in length. These 25 equations formed a 1600-bit state of SHA-3 and have the same structure but with different sets of inputs. Due to the same structure of these equations, a single generalize equation is proposed here. In the second phase, a hardware architecture is proposed, which represents the generalized equation for the implementation of first phase equations. This hardware architecture is implemented using Xilinx LUT primitives. Using this architecture, the generalize equation is instantiated 25 times to implement 25
equations of the first phase. Each instantiated equation provides a 64-bit output. A single round of the compression function (SHA-3 core) is implemented using these two phases within a single clock cycle and then an iterative approach is used to complete the 24 rounds of the compression function. The LUT-primitives are used because conventional coding techniques and synthesis tools are generally not intelligent enough and do not utilize FPGA resources efficiently [31]. The conventional design approach is to code the design logic in a Hardware Description Language (HDL), and then let the synthesis tool do the job of generating the FPGA level design. The drawback of this approach is that synthesis tools tend to map all of the logic to a LUT based architecture randomly, which results in consumption of a bigger chip area and longer input to output path delays. Hence, the design becomes bigger and runs at slower clock rates, which result in an inefficient design in terms of area and speed. Compared to this, manual use of the LUT primitives can help to get efficient utilization of LUT resources of an FPGA and this makes the proposed design more efficient with minimum area and high speed. Details of each phase implementation of SHA-3 are explained below, 5.1 Logically combining all steps of SHA-3 core Here, the SHA-3 core function steps are logically combined to get a SHA-3 core output of 1600-bits without involving any intermediate states. To combine the SHA-3 core steps: First, the SHA-3 core function is unfolded manually for all possible combinations of ‘x’ and ‘y’ to get knowledge of: (1) Which input bits of the SHA-3 core function involve bit rotation operations and also how many times these bits are rotated? (2) Which input bits of the SHA-3 core function involve bit-wise logical operations of XOR, NOT and AND? After fulfilling the bit-rotation and logical operations requirement the 25 output equations are expressed in term of input bits to the SHA-3 core. The scheme given in Fig.9 is used to logically combine all steps of the SHA-3 core. In this scheme, only five outputs of the SHA-3 core (A[0,0], A[0,1], A[0,2], A[0,3] and A[0,4]) are shown out of the 25
possible outputs (A[0,0] - A[4,4]). In Fig. 9 each input, output and individual line represents a 64-bit word. All bit rotation operations of Eqs. (2) and (4) are performed through rewiring without using any extra hardware. Since there is no additional hardware needed for bit-rotation operations it allow us to get an optimized architecture with respect to area as well as speed. Initially, the core inputs of A[0,0] to A[4,4] are applied to 5-input XOR functions, as shown in Fig. 9, to get the outputs of Eq. (1) (C[0], C[1], C[2], C[3] and C[4] ). The rotated and non-rotated bits of C[x] are applied to a 2-input XOR function by considering the 1-time clockwise and 1-time anti-clockwise change in position of C[x], this is represented, in Eq. (2), by C[x+1] and C[x-1] respectively. These 2-input XOR functions provide the outputs of Eq. (2) (D[0], D[1], D[2], D[3] and D[4]). The outputs of Eq. (2) are XORed with the input bits of A[x, y] as given in Eq. (3) using another set of 2-input XOR functions. The bits of these XOR functions are rotated according to the r[x, y] scheme of Eq. (4). The modulo-5 operation of Eq. (4) is done manually to provide the bits that are applied to the next stage. In Fig. 9, the updated A[x, y] states (after modulo 5 operation) are also shown after ‘bit-rotation according to the r[x, y]’ block. The output of Eq. (4) is applied to the logical operations of NOT, AND and XOR of Eq. (5) as shown. The last step (Eq. (6)) involves an XOR operation between ‘RC’ input and updated A [0, 0] output of Eq. (5). In this way all the steps of the SHA-3 core are logically combined using the scheme of Fig. 9, which results in A[0,0] to A[4,4] outputs of the SHA-3 core function. Out of these 25 outputs (A[0,0] - A[4,4]), one output expression for A[0,0] is given in Eq. (7). All other outputs have the same structure as that of Eq. (7) but different inputs. In Eq. (7) the ‘ROT’ expression has two parts, where the first part represents the particular 64-bit word of A[x, y] and the second part represents the number of times the bits of A[x, y] are rotated.
Fig. 9: Scheme used to combine steps of SHA-3 core
( {ROT(A[2,2], 43)]}⊕
{ {ROT(A[2,0], 45)}⊕ {ROT(A[2,1], 45)}⊕ {ROT(A[2,2], 45)}⊕ { ROT(A[2,3], 45)}⊕ {ROT(A[2,4], 45)} } )&
{ {ROT(A[0,0], 44)}⊕ {ROT(A[0,1], 44)}⊕ {ROT(A[0,2], 44)}⊕ {ROT(A[0,3], 44)}⊕ {ROT(A[0,4], 44)} }⊕
{~({ROT(A[1,1], 44)}⊕
{ {ROT(A[1,0], 1)}⊕ {ROT(A[1,1], 1)}⊕ {ROT(A[1,2], 1)}⊕ {ROT(A[1,3], 1)}⊕ {ROT(A[1,4], 1)}}⊕
{ {I23}⊕ {I24 }⊕ {I25 }⊕ {I26 }⊕ {I27 } }⊕
( {I22}⊕
{ {I17 }⊕ {I18 }⊕ {I19 }⊕ {I20}⊕ {I21 } } )&
{ {I12)}⊕ {I13 }⊕ {I14 }⊕ {I15)}⊕ {I16 } }⊕
{~({I11 }⊕
{ {I6 }⊕ {I7 }⊕ {I8 }⊕ {I9 }⊕ {I10 }}⊕
A[0,0]= {RC ⊕{I0}⊕ { {I1 }⊕ {I2 }⊕ {I3 }⊕ {I4 }⊕ {I5 } }⊕
{ {ROT(A[1,0], 43)}⊕ {ROT(A[1,1], 43)}⊕ {ROT(A[1,2], 43)}⊕ {ROT(A[1,3], 43)}⊕ {ROT(A[1,4], 43)} }⊕
{ {I28}⊕ {I29 }⊕ {I30 }⊕ {I31 }⊕ {I32 } } )}};
A[0,0]= {RC ⊕{ A[0,0]}⊕ { {A[4,0]}⊕ {A[4,1]}⊕ {A[4,2]}⊕ {A[4,3]}⊕ {A[4,4]} }⊕
{ {ROT(A[3,0], 44)}⊕ {ROT(A[3,1], 44)}⊕ {ROT(A[3,2], 44)}⊕ {ROT(A[3,3], 44)}⊕ {ROT(A[3,4], 44)} } )}}; (7)
(8)
5.2 Hardware implementation of first phase equations The resultant 25 equations of first phase, from A[0,0] to A[4,4] have the same structure. Therefore, it is possible to write a general equation as given in Eq. (8), which can represent each output equation (A[0,0] - A[4,4]) with respective sets of inputs from I0 to I32, where each input is a 64-bit word. The ‘RC’ in Eq. (8) is applied in each round only to update A[0,0], otherwise it is zero for the remaining 64-bit words of A[x, y]. The hardware architecture of Eq. (8) is shown in Fig. 10. In Eq. (8) the 64-bit inputs of I1 to I5, I6 to I10, I12 to I16, I17 to I21, I23 to I27 and I28 to I32 are XORed with each other, these six XOR operations are represented by the blocks of the 5-input XOR functions on the L.H.S. of Fig. 10. The XOR outputs of I12 to I16 and I17 to I21 are further XORed with I11 and this logical operation is represented by a 3-input XOR function with output (A). In the same way the XOR outputs of I23 to I27 and I28 to I32 are XOR with I22 and this is represented by another 3-input XOR function with output (B). The output (A) is inverted using a NOT logical operation and an AND logical operation is performed between this inverted (A) and the output (B). This NOT and AND logical operations are implemented using the block (~A&B). The last block of the 5-input XOR function in Fig. 10 is used to perform an XOR operation between RC, I0, and the XOR output of I1 to I5, XOR output of I6 to I10 and the output of (~A&B) block. In this way Eq. (8) is represented by the hardware architecture of Fig. 10. So, the architecture of Fig. 10 consists of seven 5-input, two 3-input XOR functions, and one function of (~A&B). These functions are implemented using LUT_5, LUT_3 and LUT_2 primitives respectively. LUTs are the basic logic building blocks of an FPGA and are used to implement most logic functions of a design. The ‘INIT’ parameter for the LUT primitive provides the logical value of the LUT primitive. The architecture presented in Fig. 10 is instantiated 25 times to obtain the 25 64-bit word equations from A[0,0] to A[4,4]. All 24 rounds of the compression functions are performed using the architecture of Fig. 10 and each round implementation requires only a single clock cycle. This SHA-3 core implementation is used in the ‘HMAC calculation’ step of the AH protocol. The next section provides details of the AH protocol implementation.
Fig. 10: Proposed architecture to get 64-bit output words of A[x, y] (0 ≤ x, y ≤ 4) 6. IPSec AH protocol implementation The AH core implementation presented here involves both transport and tunnel modes of operation and this reconfigurable core is capable of securing IPv4 datagram, as shown in Fig. 11. This reconfigurable AH core can be configured manually for the desired mode of operation. Before applying the AH protocol processing, the input IP datagram is passed through three checks, IP version verification, packet filter and selection of mode as given in Fig. 12. The initial four bits of the IP datagram packet are used to check the IP version. A decimal value ‘4’ of these bits represents an IPv4 datagram. This check is necessary to determine the structure of the incoming IP datagram to facilitate the extraction of the datagram’s fields. Once the IP version is verified, the IP datagram packet is passed through the packet filter, which is used to decide whether the AH protocol processing for a particular IP datagram packet is required or not. An IP datagram for a specific network can be recognized by applying a check on the source or destination IP addresses of the datagram. Using this packet filter an IP datagram packet of a particular network can be dropped or forwarded either with or without applying the AH protocol. Additionally this filter can be used to check whether or not the incoming IP datagram is already secured using the AH protocol. In this case the IP datagram packet is forwarded as is.
Fig. 11: Reconfigurable AH core The implemented filter checks the ‘Protocol’ and ‘Next header’ fields of the IPv4 datagram. If these fields are equal to ‘51’, it means that the IP datagram is already secure using the AH protocol then the incoming IP datagram is forwarded without applying the AH protocol again. The next function shown in Fig. 12 involves the selection of the mode of operation. As mentioned earlier this selection is performed manually by the installation engineer. The last step is AH protocol processing, which is explained in detail below. The AH protocol processing is divided into five blocks as given in Fig. 13.
Fig. 12: Main blocks of proposed implementation
Fig. 13: AH protocol Processing 6.1 Protocol Pre-Processing The protocol pre-processing depends on the selected mode of operation and is used to prepare the IP header for HMAC calculation. In transport mode the original IP header is updated while in tunnel mode a new IP header is generated. In transport mode, the original IP header is updated by replacing mutable fields by zero and also by updating some other fields; like the ‘Protocol’ field of the IPv4 header is replaced by ‘51’, where ‘51’ is the protocol number of the AH protocol. Also, the ‘Total length’ field of the IPv4 header is updated by adding the AH header length to their respective original lengths. In tunnel mode, fields of the new IP header are equated to the fields of the original IP header and like transport mode, the mutable fields of the new header are replaced by zero. The ‘Protocol’ and ‘Next header’ fields of the new IPv4 header is replaced by ‘51’ as with transport mode. However, the ‘Total length’ field of the new IPv4 header is updated by adding the AH header length. This pre-processed IP datagram will be used later for the calculation of HMAC with an AH header.
6.2 Formation of the AH Header The AH header formation involves generation of six fields. These fields are mentioned in Table I1. The ‘Next header’ field of the AH header depends on the selected mode of operation. In transport mode, this field is set to the ‘Protocol’ field of the IPv4 header, while in tunnel mode this field represents the encapsulated IP datagram, therefore it is set to ‘4’, where ‘4’ is the protocol numbers of IPv4. The ‘AH len’ field i.e. the total length of the AH header, depends upon the length of the ICV. The length of the ICV value is based on the selected cryptographic hash function. In this work to calculate the ICV value the 256-bit variant of SHA-3 is used. Also, the ‘AH len’ field is given in 32-bit words minus 2 as shown in Eq. 9. This minus two is due to the fact that AH is basically an extension header of IPv6 [32], of which length is usually measured in 64-bit words (not in 32-bit words). Therefore, for IPv4: AH len = [AH header (96-bit) + ICV(256-bit) - 2]
(9)
In term of 32-bit words = [3 (32-bit fixed fields of AH header ) + 8 (32-bit words for the ICV value)] – 2 i.e
= ‘9’
The next field of the AH header is ‘reserved’ and is set to zero irrespective of the selected mode. The ‘SPI’ field is also set to zero, which indicates that no security association exists. The ‘sequence number’ field is generated by using a 32-bit unsigned counter which increases by one whenever an AH header is generated. In this way the ‘sequence number’ field is different for each secure IP datagram packet. The sequence number of the first secure IP datagram packet is 1. The last field of the AH header is the ‘Authentication Data’ field, which is set to the ICV value. Here the ICV value is set to zero for HMAC calculation. Now, in the next step, the generated AH header is combined with the protocol pre-processed IP datagram packet.
6.3 Insertion of AH header into pre-processed IP datagram Before computation of the ICV value, the generated AH header is inserted into the pre-processed IP datagram. This insertion scheme depends on the selected mode of operation as shown in Figs. 4 and 5. This step results in an updated IP datagram that is used as input to the HMAC calculation. 6.4 HMAC calculation In order to provide data integrity and authenticity, the AH protocol defines a symmetric HMAC that is used to generate the ICV. The biggest advantage of a HMAC is its relatively high speed as compared to digital signatures. The AH protocol requires the usage of the HMAC construction, which provides a fixed size authentication tag for arbitrary messages. The ICV value of a ‘message’ is computed as H(K XOR opad, H(K XOR ipad , message)) Where, opad and ipad are padding constant H is the cryptographic hash function K is the secret key In order to make the construction secure, it is required that the key length used is equal to or greater than the output length of the employed hash function (H). The implementation of an HMAC wrapper, given a hash function (H), is shown in Fig. 14. Here, ‘message’ is the updated IP datagram after the insertion of the AH header. In this work, a SHA-3 implementation working on a block size of 1088 bits is used. Each 1088-bit block is applied to SHA-3 one by one untill the end of the message. In case the message is not a multiple of 1088 bits padding is used to make the length of the last block equal to 1088 bits. After applying all blocks of the message to the SHA-3 module the resultant hash output is truncated to 256 bits to get the desired length of the ICV value.
Fig. 14: Architecture for HMAC calculation
Now, the last step in protocol post-processing is to update some fields of the pre-processed IP header and AH header (Both headers are already combined in Section 6.3) to prepare a final secure datagram packet. 6.5 Protocol Post-Processing Protocol post-processing involves three steps: (1) The ‘Authentication data’ field of the AH header is set to the calculated ICV value (2) The mutable fields of the IP header, which were replaced by zero for HMAC calculation are set to their correct values (3) Re-calculate the header checksum and to update the ‘header checksum’ field. Now, the secure IP datagram is ready to forward. The proposed AH protocol implementation scheme with SHA-3 implementation is summerised in Fig.15.
Fig. 15: Proposed IPSec-AH protocol implementation scheme 7. Performance Results The performance results of the AH protocol implementation are presented below. To facilitate comparision with previous implementations of the algorithm, the SHA-3 implementation results are
mentioned seperately in Table III. Implementation results of the AH protocol for transport and tunnel mode operations are given in Table IV and V respectively. The performance results are taken on a Virtex series of Xilinx FPGAs using ISE Xilinx 14.2 tool and the HDL language Verilog is used. Selection of the Virtex series was made because of their high performance features. The target device xc5vtx240t2ff1759 was used for Virtex-5 and xc6vcx75t-2ff484 for Virtex-6. The SHA-3 implementation results in Table III show that our proposed implementation technique provides much improved throughput and TPA (Throughput/Area), when compared to previously published work. This is because of the unique two phase implementation approach presented in Section 5. Throughput and TPA are calculated by using Eqs. (10) and (11) respectively. TP = (Block Size) / (T * Nclk)
(10)
Where, the block size is 1088 because of the 256-bit variant, T is the time period of the system clock and Nclk is the number of clock cycles required for a valid hash output. TPA = TP / Area
(11)
The term ‘TPA’ measures the efficiency of the implementation as it combines both area and speed into a single performance related value. The presented results in table III were obtained using time constraints of 2.50ns and 2.25ns for Virtex 5 and Virtex 6 respectively. The AH core is designed to support a default length of 576 bytes for an IPv4 datagram. Tables IV and V presents the placed and routed results of the core implementation. These results are taken by using 64bit wrapper. Performance results of Table IV and V show that the proposed design of the IPSec AH core can provide throughput of approximately 2Gbps. Here also throughput (TP) and TPA are calculated using Eqs. 10 and 11 respectively. Total number of clock cycles = 320 for complete implementation of the core. As discussed in Section 4.1, most of the previously published work on implementations of the IPSec protocol used soft-core processors except for Niu et.al. [19]. They presented synthesized results of an AH core implementation for the SHA-1 algorithm on a Virtex-5, which resulted in a frequency of 100MHz and LUTs utilization of 27832. The authors did not provided any information about Throughput or TPA. In [20], [21] and [22] the functionality of IPSec is handled with soft-core processors. In [23] a
cryptographic design technique is discussed without actual implementation of the IPSec protocol functionality. The work presented here is the first contribution of a hardware implementation of the IPSec AH protocol in which an FPGA is used without the use of a soft-core processor for both transport and tunnel modes of operation. 8. Conclusion This work describes an FPGA implementation of a reconfigurable IPSec AH core. The presented AH core facilitates both transport and tunnel modes of operation and is capable of handling IPv4 datagrams. Here, a newly selected cryptographic hash function, SHA-3 is used to generate the HMAC. SHA-3 was implemented using a unique two phase technique, which results in higher throughput compared to previously published work on SHA-3. This SHA-3 implementation is used in the IPSec AH core. To the best of the authors’ knowledge, this is the first work which presents a hardware implementation of the IPSec AH protocol without involving any soft core processors. Additionally this work is the first contribution which utilizes the SHA-3 algorithm to handle IPv4 datagrams using transport and tunnel modes of operation. In the same manner the IPSec ESP protocol will be implemented in the future. The proposed reconfigurable AH core can be used to provide security services for IoT applications that require high data throughput speeds in addition to allowing legacy IPv4 hardware to implement IPSec using BITW without having to replace expensive networking devices. Acknowledgment The authors would like to thank the Erasmus Mundus STRoNGTiES (Strengthening Training and Research through Networking and Globalization of Teaching in Engineering Studies) program and the the Irish Research Council (IRC-GOIPG/2013/1132) for providing funding that has facilitated the completion of this work. In addition this work was supported in part by SFI-Science Foundation Ireland under Grant No. SFI/12/RC/2302. References [1] “RFC-791”, http://www.ietf.org/rfc/rfc791.txt. [2] “RFC-1349”, http://www.ietf.org/rfc/rfc1349.txt. [3] “RFC-2474”, http://www.ietf.org/rfc/rfc2474.txt.
[4] S. Kent, and R. Atkinson,“Security architecture for the internet protocol”, IETF network working group, RFC2401, 1998. [5] “RFC-4302”, https://www.ietf.org/rfc/rfc4302.txt. [6] “RFC-4303”, https://www.ietf.org/rfc/rfc4303.txt. [7] A.Ferrante,V.Piuri, and J. Owen, “IPSec Hardware Resource Requirements Evaluation”, Next Generation Internet Networks (NGI 2005), April 2005. pp.240-246, “DOI:10.1109/NGI.2005.1431672”. [8] K. Rahimunnisa, P. Karthigaikumar, Soumiya Rasheed, J. Jayakumar and S. SureshKumar, “FPGA implementation of AES algorithm for high throughput using folded parallel architecture” , Security and Communication Networks. Volume 7, Issue 11, pages 2225–2236, November 2014. [9] Rao M., Newe T. and Grout I., “Secure Hash Algorithm-3(SHA-3) implementation on Xilinx FPGAs, Suitable for IoT Applications”. 8th International Conference on Sensing Technology (ICST 2014), Liverpool John Moores University, Liverpool, United Kingdom, 2nd-4th September, 2014. [10] M. Healy, T. Newe and E. Lewis. “Security for Wireless Sensor Networks: A Review”. IEEE Sensors Applications Symposium (IEEE SAS 2009) February 17th–19th, 2009. New Orleans, LA, USA. pp 80-85. ISBN: 978-1-4244-2787-1 [11] A. Grau, “How to Build a Safer Internet of Things, Today's IoT is full of security flaws. We must do better”. http://spectrum.ieee.org/telecom/security/how-to-build-a-safer-internet-of-things (Accessed 23rd October 2015). [12] G. Bertoni, J. Daemen, M. Peeters, and G. V. Assche, “The Keccak SHA-3 Submission”, Submission to NIST (Round-3), 2011. http://keccak.noekeon.org/Keccak-submis sion-3.pdf. [13] P. Morawiecki and M. Srebrny, “A SAT-based preimage analysis of reduced Keccak hash functions ”, Elsevier’s Information Processig letter, Volume 113, Issues 10–11, May–June 2013, Pages 392–397. [14] M. Taha and P. Schaumont, “Side-Channel Analysis of MAC-Keccak ”, IEEE International Symposium on Hardware-Oriented Security and Trust (HOST), June 2013 Pages 125 – 130. [15] National Institute of Standards and Technology (NIST): SHA-3 Winner announcement http://www.nist.gov/itl/csd/s ha-100212.cfm. [16] X. Wang, X. L. Feng, D. Yu, “Collisions for hash functions MD4, MD5, HAVAL-128 and RIPEMD”, Cryptology ePrint Archive, Report 2004/199, pp. 1–4 (2004), http://eprint.iacr.org/2004/199. [17] M. Szydlo, “SHA-1 collisions can be found in 263 operations”, Crypto Bytes Technical Newsletter (2005). [18] M. Stevens, “Fast collision attack on MD5”. ePrint-2006-104,pp.1–13(2006), http://eprint.iacr.org/2006/104.pdf [19] Y. Niu, L. Wu and X. Zhang,“An IPSec Accelerator Design for a 10Gbps In-Line Security Network Processor”, JOURNAL OF COMPUTERS, pp-319 – 325, VOL. 8, NO. 2, Feb. 2013. [20] B. Driessen, T. Güneysu, E. Bilge Kavun, O. Mischke, C. Paar, T. Pöppelmann,“IPSECCO: A Lightweight and Reconfigurable IPSec Core”, International conference on Reconfigurable Computing and FPGAs (ReConFig), 2012. DOI: 10.1109/ReConFig.2012.6416757. [21] A.Salman, M.Rogawski and J. Kaps, “Efficient Hardware Accelerator for IPSec based on Partial Reconfiguration on Xilinx FPGAs”, International Conference on Reconfigurable Computing and FPGAs, 2011. DOI: 10.1109/ReConFig.2011.33. [22] J. Lu, and J. Lockwood,“IPSec Implementation on Xilinx Virtex-II Pro FPGA and Its Application”, 19th IEEE International Parallel and Distributed Processing Symposium(IPDPS’05),2005.DOI:10.1109/IPDPS.2005.262 [23] M. McLoone, J. VMcCanny, “A Single-Chip IPSec Cryptographic Processor”, IEEE Workshop on Signal Processing Systems, 2002. (SIPS'02). DOI: 10.1109/SIPS.2002.1049698. [24] K. Latif, A. Aziz and A. Mahboob, “Look-Up Table Based Implementations of SHA-3 Finalists: JH, Keccak and Skein”, KSII transaction on internet and information system, Volume 6, Number 9, September 2012 pp 2388-2404. [25] K. Gaj, E. Homsirikamol, M. Rogawski, R. Shahid, and M. U. Sharif, "Comprehensive Evaluation of HighSpeed and Medium-Speed Implementations of Five SHA-3 Finalists Using Xilinx and Altera FPGAs," Cryptology ePrint Archive: Report 2012/368, 30 Oct 2012, available athttp://eprint.iacr.org/2012/368. [26] M. Knezevic, K. Kobayashi, J. Ikegami, S. Matsuo, A. Satoh, U. Koc, J. Fan, T. Katashita, T. Sugawara, K. Sakiyama, I. Verbauwhede, K. Ohta, N. Homma and T. Aoki, “Fair and Consistent Hardware Evaluation of Fourteen Round Two SHA-3 Candidates,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 5, pp. 827–840,2012. [27] B. Jungk and M. Stöttinger,“Among slow dwarfs and fast giants: A systematic design space exploration of KECCAK”, 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), Darmstadt, Germany, July 10-12, 2013. IEEE 2013 ISBN 978-1-4673-6180-4.
[28] B. Baldwin, N. Hanley, M. Hamilton, L. Lu, A. Byrne, M. Neill and W. P. Marnane, “FPGA Implementations of the Round Two SHA-3 Candidates”, 2nd SHA-3 Candidate Conference, pp.1-18, Aug. 2010. [29] S.Matsuo, M. Knezevic, P. Schaumont, I. Verbauwhede, A. Satoh, K. Sakiyama and K. Ota, “How Can We Conduct Fair and Consistent Hardware Evaluation for SHA-3 Candidate?”, 2nd SHA-3 Candidate Conference, pp.1-15, Aug.2010. [30] Rao, M.; Newe, T.; Grout, I., "Efficient High Speed Implementation of Secure Hash Algorithm-3 on Virtex-5 FPGA," in Digital System Design (DSD), 2014 17th Euromicro Conference on , vol., no., pp.643-646, 27-29 Aug.2014. doi: 10.1109/DSD.2014.24. [31] K. Latif, A. Aziz and A. Mahboob, “Optimal utilization of available reconfigurable hardware resources”, Elsevier’s Computer & Electrical Engineering, Volume 37, Issue 6, November 2011, Pages 1043-1057. [32] “RFC-2460”, https://www.ietf.org/rfc/rfc2460.txt.
Table I: Acronyms used in this paper Term IPSec AH ESP SHA-3 HMAC ICV BITW ROT RC VPN
Description Internet Protocol Security Authentication Header Encapsulation Security Payload Secure Hash Algorithm - 3 Hash Message Authentication Code Integrity Check Value Bump In The Wire Bit Rotation operation Round Constant Virtual Private Network
Table II: Authentication Header (AH)
Table III: SHA-3 implementation results
Table IV: AH protocol implementation results (Transport Mode)
Table V: AH protocol implementation results (Tunnel Mode)