Hardware-Accelerated Twofish Core for FPGA

0 downloads 0 Views 612KB Size Report
achieve real-time encryption and decryption. The algorithm was implemented for 128-bit words and 128-bit keys. This article demonstrates that the Twofish ...
Hardware-Accelerated Twofish Core for FPGA David Smekal, Jan Hajny and Zdenek Martinasek Department od Telecommunications Brno University of Technology Brno, Czech Republic Email: [email protected]

Abstract—This article describes the hardware-accelerated implementation of the Twofish encryption algorithm on Field Programmable Gate Array (FPGA) network cards. The encryption core was implemented using the Virtex 7 network card to achieve real-time encryption and decryption. The algorithm was implemented for 128-bit words and 128-bit keys. This article demonstrates that the Twofish encryption core can operate with the maximum clock frequencies of 315 MHz and achieves the throughput of 48 Gbps, which is faster than most currently implemented systems. Keywords—Twofish; Encryption; Decryption; HardwareAccelerated; FPGA; Component; VHDL; Core; Virtex-7

I. I NTRODUCTION The article describes the design and implementation of an encryption algorithm for the use on a hardware device. The Twofish algorithm, one of the finalists of the AES competition, was chosen as an alternative to the current standard. Twofish did not win the competition, but many encryption systems offer it in their solutions. It still provides a high level of security and it has not yet been broken or become vulnerable, as is the case with other older encryption algorithms. The design is aimed at hardware implementation using a real network card. For final implementation, we selected the FPGA network card NFB-100G2Q. The card has an FPGA chip that offers the use of powerful hardware acceleration. Therefore, VHDL language was chosen to describe the whole architecture design. Many researchers are currently trying to achieve high speeds using optimization. However, few of these designs are able to work on a real device, either due to the use of high chip frequencies or the use of all the logic of the chip. Authors in article [1] and [2] deal with performance analysis of Twofish encryption. Implementation on the hardware was performed in articles [3], [4], [5] and [6], but the encryption rate is too lowq (12.1 – 16.8 Gbps) or unrealistic with its high frequency implementation. Many articles deal with the implementation of the AES cipher on the FPGA, not the implementation of Twofish. The implementation of the AES encryption algorithm was described in our article [7]. The article is divided into the following sections: Section 2 deals with the algorithm and its functionality in general and describes its individual parts. Section 3 describes the Research described in this paper was financed by the Ministry of Interior program “Program bezpecnostniho vyzkumu Ceske republiky 2015-2020” under grant VI20162018036. For the research, infrastructure of the SIX centre was used.

actual implementation for use on an FPGA chip. Section 4 evaluates the results of the simulation, synthesis and final implementation of the whole system design. II. T WOFISH A LGORITHM The Twofish [8] cipher was created as one of the AES contest designs. The algorithm met specified criteria, such as a symmetrical cipher, block cipher with a block size of 128 bits. The cipher has a variable key length of 128, 192 or 256 bits, and can be implemented on different platforms. The Twofish cipher works with a 16-round iteration structure of the Feistel network, where function F is the basis. The input of the cipher are 128 bits of plaintext, when decrypting, the input is ciphertext. The main part of the Twofish algorithm is the F function. This operation is part of both encryption and decryption. The F function consists of other functions: G, MDS, S-box, PHT. The schema of the F function is depicted in Fig. 1. The F function is the main part of every Twofish round. We characterize it as a permutation of 64-bit values depending on the key. Vectors R0 and R1 enter it together with the round number r. It consists of two parallel G functions, the PHT function and the expanded keys k. Vector R0 enters the G function and creates vector T0 . Vector R1 is first rotated to the left by 8 bits, then enters the G function and creates vector T1 . Both vectors T0 and T1 enter the PHT function. Finally, they are combined with the expanded keys using the XOR operation and the output is vector F0 and F1 . T0

= G(R0 )

T1

= G(ROL(R1 , 8))

F0

=

(T0 + T1 + K2r+8 ) mod 232

F1

=

(T0 + 2T1 + K2r+9 ) mod 232

(1)

The G function consists of four S-boxes. The next operation is the MDS function. The output is a 32-bit word. These 32bit words then enter the PHT function. The blocks of data are XORed with the keys, followed by a rotation. At the end of each round, the left and right sides of the network are swapped. In the last round, the sides are not swapped and only the last key is added. In our VHDL implementation the bit rotation uses the internal ROL and ROR functions. We also used a conversion of the bit vector into Std logic type vector. The input of this function is the input S f port.

The expanded keys are defined according to the following algorithm (5): p

=

224 + 216 + 28 + 20

Ai

= h(2ip, Me )

Bi

= ROL(h((2i + 1)p, Mo ), 8)

K2i

=

(Ai + Bi ) mod 232

= ROL((Ai + 2Bi ) mod 232 ), 9),

K2i+1

(5)

where p is the constant which represents the sequence of the key, Ai and Bi are outputs of functions h and Me and Mo are subvectors of cipher key. A. Encryption Fig. 1. Scheme of F function

The expanded keys are not generated in advance, but each round (F function) generates the necessary keys. The key expansion is performed from the main key, from which three subkeys are derived, Me , Mo , S. These are then used in whitening, rounds and S-boxes. When using a cipher with a key length of 128 bits, N being the length of the key, we define k = N/64. In this case k = 2, with the key M consisting of 8k bytes m. At the beginning of the algorithm, the key is converted into a 2k subkeys with a 32-bit length. Mi

3 X

=

m(4i+j) · 28j

i = 0, . . . , 2k − 1.

(2)

The basic schema of the encryption algorithm is depicted in Fig. 2. The encryption consists of the following steps: 1) The input of the cipher are 128 bits of plaintext and 128 bits of cipher key. Before entering each iteration, the chain is divided into four blocks P0 , P1 , P2 , P3 of 32 bits. Pi

=

3 X

p4i+j · 28j

i = 0, . . . , 3

2) The first operation is input whitening with expand key K0 , . . . K3 according to formula (7) R0,i

= Pi ⊕ K i

i = 0, . . . , 3.

j=0

In the next step, it folds into subvectors Me , Mo . Me

=

(M0 , M2 , . . . , M2k−2 )

Mo

=

(M1 , M3 , . . . , M2k−1 ).

(3)

The third subkey S is derived in the following way (4):   m8i m8i+1        m8i+2  Si,0 . ... .   Si,1  m8i+3  .. .. ·    =    . RS . m  Si,2   8i+4  . . . . .   Si,3 m8i+5  m8i+6  m8i+7 Si

=

3 X

si,j · 28j

i = 0, . . . , k − 1

j=0

S

=

(Sk−1 , Sk−2 , . . . , S0 ),

where RS matrix is:  01 A4 A4 56  RS =  02 A1 A4 55

55 82 FC 87

87 5A F3 1E C1 47 5A 58

58 DB C6 68 AE 3D DB 9E

(4)  9E E5 . 19  03

Vectors Me , Mo , S are the basis for key expansion. In the expansion, a set of k keys K2r and K2r+1 are generated, where r is the round number used in each round.

(6)

j=0

Fig. 2. Schema of Encryption Algorithm

(7)

3) Subsequently, there are 16 rounds. The two left 32bit vectors Rr+1,0 and Rr+1,1 enter the G function in parallel. The second block is first rotated by 8 bits before entering this function. (Fr,0 , Fr,1 ) Rr+1,0 Rr+1,1 Rr+1,2 Rr+1,3

= F (Rr,0 , Rr,1 , r), = ROR(Rr,2 ⊕ Fr,0 , 1), = ROL(Rr,3 , 1) ⊕ Fr,1 , = Rr,0 , = Rr,1 ,

(8)

where ROR is bit rotation to the right and ROL is bit rotation to the left. 4) The output whitening computes keys for last step (9). Ci = R16,(i+2) mod 4 ⊕ Ki+4

i = 0, . . . , 3 (9)

5) When all the individual 32-bit blocks are put together C0 + C1 + C2 + C3 , we obtain a 128-bit data block, called ciphertext. B. Decryption When decrypting, we use the main key and a block size of 128 bits. As with encryption, decryption uses 16 rounds of the pseudo-Feistel structure using the F function. The procedure is almost identical, only with a change of order of the keys in the rounds. There is also a change in the order of the input whitening key and the order of the bit and XOR operations after the output from the component. The basic schema of decryption algorithm is shown on Fig. 3. The decryption consists of the following steps: 1) The ciphertext is composed of the 16 bytes c0 , . . . , c15 as equivalent four subvectors C0 , . . . , C3 as follows (10): Ci

=

3 X

c4i+j · 28j

i = 0, . . . , 3.

(10)

j=0

2) Input whitening with keys K4 , . . . K7 R0,i

= Ci ⊕ Ki+4

i = 0, . . . , 3.

(11)

3) The main part of the algorithm is in sixteen rounds in a similar way to encryption. (Fr,0 , Fr,1 )

Fig. 3. Scheme of Decryption Algorithm

III. I MPLEMENTATION As already mentioned, for practical implementation we used the network card NFB-100G2Q [9]. The board is equipped with a powerful FPGA chip Xilinx Virtex-7 HT with 2× QSFP28 cages supported by 1×100G / 2×40G Ethernet ports. The device is connected via PCI Express Gen 3 x16 (128 Gbps). This chip has 580 480 logic cells available, contains 362 800 Look-Up-Tables (LUTs) and 725 600 Flip-Flop (FF) registers. The network card NFB-100G2Q is shown in Fig. 4.

= F (Rr,0 , Rr,1 , r)

Rr+1,0

= ROL(Rr,2 , 1) ⊕ Fr,0

Rr+1,1

= ROR(Rr,3 ⊕ Fr,1 , 1)

Rr+1,2

= Rr,0

Rr+1,3

= Rr,1

(12)

4) After the sixteenth round, the whitening operation adds the last key K0 , . . . K3 to the output according to (13) Pi = R16,(i+2) mod 4 ⊕ Ki

i = 0, . . . , 3.

(13)

The individual bytes of the whole data block can be written p0 , . . . p15 , where the bytes are created as follows (14)   P[i/4] i = 0, . . . , 15. (14) pi = 8(i mod 4) mod 28 2 In this case of decryption, we obtain plaintext.

Fig. 4. NFB-100G2Q [9]

For our propose, we used the Netcope Development Kit (NDK) [9] as a development platform for a better implementation. The NDK is a toolset for development of hardwareaccelerated network applications. The platform implements parts for input and output blocks (packet receiving and

transmitting), interconnection system including communication over network (Ethernet), PCI Express interface and with memories. The principle of work with NDK lies in a modification of the prepared application module based on current application requirements (encryption system). IV. T ESTS AND R ESULTS This section deals with functionality verification using simulation, synthesis design to obtain the necessary data on using the Virtex chip and implementing the final architecture on a network card. We implemented six individual components in the application core. Namely encryption core, decryption core, key expansion component, F function component, G function component and control component. The final implementation was realized using the VHDL language. Functionality was verified using simulator tool. Our implementation of Twofish was verified with test vectors [8]. These tests check if the Twofish core module gives the valid output of encrypted or decrypted data corresponding to the valid input data. Vivado was used to synthesize the final firmware of the whole system application. The results of Twofish core and NDK platform are shown in Table I. TABLE I. Component Twofish NDK [9]

U TILIZATION DESIGN – COMPONENT Frequency (MHz) 315 245

LUT [-] 60 212 120 674

FF [-] 2 945 85 437

Throughput (Gbps) 40.32 31.36

Frequency denotes the maximum frequency at which the unit is able to run. LUT indicates the number of LookUp Tables and FF denotes the number of Flip-Flop registers

on the chip. When the encryption core can work at a maximum frequency of 315 MHz, then the maximum throughput of encryption is 40.32 Gbps, according to this formula t = f ∗Nb , where t is throughput, f present frequency and Nb is block of data in bits (in our case 128 bits). The primary goal of this project is to design an architecture and implement an encryption core which can encrypt data at a maximum speed of about 30 – 50 Gbps on real FPGA hardware using a Virtex chip and a real 250 MHz frequency. The chosen frequency 240 MHz is the maximum with respect to the speed of each component from NDK [9]. When the real frequency is 240 MHz then real maximum throughput is 30.72 Gbps. Table II summarizes the hardware utilization of all components implemented on Virtex 7. The proposed design containing a Twofish core and an NDK platform takes 45 % of available Logic LUT and 6 % Memory LUT resources. Then it makes use of 13 % FF registers and 28 % of available RAM. V. C ONCLUSION From a security point of view, the Twofish cipher is a suitable alternative to the current most common AES standard. The security of the cipher underwent a detailed analysis which did not find any problems that would prevent its implementation. Therefore, the basic design of the algorithm was performed and then verified using a simulation. After a minor optimization, the design was implemented and subsequently subjected to testing. For detailed results see Section IV, from which we can highlight the encryption frequency and the speed the cipher achieves on a real device. The designed cryptographic core can be used on any device supporting hardware acceleration in accordance with the VHDL language. In our case, the core is included in existing firmware and tested in a test network. R EFERENCES

control key expansion encryption decryption NDK [9] Total

Memory

G function

Register as Flip Flop

F function

LUT as Memory

Available

H ARDWARE U TILIZATION

LUT as Logics

TABLE II.

362 800

362 800

725 600

940

1 971 0.54 % 416 1.2 % 1 616 0.45 % 1 115 0.31 % 30 351 8.37 % 29 414 8.11 % 99 713 27.48 % 164 596 45.37 %

256