FPGA-Based 40.9-Gbits/s Masked AES With Area Optimization for ...

11 downloads 4564 Views 907KB Size Report
Area Optimization for Storage Area Network. Yi Wang and Yajun Ha, Senior ... vanced encryption standard (AES) engine is proposed. However, this engine ...
36

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO. 1, JANUARY 2013

FPGA-Based 40.9-Gbits/s Masked AES With Area Optimization for Storage Area Network Yi Wang and Yajun Ha, Senior Member, IEEE

Abstract—In order to protect “data-at-rest” in storage area networks from the risk of differential power analysis attacks without degrading performance, a high-throughput masked advanced encryption standard (AES) engine is proposed. However, this engine usually adopts the unrolling technique which requires extremely large field programmable gate array (FPGA) resources. In this brief, we aim to optimize the area for a masked AES with an unrolled structure. We achieve this by mapping its operations from GF (28 ) to GF (24 ) as much as possible. We reduce the number of mapping [GF (28 ) to GF (24 )] and inverse mapping [GF (24 ) to GF (28 )] operations of the masked SubBytes step from ten to one. In order to be compatible, the masked MixColumns, masked AddRoundKey, and masked ShiftRows including the redundant masking values are carried over GF (24 ). We also use FPGA block RAM (BRAM) to further reduce hardware resources. Compared with a state-of-the-art design, our implementation reduces the overall area by 36.2% (20.5% is contributed by the main method, and 15.7% is contributed by the BRAM optimization). It achieves 40.9-Gbits/s at 4.5-Mbits/s/slice on the Xilinx XC6VLX240T platform. We have attacked the iterative version of this masked AES in hardware. Results show that none of the bytes can be guessed from the masked AES with the collected 10 000 power traces, but 14 out of 16 bytes can be guessed from the unprotected AES with the same number of traces. Index Terms—Advanced encryption standard (AES), differential power analysis (DPA), field programmable gate array (FPGA), masking, storage area network (SAN).

I. I NTRODUCTION

T

HE SENSITIVE data stored in storage area network (SAN) are under the risk of information leakage in embedded applications. These applications need not only the protection at both the protocol level and the physical level but also high-throughput implementations. For example, for a SAN application, it needs up to 40-Gbits/s throughput for a fourport host bus adapter connected by fiber cables. The information leakage includes power consumption, timing, and fault detection. In 1999, Kocher et al. first broke the normal advanced encryption standard (AES) [1] by means of power analysis attacks [2]. Later, the differential power analysis (DPA) attack was further developed as one of the most promising power analysis attacks. From then on, numerous efforts have been devoted Manuscript received September 15, 2012; accepted November 7, 2012. Date of current version March 11, 2013. This work was supported by the Agency for Science Technology and Research, Singapore, under Project 1122804006. This brief was recommended by Associate Editor T. Zhang. The authors are with the Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576 (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this brief are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSII.2012.2234891

to the development of efficient countermeasures for the AES implementations against DPA attacks. Two representatives are the multiplicative masking and the Boolean masking. They both try to remove the correlation between the power consumption and the secret keys. The multiplicative masking can be realized by using either standard CMOS cells at the gate level (which has been proved to be insecure in terms of glitch attacks) or nonstandard CMOS cells (which has been proved to be DPA resistant and glitch free but requires a semiautomatic design flow). On the other hand, the Boolean masking can be easily realized at the algorithmic level and is immune to DPA and glitch attacks. The Boolean masking has the advantage of easy implementation because it does not need extra specific hardware as in [3] and [4]. The Boolean masking is a good candidate to be applied to the AES in SANs, but if we directly apply it to the AES, one masked AES’s S-box over GF (28 ) with two 8-bit input and output masks needs to store 28 × 28 × 256 bytes (16.8 Mbytes). Therefore, for a whole 128-bit masked AES with an unrolled architecture, it needs to store around 2952.8 Mbytes. This is too big to be fit into any field programmable gate array (FPGA). To have a feasible FPGA implementation, one possible way is to transform the S-box computation of a masked AES from GF (28 ) to GF (24 ). In this brief, we develop techniques to optimize the area of a masked AES for SANs. We perform the masked AES mainly over GF (24 ), and the related operations like the masked MixColumns, masked AddRoundKey, and masked ShiftRows including redundant masking values are all calculated over GF (24 ). Therefore, we only need to transform the input values from GF (28 ) to GF (24 ) and transform the output values back from GF (24 ) to GF (28 ) once which reduces around 20.5% hardware resources. In addition, we map half resources of the masked S-box onto block RAM (BRAM) which reduces 15.7% hardware resources. We insert pipelined registers into the round calculation and its masked S-box calculation in order to meet the requirement of high throughput for SANs. We show that the area-optimized design can be fit into Xilinx Virtex-6 platform, which provides a feasible alternative protection of the AES on reconfigurable devices. We also perform DPA attack to the iterative version of the masked AES, and we prove that no correct bytes of the last round key can be guessed from our masked design. The rest of this brief is organized as follows. Section II presents the existing works. Section III proposes the optimized AES with an unrolled architecture, presents the detailed design methodologies about the masked S-box and masked MixColumns, and further optimizes the proposed design by

1549-7747/$31.00 © 2013 IEEE

WANG AND HA: FPGA-BASED 40.9-Gbits/s MASKED AES WITH AREA OPTIMIZATION FOR SAN

inserting pipelined registers. Section IV shows the experimental results. Section V draws the conclusion. II. P REVIOUS W ORK Data transformations in a SAN system usually need real-time high-throughput processing regardless of area overheads. In addition, the security issues of “data-at-rest” in a SAN system require an add-in masking to be DPA and glitch attack resistant. Most existing works only concern high throughputs but not the ability to defend DPA and glitch attacks. Gaj and Chodowiec [5] proposed a pipelined structure for the AES on Virtex XCV1000 FPGA and achieved 12 Gbits/s. Standaert et al. [6] presented the design tradeoff for the further optimization of the AES implementation on FPGA platforms. Unrolling, tiling, and pipelining structures for the AES were discussed in [7]. McLoone and McCanny’s method achieved a throughput of 12 Gbits/s using lookup table (LUT)-based SubBytes [8]. Another approach [9] aimed at the on-the-fly generation of SubBytes was first proposed by Rijmen, one of the creators of the AES. Hodjat and Verbauwhede presented a fully pipelined SubBytes architecture achieving a throughput of 21.54 Gbits/s [10]. However, all the aforementioned methods are vulnerable to DPA and glitch attacks. Mangard et al. [11] successfully broke the AES by using the DPA attack at the algorithmic level. Oswald et al. proposed a masked SubBytes over GF (24 ) at the algorithm level, but they only focused on software implementation [12]. Higher order masking schemes [13], [14] have been proposed. They are based on software implementations of the masked AES. The countermeasures used in the work of Goli´c [15] and Canright and Batina [16] can be attacked successfully by the glitch attack [17] at the gate level. To the best of our knowledge, no previous work has been done on the high-throughput masked AES that has the ability to defend against DPA and glitch attacks. This is due to the required huge hardware area when applying masking to AES at the algorithm level. Most existing works are inefficient to implement masked AES with an unrolled architecture on FPGAs. III. P ROPOSED M ASKED AES FOR U NROLLED S TRUCTURE In the Boolean masking implementation, the intermediate value x is concealed by exclusive-ORing it with the random mask m. In the round function of the AES, ShiftRows, MixColumns, and AddRoundKey are linear transformations, while SubBytes is the only nonlinear transformation of the AES. We define the linear transformations as Oper; then, the masked Oper can be written as Oper(x ⊕ m) = Oper(x) ⊕ Oper(m). However, the masked nonlinear transformation SubBytes has the characteristic as S-box(x ⊕ m) = S-box(x) ⊕ S-box(m). In order to mask the nonlinear transformation, a new S-box, denoted as S-box , is recomputed as S-box (x ⊕ m) = S-box(x) ⊕ m , where m and m are the input and output masks of SubBytes. To mask a 128-bit AES, it usually needs 6-byte random values. These 6 values are defined as m, m , m1 , m2 , m3 , and m4 . For simplicity, we define m1234 = {m1 , m2 , m3 , m4 } as

37

the mask for one 32-bit MixColumns transformation, and it also holds that m1234 = MixColumns(m1234 ). The field GF (28 ) is an extension of the field GF (24 ), over which to perform a modular reduction needs an irreducible polynomial of degree 2, x2 + {1}x + {e}, and another irreducible polynomial of degree 4, x4 + x + 1. In order to reduce the hardware resources, we calculate the masked AES engine mainly over GF (24 ). Fig. 1(a) shows the proposed masked AES, which moves the mapping and inverse mapping outside the AES’s round functions. The plaintext and the masking values are mapped once from GF (28 ) to GF (24 ), and all the intermediate operations are computed over GF (24 ). Finally, the ciphertext is mapped back from GF (24 ) to the original field GF (28 ). In this brief, all the masking values need to be mapped from GF (28 ) to GF (24 ), and we denote m84 = map(m), m84 = map(m ), m1234,84 = map(m1234 ), and m1234,84 = map(m1234 ). The adjustment of the masked SubBytes and masked MixColumns is discussed in the following, and the masked ShiftRows and masked AddRoundKey remain the same. A. Optimized Masked S-Box Over GF (24 ) In order to move the mapping and inverse mapping outside AES’s round operation, we exchange the computational sequence of masked affine and inverse mapping functions within masked S-box. The masked affine function needs to be adjusted with new scaling factors. In Fig. 1(b), map operation is the mapping transformation of 8 × 8 matrix, and map−1 is constructed by the inverse map operation. We denote that the input values of the map function are (z + m) and m, and the output values of the map function are (z + m) and m , where {(z + m), m} ∈ GF (28 ) and {(z + m) , m } ∈ GF (24 ). It holds that (z + m + m) = map(z + m + m)

(1)

where (z + m) = {a∗h + mh , a∗l + ml } and m = {mh , ml }. As discussed before, maffine and maffine are needed for scaling the output values and the output masking values. The following steps introduce the procedure to obtain the scaling values. The normal affine function (Ax + b) can be applied to the left and the right sides of (1) as A(z + m + m) + b = Amap−1 (z + m + m) + b.

(2)

When mapping Equation (2) from GF (28 ) to GF (24 ), we can get map (A(z + m + m) + b)  (3) = map Amap−1 (z + m + m) + b map (A(z + m) + b) + mapAm = mapAmap−1 (z + m) + mapb + mapAmap−1 m . (4) Therefore, we deduce that maffine = mapAmap−1 + mapb and maffine = mapAmap−1 . Fig. 2 shows the new affine function when transforming affine function from GF (28 ) to GF (24 ). The four tables in Fig. 1(b) remain the same in our previous work [18]. These four tables are the following: 1) Td1 : ((x + m), m) → x2 × e + m; 2) Td2 : ((x + m), (y + m )) → ((x + m) + (y + m )) × (y + m ); 3) Tdm : ((x + m), (y +

38

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO. 1, JANUARY 2013

Fig. 1. Proposed masked AES with an unrolled architecture. (a) Masked AES. (b) Masked S-box.

Fig. 2. Affine transform function from GF (28 ) to GF (24 ).  m )) → (x + m) × (y + m ); and 4) Tinv : ((x + m), m) → Tinv (x) + m.

B. Proposed Masked MixColumns Over GF (24 ) Masked MixColumns can be scaled to adjust the operations over GF (24 ), and it needs to deduce the scaling factor of a modular multiplication with the fixed coefficients 0X02 and 0X03. If S is 1 byte of MixColumns, it holds that S = map(Sh , Sl ) ∼ = Sh x + Sl , where S ∈ GF (28 ) and Sh , Sl ∈ 4 GF (2 ). Therefore, scaling factors 2x + 6 and 2x + 7 of S equal to (4Sh + 2Sl )x + (f Sh + 6Sl ) and (5Sh + 2Sl )x + (f Sh + 7Sl ). Fig. 3 shows the scaling computation for the masked MixColumns. C. Optimization for Proposed Architecture Usually, throughputs can be significantly improved by inserting pipeline registers for latency careless designs. For each

Fig. 3.

Scaling computation of the masked MixColumns.

masked AES’s round, we insert six-stage pipelines to enhance the throughputs. We insert three pipelines to each round of the masked AES, called outer three pipelines, as shown in Fig. 1(a). The pipeline registers are inserted at the output of each transformation. We insert three pipelines to the masked S-box, called inner three pipelines, as shown in Fig. 1(b). Note that the maximum pipelined stages for our proposed design is six. In order to be compatible with the encryption procedure, we also insert six-stage pipelines to the key expansion in order not to affect the critical path of the main encryption. IV. R ESULTS In this section, we have implemented the proposed design with a very high speed integrated circuit hardware description

WANG AND HA: FPGA-BASED 40.9-Gbits/s MASKED AES WITH AREA OPTIMIZATION FOR SAN

39

TABLE I R ESULT C OMPARISONS OF THE P ROPOSED M ASKED S-B OX W ITH OTHER W ORKS

TABLE II R ESULTS OF THE M ASKED AES E NGINE W ITH AN U NROLLED A RCHITECTURE ON XC6VLX240T

language, synthesized our design using Xilinx ISE 13.3, and ported the design to a Virtex-6 XC6VLX240T platform. Table I shows the comparisons between the proposed masked S-box and the existing masking methods. The multiplicative masking in [19] and [20] had been proved to be vulnerable to glitch attacks. Oswald’s masking scheme was a safer and smaller solution at algorithmic level [12]. When moving the mapping and the inverse mapping outside the AES’s round function, our proposed masked S-box is 200 gates smaller than theirs. The masked S-box is only 2.2 times or 1.89 times larger than the unprotected baseline design on an FPGA or an ASIC platform, respectively. There have been some existing works on the unprotected AES for high-throughput applications. The work of Mathew et al. was the fastest and smallest AES implementation on 45-nm standard cell CMOS library [21]. The most optimized design was the work of Standaert et al., in which they achieved the best throughput/slices(4.2) on FPGA among the existing designs [6]. To our best knowledge, there have been no other works on the high-throughput masked AES on FPGA platforms. Table II shows our experimental results of the proposed masked AES on a Virtex-6 FPGA platform. In order to have a fair comparison, we also implement the masked AES design using Oswald’s masked S-box [12] with nonpipelined and sixstage pipelined unrolled structures. The area-optimized design with the pipelined structure proposed in this brief achieves 36.2% area reduction compared with Oswald’s pipelined design. It occupies just 53% larger area than that of the unprotected baseline design with the pipelined structure. Furthermore, in order to compare the capability of defending DPA attacks by the unprotected AES and the masked AES,

we set up an evaluation platform using Agilent MSO6104A oscilloscope and SASEBO-GII (Virtex-5 XCV5VLX30). Both the unprotected AES and the masked AES are implemented with the iterative architecture running at 2 MHz due to the resource limitation of SASEBO-GII. We construct a power model in order to retrieve the last round’s key. Suppose that a secret key for the AES encryption is K = 0x0102030405060708090A0B0C0D0E0F and the last round’s key is k10 = 0x13111D7FE3944A17F307A78B4D2B30C5. Fig. 4(a) shows a successful DPA attack at the first byte (0x13) of k10 of the unprotected AES. The correlation curve of the first byte value equaling to 0x13 is highlighted by black line, and it has the highest correlation than other values (not equaling to 0x13). On the other hand, Fig. 4(b) shows an unsuccessful DPA attack to the masked AES, where the black line (0x13) is covered by the other incorrect values’ correlations. Consequently, we can obtain all guessed 16 bytes of the last round’s key for the unprotected AES and the masked AES as listed in Table III. We can successfully achieve 14 correct bytes of k10 from the unprotected AES, while we cannot obtain any correct bytes of k10 from the masked AES for 107 000 traces (998 traces can break the unprotected AES, while up to 90 000 traces cannot break the masked AES). V. C ONCLUSION High throughput is an important factor for large data transformation systems in SANs. In order to secure “data-at-rest” and enhance the throughput, modern systems shift the encryption procedure from a software platform to a hardware platform. Hardware-based encryption still opens the possibility of DPA

40

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 60, NO. 1, JANUARY 2013

ACKNOWLEDGMENT The authors would like to thank S. Luo (Research Associate at the Department of Electrical and Computer Engineering, National University of Singapore) for his valuable help on collecting the power traces used in our experiments. R EFERENCES

Fig. 4. DPA attacks to the unprotected and the masked AES. (a) Correlation of the correct first byte (0x13) of k10 of the unprotected AES implementation. (b) Correlation of the correct first byte (0x13) of k10 of the masked AES implementation. TABLE III G UESSED K EY VALUES W ITH C ORRESPONDING C ORRELATION VALUES

and glitch attacks. In this brief, an LUT-based masked S-box has been proposed to construct the DPA-resistant design with acceptable area on FPGA. The proposed masked AES only needs to map the plaintext and masking values from GF (28 ) to GF (24 ) once at the beginning of the operation and map the ciphertext back from GF (24 ) to GF (28 ) once at the end of the operation. Therefore, by moving the mapping and inverse mapping outside the masked AES’s round function, we can reduce 20.5% area resources. We also map some parts of the masked S-box onto BRAM which further reduces area resources by 15.7%. Therefore, the overall area reduction is 36.2%. We achieve 4.5Mbits/s/slice for the proposed masked AES with the ability to defend against DPA and glitch attacks.

[1] Advanced Encryption Standard (AES), FIPS-197, Nat. Inst. of Standards and Technol., 2001. [2] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Proc. CRYPTO, 1999, vol. LNCS 1666, pp. 388–397. [3] L. Goubin and J. Patarin, “DES and differential power analysis (the ‘duplication’ method),” in Proc. CHES LNCS, 1999, vol. 1717, pp. 158–172. [4] S. Messerges, “Securing the AES finalists against power analysis attacks,” in Proc. FSE LNCS, 2000, vol. 1978, pp. 150–164. [5] K. Gaj and P. Chodowiec, “Fast implementation and fair comparison of the final candidates for advanced encryption standard using field programmable gate arrays,” in Proc. CT-RSA LNCS, 2001, vol. 2020, pp. 84–99. [6] F. X. Standaert, G. Rouvroy, J. J. Quisquater, and J. D. Legat, “Efficient implementation of Rijndael encryption in reconfigurable hardware: Improvements and design tradeoffs,” (in German), in Proc. CHES LNCS, 2003, vol. 2779, pp. 334–350. [7] G. P. Saggese, A. Mazzeo, N. Mazzocca, and A. G. M. Strollo, “An FPGAbased performance analysis of the unrolling, tiling, pipelining of the AES algorithm,” in Proc. FPL LNCS, Aveiro, Portugal, 2003, vol. 2778, pp. 292–302. [8] M. McLoone and J. V. McCanny, “Rijndael FPGA implementations utilizing look-up tables,” in Proc. IEEE Workshop Signal Process. Syst., Antwerp, Belgium, 2001, pp. 349–360. [9] V. Rijmen, “Efficient Implementation of the Rijndael S-Box,” Dept. ESAT., Katholieke Universiteit Leuven, Leuven, Belgium, 2006. [Online]. Available: http://www.networkdls.com/Articles/sbox.pdf [10] A. Hodjat and I. Verbauwhede, “A 21.54 Gbits/s fully pipelined processor on FPGA,” in Proc. IEEE 12th Annu. Symp. Field-Programm. Custom Comput. Mach., 2004, pp. 308–309. [11] S. Mangard, N. Pramstaller, and E. Oswald, “Successfully attacking masked AES hardware implementations,” in Proc. CHES LNCS, 2005, vol. 3659, pp. 157–171. [12] E. Oswald, S. Mangard, N. Pramstaller, and V. Rijmen, “A side-channel analysis resistant description of the AES S-box,” in Proc. FSE LNCS, Setubal, Potugal, 2005, vol. 3557, pp. 413–423. [13] H. Kim, S. Hong, and J. Lim, “A fast and provably secure higher-order masking of AES S-box,” in Proc. CHES LNCS, Nara, Japan, 2011, vol. 6917, pp. 95–107. [14] C. Carlet, L. Goubin, E. Prouff, M. Quisquater, and M. Rivain, “Higherorder masking schemes for S-boxes,” in Proc. FSE LNCS, 2012, vol. 7549, pp. 366–384. [15] J. D. Goli´c, “Techniques for random masking in hardware,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 2, pp. 291–300, Feb. 2007. [16] D. Canright and L. Batina, “A very compact ‘perfectly masked’ S-box for AES,” in Proc. ACNS LNCS, 2008, vol. 5037, pp. 446–459. [17] S. Mangard, E. Oswald, and T. Popp, Power Analysis Attacks: Revealing the Secrets of Smart Cards. New York: Spinger-Verlag, 2007. [18] Z. Yuan, Y. Wang, J. Li, R. Li, and W. Zhao, “FPGA based optimization for masked AES implementation,” in Proc. IEEE 54th Int. MWSCAS, Seoul, Korea, 2011, pp. 1–4. [19] M. Alam, S. Ghosh, M. J. Mohan, D. Mukhopadhyay, D. R. Chowdhury, and I. S. Gupta, “Effect of glitches against masked AES S-box implementation and countermeasure,” IET Inf. Security, vol. 3, no. 1, pp. 34–44, Feb. 2009. [20] E. Trichina, T. Korkishko, and K. H. Lee, “Small size, low power, side channel-immune AES coprocessor: Design and synthesis results,” in Proc. AES LNCS, 2005, vol. 3373, pp. 113–127. [21] S. K. Mathew, F. Sheikh, M. Kounavis, S. Gueron, A. Agarwal, S. K. Hsu, H. Kaul, M. A. Anders, and R. K. Krishnamurthy, “53 Gbps 2 native GF (24 ) composite-field AES-encrypt/decrypt accelerator for content-protection in 45 nm high-performance microprocessors,” IEEE J. Solid-State Circuits, vol. 46, no. 4, pp. 767–776, Feb. 2011.