FPGA based Optimized SHA-3 Finalist in ...

4 downloads 0 Views 208KB Size Report
Qian Song, Yi Wang, Zhican Li, Quan Zhou,Wufei Wu, Demin Han, Wenlong Xu, Zuo Chen and Renfa Li. Embedded Systems & Networking Laboratory, Hunan ...
FPGA based Optimized SHA-3 Finalist in Reconfigurable Hardware Qian Song, Yi Wang, Zhican Li, Quan Zhou,Wufei Wu, Demin Han, Wenlong Xu, Zuo Chen and Renfa Li Embedded Systems & Networking Laboratory, Hunan University Hunan Provincial Key Laboratory of Network and Information Security, Hunan University e-mail:[email protected] Abstract—A hash function is well-defined procedure to convert large, uncertain long message into fixed small integers. Secure Hash Algorithm (SHA) is an one-way message digest algorithm which is usually used in cryptographic applications such as authentication, digital signature and data integrity. In this paper, we proposed the reconfigurable structure for SHA-3 finalist BLAKE, Grøstl, JH, Keccak and Skein, separately. The proposed reconfigurable Grøstl, JH and Keccak could support different digested sizes. And Skein and BLAKE optimized three different modes using one single hardware core. The experimental results showed that our proposed structure could support different parameters of SHA-3 finalist with comparable performance among the existing works when ported to Xilinx Virtex-5 FPGA platform. Keywords-SHA-3, FPGA, reconfigurable

I.

INTRODUCTION

Secure Hash Algorithm (SHA) is a data encryption algorithm which is widely used in Automatic Teller Machine (ATM), Radio-Frequency Identification (RFID) and Virtual Private Networks (VPN) [1]. However, it has been announced that SHA-0, SHA-1 and SHA-2 might be attacked [2-5]. Therefore, National Institutes of Standards and Technology (NIST) officially announced to call SHA-3, which can be regarded as a new-secure hash algorithm. Till now, SHA-3 finalists are five hash algorithms: BLAKE, Grøstl, JH, Keccak and Skein [6]. The BLAKE [7] is processing with Hash Iterative Framework (HAIFA) iteration mode [8], whose compression function is built on the CHACHA core algorithm which is one of the fastest stream ciphers [9]. Aumasson et al. realized 1G (G is an encryption function), 4G and 8G mode of BLAKE separately, the maximum throughput achieved 3103Mb/s [7]. Kobayashi et al. [10] realized the 4G mode of BLAKE-32 with 2676Mb/s throughput using 1660 slices on Side-channel Attack Standard Evaluation Board II (SAEBO-GII) [11] platform. The similar design proposed by Homsirikamol et al. achieved 119 MHz when ported to Altera Stratix III FPGA platform [12]. Grøstl [13] is mainly composed of Message Digest (MD) iteration and Advanced Encryption Standard (AES) compression function. Baldwin et al. realized Grøstl-256 and Grøstl-512 separately which achieved throughput of 3242Mb/s and 3619 Mb/s on Xilinx xc5vlx330. They also proposed a reconfigurable structure of Grøstl-256 and Grøstl-512 by processing P and Q permutations in parallel and S-box in Block Random-Access Memory (BRAM), which achieved 7310Mb/s throughput [14]. Jungk and Reith proposed the reconfigurable structures of Grøstl-224 and Grøstl-256 [15], Grøstl-384 and

Grøstl-512 [16], and the structure shared P and Q permutations with S-Box generating on-the-fly. Their designs took up 6136 slices and 8308 slices separately. A pipelined structure for Grøstl-256 was proposed by Homsirikamol et al., which achieved 7885Mb/s throughput and it is the best among the existing designs [11]. JH [17] proposed a new compression function structure, which used a large block cipher with constant key to construct a compression function. JH-256 had been realized by Baldwin et al. and Matsuo et al. separately, which achieved throughput of 1941Mb/s and 2639Mb/s using 1291 slices and 2661 slices separately on Xilinx Virtex-5 FPGA platform [14][18]. Homsirikamol et al. realized JH-256, which achieved 5516Mb/s throughput using 1018 slices on Xilinx Virtex-5 FPGA platform [11]. Keccak [19] is based on sponge construct [20] which can achieve higher frequencies by a shorter critical path. Homsirikamol et al. optimized Keccak256 on Virtex-5 platform and Keccak-512 on Altera Stratix III FPGA platform [11]. The proposed Keccak-256 took up 1217 slices with throughput of 12817Mb/s, and the proposed Keccak-512 took up 4213 ALUTs with throughput of 12393Mb/s. Baldwin et al. realized Keccak-256 which achieved 8518Mb/s throughput using 1117 slices on Xilinx xc5vlx330 [14]. Skein is mainly composed of Unique Block Iteration (UBI) and Threefish functions. UBI transforms an input with the random length into an output with the fixed length. Threefish is the compression function which determining the size of Skein’s state space [21]. Baldwin et al. proposed a four-rolled structure for Skein-512 which achieved 1945Mb/s throughput on Xilinx xc5vlx330 [15]. Tillich et al. proposed an eight-rolled architecture for Skein-256 and Skein512 on Xilinx xc5vlx110, and their designs achieved 1751Mb/s and 3535Mb/s throughput separately [22]. The similar structure of Skein-256 and Skein-512 are proposed by Tillich et al., and the maximum throughput could reach 1762Mb/s and 2501Mb/s in an UMC 0.18 ȝm CMOS standard cell technology [23]. In this paper, we firstly detailed the different characteristics of SHA-3 finalist. Then, we proposed the reconfigurable structures for SHA-3 finalist to support different message lengths. A detailed comparison of the proposed designs with the existing designs is given to show that the proposed reconfigurable designs took up less area compared with implementing the design separately. II.

The SHA-3 finalists have different structures and different compression functions. NIST summarized the characteristics of these algorithms as shown in Table I [6]:

This work is supported by “Chinese National Science Foundation” (No.60873074 and No.60673061); “Changsha Science Technology Scheme” (No.K1003028-11) and “the Fundamental Research Funds for Chinese Central Universities”.

c 978-1-61284-865-52011 IEEE

PRILIMINARY OF SHA-3 FINALIST

508

TABLE I.

STRUCTURE AND CHARCTERASTICS

Algorithm

Designer

Structure

BLAKE[7] Grøstl[13] JH[17] Keccak[19] Skein[21]

Aumasson Knudsen Hongjun wu Daemen Schneier

HAIFA Wide-pipe MD Wide-pipe MD Sponge MD,UBI

Compression Function LAKE,CHACHA AES permutations AES Iterated permutation Threefish

From Table I, these new candidate algorithms are designed with a variety of structures including HAIFA, wide-pipe MD, MD, UBI and Sponge. The compression functions include LAKE (a hash function with wide-pipe structure [24]), CHACHA, AES, Threefish and iterated permutation. Aumasson et al. [7] proposed BLAKE with HAIFA structure which mixed salt and counter in compression function in order to encourage the use of randomized hashing to overcome the weakness of iterative structure. Grøstl [13] employed Widepipe MD structure where the size of the internal state is significantly larger than the size of the output, and the compression function is mainly consist of P and Q permutations. JH [17] is an algorithm which has the same structure of Wide-pipe MD, and the compression function is an efficient differential propagation which is a byte-oriented Substitution-Permutation Network (SPN) as AES. Keccak [19] applies a Hermetic Sponge Strategy (HSS) with three steps: absorbing input data, squeezing data in a state and outputting data [20]. In the algorithm, Keccak-1600 is chosen from a set of 7 permutations as a candidate for SHA-3 competition. Skein [21] uses a new structure of UBI and Threefish. UBI structure includes three types: configuration UBI, message processing UBI and outputting UBI. Each UBI has a corresponding Threefish processing unit. III.

scalable architecture for Skein-512 to support all three modes. Fig. 2 shows the proposed structure.

Figure 2. The proposed scalable structure of Skein

Each MPn (n=0,1,2,…7) represents a set of MIX transformations, and MUX unit represents Multiplexer which is controlled by Sel and Counter signals. C. JH JH can be divided into JH-224, JH-256, JH-384 and JH512 according to the different output lengths. We proposed a reconfigurable structure of JH as shown in Fig. 3.

PROPOSED RECONFIFURABLE ARCHITECTURE FOR GRØSTL, SKEIN, JH, KECCAK AND BLAKE

A. Grøstl According to the different output lengths, Grøstl can be divided into Grøstl-224, Grøstl-256, Grøstl-384 and Grøstl512. We proposed a reconfigurable structure of Grøstl as shown in Fig. 1. Figure 3.

The proposed reconfigurable structure of JH

In Fig. 3 MSG represents the input message block, Cr,0 is an initial vector of the round constant, sel signal is to select the initial values of JH-224, JH-256, JH-384 and JH-512. D. Keccak We proposed a reconfigurable structure for Keccak-224, Keccak-256, Keccak-384 and Keccak-512. Fig. 4 shows the proposed structure. Figure 1. The proposed reconfigurable structure of Grøstl

In Fig. 1 M represents the message and the signal ver is to select the parameters among 224, 256, 384 and 512.

In Fig. 4 M represents the input message. The module R is a round function which iterates 18 times for each permutation. The sel signal is to select the different lengths of output messages.

B. Skein According to the state space, Skein has three different versions, Skein-256, Skein-512 and Skein-1024. MIX function is a non-linear mixing function of Threefish which has iteration, four-rolled and eight-rolled modes. We proposed the

2011 International Symposium on Integrated Circuits

509

1600 M

Pading 64

c hash_ready

MUX

576

1088

SIPO r

1

state_register

1600

r r

2

r

DATA_Bffer

c 1

sel 2

MUX 1600

r

512 384 256 224

Buffer_ready PISO

in keccak_224: keccak_256: keccak_384: keccak_512:

r=1024,c=576 r=1024,c=576 r=512,c=1088 r=512,c=1088

32

R out

DATA_OUT 1600

Figure 4. The proposed reconfigurable structure of Keccak

E. BLAKE BLAKE has four different versions, BLAKE-28, BLAKE32, BLAKE-48, and BLAKE-64. There have three different G function modes: 1G, 4G and 8G. We proposed the scalable structure for BLAKE-32 to support three different G modes. Fig. 5 shows the proposed structure of BLAKE.

Figure 5. The proposed scalable structure of BLAKE

In Fig. 5 Msg represents the input message and the signal sel is to select the modes among 1G, 4G and 8G. IV.

COMPARISONS

We implemented Grøstl, Skein, JH, Keccak and BLAKE using Verilog hardware description language, and synthesizing with Xilinx ISE 12.1 tools. Table II shows the experimental results of the proposed reconfigurable structure and implemented individually on Xilinx Virtex-5 platform. From table II, it is obvious that the proposed reconfigurable structure of Grøstl takes up 1802 slices less compared with the total area of implementing for Grøstl-256 and Grøstl-512 individually. The proposed Skein takes up 1728 slices less compared with the total area of supporting three modes. The proposed JH takes up 1800 slices less compared with the total area of implementing for JH-256 and JH-512 individually and the proposed Keccak takes up 1423 slices less compared with the total area of implementing for Keccak-256 and Keccak512 individually. The proposed BLAKE-32 takes up 2947 slices less compared with the total area of supporting three modes.

510

Table III shows the comparison between our reconfiguration designs and the existing methods. The Throughput/Area aspect of the proposed Grøstl is 1.4 times better than Kobayashi’s design and 2.4 times better than Baldwin’s design. Although Matsuo and Homsirikamol achieved 1.7 and 1.8 times better than our design in the aspect of Throughput/Area, but their design only can support one parameter. The throughput of the proposed Skein is 1.6 times larger than Baldwin’s design, but it is 1.1 times less than Tillich’s design. However, our scalable structure of Skein could support iteration, four-rolled and eight-rolled modes. The proposed JH achieved 1.6 times faster than Matsuo’s design, but it is 0.8 times slower compared with Homsirikamol’s design. The throughput of the proposed Keccak achieved 1.6 times larger than Baldwin’s and Matsuo’s designs, and also achieved 1.1 times larger than Homsirikamol’s design. The proposed BLAKE achieved 6.8% faster than Baldwin’s design. TABLE II.

THE RESULTS OF THE PROPOSED STRUCTURES

Algorithm Platform Area Throughput Frequency 256 xc5vlx220 2145slices 5548Mb/s 238.4MHz 512 xc5vlx220 3936slices 8137Mb/s 238.4MHz Grøstl R xc5vlx220 4279slices 7623Mb/s 223.32MHz D 1802slices 512-1 xc5vlx30 1284slices 1214Mb/s 175.5 MHz 512-4 xc5vlx30 1458slices 3074Mb/s 120.1 MHz Skein 512-8 xc5vlx30 1561slices 3645Mb/s 71.2 MHz R xc5vlx30 2539slices 3139Mb/s 61.3MHz D 1728slices 256 xc5vlx220 1452slices 4962Mb/s 378MHz 512 xc5vlx220 1823slices 4962Mb/s 378MHz JH R xc5vlx220 1475slices 4319Mb/s 329MHz D 1800slices 256 xc5vlx220 1375slices 14438Mb/s 282MHz 512 xc5vlx220 1446slices 7066Mb/s 276MHz Keccak R xc5vlx220 1698slices 13414Mb/s 262MHz D 1423slices 32-1 xc5vlx220 1624slices 849Mb/s 136MHz 32-4 xc5vlx220 1742slices 2828Mb/s 122MHz BLAKE 32-8 xc5vlx220 1730slices 1805Mb/s 78MHz R xc5vlx220 2149slices 2927Mb/s 126MHz D 2947slices R: the reconfigurable architecture of Grøstl, Skein, JH, Keccak and BLAKE. D: the area difference between the reconfigurable architecture and the individual designs.

V.

CONCLUSION

In this paper, we detail the design features and the existing methods of SHA-3 finalist. In order to improve the flexibility of hardware implementation, we proposed the new reconfigurable structures for Grøstl, JH and Keccak, which could support four different parameters of 224, 256, 384 and 512. We proposed the scalable structure for Skein, which could support iteration, four-rolled and eight-rolled modes computation. Similarly, we proposed the scalable structure for BLAKE, which could support 1G, 4G and 8G modes. The experimental results showed that our proposed designs take up smaller area compared with implementation individually. Moreover, we provide flexibility for the area-constraint applications.

2011 International Symposium on Integrated Circuits

TABLE III. Algorithm Homsirikamol[12] Grøstl

Kobayashi[10] Baldwin[14] Matsuo[18] Matsuo[18] Kobayashi[10]

Skein

Tillich[22] Baldwin[14] Homsirikamol[12] Homsirikamol[12]

JH

Keccak

Matsuo[18] Baldwin[14] Homsirikamol[12] Baldwin[14] Matsuo[18] Aumasson[7]

BLAKE

Grøstl Skein JH Keccak BLAKE

Kobayashi[10] Homsirikamol[12] Baldwin[14]

This paper

Platform Xilinx Virtex-5 Xilinx Virtex-5 Virtex-5 xc5vlx30 Virtex-5 xc5vlx330 Virtex-5 xc5vlx330 Virtex-5 xc5vlx30 Virtex-5 xc5vlx30 Virtex-5 xc5vlx30 Virtex-5 xc5vlx110 Virtex-5 xc5vlx110 Virtex-5 xc5vlx330 Xilinx Virtex-5 Xilinx Virtex-5 Xilinx Virtex-5 Virtex-5 xc5vlx30 Virtex-5 xc5vlx330 Xilinx Virtex-5 Virtex-5 xc5vlx330 Virtex-5 xc5vlx30 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx30 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220 Virtex-5 xc5vlx220

COMPARISON WITH THE EXISTING DESIGNS Data path 512-bit 1024-bit 512-bit 512-bit 1024-bit 512-bit 256-bit-4 256-bit-4 256-bit -8 512-bit -8 512-bit -4 512-bit -4 256-bit 512-bit 256-bit 256-bit 1024-bit 1024-bit 512-bit 512-bit-1 512-bit-4 512-bit-8 512-bit-4 512-bit-4 512-bit-4 1024-bit 512-bit 512-bit 1024-bit 512-bit

REFERENCES [1]

National Institute of Standards and Technology, FIPS 180: “Secure Hash Standard,” FIPS, 1993. [2] X. Y. Wang and H. B. Yu, “How to Break MD5 and Other Hash Functions,” EUROCRYPT’05, Lecture Notes in Computer Science Vol. 3494, Springer, 2007, pp.19-35. [3] F. Chabaud, and A. Joux, “Differential Collisions in SHA-0,” CRYPTO’98, vol. 1462, Springer-Verlag, 1998, pp. 56-71. [4] X. Y. Wang, Y. L. Yin, and H. B. Yu, “Finding Collisions in the Full SHA-1," CRYPTO 2005, Springer-Verlag, 2005, pp.17-36. [5] S. K. Sanadhya, and P. Sarkar, “New Local Collosions for the SHA-2 Hash Family,” ICISC 2007, vol. 4817, Springer-Verlag, 2007, pp. 193205. [6] National Institute of Standards and Technology: “CRYPTOGRAPHIC HASH ALGORITHM COMPETITION,” Gaithersburg: 2007[2011], http://csrc.nist.gov/groups/ST/hash/sha-3/index.html [7] J. P. Aumasson, L. Henzen, and W. Meier, “SHA-3 proposal BLAKE,” version 1.3, 2008, Available online at http://131002.net/blake/blake.pdf, [8] E. Biham and O. Dunkelman, “A framework for iterative hash functionsHAIFA,” Cryptology ePrint Archive, Report 2007/278, 2007, http://eprint.iacr.org/ [9] D. J. Bernstein, “ChaCha, a variant of Salsa20,” January 2008, http://cr.yp.to/chacha.html . [10] K. Kobayashi, J. Ikegami, and S. Matsuo, “Evaluation of hardware performance for the SHA-3 candidates using SASEBO-GII.” Cryptology ePrint Archive, Report 2010/010, 2010, http://eprint.iacr.org/. [11] National Institute of Advanced Industrial Science and Technology (AIST), Research Center for Information Security (RCIS), “Sidechannel Attack Standard Evaluation Board (SASEBO),” http://www.rcis.aist.go.jp/special/SASEBO/SASEBO-GII-ja.html. [12] E. Homsirikamol, M. Rogawski and K. Gaj, “Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs,” Cryptology ePrint Archive, Report 2010/445, 2010.

Area Throughput Clock Frequency Throughput/Area 1597slices 7885Mb/s 323.4MHz 4.94 3188slices 10314Mb/s 292.1MHz 3.24 4057slices 5171Mb/s 101MHz 1.27 2391slices 3242Mb/s 101.3MHz 1.36 4845slices 3619Mb/s 123.4MHz 0.75 2616slices 7885Mb/s 154MHz 3.01 854slices 1402Mb/s 115MHz 1.64 854slices 1482Mb/s 115MHz 1.74 937slices 1751Mb/s 68.4MHz 1.87 1632slices 3535Mb/s 69MHz 2.17 1786slices 1945Mb/s 83.65MHz 1.09 1716slices 3209Mb/s 119.1MHz 1.87 1018 slices 5416Mb/s 380.8MHz 5.32 1104slices 5610Mb/s 394.5MHz 5.08 2661slices 2639Mb/s 201MHz 0.99 1291slices 1941Mb/s 250.13MHz 1.50 1272slices 12817Mb/s 282.7MHz 10.08 1117slices 8518Mb/s 189MHz 7.63 1117slices 8190Mb/s 189MHz 7.33 390slices 575Mb/s 91MHz 1.47 1217slices 2438Mb/s 100MHz 2.00 1694slices 3103Mb/s 67MHz 1.83 1660slices 2676Mb/s 115MHz 1.61 1851slices 2611Mb/s 117MHz 1.41 1118slices 1079Mb/s 118MHz 0.97 4279slices 7623Mb/s 223.32MHz 1.78 2539slices 3139Mb/s 61.3MHz 1.24 1475slices 4319Mb/s 329MHz 2.93 1698slices 13414Mb/s 262MHz 7.9 2149slices 2927Mb/s 126 MHz 1.36 [13] P. Gauravaram, L. R. Knudsen, and K. Matusiewicz, “Grøstl–a SHA-3 candidate,” October 2008, http://www.grøstl1.info/Grøstl1.pdf,. [14] B. Baldwin, N. Hanley, and M. Hamilton, “FPGA implementations of the round two SHA-3 candidates,” FPL 2010 , 2010, pp. 400-407. [15] B. Jungk, S. Reith, and J. Apfelbeck, “On Optimized FPGA Implementations of the SHA-3 Candidate Grøstl,” IACR Eprint report 2009/206, Available online at http://eprint.iacr.org/2009/206.pdf. [16] B. Jungk, and S. Reith, “On FPGA-based implementation of Grøstl,” IACR Eprint report 2010/260, http://eprint.iacr.org/2010/260.pdf. [17] H. J. Wu, “SHA-3 proposed JH,” 2008, Available online at http://icsd.i2r.a-star.edu.sg/staff/hongjun/jh/index.html. [18] S. Matsuo, M. Knezevic, and P. Schaumont, “How Can We Conduct ‘Fair and Consistent’ Hardware Evaluation for SHA-3 Candidate?” Second SHA-3 Candidate Conference, 2010, Available online at http://csrc.nist.gov/groups/ST/hash/sha-3/Round2/Aug2010/documents/ papers/MATSUO_SHA-3_Criteria_Hardware_revised.pdf. [19] G. Bertoni, J. Daemen, and M. Peeters, “Keccak sponge function family main ducument,” 2008, http://keccak.noekeon.org/ [20] G. Bertoni, J. Daemen, and M. Peeters, “Sponge Functions,” ECRYPT 2007,http://www.csrc.nist.gov/pki/HashWorkshop/Public_Comments/20 07_May.html . [21] N. Ferguson, S. Lucks, and B. Schneier, “The Skein hash function family,” 2009, http://eprint.iacr.org/. [22] S. Tillich, “Hardware Implementation of the SHA-3 Candidate Skein,” IACR Eprint report 2009/159, http://eprint.iacr.org/ 2009/159.pdf. [23] S. Tillich, M. Feldhofer, and M. Kirschbaum, “High-Speed Hardware Implementations of BLAKE, Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH, Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein,” IACR Eprint report 2009/510, http://eprint.iacr.org/2009/510.pdf. [24] J. P. Aumasson, W. Meier, and R. C. W. Phan, “The hash function family LAKE,” in Fast Software Encryption 2008, vol. 5086, SpringerVerlag, 2008, pp. 36-53.

2011 International Symposium on Integrated Circuits

511