A Reconfigurable Implementation of the New Secure ... - IEEE Xplore

A Reconfigurable Implementation of the New Secure Hash Algorithm M. Zeghida,b, a

B. Boualleguea,

A. Baganneb,

M. Machhouta

and R. Tourkia

Electronics and Micro-Electronic Laboratory (LEME), Monastir, Tunisia. b LESTER-University of South Brittany, Lorient, France. [email protected], [email protected], [email protected] [email protected], [email protected]

Abstract The main applications of the hash functions are met in the fields of communication’s integrity and signature authentication. Many hash algorithms have been investigated and developed in the last years. This work is related to hash functions FPGA implementation. Field programmable gate arrays (FPGAs) being reconfigurable, flexible and physically secure are a natural choice for implementation of hash functions in a broad range of applications with different areaperformance requirements. We propose a configurable Secure Hash Algorithm (SHA) processor for extended signature authentication. This paper investigates different optimizations algorithms of recent Techniques that have been proposed in the literature. In our implementation based on Xilinx Virtex FPGAs, the throughput of SHA processor is equal to 1296 Mbit/s. Speed/area results from these processors are analyzed and shown to compare favorably with other FPGAbased implementations. A fastest data throughput is achieved by our optimized algorithm.

1. Introduction Cryptography serves a great number of scopes and ensures different types of security due to alternative encryption schemes. Among them we can site for instance, the bulk encryption, the message authentication and the data integrity. The symmetric ciphers, the asymmetric encryption algorithms and hash functions support each one of the above types respectively [3]. Hash functions operate at the root of many popular cryptographic methods in current use, such as the Digital Signature Standard (DSS), Transport Layer Security (TLS) and Internet Protocol Security (IPSec) protocols, numerous random number generation algorithms, encryption algorithms, all-or-nothing transforms, and password storage mechanisms [1,2]. The purpose of a hash function is to produce a ‘‘fingerprint’’ of a file, message, or other block of data. A hash value h is generated by a function H of the form

Second International Conference on Availability, Reliability and Security (ARES'07) 0-7695-2775-2/07 $20.00 © 2007

h = H(M), where M is a variable-length message and H(M) is the fixed-length hash value. In the cryptographic hash function, a message of arbitrary length padded and broken into blocks is fed sequentially to a compression function which converts a fixed-length input (current message block) to a fixedlength output (hash value). The hash values of individual blocks are used iteratively by the compression function to find the final hash value, referred to as message digest. A hash function provides a unique relationship between the input message and the hash value and hence, represents a longer message in a concise way. Currently, the SHA-1 algorithm is the National institute of Standards and technology (NIST) secure hash standard. Anticipating the increase in security which will be offered by the new Advanced Encryption Standard (AES), the NIST proposed an expansion of their standard. In 2002, the NIST published the new Secure Hash Standard [4], which detailed three new Secure Hash Algorithms SHA-256, SHA-384, and SHA-512. Since then, SHA-224 has been added to the standard, forming the ‘SHA-2’ family of hash functions. In our days, reconfigurable computing is a very attractive method for the hardware implementation of systems/algorithms [15, 16]. Reconfigurable systems can change their “true” hardware configuration and can support multioperations modes. In this paper, we present ultra high speed architecture for the integration of SHA-256 and SHA512 hash functions. The proposed system is reconfigurable in the sense that performs efficiently for both hash functions. The architecture design is based on a pipelined methodology of 2-stages. The proposed system is compared with other related works in the terms of frequency, allocated resources and throughput [6, 7, 8, 9, 10, 11]. In addition comparisons with other conventional works are shown using the Performance/Area ratio. Design approaches that meet both constraints of high-performance and small-size were presented in [7] and [10], where SHA-2 was

implemented using the re-use and pipeline techniques simultaneously. The paper is organized as follows: In section 2 both SHA-256 and SHA-512 hash functions are described briefly. In the next section, the proposed system is presented. The internal components of this architecture are described in detail. The synthesis results of the FPGA implementation are given in the next section. Comparisons with other related works are also presented in section 4. Finally, conclusions and observations are discussed in section 5.

2. SHA-256/SHA-512 Algorithm All descriptions of SHA-256, SHA-384 and SHA512 algorithms can be found in the official NIST standard [4, 5]. Table 1 shows a comparative study in terms of function characteristics of three hash functions. The security of these hash functions is controlled by the size of their outputs, referred to as hash values, n. The definition of SHA-384 is almost identical to SHA-512, with the exception of a different choice of the initialization vector and a truncation of the final 512-bit result to 384 bits. Table 1. Functional characteristics of three hash functions SHA standard Size of hash value Complexity of the best attack Message size Message block size Word size Number of words Digest rounds number

-256 256 2128

-384 384 2192

-512 512 2256

< 264 512

< 2128 1024

< 2128 1024

32 8 64

64 8 80

64 8 80

All functions have a very similar internal structure and process each message block using multiple rounds. The number of rounds for SHA-384 and SHA-512 is the same and 20% smaller in SHA-256. These hash functions enable the determination of a message’s integrity: any change to the message will result in a different produced message digest, with a very high probability. Each hash function operation can be divided in two stages: (1) pre-processing and (2) hash computation. Pre-processing involves padding the input message, parsing the padded data into a number of mbit blocks, and setting the appropriate initial values, which are used in the hash computation (m = 512 or 1024 bits). The hash computation uses the padded data along with functions, constants, and word logical and algebraic operations, to iteratively generate a series of hash values. After a specified number of transformation rounds, the produced hash value is equal to the message


digest. This latter ranges in length from 256- to 512bits, depending each time on the selected hash function. As for SHA-256, during pre-processing phase, the message is padded and parsed into 512-bit message blocks, which are used to generate the message schedule Wt’s. SHA-256 requires 64 cycles to produce the 256-bit message digests. Each cycle requires the previous round’s results, Wt, as well as constant value Kt and has 512-bit input and a 512-bit output, both formed by eight 64-bit words: A-H. The expressions to calculate each cycle’s outputs are given in Eqs(1-3), following the guidelines of [4,5]. (1) At+1 = Tmp + Σ0(At) + Maj(At + Bt + Ct) (2) Bt+1 = At, Ct+1 = Bt , Et+1 = Tmp + Dt (3) Dt+1 = Ct , Ft+1 = Et, Gt+1 = Ft , Ht+1 = Gt Where Tmp = Wt+1+ Kt+1+ Ht + Σ1(Et)+ Ch(Et + Ft + Gt) The processing unit uses four logical functions: Ch, Maj, Σ0, and Σ1. The result of each new function is either a new 32-bit or 64-bit word. The logic functions Ch and Maj are identical for SHA-256 and SHA-512. Ch (x,y,z) = (x y) ⊕(-x z) (4) (5) Maj (x,y,z) = (x y) ⊕(x z) ⊕(y z) (6) Σ0(x)=ROTR28(x)⊕ROTR34(x)⊕ROTR39(x) Σ1(x)=ROTR14(x)⊕ROTR18(x)⊕ROTR41(x). (7) ROTRy(x) stands for rotation of x by y positions to the left.

3. Proposed Sha-2 Family Architecture Since the SHA-256 and SHA-512 algorithms are very similar, they can be implemented on a single chip easily. The proposed system architecture is illustrated in Figure 1. and will be presented in the following section. The proposed system supports three operation modes for reconfigurable SHA processor. The controller is designed to control the flow of data in the design, as well as controlling the movement of data between the Padded Process Unit and Hash Function Unit. A FSM is utilised for this purpose. Padded Process Unit pads the input data messages and converts them to 512- or 1024bit blocks (padded data). This operation is characterized by simplicity and it is well defined by SHA specifications. The hash value generation is mainly centred in the Hash Function Core (Figure 2), for both operations (SHA-256 and SHA-512). Since the SHA processors are designed for implementation in a reconfigurable platform, these allowed one combination to be investigated with relative low complexity once unrolled (‘2x’) core is implemented [8, 11].

Control Unit

Rom blocks Kt Hash Function Unit

Rom Constant

input data to the Hash Function Unit a word (Nb-bit) at a time, during the first N’ clock cycles used to process each message block (N’ = 32 clock cycles for SHA-256 or 64 for SHA-512).

Padded Unit

A

Bus interface unit


C

D

E

F

G

Wt Kt

H

∑1

∑0 Maj

Figure 1. Proposed system architecture Specifically, the goal is to permit next round’s available inputs to calculate instantaneously an intermediate result. The aim is to obtain a resource efficient implementation, increasing the achievable data throughput. The proposed optimization is based on modifying the basic hardware structure of SHA processor with the following modifications and additions. 1. One look-up table (LUT) is used to store the constants and number of shifts. 2. The SHA processor take, Nb bits data as inputs (Nb = 16 or 32). The input data are not stored. 3. A 1-stage shift register design approach is employed to implement the Padded Unit. The register is loaded with the Nb-bit input message per clock cycle. 4. There are four different nonlinear functions: Eq(4), Eq(5), Eq(6) and Eq(7). Every one is used for each hashing data process, as know SHA-256 and SHA-512. Instead, we have decided to use a common architecture of these functions, which reduce the area. For the Left and the Right shifted constants, it has to be mentioned with mode selected signal. For our proposed architecture, memory elements configured as ROM are used to store the constants and number of shifts. For the shift-number lookup, since there are 24 steps in SHA, a ROM with 5-bit address is used. 6-bit output is used for the SHA-2 processor. For constants lookup, a ROM with 7-bit input address and 64 or 32-bit output is used for the SHA algorithm. This arrangement allows simple decoding logic to select the appropriate constants for the two algorithms. As LUTs are used in the aforementioned manner, a mode signal is used to index the appropriate values during each step. It simplifies the control logic design. Thus a compact implementation is resulted. However, our aim is to produce fast real time implementations of the hashing processor. Therefore, a Padded Unit block was also included to implement the padding described in Section 2. This was realised as a synchronous finite state machine (FSM). In our architecture, message blocks are not stored in a Padded Unit. The Padded Unit passes

B

ch

+

+

+

+ +

+

+ Wt+1 Kt+1 ∑1

∑0 Maj

ch

+ +

+

+

+

+

+

A

B

C

D

E

F

G

H

Figure 2. Hash Function Unit architecture The Hash Function digest Unit performs the actual hashing. It uses one clock cycle per digest 2 rounds (unrolled (‘2x’))[10]. This process continues for N/2 clock cycles (N= 80 or 64). In both functions, input registers are initialized with the constant initialization vector, and are updated with the new value in each round. In SHA-2, eight words (A, B, C, D, E, F, G, and H) remain almost unchanged by a single round. These words are only shifted by one position down. The first word, A, and the last word, E, undergoes a complicated transformation equivalent to multi operand addition modulo 2word. These operands depend on six out of eight input word (all except A and E), the rounddependent constant Kt (the Kt constants are stored in ROM to conserve space on hash function), and a message dependent word Wt. 2x-unrolled core requires two Wt words to be available, simultaneously; both units generate N message dependent words, Wt, t=0...79. The first 16 of these words, W0..W15, is simply the first 16 words of the input message block, the remaining words are computed using a simple feedback function, shifts, and XOR operations. After 80 cycles the values in registers A to H are added to the initial hash values to obtain new hash values. Both circuits use the carry save representation of numbers to speed-up the multi operand addition, and minimize delays associated with carry propagation. The straightforward use of carry save adders in case of five operand addition would lead to three levels of 3-to-2

carry save adders, followed by a carry propagate adder. Finally when the final message block has been processed, the hash value outputs are concatenated to produce the 256 or 512-bit message digest. A signal mode is selected a counter, which counts to N/2. It is used to address the ROM and to select between the 512bit and the shortened 256-bit message digests.

with hash function standard (SHA-1) implementations [11, 12, 13] are also given. Table 2.Implementation comparison results Reference Our work Virtex2 xc2v2000

[6]

4. Experimental Results and Comparisons

Virtex v200pq240

The proposed system architecture (Figure 1) was captured by using VHDL, with structural description logic. The described circuits have been implemented in VHDL using the Model Technology’s ModelSim Simulator and synthesized using Xilinx ISETM tools v6.1 for implementation on a Xilinx Virtex IITM xc2v2000-bf957 FPGA. The synthesis results for the proposed SHA-2 implementation are illustrated in Table 2. Three performance metrics such as the area (a), clocking frequency (f) and throughput (d) are used. The latency is defined as the number of rounds in a loop and the minimum operating clock as clock period. The throughput (d) is computed as, d = message block size/(clock period * latency). Consider it as equation is better. The results show that unrolling the quasi-pipelined SHA-2 processor design provides data throughput advantage. Our circuit has a 2x-unrolled core. It is clear that the critical path inside the core of SHA-2 processor increases with the degree of unrolling. Although the unrolled designs process messages in fewer clock cycles than the basic designs, the longer critical path in the unrolled designs means that the maximum clock frequency decreases. Going from a basic quasipipelined SHA-512 or SHA-256 design to the 2xunrolled designs reduces the number of clock cycles by a factor of 2. Therefore, an overall increase in throughput is obtained, at the cost of a 51% area increase. The operating frequency of the proposed system is 81 MHz. We effectively pushed the pipelining approach to the limit, in the sense that it is not possible to create more pipeline sections and increase the total amount of clock cycles only by a small further factor. This can be understood by considering the presence of the Maj and Ch functions accessing simultaneously the values stored in three different positions in the corresponding shift registers. Their output value is immediately inserted back into the shift register. For more assessments and evaluation, the proposed SHA processor implementations are compared in terms of operating frequency, throughput and area-delay product with other related works (see table 2). Further comparisons

VirtexE xcv600E8


Frequency (MHz) 81

[8] [7]

74 38 73.975

Virtex2 xc2v2000

[9]

75,(512)

Virtex zoopq240

[10]

58

Virtex E

[11]

55

Virtex 2vsoofg45

[12]

38

Virtex 2vsoofg45

[13]

55

Area 1938 slices 2384 CLBs 2914 2032 slices 2237 CLBs 2710 Slices 2245 CLBs 1550 CLBs -

Throughput (Mbps) 1296(256) 2073,6(512) 291, (256) 467, (512) 479 996.7,(256) 1466,(512) 480,(512) 1485 1339 SHA-1 900 SHA-1 2816 SHA-1

Virtex 2vsoofg45

The proposed system requires only 32 clock cycles in the SHA-2(256) operation mode. In the case of SHA2(512) operation mode, 40 clock cycles are required. The clock frequency for all the operation modes of the proposed architecture is equal to 81 MHz. So, its performance (throughput) is from 4 to 5 times faster than the implementations in [6, 8]. The introduced system in [8] supports the two hash functions (384 and 512) of SHA-2 standard, while the proposed system operates efficiently for 256 and 512 modes. The achieved throughput of [8] is 479 Mbps which is much lower then the one of our proposed system for SHA2(256) (800 Mbps) and SHA-512 (1580 Mbps). The proposed system requires about 9% less of silicon area resources than the design in [8]. Using the Performance/Area (Mbps/CLBs) ratio the proposed system is proved better about 55% by compared with [6]. Therefore, using this ratio, the proposed system out performs the system [8] for SHA-2(512) operation by about 40%. Although, in [7, 9, 10] the SHA-2(256) and the SHA-2(512) have been implemented separately. The proposed system achieves better throughput values compared to both FPGA implementations of [7]. Also, the proposed system is proved to be better compared with the previous SHA-1 standard hardware implementations [11, 12, 13]. For instance, we cannot go on a detailed “fair” comparison with the previous standard, since these two standards (SHA-1and SHA-2) have major differences in their specifications.

5. Conclusion In this paper, a reconfigurable implementation with multi-mode operation is proposed for the SHA-2 hash family. SHA-2 secure hash algorithm is the newest hash function standard. This hash functions family deployed in a broad range of applications with different areaperformance requirements. The proposed implementations are compared in terms of operating frequency, throughput and area-delay product. From the comparative, with other related well known works, our system performs much better. They can substitute efficiently the previous SHA-1 implementations, in communication protocols and networks integrity units, with higher supported security level and better achieved performance. This work can be applied efficiently in the implementation of digital signature algorithms, keyedhash message authentication codes and in random numbers generators architectures. The introduced system performs efficiently for the two SHA-2 standard functions (256, 512). The allocated resources of the proposed system are almost the same with the covered area of the separate implementation SHA-2 512. The achieved performance is almost equal to the separate implementations performance.

[8] M. McLoone, and J. V. McCanny. “Efficient single-chip implementation of SHA-384 & SHA-512”. In IEEE Proc., International Conference on Field-Programmable Technology (FTP), pp. 311–314, 2002. [9] N. Sklavos, and O. Koufopavlou,”On the hardware implementations of the SHA-2 (256, 384, 512) hash functions”,In IEEE International Symposium on Circuits & Systems (ISCAS) , 2003, Proc., vol. V, pp. 153–156. [10] K. Aisopos, A.P. Kakarountas, H. Michail, and C.E. Goutis , “High throughput implementation of the new Secure Hash Algorithm through partial unrolling” signal processing systems design and implementation, 2-4 Nov 2005. IEEE Workshop on,pp.99-103. [11] N., Sklavos, G., Dimitroulakos, and O., Koufopavlou, “An Ultra High Speed Architecture for VLSI Implementation of Hash Functions,” in Proc. of ICECS, pp. 990–993, 2003. [12] J.M., Diez, S., Bojanic, C., Carreras, and O., NietoTaladriz, “Hash Algorithms for Cryptographic Protocols: FPGA Implementations,” in Proc. of TELEFOR, 2002. [13] H.Michail, A.P. Kakarountas, O. Koufopavlou, and C. E. Goutis,”A Low-Power and High-Throughput Implementation of the SHA-1 Hash Function”, Circuits and Systems, 2005. ISCAS 23-26 May 2005. IEEE International Symposium on,pp: 4086- 4089 Vol. 4.

6. Acknowledgments

[14] SHA-1 Standard, National Instihite of Standards and Technology (NIST), Secure Hash Standard, FIF'S PUB 180-1, www.itl.nist.gov/tipspuhs/fiplXO-.lh tm

This work was sponsored and supported by the FrenchTunisian CMCU Program Grant N° 04G1102.

[15] N. Shirazi,W. Luk, and P. Y. K. Cheung, “Framework and tools for run-time reconfigurable designs”. IEE Proc., Comput 2000.. Digit. Tech., 147(3):147–152.

7. References [1] W. Stallings, “Nerwork and Intemehwrk Security Principles and Practice”, Prentice Hall International, 1995. [2] US NIST, “Digital Signature Standard”, FIPS PUB 186-2, http://csrc.nist.gov/publications/fips/fip1s8 6-2.htm. [3] A. Menezes, P. Oorschat, S. Vanstone, “Handbook of Applied Cvptogrophy”, CRC Press, 1997. [4] US NIST, “Secure Hash Standard”, Drafl FIPS PUB 1802, 2002. [5] US NIST, “Descriptions of SHA-256, SHA-384 and SHA512”,http://csrc.nist.gov/encryptionishs/sha2S6-3X4SI2.pdf,2001. [6] N. SKLAVOS, and O. KOUFOPAVLOU , “Implementation of the SHA-2 Hash Family Standard Using FPGAs” The Journal of Supercomputing, 31, 227–248, 2005. [7] R.P. McEvoy, F.M. Crowe, C.C. Murphy, W.P. Marnane, “Optimisation of the SHA-2 Family of Hash Functions on FPGAs”, Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, volume: 00, On page(s): 6 pp, 2-3 March 2006.


[16] P. James-Roxby, E. Cerro-Prada, and S. Charlwood, ‘‘Core-based design methodology for reconfigurable computing applications’’. IEE Proc., Comput 2000. Digit. Tech., 147(3):142---146.

A Reconfigurable Implementation of the New Secure ... - IEEE Xplore

A Reconfigurable Implementation of the New Secure ... - IEEE Xplore

Suggest Documents

A Reconfigurable Implementation of the New Secure ... - IEEE Xplore

Reconfigurable hardware implementation of BinDCT ... - IEEE Xplore

A Self-Reconfigurable Implementation of the JPEG ... - IEEE Xplore

Implementation of a Reconfigurable Optical Logic Gate ... - IEEE Xplore

Design and FPGA Implementation of a Reconfigurable ... - IEEE Xplore

A Reconfigurable Viterbi Traceback for Implementation ... - IEEE Xplore

Design & Implementation of a Secure Sensitive ... - IEEE Xplore

The Design and Implementation of a New Digital ... - IEEE Xplore

Secure Identities - IEEE Xplore

Reconfigurable Dipole Antenna - IEEE Xplore

Reconfigurable Swarm Fixtures - IEEE Xplore

A Simple Polarization-Reconfigurable Antenna - IEEE Xplore

Reconfigurable Computing Architectures - IEEE Xplore

A reconfigurable system featuring dynamically ... - IEEE Xplore

A Wideband Quad-Polarization Reconfigurable ... - IEEE Xplore

A New Symmetric Cryptography Algorithm to Secure E ... - IEEE Xplore

A new speech signal scrambling method for secure ... - IEEE Xplore

Secure or Usable? - IEEE Xplore

A New Efficient Dictionary and its Implementation on ... - IEEE Xplore

A prototype infrastructure for the secure aggregation of ... - IEEE Xplore

Coherent Versus Non-Coherent Reconfigurable ... - IEEE Xplore

Directional Reconfigurable Antennas on Laptop ... - IEEE Xplore

Reconfigurable Spherical Helical Electrically Small ... - IEEE Xplore

Learning Reconfigurable Scene Representation by ... - IEEE Xplore