ASIC Implementation of a Unified Hardware Architecture for Non-Key

ASIC Implementation of a Unified Hardware Architecture for Non-Key Based Cryptographic Hash Primitives Ganesh T S Electrical & Computer Engineering Iowa State University Ames, Iowa, USA – 50011 [email protected]

Abstract Hash algorithms are a class of cryptographic primitives used for fulfilling the requirements of integrity and authentication in cryptography. In this paper, we propose and present the ASIC implementation of ‘HashChip’, a hardware architecture aimed at providing a unified solution for three different commercial MDC (Manipulation Detection Codes) hash primitives, namely MD5, SHA1 and RIPEMD160. The novelty of the work lies in the exploitation of the similarities in the structure of the three algorithms to obtain an optimized architecture. The performance analysis of a 0.18µ ASIC implementation of the architecture has also been done.

1. Introduction In this age of information technology, the important concern of security is handled by cryptography algorithms. They ensure that the authentication, confidentiality, integrity and non repudiation aspects of communication are not compromised. Hash algorithms are a class of cryptographic primitives which ensure integrity and authentication. They are further classified as MDC (Manipulation Detection Codes) and MAC (Message Authentication Codes). MDCs ensure integrity only, but MACs can ensure both integrity and authentication. While the MDCs do not use a key, the MACs use a keyed hash function. A fixed-length hash value is computed based upon the input message that makes it impossible for either its contents or length to be recovered (one-way). Furthermore, there is an almost zero probability that two different messages will yield the same hash value (collision resistance). Popular MDCs used at present are MD5 (Message Digest Version 5), SHA1 (Secure Hash Algorithm 1)

T S B Sudarshan Computer Science & Information Systems Birla Institute of Technology & Science Pilani, Rajasthan, INDIA -333031 [email protected]

and RIPEMD160 (RACE Integrity Primitives Evaluation Message Digest 160) [1]. Hardware implementations of cryptographic primitives offer a multitude of advantages including lower power consumption and higher throughput in comparison to software implementations. Hence, embedded hardware implementation of cryptographic primitives has become an actively explored topic. A FPGA implementation of MD5 is presented in [2], and unified architectures for hash primitives have also gained prominence [3][4][5]. This paper presents the design and implementation of a unified architecture for three commercially important MDCs on an ASIC. Despite structural similarities in the algorithm structures, some fundamental differences make their implementation on a common architecture a challenging process. The highlights of the proposed design include an integrated message padding block and similar performance for all the three algorithms. The designed core betters other implementations in either terms of performance [3][4], resource utilization [5] or offered functionality [3][6]. The remainder of this paper is structured as follows. An analysis of the structure and working of the one way hash algorithms is done in Section 2, and Section 3 gives details of the work done by elucidating the design approach for ‘HashChip’. The datapath and performance analysis of the ASIC implementation of the proposed architecture are also presented.

2. Message Digest generation Manipulation Detection Codes

using

Manipulation Detection Codes are usually non-keyed hash functions which are designed as iterative processes hashing arbitrary input lengths by processing successive fixed size blocks of the input [7].

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05) 0-7695-2315-3/05 $ 20.00 IEEE

Fig 01. Message Digest Generation using MD5 / SHA1 / RIPEMD160 Popular commercial MDCs include MD5 [8], SHA1 [9][10] and RIPEMD160 [11]. In all these three algorithms, the fixed length of each of the processed blocks is 512 bits. The nature of preprocessing in these algorithms is brought out in Fig. 01. Minor differences amongst the considered algorithms are detailed below. The message of ‘K’ bits is padded to ensure that its length in bits is 64 short of an integral multiple of 512. Padding is always performed even if the length of the message is already congruent to 448 modulo 512. Padding consists of a single 1-bit followed by the necessary number of 0-bits, as shown in Fig. 01. A 64bit representation of the message size (in bits) is padded at the end of the message. MD5 and RIPEMD160 go in for Little-Endian padding. SHA1 goes in for the Big-Endian format. A message block which is an integral multiple of 512 bits in length is obtained after the completion of the padding. As shown in Fig. 01, each of the 512 bit blocks is used once in the complete processing by the chain of compression functions. The message digest size is 128 bits for MD5 and 160 bits for SHA1 and RIPEMD160 [1]. The structure of the compression functions of the three algorithms are brought out in Fig. 02. Each 512

Fig 02. Compression Functions (Left to Right - MD5, SHA1 & RIPEMD160)

bit block of the message is subjected to four rounds of iterations in MD5 and SHA1, while there are two parallel instances of five rounds in RIPEMD160. To complicate matters even further, each round consists of 16 elementary operations in MD5 and RIPEMD160, while there are 20 in SHA1. To obtain the chaining variable at the end of the compression function, the final result of the processing of the rounds is added to the chaining variable input at the beginning. This addition is performed on corresponding words in both MD5 and SHA1, but RIPEMD is much more complex, since there are five sets of three 32-bit data to add. The addition is not done on corresponding words, but according to the ordering in Fig. 02. At the heart of each of these hash algorithms is the elementary operation (also termed as a single step) shown in Fig. 03. The most important fact to observe in the architecture of these elementary operations is that A, B, C, D and E are words of 32 bits each and that E is missing in MD5 since the message digest produced by it is only 128 bits long. All the additions involved

Fig 03. Elementary Operations (Left to Right MD5, SHA1 & RIPEMD160) are modulo 232, meaning that the carry generated at the end of the 32 bits can be safely ignored. In all three algorithms, a 32 bit part of the message (X[k], Wt, Xi) is added to a constant (T[i], Kt, Kj) during each iteration. Another similarity is that B, C and D are input to a primitive function block in each step. The functions implemented on the three 32 bit inputs, however, vary from round to round and algorithm to algorithm. While the elementary operations of MD5 and RIPEMD160 directly use a 32 bit word from the 16 message words available, SHA1 expands the initially available 16 words to 80 words on the basis of the following transformation and uses them once in each of the compression function iterations. Mt represents the tth word of the 16 32-bit words available from the current 512 bit block, and S1 represents circular left shift by one place of its argument. [0 ≤ t ≤ 15] W t : Mt S1 (Wt-16 ⊕ Wt-14 ⊕ Wt-8 ⊕ Wt-3)


[16 ≤ t ≤ 79]

3. Hardware architecture design for the ‘HashChip’ The message to be hashed is assumed to be already available in a byte organized memory bank, which supplies data in a synchronous manner depending on the validity of a data request signal given as input to it. Also, the end of the message is signified by a sentinel byte, though the architecture can be easily reconfigured to respond to a special signal indicating the end of the availability of bytes to include in the message. Another input decides the algorithm to be applied to the data to hash. The outputs of the system include the 160 bit final digest, a signal indicating whether the data available as the final digest output is valid or not and a signal to indicate whether there is an internal error or an invalid algorithm request. Global clock and reset signals synchronize the operation of the datapath components. A padding block ensures that the algorithm starts processing as soon as the minimum requirements of 64 bytes are input to the system. A micro-programmed control unit executes the iterations of the required compression function as soon as data is available for it. A holistic view of the designed datapath is presented in the following subsection.

3.1. The datapath for the ‘HashChip’ A high level view of the architecture of the ‘HashChip’ is presented in Fig. 04. The components of the datapath can be broadly divided into six major categories. The ‘Padder & Memory Block’ handles the interfacing of the system to the external memory bank. It handles the storage of the message words as well as the various constants required during the different iterations. Even though MD5 and SHA1 have only one chain of rounds in their compression function, RIPEMD has two concurrently running rounds.

Fig 04. Block Diagram of the ‘HashChip’ Datapath

Therefore it is necessary to duplicate some of the components from the ‘Main Compressor Block’ to the ‘Parallel Compressor Block’. The ‘Chaining Variable Updation Block’ handles the updating of the chaining variable in a generic manner at the end of the processing of each 512 bit block’s compression function iteration. A ‘Digest Generation Block’ ensures that the chaining variable value at the end of all the iterations of the compression functions is transferred in the proper Endian format as the value of the final digest. Also, the status of the chaining variable (as to whether it is valid or not) is generated in this block. Finally, a ‘Microcode Control Unit’ handles the issuing of the different control signals needed at various points in the architecture during the different iterations. The following subsection details the datapath components.

3.2. Datapath components of the ‘HashChip’ The ‘HashChip’ interfaces to a memory bank which contains the data to be hashed. The input from the memory bank is given to the padding block which takes care of the preprocessing of the message. This block not only supplies 512 bits as soon as they are received, but also takes care of the message padding as outlined in Fig 01. The padding block has an internal buffer of size 64x8 and a FSM controller to take care of the padding [12]. Fig 05 shows the interfacing of the padding block to the other related components in the datapath. Once the buffer in the padding block fills up, the data gets transferred in groups of 32 bits (Little or Big-Endian) to the 16x32 ‘Register File’. This RAM block is designed for dual port access since RIPEMD160 processing requires two message words during each iteration step (one for each of the parallel rounds). The complex message word selection methodology of SHA1 (outlined towards the end of Section 2) is implemented by the ‘SHA Reg File’ and ‘SHA Processing Block’ components. A dual port ROM stores the various constants required (64 for MD5, 4 for SHA1 and 10 for RIPEMD160). In all iteration steps, a representation of the message word is added to a constant and the result is used in the

Fig 05. The Padder & Memory Block


compressor blocks. This is performed in the datapath of Fig 05 too. The selection of message words for the first sixteen iterations of SHA1 is similar to the selection process for MD5 and RIPEMD160. The output of the ‘SHA Processing Block’ is used only for SHA1 iterations numbered seventeen to eighty. Also, the output of the processing block is written back to the ‘SHA Reg File’ for usage in future iteration steps. The elementary operations outlined in Fig 03 are implemented by the ‘Main Compressor Block’ (Fig 06) and the ‘Parallel Compressor Block’ (Fig 07). The registers A, B, C, D and E can be written with either the value of the corresponding chaining variables or the results from the previous step. Denoting the result of the adder and shifter processing as ‘FRT’, it is determined that the ‘A’ register needs to be updated with the value of D, or ‘FRT’ or E, the ‘B’ register needs to be updated with the value of ‘FRT’ or ‘A’ or ‘B’ and so on for all the registers, depending on the algorithm to implement. This is taken care of by the multiplexers placed before the registers. Since these registers need not be written during every clock cycle, a write enable signal controls their updating. The multiplexer block before the registers in the ‘Parallel Compressor Block’ is fortunately not as complex as their ‘Main Compressor Block’ counterpart since this segment of the architecture is devoted entirely to the processing of RIPEMD160 only. A common feature in all the three elementary operations is that a primitive function generator generates a combinational function of the data from the B, C and D registers. For all the three algorithms put together, there are seven distinct combinational functions. Depending on the algorithm, step and round, a particular combinational function is chosen. The output of this primitive function block is

Fig 06. The Main Compressor Block

Fig 07. The Parallel Compressor Block

Fig 08. The CV Updation Block (Left) Fig 09. The Digest Generation Block (Right) added to either ‘A’ (for MD5 and RIPEMD160) or ‘E’ (for SHA1). The result of this step is added to the output of the ‘Padder & Memory Block’ in all the three algorithms. For generating ‘FRT’, the output from the previous step is subjected to a variable circular left shift and added to ‘B’ in MD5 and ‘E’ in RIPEMD160, while for SHA1, the output is simply added to the contents of ‘A’ circular left shifted by five bits. Fig 08 shows the structure of the ‘Chaining Variable (CV) Updation Block’. It implements the adders shown in the ‘Compression Functions’ (Fig 02). The A, B, C, D and E registers need to be loaded with either the initializing vector or the chaining variable, depending on whether it is the first iteration of the first 512 bit block or of an intermediate 512 bit block (Fig 01). Thus, the chaining variable registers which supply the values to the intermediate registers need to be loaded with the initializing vector or the results of the end of the previous compression function iteration. During most of the iterations, however, the chaining variable registers need not be written to. For MD5 and SHA1, the updating needs to be done with the addition of the current value of the intermediate registers to the previous value of the chaining variable. In RIPEMD160, the addition is much more complex, and the situation is brought out pictorially in Fig. 08. The ‘Digest Generation Block’ is shown in Fig 09. The final digest must contain the value of the chaining variable after all compression function iterations. The multiplexer arrangement performs the conversion of the data in the chaining variable registers to the appropriate Endian format (Little Endian for MD5 and RIPEMD160, Big Endian for SHA1). The ‘DigestValid’ signal indicates the completion of all the compression function iterations for all the message blocks. The lower 32 bits of the final digest output are set to zero for MD5, since it generates a 128 bit


message digest only. The micro-programmed control unit provides the control signals needed at various points in the datapath. It also generates the ‘Error’ signal which identifies whether there is an internal error or an invalid algorithm request.

3.3. Performance analysis of the ‘HashChip’ architecture The architecture of the datapath described above was coded at the RTL level using Verilog. The testbench was simulated extensively with the help of a GUI drawn up in Tcl/Tk, using Model Technology’s ModelSim 5.7G HDL simulator. The performance in terms of the number of cycles taken for hashing messages of different sizes was analyzed and a graphical summary of the results obtained is presented

3.4. ASIC implementation details Pre-silicon prototyping was initially carried out on a variety of FPGAs [13]. Cadence BuildGates and Encounter were used in the sign-off process. The 0.18µ GSCLib V2.0 (Generic Standard Cell Library) from Cadence was used as the target library. The generated placement floorplan and layout are shown in Figs 11 and 12 respectively. The performance analysis of the placed and routed design is brought out in Table 01. As compared to previous known implementations of unified hardware architectures for hash primitives, the highlights of our design are as enumerated below: • Unification with a more commercially important MDC, namely, SHA1 [3] • Integration of message padding into the core design [3][4][5] • Almost four times better throughput in-spite of padding latency considerations [4] • Approximately equal processing times for all algorithms implemented [3][4][5][6] Table 01. HashChip ASIC Performance Details (Post Place & Route) Operating Frequency Throughput

Gate Count Fig 10. Number of Clock Cycles for Hashing Messages of Different Sizes in and Fig 10. The number of clock cycles includes those spent in transferring the data from the memory bank into the internal buffers also. Since the steps of the compression function finish much before the next set of 64 bytes are transferred into the padding block’s buffer, the difference in the number of clock cycles taken for MD5 and SHA1 or RIPEMD160 remains constant at 16. The first compression function iteration of MD5 takes 64 cycles, while SHA1 and RIPEMD160 take 80 cycles each, after which each block is processed in a constant amount of time which equals the time taken for 64 bytes to get transferred from the memory bank to the internal register file through the buffer.

Core Cell Area Average Power Dissipation

116 MHz 401.5 Mbps (Including Padding Latency) 824.9 Mbps (512 bit Block Throughput) 70170 (Includes RAM Blocks) 1.645 mm2 1.98 mW (3V 25°C)

4. Conclusion A unified architecture, ‘HashChip’, has been drawn up for implementing three commercially important MDCs on a hardware platform, after analyzing the similarities in their structures. The results of performance analysis of a 0.18µ ASIC implementation of the architecture are presented above. The proposed architecture presents a multitude of advantages over similar previous works. The throughput can further be improved by increasing the bus width of the communication between the memory bank and the hash core. Other possible future extensions include handling power dissipation concerns and mapping other cryptography algorithms into this architecture.


Fig 11. Placement Floorplan Generated by Cadence SOC Encounter (Amoeba Place)

5. Acknowledgements The authors wish to thank the VLSI CAD support group at Iowa State University and Birla Institute of Technology & Science, Pilani. Further, Ganesh T S thanks Michael Hassel and Ulrich Zaus at Infineon Technologies AG, Munich and Natarajan Viswanathan at Iowa State University for their invaluable help. Thanks are also due to the anonymous reviewers who helped in improving this paper.

6. References [1] William Stallings, Cryptography and Network Security – Principles and Practices. 3rd Edition.; Pearson Education (Singapore), 2003 [2] J. Deepakumara, H. M. Heys, and R. Venkatesan, “FPGA Implementation of MD5 Hash Algorithm,” Canadian Conference on Electrical and Computer Engineering, 2001; vol. 2, pp. 919-924, 2001. [3] Chiu-Wah Ng, Tung-Sang Ng and Kun-Wah Yip, “A Unified Architecture of MD5 and RIPEMD-160 Hash Algorithms”, IEEE International Symposium on Circuits and Systems, May 2004. [4] S. Dominikus, “A Hardware Implementation of MD4Family Hash Algorithms,” Proc. 9th Int. Conf. on Electronics, Circuits and Systems, Vol. 3, pp. 1143-1146, 15-18 Sep. 2002.

Fig 12. Placed & Routed HashChip ASIC Generated by Cadence SOC Encounter [6] Helion Technology, Datasheet – High Performance Dual MD5 and SHA-1 Hash Core for ASIC, available at http://www.heliontech.com/core7.htm [7] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography. Cambridge, M.A; CRC Press, 1996 [8] R. Rivest, RFC 1321 (The MD5 Message Digest Algorithm). MIT Laboratory for Computer Science and RSA Data Security, Inc.; 1992 [9] D. Eastlake and P. Jones, RFC 3174 (The SHA1 Message Digest Algorithm). Cisco Systems Inc.; 2001 [10] Secure Hash Standard: National Institute of Standards and Technology FIPS PUB 180-2, Aug. 2002. [11] H. Dobbertin, A. Bosselaers, B. Preneel, “RIPEMD160, A strengthened version of RIPEMD,” Fast Software Encryption, LNCS 1039, D. Gollmann, Ed., Springer-Verlag, 1996, pp. 71-82. [12] T S B Sudarshan and Ganesh T S, “Hardware Architecture for Message Padding in Cryptographic Hash Primitives,” 8th International IEEE VLSI Design & Test Workshop, Mysore, India, August 2004 [13] Ganesh T S, T S B Sudarshan, Naveen Kumar Sreenivasan and Karthick Jayapal, “Pre-Silicon Prototyping of a Unified Hardware Architecture for Cryptographic Manipulation Detection Codes,” 3rd IEEE International Conference on Field Programmable Technology, Brisbane, Australia, December 2004.

[5] Y. K. Kang, D. W. Kim, T. W. Kwon and J. R. Choi, “An Efficient Implementation of Hash Function Processor for IPSEC,” Proc. 2002 IEEE Asia-Pacific Conf. on ASIC, pp. 93-96, 6-8 Aug. 2002.


ASIC Implementation of a Unified Hardware Architecture for Non-Key

ASIC Implementation of a Unified Hardware Architecture for Non-Key

Suggest Documents

A UNIFIED ARCHITECTURE FOR THE IMPLEMENTATION OF

Pre-silicon prototyping of a unified hardware architecture for

On the Implementation of a Hardware Architecture for

Unified Hardware Architecture for 128-bit Block Ciphers AES ... - IACR

Network Architecture, Security Issues, and Hardware Implementation ...

Energy-efficient Hardware Architecture and VLSI Implementation of a ...

A Hardware Implementation of an Embryonic Architecture ... - CiteSeerX

asic implementation of elliptic curve

Design and Analysis of a Structured-ASIC Architecture for ...

FPGA Implementation Technology for Memory Efficient Hardware Architecture

ASIC Implementation for Improved Character ... - ScienceDirect

A Hardware/Software Reconfigurable Architecture for ... - CiteSeerX

A Hardware Architecture for Accelerating Neuromorphic Vision ...

A New Reconfigurable Hardware Architecture for Cryptography ...

A Proposed Hardware-Software Architecture for ...

A Proposed Hardware-Software Architecture for

A Hardware Architecture for Filtering Irreducible Testors

a scalable parallel hardware architecture for ...

Efficient Architecture and Hardware Implementation of the Whirlpool

Hardware architecture for security improved

EFFICIENT SCALABLE HARDWARE ARCHITECTURE FOR ...

A Cognitive Architecture for the Implementation of

Implementation of a Service-Oriented Architecture for

a low power architecture for implementation of