Novel Hardware Architecture for Implementing the ... - Semantic Scholar

1 downloads 0 Views 276KB Size Report
popular algorithms as MD5 and SHA-0 [2,3] (already broken). Currently, several hash function algorithms are being evaluated to select SHA-3 [4,5]. There have ...
2011 14th Euromicro Conference on Digital System Design

Novel Hardware Architecture for implementing the inner loop of the SHA-2 Algorithms Ignacio Algredo-Badillo∗ , Claudia Feregrino-Uribe† , Ren´e Cumplido† , Miguel Morales-Sandoval‡ ∗ Computer Engineering, University of Istmo, UNISTMO Tehuantepec, Oaxaca, Mexico 70760 Email: [email protected] † National Institute for Astrophysics, Optics and Electronics, INAOE Tonantzintla, Puebla, Mexico 72840 Email: cferegrino,[email protected] ‡ Polytechnic University of Victoria Cd. Victoria, Tamaulipas, Mexico 87138 Email: [email protected]

Abstract—Cryptographic algorithms are used to enable security services that are the core of modern communication systems. In particular, Hash functions algorithms are widely used to provide services of data integrity and authentication. These algorithms are based on performing a number of complex operations on the input data, thus it is important to count with novel designs that can be efficiently mapped to hardware architectures. Hash functions perform internal operations in an iterative fashion, which open the possibility of exploring several implementation strategies. In the paper, two different schemes to improve the performance of the hardware implementation of the SHA-2 family of algorithms are proposed. The main focus of the proposed schemes is to reduce the critical path by reordering the operations required at each iteration of the algorithm. Implementation results on an FPGA device show an improvement on the performance on the SHA256 algorithm when compared against similar previously proposed approaches.

I. I NTRODUCTION Digital communication systems are widely used for executing a wide variety of electronic operations such as: electronic transfers, mobile communications, multimedia, electronic commerce, document transfers, and videoconferences, among others. The information being transmitted by these systems can potentially be at risk if no security measures are taken. The most common risks found in these systems include: non authorized accesses, denial of service, data corruption, leakage, monitoring attacks, authentication, and trashing. Several measures can be taken to reduce such risks; among the most common are the use of firewalls, traffic filtering, and detection

978-0-7695-4494-6/11 $26.00 © 2011 IEEE DOI 10.1109/DSD.2011.75

systems. Additionally, security can also be provided by using tools based on cryptographic algorithms. There is a great variety of cryptographic algorithms grouped into symmetric and asymmetric as well as hash functions. These algorithms are used to provide a number of security functions that in turn are used to build secure communication systems. It is expected that these systems treat information in a trusted way and without allowing information to be manipulated by third parties. Hash functions are widely used to provide services of data integrity and authentication issues. These functions allow to maintain a high level of security by performing complex operations, usually in an iterative fashion, which require a significant amount of computing resources. There are several algorithms for performing hash functions: MD5, SHA-1, SHA-2, Whirlpool, Haval, and RipeMD-160, being the SHA family the most widely used, in particular SHA-2 that offers higher security and solved the insecurity problem of SHA-1 [1] and other popular algorithms as MD5 and SHA-0 [2,3] (already broken). Currently, several hash function algorithms are being evaluated to select SHA-3 [4,5]. There have been several works in the literature, both research papers and commercial products that offer hardware or software implementations of one or more algorithms from the SHA-2 family. In general, the hardware architectures that report better performance are based in the customization of hardware elements that efficiently compute specific functions. In this work, two ways for computing the main operation of the SHA-2 algorithm are proposed with the aim of improving the performance

543

TABLE I S ECURE H ASH A LGORITHMS

Message Block

Algorithm

SHA-1

SHA-256

SHA-384

SHA-512

Message size*

< 264

< 264

< 2128

< 2128

Block size*

512

512

1024

1024

Word size*

32

32

64

64

Message digest size*

160

256

384

512

Security*

80

128

192

256

256 H0t

512 Bt A

B

C

D

E

F

G

H

32

64 Iterations 256 32 32 A’

*Sizes and Security are specified in bits

A

B’

B

C’

C

D’

D

E’

E

F’

F

G’

G

H’

H

+ + + + + + + + 32

of the hardware implementation. Their main focus is to reduce the critical path by means of predicting the result of some computations and also by reducing latency of the calculations. This paper is organized as follows: section 2 describes the SHA-2 family, section 3 shows the details of data dependency in SHA-256 algorithm, section 4 presents the proposed computing for the main operations of SHA2 and architectures, section 5 details implementation details and discusses the results, and finally section 6 concludes.

32

32

32

32

32

32

32

H0 t+1 256

Message Digest

Fig. 1.

Block diagram of the SHA-256 algorithm

For i = 1 to N: Prepare the message Schedule Wt Initialize the eight working variables A,…,H For t = 0 to 63: Compute Temporal T1 Compute Temporal T2 Compute new H, G, F Compute new E = D + T1 Compute new D, C, B Compute new A = T1 + T2 Compute the ith intermediate hash value H(i) Generate the message digest with H(N)

II. S ECURE H ASH A LGORITHMS Hash functions are used to provide data integrity and when they are used in combination with digital signature algorithms and MACs, they also provide authentication. Among the most important hash function are the SHA family, which share the same functional structure with some variation in the internal operations, message size, message block size, word size, number of security bits and message hash size, see Table 1 [6]. These algorithms are iterative and one-way functions that input a message and output a message digest. They process the input data block in two stages: preprocessing and digest computing. The preprocessing warranties that the message has a size that is multiple of a particular value, allowing to divide the message into predefined block sizes and to provide an initial hash value. In the second stage, each message block is utilized during a fixed number of iterations, where each iteration defines functions, constants and word operations to generate a series of hash values. After all blocks are processed, the value of the final hash is used as the message digest. In particular, the second stage of the SHA-256 algorithm performs 64 iterations over blocks of 512-bit messages and hash values of 256 bits described as 8 32-bit words (A, B, . . . , H). The hash message is 256 bits long, see Figure 1.

Initial Hash Value

Fig. 2.

Pseudocode for the SHA-256 algorithm

III. R ELATED WORKS Implementations of hash functions require great amounts of spatial and temporal resources that may result in low performance and efficiency. This has motivated the research on several approaches for speeding up the computation or for improving the efficiency in terms of throughput/area ratio. The pseudocode shown in Figure 2 describes the SHA256 algorithm; more details are given in [6]. The general structure of the pseudocode is very similar for the other algorithms of the SHA-2 family. The need to efficiently implement the computations of the inner loop, that represents the iterations performed by the algorithm, has resulted in a number of implementation approaches. In general, depending of the system organization, hardware implementations report better performance than software only implementations [7] An important factor that limits the performance in both hardware and software implementations is the high

544

Round k AkBkCk

Dk EkFkGk X

X Y Z Maj Function

X Y Z Ch Function

0

+

Hk WkKk

Round k+1

X

+

1

+

+

T2

+ T1

+ + Ak Bk

Ck

Ek Fk Gk

Ak+1 Bk+1Ck+1 Dk+1Ek+1Fk+1Gk+1Hk+1 W

k+1

X

X Y Z Maj Function

0

X Y Z Ch Function

+

+

X 1

Kk+1

+

performance [9, 10]. These architectures will be analyzed in Section 6. IV. 4. P ROPOSED A RCHITECTURES In a straightforward hardware implementation of the SHA-256 algorithm, there are several ways of designing the inner part of the loop because of the number of additions that are required. As far as data dependencies are taken into account, it is possible to rearrange the order of the computation with the aim of increasing the performance. The operations in the inner loop of the algorithm compute new values for the variables A and E taking into account the eight previous values of the same variables, see equations (1) and (2). Additional processes are used for computing the remaining variables (B, C, D, F, G, H), which are updated by setting values of the previous value, more details in [6].

+

Ek+1 = Dk +Σ1 (Ek )+CH(Ek , Fk , Gk )+Hk +Kk +Wk

T2

(1)

+ +

Ak+1 = Σ0 (Ak ) + M AJ(Ak , Bk , Ck ) + Σ1 (Ek )+ + CH(Ek , Fk , Gk ) + Hk + Kk + Wk

+

(2) Ak+1Bk+1 Ck+1

Ek+1Fk+1 Gk+1

Ak+2 Bk+2Ck+2 Dk+2 Ek+2Fk+2Gk+2Hk+2 Fig. 3. Data dependency in the iterative process of the SHA-256 algorithm

In [9], authors propose the pre-computing of δk to save the calculation of a sum during the run time of the iteration, that is, to calculate δk in a previous iteration k-1 by using Gk−1 as described by equation (3). δk = Hk + Kk + Wk = Gk−1 + Kk + Wk

data dependency of the computations within the inner loop. Because of this, fully parallel implementations are not possible, however partial loop unrolling has been explored by several authors. When partial unrolling is used, data dependence is present when the iteration k+1 requires the results obtained for the 8 variables by iteration k, see Figure 3. In order to take advantage of the use of specialized cores or embedded systems, several architectures have been reported to iteratively operate without using parallelization or unrolling of the inner loop [8]. Other approaches consist on introducing a pre-computing stage in order to reduce the critical path. By adding some operations to the initialization process, pre-computing allows to reduce the number of adders on the critical path (reducing the data dependence), and increases the

(3) In equation (3), the values Hk , Kk and Wk must be pre-computed or must be present to obtain δk at time k and to perform the calculation of iteration k+1. In this last iteration, the computing of the new values A and E is based on equations (4) and (5). Ek+1 = Dk + Σ1 (Ek ) + CH(Ek , Fk , Gk ) + δk

(4)

Ak+1 = Σ0 (Ak ) + M AJ(Ak , Bk , Ck ) + Σ1 (Ek )+ + CH(Ek , Fk , Gk ) + δk

(5)

545

In this work, two new forms of reordering the operations of the inner loop are proposed: one based on the subtraction of Dk and the other one employs two previous calculations. In the first proposal of this work, the data flow is rearranged to obtain the operations of the sum described by equations (1) and (2). It calculates a variable τk , according to equation (6). τk = Hk + Kk + Wk + Dk

(6) The calculation at run time k is based on equations (7) and (8), where no pre-computing is executed.

in forward form derived from equations (1) and (2). The implementation proposed in [9] saves the calculation of two adders because they are pre-computed in a previous time k, (Figure 4b). This implementation calculates the inner loop operation according to equations (3), (4) and (5). The two proposals of this work are shown in Figures 4c and 4 d. Figure 4c shows the first proposed architecture that is based on rearranging the dataflow as described by equations (6), (7) and (8), whereas the second proposed architecture based on equations (9), (10) y (11) is shown in figure 4d. The calculations of the hardware architectures shown in the Figures 4b and 4d require an additional clock cycle to initialize the system, with the advantage of decreasing the data dependency.

Ek+1 = Σ1 (Ek ) + CH(Ek , Fk , Gk ) + τk

V. I MPLEMENTATION RESULTS AND DISCUSSION (7)

Ak+1 = Σ0 (Ak ) + M AJ(Ak , Bk , Ck )+ + Σ1 (Ek ) + CH(Ek , Fk , Gk ) + τk − Dk

(8) The second proposal of this work is based on the pre-computing of equations (3) and (9) at time k using a previous value at time k-1. This last one thing is the difference between τk and δk , equations (3) and (9), respectively. The main idea is to calculate another addition in a previous time k in order to reduce the critical path of the inner loop operations at the time k+1. δk0 = δk + Dk

(9)

In order to evaluate the proposed architectures in fair conditions, the four architectures described in section 4 were implemented on a common platform, see Figure 5. The aim is to compare them in terms of hardware resources, throughput and efficiency. This common platform is a straightforward implementation of the SHA256 algorithm that can be easily modified to accommodate the four architectures to perform the inner loop operations. The white blocks of each architecture in Figure 4 were implemented in the white block named Main Function of the basic platform, see Figure 5. These four architectures were implemented in VHDL language and synthesized for a Virtex-2 XC2VP-7 FPGA device using Xilinx’s ISE 10.1 tool. The implementation results are shown in Table 2. The 4 architectures require 65 cycles to process a 512-bit input block. The throughput was calculated using equation (12), whereas the efficiency with equation (13).

The calculation at run time k+1 is based on the equations (10) and (11).

T hroughput =

Data block size Clock time ∗ Clock cycles

(12)

Ek+1 = Σ1 (Ek ) + CH(Ek , Fk , Gk ) + δk0

(10) Ef f iciency = Ak+1 = Σ0 (Ak ) + M AJ(Ak , Bk , Ck ) + Σ1 (Ek )+

T hroughput N umber of Slices

(13)

+ CH(Ek , Fk , Gk ) + δk

(11) Figure 4 shows the architectures for processing the inner loop operations. Figure 4a shows the architecture

The forward architecture (Figure 4a) requires less area and reports the best efficiency, although it achieved the worst throughput. The first proposed architecture named SHA-256c (Figure 4c) reports an improved throughput,

546

At Bt Ct

Ht Wt Kt

Dt Et Ft Gt X

X Y Z Maj Function

X

X Y Z Ch Function

S0

+

At Bt Ct

+

+

T2

T1

Ct

Et Ft

+ + +

At Bt

Gt

X

+

Et Ft

Gt

(b)

Dt Et Ft Gt X Y Z Ch Function

S0

Ct

At+1 Bt+1 Ct+1 Dt+1 Et+1 Ft+1 Gt+1 Ht+1

(a)

X Y Z Maj Function

Wt Kt G W K Delta Function

S1

+

At+1 Bt+1 Ct+1 Dt+1 Et+1 Ft+1 Gt+1 Ht+1

At Bt Ct

X

X Y Z Ch Function

T2

+

At Bt

S0

+

+

+

X

X Y Z Maj Function

S1 +

Ht

Dt Et Ft Gt

X

+

+1

S0

D G W K DeltaP Function DT DP

T2

+

+

Ht X

X Y Z Ch Function

+

+

+

X

X Y Z Maj Function

S1 + +

T2

Dt Wt Kt Et Ft Gt

At Bt Ct

Wt Kt Ht

S1

+ + +

+

At Bt

Ct

Et Ft

At Bt

Gt

At+1 Bt+1 Ct+1 Dt+1 Et+1 Ft+1 Gt+1 Ht+1

Ct

Et Ft

Gt

At+1 Bt+1 Ct+1 Dt+1 Et+1 Ft+1 Gt+1 Ht+1

(c)

(d)

Fig. 4. Block diagrams for computing the inner loop operations of the algorithm SHA-256: (a) straightforward architecture, (b) architecture with basic pre-computing, (c) proposed architecture with reordering of the data flow and (d) proposed architecture with two pre-computations SHA256_Architecture

ADDRESS, SEL1, SEL2, SELH0, WR

START Control

+ + +

SEL2

SEL1 MESSAGE BLOCK E,F,G E A,B,C A CLK RST

Memory64x32 Message_ Schedule

WT

Ch_Function

+

HASH

+ +

MESSAGE DIGEST

+

Main_ Function

SELH0

A’,E’

BIGSIGMA1

REG Maj_Function

A,B,C,E,F,G

H0

MUX

ADDRESS

KT

+

SEL2 WR

Buffer8x32

BIGSIGMA0 X

Fig. 5.

Architecture of the SHA-256 algorithm used to evaluate the proposed approaches

requiring more hardware resources, which in turn decreases the efficiency. The two architectures that execute

pre-computations report poorer efficiencies, but with the highest throughput. The second proposed architecture,

547

TABLE II I MPLEMENTATION RESULTS OF THE PROPOSED ARCHITECTURES Architecture

Hardware Resources (Slices)

Clock Frequency (MHz)

Throughput (Mbps)

Efficiency (Mbps/Slice)

SHA-256a

966 S, 1335 LUTs, 65 cycles

97.02

764.23

0.791

SHA-256b

1097 S, 1365 LUTs, 65 cycles

101.32

798.15

0.727

SHA-256c

1004 S, 1412 LUTs, 65 cycles

99.14

780.98

0.777

SHA-256d

1117 S, 1396 LUTs, 65 cycles

104.06

819.74

0.734

named SHA-256d (Figure 4d), requires more hardware resources because of the pre-computations that it performs, however it achieves the highest throughput. The proposed architectures can be seen as single unrolled designs thus can be used as basis to explore architectures with several unrolled rounds. There are several commercial and academic implementations of the SHA-256 algorithm reported in the literature [8-22]. For the commercial implementations reported in [11] and [12], internal details of the architectures are not reported, thus they will not be included in the discussion. Academic implementations are based on several approaches such as straightforward design, unrolled/pipelined designs, hardware/software implementations and designs using pre-computations. Because this work proposes some modifications to the way the operations of the inner loop are performed, only architectures that were implemented using similar approaches are briefly described [8,10,16,17,18]. Some features of these works are listed in Table 3. Even though these architectures were implemented using a similar approach to the one used in this work, a fair comparison is still not feasible because of variations in the device used as well as the software tools. Thus the information in Table 3 is provided only as a reference. In [16, 21], forward architectures that calculate three hash functions, including the SHA-256, are reported; the implementation in [16] only computes the first stage of the algorithm. [8, 10, 17, 18] report architectures with basic iterative architectures, where optimizations are done to decrease the critical path. In [10], authors propose to use three operands adders. In [17], the reported architecture calculates three different hash functions: MD5, SHA-1 and SHA 224/256, whereas [18] reports an architecture to perform the ciphering algorithm AES with different key sizes and the hash functions SHA1 and SHA-256. Some other works are implemented using different approaches such as hardware/software codesign [13, 14, 15], or use of parallelization techniques, pipelining and loop unrolling [9,19,20,22].

VI. C ONCLUSION Two different schemes to improve the performance of the hardware implementation of the SHA-2 family were proposed. These approaches are based on rearranging the computation of the inner loop of the algorithms. Both architectures proposed in this paper exhibit high performance, throughput and efficiency. These architectures can be further improved by exploring some improvements such as: balancing of data paths, introducing adders with 3 operands, loop unrolling, and pipelining. These modifications should results in an improved efficiency. Although the results were focused on the SHA256 algorithm, these approaches can be extended to the other algorithms of the SHA-2 family.

548

R EFERENCES [1] Xiaoyun Wang, Yiqun Lisa, Hongbo Yu, ”Finding Collisions in the Full SHA-1”, In Proceedings of CRYPTO, LNCS 3621, pp. 17-36, ISBN: 3-540-28114-2, Springer, 2005. [2] Praveen Gauravaram, Adrian McCullagh, Ed Dawson, ”Collision Attacks on MD5 and SHA-1: Is this the ’Sword of Damocles’ for Electronic Commerce”, In Proceedings of the AusCERT Asia Pacific Information Technology Security Conference (AusCERT2006), pp. 73-88, 2006. [3] Eli Biham, Rafi Chen, Antoine Joux, Patrick Carribault, Christophe Lemuet, William Jalby, ”Collisions of SHA-0 and Reduced SHA-1”, Eurocrypt 2005, LNCS 3494, pp. 36-57, 2005. [4] Andrew Regenscheid, Ray Perlner, Shu-jen Chang, John Kelsey, Mridul Nandi, Souradyuti Paul, ”Status Report on the First Round of the SHA-3 Cryptographic Hash Algorithm Competition”, Technical Report, NIST, September 2009. [5] Elena Andreeva, Bart Mennink, Bar Preneel, ”Security Reductions of the Second Round SHA-3 Candidates”, Lecture Notes in Computer Science, 2011, Volume 6531, pp: 39-53, 2011. [6] Federal Information Processing Standards Publication 180-2, ”Announcing the Secure Hash Standard”, U. S. DoC/NIST, August 2002. [7] Thomas Wollinger, Jorge Guajardo, Christof Paar, ”Cryptography on FPGAs: State of the Art Implementations and Attacks”, ACM Special Issue Security and Embedded System, March 2003. [8] Ryan Glabb, Laurent Imbert, Graham Jullien, Arnaud Tisserand, Nicolas Veyrat-Charvillon, ”Multi-mode operator for SHA-2 hash functions”, Journal of Systems Architecture: the EUROMICRO Journal, Vol. 53, Issue 2-3, February 2007.

TABLE III R ELEVANT HARDWARE ARCHITECTURES FOR THE ALGORITHM SHA-256. Work

Device

Hardware Resources and clock frequency

Throughput (Mbps)

Efficiency (Mbps/Slice)

[8]

V200/400XCV

1306 Slices, 77 MHz, 66 cycles

308

0.236

[10]

XCV-1000

1038 LUTs, 39.5 MHz

316

-

[16]

XV200-6

2384 CLBs, 1 BRAM, 74 MHz

291

-

[17]

Stratix II EP2S60

1380 Slices

772

0.516

[18]

Virtex-5

80 MHz, 65 cycles

630

-

[9] Ricardo Chaves, Georgi Kuzmanov, Leonel Sousa, Stamatis Vassiliadis, ”Improving SHA-2 Hardware Implementations”, In Proceedings of Cryptographic Hardware and Embedded Systems (CHES 2006), LNCS 4249, pp: 298-310, 2006. [10] Damian B. Fedoryka, ”Fast Implementation of the Secure Hash Algorithm SHA-256”, Thesis, George Mason University, Department of Electrical and Computer Engineering, 2004. [11] Helion Technology Limited, ”Fast SHA-256 Hash Core for Xilinx FPGA”, Datasheet, June 2010. [12] Helion Technology Limited, ”Tiny Hash Core Family for Xilinx FPGA”, Datasheet, June 2010. [13] Murat As¸kar, Tu˘ g ba S¸ıltu C¸elebi, ”Design and FPGA Implementation of Hash Processor”, In Proceedings of the Information Security & Cryptology Conference with International Participation, pp. 85-89, December 2007. [14] Kurt K. Ting, Steve C. L. Yuen, K. H. Lee, Philip H. W. Leong, ”An FPGA based SHA-256 Processor”, Proceedings of the Nineth International Workshop on Field Programmable Logic and Applications (FPL’02), pp. 577-585, 2002. [15] Marcio Juliato, Catherine Gebotys, ”Tailoring a Reconfigurable Platform to SHA-256 and HMAC through Custom Instructions and Peripherals”, 2009 International Conference on Reconfigurable Computing and FPGAs, pp. 195-200, December 2009. [16] N. Sklavos, O. Koufopavlou, ”Implementation of the SHA-2 Hash Family Standard Using FPGAs”, The Journal of Supercomputing, Vol. 31, Issue 3, pp 227-248, March 2005. [17] Sylvain Ducloyer, Romain Vaslin, Guy Gogniat, Eduardo Wanderley, ”Hardware implementation of a multi-mode hash architecture for MD5, SHA-1 and SHA-2”, Workshop on Design and Architectures for Signal and Image Processing, November 2007. [18] Mihai Togan, Adrian Floarea, Gigi Budariu, ”Design and Implementation of Cryptographic Modules on FPGA”, In Proceedings of the Applied Mathematics and Informatics, pp. 149-154, December 2010. [19] Medien Zeghid, Belgacem Bouallegue, Mohsen Machhout, Adel Baganne, Rachel Tourki, ”Architectural design features of a programmable high throughput reconfigurable SHA-2 Processor”, Journal of Information Assurance and Security2 (2008), pp. 147-158, November 2007. [20] H. E. Michail, G. S. Athanasiou, A. A. Gregoriades, Ch. L. Panagiotou, C. E. Goutis, ”High Throughput Hardware/Software Co-Design Approach for SHA-256 Hashing Cryptographic Module in IPSec/IPv6”, Global Journal of Computer Science and Technology, vol. 10, issue 4, pp 54-59, June 2010. [21] N. Sklavos, O. Koufopavlou, ”On the Hardware Implementations of the SHA-2 (256, 384, 512) Hash Functions”, Proceedings of IEEE International Symposium on Circuits & Systems (IEEE ISCAS’03), Vol. V, pp. 153-156, 2003.

[22] Robert P. McEvoy, Francis M. Crowe, Colin C. Murphy, William P. Marnane, ”Optimisation of the SHA-2 Family of Hash Functions on FPGAs”, IEEE Computer Society Annual Symposium on VLSI: Emerging VLSI Technologies and Architectures (ISVLSI’06), pp.317-322, 2006.

549