Exploiting Parallelism on Keccak: FPGA and GPU ...

Parallel & Cloud Computing

Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

Exploiting Parallelism on Keccak: FPGA and GPU Comparison Fábio Dacêncio Pereira1, Edward David Moreno Ordonez2, Ivan Daun Sakai1, Allan Mariano de Souza¹ 1

University Center Euripides of Marília, Marilia, Brazil 2 Federal University of Sergipe, Aracaju, Brazil [email protected]; [email protected]; [email protected]; [email protected] NIST received sixty-four entries by October 31, 2008; and selected fifty-one candidate algorithms to advance to the first round on December 10, 2008, and fourteen to advance to the second round on July 24, 2009. A year was allocated for the public review of the fourteen second-round candidates. [9]

Abstract- One of the methods to ensure information integrity is the use of hash functions, which generates a stream of bytes (hash) which must be unique. But most functions can no longer prevent malicious attacks and ensure that the information have just a hash. In order to solve this problem, the National Institute of Standards and Technology (NIST) convened the scientific community through a competition to create a new hash function standard, called SHA-3. This work is proposed to explore one of the finalist algorithms in the competition, the Keccak, and subsequently implement the propose pipeline architecture in FPGA with intuit to obtain performance data. Finally, it will be compared the pipeline implementation of keccak with implementations on GPUs.

NIST received significant feedback from the cryptographic community. Based on the public feedback and internal reviews of the second-round candidates, NIST selected five SHA-3 finalists - BLAKE, Grøstl, JH, Keccak, and Skein to advance to the third (and final) round of the competition on December 9, 2010, which ended the second round of the competition. [9]

Keywords- Information Integrity; Hash Functions; Pipeline Architecture; SHA-3 Keccak; GPU and FPGA Implementation

I.

Submitters of the finalist algorithms are allowed to make minor modifications to their algorithms and submit the final packages to NIST by January 16, 2011. A one-year public comment period is planned for the finalists. NIST also plans to host a final SHA-3 Candidate Conference in the spring of 2012 to discuss the public feedback on these candidates, and select the SHA-3 winner later in 2012. [10]

INTRODUCTION

The information integrity is a goal related to information security that highlight in the current scenario. One of the techniques and methods to ensure the information integrity is the use of hash functions.

The NIST maintains a forum about hash functions. The extensive documentation and templates keccak was decisive in choosing this algorithm finalist in this research project.

Among the hash algorithms currently can highlight the Message-Digest Algorithm 5 (MD5) and Secure Hash Algorithm (SHA-1 and SHA-2). The MD5 was developed by RSA Data Security, is now commonly used to check file integrity. The family SHA was developed by the National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST) that standardized the function in the U.S.. The SHA-2 is the most widely used in applications requiring high security integrity.

In this context, this work aims to study one finalist algorithm for SHA-3 (keccak) and then propose the pipeline architecture to exploit parallelism in hardware (FPGA), in order to obtain performance data for comparison with related work. This paper is presented proposed pipeline architecture, as well as, structural improvements to increase the performance. The results are analyzed and compared with a reference architecture provided by the authors of Keccak. Finally, will be compared the pipeline implementation of keccak with implementations on GPUs.

However successful attacks have been reported for both the MD5 algorithm [3], as well as the algorithms SHA-0 and SHA-1 [4], which generate collisions, which hurts the principle of hash functions, which is to ensure the information integrity . The function SHA-2 is currently still safe, but as share a similar structure to its predecessor, the SHA-1, becomes suspicious and raises doubts about its safety.

II. KECCAK ALGORITHM The design philosophy of Keccak is the hermetic sponge strategy [6]. It uses the sponge construction for having provable security against all generic attacks. It calls a permutation that should not have structural properties with the exception of a compact description [1].

In response, in 2007 it opens a competition with the objective of choosing a new hash function. The NIST (National Institute of Standards and Technology) created the Cryptographic Hash Algorithm Competition, which will select a successor to the algorithm SHA family. Technology centers, companies and scientific community were called to submit proposals to the new standard hash function, the SHA-3. Once the candidates presented, they will be exposed and evaluated in several aspects.

Keccak is a family of hash functions that is based on the sponge construction, and hence is a sponge function family. In Keccak, the underlying function is a permutation chosen in a set of seven Keccak-f permutations, denoted Keccak-f

-1-


Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

[b], where b ∈ {25, 50, 100, 200, 400, 800, 1600} is the width of the permutation. The width of the permutation is also the width of the state in the sponge construction [2].

xor A[x,3] xor A[x,4], D[x] = C[x-1] xor rot(C[x+1],1),

The state is organized as an array of 5×5 lanes, each of length w ∈ {1, 2, 4, 8, 16, 32, 64} (b=25w). When implemented on a 64-bit processor, a lane of Keccak-f [1600] can be represented as a 64-bit CPU word. For obtain the Keccak [r, c] sponge function, with parameters capacity c and bitrate r, if we apply the sponge construction to Keccakf [r+c] and by applying a specific padding to the message input.

A[x,y] = A[x,y] xor D[x], ρ and π steps B[y,2*x+3*y] = rot(A[x,y], r[x,y]), χ step A[x,y] = B[x,y] xor ((not B[x+1,y]) and B[x+2,y]), ι step

All the operations on the indices are done modulo 5. A denotes the complete permutation state array, and A [x, y] denotes a particular lane in that state. B [x, y], C [x], D [x] are intermediate variables. The constants r [x, y] are the rotation offsets, while RC [i] are the round constants. Rot (W, r) is the usual bitwise cyclic shift operation, moving bit at position i into position i+r (modulo the lane size). The constants r [x, y] are the cyclic shift offsets and are specified in the Table I.

A[0,0] = A[0,0] xor RC return A } The four steps (Θ, ρπ, χ, ι) of hash function keccak have data dependency of first level, ie, the current step depends only of the outcome of the previous step. This feature allows exploring techniques of parallelism in hardware. In this context, this paper presents a proposed architecture that exploits the parallelism using pipeline technique.

TABLE I CONSTANTS R [X, Y] – KECCAK ALGORITHM

III. RELATED WORK In this article we used the reference architectures available from authors in the official site keccak [5]. In this paper we used the reference architecture that emphasizes high performance. Keccak allows to trade off area for speed and vice versa. Different architectures reflect different trade-offs. The two architectures we have investigated and implemented reflect the two ends of the spectrum: a high-speed core and a lowarea coprocessor.

The constants RC [i] (see Table II) are the round constants. The following table specifies their values in hexadecimal notation for lane size 64. For smaller sizes they must be truncated. TABLE II CONSTANTS RC [I]- – KECCAK ALGORITHM

The architecture of the high-speed core design is depicted in Figure 1, which is based on the plain instantiation of the combinational logic for computing one Keccak- f round, and use it iteratively.

The keccak first start with the description of Keccak-f in the pseudo-code below. The number of rounds nr depends on the permutation width, and is given by nr = 12+2l, where 2l = w. This gives 24 rounds for Keccak-f [1600]. Round[b](A,RC) { θ step Fig. 1 Top-level Combinational Architecture (high performance) [5]

C[x] = A[x,0] xor A[x,1] xor A[x,2]

-2-


Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

The core is composed of three main components: the round function, the state register and the input/output buffer. The use of the input/output buffer allows decoupling the core from a typical bus used in a system-on-chip (SoC).

operation rescheduling and hardware reutilization, allowing a significant reduction of the critical path while the required area also decreases. Both SHA256 and SHA512 hash functions have been implemented and tested in the VIRTEX II Pro prototyping technology. Experimental results suggest improvements to related SHA256 art above 50% when compared with commercial cores and 100% to academia art, and above 70% for the SHA512 hash function. The resulting cores are capable of achieving the same throughput as the fastest unrolled architectures with 25% less area occupation than the smallest proposed architectures. The proposed cores achieve a throughput of 1.4 Gbit/s and 1.8 Gbit/s with a slice requirement of 755 and 1667 for SHA256 and SHA512 respectively, on a XC2VP30-7 FPGA.

In the absorbing phase, the I/O buffer allows the simultaneous transfer of the input through the bus and the computation of Keccak- f for the previous input block. Similarly, in the squeezing phase it allows the simultaneous transfer of the output through the bus and the computation of Keccak- f for the next output block. The high-speed core can be modified to optimize for different aspects. In many systems the clock frequency is fixed for the entire chip. So even if the hash core can reach a high frequency it has to be clocked at a lower frequency. In such a scenario Keccak allows instantiating two, three, four or even six rounds in combinatorial logic and compute them in one clock cycle [5].

IV. PIPELINE ARCHITECTURE The pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.

The replication of high speed core has a direct impact in the area consumed by dedicated combinatorial logic. In our proposal has achieved significant results in the relationship Area x Performance, that can be seen in section IV and V this paper.

In this context, the algorithm keccak has characteristics that enable the use of pipeline parallelism as a solution, since there is a data dependency only of the first level, at which data processing elements connected in series.

Other works highlight techniques to performance gains over the throughput of hash functions. Hash functions as SHA-1, SHA-2 and Whirlpool have been fully explored [7] [8] [9] . The following highlights some related works. Many papers use the pipeline as solution to improve throughput.

The Figure 2 shows as the pipeline has been exploring this proposed architecture to algorithm keccak. Each pipeline stage is represents the elements step | block | round, ie T | 0 | 2, step = Θ, block = 0, round = 2.

[7]

exploited the Whirlpool hash In your work, Kotturi function. It should be capable of processing the input data streams at high speeds. They proposed a fully synchronous parallel pipelined architecture with ten stages of pipelining between rounds, and internal pipelining within each round stage. The proposed architecture can tremendously improve the performance of Whirlpool hash function. Our final implementation can encrypt continuous bit streams seamlessly, achieving a throughput of 56.89 G bps, which is considerably high compared to existing implementations in literature.

In the pipeline architecture are operated simultaneously 4 different data blocks. As can be seen in Table III, the first data block is processed in one cycle [T| 1 | 0] this ends in the fourth round, Cycle 4 [I | 1 | 0]. The data dependency of the first level allow the pipeline stays full until the end of execution, eliminating the hazards of the structural type of data and control architectures commonly found in pipeline implementations. However there is the initial cost of four cycles for total completion of all stages of the pipeline and four cycles to empty the same final stages.

According to Michail [8], hash functions are widely used in applications that call for data integrity and signature authentication at electronic transactions. A hash function is utilized in the security layer of every communication protocol. As time passes more sophisticated applications arise that address to more users-clients and thus demand for higher throughput. Furthermore, due to the tendency of the market to minimize devices’ size and increase their autonomy to make them portable, power issues have also to be considered. The existing SHA-1 Hash Function implementations (SHA-1 is common in many protocols e.g. IPSec) limit throughput to a maximum of 2 Gbps. In this work, a new implementation comes to exceed this limit improving the throughput by 53%. Furthermore, power dissipation is kept low compared to previous works, in such way that the proposed implementation can be characterized as low-power.

TABLE III PIPELINE STAGES DEMONSTRATION

Round 0

Round 1

Cicle s

1

2

3

4

5

6

7

8

Θ

T|1| 0

T|2| 0 P|1| 0

T|3| 0

T|4| 0

T|1| 1

T|2| 1

T|3| 1

T|4|1

P|2|0

P|3|0

P|4|0

P|1|1

P|2|1

P|3|1

X|1| 0

X|2| 0

X|3| 0

X|4| 0

X|1| 1

X|2|1

I|1|0

I|2|0

I|3|0

I|4|0

I|1|1

ρπ χ ι

The initial and final cost becomes insignificant when you run the algorithm on a robust body of data, where new data blocks are added to the pipeline without the cost of filling and emptying.

Finally, Chaves [9] in your paper proposes a set of new techniques to improve the implementation of the SHA-2 hashing algorithm. These techniques consist mostly in

-3-


Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

TABLE IV STATISTICS OF THE ROUND OPERATIONS

Operations

Slices

Frequency

Θ Ρπ Χ Ι

1040 721 648 589

323 Mhz 467 Mhz 435 Mhz 456 Mhz

We can observe the operation theta (Θ) has a propagation time significantly different than the other operations. As the control of the pipeline is synchronous, the stage that consumes the largest propagation time will be considered as a parameter to determine the frequency of transitions between stages of the pipeline. In this sense, it was proposed to improve the subdivision of stage theta (Θ) two phases and transforming the proposed four stage pipeline to five stages as similar propagation time, improving the throughput of the system. This in fact occurred as can be seen in the new data presented in Table V. As expected increased total area of the module, was included as a new barrier stage of the pipeline. TABLE V STATISTICS OF THE ROUND OPERATIONS (THETA SUBDIVISION)

Operations

Slices

Frequency

Θ1

621

424 Mhz

523

452 Mhz

721

467 Mhz

648

435 Mhz

589

456 Mhz

Θ2 Ρπ Χ Ι

With the uniform division of the processing load among the pipeline stages obtained an improvement in the total performance of the Keccak module.

Fig. 2 Pipeline architecture top-level

The Figure 2 represents the top-level architecture with keccak pipeline. At each stage of the pipeline is necessary to create a buffer capable of storing the information processed in the previous iteration. This buffer stores the state of partial array processed in the previous cycle.

Finally, for fair comparison of the proposed pipeline architecture as the reference architecture have been synthesized both source codes, using the same settings and constraints of the Xilinx ISE 13.0. The results can be seen in Table VI.

In the first round the pipeline is filled with the original blocks of data and after this round is the processed blocks are reused in the new round. Thus, it was necessary to create an asynchronous multiplex determine the origin of data to be processed in a specific iteration.

As we can observe the area of the proposed pipeline architecture is 15, 3% greater than the reference architecture [5] , however we obtained a higher frequency of the circuit that has a direct impact on the final throughput. Without a significant increase as the area was possible to obtain a significant performance gain of 32% over the reference architecture.

The barriers of stage add a significant area to the final hardware, however, allow to separate simultaneous processing lines. This can improve the performance of algorithm Keccak when considering the relationship Throughput x Area (see Section V).

TABLE VI FINAL IMPLEMENTATION STATISTICS

Architecture

Slices

Frequency

Reference

2640

122Mhz

5.2 Gbit/s

Pipeline

3117 (+15,3%)

452 Mhz (+370%)

7.7 Gbits/s (+32%)

V. IMPLEMENTATION AND RESULTS The architecture presented in Section 4 was described in VHDL and implemented in a Virtex 5 (XC5VLX50FF324-3) FPGA. After implementation it was possible to detect improvements could be made. To analyze the pipeline, each operation (Θ, ρπ, χ, ι) was synthesized individually.

Throughput

The relationship Slices consumed by Gbits/s can demonstrate with higher property the gain obtained. So in the architecture of reference has 2640 Slices / 5.2 Gbits/s = 507 Slices by Gbit of throughput. The pipeline architecture 3117/7.7 = 404 Slices by Gbits of throughput, concluding, the pipeline architecture consumes 20% less area by Gbit of throughput.

It was then possible to associate weights related to the propagation time of these operations. The results can be seen in Table IV.

-4-


Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

VI. COMPARISONS WITH GPU IMPLEMENTATIONS (CUDA)

The Table VII shows that even with the high processing power of GPUs its throughput was lower than in the implementation pipeline, but GPU Implementation shows that a software implementation can perform well when compared to a dedicated hardware implementation.

CUDA (Compute Unified Device Architecture) is a GPGPU technology (General-Purpose Computing on Graphics Processing Units). Instead of executing an application exclusively on the central processor (CPU), some computational intensive parts of the application can be transferred to the graphic processor (GPU). Using the GPU for high-performance computing has been in practice for years already, but the lack of a suitable API made it a painstaking experience for the programmer, formulating his ideas in an API designed for pure graphics programming. However, to program the GPU efficiently, a good knowledge of the internal workings of the GPU is still necessary [11].

Importantly, the test environments are different (FPGA x GPU). The calculated throughput after optimizations of the cooperation between GPU and CPU, top speed of more than 1 GB/s (including data transfers) has been reached using an entry level GTS 250 card [12]. The FPGA version architecture does not consider the auxiliary storage and data transfer, it has an impact on throughtput rate calculated 7.7 GB / s [13], since the transfer of data may generate a significant latency as the designed architecture, reducing the maximum throughput calculated in this case.

Due to the high processing power of GPUs becomes interesting to carry out performance comparisons between the implementation of Keccak CUDA architectures, where its software’s implementation with FPGA where it is implemented directly in hardware and processing is fully dedicated. Thus, we surveyed related work with Keccak implementations on GPUs, works cited below are found to perform the comparison.

In this context, it is concluded that the difference between the FPGA implementation and GPU may be less than that indicated in Table VII, which highlights the capacity of processing applications GPU.

In your work, Pierre-Louis Cayrel [11] present an implementation of the Keccak hash function family on graphics cards, using NVIDIA’s CUDA framework. That implementation allows to choose one function out of the hash function family and hash arbitrary documents. In addition he presents the first ready-to-use implementation of the tree mode of Keccak which is even more suitable for parallelization.

VII. CONCLUSIONS The sha-3 competition is in focus in the information secure area. Attacks successive presenting fragilities found in functions of hash as MD5, SHA-0, SHA-1 and SHA-2 stimulated the scientific community to search a sucessor more robust and secure.

Guillaume Sevestre [12] presents a Graphics Processing Unit implementation of Keccak cryptographic hash function, in a parallel tree hash mode to exploit the parallel compute capacity of the graphics cards.

This work has as objective to explore one of finalists algorithms of SHA-3, the Keccak. This went described in VHDL and implemented in a FPGA Virtex 5 (XC5VLX50FF324-3), applying techniques of paralel processing.

The GPU supports thousands of threads in hardware, so one possibility to use all of these threads is hashing a single document in tree mode. In this case, parts of the document are first loaded into the leaves of a tree. These leaves are then hashed using the basic kernel above. The idea is to distribute the input data into the leaves of a tree and to hash those leaves independently and in parallel.

As result went developed a pipeline architecture, that in your final version has five stages of processing and operates simultaneously on five blocks of data. The principal contribution this work went to demonstrate that Keccak algorithm allows to explore paralelism techniques and get significant results of performance. As reference for analysis and comparison went used a provided version by the authors of Keccak.

Pierre-Louis Cayrel [11] used an NVIDIA GTX 295 GPU with tree mode hashing to run all of the tests, using different tree parameters and documents file size.

This version explores the performance high proposing a combinational architecture for main stream of Keccak. Theoretically, the combinational version explores the hardware limit to execute the Keccak algorithm.

Guillaume Sevestre [12] used an Core i5-750 2.6 Ghz Nvidia GTS 250 an implemented keccak tree mode hashing to run the tests. The table below shows the results of the studies found and compared with results obtained in this project.

However, went proposed a pipeline architecture that allows addition to exploring the native paralelism of Keccak algorithm also allows lines creation of processing simultaneous.

TABLE VII RESULTS OF KECCAK’S IMPLEMENTATIONS

Authors

Title

Implementati on

Throughpu t

PierreLouis Cayrel [11]

GPU Implementation of the Keccak Hash Function Family

NVIDIA GTX 295 GPU

250 Mbits/s

Guillaune Sevestre [12]

Implementation of Keccak hash function in Tree mode on Nvidia GPU

Core i5-750 2.6 Ghz Nvidia GTS 250

1 Gbits/s

Pereira [13]

Pipeline architecture

Virtex 5

7.7 Gbits/s

The results of the proposed architecture can be found in Section V. It is important to highlight that the cost of Slices consumed by Throughput, improved in the proposed pipeline architecture, or be, with less hardware obtained a high throughput. and

-5-

The comparison between pipeline architecture (FPGA) GPU Implementation shows that a software


Jan. 2013, Vol. 2 Iss. 1, PP. 1-6

[7] Kotturi, D., Yoo, S., High-Speed Parallel Architecture of the Whirlpool Hash FunctionInternational Journal of Advanced Science and Technology, Volume 7, June, 2009. [8] Michail H. E., et al, Optimizing SHA-1 Hash Function for High Throughput with a Partial Unrolling Study, Lecture Notes in Computer Science, Volume 3728/2005, 2005. [9] Chaves R., Kuzmanov, G., Sousa, L., Vassiliadis S., Improving SHA-2 Hardware Implementations, Workshop on Cryptographic Hardware and Embedded Systems, 2006. [10] FIPS 180-3, Secure Hash Standard, Cryptographic Hash Algorithm Competition, available from http://csrc.nist.gov/groups/ST/ hash/sha-3/index.html, 2011. [11] Pierre-Louis Cayrel, Gerhard Hoffmann, Michael Schneider, GPU Implementation of the Keccak Hash Function Family, SERSC International Journal of Security and Its Applications Vol. 5 No 4, October, 2011. [12] Guillaune Sevestre, Implementation of Keccak hash function in Tree mode on Nvidia GPU, 2011. [13] Pereira, F. D.; Ordonez, E. D. M.; Sakai, I. D. Hash function keccak: exploring parallelism with pipeline. In: PDCSParallel and Distributed Computing and Systems, 2011.

implementation can perform well when compared to a dedicated hardware implementation, if used a nonconventional architecture, such as GPUs. It is possible to conclude that GPU implementations can be a good solution for many applications, because it is efficient and high integration capability. REFERENCES

[1] G. Bertoni, J. Daemen, M. Peeters and G. Van Assche, The Keccak reference, 2011. [2] G. Bertoni, J. Daemen, M. Peeters and G. Van Assche, The Keccak SHA-3 submission, 2011. [3] Wang, X. et al. “Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD”. Proceedings EUROCRYPT, 2004. [4] Wang, X., Yin, Y. L., YU, H. “Finding Collisions in the Full SHA-1”. Proceedings of Crypto 2005. [5] Strömbergson J., Implementation of the Keccak Hash Function in FPGA Devices, 2008. [6] Daemen, J. et al. “Sponge Functions”. 2011, available from http://sponge.noekeon.org/Sponge Functions.pdf.

-6-

Exploiting Parallelism on Keccak: FPGA and GPU ...

Exploiting Parallelism on Keccak: FPGA and GPU ...

Suggest Documents

Exploiting GPU and Cluster Parallelism in Single ...

Exploiting GPU Parallelism to Optimize Real-World ...

FPGA and GPU - Semantic Scholar

Exploiting Application Data-Parallelism on Dynamically ... - UCI

Exploiting Heterogeneous Parallelism on a Multithreacled ...

Exploiting Heterogeneous Parallelism on a Multithreacled

JANUS: Exploiting Parallelism via Hindsight

FINDING AND EXPLOITING PARALLELISM IN A ... - CiteSeerX

Exploiting the Multilevel Parallelism and the

Exploiting Task and Data Parallelism on a ... - Semantic Scholar

Molecular Docking on FPGA and GPU Platforms - Semantic Scholar

FPGA-based Design Approaches of Keccak Hash ...

Exploiting Parallelism with Dependence-Aware ... - Semantic Scholar

MMT: Exploiting Fine-Grained Parallelism in

Ch4. Exploiting Instruction-Level Parallelism with Software ...

Exploiting Parallelism in Coalgebraic Logic ... - Semantic Scholar

TERAFLUX: Exploiting Dataflow Parallelism in Teradevices - CiteSeerX

Exploiting Parallelism in Tabled Evaluations? - Semantic Scholar

Exploiting Coarse Grained Parallelism in ... - Semantic Scholar

Exploiting Parallelism in Decision Tree Induction - DCC

Exploiting loop-level parallelism on coarse-grained reconfigurable ...

Exploiting FineâGrain Thread Level Parallelism on the MIT Multi

EXPLOITING MULTIPLE LEVELS OF PARALLELISM IN SPARSE ...

RouteBricks: Exploiting Parallelism To Scale Software Routers

Exploiting Parallelism on Keccak: FPGA and GPU ...