On Solving RC5 Challenges with FPGAs - IEEE Computer Society

0 downloads 0 Views 176KB Size Report
a cost assessment of an FPGA-based brute force attack against the challenge RC5-72. The aim is to develop an alternative to software-based solutions for ...
2007 International Symposium on Field-Programmable Custom Computing Machines

On Solving RC5 Challenges with FPGAs Guerric Meurice de Dormale†∗, John Bass‡, Jean-Jacques Quisquater† †UCL/DICE Crypto Group, Place du Levant, 3, B-1348 Louvain-La-Neuve, Belgium ‡ DMS Design, PO Box 38, Masonville, CO 80541, USA E-mail: {gmeurice,quisquater}@dice.ucl.ac.be, [email protected] Abstract

most expensive part. The main operations of RC5-72 are: a bitwise XOR, an unsigned addition and a left rotate barrel shift (ROTL). Reader is referred to [3] for details.

This work explores a hardware design alternative and a cost assessment of an FPGA-based brute force attack against the challenge RC5-72. The aim is to develop an alternative to software-based solutions for distributed.net. Hardware platforms, particularly reconfigurable hardware, can offer significant cost, flexibility and performance advantages, while significantly reducing environmental energy costs and impacts. Implementation results show that an 80 US$ FPGA can yield a throughput of 145 Mkeys/sec with a power consumption of 10 Watts. This is roughly an order of magnitude faster, cheaper and lower power, when compared with fully dedicated general purpose computers.

Several kinds of high level strategies can be applied in order to implement the RC5 algorithm in hardware. They all have their own area and throughput trade-offs. This work focuses on a parallel and fully unrolled implementation. This requires bigger FPGAs but has several advantages. First, the circuit can be data-driven, removing the need for control logic. Second, the architecture can be fully specialized for each operation of the algorithm.

1

3.1 Parallel Adder

3

Introduction

A carry ripple adder has the most compact implementation, as it is the simplest architecture. Typically, 16 slices are required for a 32-bit adder. The ripple carry drawback is the need for a full carry propagation, but mitigated by the dedicated carry chain available in FPGAs.

This work describes a hardware brute force attack on the RC5 cipher, proposed by Rivest in 1994 [3]. In order to quantify the security of RC5, The RSA Laboratories launched several contests in 1997 [4]. Prizes were offered to solve RC5 challenges with keys ranging from 40 to 128-bit. The RSA Challenges are constructed to show the strength of key size choices using the RC5 algorithm, by providing a real life benchmark in brute force effort. This provides users of RC5 with an estimate of the time their data will remain secure as technology improves continuously. The brute force search is easily parallelized using a straightforward divide-and-conquer approach on the key space. The RC5 projects at distributed.net (d.net) exploit this approach using tens of thousands of Personal Computers (PCs) to solve both the RC5-64 challenge and its current attack on the RC5-72 (RC5-32/12/9) challenge.

2

3.2 Parallel Barrel Shifter ROTL The ROTL operation can be divided into several stages, leading to three main different approaches: a full 32 : 1 mux in a single stage, a column of 2 : 1 muxes stacked up in log2 (32) stages, or an intermediate approach using bigger muxes with fewer stages. The single-stage approach has the minimum latency but necessitates an exponential number of slices (8 per bit). The fully logarithmic approach has the advantage of reducing the area to log2 (32) 2 : 1 muxes (2.5 slices per bit). The drawback is the delay, as general routing resources instead of specific MuxFi must be used. This delay can nevertheless be traded for latency by pipelining the circuit. As some operands must be retimed this latency could also be an issue. To mitigate the delay/latency problem of the fully logarithmic approach, 4 : 1 muxes instead of 2 : 1 muxes could be used in each stage. Taking advantage of MuxF5, this solution reduces the delay/latency without increasing the area.

RC5 Algorithm

The RC5 algorithm is composed of two parts. First, the key schedule handles the mixing of the secret key and generates a set of sub-keys. Second, the encryption/decryption uses the sub-keys to transform between clear and cipher text forms. For a brute force attack, the sub-key generation is the ∗ Supported

by the Belgian fund for industrial and agricultural research

0-7695-2940-2/07 $25.00 © 2007 IEEE DOI 10.1109/FCCM.2007.13

Implementation Options

273 281

3.3 Shift Register for Retiming

Concerning comparisons with other works, the improvement is clear: a 4.5-fold [1] or an 8-fold [2] better AT product is exhibited. This is mainly due to a thorough analysis of RC5 operations and a highly optimized FPGA packing.

During the computation of RC5, some data is not directly consumed when produced and must be stored. For instance, RC5-72 needs to compute some terms in one iteration and use them 26 and 3 iterations later. Three options are available: SRL16 LUTs, distributed RAM LUTs and block RAMs (bRAMs). The use of SRL16s yields the most compact implementation. It is efficient for shift registers of low depth. The drawback is its high dynamic power consumption. Distributed RAMs should therefore be preferred. Indeed, a shift register can be implemented with a RAM and a counter: data is written at an incremented address and read after a number of cycles equal to the shift register depth. While a high depth is needed, the use of bRAMs can appear as a good solution. They are not as fast as general-purpose logic but they already exhibit acceptable performance. Two 32-bit shift registers of depth up to 256 can be implemented in a single bRAM.

4

6

For the cost assessment of an FPGA brute force attack, the AT product must be weighted by the hardware cost (in US$). Table 2 presents the number of devices needed to search the whole key-space of RC5-72 in one year and the power consumption (using Web Power Tool V8.1). With a price of $50 (104 ) and $160 (2.5 103 ) for a single S3 and V4 FPGA, this corresponds to a cost of $96M for V4 and $51M for S3 FPGAs. Those results show that low-cost Spartan 3 devices are more cost-effective to break RC5. This shows also that a software-based solution would burn 45 times (S3) or 80 times (V4) more energy than hardware! Platform

Freq. [Ghz] V4LX40 0.25 0.15 S3 2000 Opteron 285 2.6 Core2 QX6700 2.66

Implementation: Area-first Approach

A fully unrolled design with a parallel datapath exhibits a good AT product. This section investigates such an approach with a minimal area goal. Less area can potentially result in less logic, routing delays and a cheaper device. Following the area-first approach, a possibility is first to decide to save LUTs by using bRAMs for the retiming of S terms. This choice has a strong impact over the whole strategy: as the S shift registers can be almost as deep as needed, extensive pipelining can be applied. Nevertheless, the retiming of L terms should not exceed 17 pipeline stages in order not to exceed a single 4-input LUT and FF per bit. To save area, simple carry ripple adders are used. The last mux stage of the partially logarithmic ROTL of the previous iteration is also merged with the adder following it.

5

V4LX40 Area S3 2000 Area V2 6000 [1] V4LX25 [2]

Freq. [Mhz] 250 147 100 100

Area [Slices] 17,736 19,812 33,790 9,388

Through. [Mkeys/s] 250 147 100 15.3

Thr. [Mk./s] 250 147 22 40

Power [W] 10 11 95 130

# Plat. 1 year 6 105 106 7 106 4 106

Power 1 year $5.3M $9.8M $570M $430M

HW and SW estimates ($0.1 per kWh)

7

Conclusion and Further Work

A parallel fully unrolled implementation of RC5 on an FPGA was studied. This thorough analysis resulted in superior implementations compared to prior state of the art. Newer generations of FPGAs should yield even better packing and power consumption. Based on those figures, cost assessment of an FPGAbased attack against RC5-72 was provided. It should be understood as an upper bound for attacking RC5. This solution is not only more efficient from a throughput/hardware cost point of view, it is also much more environmentally friendly with a significantly lower total energy cost compared to software-based solutions.

Implementation Results Designs

Cost Estimates and Power Consumption

AT [S./Mhz] 71 135 338 614

References [1] H. Diab, M. Huang et al., An automated pipeline balancing in the SRC Reconfigurable Computer and its application to the RC5 cipher breaking, MAPLD conference, 2004. [2] D. Koch, M. Koerber, J. Teich, Searching RC5-Keys with Distributed Reconfigurable Computing, in Proc. of ERSA’06, CSREA Press, pp. 42-48, 2006.

Implementation results of RC5 cores Implementation results were achieved for Virtex4 LX4010 (V4) and Spartan3 2000-4 (S3). Table 1 presents implementation results using ISE 8.1. Each of the area-first design uses 26 bRAMs (representing 27% and 65% of the V4 and S3 devices) with an extra output register.

[3] R.L. Rivest, The RC5 Encryption Algorithm, in Proc. of FSE’94, LNCS 1008, Springer-Verlag, pp. 86-96, 1994. [4] RSA Laboratories, The RSA Laboratories Secret-Key Challenge, www.rsasecurity.com/rsalabs/

282 274