application executing on a cluster containing commodity recon- .... A. Application Development .... desktop machine running the Linux operating system. Each.
Searching RC5 Keyspaces with Distributed Reconfigurable Hardware John P. Morrison, Padraig J. O’Dowd and Philip D. Healy Computer Science Dept., University College Cork, Ireland j.morrison, p.odowd, p.healy @cs.ucc.ie
Abstract— Topic Area: Software. Implementation details and performance measurements for a brute-force RC5 keycrack application executing on a cluster containing commodity reconfigurable hardware are presented. The purpose of the application is to gauge the maximum real-world speedups attainable on a metacomputer that forms the underlying execution platform. The operation of the metacomputer and its associated tools, designed to target cluster-based distributed reconfigurable hardware in a high-level manner, is discussed in detail. Index Terms— Reconfigurable computing, high performance computing, distributed reconfigurable computing.
I. I NTRODUCTION Although high-level development and execution environments for distributed reconfigurable hardware contained in a single machine have been developed, e.g., MATLAB [1] and ANSI-C [2] compilers, such tools have not yet become available for clusters containing reconfigurable hardware. APIs analogous to standard parallel processing libraries such as PVM and MPI have been developed that greatly simplify the task of controlling distributed systems of adaptive computing boards [3]. Although unbeatable in terms of pure performance, developing with low-level libraries such as these places a significant burden on the part of the developer, who must explicitly schedule all communications with the network and the reconfigurable hardware. The Distributed Reconfigurable Metacomputer (DRMC) project attempts to shift this burden to the runtime execution platform. Computations are expressed as sets of graphs that generate instructions. These instructions are then distributed, scheduled and executed by the metacomputer. While a system of this type is inevitably unsuited to all applications, full utilization of available computational resources should be possible for many classes of applications. The RC5 keycrack application described in this paper was developed in order to gauge the upper bound of speedups attainable for real-world applications executing on the metacomputer. Cryptographic keycrack applications are good examples of embarrassingly parallel computations [4], i.e., they can be divided into completely independent parallel tasks that require no intercommunication. As a result, they are often executed on clusters of commodity machines, or even large
scale distributed computing projects such as distributed.net [5]. Cryptographic applications in general are also ideal candidates for acceleration using reconfigurable hardware because of their high degree of intrinsic parallelism and data locality [6][7]. Significant speed increases should therefore be attainable through a combination of both distributed computing and reconfigurable computing techniques, i.e., via High Performance Reconfigurable Computing [8]. The RC5 application was chosen because the amount of data communicated during execution is minuscule and the computational granularity can be arbitrarily coarse. These characteristics allow the capabilities of both types of hardware to be fully exploited for the duration of the computation, since the bottlenecks traditionally associated with each (Ethernet latency/bandwidth in the case of clusters and PCI bus latency/bandwidth in the case of reconfigurable hardware) are not limiting factors to performance. The speedups attained should therefore scale up with faster processors and faster/denser FPGAs without requiring a corresponding improvement in the network or system bus. The remainder of this paper is organized as follows: some background information on the RC5 algorithm is given in Section II. The metacomputer that is the target platform for the application is described in Section III. Implementation details on the application are provided in Section IV. The hardware upon which the application was executed is described in Section V. The resulting performance measurements are given in Section VI. Finally, Section VII presents our conclusions and discussion. II. T HE RC5 C RYPTOGRAPHIC A LGORITHM RC5 is a simple and fast symmetric block cipher first published in 1994 [9]. The algorithm requires only three operations (addition, XOR and rotation), allowing for easy implementation in hardware and software. Data-dependent rotations are used to make differential and linear cryptanalysis difficult, and hence provide cryptographic strength. The algorithm takes three parameters: the word size ( ) in bits, the number of rounds ( ) and the number of bytes ( ) in the secret key. A particular (parameterized) RC5 algorithm is denoted RC5- / / , with RC5-32/12/16 being the most common. As 64-bit chip architectures become the norm, it is likely that 64 bit word sizes will increase in popularity. In that case
The support of IRCSET (through the Embark Initiative) and Enterprise Ireland (through grant no. IF/2001/318) is gratefully acknowledged.
it is suggested that the number of rounds be increased to 16. Variable length keys are accommodated by expanding the secret key to fill an expanded key table of words. RC5 is extremely resistant to linear cryptanalysis, and is widely accepted as being secure (notwithstanding certain pathological examples that could yield to differential cryptanalysis and timing attacks) [10]. A brute-force attack, the focus of this paper, works by testing all possible keys in turn against an encrypted piece of known plaintext. This type of attack is feasible when the key lengths concerned are small and have been successfully mounted on a number of occasions using Networks of Workstations and distributed computing projects. For longer key lengths (128 bits or greater), the brute-force approach is totally inadequate, requiring millions of years to yield the key. Despite this, brute-force RC5 keycracking is an application worthy of interest because it provides a simple, easily parallelizable real-world application amenable to acceleration with reconfigurable hardware.
III. T HE D ISTRIBUTED R ECONFIGURABLE M ETACOMPUTER The Distributed Reconfigurable Metacomputer (DRMC) project provides an environment in which computations can be constructed in a high-level manner and executed on clusters containing reconfigurable hardware. DRMC is unique in that applications are executed on clusters using the Condensed Graphs [11] model of computation. The DRMC system is comprised of several components: a metacomputer containing a Condensed Graphs engine capable of executing applications expressed as graphs, a Condensed Graphs compiler, a control program for initiating and monitoring computations, and a set of libraries containing components that simplify application development. A. Application Development A DRMC application consists of a set of graph definitions (expressed as XML, in a scheme similar to the one outlined in [12]) and a set of executable instructions. Instructions are implemented either in C or as FPGA configurations.
Computation Graph Application Code and Libraries
Instructions are represented by object code (contained in .o files) or FPGA configurations (contained in .bit files). The Condensed Graphs Compiler compiles the set of definition graphs and links them with the set of executable instructions to produce a shared object (.so) file ready for dynamic linking and execution by the metacomputer. Any FPGA configurations required by the computation are loaded separately by the metcomputer as needed. Currently, application components are assembled manually, however tools to automate this process are under development. B. Metacomputer Overview The metacomputer is a peer-to-peer UNIX application composed of a daemon and, when an application is being executed, a multithreaded computation process. The daemon is lightweight and runs on each cluster node, listening for incoming messages. At an appropriate signal from the control program, the daemon spawns a computation process. The computation process essentially consists of a number of threads that exchange instructions and results (see Fig. 1). At its core is the scheduler, responsible for routing instructions and results between the various modules. Instructions may arrive either from the Condensed Graphs Engine or from the communications module. The scheduler sends native and Condensed Graph instructions to the Native Instruction Execution Thread. Likewise, FPGA instructions are sent to the FPGA Instruction Execution Thread. Some instructions may have implementations in both software and hardware, in which case the scheduler is free to decide which thread is most appropriate. Instructions accumulate in the scheduler while awaiting execution. The scheduler will delegate instructions to other cluster nodes if this is deemed to be more expedient than waiting for an execution thread to become available. Results arrive from the execution threads or, in the case of instructions executed remotely, the communications module. Results for instructions that initiated on the local machine are sent to the Condensed Graphs Engine, progressing the computation. Results for instructions that originate remotely are sent to the appropriate node. I
Condensed Graphs Engine
R
I
.so
Native Instruction Execution Thread
R
Scheduler Reconfigurable Hardware Local Area Network
I
FPGA Instruction Execution Thread
R
I
Communications Module
R
Fig. 1. An overview of the various components comprising a DRMC computation process, along with the resources managed by each. Arrows indicate the flow of instructions (I) and results (R).
C. Operation The execution of an application is initiated by sending the appropriate command from the control program to an arbitrary node in the cluster. This initiator node then spawns a computation process and broadcasts a message instructing the other cluster nodes to do likewise, specifying a shared directory containing the application code. Once the shared object containing the application code is loaded, a special registration function is called that informs the computation process of the instructions available and the libraries that the application depends on. The initiator node’s computation process then commences execution of the application’s top level graph, which is equivalent to a C main function. As instructions become available for execution, they form a queue that is managed by the scheduler. Some instructions are executed locally by sending them to the computation process’ native instruction execution thread or FPGA instruction execution thread. If these local resources are occupied, some instructions will be sent for execution to other cluster nodes. Instructions corresponding to condensed graphs may also be delegated to other cluster nodes, allowing parallelism to be exposed on remote machines. Each cluster node regularly advertises its load to the others, allowing the schedulers to favour lightly loaded nodes when delegating instructions. If all the nodes are heavily loaded with long instruction queues, the computation is throttled, i.e., no more new instructions are generated until the backlog has eased. At present, DRMC assumes that only one type of FPGA instruction will be issued by each computation. As development progresses, support will be added for heterogeneous configurations. The metacomputer would then initiate FPGA reconfigurations based on the set of instructions awaiting execution. The control program (CP) monitors the execution of a computation, providing the user with real-time information on the state of each machine. The CP is also responsible for the display of log messages as well as user interaction with executing applications. In the event that a computation process exits prematurely (e.g., a badly written instruction causes a segmentation fault), the DRMC daemon executing on the affected node sends an appropriate error message to the CP before broadcasting a message that halts the computation. IV. A PPLICATION D ESIGN The RC5 application was implemented as a graph definition file, and single checkKeys function implemented both as a native and an FPGA instruction - yielding a hardware and a software implementation. Other instructions required by the application were invoked directly from the DRMC libraries. The graph definition file was created using an XML editor. The computation graph is quite simple - it divides the key space into partitions (each containing 10 million keys) that are then passed to instances of the checkKeys instruction. This instruction is responsible for encrypting the known plaintext with all the keys in the supplied keyspace partition, and comparing the encrypted plaintext with the known ciphertext. If a match is found, the key is returned.
The software and hardware implementations of checkKeys are based on the RC5 implementation contained in RFC 2040 [13]. To create the native implementation, this code was augmented with an extra function interfacing with DRMC to perform type conversions. The compiled object code and the graph definition file were passed to the Condensed Graphs Compiler to create a shared object capable of execution by the metacomputer. The hardware implementation of checkKeys was created with Handel-C [14], a derivative of ANSI-C specifically designed for translation to hardware. The language contains a number of extensions required for hardware development, including variable data widths and constructs for specifying parallelism and communications at the hardware level. The process of converting an ANSI-C program to efficient HandelC is relatively straightforward in comparison to traditional hardware design languages such as Verilog and VHDL. First, the code must be parallelized with par statements and the widths of data lengths expressed explicitly. Analysis with place and route tools reveals the longest paths in the resulting hardware design. Through a process of iterative refinement, various optimizations can be performed until an acceptable level of speed/efficiency is reached. Small changes in the Handel-C source can result in major changes in the resulting logic, allowing different design strategies to be evaluated quickly. The first attempt at porting the checkKeys code to Handel-C resulted in a hardware design capable of checking 12 keys in parallel, but at a very low clock speed. The application of a variety of optimizations (removing division/modulo operators, mediating access to RAMs, inlining functions, replacing for loops with while loops) allowed eight keys to be checked in parallel with a clock speed of 50 MHz. The most significant performance increases were obtained by dividing the design up into an eight-stage pipeline, with three pipelines operating in parallel. Although the pipelined version of the design operates at a lower clock speed (41 MHz), it throughput is seven times that of the unpipelined version. V. H ARDWARE S ETUP The application was executed on a standard Beowulf-type cluster [15], consisting of eight nodes, each a commodity desktop machine running the Linux operating system. Each machine was equipped with a single 350MHz Pentium II
100 Mb Ethernet LAN
Fig. 2. The topology of the cluster used to execute the application, composed of commodity workstations and reconfigurable computing boards.
processor and 256MB RAM. The nodes were connected by a 100Mb Ethernet switch. A single Celoxica RC1000 reconfigurable computing board [16] was fitted to each cluster node. These boards are PCIbased and are incorporate a single Xilinx Virtex XCV2000E FPGA [17] as well as 2MB of on-board memory. This model of FPGA contains over 2.5 million gates. VI. R ESULTS Execution of the application on a single machine using native instructions resulted in an average of 40,000 keys checked per second. Allowing checkKeys instructions to be delegated to the reconfigurable computing boards increases the number of keys checked to over 1.7 million keys/second; a 42fold speed increase. Even compared to a modern Pentium 4 2.4 GHz processor (168,000 keys/second), a greater than orderof-magnitude speed increase is attained through the utilization of reconfigurable hardware. Using all eight cluster machines results in a throughput of over 13 million keys/second, almost 350 times faster than a single machine. The cluster containing reconfigurable hardware provides enough computing power to search the entire 40 bit RC5 keyspace in under 22 hours. VII. C ONCLUSIONS
AND
F UTURE W ORK
The RC5 keycrack application demonstrates that significant speedups can be realized by utilizing cluster-based distributed reconfigurable hardware without the application designer having to explicitly schedule communications and FPGA reconfigurations. This application represents the best-case scenario, where the communications overhead is minimal and only one type of instruction is used heavily. As a result, one FPGA configuration can be used homogeneously throughout the cluster. Future work will focus on allowing the metacomputer to maintain a heterogeneous set of FPGA configurations, with reconfigurations determined by the set of instructions awaiting execution. Instructions of a different type to the host machine’s current configuration could then be delegated to another node with a matching configuration. Various scheduling strategies will be evaluated in order to make efficient use of the reconfigurable hardware available. Other work will involve the development of tools, or possibly a dedicated language, to generate DRMC application components. Such a tool would allow the more advanced features of the metacomputer’s execution model (e.g., eager/lazy evaluation) to be leveraged without an intimate knowledge of the Condensed Graphs model. ACKNOWLEDGMENT The authors gratefully acknowledge the assistance of Roger Gook at Celoxica Ltd. R EFERENCES [1] M. Haldar, A. Nayak, A. Kanhere, P. G. Joisha, N. Shenoy, A. N. Choudhary, and P. Banerjee, “Match virtual machine: An adaptive runtime system to execute MATLAB in parallel,” in International Conference on Parallel Processing, 2000, pp. 145–152.
[2] J. B. Peterson, R. B. O’Connor, and P. M. Athanas, “Scheduling and partitioning ANSI-C programs onto multi-FPGA CCM architectures,” in IEEE Symposium on FPGAs for Custom Computing Machines, K. L. Pocek and J. Arnold, Eds. Los Alamitos, CA: IEEE Computer Society Press, 1996, pp. 178–187. [3] M. Jones, L. Scharf, J. Scott, C. Twaddle, M. Yaconis, K. Yao, P. Athanas, and B. Schott, “Implementing an API for distributed adaptive computing systems,” in Proceedings of IEEE Symposium on FieldProgrammable Custom Computing Machines, Napa, CA, Apr. 1999. [4] G. V. Wilson, Practical Parallel Programming (Scientific and Engineering Computation). The MIT Press, Jan. 1996. [5] [Online]. Available: http://www.distributed.net/ [6] J.-P. Kaps and C. Paar, “Fast DES implementation for FPGAs and its application to a universal key-search machine,” in Selected Areas in Cryptography, 1998, pp. 234–247. [7] P. D. Kundarewich, S. J. Wilton, and A. J. Hu, “A CPLD-based RC4 cracking system,” in 1999 Canadian Conference on Electrical and Computer Engineering, May 1999. [8] M. C. Smith, S. L. Drager, L. L. Pochet, and G. D. Peterson, “High performance reconfigurable computing systems,” in Proceedings of 2001 IEEE Midwest Symposium on Circuits and Systems, Fairborn, Ohio, Aug. 2001. [9] R. L. Rivest, “The RC5 encryption algorithm,” in Practical Cryptography for Data Internetworks, W. Stallings, Ed. IEEE Computer Society Press, Jan. 1996. [10] B. Kaliski and Y. Yin, “On the security of the RC5 encryption algorithm,” CryptoBytes, vol. 1, no. 2, pp. 13–14, 1995. [11] J. P. Morrison, “Condensed graphs: Unifying availability-driven, coercion-driven and control-driven computing,” Ph.D. dissertation, Technische Universiteit Eindhoven, 1996. [12] J. P. Morrison and P. Healy, “Implementing the WebCom 2 distributed computing platform with XML,” in Proceedings of the International Symposium on Parallel and Distributed Computing, Iasi, Romania, July 2002, pp. 171–179. [13] R. Baldwin and R. Rivest, “RFC 2040: The RC5, RC5-CBC, RC5CBC-pad, and RC5-CTS algorithms,” Oct. 1996. [Online]. Available: ftp://ftp.internic.net/rfc/rfc2040.txt [14] Handel-C Language Reference Manual Version 3.1, Celoxica Ltd., 2002. [15] T. Sterling, D. Savarese, D. J. Becker, J. E. Dorband, U. A. Ranawake, and C. V. Packer, “BEOWULF: A parallel workstation for scientific computation,” in Proceedings of the 24th International Conference on Parallel Processing, Oconomowoc, WI, 1995, pp. I:11–14. [16] RC1000 Hardware Reference Manual, Celoxica Ltd., 2001. [17] VirtexTM -E 1.8V Field Programmable Gate Arrays Production Product Specification, Xilinx, Inc., July 2002.