Architecture and Implementation of a Distributed ...

Architecture and Implementation of a Distributed Reconfigurable Metacomputer John P. Morrison, Philip D. Healy and Padraig J. O’Dowd Computer Science Dept., University College Cork, Ireland. j.morrison, p.healy, p.odowd @cs.ucc.ie

Abstract— The use of application-specific co-processors created using reconfigurable hardware (FPGAs) has been shown to realize significant speed increases for many computationally intensive applications. The addition of reconfigurable hardware to clusters composed of commodity machines in order to improve the execution times of parallel applications would, therefore, appear to be a logical step. However, the extra complications introduced by this technique may make the real-world application of such technology appear to be prohibitively difficult. In this paper the design and implementation of a metacomputer designed to simplify the development of applications for clusters containing reconfigurable hardware are presented. The operation of the metacomputer is also discussed in some detail, including the process of implementing applications for execution on the metacomputer.

I. I NTRODUCTION The falling price and ever-increasing capability of commodity hardware in recent years has lead to the popularization of Beowulf-style clusters [1]. These allow many large-scale and grand-challenge applications that were previously in the realm of supercomputers to be tackled on commodity hardware purchased at a fraction of the cost of a dedicated parallel processing machine. Similarly, the notion of creating application specific coprocessors using reconfigurable hardware, or Reconfigurable Computing [2], has gained popularity in recent years as FPGAs have increased in speed and density. Using this technique, applications may be accelerated by delegating some of their most commonly used functionality (usually the innermost loop) to an FPGA co-processor configured with a suitable hardware implementation. Not all applications are amenable to acceleration with reconfigurable hardware; the part of the application to be accelerated must typically exhibit a high degree of intrinsic parallelism and data locality in order to make the conversion to hardware worthwhile. For some classes of applications, however, the benefits are impressive, with orders-of-magnitude performance increases common in fields such as cryptography and image processing. Given the performance increases attainable through Cluster Computing and Reconfigurable computing techniques, a combination of both approaches, known as High Performance Reconfigurable Computing1, seems logical. A variety The support of Enterprise Ireland (through grant no. IF/2001/318) and IRCSET (through the Embark Initiative) is gratefully acknowledged.

of schemes have been proposed for the siting of reconfigurable hardware within the cluster topology, such as embedding the reconfigurable hardware within cluster nodes, connecting the reconfigurable hardware directly to the network and using the reconfigurable hardware to allow different interconnection networks to be created. A number of schemes have also been proposed for embedding the reconfigurable hardware within cluster nodes. Typically, PCI boards are used, but FPGAs may also be fitted to DIMM sockets [3] or integrated with the Network Interface Card [4]. For the remainder of this paper, we will assume that the reconfigurable hardware is embedded within nodes on PCI boards with no direct access to the network. Although high-level development and execution environments for distributed reconfigurable hardware contained within a single machine have been developed, e.g., MATLAB [5] and ANSI-C [6] compilers, such tools have not yet become available for clusters containing reconfigurable hardware. APIs analogous to standard parallel processing libraries such as PVM and MPI have been developed that greatly simplify the task of controlling distributed systems of adaptive computing boards [7]. Although unbeatable in terms of pure performance, developing with low-level libraries such as these places a significant burden on the part of the developer, who must explicitly schedule all communications with the network and the reconfigurable hardware. The Distributed Reconfigurable Metacomputer (DRMC) project attempts to shift this burden to the runtime execution platform. Metacomputing has been defined as “the use of powerful computing resources transparently available to the user via a networked environment” [8]. In this instance, the network is made transparent to the application developer by expressing computations as sets of graphs that generate instructions. These instructions are then distributed, scheduled and executed by the metacomputer. While a system of this type is inevitably unsuited to all applications, full utilization of available computational resources should be possible for many classes of applications. The goal of the project is to make cluster-based reconfigurable computing techniques more accessible to developers with little or no background in reconfigurable computing, the only prerequisites being a 1 The field is referred to by a number of synonymous terms in the literature. Other terms in common use include Distributed Reconfigurable Computing and Distributed Adaptive Computing.

knowledge of the Condensed Graphs Model (this requirement may become unnecessary in the future, see Section VI) and a willingness to learn a high-level hardware description language. The remainder of this paper is organized as follows: an overview of the Distributed Reconfigurable Metacomputer system architecture is given in Section II. The operation of the metacomputer is described in Section III. The process of developing applications for the metacomputer is outlined in Section IV. Finally, our conclusions and future work are presented in Section VI. II. DRMC OVERVIEW The Distributed Reconfigurable Metacomputer project (hereafter referred to as “DRMC”) provides an environment in which computations can be constructed in a high-level manner and executed on clusters containing reconfigurable hardware. DRMC is unique in that applications are executed on clusters using the Condensed Graphs model of computation [9]. This model allows the parallelism inherent in applications to be exposed by representing them as sets of graphs. At runtime, the flow of entities on arcs triggers nodes to fire and hence generate instructions. The power of the model results from its ability to exploit different evaluation strategies (availability-driven, coercion-driven and demand-driven) in the same computation, and dynamically move between them using a single, uniform, formalism. The metacomputing environment is comprised of several components: a metacomputer containing a Condensed Graphs engine capable of executing applications expressed as graphs, a Condensed Graphs compiler, a control program for initiating and monitoring computations, and a set of libraries containing components that simplify application development. The metacomputer itself is implemented as a peer-to-peer UNIX application composed of a daemon and, when an application is being executed, a multithreaded computation process. The daemon is lightweight and runs on each cluster node, listening for incoming messages. At an appropriate signal from the control program, the daemon spawns a computation process. The computation process essentially consists of a

Computation Graph Application Code and Libraries

number of threads that exchange instructions and results (see Fig. 1). At its core is the scheduler, responsible for routing instructions and results between the various modules. Instructions may arrive either from the Condensed Graphs Engine or from the communications module. The scheduler sends native and Condensed Graph instructions to the Native Instruction Execution Thread. Likewise, FPGA instructions are sent to the FPGA Instruction Execution Thread. Some instructions may have implementations in both software and hardware, in which case the scheduler is free to decide which thread is most appropriate. Instructions accumulate in the scheduler while awaiting execution. The scheduler will delegate instructions to other cluster nodes if this is deemed to be more expedient than waiting for an execution thread to become available. Results arrive from the execution threads or, in the case of instructions executed remotely, the communications module. Results for instructions that initiated on the local machine are sent to the Condensed Graphs Engine, progressing the computation. Results for instructions that originate remotely are returned to the appropriate node. III. DRMC O PERATION The Control Program (CP) is an application used for initiating computations, displaying log messages generated by applications, and allowing user interaction with executing applications. The CP is also responsible for monitoring the execution of computations and providing the user with realtime information on the state of each machine. A user initiates the execution of an application by sending the appropriate command from the control program to an arbitrary node in the cluster. This initiator node then spawns a computation process and broadcasts a message instructing the other cluster nodes to do likewise, specifying a shared directory containing the application code. Once the shared object containing the application code is loaded, a special registration function is called that informs the computation process of the instructions available and the libraries that the application depends on. The initiator node’s computation process then commences execution of the application’s top level graph, which is equivalent to a C main function. I

Condensed Graphs Engine

R

I

.so

Native Instruction Execution Thread

R

Scheduler Reconfigurable Hardware Local Area Network

I

FPGA Instruction Execution Thread

R

I

Communications Module

R

Fig. 1. An overview of the various components comprising a DRMC computation process, along with the resources managed by each. Arrows indicate the flow of instructions (I) and results (R).

As instructions become available for execution, they form a queue that is managed by the scheduler. Some instructions are executed locally by sending them to the computation process’ native instruction execution thread or FPGA instruction execution thread. If these local resources are occupied, some instructions will be sent for execution to other cluster nodes. Instructions corresponding to condensed graphs may also be delegated to other cluster nodes, allowing parallelism to be exposed on remote machines. Each cluster node regularly advertises its load to the others, allowing the schedulers to favour lightly loaded nodes when delegating instructions. If all the nodes are heavily loaded with long instruction queues, the computation is throttled, i.e., no more new instructions are generated until the backlog has eased. At present, DRMC assumes that only one type of FPGA instruction will be issued by each computation. As development progresses, support will be added for heterogeneous configurations. The metacomputer would then initiate FPGA reconfigurations based on the set of instructions awaiting execution. In the event that a computation process exits prematurely (e.g., a badly written instruction causes a segmentation fault), the DRMC daemon executing on the affected node sends an appropriate error message to the CP before broadcasting a message that halts the computation. IV. A PPLICATION D EVELOPMENT A DRMC application consists of a set of graph definitions, a set of executable instructions, and (optionally) a set of userdefined types, all of which are specified in an XML definition file. Instructions are implemented either as object code (contained in .o files) or FPGA configurations (contained in .bit files), or both. The Condensed Graphs Compiler (CGC) is an application that compiles the set of definition files and links them with the set of executable instructions needed by the application to produce a shared object (.so) file ready for dynamic linking by the metacomputer. Once the shared object has been loaded, special registration functions created automatically by the compiler are used to register the graph, instruction and type definitions with the metacomputer. Any FPGA configurations required by the computation are loaded separately by the metcomputer as needed.

Graph definitions follow a scheme similar to the one outlined in [10]. The compiler automatically converts each XML graph definition to a corresponding function that, when called, returns an instance of the definition in the internal representation used by the metacomputer. Instructions are specified using the name of the C functions or .bit files that implement them, along with a type signature. The type information is used for type-checking purposes by the compiler, as well as for communications purposes by the metacomputer. Although the full set of C primitive types, as well as extra types for representing references, lists and block data are provided by the metacomputer, many application require the use of more complex data structures. In this instance, a user-defined type may be specified. A type definition is comprised of a name, an alias type (user-defined types are typically aliases of the built-in reference type) a string conversion function, a serialization function and a deserialization function. The serialization and deserialization functions are necessary for communications over the network and to/from FPGAs. B. Implementing Native Instructions Native instructions are implemented as C functions that operate on and return instances of a predefined DRMC Operand datatype. An Operand instance is a structure containing two fields: an integer value representing the operand type, and an encapsulation of the actual operand value which may be a primitive value such as an int or a reference to another structure. In many cases, these functions are merely wrappers that interface with existing libraries such as LAPACK [11] by unpacking the values contained in the function operands, calling the equivalent library function and returning a new Operand structure created from the result. Given that the Condensed Graphs Compiler automates the generation of code that creates Condensed Graph instances, an automatic interface generation tool similar to SWIG [12] could be constructed that would generate the tedious boilerplate code required for interfacing with libraries. However, such a tool would be unlikely to eliminate the need for developers to create instructions manually due the resulting loss of control over instruction granularity. C. Implementing FPGA Instructions

A. XML Definition Files Definition files are used to specify the graphs, instructions and user-defined types from which an application is composed.

.xml Graph Definitions

.o Native Instructions

CGC

.so

FPGA instructions communicate with the metacomputer using a simple protocol. In order to write data to an FPGA, the host first sends a header byte over the PCI bus that indicates

CP

LAN

Shared Object

DRMC Fig. 2. The DRMC application development process. A graph definition file and object code are passed to the Condensed Graphs Compiler to produce a shared object capable of execution by the metacomputer. FPGA configurations are created and distributed seperately.

how many other bytes are to follow. Larger data transfers are performed using DMA. The host notifies the FPGA instruction that a DMA transfer is about to commence by sending a value of zero in the header byte. An identical scheme is used for communications in the opposite direction (i.e., between the FPGA and the host). Future versions of the metacomputer may integrate Celoxica’s Data Stream Manager (DSM) [13], a library providing a more flexible, higher level means of communicating with FPGAs. All the FPGA DRMC instructions created to date have been implemented using Handel-C [14], a derivative of ANSI-C specifically designed for translation to hardware. The language contains a number of extensions required for hardware development, including variable data widths and constructs for specifying parallelism and communications at the hardware level. Handel-C is not a superset of ANSI-C, so when an ANSI-C program is used as the basis of a hardware design, changes (such as the elimination of pointers) must typically be made before the program can be converted to a hardware design. Next, the code must be parallelized with par statements and the widths of data lengths expressed explicitly. Analysis with place and route tools reveals the longest paths in the resulting hardware design. Through a process of iterative refinement, various optimizations can be performed until an acceptable level of speed/efficiency is reached. Small changes in the Handel-C source can result in major changes in the resulting logic, allowing different design strategies to be evaluated quickly. Although high-level hardware design languages such as Handel-C and SystemC [15] offer perhaps the easiest method of implementing FPGA instructions, particularly for those without hardware design experience, numerous alternative approaches are possible. Traditional hardware design languages such as VHDL [16] and Verilog [17] can be used to produce more efficient, higher-performing designs but are significantly more difficult to learn and use than their higher-level counterparts. APIs for creating configurations using conventional programming languages such as Java [18] and C++ [19] have also been developed. A more unusual approach is the evolution of hardware designs using genetic algorithms [20].

relatively straightforward (see [10] for an example of an XML description of a Condensed Graph). The same graph could be defined more concisely by creating a special instruction for constructing graphs consisting of binary reduction trees, although the implementation details of such an instruction are beyond the scope of this paper. +

+

+ +

In order to illustrate the DRMC application development process, a simple recursive divide-and-conquer application is presented. For simplicity without loss of generality, the function used will be a simple addition operator, yielding a divide-and-conquer summation. It must be stressed that this implementation is terribly inefficient, given the very small grain sizes involved and the latency issues inherent in both commodity networking equipment (used for inter-node communication) and PCI buses (used for communicating with the FPGAs). Nevertheless, the application’s concise nature should serve as a valid proof of concept. The application sums eight integer inputs using a simple binary reduction tree. This is easily expressed as the Condensed Graph illustrated in Fig. 3. The XML document describing the graph is too long to reproduce here, but is

X

+

+

Fig. 3. The sample application graph definition. The eight operands flow from the Enter (E) node to four addition nodes, causing them to fire. The resulting outputs then form the operands of other addition nodes. The final result is returned by the Exit (X) node.

Once the graph definition is complete, the instructions it requires must be implemented. Implementing the addition instruction natively using C is trivial; a single function is required that unpacks the two integer operands it receives before creating a new Operand instance from their sum (see Fig. 4). The C file containing the add function is compiled (in this case using GCC) to produce an object code (.o) file. This file, along with the graph definition file described above are passed to the Condensed Graphs Compiler, resulting in a shared object (.so) file that can be dynamically loaded and executed by computation processes running on cluster nodes. In order for the metacomputer to delegate work to the reconfigurable hardware contained in the cluster, an alternative hardware implementation of the add instruction must be provided. A Handel-C implementation of the instruction is shown in Fig. 4. At this point, the application developer’s work is finished, as the metacomputer will assume responsibility for distributing, scheduling and executing the instructions generated by the application graph. #include "drmc.h"

V. E XAMPLE A PPLICATION

+

E

/* Sum two integer operands */ Operand add(Operand o1, Operand o2) { int a = OperandToInt(o1); int b = OperandToInt(o2);

#include "rc1000.h" #include "drmc.h" void main(void) { int 32 a, b; while(1) { /* Read operands */ DrmcReadByte(); a = DrmcReadInt(); b = DrmcReadInt();

return OperandInt(a + b); }

/* Write result */ DrmcWriteByte(width(a)/8); DrmcWriteInt(a + b); } }

Fig. 4. The ANSI-C and Handel-C implementations of a simple DRMC instruction that adds two integers.

VI. C ONCLUSIONS

AND

F UTURE W ORK

Modern high-level hardware design languages have made the field of reconfigurable computing more accessible than

ever before to software developers without hardware design experience. It is hoped that the ability to write applications for a metacomputer will make the notion of developing for distributed reconfigurable hardware a less daunting prospect than at present. The metacomputer is not without its faults however, chief amongst these being the knowledge of the Condensed Graphs computing model required to construct the application graphs. This requirement creates an additional barrier to entry, a situation that should be avoided when one considers that the goal of the project is to make high performance reconfigurable computing more accessible to the average developer. Several methods of simplifying the creation of graph definitions are currently under investigation. A GUI Condensed Graphs design application has been developed as part of the WebCom [21] project that provides a more natural and intuitive method of specifying graphs but does not eliminate the need for knowledge of the underlying execution model. The automatic generation of graph definitions from other computational representations, such as Java source code or activity set specifications, is a topic under active investigation. The development of more advanced scheduling algorithm that would enable run-time reconfiguration is also a priority. The scheduler in its current form assumes that only one type of FPGA instruction will be generated by the application graph and configures all the FPGAs in cluster homogeneously with the single expected FPGA instruction. Future work will focus on allowing the metacomputer to maintain a heterogeneous set of FPGA configurations, with reconfigurations determined by the set of instructions awaiting execution. Instructions of a different type to the host machine’s current configuration could then be delegated to another node with a matching configuration. Various scheduling strategies will be evaluated in order to make efficient use of the reconfigurable hardware available. R EFERENCES [1] T. Sterling, D. Savarese, D. J. Becker, J. E. Dorband, U. A. Ranawake, and C. V. Packer, “BEOWULF: A parallel workstation for scientific computation,” in Proceedings of the 24th International Conference on Parallel Processing, Oconomowoc, WI, 1995, pp. I:11–14. [2] G. Milne, “Reconfigurable custom computing as a supercomputer replacement,” in 4th International Conference on High Performance Computing, Bangalore, India, Dec. 1997. [3] P. Leong, M. Leong, O. Cheung, T. Tung, C. Kwok, M. Wong, and K. Lee, “Pilchard - a reconfigurable computing platform with memory slot interface,” in Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines, Apr. 2001. [4] K. Underwood, R. Sass, and W. Ligon, “A reconfigurable extension to the network interface of beowulf clusters,” in Proceedings of IEEE Cluster Computing, Newport Beach, CA, Oct. 2001. [5] M. Haldar, A. Nayak, A. Kanhere, P. G. Joisha, N. Shenoy, A. N. Choudhary, and P. Banerjee, “Match virtual machine: An adaptive runtime system to execute MATLAB in parallel,” in International Conference on Parallel Processing, 2000, pp. 145–152. [6] J. B. Peterson, R. B. O’Connor, and P. M. Athanas, “Scheduling and partitioning ANSI-C programs onto multi-FPGA CCM architectures,” in IEEE Symposium on FPGAs for Custom Computing Machines, K. L. Pocek and J. Arnold, Eds. Los Alamitos, CA: IEEE Computer Society Press, 1996, pp. 178–187. [7] M. Jones, L. Scharf, J. Scott, C. Twaddle, M. Yaconis, K. Yao, P. Athanas, and B. Schott, “Implementing an API for distributed adaptive computing systems,” in Proceedings of IEEE Symposium on FieldProgrammable Custom Computing Machines, Napa, CA, Apr. 1999.

[8] L. Smarr and C. E. Catlett, “Metacomputing,” Communications of the ACM, vol. 35, pp. 44–52, June 1992. [9] J. P. Morrison, “Condensed graphs: Unifying availability-driven, coercion-driven and control-driven computing,” Ph.D. dissertation, Technische Universiteit Eindhoven, 1996. [10] J. P. Morrison and P. Healy, “Implementing the WebCom 2 distributed computing platform with XML,” in Proceedings of the International Symposium on Parallel and Distributed Computing, Iasi, Romania, July 2002, pp. 171–179. [11] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK Users’ Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics, 1999. [12] D. M. Beazley, SWIG Users Manual, June 1997. [13] Data Stream Manager, Celoxica Ltd., Sept. 2002. [14] Handel-C Language Reference Manual Version 3.1, Celoxica Ltd., 2002. [15] T. Grotker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Kluwer Academic Publishers, May 2002. [16] P. J. Ashenden, The Designer’s Guide to VHDL, 2nd Edition. Morgan Kaufmann, May 2001. [17] P. R. Moorby and D. E. Thomas, The Verilog Hardware Description Language. Kluwer Academic Publishers, May 1998. [18] S. A. Guccione, D. Levi, and P. Sundararajan, “JBits: A Java-based interface for reconfigurable computing,” in 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD)., Sept. 1999. [19] O. Mencer, M. Morf, and M. J. Flynn, “PAM-Blox: High performance FPGA design for adaptive computing,” in IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), Napa Valley, California, 1998. [20] A. Thompson, Hardware Evolution: Automatic design of electronic circuits in reconfigurable hardware by artificial evolution, ser. Distinguished dissertation series. Springer-Verlag, 1998. [21] J. Morrison, J. Kennedy, and D. Power, “Webcom: A Web based volunteer computer,” The Journal of Supercomputing, vol. 18, pp. 47–61, Jan. 2001.

Architecture and Implementation of a Distributed ...

Architecture and Implementation of a Distributed ...

Suggest Documents

Distributed Systems-Architecture and Implementation ... - Microsoft

Distributed Systems-Architecture and Implementation ... - Microsoft

distributed video coding: codec architecture and implementation

the architecture and implementation of a distributed ...

Design and Implementation of Distributed Control Architecture of an

Architecture and Implementation of Distributed Data Storage using ...

Design and Implementation of a Distributed ... - People.csail.mit.edu

Architecture and Implementation of a Collaborative ... - CiteSeerX

ARCHITECTURE AND IMPLEMENTATION OF ... - Journals

Architecture and Implementation Issues, Towards a ... - WWW2015

Implementation of a distributed mobile based ...

Implementation of a Distributed Chess Playing ...

Global GAMMA: A Distributed Implementation of ... - ScienceDirect

Implementation of a Distributed Multiterabyte ... - Semantic Scholar

A Distributed Implementation of Structured Gamma

A distributed system architecture for a distributed ...

A distributed system architecture for a distributed ... - Google Sites

A distributed system architecture for a distributed ...

A UNIFIED ARCHITECTURE FOR THE IMPLEMENTATION OF

A Cognitive Architecture for the Implementation of

Implementation of a Service-Oriented Architecture for

a low power architecture for implementation of

Architecture, Protocol, and Implementation Issues

BoxRouter 2.0: Architecture and Implementation of a Hybrid and