Design of a high-level language for Custom Computing ... - CiteSeerX

0 downloads 0 Views 136KB Size Report
Nov 20, 2002 - a standard imperative programming language such as C or Java to FPGAs. ... In its simplest form, a Field Programmable Gate Array (FPGA) consists of a grid of ... gate counts (as long as adding more replicas of the CCM helps), whereas for .... since the loop is now an obvious candidate for unrolling, ...
Design of a high-level language for Custom Computing Machines C. van Reeuwijk Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands email: [email protected] November 20, 2002 Abstract Modern Field Programmable Gate Arrays (FPGAs) can implement digital circuits with millions of gates. This makes them very attractive for implementing Custom Computing Machines (CCMs): dedicated processors for specific computations. For some applications FPGA CCMs can provide a speedup of a factor 10 to 1000 or more relative to an implementation on a conventional processor. Particularly fruitful areas are signal processing, cryptography, and some classes of routing and planning problems. Unfortunately, implementing a CCM on an FPGA is currently an elaborate process requiring specialised knowledge. An important reason is the lack of a suitable programming language. Translation from programming languages such as C or Java is difficult and inefficient. Existing hardware description languages such as VHDL and Verilog are too low-level and too complicated for casual use, others use an unsuitable abstraction (SystemC, SpecC), or are too limited. In this paper we will analyse the problems with existing approaches, and investigate a number of alternative models of computation. One of these is a straightforward formalisation of synchronous digital circuits, the other one is more abstract and is not restricted to the synchronous model. As we will show, by choosing a suitable model of computation it is possible to define a simple but powerful programming language for CCMs from which hardware configurations can easily be generated. We have implemented a prototype version of such a programming language. The results are promising, but further research is required to arrive at the best choice of programming language, and to evaluate the potential of this language.

1 Introduction Some algorithms can be accelerated significantly by constructing a dedicated digital circuit for them, called a Custom Computing Machine (CCM). For example, Figure 1 shows a CCM for DNA sequence matching. The CCM compares DNA sequences to

1

register 3

load

values

register 1

match selector

regno

load circuit

register 2

register 0

match number match

pattern register

Figure 1: An example CCM: a DNA sequence matcher.

a number of reference sequences (stored in registers 0 to 3). At every clock cycle a new DNA base is shifted into the pattern register. When the load signal is asserted, the DNA sequence in the pattern register is loaded into the reference pattern register indicated by the regno input. All reference registers compare their pattern with the pattern register, and indicate whether their pattern matches. From the match signals of the reference patterns a match flag and a match register number are constructed that form the output of the CCM. As shown here, only four reference patterns are matched, but this is easily extended to a much larger number. Since all these comparisons can be done in parallel, a very high degree of parallelism can be achieved. It is usually not economical to implement a CCM as a special-purpose chip, but implementation on a Field Programmable Gate Array (FPGA) is much cheaper. An FPGA contains an array of logic cells. Each cell can be programmed to implement a simple digital circuit using a number of configuration bits stored in RAM. The individual cells can be interconnected through configurable wiring channels into larger digital circuits. The cells and wiring have been designed so that arbitrary logic circuits can be implemented. More details can be found in Section 2. Modern FPGAs can accommodate very large circuits. The largest FPGAs available today have nearly 100; 000 logic cells [3, 41] (e.g. the Altera Stratix EP1S80 or the Xilinx Virtex II-Pro 2VP125), and can implement digital circuits with an estimated 5  106 logic gates. In addition, these devices provide dedicated memory blocks to implement register sets, stacks, FIFO, etc., and often other dedicated blocks such as multipliers. As Lignon [24] et al. report, such an FPGA can accommodate, for example, roughly 150 single precision floating point adders or multipliers, 500 16-bit integer multipliers, or several thousand 16-bit integer adders. Since the configuration bits are stored in RAM cells, the FPGA can be reconfigured at any moment. This means that the FPGA can be treated as a rather unusual processor, and a FPGA configuration as a program for that processor. The processor is ‘rather unusual’ because it deviates from traditional processors with separate memory banks and a central processing unit: it is not a ‘von Neumann’ architecture.

2

Using a FPGA CCM is called reconfigurable computing. Reconfigurable computing can lead to spectacular speedups. Compared to an implementation on a standard processor, speedup of a factor 10 to 1000 or more is quite feasible for some algorithms, particularly in signal processing, cryptography, and some classes of routing and planning problems. In Section 3 we describe some existing applications of reconfigurable computing. A good way to exploit FPGAs for reconfigurable computing is to use them as coprocessor in a traditional computer system. That way, only the kernel of the algorithm needs to be implemented on an FPGA, and all the supporting software for file input and output, data set preparation, and other supporting computations can be implemented on a standard computer system. For this reason, several vendors sell accelerator cards that can be plugged in a standard PCI slot. One vendor [7] has even implemented a parallel computer system with a FPGA in each node, and in the MOLEN project [39] a processor is proposed that uses reconfigurable hardware to implement instructions that are specific to a program. Unfortunately, currently FPGAs are far more difficult to program than standard processors. Considerable knowledge of digital circuit design is required, as is a thorough training in the particular circuit synthesis tool for the target FPGA. Furthermore, designs are much less portable than standard software. Finally, even experienced developers require considerably more time to implement an algorithm on an FPGA than on a standard computer. These disadvantages are not inherent in the use of FPGAs, but reflect the fact that current development tools were developed for experienced developers constructing highly efficient circuit designs. For reconfigurable computing it is usually acceptable to trade some efficiency for a simpler design process; a similar tradeoff is made when using high-level programming languages instead of assembly language. Instead of using a dedicated circuit description tool, it is possible to compile from a standard imperative programming language such as C or Java to FPGAs. However, these languages are not designed for FPGA programming, and contain constructs that are difficult to translate, such as recursion and dynamic memory allocation. Also, these languages are sequential, so all parallelism must be discovered by the compiler. This leads to a difficult compilation process. Despite this problem, the overwhelming dominance of languages like C and Java makes it tempting to develop a compiler to FPGAs. This has been tried in a large number of projects; for details see Section 4. Instead of investing a large amount of effort trying to compile an imperative programming language to FPGAs, we propose to use another model of computation. Although this means that the programmer must learn a new language, it also means that the programs can be shorter and clearer, and that a simple compiler is able to generate efficient circuits. In Section 5 we experiment with a number of alternative models of computation by defining toy programming languages. As we will show, by choosing a suitable model of computation it is possible to define a simple but powerful programming language that is easily translated to efficient hardware. Based on these experiments we conclude that using a different model of computation is indeed a very powerful approach to reconfigurable computing. More detailed conclusions are drawn in Section 6. We assume that once a circuit is described as a netlist (a list of the used components 3

LE

LE

LE

LE

interconnection

LE

LE

LE

LE

Figure 2: A simple field-programmable gate array, consisting of a grid of logical elements and interconnection hardware. out data1 data2 data3 data4

Look−Up Table (LUT)

D

Q

reg out

enable clock set reset

Figure 3: A simple FPGA logical element.

and their connections), it can be automatically translated to an FPGA configuration. This is somewhat optimistic, since in practice the efficiency of an implementation is often improved significantly by manual intervention. Since we want to investigate high-level languages for FPGAs, we ignore this aspect in this report. We expect that a full high-level language will allow the user to give hints to the compiler, similar to the register declaration of C.

2 FPGA architecture In its simplest form, a Field Programmable Gate Array (FPGA) consists of a grid of logical elements (LEs) surrounded by interconnection hardware (Fig. 2). The interconnection hardware provides connections between the inputs and outputs of the logical elements. It contains switches that allow the interconnections to be configured to implement a specific wiring of the logical elements. To communicate with the outside world, it also contains I/O cells that read a signal from a pin, or transmit a signal to a pin of the FPGA chip.

4

A simple logical element (Fig. 3) consists of a lookup table to implements a simple logical function, and a register to store a single bit. The combination of programmable LEs and programmable interconnection allows a FPGA to implement arbitrary digital circuits. Moreover, since the logical elements and interconnection hardware is programmable with configuration bits that are stored in RAM cells, the implemented digital circuit can be rapidly replaced. In practice, the basic FPGA architecture is augmented with hardware to provide more efficient implementation for frequently required functionality. For example: 

Efficient interconnections between neighbouring LEs.



Specialised carry signals between neighbouring LEs.



Specialised circuitry for distributing clock signals.



Specialised blocks containing RAM, multipliers, or other complex logical functions.



Hierarchical routing.

3 Example applications of FPGA CCMs Before we consider new approaches to programming FPGAs, it is useful to investigate what has been achieved with existing tools. A good illustration of potential achievements is a study done by Graham and Nelson [16]. They report a speedup of roughly a factor 70 for an FPGA implementation of a genetic algorithm for the travelling salesman problem. As is typical for FPGA implementations, there are two reasons for the success. The first is the sheer speed of the CCM they implement: in this case it is 9 times faster than the software implementation they compare to. Moreover, multiple engines can be used in parallel: in this case they were able to put 8 engines on the FPGA board they used. Table 1 lists this result together with a number of other examples of successful applications of FPGA CCMs. The shown speedups are approximate, since the different projects use different software platforms as reference, and often we have summarised a range of results in one characteristic speedup figure. Nevertheless the results indicate that for all these problems highly significant speedups can be achieved by using an FPGA CCM. Moreover, improvements in integrated circuit fabrication, as predicted by Moore’s law can be expected to favour CCM more than standard processors: both gain equally by improved operating speeds. Moreover, CCMs also benefit linearly from increased gate counts (as long as adding more replicas of the CCM helps), whereas for standard processors the benefit is less1 . Algorithms that benefit most from a CCM implementation tend to be integer operations with much fine-grain parallelism. Good examples can be found in image, signal, and string processing, and in encryption and decryption. NP-hard problems also tend to 1 In standard processor design it is becoming more and more difficult to use the ever increasing numbers of transistors for something useful. Larger caches, more ALUs or deeper pipelines are often not very effective in increasing processing power.

5

Application Sub-graph isomorphism (NP-hard) Nesting problem (NP-hard) Encryption using IDEA algorithm Infrared automatic target recognition VLSI design rule checking Lempel-Ziv compression Image processing operations Parsing (speech recognition) Travelling Salesman problem (NP-hard) Fast Fourier Transform Enigma cipher machine emulation Satisfiability problem (NP-hard)

Speedup 10 13 14 20 25 30 30 10 to 70 70 102 400 1 to 7000

Reference [21] [4] [28] [25] [30] [19] [13] [9] [16] [27] [33] [2]

Table 1: Example results of using FPGAs for reconfigurable computing. be good candidates, since next to subtle algorithms they often require sheer processing power. FPGAs have recently become large enough [24] to accommodate substantial numbers of floating point operations, which makes reconfigurable computing feasible for a new large range of computationally intensive problems. Since currently FPGA programming is hard, applications tend to be algorithms for which the benefits are overwhelmingly clear. Once FPGA programming is more accessible, we expect that FPGA implementations for which the benefits are not so clear will also be tried, thereby widening the field of applications considerably. Another important advantage of an FPGA implementation is that it typically consumes only a fraction of the power of an implementation on a standard processor system. This makes reconfigurable computing very attractive for embedded systems.

4 Existing hardware description languages Generally, a hardware description language consists of three parts: 

A structural language, describing the components in the circuit and the connections between them.



A simulation language, describing the behaviour of components or larger aggregates of components.



A circuit generator language. It is used to generate repetitious circuits, define generators for families of similar circuits, and improve platform independence.

The distinction between the structural and simulation language is made for pragmatic reasons: since constructing the actual hardware is an expensive and lengthy process, hardware design relies heavily on circuit simulation. Since circuit simulation is often

6

slow, it is attractive to replace parts of the circuit with ‘stand in’ components that are easily simulated but cannot be translated to hardware. From a slightly different point of view one can say that although hardware description languages are intended to describe the behaviour of a circuit, often only a subset can be translated to hardware. Circuit design consists of first describing the circuit in the full language, verifying the behaviour by simulation, and then refining the description until only the synthesizable subset is used. In hardware design context the more abstract design is called the behavioural model, while the synthesizable design is called the structural model. No matter how you look at it, it is desirable to make the synthesizable subset as large as possible, and ideally the entire language should be efficiently synthesizable.

4.1 Translating imperative programs to hardware Considering the overwhelming dominance of imperative programming languages such as C, C++ and Java in software development, it is not surprising that using this model of computation for reconfigurable computing has been studied. Translating standard programming languages fully is often difficult. Two major obstacles are dynamic memory allocation and recursion, both of which are virtually impossible to support in a CCM. A sophisticated compiler is able to eliminate these constructs in some cases, but there will always be cases where elimination is not possible. There are also smaller obstacles such as data-structures of varying size and exceptions. Consequently, reconfigurable computing requires a dedicated imperative programming language. It can be obtained by defining a subset of an existing language, defining a new one, or a mixture of the two (i.e. extending a subset with new constructs). The process of compiling such a dedicated language is described, for example, in Budiu [6] et al. and Edwards [14] et al. It roughly consists of the following phases: 1. Partition the code into independent ‘engines’. 2. For each engine, aggressively inline function calls to produce large sections of straight-line code. 3. Transform each section to Predicated Static Single Assignment (SSA) form. 4. Group independent operations into parallel sections. 5. Generate hardware to sequentially execute each of the parallel sections. 6. Apply simplifications and optimisations to the circuit. In predicated SSA form, every variable is assigned to exactly one time. Compared to the original program, predicated SSA form contains additional variables, and choice expressions that select between possible values based on a choice predicate. The compilation process contains a number of heuristics, in particular in the partitioning and optimisation phases. Consequently, the generated circuits vary considerably in quality, making the compiler more difficult to use.

7

compression

encryption

error correction

Figure 4: An example of a pipeline of engines. Such a pipeline is not simple to express in an imperative programming language.

For compilation to a standard processor, programmers have a fairly accurate idea what the performance of a section of code will be2 . When the same section of code is translated to hardware, the cost is no longer so simple to estimate, because some language constructs incur a very high cost, in terms of required chip area or time. To achieve high performance in these circumstances, the programmer must know the translation process in detail, and must carefully choose language constructs that are efficiently translated. For example, assume that we have an array a of 16 numbers, and that we want to find the index of an array element that has value 1. The natural way to do this in C would be: int ix = -1; for( int i=0; i out:int out = counter [reset,enable]; reset = enable & out>=9;

As a larger example, the following circuit calculates the greatest common divisor (GCD) of two numbers: 16

def gcd [load:bool,aval:int,bval:int] => [found:bool,val:int] register a:int = 1; register b:int = 1; register found:bool = false; register val:int = 12; swap = a match:bool register reg:dnaPat; reg’ = load?in:reg; [r0,r1,r2,r3] = reg; [i0,i1,i2,i3] = in; match = (r0==i0) & (r1==i1) & (r2==i2) & (r3==i3); def dnamatcher [load:bool,val:int,regno:int] => [match:bool,reg:int] register v0,v1,v2,v3:int; v = [v0,v1,v2,v3]; l0 = load & regno == 0; l1 = load & regno == 1; l2 = load & regno == 2; l3 = load & regno == 3; m0 = dnaregister[l0,v]; m1 = dnaregister[l1,v]; m2 = dnaregister[l2,v]; m3 = dnaregister[l3,v]; v0’ = val; v1’ = v0; v2’ = v1; v3’ = v2; match = m0 | m1 | m2 | m3; reg = m0?0 : m1?1 : m2?2 : m3?3 : 0;

The dnaregister component stores a single reference DNA pattern. It is loaded when the load signal is asserted. The value of in is then stored in the internal register

17

reg. The input pattern in is compared against the reference pattern in reg. If it matches, the match value is asserted. The dnamatcher component uses four dnaregisters. On every clock cycle, dnamatcher receives a new DNA base that is put in the shift register formed by v0 to v3. The value in the shift register is compared by all dnaregister components. If one of the patterns matches, the output match is asserted, and the output reg is set to the number of the register that contains the match. SL is trivially translated to hardware: ‘inlining’ all composite components results in a circuit description that exactly specifies the occurring instances and their interconnections. Since recursive use of components is not allowed, inlining is always possible.

5.2 The EL language One noticeable pattern that emerges from the SL examples is that the components often require handshaking. That is, one component must wait for input values to be ready, or must wait until output values can be transmitted. Since this construct occurs very often, it is tempting to abstract away from it by leaving it implicit. One solution is to let components change state in reaction to an event instead of a clock pulse. To illustrate this approach, let us define another small language, called EL (for Event Language). Components have events as input and output. An event can occur on its own, or can have data associated with it. As a simple example, the following is a component that counts the number of events on its input event. For each increment it sends an event value with the new count: def counter event => value:int event: value value:int event: value

Suggest Documents