CREC: A Novel Reconfigurable Computing Design Methodology Octavian CreĠ, Kalman Pusztai, Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Computer Science Department BariĠiu 26, 3400 Cluj-Napoca, Romania {Octavian.Cret, Kalman.Pusztai}@cs.utcluj.ro,
[email protected],
[email protected] Abstract The main research done in the field of Reconfigurable Computing was oriented towards applications involving low granularity operations and high intrinsic parallelism. CREC is an original, low-cost general-purpose Reconfigurable Computer whose architecture is generated through a Hardware / Software CoDesign process. The main idea of the CREC system is to generate the bestsuited hardware architecture for the execution of each software application. The CREC Parallel Compiler parses the source code and generates the hardware architecture, based on multiple Execution Units. The hardware architecture is described in VHDL code, generated by a program. Finally, CREC is implemented in an FPGA device. The great flexibility offered by the general-purpose CREC system makes it interesting for a wide class of applications that mainly involve high intrinsic parallelism, but also any other kinds of computations.
1. Introduction and related work Programmable architectures are those architectures that heavily and rapidly reuse a single piece of active circuitry for many different functions. Configurable architectures are those architectures where the active circuitry can perform any of a number of different operations, but the function cannot be changed from cycle to cycle. The most common examples are processors that can perform different instructions on their ALUs on every cycle (for programmable architectures), and FPGA devices (for configurable devices) [1]. The efficiency of Reconfigurable Computing has been proven for several types of applications, like wireless communications, DSP, image processing, pattern recognition, artificial life, evolvable hardware, etc. In most projects, the main idea was to integrate a small processor together with a reconfigurable computing unit ([2], [3]) inside a single chip, thus achieving a considerable
performance improvement for specific applications, sometimes overcoming even specialised DSP processors. One of the first proposals for a reconfigurable architecture using commercial FPGAs was the Programmable Active Memory (PAM) from DEC Paris Research Laboratory [12]. Based on FPGA technology, a PAM is a virtual machine controlled by a standard microprocessor. The system can be dynamically configured into a large number of application-specific circuits. PAM introduced the concept of “active” memory. Attached to a high-speed bus of the host computer, PAM acts like any RAM module, but it processes data between write and read instructions. Programmable datapaths were proposed by RaPid [13], a linear array of function units composed of 16-bit adders, multipliers and registers connected through a reconfigurable bus. The logic blocks are optimized for large computations. They perform the operations much more quickly and consume less space on the chip than a set of smaller cells connected to form the same type of structure. The GARP project [3] introduced the idea of a configuration cache. The GARP system had a configuration cache to hold recently stored configurations for rapid reloading. Runtime reconfiguration is a concept explored by many reconfigurable computing research projects, such as Chimera [14]. This allows the cell program to be modified during execution. The RAW microprocessor chip [15] comprises a set of replicated tiles. Each tile contains a simple RISC processor, a small amount of configurable logic and a portion of memory for instructions and data. Each tile has associated a programmable switch which connects the tile in a wide-channel point-to-point interconnect. This architecture is estimated to achieve performance levels 10x-100x over workstations for many applications. Another important concept was the notion of systolic flow execution – the ability to flow the data to the function units from an interconnection network in addition to the traditional mode of fetching operands from memory to execute in the function units.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
General-purpose reconfigurable computers (GPRCs) represent an (quite) old problem. A very important GPRC implementation was the DPGA [1]. The DPGA basically consists of an array of reconfigurable computing elements (RCE - similar to the reconfigurable cells that are present into a classical FPGA device), but where the context memory of each RCE can store not only one (like in FPGA devices), but several configuration contexts. The local storage of the several configuration contexts is a distinctive feature of all the RC devices. This feature was adopted as one of the main characteristics of the CREC computer. However, implementations of general-purpose reconfigurable computing systems are more rare. In this paper we introduce CREC (Calculator REConfigurabil): a lowcost general-purpose reconfigurable computer with dynamically generated architecture, built in a Hardware / Software CoDesign manner. CREC is based only on classical FPGA devices, the VHDL hardware description language and a high-level programming language (like JAVA), without integration in a dedicated VLSI chip.
2. Main concepts The main idea of the CREC design was to exploit the intrinsic parallelism present in many low-level applications by generating the best-suited hardware for implementing a specific application. Thus, there will be a different hardware architecture implementing each application. The link between software and hardware is very tight: for each program written in CREC assembly language, the CREC environment generates the optimal architecture. Then, this architecture is downloaded into a FPGA device (which constitutes the physical support of the CREC), and then the application is executed by it. We are thus building the CREC architecture without creating a specialised VLSI chip (like in most RC implementations), but only by using existing tools and physical (hardware) support: a VHDL compiler and synthesizer, a C++ or JAVA compiler, FPGA download programs and FPGA device(s). Even if the compilation time increases, for applications that need a big execution time (image processing, pattern recognition, genetic algorithms, etc.), it will be largely recovered (or repaid) at run-time by the parallel execution instead of the classical sequential one. This approach opens the way for a new family of computing systems, where there will be no restrictions on the number of execution units (of course, there will always be a restriction given by the FPGA device capacity, but with the technological progress this capacity will continuously increase, thus leading to more complex implementations of CREC architectures).
into a scalable structure. CREC is the result of a Hardware / Software CoDesign process, where the hardware part is dynamically and automatically generated during the compilation process. The resulting architecture is optimal because it exploits the intrinsic application parallelism. The main steps in the program execution are: 1. The application source code is written in a specific CREC assembly language, in RISC style. All the instructions are encoded on the same number of bits. 2. The source code is compiled using a parallel compiler, which allows the implementation of ILP (instruction-level parallelism). The compiler detects and analyses data dependencies, then it determines which instructions can be executed in parallel. A collection of instructions that can be executed in parallel constitutes a program slice. Thus, the whole program is divided into slices. The slice’s size depends on the designed number of execution units used for program execution. 3. According to the slices size, the hardware structure will be generated; it will run the program. The structure will be materialized in a FPGA device. The hardware architecture already exists in a VHDL source file, so at this moment it is only automatically adjusted according to the slices size and other options that will be discussed below. 4. The memory is divided in three parts: Data Memory (off-chip), Instructions Memory (on-chip, in DPGA style) and Operands Memory (on-chip or off-chip, according to the FPGA device capacity). 5. The VHDL file is compiled and the FPGA device is configured.
4. The software design process There are two implementation variants implemented until now. The first variant works under both Windows and Linux operating systems (due to the fact that it is written in ANSI-C language and uses BISON parser). Its interface is less user-friendly, but it has greater portability than the second one. Integrated CREC Development System
Application source code (written in CREC Assembly Language)
Application Execution
3. Design flow The CREC architecture combines several architectural styles (RISC, DPGA, ILP-like multiple execution units)
Parallel Compiler (determination of the number of slices and instructions scheduling)
VHDL source code Generator (written in JAVA)
FPGA Configuration Process
Figure 1. The CREC Design Flow
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
VHDL file Compilation
In the second variant, the software components of the CREC computer are written in JAVA and embedded in an integrated development system consisting of: the RISC Editor, the Parallel Compiler, the Instruction Expander and the Test Bench.
4.1 The RISC Editor This editor is a simple tool for writing RISC code to be compiled by the Parallel Compiler. Its main features are: Syntax help, Syntax checking and Save and load of RISC source files. The parser is implemented in JavaCC (released jointly by Metamata and Sun Microsystems).
4.2 The Parallel Compiler This program parses the CREC-RISC source code in order to take some quite important decisions upon the execution system that will be generated. The Parallel Compiler determines the minimal number of program slices and also determines which instructions will be executed in parallel in each slice. In order to do this, the Parallel Compiler must respect a set of rules, not detailed here by lack of space. An example of a rule is: each slice can contain at most one Load, Store or Jump statement. The compiler reads the application source code (written in CREC assembly language) and generates a file in a standard format that gives a description of the tailored CREC computer, such as: the size of the various functional parts, the subset of instructions from the instruction set involved, the number of execution units, etc., together with the sequence of instructions that makes up the program.
Figure 2. CREC Development System The Instruction Set respects the design rules of RISC computers. The number of CREC instructions is relatively big; the arithmetic is performed only on unsigned integers,
the only instructions accessing memory are Load and Store, with all operations performed on data stored in the internal registers. The usual arithmetic, logic, data moving and branching instructions are present, and the resulting CREC architecture contains only the hardware needed to execute the subset of instructions used in the program. What makes CREC particular with respect to classic processors, besides its reconfigurable nature, is the parallel nature of the computer: each register has an associated Execution Unit (EU). Thus, at each moment during the program execution, we have N distinct EUs, each one executing the instruction that was assigned to it. The compiler accomplishes the assignment of instructions to EUs, according to their nature. There are instructions that specify the precise EU to execute them, and other that can be executed by any EU. For instance, an instruction like “mov R1, 7” will be executed by EU1, since it works with the register R1, while the generic instruction “jmp 3” has no specific EU to execute it, and will be assigned one by the compiler, depending on the availability. Each EU is potentially able to execute any kind of instruction from the CREC Instruction Set. This means that the accumulator registers in each EU have equal capacities, but the internal structure of each EU will be different, according to the instructions subset (from the CREC Instruction Set) that each EU will actually execute. Having a greater number of EUs has the obvious advantage of introducing Instruction Level Parallelism, i.e. the execution of several instructions from the same program in parallel. Dividing a program that is written in a sequential manner into portions of code to be executed at the same time is the task of the compiler. It analyses the data dependencies and divides the program into slices. Slices are a very important concept in the CREC compiler design. When the compiler parses a file, it checks the data dependencies, and determines which instructions can be executed at the same time by their associated EUs. A slice is made up by the instructions that are assigned to each EU for simultaneous execution. The following example shows the execution of a simple CREC program: a classical, non-optimal multiplication of two integers using three EUs and no overflow check: [1] MOV R1, 2 [2] MOV R2, 3 [3] MOV R3, 3 [4] ADD R1, R2 [5] DEC R3 [6] JNZ R3, 4 [7] MOV STORE BUFFER, R1 [8] STORE 200 This sequential code is transformed into slices by taking in consideration data dependencies that occur (Table 1). Remark: the argument of the conditional jump has changed. It is translated from the sequential jump address
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
to a slice number. The number of system cycles is also reduced by 37.5%. Table 1. Instructions scheduling in slices Slice 1. 2. 3. 4. 5.
EU1 mov R1, 2 add R1, R2
EU2 mov R2, 3
EU3 mov R3, 3 dec R3 Jnz R3, 2
mov STORE BUFF, R1 Store 200
presented above. The result will be several VHDL files describing the structure of basic hardware elements. Figure 3 shows the linkage between the basic hardware elements. Addr
Instructions Memory
Slice Memory
4.3 The Instruction Expander and the Test Bench The Parallel Compiler works with the explicit form of the instructions, generated by the Instruction Expander. It receives the CREC source file and generates the expanded forms in a new file. The GUI of the Instruction Expander allows the user to modify the expanded form of any instruction (e.g. assign an instruction to another EU). File operations, like Save and Load, are also provided. The Test Bench is a simulator that allows the user to follow the execution steps, i.e. showing the instructions that are executed on each clock cycle by each Execution Unit.
5. The VHDL source code generator The information generated by the compiler is transmitted to the hardware structure by the means of a VHDL source file. Thus, after compiling the VHDL file, the different components of the hardware architecture are configured. The VHDL file contains an already written source code, where the main architecture’s parameters are given as generics and constants. The hardware structure can thus be resized according to the needs of the application. The following components can be tailored: x The number of EUs. The source code for an EU exists (already written) and it will be instantiated and tailored as many times as necessary for the application; x The register width in each EU (all registers, stacks and buffers have the same word length); x The size of the Instructions Memory; x The slice-mapping block. This method allows the increase of the operating speed and to obtain a considerable flexibility: the hardware architecture is resized according to application requirements. If one bus or Control Signal is not used, it will not appear, because no VHDL code will be created for it. Each EU is able to execute most common arithmetic and logic operations using unsigned integer values. Some operations might be useless for a specific configuration – in this case, no VHDL code will be created for their implementation. In order to be used, this package tool needs the binary files provided by the Parallel Compiler, in the special format already established, containing the information
Addr
Addr
Addr
Instructions Memory
Operand Memory
EU
EU
…
Data Stack Memory
Store Buffer
Operand Memory
Addr
Instructions Memory
Addr
Operand Memory
EU
Slice Counter Slice Stack Memory
Data Memory Addr
Load Buffer
Figure 3. The general CREC architecture The links between EUs are point-to-point (for example, these links are used when executing a mov R1, R2 instruction), but the Data Memory, the Slice Counter, the Slice Stack Memory are accessed via Address, Data and Control busses. Only its corresponding EU can access each Operand Memory. The VHDL source code produced by the VHDL Generator program can be compiled without any other adaptations / adjustments. The whole design flow is intended to be fully automatic, i.e. the user involvement in this process must be kept to a minimum.
6. The hardware architecture Hardware configuration is described using VHDL code, generated and optimised using a package of programs. The optimisation is done by the VHDL Source Code Generator and consists of eliminating each element, signal or bus from the final structure if it is not needed. As can be seen in figure 4, the hardware architecture is composed of: 1. The N EUs; 2. N local configuration memories for the N EUs, in DPGA style; 3. A Data Stack Memory; 4. A Slice Stack Memory, used to store the current slice address; 5. A Slice Program Counter; 6. An associative memory (called the slice-mapping block) that maps instructions to the slices that must be executed by each EU; 7. A Store Buffer and a Load Buffer (temporary data buffers used to store information to / from Data Memory); 8. A Data Memory; 9. Operand Memories, containing operands for the EUs. By lack of space, only the structure of the main component of the CREC architecture, i.e. the EU, will be detailed in the following sections.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
6.1 The Instruction Set The instruction set of the CREC Execution Unit is divided in two groups: the Data Manipulation and the Program Control instructions. The Data Manipulation Group contains the specific instructions for manipulating the value of the EU’s accumulator. The general format and some examples are shown in Table 2. Table 2. The Data Manipulation Instructions Group Load DSelection G CodeType Mnomic Description RCZ 0 1 2 3 4 5678 9 … N 1 0 0 0 0 1 1 1 0 x … x ADD Rn,RmRn ĸ Rn + Rm 1 r 1 0 1 r r … r DEC Rn Rn ĸ Rn – 1 The G field defines the Instruction Group (1 = Data Manipulation), the Code is the operation code, and the Type is the operation type (ex. with Carry, or Left etc.). The Load field contains the load signals for the Register and for the Carry and Zero Flags. D is the Register/Data selection bit (for the tri-state buffers) and the Selection is for the 1st level of multiplexers. The EU can perform all the necessary instructions. It has a bigger number of instructions than the usual microcontrollers. Note that every instruction performs operation on unsigned numbers. These instructions are: x Addition with or without carry; x Subtraction with or without borrow and compare; x Logical functions: And, Or, Xor, Not and Bit Test; x Shift arithmetic and logic, left or right, by 1; x Rotate and rotate through carry, left or right, by 1; x Increment or decrement the accumulator by 1 and 2’s complement. The Program Control Group contains the instructions for altering the program execution. The format is similar to the Data Manipulation Group. The G field defines the Instruction Group (0 = Program Control); the Code is the operation code. The Condition field contains the condition code, which will be compared with the CONDITION BUS for validating the instruction. The R is the load signal for the Register. D is the Register / Data selection bit and the Selection signal is for the 1st level of multiplexers. Every instruction also exists in the conditioned form (denoted by the cc field in the instruction mnemonic), which gives a great flexibility. Table 3. The Program Control Instructions Group Conditions GCode DSelection Mnomic Description C R C C 0123 4 5 6 7 8 9 … N 0 0 0 0 0 0 0 0 x … x JMP [Rn] Slice Cnt ĸ Rn 0 0 1 1 q 0 q q r r … r Rcc If true, RET
The source code can be optimised, because most of the Compare and Jump statements, which are the typical and frequently used instruction pairs, can be eliminated by the conditional instructions (for example, move and other instructions). The instruction categories are the following: x Slice counter manipulation: Jmp, Call, Ret; x Data movement: Mov; x Stack manipulation: Push and Pop; x Input and Output to port: In, Out; x Load and Store from external memory. The conditions are: C (Carry), Z (Zero), E (Equal), A (Above), AE (Above or Equal), B (Below), BE (Below or Equal) and the negation of all these with the N prefix.
6.2 The Execution Unit The main part of the CREC processor is the scalable EU. The word length of the EU is n*4 bits. At the current state of the implementation, the parameter n is limited to 4, so the word length can be up to 16 bits. The complete structure of the EU is presented in figure 4. The EU consists of six major parts: 1. Decoding Unit –decodes the instruction code; 2. Control Unit – generates the control signals for the Program Control Instruction group; 3. Multiplexer Unit – the second operand of the binary instructions is multiplexed by this unit; 4. Operating Unit – realizes data manipulating operations; 5. Accumulator Unit – stores the instruction result; 6. Flag Unit – contains the two flag bits: Carry Flag (CF), and the Zero Flag (ZF). The Operating Unit has a symmetrical organization. At the right side are the binary instruction blocks, and at the left side are the unary operation blocks (performing operations only on the accumulator). Each operation works on unsigned operands. This is the reason that there is no Sign Flag and Overflow Flag. The signed version of CREC is still under development. There are two variants of the CREC’s EU implementation. In the first one, each subunit is optimised for the Xilinx VirtexE FPGA family, occupying the same number of Slices (2*n Slices, which is equivalent with n CLB per subunit), and the dedicated Fast Carry Logic is used. This leads to a platform-dependent solution, but in the case of a processor there is the need to increase the performance of the EU, and to obtain almost equal propagation times. The second variant uses a general VHDL code, which is not optimised for any FPGA devices family. This increases CREC’s portability, but the architectural optimisations become the VHDL compiler’s task, reducing the designer’s control over the generated architecture. In the following sections we will present the architecture that has been optimised for Virtex FPGAs.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
Input Port
Stack
Immediate Operand
Register MUX
Instruction Decoder
Load Buffer
RN
R2
R1
Instruction Code
Data MUX
Decoding Unit
Reg/Data MUX Multiplexer Unit
Shift Left Unit SHL/ROL/NEG INC/DEC/
Shift Right Unit SHR/ROR/NOT
Logic Unit AND/OR/XOR
Arithmetic Unit ADD/SUB
Figure 6. The SHL/ROL/INC/DEC/NEG basic cell
Carry Generator
Operating Unit
ZF
Control Signal Generator
Register
CF EXECUTION UNIT
Control Unit
CONDITION BUS
Condition Generator Flag Unit
Accumulator
REG/DATA
JMP CALL RET PUSH POP LOAD STORE MOV STB OUT
CONDITION BUS
Register Value
The Xilinx Virtex FPGA slices are ideal for implementing shifter for two bits paired with increment, decrement and also negation (2’s complement) operations.
Operand Value
Figure 4. The detailed structure of CREC’s EU The four subunits are the following: x Logic Unit – for the logical And, Or and Xor functions, composed of simple LUTs (2 bits per slice). x Arithmetic Unit – for the Addition and Subtraction with or without Carry. The basic part is the two-bits adder, which can be implemented in a single slice:
Figure 7. The SHR/ROR/NOT basic cell The advantages of these structures are that they use only one level of slices, are cascadable and the number of used slices increases linearly with the number of bits. All four units use the same number of slices, corresponding to 2 bits per slice. For this reason, the size of the Operating Unit is growing linearly with the word length, but the operating time will not decrease significantly with the increase of the word length. There are also two multiplexers for the Carry In line for the two Shifting units. These multiplexers select the appropriate inputs for the different shift operations. The four subunits generate different Carry Out signals, which are routed to the Carry Flag. For this reason the following structure is used (Figure 8):
Figure 5. The ADD/SUB basic cell x Shift Left Unit – for the Shift and Rotate Left instructions paired with Increment or Decrement by one. The basic element of the shifter is shown in Figure 6. x Shift Right Unit – for the Shift and Rotate Right instructions paired with the logical Not. The structure of a shifter part is shown in Figure 7.
Figure 8. The Carry Out Multiplexer
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
The two XOR gates are used with the subtraction or negation operations, when the Carry Out bits should be inverted. This structure uses only one CLB. The Execution Unit is customisable. For example, if an EU will not execute Logical Instructions, then this part is simply cut out, resulting in a gain of space. Another important aspect is that the Execution Unit can be easily pipelined for higher frequency operation, using the embedded Data Flip-Flops inside the CLBs. For example, a complete Execution Unit (with all the subunits generated) having an 8-bits accumulator will consume 17 CLBs, but the same Execution Unit with a 16bits accumulator will consume 33 CLBs. The output of the Flag Unit is a 6-bits wide Condition Bus for the six possible condition cases: Zero, Not Zero, Carry, Not Carry, Above and Below or Equal. This bus validates the conditioned Program Control instructions. The most critical part of this unit is the detection of the value “zero” on the Data Bus. At word lengths greater then 4, it is done on two levels of slices. This may lead to a longer execution time. Above 16 bits, the Zero Detector should be implemented on three levels of slices, which can cause serious performance losses. This is the reason for limiting the word length of the Execution Units to maximum 16 bits. The flag unit with 8 bits large Data Bus uses 2 CLBs, and on 16 bits uses 2.5 CLBs. The Multiplexer Unit is built on two levels for optimal instruction encoding. At the first level, the Register and the Data MUXs have the same selector, so in the instruction format they have the same field. There is also a second 2:1 MUX, which selects the input operand for the instruction by the mean of one bit (Figure 4). The Multiplexer Unit is also customisable. Only the input lines used by the EU are implemented. This way it uses only the necessary space. In the VirtexE FPGAs, only 8:1 multiplexers can be implemented on a single level of CLBs. For this reason, the Multiplexer is optimised for up to 8 inputs. If the CREC architecture includes more than 8 EUs, than these multiplexers are implemented on two levels of logic. The 2:1 bus multiplexer, which selects the output of the Data or Register Multiplexer, is implemented with tri-state buffers. The most important aspect is that this unit’s size increases linearly with the increase of the word length and the number of EUs. For example, the CREC with 8 EUs on 8 bits uses 96 CLBs and on 16 bits it consumes around 160 CLBs. This is the biggest part of the Execution Unit; the Operating Unit is less then the half of the Execution Unit. The Accumulator Unit is built up from simple synchronous D Flip-Flops. The Accumulator stores the primary operand of each Data Manipulation instruction and also their result. The Accumulator implementation uses only D Flip-Flops. The Control Unit generates the validation signals for the Program Control instructions taking into account the
Condition Bus. The condition code is compared with the flags set by the previous non-program control instruction. The Decoding Unit generates the appropriate signals for the four functional parts of the Operating Unit and also for the Carry In and Carry Out multiplexers. This is only a fast Combinatorial Logic Circuit (it is not look-up table based), thus obtaining a greater speed. In conclusion, it can be noticed that one Execution Unit uses only a fraction of the FPGA device. For example, an EU with 8-bits wide registers occupies approximately 4%, and an EU with 16-bits wide registers occupies approximately 6% of the available CLBs in a Virtex600E FPGA chip.
7. Experimental results At this point of the project, the software part of the CREC system is already finished, but can still be subject to modifications, as the Instruction Set will evolve and new features will be added (for example, new and more complex macros). The parallel compiler is now fully functional in two variants, working under both Windows and Linux operating systems. The first variant is written in ANSI-C language and uses BISON parser. The second variant is written in JAVA, using JAVA Compiler Compiler and is integrated in a complex development system including the RISC Editor, the Parallel Compiler, the Instruction Expander and the Test Bench. The development system has been tested and verified for some classical benchmark applications. The tests were performed on classical general-purpose algorithms. Some of the results are shown in Table 4. The algorithms were executed on CREC architectures having a different number of EUs; the reference architecture is the classical one, where an instruction is executed in each cycle. The performance indexes listed in Table 4 show how many times faster a given algorithm is executed on an optimised CREC system than in the case of classical execution flow (with one instruction executed per cycle). Table 4. The performance evaluation of the CREC system CREC Optimized CREC Algorithm architecture Worst Best Bresenham’s line 3.00 3.00 CREC-10 Bresenham’s circle 4.08 4.60 CREC-12 Bubble sort 1.12 1.67 CREC-3 Quick sort 2.66 3.00 CREC-5 Map colouring 1.53 1.70 CREC-8 Integer square root 1.61 1.64 CREC-4 The efficiency of the CREC system is obvious for all kinds of general-purpose (or non-specific) algorithms.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
Higher improvements are obtained for DSP-like applications. The hardware part of the CREC system has been implemented in some earlier versions (with a smaller, less complex instruction set and less architectural optimisations). The targeted FPGA device was a Xilinx VIRTEXE 600. The physical implementation was made on a Nallatech Strathnuey + Ballyderl board, containing a Virtex 600E and a Spartan2-150 FPGA devices. Several CREC architectures were simulated and downloaded in this development system, the simulation yielding very good results, from space and time point of view. A preliminary version of the CREC system having 4 EUs (CREC-4) with 4-bits wide registers occupies 4% of the CLBs and 5% of the BlockRAMs in the VirtexE600 device. A CREC architecture having 4 EUs with 16-bits wide registers occupies 18% of the CLBs and 20% of the BlockRAMs in the VirtexE600 device. The operating clock frequency is of 100 MHz (for simulations reasons, this frequency is lower, because tests must be done by routing signals outside the FPGA chip; this way, the propagation delays increase significantly).
8. Conclusions and further work In this paper we presented CREC, a general-purpose low-cost reconfigurable computer that combines hardware and software CoDesign. CREC exploits the intrinsic parallelism present in many low-level applications by generating the best-suited hardware for implementing a specific application. In this way, there will be a different hardware architecture implementing each application. Then, this architecture is downloaded into a FPGA device (which constitutes the physical support of the CREC), and then the application is executed by it. CREC structure is basically composed of two main parts: the parallel compiler and the hardware structure, implemented by a Xilinx Virtex FPGA device. The physical implementation is made on a Nallatech Strathnuey + Ballyderl board containing Xilinx FPGAs. The CREC computing system has been proven to be a very effective, low-cost reconfigurable architecture, allowing the execution of parallel applications with considerable performance improvements, not only for DSP-like algorithms, but also for all kinds of applications. Further research will consist of creating the possibility of writing high-level programs for CREC. In the first phase, this will imply to extend the functionalities of the parallel compiler, then to create a C compiler for CREC applications (the instructions scheduling in slices will be realised from the C source code). In the future, CREC will be used to perform more research on hardware distributed computing, using the
FPGAs configuration over the Internet (application that is already implemented and tested). Another usage of CREC systems will be for developing and testing new parallel algorithms for hardware distributed computing.
9. References [1]. DeHon, A. “Reconfigurable Architectures for GeneralPurpose Computing”. PhD. Thesis. Massachusetts Institute of Technology (1996). [2]. Singh, H. Lee, M.-H., Lu, G. Kurdahi, F. Baghrzadeh, N., Chaves Filho, E. “MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications”. IEEE Transactions on Computers, Vol. 49, no. 5, (May 2000). [3]. Hauser, J., Wawrzynek, J. “Garp: A MIPS Processor with a Reconfigurable Coprocessor”. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '97), April 16-18 (1997). [4] Lattice Logic “Technology for FPGAs and PLDs”. Embedded Systems Magazine, volume 6, no. 44, p.11, (September 2002). [5] Graham, P., Nelson, B. "Reconfigurable Processors for High-Performance, Embedded Digital Signal Processing", in Proceedings of the Ninth International Workshop on Field Programmable Logic and Applications, (August 1999). [6] Alhquist, C.G., Nelson, B., Rice, M."Optimal Finite Field Multipliers for FPGAs", in Proceedings of the Ninth International Workshop on Field Programmable Logic and Applications, (August 1999). [7] Wolinski, C., Gokhale, M., McCabe, K. "A Reconfigurable Computing Fabric", Technical Report (LAUR), Los Alamos National Laboratory, (2002). [8] Park, J., Diniz, P. "An External memory Interface for FPGA-based Computing Engines", Proceedings of the IEEE Symp. on FPGAs for Custom Computing Machines (FCCM'01), IEEE Computer Society Press, Los Alamitos, Calif., (Oct. 2001). [10] DeHon, A., Wawrzynek, J. “Reconfigurable Computing: What, Why, and Design Automation Requirements?”, Proceedings of the 1999 Design Automation Conference (DAC '99), June 21-25, (1999). [11] DeHon, A. “Very Large Scale Spatial Computing”, Proceedings of the Third International Conference on Unconventional Models of Computation (UMC'02), October 1519, (2002). [12] Vuillemin, J., Bertin, P. et al. “Programmable active memories: Reconfigurable systems come of age”, IEEE Transactions on VLSI Systems, 4(1): 56–69, (March 1996). [13] Green, C., Franklin, P. “RaPiD – reconfigurable pipelined datapath”. In R. W. Hartenstein and M. Glesner, editors, Field-Programmable Logic: Smart Applications, New Paradigms, and Compilers. 6th International Workshop on FieldProgrammable Logic and Applications, Darmstadt, Germany, pages 126–135, (Sept. 1996). [14] Hauck, S., Fry, T., Hosler, M., Kao, J. “The chimaera reconfigurable function unit”. IEEE Symposium on FPGAs for Custom Computing Machines, (Apr. 1997). [15] Waingold, E., Taylor, M. et al. “Baring it all to software: raw machines”. IEEE Computer, pages 86–93, (September 1997)
0-7695-1926-1/03/$17.00 (C) 2003 IEEE