Optimized Generation of Data-path from C Codes for FPGAs Zhi Guo Betul Buyukkurt Walid Najjar University of California Riverside {zguo, abuyukku, najjar}@cs.ucr.edu Abstract FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers. ROCCC is a compiler designed to generate circuits from C source code to execute on FPGAs, more specifically on CSoCs. It generates RTL level HDLs from frequently executing kernels in an application. In this paper, we describe ROCCC’s system overview and focus on its data path generation. We compare the performance of ROCCCgenerated VHDL code with that of Xilinx IPs. The synthesis result shows that ROCCC-generated circuit takes around 2x ~ 3x area and runs at comparable clock rate.
1. Introduction Continued increases in integrated circuit chip capacity have led to the recent introduction of Configurable Systemon-a-Chip (CSoC), which has one or more microprocessors integrated with a field-programmable gate array (FPGA) as well as memory blocks on a single chip. In these platforms both the FPGA fabric, as well as the embedded microprocessors are essentially programmed using software. The earliest example is the Triscend E5 followed by the Triscend A7 [1], the Altera Excalibur [2], and Xilinx Virtex II Pro [3]. The capabilities of these platforms span a wide range with the Triscend A7 at the low end and the Xilinx Virtex II Pro 2VP125 at the high-end. These amazing computing devices have the flexibility of software and have been shown to achieve very large speedups, ranging from 10x to 100x, over microprocessors for a variety of applications including image and signal processing [4][5][6]. Such speedups come from large-scale parallelism made possible by high-capacity FPGAs, as well as from customized circuit design. The main problem standing in the way of wider acceptance of CSoC platforms is their programmability. Application developers must have an extensive hardware expertise, in addition to their application area expertise, to develop efficient designs. Presently, most CSoCs are programmed manually. The main drawback of this approach is that it is very labor
Kees Vissers Xilinx Corp.
[email protected] intensive and requires large design times. Some commercial effort in programming FPGAs have been proposed by companies such as Synopsys [7] and Tensillica [8]. Their focus is on moving simple loops to hardware or on instruction-set extension. Optimizing compilers for traditional processors have benefited from several decades of extensive research that has led to extremely powerful tools. Similarly, electronic design automation (EDA) tools have also benefited from several decades of research and development leading to powerful tools that can translate VHDL and Verilog code, and recently SystemC [9] code, into relatively efficient circuits. However, little work has been done to combine these two approaches. In other words, work is still needed to compile a high-level language program, based on C/C++/Java, with software level optimizations with the intent of generating a hardware circuit. Obviously, it is neither practical nor desirable to translate the whole program into hardware. It is therefore imperative to provide the programmer with tools that would help in identifying which code segments ought to be mapped to hardware as well as the cost and benefit tradeoffs implied. Compiling to CSoCs and FPGAs in general is challenging. Traditional CPUs, including VLIW, have a fixed hardware platform. Their architectural features may or may not be exposed to the compiler. FPGAs, on the other hand, are completely amorphous. The task of an FPGA compiler is to generate both the hardware (data path) and the sequence of operations (control flow). This lack of architectural structure, however, presents a number of advantages. (1) The parallelism is very high and limited only by the size of the FPGA device or by the data memory bandwidth. (2) On-chip storage can be configured at will: registers are created by the compiler and distributed throughout the data path where needed, thereby increasing data reuse and reducing re-computations or accesses to memory. (3) Circuit customization: the data path and sequence controller are tailored to the specific computation being mapped to hardware. Examples include customized data bit-width and pipelining. The objective of the ROCCC (Riverside Optimizing Configurable Computing Compiler) project is to design a high-level language compiler targeting CSoC. It takes high-
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 1530-1591/05 $ 20.00 IEEE
level code, such as C or FORTRAN as input and generates RTL VHDL code for the FPGA and C code for the CPU. In this paper we describe the overall structure of the compiler and emphasize the data path generation component. We compare the clock speed and area of automatically generated circuits to a number of IP codes available on the Xilinx web site. The results show that the speed is within 10% while the area is larger by a factor of 2 to 3. The work in [25][26] has compared generated code with hand written VHDL. Both have shown a factor of 2 on the performance decrease of the generated code in area and clock rate. ROCCC is built upon the knowledge acquired from SA-C and Streams-C. We experimentally show that the resultant VHDL is much closer to the handwritten one. The rest of this paper is organized as follows. The ROCCC compiler is introduced in section 2. Related work is discussed in section 3. Section 4 presents ROCCC compiler RTL code generation for the controller, the buffer and the data path. Experimental results are reported in section 5. Section 6 concludes the paper.
2. ROCCC System Overview Figure 1 shows the overview of the ROCCC compiler. The profiling tool set has been described in a prior publication [10]. It identifies the frequently executing code kernels in a given application. ROCCC’s objective is to compile these kernels to HDL code, which is synthesized using commercial tools. The ROCCC system is built using SUIF [11] and Machine-SUIF [12] platforms. SUIF IRs (intermediate representations) provide abundant information about loop statements and array accesses. ROCCC performs loop level optimizations on SUIF IRs. Loop unrolling for FPGAs requires compile time area estimation. The work reported in [13] shows that in less than one millisecond and within 5% accuracy compile time area estimation can be achieved. Information to generate high-level units, such as controllers and buffers, is also extracted from SUIF IRs. Machine-SUIF analysis and optimization passes, such as Control Flow Graph (CFG) library [14], Data Flow Analysis library [15] and Static Single Assignment library Esti mation Area Delay Power
Loop Optimization
SUIF2
Profiling
C /C++ Fortran Java… … Code
ROCCC
Machine SUIF
System VHDL Code Generator
Controller Generation
CAD tools
Data Path Generation Bit Stream
General Compiler
Host Executable
Graph Editor + Annotation
Figure 1 - ROCCC System Overview
[16], are used to generate the data path. ROCCC’s conventional optimizations include constant folding, loop unrolling, etc. Full loop unrolling converts a for-loop with constant bounds into a non-iterative block of code and therefore eliminates the loop controller. In addition to these conventional optimizations, at loop level ROCCC performs FPGA-specific optimizations, such as loop strip-mining, loop fusion, etc. At storage level and circuit level, ROCCC’s optimizations are closely related with HDL code generation and are discussed in section 4. The restrictions on the C code that can be accepted by the ROCCC compiler, for mapping on an FPGA fabric, include no recursion, no usage of pointers that cannot be statically unaliased. Function calls will either be inlined or whenever feasible made into a lookup table.
3. Related Works Many projects, employing various approaches, have worked on translating high-level languages into hardware. SystemC [20] is designed to provide roughly the same expressive functionality of VHDL or Verilog and is suitable to designing software-hardware synchronized systems. Handle-C [21], as a low level hardware/software construction language with C syntax, supports behavioral descriptions and uses CSP-style (Communicating Sequential Processes) communication model. SA-C [22] is a single-assignment high-level synthesizable language. Because of special constructs specific to SA-C (such as window constructs) and its functional nature, its compiler can easily exploit data reuse for window operations. SA-C uses pre-existing parameterized VHDL library routines to perform code generation in a way that requires a number of control signals between components, and thereby involves extra clock cycles and delay. Our compiler avoids spending clock cycles on handshaking by focusing more on the compile-time analysis. It takes a subset of C as input and does not involve any non-C syntax. Streams-C [23] relies on the CSP model for communication between processes, both hardware and software. Streams-C can meet relatively high-density control requirements. However, it does not support accesses to two-dimension arrays and therefore image processing applications, including video processing, must be mapped manually. This makes it very awkward to efficiently support algorithms that rely on sliding windows. For one-dimension input data vector, such as a onedimension FIR filter, Streams-C programmers need to manually write data reuse in the input C code in order to make sure that a data value is retrieved only once from external memory. SPARK [24] is another C to VHDL compiler. Its transformations include loop unrolling, common subexpression elimination, copy propagation, dead code elimination, loop-invariant code motion etc. SPARK does not support multi-dimension array accesses.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 1530-1591/05 $ 20.00 IEEE
4. The ROCCC Compiler ROCCC targets high computational density, low control density applications. Figure 2 shows the execution model. An engine moves the data from off-chip to a BRAM storage. The compiler-generated circuit accesses the arrays in BRAM and stores the output data into another BRAM, from which an engine retrieves data into the off-chip memory. Inside the compiler-generated circuit, the data path is fully pipelined. The controllers and buffers are in charge of feeding input data and retrieving output data to and from the data path. Off-chip MEM
4.1 Controller and Buffers
Block RAM
controllers
Pipelined data path
ROCCC’s scalar replacement transformation converts, for instance, the segment in Figure 3 (a) into the segment in Figure 3 (b). We can see that smart buffer scalar replacement isolates memory Off-chip Block RAM access from MEM calculation. The highlighted region of Figure 2 - The Execution Model code is exported in the form of Figure 3 (c) and goes to the data path generator. At the same time, the loop statement and memory load/store code are used to generate the controllers and buffers. The controllers include address generators, which export a series of memory addresses according to the memory access pattern, and a higher-level controller, which controls the address generators. They are all implemented as pre-existing parameterized FSMs (finite state machine) in a VHDL library. One of the major reasons that account for FPGA’s smart buffer
for (i=0; i