HDL is well below the level of traditional application programming. â«. Consequently ... Companies create proprietary ANSI C-based language. â Languages do ...
Survey of C-based Application Mapping Tools for Reconfigurable Computing
Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida
Holland
#215 MAPLD 2005
Outline
Introduction General Survey
Holland
CARTE CATAPULT C B E N C H M S A U R R K V E Y
Ten C-based Application Mappers
Benchmarking & Results
Finite-Impulse Response (FIR)
N-Queens
Radix Sort
Lessons Learned Conclusions Acknowledgements References
DIME-C HANDEL C IMPULSE C MITRION C NAPA C SA-C STREAMS C SYSTEMC
2
#215 MAPLD 2005
Motivation for Application Mappers
Motivation for Application Mappers
Holland
HDL programming has shortcomings
Limited applicability to application developers
More involved development process (vs. software)
Requires training beyond application level
HDL
Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity?
Can we bring RC performance benefits to application developers?
Would this be practical/possible in traditional HDL?
HDL is well below the level of traditional application programming
Consequently, we need to move to a higher level of abstraction
3
#215 MAPLD 2005
Introduction
C Code
COMPILER
Selecting a Higher Level of Abstraction
CAD tools: Visual appealing, but tedious for large projects
New language: Optimal, but requires complete retraining
Traditional or Object-Oriented languages: Which? How?
HDL
Netlist
Configuration File
Ideally, use pure ANSI-C, “The Universal Language”
Requires no additional knowledge or special training
Port existing C programs into hardware implementations (HDL)
Translation can be handled by a hardware compiler
Programmer concentrates on algorithmic functionality
Holland
4
#215 MAPLD 2005
Commonalities
General characteristics of C-based application mappers:
Companies create proprietary ANSI C-based language
Languages do not have all ANSI C features
Extra pragmas are included for corresponding compilers
Additional libraries of functions/macros for further extensions
Must adhere to specific programming “style” for maximum optimization
Emphasis on both hardware generation and I/O interfaces ANSI-C
VHDL
void FIR(int INPUTA, int OUTPUTB) {
/*user source*/
COMPILER
Entity FIR is Port ( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0)); end;
} /*user source in VHDL*/
Holland
5
#215 MAPLD 2005
Spectrum of C-based Application Mappers SURVEY PORTION Catapult C
Carte
Impulse C
DIME-C
SA-C
SystemC
Mitrion C
Handel C
Streams C
Napa C
Open Standard
Generic HDL Multiple Platforms
Generic HDL (Optimize for Manufacturer’s Hardware)
Targets a Specific Platform/Configuration
RISC/FPGA Hybrid Only
Cycle Accurate
BENCHMARK SECTION VHDL
Deterministic
VHDL
Not Cycle Accurate
Limited Predicitiblity
Co ntrol
Handel C
DIME-C Impulse C ANSI-C
ANSI-C
Software
Holland
DIME-C Handel C Impulse C
Some HW Pragmas
Many HW Pragmas
Effort HDL
THE LAW OF CONSERVATION OF PAIN
6
#215 MAPLD 2005
Carte
SRC Computers
Mentor Graphics
[1]
C/Fortran FPGA environment
Catapult C
Direct mapping of C/Fortran code to configuration level Software emulation and simulation of compiled code for debugging Capable of multiprocessor and multi-FPGA computational definitions Allows explicit data flow control within memory hierarchy
Holland
Algorithmic synthesis tool for RTL generation
RTL from “pure” untimed C++
No extensions, pragmas, etc.
Compiler uses “wrappers” around algorithmic code
Targets SRC’s MAP processor Produces “Unified Executables” for HW or SW processor execution Runtime libraries handle required interfacing and management
7
[2-3]
External: manages I/O interface Internal: constrains synthesis to optimize for chosen interface
Explicit architectural constraints and optimization
Output: RTL netlists in VHDL, Verilog, and SystemC
#215 MAPLD 2005
DIME-C Nallatech
FPGA prototyping tool
Designs are not cycle-accurate
Celoxica
[4]
Handel C
Allows application synthesis for a higher clock speed
Environment for cycle-accurate application development
All operations occur in one deterministic clock cycle
Compilation/Optimization
Pipeline/parallelize where possible
Included IEEE-754 FP cores
Dedicated (integer) multipliers
Currently in beta, expected release: 4Q05 Output: synthesizable VHDL and DIMEtalk components
Holland
8
[5]
Makes it cycle-accurate, but clock freq reduced to slowest operation Decisions/Loops are “penalty-free” but can significantly impact timing
Language has pragmas for explicitly defined parallelism
Compiler can analyze, optimize, and rewrite code
Output: VHDL/Verilog, SystemC, or targeted EDIFs #215 MAPLD 2005
Impulse C
Impulse Accelerated Technologies
Processes - independent, potentially concurrent, computing blocks
Streams – communicate and synchronize processes
Each process implemented as separate state machine
Output: Generic or FPGAspecific VHDL
Holland
9
“Processor” creates abstraction layer between C code and FPGA
Compilation
However, focuses on compatibility with C development environments
[7]
“Softcore” processor tactic
Compilation
Uses Streams-C methodology
Mitrion
[6]
Language/compiler for modeling sequential apps.
Mitrion C
C code is mapped to a generic “API” of possible functions Processor instantiated on FPGA, tailored to specific application Custom instruction bit-widths, specific cache and buffer sizes
Currently in beta, expected release: 4Q05
Output: a VHDL IP core for target architectures #215 MAPLD 2005
Napa C
National Semiconductor
Capitalize on single-cycle interconnect instead of I/O bus
Hand-optimized pre-placed, prerouted module generators
Holland
Designed to implicitly express data-parallel operations Image and signal processing
Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)
Compiler generates hardware pipelines from C loops
Targets NS NAPA1000 hybrid processor
[9-12]
High-level, expression-oriented, machine-independent, singleassignment language
Datapath Synthesis Technique
Colorado State University
[8]
Language/compiler for RISC/FPGA hybrid processor
SA-C
Fixed-Instruction Processor (FIP), Adaptive Logic Processor (ALP)
Loop optimizations
Structural transforms
Execution block placement
Target Platforms
ALP also compiles to RTL VHDL, structural VHDL, structural Verilog 10
UC Irvine Morphosys; Annapolis WildForce, StarFire, WildFire #215 MAPLD 2005
Streams C
Los Alamos National Laboratory
Open SystemC Initiative (OSCI)
[12-14]
Stream-oriented sequential process modeling
SystemC
Essentially, data elements moving through discrete functional blocks
Generates multi-threaded processor executables and multiple FPGA bitstreams
Includes functional-level simulation environment
Output: synthesizable RTL
11
Hierarchical decomposition of a system into modules Structural connectivity between modules using ports/exports Scheduling and synchronization of concurrent processes using events
Event-driven simulator
Holland
Core language, modules & ports for defining structure, and interfaces & channels
Supports functional modeling
Allows parallel C program translation into a parallel arch.
Open-source extension of C++ for HW/SW modeling
Compiler
[15-16]
Events are basic dynamic/static process synchronization objects #215 MAPLD 2005
About the Benchmarks
Three classic algorithms used for benchmarking
10 8 6
Finite-Impulse Response (FIR)
0
Simple 51-tap FIR filter for standard DSP applications Compare compiler solutions and analyze their usage metrics
1
3
5
7
9
11
13
15
17
Sorts using ‘binary bins’, minimizing resources Illustrates resource metrics in RAM-intensive applications
Implementation Details
21
23
25
27
-6 -8 -10
0
110
100
1
111
101
DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing)
Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGA
Resource utilization based on post place-and-route data
Runtime represents communication time (setup and verification I/O is negated)
Handel C and Impulse C require VHDL wrappers which can increase resource usage
Holland
19
-4
Classic embarrassingly parallel HPC backtracking search problem Showcases the potential of optimized implementations
Radix Sort
2
-2
N-Queens
4
12
#215 MAPLD 2005
10
Finite-Impulse Response 100
FIR Resource Utilization Statistics
Speedup over 2.4GHz Xeon 4
% Usage
80
60
3
40
2
20
1
0 Slices
Multipliers DIME-C
Handel C
Block RAMs Impulse C
Clock Freq
0 DIME-C
VHDL
Handel C
Impulse C
VHDL
gcc -O3
FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6])
Various application-mapper languages do not have a consistent I/O interface
Could not create a consistent streaming channel with requisite blocking in every tool
Instead, FIR algorithm operates on values stored in a block RAM
Obtains speedup through parallel multiplication, efficient memory accesses
gcc -O0
The 51 coefficients and variables are stored in local variables
Additional performance boosts are possible in multi-channel DSP processing
Holland
13
#215 MAPLD 2005
N-Queens Speedup over 2.4GHz Xeon
N-Queens Resource Utilization Statistics
6
100
5
80
% Usage
4 60
3 40
2
20
1 0
0 Slices DIME-C
13
Clock Freq Handel C
Impulse C
DIME-C
VHDL
14
15
Handel C
Impulse C
16 VHDL
17 gcc -O3
gcc -O0
Represents a purely computational algorithm; virtually no communication overhead
Algorithm contains several parallelizable code segments, exploitable for speedup
Implementations are based upon same baseline C code
Holland
N
Every available technique and compiler optimization is employed to boost performance
Notes:
Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinements
VHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelism
DIME-C and Impulse C N-Queens are results of experimentation with beta compilers 14
#215 MAPLD 2005
Radix Sort Radix Sort Resource Utilization Statistics
Speedup over 2.4GHz Xeon
1.0
100
% Usage
80
60 0.5
40
20
0 Slices DIME-C
Block RAMs Handel C
Impulse C
Clock Freq
0.0 DIME-C
VHDL
Handel C
Impulse C
VHDL
gcc -O3
gcc -O0
Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time)
Represents a “worst-case” legacy algorithm, containing no functional-level parallelism
Every element in every iteration depends on every previous element in every iteration
Ideal for software processor with fast cache, challenging in FPGA hardware
Speedup comes through efficient RAM usage and compiler optimizations/pipelining
Holland
Reduce quantity and addressing complexity of RAM accesses whenever possible
Metrics are based on sorting 600 32-bit integers contained within a block RAM
15
#215 MAPLD 2005
Some Optimization Techniques Keep expensive computational operations to a minimum Multiplication, division, modulo, greater/less than, and floating point are *slow*
BAD
temp = a[0]; for(i=0;i