Survey of C-based Application Mapping Tools for ...

2 downloads 3524 Views 717KB Size Report
HDL is well below the level of traditional application programming. ▫. Consequently ... Companies create proprietary ANSI C-based language. ❑ Languages do ...
Survey of C-based Application Mapping Tools for Reconfigurable Computing

Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida

Holland

#215 MAPLD 2005

Outline  

Introduction General Survey 



   

Holland

CARTE CATAPULT C B E N C H M S A U R R K V E Y

Ten C-based Application Mappers

Benchmarking & Results 

Finite-Impulse Response (FIR)



N-Queens



Radix Sort

Lessons Learned Conclusions Acknowledgements References

DIME-C HANDEL C IMPULSE C MITRION C NAPA C SA-C STREAMS C SYSTEMC

2

#215 MAPLD 2005

Motivation for Application Mappers 

Motivation for Application Mappers 



Holland

HDL programming has shortcomings 

Limited applicability to application developers



More involved development process (vs. software)



Requires training beyond application level

HDL

Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity? 

Can we bring RC performance benefits to application developers?



Would this be practical/possible in traditional HDL? 

HDL is well below the level of traditional application programming



Consequently, we need to move to a higher level of abstraction

3

#215 MAPLD 2005

Introduction

C Code

COMPILER





Selecting a Higher Level of Abstraction 

CAD tools: Visual appealing, but tedious for large projects



New language: Optimal, but requires complete retraining



Traditional or Object-Oriented languages: Which? How?

HDL

Netlist

Configuration File

Ideally, use pure ANSI-C, “The Universal Language” 

Requires no additional knowledge or special training



Port existing C programs into hardware implementations (HDL)



Translation can be handled by a hardware compiler



Programmer concentrates on algorithmic functionality

Holland

4

#215 MAPLD 2005

Commonalities 

General characteristics of C-based application mappers: 

Companies create proprietary ANSI C-based language



Languages do not have all ANSI C features



Extra pragmas are included for corresponding compilers



Additional libraries of functions/macros for further extensions



Must adhere to specific programming “style” for maximum optimization



Emphasis on both hardware generation and I/O interfaces ANSI-C

VHDL

void FIR(int INPUTA, int OUTPUTB) {

/*user source*/

COMPILER

Entity FIR is Port ( rst, clock: in std_logic; INPUTA_en: in std_logic; INPUTA_data: in std_logic_vector(31 downto 0); OUTPUTB_en: in std_logic; OUTPUTB_data: out std_logic_vector(31 downto 0)); end;

} /*user source in VHDL*/

Holland

5

#215 MAPLD 2005

Spectrum of C-based Application Mappers SURVEY PORTION Catapult C

Carte

Impulse C

DIME-C

SA-C

SystemC

Mitrion C

Handel C

Streams C

Napa C

Open Standard

Generic HDL Multiple Platforms

Generic HDL (Optimize for Manufacturer’s Hardware)

Targets a Specific Platform/Configuration

RISC/FPGA Hybrid Only

Cycle Accurate

BENCHMARK SECTION VHDL

Deterministic

VHDL

Not Cycle Accurate

Limited Predicitiblity

Co ntrol

Handel C

DIME-C Impulse C ANSI-C

ANSI-C

Software

Holland

DIME-C Handel C Impulse C

Some HW Pragmas

Many HW Pragmas

Effort HDL

THE LAW OF CONSERVATION OF PAIN

6

#215 MAPLD 2005

Carte

SRC Computers 

Mentor Graphics

[1]

C/Fortran FPGA environment 









Catapult C 

Direct mapping of C/Fortran code to configuration level Software emulation and simulation of compiled code for debugging Capable of multiprocessor and multi-FPGA computational definitions Allows explicit data flow control within memory hierarchy







Holland

Algorithmic synthesis tool for RTL generation 

RTL from “pure” untimed C++



No extensions, pragmas, etc.

Compiler uses “wrappers” around algorithmic code  

Targets SRC’s MAP processor Produces “Unified Executables” for HW or SW processor execution Runtime libraries handle required interfacing and management

7

[2-3]

External: manages I/O interface Internal: constrains synthesis to optimize for chosen interface



Explicit architectural constraints and optimization



Output: RTL netlists in VHDL, Verilog, and SystemC

#215 MAPLD 2005

DIME-C Nallatech

FPGA prototyping tool



Designs are not cycle-accurate







Celoxica

[4]





Handel C

Allows application synthesis for a higher clock speed



Environment for cycle-accurate application development



All operations occur in one deterministic clock cycle

Compilation/Optimization 

Pipeline/parallelize where possible



Included IEEE-754 FP cores



Dedicated (integer) multipliers





Currently in beta, expected release: 4Q05 Output: synthesizable VHDL and DIMEtalk components

Holland

8

[5]

Makes it cycle-accurate, but clock freq reduced to slowest operation Decisions/Loops are “penalty-free” but can significantly impact timing



Language has pragmas for explicitly defined parallelism



Compiler can analyze, optimize, and rewrite code



Output: VHDL/Verilog, SystemC, or targeted EDIFs #215 MAPLD 2005

Impulse C

Impulse Accelerated Technologies 



Processes - independent, potentially concurrent, computing blocks



Streams – communicate and synchronize processes





Each process implemented as separate state machine

Output: Generic or FPGAspecific VHDL

Holland

9

“Processor” creates abstraction layer between C code and FPGA

Compilation 

However, focuses on compatibility with C development environments

[7]

“Softcore” processor tactic 

Compilation 





Uses Streams-C methodology 



Mitrion

[6]

Language/compiler for modeling sequential apps. 



Mitrion C

C code is mapped to a generic “API” of possible functions Processor instantiated on FPGA, tailored to specific application Custom instruction bit-widths, specific cache and buffer sizes



Currently in beta, expected release: 4Q05



Output: a VHDL IP core for target architectures #215 MAPLD 2005

Napa C

National Semiconductor 



Capitalize on single-cycle interconnect instead of I/O bus







Hand-optimized pre-placed, prerouted module generators





Holland

Designed to implicitly express data-parallel operations Image and signal processing

Compiler (UC-Irvine, UC-Riverside, Colorado State Univ.)

Compiler generates hardware pipelines from C loops

Targets NS NAPA1000 hybrid processor 

[9-12]

High-level, expression-oriented, machine-independent, singleassignment language

Datapath Synthesis Technique 



Colorado State University

[8]

Language/compiler for RISC/FPGA hybrid processor 



SA-C

Fixed-Instruction Processor (FIP), Adaptive Logic Processor (ALP)





Loop optimizations



Structural transforms



Execution block placement

Target Platforms 

ALP also compiles to RTL VHDL, structural VHDL, structural Verilog 10

UC Irvine Morphosys; Annapolis WildForce, StarFire, WildFire #215 MAPLD 2005

Streams C

Los Alamos National Laboratory 

Open SystemC Initiative (OSCI)

[12-14]

Stream-oriented sequential process modeling 



SystemC 

Essentially, data elements moving through discrete functional blocks







Generates multi-threaded processor executables and multiple FPGA bitstreams

Includes functional-level simulation environment



Output: synthesizable RTL







11

Hierarchical decomposition of a system into modules Structural connectivity between modules using ports/exports Scheduling and synchronization of concurrent processes using events

Event-driven simulator 

Holland

Core language, modules & ports for defining structure, and interfaces & channels

Supports functional modeling 

Allows parallel C program translation into a parallel arch.



Open-source extension of C++ for HW/SW modeling 

Compiler

[15-16]

Events are basic dynamic/static process synchronization objects #215 MAPLD 2005

About the Benchmarks 

Three classic algorithms used for benchmarking

10 8 6



Finite-Impulse Response (FIR)  





0

Simple 51-tap FIR filter for standard DSP applications Compare compiler solutions and analyze their usage metrics

1

3

5

7

9

11

13

15

17



Sorts using ‘binary bins’, minimizing resources Illustrates resource metrics in RAM-intensive applications

Implementation Details

21

23

25

27

-6 -8 -10

0

110

100

1

111

101



DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing)



Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGA



Resource utilization based on post place-and-route data



Runtime represents communication time (setup and verification I/O is negated)



Handel C and Impulse C require VHDL wrappers which can increase resource usage

Holland

19

-4

Classic embarrassingly parallel HPC backtracking search problem Showcases the potential of optimized implementations

Radix Sort 



2

-2

N-Queens 



4

12

#215 MAPLD 2005

10

Finite-Impulse Response 100

FIR Resource Utilization Statistics

Speedup over 2.4GHz Xeon 4

% Usage

80

60

3

40

2

20

1

0 Slices

Multipliers DIME-C

Handel C

Block RAMs Impulse C

Clock Freq

0 DIME-C

VHDL

Handel C

Impulse C

VHDL

gcc -O3



FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6])



Various application-mapper languages do not have a consistent I/O interface





Could not create a consistent streaming channel with requisite blocking in every tool



Instead, FIR algorithm operates on values stored in a block RAM

Obtains speedup through parallel multiplication, efficient memory accesses 



gcc -O0

The 51 coefficients and variables are stored in local variables

Additional performance boosts are possible in multi-channel DSP processing

Holland

13

#215 MAPLD 2005

N-Queens Speedup over 2.4GHz Xeon

N-Queens Resource Utilization Statistics

6

100

5

80

% Usage

4 60

3 40

2

20

1 0

0 Slices DIME-C

13

Clock Freq Handel C

Impulse C

DIME-C

VHDL

14

15

Handel C

Impulse C

16 VHDL

17 gcc -O3

gcc -O0



Represents a purely computational algorithm; virtually no communication overhead



Algorithm contains several parallelizable code segments, exploitable for speedup



Implementations are based upon same baseline C code 



Holland

N

Every available technique and compiler optimization is employed to boost performance

Notes: 

Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinements



VHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelism



DIME-C and Impulse C N-Queens are results of experimentation with beta compilers 14

#215 MAPLD 2005

Radix Sort Radix Sort Resource Utilization Statistics

Speedup over 2.4GHz Xeon

1.0

100

% Usage

80

60 0.5

40

20

0 Slices DIME-C

Block RAMs Handel C

Impulse C

Clock Freq

0.0 DIME-C

VHDL

Handel C

Impulse C

VHDL

gcc -O3

gcc -O0



Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time)



Represents a “worst-case” legacy algorithm, containing no functional-level parallelism





Every element in every iteration depends on every previous element in every iteration



Ideal for software processor with fast cache, challenging in FPGA hardware

Speedup comes through efficient RAM usage and compiler optimizations/pipelining 



Holland

Reduce quantity and addressing complexity of RAM accesses whenever possible

Metrics are based on sorting 600 32-bit integers contained within a block RAM

15

#215 MAPLD 2005

Some Optimization Techniques Keep expensive computational operations to a minimum Multiplication, division, modulo, greater/less than, and floating point are *slow*

BAD

temp = a[0]; for(i=0;i