Performance and Toolchain of a Combined GPU/FPGA Desktop

Performance and Programming Environment of a Combined GPU/FPGA Desktop Presented at the 21st ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, February 13, 2013

Contact Details: Bruno Da Silva, [email protected]

Bruno Da An Erik H. 1 3 3 Abdellah Touhafi , Jan G. Cornelis and Jan Lemeire 1 Silva ,

1 Braeken ,

2 D’Hollander ,

1Erasmus

University College, Department IWT, Brussels, 2Ghent University, Department ELIS, Ghent, 3Vrije Universiteit Brussel, Department ETRO, Brussels

Introduction

Performance estimation

 The performance of today's PCs exceeds many times the power of the supercomputers in the 90s, but it is not enough for many computationally hungry applications..

 The roofline model expresses the maximum performance in function of the algorithm's computational intensity (CI), taking into account the peak computational power (CP) and the peak I/O bandwidth of the accelerator (BW).

18

16

14

Comparison of GPU/FPGA The C-to-VHDL compilers are highly productive and outperform handwritten code for algorithms such as erosion, but commonly use more resources.  Comparison of GPUs and FPGAs for image erosion.

R² = 0,96

Moore's law

 Handwritten vs C-to-VHDL compiler

Memory speed increase (relative)

Flops (log10)

Computational Power 12

The measurements depicted on the superimposed roofline models of GPU (dashed lines) and FPGA (continuous lines) show that both GPU and FPGA excel for image processing algorithms. However, the PCIe bandwidth (x16 continuous lines and x8 dashed lines) limits the overall performance of both devices.

Linear (Computational Power)

10

8

6

4 1965

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

 To leverage the power of different technologies, a hybrid solution is presented, combining the power of:  Graphics Processing Units (GPUs): •

Massive SIMD parallelism

•

Well-known software tool chain

 Superimposing the rooflines of GPU and FPGA shows the relative performance of both accelerators.

 Programmable Gate Arrays (FPGAs). •

Massive fine-grain parallelism and pipelining

•

Algorithm in hardware

•

Optimizing C-to-VHDL compilers

Tool chain Programming steps: 1. Identify the parts of the application to be accelerated by the GPU and/or the FPGA.

Objectives  Build GPU/FPGA desktop  Develop a combined tool chain  Accelerate industrial applications

Hybrid architecture

2. Create a C/C++ program to be executed by the CPU with GPU and FPGA function calls. •

GPU code → GPU compiler

•

FPGA code → High-Level Synthesis (HLS) (ROCCC, Vivado HLS, ...)

3. Compile the programs, synthesize the FPGA design and generate an executable linking the CPU, GPU and FPGA binaries.

Research platform:

4. Load GPU, CPU configuration binary.

CPU: Quad-core Intel Xeon E5506

5. Execute the program

code

binaries

and

FPGA

Combination of GPU/FPGA  Pedestrian Recognition Application (fastHOG) The pedestrian recognition application fastHOG, originally designed for GPUs, is composed of the Histogram Oriented Gradients (HOG) and Support Vector Machines (SVM) components which are executed several times on the downscaled images.

GPU: NVIDIA Tesla C2050 FPGA: Pico Computing w/ 2 Virtex-6 LX240 Communication link: PCIe 2.0 x16 lanes (GPU and Pico board)

Figure 3. The application is adapted to be partially executed on the FPGA. The Histogram computation and the normalization are ideal candidates for FPGAs.

Figure 2. An algorithm is converted into a C/C ++ program with mixed code fragments for the three platforms, CPU, GPU and FPGA. The executable communicates with the GPUs and FPGAs using API libraries.

The HOG computation on the FPGA is faster than on the GPU. However, the speedup combining GPU/FPGA is bounded by the PCIe bandwidth due to the data transfer. 3,00

Conclusions

Figure 1. Detailed architecture combining GPU and FPGA accelerators to create a high performance computing super desktop platform..

Acknowledgements This research has been made possible thanks to a Tetra grant 100132 "A combined GP-GPU/FPGA desktop system for accelerating image processing applications (GUDI)” of the Flanders agency for Innovation by Science and Technology.

 C/C++ based tool chain available for both platforms; FPGAs and GPUs  The I/O bandwidth has a significantly impact over the final performance.  High-level synthesis cuts down development time, making FPGAs an alternative for market solutions.

2,50 2xFPGA Speedup

 Combined High-Performance Computing platform

1xFPGA

2,00 1,50 1,00 0,50 0,00 Computation

Computation + Data Transfer

References 1. Cornelis J., Lemeire J. Benchmarks Based on Anti-Parallel Pattern for the Evaluation of GPUs, International Conference on Parallel Computing, Ghent, 2011 2.Erik H. D’Hollander, High-Performance Computing for Low-Power Systems, Advanced HPC Systems workshop, Cetraro, 2011

Performance and Toolchain of a Combined GPU/FPGA Desktop

Performance and Toolchain of a Combined GPU/FPGA Desktop

Suggest Documents

DevOps Toolchain

Virtual Reality and Desktop as a Combined ... - Semantic Scholar

A Combined Immersive and Desktop Authoring Tool for ... - CiteSeerX

A Combined Immersive and Desktop Authoring Tool for ... - CiteSeerX

Performance and function of a desktop viewer at ...

A desktop evaluation of the environmental and economic performance ...

Developer Toolchain for - LLVM.org

Performance models for desktop grids

Eclipse_GNU Toolchain Install Guide.pdf

A COMPLETE TOOLCHAIN FOR AN INTERFERENCE-FREE

A Performance Model for Combined Steering ...

Secure DevOps Toolchain

DevOps Toolchain - HubSpot

A COMPLETE TOOLCHAIN FOR AN INTERFERENCE ... - SAFURE

HARTES TOOLCHAIN EARLY EVALUATION: PROFILING ...

Thread-level Parallelism and Interactive Performance of Desktop ...

Entropia: Architecture and Performance of an Enterprise Desktop Grid ...

Performance of film, desktop monitor and laptop ... - Semantic Scholar

FocusStack and StimServer: a new open source MATLAB toolchain for

Toolchain for Programming, Simulating and Studying ... - Ece.umd.edu

PiccSIM Toolchain â Design, Simulation and ...

Programming Abstractions and Toolchain for Dataflow Multithreading

Performance study on a combined sensible and ... - Stockton University

Thermodynamic Evaluation of the Performance of a Combined Cycle ...

Performance and Toolchain of a Combined GPU/FPGA Desktop