niche product, but it has the potential to turn the Layer 4-7 network services industryâ¦â¦â¦. Maxeler Delivers New F
Accelerated Computing Using FPGAs and OpenCL Nick Finamore – Altera MDM
© 2012 Altera Corporation
Agenda
Emerging FPGA acceleration examples Performance scaling with FPGAs Deep pipelined parallelism Compiling OpenCL kernels to FPGAs OpenCL kernel porting example Mapping multi-threaded kernels to FPGAs Pipeline Vector Add Example
OpenCL roadmap Summary
© 2012 Altera Corporation
Emerging FPGA based solutions Arista FPGA switch revolutionize Layer 4-7 network services Many network engineers might dismiss Arista Networks' new field programmable gate arrays (FPGA) switch as a cool, niche product, but it has the potential to turn the Layer 4-7 network services industry……….
IBM buys Netezza for $1.7B -FPGA hosts disks
Violin Accelerates Oracle Maxeler Delivers New FPGA-Powered Connectivity Platform for HPC Users Maxeler, a company specialized in developing accelerator cards for the high-performance computing sector, has recently announced the introduction of the MaxNode10G, a platform designed for wire-speed processing of multiple 10 gigabit network data
© 2012 Altera Corporation
flash Memory Arrays provide a simple, reliable solution for scalable high performance storage. Data intensive applications like Databases, Business Intelligence/Analytics, and Data Warehousing often suffer from storage I/O bottlenecks that cripple performance. Flash Memory Arrays ensure the lowest latency data access…..
FPGA Computing Acceleration Example (circa 2008)
~10X the quad core cpu performance For Mersenne Twister + Monte Carlo Black Scholes
In double precision floating point, versus single!
FPGA implementation: Six pipelines with one result per clock for each pipeline Resources available (768) DSP blocks in the 3SL260 (65nm) XtremeData XD2000i www.altera.com/literature/wp/wp-01031-coprocessing-evolution.pdf
Performance scales with resources:
65nm (768 DSP) 40nm (1288 DSP) 28nm (4096 DSP) !! (2) Quad-core nVidia 8800 Monte Carlo cpu's Precision Single Single Paths per sec. ~240M ~900M RNG Type Halton Halton Source RapidMind RapidMind Note: Cpu and gpu results @1M paths per option.
© 2012 Altera Corporation
XD2000i w/(2) 3S260's Double 1.8G Mersenne XtremeData
Growing FPGA Design Complexity 40-/100-Gbps Ethernet
~16x
Density/Performance
Density Performance
DDR2, PCI Express
~4x ~3.5x ~2.25x PCI
1x
1x
180 nm
130 nm
90 nm
Trends © 2012 Altera Corporation 5
65 nm
40 nm and beyond
Modern Altera FPGA: Massively Parallel Build extremely deep pipelines with wide parallelism Load
Load
Load
Load
Load
Load
PCIe Store
Load
Load
Store
Load
Load
•Map coarse grained thread parallelism to fine-grained FPGA parallelism
Store
Load
Load
DDRx Store
© 2012 Altera Corporation 6
Store
Deep Pipelined Parallelism
Store
•Create deeply pipelined representation of a kernel replicated across FPGA •On each clock cycle, send in input data for a new thread
The FPGA programming challenge Typical design flow based on HDL Project definition
Project creation
Design entry/RTL***Verilog coding and early pin planning Design creation
• Behavioral or structural description of design • Early pin planning allows board development in parallel
Functional verification Synthesis (mapping)
• Verify design behavior Functional verification
Logic
Memory
I/O Design compilation
Functional verification
• Translate design into device-specific primitives • Optimization to meet required area and performance constraints
Placement and routing (fitting) • Place design in specific device resources with reference to area and performance constraints • Connect resources with routing lines
Timing analysis
• Verify design will work in target technology Functional verification
• Verify performance specifications were met • Static timing analysis
PC board simulation and test
In-system debug
© 2012 Altera Corporation 7
• Simulate board design • Program and test device on board • On-chip tools for debugging
OpenCL Overview (Open Computing Language)
OpenCL is a SW programming model: Standard C Language (C99) OpenCL C extensions (adds parallelism to C) API (Open standard for different devices)
Provides increased performance CPU offload Performance via HW acceleration
The consortium (Short List):
© 2012 Altera Corporation 8
Host CPU
Hardware Acceleration
OpenCL Enables Portability OpenCL code base
True portability across heterogeneous platforms •Performance optimized for each
Multicore CPU CPU
CPU
CPU
FPGA with Integrated CPU
GPU
© 2012 Altera Corporation
FPGA
GPU
FPGA
FPGA
OpenCL Programming Model Accelerator
Host
Typical challenges: Global/local memory bandwidth Limited floating point cores Thread occupancy
© 2012 Altera Corporation 10
Processor Accelerator Accelerator Accelerator
Local Mem Local Mem Local Local Mem Mem
Global Mem
main() { read_data( … ); maninpulate( … ); clEnqueueWriteBuffer( … ); clEnqueueNDRange(…,sum,…); clEnqueueReadBuffer( … ); display_result( … ); }
__kernel void sum(__global float *a, __global float *b, __global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
OpenCL Host Program
Pure software written in standard „C‟ Communicates with the Accelerator Device via a set of library routines which abstract the communication between the host processor and the kernels main() { Copy data from Host to FPGA
read_data_from_file( … ); maninpulate_data( … );
Ask the FPGA to run a particular kernel Copy data from FPGA to Host
© 2012 Altera Corporation 11
clEnqueueWriteBuffer( … ); clEnqueueTask(…, my_kernel, …); clEnqueueReadBuffer( … ); display_result_to_user( … ); }
OpenCL Kernels
Data-parallel function
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
Defines many parallel threads
of execution Each thread has an identifier specified by “get_global_id” Contains keyword extensions to specify parallelism and memory hierarchy
Executed by compute float object CPU GPU
*a =
0
1
2
3
4
5
6
7
float *b =
7
6
5
4
3
2
1
0
Accelerator
__kernel void sum( … ); float *answer =
© 2012 Altera Corporation 12
7
7
7
7
7
7
7
7
Mapping OpenCL Programs OpenCL Host Program + Kernels
ACL Compiler
Standard C Compiler
SOF
X86 binary
PCIe
x86 © 2012 Altera Corporation 13
Compiling OpenCL to Altera FPGAs
Host Program
__kernel void sum(__global const float *a, __global const float *b, __global float *answer) __kernel void { sum(__global const float *a, int xid = get_global_id(0); __global const float + *b, answer[xid] = a[xid] b[xid]; __global float *answer) } { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; }
main() { read_data_from_file( … ); maninpulate_data( … );
Kernel Programs
OpenCL Load Load Load Host Program + Kernels Load
ACL Compiler
Load
PCIe
Standard C Compiler
display_result_to_user( … );
} Store SOF
Load
Load
Store X86 binary Load
Load
Load
clEnqueueWriteBuffer( … ); clEnqueueKernel(…, sum, …); clEnqueueReadBuffer( … );
Store
Load
Load
PCIe
DDRx Store
© 2012 Altera Corporation 14
Store
x86
Store
FPGA OpenCL Architecture Example FPGA PCIe
External Memory Controller & PHY
DDR*
x86 / External Processor
External Memory Controller & PHY
Global Memory Interconnect M9K
M9K M9K
M9K Kernel Pipeline
Kernel Pipeline
Kernel Pipeline
M9K M9K
Local Memory Interconnect
Limited external memory bandwidth Extremely high internal memory bandwidth Highly customizable compute cores © 2012 Altera Corporation 15
Traditional “C to Gates” Approach Logic Only
For ASIC & FPGA Designers:
Logic to RTL Solution Map C algorithms to ip For ASIC or FPGA
Algorithms maintained in C
Host CPU
FPGA
User program
Development, Enhancements, Testing
Logic to RTL Examples (2010):
Driver
Logic
e.g. Mentor Catapult C
Designer adds a few missing pieces:
Data Management in/out of the ip core Data Management to external memory Also called “Platform Support Package”
16
OpenCL is a different approach
© 2012 Altera Corporation
Mapping Multithreaded Kernels to FPGAs
The simplest way of mapping kernel functions to FPGAs is to replicate hardware for each thread
Load
Load
Inefficient and wasteful
Better method deep pipeline parallelism Map coarse grained thread parallelism
to fine-grained FPGA parallelism Create a deeply pipelined representation of a kernel Send in new data and retire a new result on each clock cycle
© 2012 Altera Corporation 17
Store
Load
Load
Store
Example Pipeline for Vector Add 8 threads for vector add example 0
Load
1
2
3
4
5
6
7
Load
Thread IDs +
Store
© 2012 Altera Corporation 18
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example 1
2
3
4
5
6
7
0
Load
Load
Thread IDs +
Store
© 2012 Altera Corporation 19
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example 2
3
4
5
6
7
1
Load
Load
Thread IDs
0 +
Store
© 2012 Altera Corporation 20
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example 3
4
5
6
7
2
Load
Load
Thread IDs
1 + 0
Store
© 2012 Altera Corporation 21
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
Example Pipeline for Vector Add 8 threads for vector add example 4
5
6
7
3
Load
Load
Thread IDs
2 + 1
Store 0
© 2012 Altera Corporation 22
On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
High Definition Rendering CODEC: An example of OpenCL kernel porting Up and Running in a day! Optimized within a few days
23 © 2012 Altera Corporation
Summary
FPGA‟s are able to accelerate high performance computing applications Deep pipeline parallelism allow highly efficient algorithm acceleration OpenCL enables standard based high level programming of FPGA accelerators OpenCL will enable even more efficient implementation of acceleration on FPGAs in the future
© 2012 Altera Corporation