Sep 15, 2016 - HP Chromebook, Samsung Galaxy S6. NVIDIA. Tegra ... 17. GTRI. Embedded GPU Landscape. IP Holder. Brand Name. Licensees / .... parallelization. ⢠OpenCL can run on NVIDIA desktop and laptop, but they don't ship.
Accelerated SignalProcessing on Embedded Platforms: Paths Forward September 15, 2016 Dr. Rajib Bhattacharjea Georgia Tech Research Institute Information and Communications Lab Atlanta, GA 30318, USA 1
GTRI
Overview • Embedded SDR -
SBCs, stick computers, mini PCs, etc. Embedded ARM landscape GNURadio SDR Hardware
• Real-time signal processing on CPU - Ways to leverage SIMD - Paths forward in GNURadio
• Real-time signal processing on GPU - Why embedded GPU? - What is an embedded GPU? - Embedded GPU landscape
• GPU Programming for computations - Languages and APIs - Paths forward in GNURadio 2
GTRI
What We’re Talking About
3
GTRI
Single Board Computers!
http://hackerboards.com/ringing-in-2016-with-64-open-spec-hacker-friendly-sbcs/ http://hackerboards.com/catalog-of-81-open-spec-hacker-friendly-sbcs/ 4
GTRI
Embedded Computers from the Living Room!
5
GTRI
Embedded ARM Landscape Manufacturer Texas Instruments Broadcom Allwinner Rockchip MediaTek Qualcomm Marvell Amlogic Samsung NVIDIA Apple Actions Semiconductor NXP HiSilicon Atmel Renesas 6
Brand Name OMAP
Snapdragon ARMADA Exynos Tegra AX i.MX, QorIQ SMART SAMA5
End Devices Kindle Fire HD, Droid X, Droid 2 Raspberry Pi Tablets ASUS Chromebook, TV Boxes/Sticks Tablets, Phones Google Nexus One, HTC Incredible Panasonic Toughpad A1, Google Chromecast TV Boxes HP Chromebook, Samsung Galaxy S6 NVIDIA Shield iPhone, iPad, Apple TV, Apple Watch Tablets, TV Boxes Amazon Kindle Huawei Devices Dev Boards Samsung Galaxy Core LTE GTRI
Signal Processing with GNURadio!
http://wiki.opendigitalradio.org/FM_RDS_Stereo_transmitter_using_gnuradio 7
GTRI
Software Defined Radio Hardware!
8
GTRI
Put it all together!
9
GTRI
Real-time signal processing on CPU is your foe
Must understand both signal processing operations and the modern CPU
10
GTRI
Real-time signal processing on CPU is your foe The Modern CPU • Optimized for managing the flow of instructions and keeping the pipeline full, not for repeatedly and quickly executing a small group of instructions
Signal Processing Operations • Modulation - Complex Multiplication - Look-up tables (LUTs)
• Entropy coding, error coding, encryption
- Branch prediction, out of order execution, speculative execution, prefetching, etc.
- Bitwise operations - LUTs
• CPU manufacturers see this gap, have created special CPU instructions for common computations in DSP, multimedia, encryption, etc.
• Multiply/add heavy operations - Convolutions / FIR filters - FFTs
11
GTRI
Path 1: SIMD CPU Extensions • Write assembly • Use compiler features directly • ORC
• OpenCL • Performance-tuned libraries
In GNURadio, you’re already using VOLK, so we’ll assume this is not meeting your application’s requirements If you end up with SIMD code, doesn’t matter how you got there
No guarantees of optimality other than to write/modify assembly with the help of a profiler
12
GTRI
SIMD Paths Forward in GNU Radio • VOLK uses standard C++ (Release builds use -O3), inline ASM, compiler intrinsics, and ORC • In the best case, the end result of any of the above is SIMD machine code • Picks best out of what’s in VOLK, but what about things that aren’t in VOLK? Recommendations: - Profile kernels and publish results - Offer SoC project to compare against other similar projects (http://libvolk.org/comparisons.html) - Learn from or merge in implementations from other packages
13
GTRI
Path 2: Embedded GPU
Logos are the trademark of their respective companies 14
GTRI
Embedded GPUs: Why are they there? • Vendors want to support “multimedia applications”, the killer app - Streaming, decoding, and playback of media - Encoding of media (Vine, Youtube, Periscope, etc.) - Mobile gaming • Projected to generate 36.9 billion USD in revenue in 2016* • “Mobile” is the marketing term for “embedded”
* https://newzoo.com/insights/articles/global-games-market-reaches-99-6-billion-2016-mobile-generating-37/ 15
GTRI
What are these GPUs?
http://www.arm.com/products/multimedia/mali-gpu/high-performance/mali-t628.php https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core 16
GTRI
Embedded GPU Landscape IP Holder
Brand Name
Licensees / Manufacturers
End Devices
Broadcom
VideoCore
Broadcom
Raspberry Pi, Samsung Galaxy (various)
ARM Holdings
Mali
Allwinner, Amlogic, MediaTek, Rockchip, Samsung
Samsung Gear/Galaxy/Chromebook (various), Google Nexus 10, ODROID (various)
NVIDIA
Tegra GPU
NVIDIA
NVIDIA Shield, Acer Chromebook, Google Pixel C, HTC Nexus 9
Intel
HD Graphics, GMA, Iris, Iris Pro
Intel
Dell Pro 11 2in1, Lenovo ThinkPad (various), Dell Venue 11 Pro
Qualcomm
Adreno
Qualcomm
Samsung Galaxy (various), Sony Xperia X, HTC 10
Imagination Technologies
PowerVR
Ti, Apple, Intel, Broadcom, Allwinner, Samsung, Rockchip
iPhone (various), iPad (various), Apple Watch, PS Vita, Asus Zenfone 4, ODROID-XU
Vivante Corporation
Vega
Marvell, Freescale, Ingenic Semiconductor, Rockchip
Samsung Galaxy Tab 4, Chuwi V90, Hummingboard (various)
17
GTRI
GPU Programming for Compute • Write GPU assembly • Write shading language code
• Use higher level compute languages • GPU-accelerated libraries
18
GTRI
GPU Programming for Compute: Shading Languages, Compute Languages, APIs
https://community.arm.com/groups/arm-mali-graphics/blog/2016/07/06/the-vulkan-validation-layers
19
GTRI
GPU Shading Language • OpenGL ES Shading Language (ESSL) - C-like language, OpenGL Shading Language (GLSL), but embedded
- Compute shaders supported in OpenGL ES 3.1 and up - Compliant vendors provide compilers that generate GPU machine code #version 310 es // The uniform parameters which get passed for every frame. uniform float radius; struct Vector3f // represents either vertex or color. { float x; float y; float z; float w; }; // colored point struct AttribData { Vector3f v; // vertex Vector3f c; // color }; // machine aligned output buffer layout(std140, binding = 0) buffer destBuffer { AttribData data[]; } outBuffer; layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
void main() { // position for this thread ivec2 storePos = ivec2(gl_GlobalInvocationID.xy); // Calculate the global number of threads (size) for this uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x; uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y; uint gSize = gWidth * gHeight; uint offset = storePos.y * gWidth + storePos.x; // offset // Calculate an angle for the current thread float alpha = 2.0 * 3.14159265359*(float(offset) / float(gSize)); // vertex position from calculate angle and radius outBuffer.data[offset].v.x = sin(alpha) * radius; outBuffer.data[offset].v.y = cos(alpha) * radius; outBuffer.data[offset].v.z = 0.0; outBuffer.data[offset].v.w = 1.0; // Assign color for the vertex outBuffer.data[offset].c.x = storePos.x / float(gWidth); outBuffer.data[offset].c.y = 0.0; outBuffer.data[offset].c.z = 1.0; outBuffer.data[offset].c.w = 1.0; }
https://community.arm.com/groups/arm-mali-graphics/blog/2014/04/17/get-started-with-compute-shaders 20
GTRI
GPU Compute Languages: OpenCL • Two available languages based on C and C++ (OpenCL C, OpenCL C++) • Operating modes - “Offline” compile machine code (not portable, no runtime cost, no source) • Intel*, ARM Mali¤, or use the API itself to dump the machine code†
- SPIR (portable, JIT cross-assembler at runtime, no source) • Intel*
• Vulkan is a way to get SPIR, more on this later
- Pure JIT (portable, JIT compile at runtime, source ships to user)
• Requires hardware vendor support - At time of writing, broadly supported on embedded GPUs (except NVIDIA) * https://software.intel.com/en-us/node/539388 ¤ http://malideveloper.arm.com/resources/tools/mali-offline-compiler/ † http://www.cs.bris.ac.uk/home/simonm/montblanc/AdvancedOpenCL_full.pdf 21
GTRI
GPU Compute Languages: CUDA • NVIDIA specific GPGPU programming framework - Not ubiquitous (like other ARM GPUs)
• “CUDA C/C++” or “CUDA Fortran” languages + compilers - nvcc - PGI CUDA Fortran Compiler
Aside • OpenACC are a set of compiler directives you just add to your C/C++ or Fortran, then compile with a special compiler (PGI’s pgcc) - NVIDIA is one of the only companies to support this technology on their GPUs - Claims you don’t have to touch much of your original code, just give hints about parallelization
• OpenCL can run on NVIDIA desktop and laptop, but they don’t ship OpenCL for Tegra embedded
22
GTRI
GPU Compute Capable API: Vulkan • Spiritual successor to OpenGL • Supports compute shaders
• Supported by all the major vendors except Broadcom - Including NVIDIA Tegra
• Generates portable code in the machine-independent SPIR - JIT cross-assembled for the host GPU by the Vulkan driver/runtime
• Verbose, lots of boilerplate - 600+ lines of code to allocate two buffers in the GPU and memcpy one to the other - (https://gist.github.com/sheredom/523f02bbad2ae397d7ed255f3f3b5a7f)
23
GTRI
GPU Accelerated APIs
NVIDIA Performance Primitives
cuBLAS clBLAS cuFFT clFFT
24
GTRI
Embedded GPU Compute Paths Forward
Product
Path of Least Resistance
Broadcom VideoCore
Assembly
ARM Holdings Mali
OpenCL (accelerated libraries)
NVIDIA Tegra GPU
CUDA* (accelerated libraries)
Intel HD Graphics, GMA, Iris, Iris Pro
OpenCL (accelerated libraries)
Qualcomm Adreno
OpenCL (accelerated libraries)
Imagination Technologies PowerVR
OpenCL¤ (accelerated libraries)
Vivante Corporation Vega
OpenCL (accelerated libraries)
* OpenCL could easily be supported by NVIDIA, but they haven’t released for Tegra ¤
25
Have to apply for PowerVR early access program, still not clear where downloads are
GTRI
GPU Processing on Embedded: State of the Art • OpenCL is the best option right now due to broad vendor support • Can justify CUDA development only if you’re doing SDR on a Tegra - Compute shaders (ESSL or Vulkan) are more of a pain but supported on Tegra - There are automatic ways to convert your CUDA code to OpenCL (Swan, Vtsynergy CU2CL)
• Some work has been done on Broadcom VideoCore - Assemblers for the VideoCore, including one where you write assembly in python (PyVideoCore) - PyVideoCore has a single example of a single BLAS/LAPACK function
26
GTRI
Final Thoughts • You can’t pick up an embedded device and get optimal signal processing performance - Software work to be done to leverage the processing power of the GPUs and SIMD technologies
• Should the VOLK project support GPU accelerated implementations? - Should everyone use OpenCL?
• It’s an exciting time in embedded SDR! • The unprecedented popularity of mobile, wireless, and TV markets has effectively subsidized the SBC and embedded computer market. • You can realistically learn SDR and signal processing with practical, hands-on experience for the price of a drink at a nice restaurant.
27
GTRI
Acknowledgements • Ben Riley of GTRI for funding IRAD
28
GTRI
Thank You
29
GTRI
Extra Material
30
GTRI
Special Coprocessors • Xeon Phi - Not embedded yet; power draw is same as thirty SBCs
• Adapteva Epiphany Coprocessor - Has not really caught-on
- Used in one design, the Adapteva Parallella-16 SBC
*
• Movidius Myriad VPUs - Vision processing units
- Google uses for machine learning - Intel is acquiring them
http://www.movidius.com/solutions/vision-processing-unit
http://www.adapteva.com/epiphanyiii/
*http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html 31
GTRI
Other Future Directions • AMD has an initiative called GPUOpen that has a project called HIP - http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/
• Device agnostic C++ variant that can be compiled for NVIDIA GPUs, to AMD machine code, and to a device independent HSAIL machine code (JIT cross-assembled at runtime on your platform of choice) • No vendor support yet, but here are some of the founders of the HSA consortium - ARM - Imagination Technologies
- Qualcomm
32
GTRI
Embedded Processing: Operations and Speed Convolution (FIR filter) 𝑀−1
(𝑥 ∗ 𝑦) 𝑛 =
𝑥 𝑛 − 𝑚 𝑦[𝑚] 𝑚=0
#define M (1L