Accelerated Signal- Processing on Embedded Platforms

31 downloads 689217 Views 2MB Size Report
Sep 15, 2016 - HP Chromebook, Samsung Galaxy S6. NVIDIA. Tegra ... 17. GTRI. Embedded GPU Landscape. IP Holder. Brand Name. Licensees / .... parallelization. • OpenCL can run on NVIDIA desktop and laptop, but they don't ship.
Accelerated SignalProcessing on Embedded Platforms: Paths Forward September 15, 2016 Dr. Rajib Bhattacharjea Georgia Tech Research Institute Information and Communications Lab Atlanta, GA 30318, USA 1

GTRI

Overview • Embedded SDR -

SBCs, stick computers, mini PCs, etc. Embedded ARM landscape GNURadio SDR Hardware

• Real-time signal processing on CPU - Ways to leverage SIMD - Paths forward in GNURadio

• Real-time signal processing on GPU - Why embedded GPU? - What is an embedded GPU? - Embedded GPU landscape

• GPU Programming for computations - Languages and APIs - Paths forward in GNURadio 2

GTRI

What We’re Talking About

3

GTRI

Single Board Computers!

http://hackerboards.com/ringing-in-2016-with-64-open-spec-hacker-friendly-sbcs/ http://hackerboards.com/catalog-of-81-open-spec-hacker-friendly-sbcs/ 4

GTRI

Embedded Computers from the Living Room!

5

GTRI

Embedded ARM Landscape Manufacturer Texas Instruments Broadcom Allwinner Rockchip MediaTek Qualcomm Marvell Amlogic Samsung NVIDIA Apple Actions Semiconductor NXP HiSilicon Atmel Renesas 6

Brand Name OMAP

Snapdragon ARMADA Exynos Tegra AX i.MX, QorIQ SMART SAMA5

End Devices Kindle Fire HD, Droid X, Droid 2 Raspberry Pi Tablets ASUS Chromebook, TV Boxes/Sticks Tablets, Phones Google Nexus One, HTC Incredible Panasonic Toughpad A1, Google Chromecast TV Boxes HP Chromebook, Samsung Galaxy S6 NVIDIA Shield iPhone, iPad, Apple TV, Apple Watch Tablets, TV Boxes Amazon Kindle Huawei Devices Dev Boards Samsung Galaxy Core LTE GTRI

Signal Processing with GNURadio!

http://wiki.opendigitalradio.org/FM_RDS_Stereo_transmitter_using_gnuradio 7

GTRI

Software Defined Radio Hardware!

8

GTRI

Put it all together!

9

GTRI

Real-time signal processing on CPU is your foe

Must understand both signal processing operations and the modern CPU

10

GTRI

Real-time signal processing on CPU is your foe The Modern CPU • Optimized for managing the flow of instructions and keeping the pipeline full, not for repeatedly and quickly executing a small group of instructions

Signal Processing Operations • Modulation - Complex Multiplication - Look-up tables (LUTs)

• Entropy coding, error coding, encryption

- Branch prediction, out of order execution, speculative execution, prefetching, etc.

- Bitwise operations - LUTs

• CPU manufacturers see this gap, have created special CPU instructions for common computations in DSP, multimedia, encryption, etc.

• Multiply/add heavy operations - Convolutions / FIR filters - FFTs

11

GTRI

Path 1: SIMD CPU Extensions • Write assembly • Use compiler features directly • ORC

• OpenCL • Performance-tuned libraries

In GNURadio, you’re already using VOLK, so we’ll assume this is not meeting your application’s requirements If you end up with SIMD code, doesn’t matter how you got there

No guarantees of optimality other than to write/modify assembly with the help of a profiler

12

GTRI

SIMD Paths Forward in GNU Radio • VOLK uses standard C++ (Release builds use -O3), inline ASM, compiler intrinsics, and ORC • In the best case, the end result of any of the above is SIMD machine code • Picks best out of what’s in VOLK, but what about things that aren’t in VOLK? Recommendations: - Profile kernels and publish results - Offer SoC project to compare against other similar projects (http://libvolk.org/comparisons.html) - Learn from or merge in implementations from other packages

13

GTRI

Path 2: Embedded GPU

Logos are the trademark of their respective companies 14

GTRI

Embedded GPUs: Why are they there? • Vendors want to support “multimedia applications”, the killer app - Streaming, decoding, and playback of media - Encoding of media (Vine, Youtube, Periscope, etc.) - Mobile gaming • Projected to generate 36.9 billion USD in revenue in 2016* • “Mobile” is the marketing term for “embedded”

* https://newzoo.com/insights/articles/global-games-market-reaches-99-6-billion-2016-mobile-generating-37/ 15

GTRI

What are these GPUs?

http://www.arm.com/products/multimedia/mali-gpu/high-performance/mali-t628.php https://community.arm.com/groups/arm-mali-graphics/blog/2014/03/12/the-mali-gpu-an-abstract-machine-part-3--the-shader-core 16

GTRI

Embedded GPU Landscape IP Holder

Brand Name

Licensees / Manufacturers

End Devices

Broadcom

VideoCore

Broadcom

Raspberry Pi, Samsung Galaxy (various)

ARM Holdings

Mali

Allwinner, Amlogic, MediaTek, Rockchip, Samsung

Samsung Gear/Galaxy/Chromebook (various), Google Nexus 10, ODROID (various)

NVIDIA

Tegra GPU

NVIDIA

NVIDIA Shield, Acer Chromebook, Google Pixel C, HTC Nexus 9

Intel

HD Graphics, GMA, Iris, Iris Pro

Intel

Dell Pro 11 2in1, Lenovo ThinkPad (various), Dell Venue 11 Pro

Qualcomm

Adreno

Qualcomm

Samsung Galaxy (various), Sony Xperia X, HTC 10

Imagination Technologies

PowerVR

Ti, Apple, Intel, Broadcom, Allwinner, Samsung, Rockchip

iPhone (various), iPad (various), Apple Watch, PS Vita, Asus Zenfone 4, ODROID-XU

Vivante Corporation

Vega

Marvell, Freescale, Ingenic Semiconductor, Rockchip

Samsung Galaxy Tab 4, Chuwi V90, Hummingboard (various)

17

GTRI

GPU Programming for Compute • Write GPU assembly • Write shading language code

• Use higher level compute languages • GPU-accelerated libraries

18

GTRI

GPU Programming for Compute: Shading Languages, Compute Languages, APIs

https://community.arm.com/groups/arm-mali-graphics/blog/2016/07/06/the-vulkan-validation-layers

19

GTRI

GPU Shading Language • OpenGL ES Shading Language (ESSL) - C-like language, OpenGL Shading Language (GLSL), but embedded

- Compute shaders supported in OpenGL ES 3.1 and up - Compliant vendors provide compilers that generate GPU machine code #version 310 es // The uniform parameters which get passed for every frame. uniform float radius; struct Vector3f // represents either vertex or color. { float x; float y; float z; float w; }; // colored point struct AttribData { Vector3f v; // vertex Vector3f c; // color }; // machine aligned output buffer layout(std140, binding = 0) buffer destBuffer { AttribData data[]; } outBuffer; layout (local_size_x = 8, local_size_y = 8, local_size_z = 1) in;

void main() { // position for this thread ivec2 storePos = ivec2(gl_GlobalInvocationID.xy); // Calculate the global number of threads (size) for this uint gWidth = gl_WorkGroupSize.x * gl_NumWorkGroups.x; uint gHeight = gl_WorkGroupSize.y * gl_NumWorkGroups.y; uint gSize = gWidth * gHeight; uint offset = storePos.y * gWidth + storePos.x; // offset // Calculate an angle for the current thread float alpha = 2.0 * 3.14159265359*(float(offset) / float(gSize)); // vertex position from calculate angle and radius outBuffer.data[offset].v.x = sin(alpha) * radius; outBuffer.data[offset].v.y = cos(alpha) * radius; outBuffer.data[offset].v.z = 0.0; outBuffer.data[offset].v.w = 1.0; // Assign color for the vertex outBuffer.data[offset].c.x = storePos.x / float(gWidth); outBuffer.data[offset].c.y = 0.0; outBuffer.data[offset].c.z = 1.0; outBuffer.data[offset].c.w = 1.0; }

https://community.arm.com/groups/arm-mali-graphics/blog/2014/04/17/get-started-with-compute-shaders 20

GTRI

GPU Compute Languages: OpenCL • Two available languages based on C and C++ (OpenCL C, OpenCL C++) • Operating modes - “Offline” compile machine code (not portable, no runtime cost, no source) • Intel*, ARM Mali¤, or use the API itself to dump the machine code†

- SPIR (portable, JIT cross-assembler at runtime, no source) • Intel*

• Vulkan is a way to get SPIR, more on this later

- Pure JIT (portable, JIT compile at runtime, source ships to user)

• Requires hardware vendor support - At time of writing, broadly supported on embedded GPUs (except NVIDIA) * https://software.intel.com/en-us/node/539388 ¤ http://malideveloper.arm.com/resources/tools/mali-offline-compiler/ † http://www.cs.bris.ac.uk/home/simonm/montblanc/AdvancedOpenCL_full.pdf 21

GTRI

GPU Compute Languages: CUDA • NVIDIA specific GPGPU programming framework - Not ubiquitous (like other ARM GPUs)

• “CUDA C/C++” or “CUDA Fortran” languages + compilers - nvcc - PGI CUDA Fortran Compiler

Aside • OpenACC are a set of compiler directives you just add to your C/C++ or Fortran, then compile with a special compiler (PGI’s pgcc) - NVIDIA is one of the only companies to support this technology on their GPUs - Claims you don’t have to touch much of your original code, just give hints about parallelization

• OpenCL can run on NVIDIA desktop and laptop, but they don’t ship OpenCL for Tegra embedded

22

GTRI

GPU Compute Capable API: Vulkan • Spiritual successor to OpenGL • Supports compute shaders

• Supported by all the major vendors except Broadcom - Including NVIDIA Tegra

• Generates portable code in the machine-independent SPIR - JIT cross-assembled for the host GPU by the Vulkan driver/runtime

• Verbose, lots of boilerplate - 600+ lines of code to allocate two buffers in the GPU and memcpy one to the other - (https://gist.github.com/sheredom/523f02bbad2ae397d7ed255f3f3b5a7f)

23

GTRI

GPU Accelerated APIs

NVIDIA Performance Primitives

cuBLAS clBLAS cuFFT clFFT

24

GTRI

Embedded GPU Compute Paths Forward

Product

Path of Least Resistance

Broadcom VideoCore

Assembly

ARM Holdings Mali

OpenCL (accelerated libraries)

NVIDIA Tegra GPU

CUDA* (accelerated libraries)

Intel HD Graphics, GMA, Iris, Iris Pro

OpenCL (accelerated libraries)

Qualcomm Adreno

OpenCL (accelerated libraries)

Imagination Technologies PowerVR

OpenCL¤ (accelerated libraries)

Vivante Corporation Vega

OpenCL (accelerated libraries)

* OpenCL could easily be supported by NVIDIA, but they haven’t released for Tegra ¤

25

Have to apply for PowerVR early access program, still not clear where downloads are

GTRI

GPU Processing on Embedded: State of the Art • OpenCL is the best option right now due to broad vendor support • Can justify CUDA development only if you’re doing SDR on a Tegra - Compute shaders (ESSL or Vulkan) are more of a pain but supported on Tegra - There are automatic ways to convert your CUDA code to OpenCL (Swan, Vtsynergy CU2CL)

• Some work has been done on Broadcom VideoCore - Assemblers for the VideoCore, including one where you write assembly in python (PyVideoCore) - PyVideoCore has a single example of a single BLAS/LAPACK function

26

GTRI

Final Thoughts • You can’t pick up an embedded device and get optimal signal processing performance - Software work to be done to leverage the processing power of the GPUs and SIMD technologies

• Should the VOLK project support GPU accelerated implementations? - Should everyone use OpenCL?

• It’s an exciting time in embedded SDR! • The unprecedented popularity of mobile, wireless, and TV markets has effectively subsidized the SBC and embedded computer market. • You can realistically learn SDR and signal processing with practical, hands-on experience for the price of a drink at a nice restaurant.

27

GTRI

Acknowledgements • Ben Riley of GTRI for funding IRAD

28

GTRI

Thank You

29

GTRI

Extra Material

30

GTRI

Special Coprocessors • Xeon Phi - Not embedded yet; power draw is same as thirty SBCs

• Adapteva Epiphany Coprocessor - Has not really caught-on

- Used in one design, the Adapteva Parallella-16 SBC

*

• Movidius Myriad VPUs - Vision processing units

- Google uses for machine learning - Intel is acquiring them

http://www.movidius.com/solutions/vision-processing-unit

http://www.adapteva.com/epiphanyiii/

*http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html 31

GTRI

Other Future Directions • AMD has an initiative called GPUOpen that has a project called HIP - http://gpuopen.com/compute-product/hip-convert-cuda-to-portable-c-code/

• Device agnostic C++ variant that can be compiled for NVIDIA GPUs, to AMD machine code, and to a device independent HSAIL machine code (JIT cross-assembled at runtime on your platform of choice) • No vendor support yet, but here are some of the founders of the HSA consortium - ARM - Imagination Technologies

- Qualcomm

32

GTRI

Embedded Processing: Operations and Speed Convolution (FIR filter) 𝑀−1

(𝑥 ∗ 𝑦) 𝑛 =

𝑥 𝑛 − 𝑚 𝑦[𝑚] 𝑚=0

#define M (1L