OpenCLIPP: OpenCL Integrated Performance ...

29 downloads 127976 Views 1MB Size Report
ResultImage.Read(true);. How to use in C++. RAM. CPU. VRAM. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
M. Akhloufi, A. Campagna, "OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications", Proc. SPIE Electronic Imaging, Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, 9025-31, San Francisco, CA, USA, February 2014

OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications Moulay Akhloufi, [email protected] ; Antoine Campagna

In recent years, we see an increase of interest for GPGPU computing (General-Purpose computation on Graphics Processing Units). This domain aim to using the processing power of the GPU (Graphics Processing Units) in order to accelerate general processing like mathematics, 3D visualization, image processing, etc. In the past years, CUDA (Compute Unified Device Architecture) a parallel computing platform and programming model invented by NVIDIA was the main driver of this interest and the most used architecture for GPGPU computing. With the recent advent of Open Computing Language (OpenCL), we see more and more work conducted using this new platform. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. It has been adopted by multiple companies including NVIDIA (the inventor of CUDA). With this increase of interest, the availability of a set of performance primitives for general purpose applications can help accelerate the work of the research and industrial communities. Intel, for example, develops Intel Integrated Performance Primitives (Intel IPP), a multi-threaded software library of functions for multimedia and data processing applications. In the other hand, NVIDIA offers the NVIDIA Performance Primitives library (NPP), a collection of GPU-accelerated image, video, and signal processing functions that deliver faster performance than comparable CPU-only implementations. In this work, we present the architecture and development of an open source OpenCL integrated performance primitives library called OpenCLIPP. This library aim to provide a free and open source set of OpenCL functions with a simple interface similar to Intel IPP and NVIDIA NPP. The first release includes mainly image processing and computer vision algorithms: Convolution filters, Thresholding, Blobs, etc. The developed functions are introduced and benchmarks with equivalent Intel IPP and NVIDIA NPP functions are presented. This library will be made available to the open source community.

by Moulay Akhloufi, ([email protected]) Antoine W. Campagna

How it works

Introduction Computer vision is more and more used in today's applications. With always higher resolution and more demanding algorithms, applications are often limited by the processing power of CPUs. An alternative is the use of GPUs. We present a new library based on OpenCL to perform high speed image processing on GPUs: OpenCLIPP

RAM

CPU

VRAM

GPU

The library is Open Source, LGPL licensed and free for commercial use. You can download it on GitHub website:

http://openclipp.wix.com/openclipp

The library supports images with : • signed and unsigned integer of 8, 16 or 32 bits, or floating point 32 bits • 1, 2, 3 or 4 channels • almost any size (maximum image size depends on hardware)

Why OpenCL ?

How to use in C

Right now, there are two major frameworks for GPU computing : OpenCL and CUDA.

The library provides an interface in C, allowing many programming languages to use its capabilities.

CUDA has its advantages but CUDA works only on NVIDIA devices, while OpenCL works on all major high performance devices.

// Variables ocipContext Context = NULL; ocipImage SourceImage, ResultImage; SImage ImageInfo = {...}; // Fill with size, type, channels of image

In our experiments, we found that OpenCL is as fast as CUDA on NVIDIA hardware. OpenCL may also become prevalent on mobile devices (where GPUs are increasingly powerful). This will increase the range of OpenCL applications.

Supported primitives in version 1 Arithmetic

Add AddSquare Sub AbsDiff Mul Div Min Max AddC SubC AbsDiffC MulC DivC RevDivC MinC MaxC Abs Exp Log Sqr Sqrt Sin Cos

Logic

And Or Xor AndC OrC XorC Not

LUT

LUT, Linear LUT, Scale LUT

Morphology

Erode Dilate Open Close Gradient TopHat BlackHat

Transform

MirrorX MirrorY Flip Transpose Resize SetAll

Conversions

Convert Scale Copy ToGray SelectChannel ToColor

Tresholding

TresholdGT TresholdLT TresholdGTLT Compare

Filters

Gauss Sharpen Smooth Median Sobel Prewitt Scharr HiPass Laplace

Reductions

Min Max MinAbs MaxAbs Sum Mean MeanSqr

More functions

Histogram, Integral scan, Blob labeling and FFT (soon)

// Initialize OpenCL ocipInitialize(&Context, NULL, CL_DEVICE_TYPE_ALL); ocipSetCLFilesPath("/path/to/cl files/"); // Create images in OpenCL device ocipCreateImage(&SourceImage, ImageInfo, SourceImageData, CL_MEM_READ_ONLY); ocipCreateImage(&ResultImage, ImageInfo, ResultImageData, CL_MEM_WRITE_ONLY);

// Prepare the Filters - compiles the OpenCL C program // optional (would otherwise be done upon the first filter call) ocipPrepareFilters(SourceImage);

The library comes with a test and benchmarking program. The results below have been obtained with a PC with the following specifications: • Intel Core i7-3770 8GB RAM • NVIDIA GeForce GTX 680 • Windows 7 64b

25

AbsDiff U8 CPU (IPP)

7

OpenCLIPP

NPP

OpenCV OCL

1024x1024 2048x2048

HXGA

4096x4096

HSXGA

4

We can see OpenCLIPP has a 2X lead over IPP here and a slight lead over NPP

3

And here is a statistical reduction, presented in GB/s

2 1

0 1024x1024 2048x2048

10

1

// Prepare the Filters - compiles the OpenCL C program // It is optional (would otherwise be done upon the first filter call) filters.PrepareFor(SourceImage);

5

HUXGA

WHUXGA

Image size

HXGA 4096x4096 Image size

HSXGA

HUXGA

AbsDiff U8 - log scale OpenCLIPP

NPP

OpenCV OCL

How to use in C++

// Initialize OpenCL COpenCL CL; CL.SetClFilesPath("/path/to/cl files/"); Filters filters(CL);

10

512x512

CPU (IPP)

// Fill with size, type, channels of image

15

5

// Transfer image to host (synchronous) ocipReadImage(ResultImage);

SImage ImageInfo = {...};

OpenCV OCL

0

WHUXGA

And here is the same results along a logarithmic scale to better see the performance on small images.

using namespace OpenCLIPP;

NPP

6

// Apply filter (asynchronous) ocipSobel(SourceImage, ResultImage);

The library itself is implemented in C++ and C++ programs can use the C++ interface directly.

OpenCLIPP

20

Here we see the performance advantage of GPUs with OpenCLIPP performing up to 8 times faster than IPP for calculating the absolute difference between two images. We can also see OpenCLIPP beats NPP by a small margin. 8

TopHat U8 CPU (IPP)

Each primitive was run 30 times, the average of all runs is given. Image transfer and program compilation times are not included in the results.

512x512

// Create images in OpenCL device ColorImage SourceImage(CL, ImageInfo, SourceData); ColorImage ResultImage(CL, ImageInfo, ResultData);

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

AbsDiff is a very simple algorithm. Below, we show a more complex algorithm TopHat morphological operation, which has many memory accesses for each pixel

Time in ms - lower is better

OpenCLIPP provides an interface in C inspired by the interface in these libraries but simplified. OpenCLIPP also provides a C++ interface.

Performance results

Image size

Processing bandwidth in GB/s - higher is better

How it works : 1. A program is written in a language similar to C 2. The program gets compiled for the computing device used 3. The compiled program runs in parallel over all the computing resources of the device

There are two existing and popular image processing primitives libraries : • Intel IPP optimized for CPUs • NVIDIA NPP, which provides a similar interface to Intel IPP but allows computing on NVIDIA CUDA GPUs

Time in ms - lower is better

OpenCL is a framework that allows using the computing resources present in specialized computing devices like GPUs.

Library interface

Time in ms - lower is better

What is OpenCL ?

140

Processing bandwidth for Mean Reduction - F32 CPU (IPP)

OpenCLIPP

NPP

120 100 80 60 40 20 0

0,1

Image size Here we see a good 40GB/s for CPU when inside the cache and 15GB/s for images too big for the cache.

0,01

Performance of OpenCLIPP increases with the size of the image, reaching 135GB/s, 9X faster than IPP and 50% faster than NPP. OpenCV OCL failed to calculate the mean in current version.

Conclusion

0,001

OpenCLIPP can provide a significant performance improvement to all image processing applications, regardless of the platform used (AMD or NVIDIA, Windows or Linux). 0,0001

// Apply filter (asynchronous) filters.Sobel(SourceImage, ResultImage);

We can see that GPU operations have an overhead. The overhead for NPP is 0.01ms and the overhead for OpenCLIPP is higher at 0.03ms. OpenCV OCL has a even higher overhead at 0.11ms

// Transfer image to host (synchronous) ResultImage.Read(true);

CPU has no such overhead so IPP beats GPU for small images.

HXGA-4096x3072, HSXGA-5120x4096, HUXGA-6400x4800, WHUXGA-7680x4800

Performance gain is substantial when compared to even the most optimized CPU libraries when processing large (>10 MPixels) images on high end GPUs. GPU processing is not a good choice for small images (