M. Akhloufi, A. Campagna, "OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications", Proc. SPIE Electronic Imaging, Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, 9025-31, San Francisco, CA, USA, February 2014
OpenCLIPP: OpenCL Integrated Performance Primitives library for computer vision applications Moulay Akhloufi,
[email protected] ; Antoine Campagna
In recent years, we see an increase of interest for GPGPU computing (General-Purpose computation on Graphics Processing Units). This domain aim to using the processing power of the GPU (Graphics Processing Units) in order to accelerate general processing like mathematics, 3D visualization, image processing, etc. In the past years, CUDA (Compute Unified Device Architecture) a parallel computing platform and programming model invented by NVIDIA was the main driver of this interest and the most used architecture for GPGPU computing. With the recent advent of Open Computing Language (OpenCL), we see more and more work conducted using this new platform. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. It has been adopted by multiple companies including NVIDIA (the inventor of CUDA). With this increase of interest, the availability of a set of performance primitives for general purpose applications can help accelerate the work of the research and industrial communities. Intel, for example, develops Intel Integrated Performance Primitives (Intel IPP), a multi-threaded software library of functions for multimedia and data processing applications. In the other hand, NVIDIA offers the NVIDIA Performance Primitives library (NPP), a collection of GPU-accelerated image, video, and signal processing functions that deliver faster performance than comparable CPU-only implementations. In this work, we present the architecture and development of an open source OpenCL integrated performance primitives library called OpenCLIPP. This library aim to provide a free and open source set of OpenCL functions with a simple interface similar to Intel IPP and NVIDIA NPP. The first release includes mainly image processing and computer vision algorithms: Convolution filters, Thresholding, Blobs, etc. The developed functions are introduced and benchmarks with equivalent Intel IPP and NVIDIA NPP functions are presented. This library will be made available to the open source community.
by Moulay Akhloufi, (
[email protected]) Antoine W. Campagna
How it works
Introduction Computer vision is more and more used in today's applications. With always higher resolution and more demanding algorithms, applications are often limited by the processing power of CPUs. An alternative is the use of GPUs. We present a new library based on OpenCL to perform high speed image processing on GPUs: OpenCLIPP
RAM
CPU
VRAM
GPU
The library is Open Source, LGPL licensed and free for commercial use. You can download it on GitHub website:
http://openclipp.wix.com/openclipp
The library supports images with : • signed and unsigned integer of 8, 16 or 32 bits, or floating point 32 bits • 1, 2, 3 or 4 channels • almost any size (maximum image size depends on hardware)
Why OpenCL ?
How to use in C
Right now, there are two major frameworks for GPU computing : OpenCL and CUDA.
The library provides an interface in C, allowing many programming languages to use its capabilities.
CUDA has its advantages but CUDA works only on NVIDIA devices, while OpenCL works on all major high performance devices.
// Variables ocipContext Context = NULL; ocipImage SourceImage, ResultImage; SImage ImageInfo = {...}; // Fill with size, type, channels of image
In our experiments, we found that OpenCL is as fast as CUDA on NVIDIA hardware. OpenCL may also become prevalent on mobile devices (where GPUs are increasingly powerful). This will increase the range of OpenCL applications.
Supported primitives in version 1 Arithmetic
Add AddSquare Sub AbsDiff Mul Div Min Max AddC SubC AbsDiffC MulC DivC RevDivC MinC MaxC Abs Exp Log Sqr Sqrt Sin Cos
Logic
And Or Xor AndC OrC XorC Not
LUT
LUT, Linear LUT, Scale LUT
Morphology
Erode Dilate Open Close Gradient TopHat BlackHat
Transform
MirrorX MirrorY Flip Transpose Resize SetAll
Conversions
Convert Scale Copy ToGray SelectChannel ToColor
Tresholding
TresholdGT TresholdLT TresholdGTLT Compare
Filters
Gauss Sharpen Smooth Median Sobel Prewitt Scharr HiPass Laplace
Reductions
Min Max MinAbs MaxAbs Sum Mean MeanSqr
More functions
Histogram, Integral scan, Blob labeling and FFT (soon)
// Initialize OpenCL ocipInitialize(&Context, NULL, CL_DEVICE_TYPE_ALL); ocipSetCLFilesPath("/path/to/cl files/"); // Create images in OpenCL device ocipCreateImage(&SourceImage, ImageInfo, SourceImageData, CL_MEM_READ_ONLY); ocipCreateImage(&ResultImage, ImageInfo, ResultImageData, CL_MEM_WRITE_ONLY);
// Prepare the Filters - compiles the OpenCL C program // optional (would otherwise be done upon the first filter call) ocipPrepareFilters(SourceImage);
The library comes with a test and benchmarking program. The results below have been obtained with a PC with the following specifications: • Intel Core i7-3770 8GB RAM • NVIDIA GeForce GTX 680 • Windows 7 64b
25
AbsDiff U8 CPU (IPP)
7
OpenCLIPP
NPP
OpenCV OCL
1024x1024 2048x2048
HXGA
4096x4096
HSXGA
4
We can see OpenCLIPP has a 2X lead over IPP here and a slight lead over NPP
3
And here is a statistical reduction, presented in GB/s
2 1
0 1024x1024 2048x2048
10
1
// Prepare the Filters - compiles the OpenCL C program // It is optional (would otherwise be done upon the first filter call) filters.PrepareFor(SourceImage);
5
HUXGA
WHUXGA
Image size
HXGA 4096x4096 Image size
HSXGA
HUXGA
AbsDiff U8 - log scale OpenCLIPP
NPP
OpenCV OCL
How to use in C++
// Initialize OpenCL COpenCL CL; CL.SetClFilesPath("/path/to/cl files/"); Filters filters(CL);
10
512x512
CPU (IPP)
// Fill with size, type, channels of image
15
5
// Transfer image to host (synchronous) ocipReadImage(ResultImage);
SImage ImageInfo = {...};
OpenCV OCL
0
WHUXGA
And here is the same results along a logarithmic scale to better see the performance on small images.
using namespace OpenCLIPP;
NPP
6
// Apply filter (asynchronous) ocipSobel(SourceImage, ResultImage);
The library itself is implemented in C++ and C++ programs can use the C++ interface directly.
OpenCLIPP
20
Here we see the performance advantage of GPUs with OpenCLIPP performing up to 8 times faster than IPP for calculating the absolute difference between two images. We can also see OpenCLIPP beats NPP by a small margin. 8
TopHat U8 CPU (IPP)
Each primitive was run 30 times, the average of all runs is given. Image transfer and program compilation times are not included in the results.
512x512
// Create images in OpenCL device ColorImage SourceImage(CL, ImageInfo, SourceData); ColorImage ResultImage(CL, ImageInfo, ResultData);
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
AbsDiff is a very simple algorithm. Below, we show a more complex algorithm TopHat morphological operation, which has many memory accesses for each pixel
Time in ms - lower is better
OpenCLIPP provides an interface in C inspired by the interface in these libraries but simplified. OpenCLIPP also provides a C++ interface.
Performance results
Image size
Processing bandwidth in GB/s - higher is better
How it works : 1. A program is written in a language similar to C 2. The program gets compiled for the computing device used 3. The compiled program runs in parallel over all the computing resources of the device
There are two existing and popular image processing primitives libraries : • Intel IPP optimized for CPUs • NVIDIA NPP, which provides a similar interface to Intel IPP but allows computing on NVIDIA CUDA GPUs
Time in ms - lower is better
OpenCL is a framework that allows using the computing resources present in specialized computing devices like GPUs.
Library interface
Time in ms - lower is better
What is OpenCL ?
140
Processing bandwidth for Mean Reduction - F32 CPU (IPP)
OpenCLIPP
NPP
120 100 80 60 40 20 0
0,1
Image size Here we see a good 40GB/s for CPU when inside the cache and 15GB/s for images too big for the cache.
0,01
Performance of OpenCLIPP increases with the size of the image, reaching 135GB/s, 9X faster than IPP and 50% faster than NPP. OpenCV OCL failed to calculate the mean in current version.
Conclusion
0,001
OpenCLIPP can provide a significant performance improvement to all image processing applications, regardless of the platform used (AMD or NVIDIA, Windows or Linux). 0,0001
// Apply filter (asynchronous) filters.Sobel(SourceImage, ResultImage);
We can see that GPU operations have an overhead. The overhead for NPP is 0.01ms and the overhead for OpenCLIPP is higher at 0.03ms. OpenCV OCL has a even higher overhead at 0.11ms
// Transfer image to host (synchronous) ResultImage.Read(true);
CPU has no such overhead so IPP beats GPU for small images.
HXGA-4096x3072, HSXGA-5120x4096, HUXGA-6400x4800, WHUXGA-7680x4800
Performance gain is substantial when compared to even the most optimized CPU libraries when processing large (>10 MPixels) images on high end GPUs. GPU processing is not a good choice for small images (