Integral Image computation on GPU

129 downloads 0 Views 393KB Size Report
Mohamed Atri. Laboratory of Electronics and. Microelectronics (EμE). Faculty of Sciences Monastir. Monastir, Tunisia mohamed.atri@fsm.rnu.tn. Rached Tourki.
2013 10th International Multi-Conference on Systems, Signals & Devices (SSD) Hammamet, Tunisia, March 18-21, 2013

SSD'13 1569681225

Integral Image computation on GPU

Marwa Chouchene

Fatma Ezahra Sayadi

Mohamed Atri

Rached Tourki

Laboratory of Electronics and Microelectronics (EμE) Faculty of Sciences Monastir Monastir, Tunisia [email protected]

Laboratory of Electronics and Microelectronics (EμE) Faculty of Sciences Monastir Monastir, Tunisia [email protected]

Laboratory of Electronics and Microelectronics (EμE) Faculty of Sciences Monastir Monastir, Tunisia [email protected]

Laboratory of Electronics and Microelectronics (EμE) Faculty of Sciences Monastir Monastir, Tunisia [email protected]

In this context we start with a short presentation of the architecture of GPU NVidia by CUDA. Then we give details of the integral image. Finally, we conclude this work by presenting the implementation results.

Abstract—In this paper we present an integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU). Our system exploits the parallelisms in computation via the NVIDA CUDA programming model, which is a software platform for solving non-graphics problems in a massively parallel high performance fashion. We compare the performance of the parallel approach running on the GPU with the sequential CPU implementation across a range of image sizes. Index Terms—Integral image, GPU, CPU, NVIDIA CUDA.

II. GPU NVIDIA BY CUDA In this section we present the basic concepts of the architecture and programming in GPU by CUDA. A. Basic concepts CUDA (Compute Unified Device Architecture) is a parallel programming model that is akin to software that can handle the power of a graphics card. Three key concepts are the basis of CUDA: a hierarchy of thread groups, shared memories, and barrier of synchronization. These concepts give the programmer the ability to precisely manage the parallelism offered by the graphics card. A CUDA program can then be run on a number of processors; all without the programmer has knowledge of the architecture of the graphics card. The extensions language C provided by CUDA allows the programmer to define functions called kernels, which will be held in graphics card simultaneously. These functions will therefore be executed N times by N CUDA threads different. The variable thread is a vector of three components. Threads can be identified by one dimension, two dimensions or three dimensions thus forming blocks in one dimension, two dimensions or three dimensions. The threads of the same block can communicate with each other by sharing their data using shared memory and can synchronize to coordinate their memory accesses. However, a kernel can be executed by multiple blocks that have the same dimension. In this case the number of threads is the result of multiplying the number of threads per block, by the number of blocks. These blocks are arranged within a grid of dimension less strictly at three as shown in Figure 1.

I. INTRODUCTION The graphics processing unit (GPU) has become an integral part of today’s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart. The speed of GPU increases the programming capacity so the resolution and processing complex problems. This effort in computing has positioned the GPU as an alternative to traditional microprocessors in high-performance computing system. CUDA (Compute Unified Device Architecture) is a general architecture for parallel computing introduced by NVidia in November 2007[1]. It includes a new programming model, new architecture and an different set instruction. In other words, with the increasing complexity of computing machines, many scientists are interested to improve several algorithms of computer vision. Among this research we find the work of Viola and Jones that introduces the integral image which is a new representation of the image used to calculate rapidly the Haar descriptors. In this work, in order to evaluate our parallel integral image implementation, we executed the algorithm on different size image in both CPU and GPU.

978-1-4673-6457-7/13/$31.00 ©2013 IEEE

1

B. GPU Architecture When a CUDA program invokes a kernel from the host, the blocks of the grid are enumerated and distributed to multiprocessors with sufficient resources to execute them. All threads in a block are executed simultaneously on the same multiprocessor. Since all threads in a block have completed their instructions, a new block is started on the multiprocessor. To handle hundreds of threads, the multiprocessor employs a new architecture called SIMT (Single-Instruction, Multiple Threads). It is therefore the unit SIMT of a multiprocessor, which creates, plans, and executes the threads in groups of 32. These groups of threads are called warps. Threads inside of a warp start their execution at the same instruction in the program but are free to run independently. When a multiprocessor is given one or more blocks of threads to run, he divides them into warps whose execution will be planned by the SIMT unit. The next instruction is launched for all active threads of the warp. One instruction at a time is performed by warp so performance will be maximum if 32 threads calling warp agree on the path to take. As shown in Figure 3 below, a personal memory of the multiprocessor is organized in this way [2]: • A bank of 32 registers per processor • A memory block common to all processors (the shared memory) • A read-only cache memory designed to accelerate access to data in the memory constant (which is also a read-only memory), shared between processors • A read-only cache memory designed to accelerate access to data in texture memory (which is also a readonly memory), shared between processors.

Fig. 1. Illustration of a grid.

Threads have access to data located in separate memories. Each thread has its own memory, called local memory. The blocks, have a shared memory accessible only by the threads in the block in question. Finally, all threads have access to data in the global memory. There are also two other memories, namely the constant memory and texture memory. As shown in figure 2, the threads are running on a separate physical machine (the device), in competition the rest of the code C takes place on the host. Both entities have their own DRAM (Dynamic Random Access Memory), respectively called device memory and host memory. A CUDA program must therefore manipulate the data so that they are copied to the device for processing before being copied back to the host.

Fig. 3. SIMT multiprocessor with shared memory on board.

Fig. 2. Exchange Host/Device.

2

Note: it is possible to read and write in the memory called global and local memory of device, and there is no cache to speed up access to data present in the global and local memory.

(1) Thus the sum of a rectangular region can be evaluated from four references to the integral image, which is calculated in a single scan of the original image [3] (Figure 5).

C. CUDA Programming The stated goal of the CUDA programming interface is to enable a programmer familiar with coding C, to be able to start programming on graphic cards. There are four key elements as the basis for extensions to the C language: • Types to define the entity that a function must be execute • Types to define the memory where a variable resides • Particular characters to define the grid during the execution of a kernel from the host • Standard variables that refer to the indices and dimensions of the grid parameters Each source file with one of these extensions must be compiled using nvcc compiler.

Fig. 5. Calculate the sum of the rectangle D with the integral image [7]

Using graphics hardware in computers for image processing and computer vision is a new research area. Image processing is intended to study digital images, to generate the information, interprets the contents. The method presented here used to calculate the integral image.

Therefore the sum of two rectangles descriptor is calculated from the difference between two adjacent rectangles using 6 references to the integral image. Similarly we need 8 references for descriptor at 3 rectangles, and 9 for a descriptor at 4 rectangles [9, 10, 11, 12].

III. INTEGRAL IMAGE

IV. EXPERIMENTAL RESULTS

The integral image was proposed by Paul Viola and Michael Jones in 2001 and has been used in the real-time object detection [3]. The integral image used is constructed from an original image grayscale. In [4], the integral image has been extended; it is then used to compute the Haar-like features, and characteristics of centersurround. In the SURF algorithm [5] [6], it accelerates the calculation first and second order of Gaussian derivative that uses the integral image. In the algorithm CenSurE [7], and its improved version SUSur [8], it uses two levels of filter to approximate the Laplace operator using the integral image.

In this section, we present our algorithm to calculate the integral image. The implementation steps are shown in figure 6:

CPU

GPU

Read the input image Transformed image data series Transfer data from CPU to GPU Calculate the integral image in GPU Transfer data from GPU to CPU

Fig. 4. A schematic of integral image

An integral image is an intermediate representation of the input image. The value of the integral image at point (x, y) is equal to the sum of all pixels in above and to the left of (x, y) as shown in Figure 4 and according to the formula below:

Display the results

Fig. 6. Algorithm as implemented on hardware

3

We test our parallel algorithm on the windows platform. Software environment includes Microsoft Visual Studio 2008, CUDA 2.3. Hardware environment includes a consumer-level PC with an Intel Core i5, M560 2.6GHz CPU, 4G RAM and a NVIDIA GeForce 310M video card. In order to evaluate our parallel integral image implementation, we executed our algorithm for different image sizes in both CPU and GPU (Table 1). TABLE I. Image Size 128*128 256*256 512*512

efficient acceleration technique for image processing, and is cheaper in hardware implementation. REFERENCES [1] NVIDIA Corporation, “NVIDIA CUDA Compute Unified Device Architecture Programming Guide,” Version 1.1, 2007 [2] D John Owens, Mike Houston, David Luebke, “GPU Computing,” Proceedings of the IEEE, vol. 96, no. 5, May 2008 [3] P Viola, M Jones, “Rapid object detection using a boosted cascade of simple features,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.2001 [4] R Lienhart, A Kuranov, V Pisarevsky, “Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection,” Pattern Recognition. vol. 2781, pp. 297-304, 2003 [5] H Bay, A Ess, T Tuytelaars, L Van Gool, “Speededup robust features (SURF),” International Journal on Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008 [6] H Bay, T Tuytelaars, L Van Gool, “SURF: Speeded up robust features,” Proceedings of the European Conference on Computer Vision, Springer LNCS volume 3951, part 1, pp 404– 417,2006. [7] M Agrawal, K Konolige, M R Blas “Censure: Center surround extremas for realtime feature detection and matching,” In D. A. Forsyth, P. H. S. Torr, and A. Zisserman, editors, ECCV (4), volume 5305 of Lecture Notes in Computer Science, pp 102– 115. Springer, 2008. [8] M Ebrahimi, W W Mayol-Cuevas, “SUSurE: Speeded Up Surround Extrema feature detector and descriptor for realtime applications,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, cvprw, pp.9-14, 2009 [9] Pablo Augusto Negri, “Détection et Reconnaissance d’objets structurés : Application aux Transports Intelligents,” thesis defended in September 2008, University Pierre and Marie Curie - Paris VI, Institute of Intelligent Systems and Robotics. [10] Lemaire Pierre, “Etude de la pertinence topologique des descripteurs d’images utilisés dans les algorithmes de détection de visages par apprentissage,” Master course in 2008. [11] Shuji Zhao, “Apprentissage et Recherche par le Contenu Visuel de Catégories Sémantiques d'Objets Vidéo,” Master memory supported in July 2007, University Paris Descartes, Laboratory of Image Processing and Signal CNRS France. [12] P A Negri, L Prévost, X Clady, “Cascade generative and discriminative classifiers for vehicle detection,” 16 French Congress AFRIFAFIA, Pattern Recognition and Artificial Intelligence, Amiens, France, pp 22-25 January 2008.

THE TIME-CONSUMING COMPARISON BETWEEN CPU BASED AND GPU BASED ALGORITHMS

CPU based time consuming 0,0099 0,0129 0,0178

GPU based time consuming (ms) 0,0041 0,0049 0,0054

Speed up 2,41 2,63 3,29

As can be seen from the table 1, compared to the corresponding CPU-based serial algorithm, our algorithm has a relatively reduction in time-consuming. With the increasing of the image size used in the experiment, speedup increases. However, due to the memory limitations in the video chip, the processed image size at one time cannot be increased unlimitedly. Figure 7 gives the runtime comparison for various kernels implemented on the GPU. The comparison is done relative to each other considering each kernel is executed once on the GPU.

Fig. 7. GPU Kernels runtime comparison

V. CONCLUSION In this paper, a parallel integral image algorithm, is presented and implemented on GPU, and compared with the sequential implementations based on CPU. Performance results indicate that significant speedup can be achieved. The integral image can get speedup of up to 3× compared to CPU-based implementations. Obviously, GPU provides a novel and

4

Suggest Documents