Acceleration of the Retinex algorithm for image restoration by GPGPU / CUDA Yuan-Kai Wang* and Wen-Bin Huang Department of Electrical Engineering, Fu Jen Catholic University, 510, Chung-cheng Rd., Hsinchuang, Taipei County 24205, Taiwan (R.O.C.) ABSTRACT Retinex is an image restoration method that can restore the image’s original appearance. The Retinex algorithm utilizes a Gaussian blur convolution with large kernel size to compute the center/surround information. Then a log-domain processing between the original image and the center/surround information is performed pixel-wise. The final step of the Retinex algorithm is to normalize the results of log-domain processing to an appropriate dynamic range. This paper presents a GPURetinex algorithm, which is a data parallel algorithm devised by parallelizing the Retinex based on GPGPU/CUDA. The GPURetinex algorithm exploits GPGPU’s massively parallel architecture and hierarchical memory to improve efficiency. The GPURetinex algorithm is a parallel method with hierarchical threads and data distribution. The GPURetinex algorithm is designed and developed optimized parallel implementation by taking full advantage of the properties of the GPGPU/CUDA computing. In our experiments, the GT200 GPU and CUDA 3.0 are employed. The experimental results show that the GPURetinex can gain 30 times speedup compared with CPU-based implementation on the images with 2048 x 2048 resolution. Our experimental results indicate that using CUDA can achieve acceleration to gain real-time performance. Keywords: GPU computing, CUDA, parallel computing, Retinex, image restoration, image enhancement
1. INTRODUCTION The Retinex algorithm is a very popular and effective method to remove environmental light interference that is used as a preprocessing step in many computer vision algorithms. Land1,2 first conceived the idea of the Retinex as a model of the lightness and color perception of human vision and has a variety of extensions. The Retinex algorithms usually have three classes: path-based Retinex algorithms, recursive Retinex algorithms, and center/surround Retinex algorithms. The center/surround Retinex algorithm is suitable for parallelization. In the center/surround Retinex algorithms, a Single Scale Retinex(SSR)3,4 algorithm was proposed that can either achieve dynamic range compression or color/lightness rendition, but not both simultaneously. The dynamic range compression of using the small center/surround information and color/lightness rendition of using the large center/surround information were combined by the Multi-Scale Retinex(MSR)4-6 and Multi-Scale Retinex with Color Restoration(MSRCR)5,6 algorithms with a universally applied color restoration. The Retinex algorithm contains the computing of center/surround information, a log-domain processing, and normalization. The time complexity of the Retinex algorithm is very high. However, how to reduce the complexity and use the massively parallel architecture and hierarchical memory to improve the efficiency is a focus of the study. The computations of most image processing algorithms are intensive and time-consuming. Traditional sequential processing methods are often not able to gain real-time performance, but the parallel processing method can help achieve the real-time performance. GPGPU/CUDA7,8 has many properties that are suitable for real-time image/video processing. Therefore, GPU parallelization of the Retinex algorithms should greatly improve its performance. The purpose of this paper is to accelerate the Retinex algorithm that is called GPURetinex algorithm which is a data parallel algorithm based on GPGPU/CUDA. GPUs have traditionally been used to execute only graphics applications and the development of parallel processing algorithms on this platform is very difficult. In the recent years, the GPUs are as sources of massive computing power that can be used for general purpose computations (called GPGPUs). In addition, the GPGPUs have evolved into multithreads and many-core processors that are especially well-suited to data parallel computations. Another important capability is programmable that enables users to easily develop GPU-based parallel algorithms. *
[email protected]
Parallel Processing for Imaging Applications, edited by John D. Owens, I-Jong Lin, Yu-Jin Zhang, Giordano B. Beretta, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7872, 78720E · © 2011 SPIE-IS&T · CCC code: 0277-786X/10/$18 doi: 10.1117/12.876640 SPIE-IS&T/ Vol. 7872 78720E-1 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
The CUDA (Compute Unified Device Architecture) programming model was developed by NVIDIA for this platform. CUDA brings the C-based development environment which uses a C compiler to compile programs, and programming is done with ANSI C extended with keywords and some extended libraries8. Due to the easily programmability, making the development of GPGPU programs become more flexible and efficient. However, some challenges still need to study, such as how to manage organized hierarchical threads, the usage of register and on chip shared memory, the characteristics of off-chip memory access, data transfer between the device memory and host memory and synchronization among threads on the GPGPU8. In this paper, a data parallel algorithm called GPURetinex is proposed to parallelize the Retinex algorithm on GPGPU. The computing of the Gaussian blur in the GPURetinex adopts separable convolution kernels to reduce the computation and the internal data transfer. The data distribution of parallel Gaussian blur convolution adopts a horizontal stripe method. Each thread reads pixels of a horizontal stripe of the image to implement the Gaussian blur convolution. The Gaussian blur utilizes texture memory and constant memory to improve efficiency. The data distribution in the logdomain processing and normalization steps uses a square subimage method. Each thread computes the two operations for all pixels within its square subimage. A parallel reduction method is devised to find the maximum and minimum values of the log-domain processing image. Threads within a grid communicate by global memory and shared memory. The organization of the rest of this paper is as follows: Section 2 presents a brief introduction to Retinex algorithms, including SSR, MSR, and MSRCR. Section 3 presents details of the proposed GPURetinex algorithm, including architecture, hierarchical parallel threading, data distribution, and the usage of hierarchical memory. Experimental results are reported and discussed in Section 4. Finally conclusions are given in Section 5.
2. BACKGROUND In this section, we will review the computational model of Retinex algorithms and the parallelization of vision algorithms by GPGPU’s multicore. 2.1 Computational model of Retinex algorithm The Retinex theory of color vision was first developed by Land2 as a model of the lightness and color perception of human vision. The main goal of Retinex is to remove illumination effect that decomposing the image into the reflectance image and the illuminant image. Many interpretations, implementations and improvements of the Retinex algorithm can usually be classified into three classes: path-based Retinex algorithms, recursive Retinex algorithms, and center/surround Retinex algorithms. The path-based Retinex algorithms are based on the multiplication of ratios between pixels values along a set of random paths in an image. The original theory and works of Land2,9,10 belong to this class. Horn extended the Land’s Retinex and showed that the illuminant can be estimated using two-dimensional Laplacian11. Brainard and Wandell12 studied the convergence properties for a large number of long paths and find that the result converges to a simple normalization, as the number of paths and their lengths increases. A further development using Brownian motion13 introduced randomly distributed path. The primary drawbacks of path-based algorithms are high computational complexity, dependency of the path geometry, and some parameters are difficult to be determined, such as the number of paths, trajectories, and lengths. Provenzi et al.14 presented a detailed mathematical analysis for qualitative behavior of Retinex in relation to parameter variations. The recursive Retinex algorithm was first developed by Frankle and McCann15. It is a two-dimensional extension of the path-base version and replaces the path computation by a recursive matrix comparison15-17. The recursive Retinex algorithm computes the ratios and products with long-distance iterations between pixels first and then progressively move to short-distance interactions. This algorithm is computationally more efficient than the path-based class. The primary drawback is that the number of iterations is difficult to be determined. A method that automatically determines the number of iterations using an early stopping technique has been developed18. Sobol19 introduced a ratio modification operator to better compress large ratios while enhancing small ratios and improved the original iterative Retinex algorithm. The center/surround Retinex algorithm was first proposed by Land20. The new technique allows each pixel beging treated only once and selected sequentially. New pixel values are obtained by computing the ratio between each treated pixel and a weighted average of its surrounding area. This new implementation suggests the idea that the center/surround information can be computed from a blurred version of the input image. Thus, Rahman et al.3 utilized a Gaussian blur to compute the center/surround information and the method is called single-scale Retinex. The multi-scale Retinex4 is an extension of SSR and offers the combining of the dynamic range compression and color/lightness rendition by averaging
SPIE-IS&T/ Vol. 7872 78720E-2 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
three SSRs of different spatial scales. The Multi-Scale Retinex with Color Restoration5,6 was proposed to compensate for the loss of color saturation inherently by a color restoration factor. The center/surround Retinex algorithms are faster than the path-based ones, the amount of parameters is greatly reduced. It has lower computational cost and overcomes the deficiency of the recursive Retinex algorithm. In addition, the Retinex algorithms of the path and recursive classes can not be parallelized effectively, while the Retinex algorithm of the center/surround class can. Therefore, the center/surround Retinex algorithm is suitable for GPGPU/CUDA implementation. Next, we will introduce the MSRCR, MSR, and SSR with a unified view. The MSRCR algorithm combines small center/surround information and large center/surround information. Besides, it adopts a color adaptation mechanism to improve color/lightness rendition of images that contain gray-world violations. The basic form of MSRCR is given by n
(
Ri ( x, y ) = ri ( x, y ) × ∑ Wk log Ii ( x, y ) − log ⎡⎣ Fk ( x, y ) ⊗ Ii ( x, y ) ⎤⎦ k =1
)
, i ∈ { R, G, B} ,
(1)
Where Ri(x,y) is the MSRCR output at the coordinates (x,y), Ii(x,y) the original image, Fk the kth center/surround function, Wk the weight of Fk, n the number of center/surround functions, and “ ⊗ ” denotes the 2D convolution operation. The center/surround function Fk (x,y) is given by − ⎛⎜ x2 + y 2 ⎞⎟
Fk ( x, y ) = Ke
⎝
⎠
ck 2
,
(2)
where ck is the kth Gaussain center/surround scale(smaller value of ck leads to narrower surround and larger value of ck leads to wider surround), and K is a constant satisfying the ∫∫ F ( x, y )dxdy = 1 . Small scale plays a role in dynamic range compression, while large scale contributes to color/lightness rendition. We can use multiple center/surrounds and different weights to achieve a graceful balance between dynamic range compression and tonal rendition. The number of scales is used for the MSRCR that usually adopt a combination of three scales representing narrow, medium, and wide center/surrounds is sufficient to provide both dynamic range compression and tonal rendition. Then a color restoration factor ri(x,y) is considered to offer color constancy. The color restoration factor ri(x,y) is given by ⎛ ⎞ ⎜ I i ( x, y ) ⎟ ⎟ , ri ( x, y ) = β × log ⎜ α × N (3) ⎜ ⎟ I ( x , y ) ∑ i ⎜ ⎟ i =1 ⎝ ⎠ where ri(x,y) is the color restoration coefficient in the i th spectral band, N the number of spectral bands(N = 3 for typical RGB color images), β is a gain constant, and α controls the strength of nonlinearity. The MSRCR provides synthesized color constancy, dynamic range compression, enhancement of contrast and lightness, and good color rendition. However, if the color restoration factor ri(x,y) is not considered, then Ri(x,y) becomes the MSR output. The MSR can combine the dynamic range compression and color rendition, but it still has some drawbacks for color images21. If the color restoration factor ri(x,y) is not considered and the number of center/surround function n is 1, then Ri(x,y) becomes the SSR output. The SSR can provide dynamic range compression with a smaller scale that makes the edge information become obvious. Conversely, when the larger scale is adopted then the SSR can provide color rendition that increases color information. The SSR can’t simultaneously retain the edge and color information. The MSRCR is computationally the most intensive algorithm. It usually adopts three Gaussian blur convolutions with large kernel size to compute the center/surround information in one spectral band. Then three log-domain processing between the original image and the different center/surround information are performed pixel-wise. In the final step, the algorithm needs to find the maximum and minimum of the previous processing results in order to normalize the results to the range from 0 to 255. 2.2 Parallel processing by GPGPU’s multicore GPGPU is a general-purpose GPU with many cores that allow the development of parallel processing algorithms to improve performance. CUDA is a technology that enables users to easily develop GPU-based parallel algorithms. Recently, GPGPUs have evolved to programmable multicore architecture and become the focus of much research in parallel and high performance computing. Owens et al.26 gave a broad survey of general-purpose computation on
SPIE-IS&T/ Vol. 7872 78720E-3 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
graphics hardware. Here we focus on the works done for GPGPU parallelization of image processing and computer vision. There are many works on GPU computing for image processing and computer vision. A fragment shader was used by Moreland and Angel27 to compute Fast Fourier Transform on GPU and they gained 4 times faster compared to CPU implementation. Strzodka et. al28 presented a motion estimation algorithm that provides dense estimates from optical flow and achieved 2.8 times speedup on a GeForce 5800 Ultra GPU than on an optimized CPU version. Shen et al.29 implemented the color space conversion for MPEG video encoding on GPU using DirectX to achieve 2~3 times speedup. Many parallel algorithms for computer vision have been studied in GPU4Vision30, such as real-time optical flow and total variation based image segmentation. Some open source libraries have been developed to aid the development of GPU based vision algorithms. OpenVIDIA31 provided a framework for video input and display. In addition, the implementations of feature detection and tracking, skin tone tracking, and real-time blink detection were also provided in the OpenVIDIA. GpuCV32 was designed to provide GPU-acceleration with OpenCV interfaces and compatible with the provided OpenCV library. Another open source library is called MinGPU33 that provided a set of useful image processing and computer vision functionalities. The CUDA is a programming framework developed for GPGPU platform. Due to its ease in programmability, the development of GPGPU programs becomes more efficient. There are a lot of implementations using GPGPU/CUDA to accelerate computationally intensive tasks in image processing and computer vision domains. A parallel Canny edge detector under CUDA was demonstrated22, including all stages of algorithms. They achieved 3.8 times speedup on the images with 3936 x 3936 resolution compared to an optimized OpenCV version. A real-time 3D visual tracker with an efficient Sparse-Template-based particle filtering was implemented by Lozano and Otsuka34. Their implementation can achieve 10 times performance improvements compared to a similar CPU-only tracker.
3. THE GPURETINEX METHOD The Retinex algorithms are an inherently parallel problem. Each spectral band independently performs the computation of center/surround information with different scales, log-domain processing, and normalization. Each of these steps can be parallelized and run as kernels on the GPGPU. An overview of the proposed GPURetinex is shown in Figure 1. The GPURetinex uses the heterogeneous programming model provided by CUDA, where the serial code segments are executed on the host (CPU) and only the parallel code segments are executed on the device (GPU). The host loads an original image and performs the memory transfer of the image from host to device. Then four kernels, Gaussain blur, log-domain processing, reduction35,36, and normalization are parallelly executed with Single Program Multiple Data(SPMD) model in the GPGPU. The final step is to transfer the result from device to host. In each kernel, the parallel execution with optimized memory usage and maximum utilization of memory bandwidth is a critical design factor to achieve maximum performance. It is very important for the computing of the Gaussian blur in the GPURetinex because the GPURetinex utilizes a Gaussian blur convolution with large kernel size to obtain the center/surround information. 2D Gaussian filter can be separated into two 1D Gaussian filters because 2D Gaussian filter is separable22,23. Adopting the separable convolution kernel can reduce computation time. The computing of the Gaussian blur in the GPURetinex will be divided into the convolutions of row filter and column filter. First is the convolution of row filter as shown in Figure 2. The data distribution of the parallel Gaussian blur convolution with a row filter adopts a horizontal stripe method36. Each thread block that contains 256 threads (256 * 1 as the dimension of the thread block: T0~T255) corresponds to each row in the TempImage. Each thread in the thread block reads pixels in the horizontal stripe for row convolution. The row convolution utilizes texture and constant caches to improve efficiency because the repeatition of reading original image and Gaussian kernel are to frequent. Hence, these data are fetched from the caches and spatial locality can be fully exploited when threads of the same warp access nearby pixels in the image data. The followed column convolution in the GPURetinex is as shown in Figure 3. Similarly, its data distribution also adopts a horizontal stripe method. The next step is the parallel computing of log-domain processing including subtractions, multiplication of weights and color restoration factor, and summation. These computations can be performed in parallel at pixel-wise level and are shown in Figure 4. The GPURetinex adopts three Gaussian blur images with different scales to combine the effects of dynamic range compression and color/lightness rendition. Three Gaussian blur images with different scales are respectively the G1(x,y), G2(x,y), and G3(x,y). The I(x,y) and R(x,y) are the input and output images. The data distribution in this step uses a square subimage method36. Each thread block that contains 256 threads (16 * 16 as the dimension of
SPIE-IS&T/ Vol. 7872 78720E-4 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
the thread block: T(0,0)~T(15,15)) corresponds to a square subimage. Each thread computes the operations for all pixels within its square subimage. Host (CPU)
Device (GPU) Global Memory
Color Image Src B
Src R
Global Memory Gaussian Blur B
Split Three Spectral Bands of Color Image B,G,R Copy Data from Host to Device
Src G
Constant Memory B,G,R
Gaussian Kernel
Texture Memory
Gaussian Blur G
Call The Kernel of Parallel Gaussian Blur
Parallel Gaussian Blur
Gaussain Blur R
Call The Kernel of Parallel Log-domain Processing
Parallel Log-domain Processing
Global Memory Retinex B
Call The Kernel of Parallel Reduction
Retinex G
Parallel Reduction
Retinex R Call The Kernel of Parallel Normalization Copy Data from Device to Host B,G,R
Parallel Normalization
B,G,R
Global Memory maxB,minB maxG,minG maxR,minR
Global Memory Dst B
Combine Three Spectral Bands
Dst G
Dst R
Constant Memory maxB,minB maxG,minG maxR,minR
Retinex Image
Figure 1. Overview of the GPURetinex. GPU SrcImage S1 S2 S3
Row Filter
T0 T1
m 1 m2 m3
o1 o2
…
TempImage Block 0 Block 1
…
T 255
Constant Memory
… 256
256
Block 255 256
Texture Memory
256
Global Memory
…
T0: o1 = S1 * m1 + S1 * m2 + S2 * m3 T1: o2 = S1 * m1 + S2 * m2 + S3 * m3 T255
Figure 2. The row convolution in the GPURetinex.
SPIE-IS&T/ Vol. 7872 78720E-5 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
GPU
TempImage
Column Filter
T0 T1
n1 n2 n3
o1 o2
S1 S2 S4 S5
…
DstImage Block 0 Block 1
…
T 255
Constant Memory
…
256
256
Block 255 256
256
T0: o1 = S1 * n1 + S1 * n2 + S4 * n3 T1: o2 = S2 * n1 + S2 * n2 + S5* n3
Texture Memory
Global Memory
…
T255
Figure 3. The column convolution in the GPURetinex. GPU T(1,0) … T(15,0) T(0,0) T(0,1)
T(1,0) … T(15,0) B(0,0) B(1,0) … B(15,0)
T(0,15)
256
…
B(0,1)
…
… B(0,15)
I(x,y)
B(0,15)
Width = 256
Global Memory
G1(x,y)
…
B(0,1)
…
256
…
…
B(0,0) B(1,0) … B(15,0)
T(0,15)
T(0,0) T(0,1)
G2(x,y)
G3(x,y)
…
Width = 256
…
…
T(0,0): R(0,0) = r(0,0)*exp(W1*(log (I(0,0) +1)-log(G1(0,0)+1))+ T(1,0): R(1,0) = r(1,0)*exp (W1*(log (I(1,0) +1)-log(G1(1,0)+1))+ … W2 *(log (I(0,0) +1)-log(G2(0,0)+1))+ W2 *(log (I(1,0) +1)-log(G2 (1,0)+1))+ W3*(log (I(0,0) +1)-log(G3(0,0)+1))) W3*(log (I(1,0) +1)-log(G3(1,0)+1))) T(1,0) … T(15,0) T(0,0) T(0,1) …
B(0,0) B(1,0) … B(15,0)
T(0,15) B(0,1)
… B(0,15)
R(x,y)
…
Height = 256
…
Width = 256 Global Memory
Figure 4. The parallel computing of log-domain processing.
Next we have to normalize the previous processing result to the range from 0 to 255. A formula of normalization is given by: ⎛ ⎞ 255 Oi ( x, y ) = [ Ri ( x, y ) − min i ] × ⎜ ⎟, ⎝ max i − min i ⎠
(4)
where Oi ( x, y ) is the output in the i th spectral band. Ri ( x, y ) is the results of log-domain processing. The maxi and mini are maximum and minimum values in the i th spectral band. Therefore, a parallel reduction method35,36 is utilized to find the maximum and minimum values of the preious processing results. Then the final restored image is obtained by Eq.(4). The min/max problem is implemented by the multiple reduction operations on a single set of data36. Therefore the normalization is divided into two steps: parallel reduction and parallel normalization.
SPIE-IS&T/ Vol. 7872 78720E-6 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
The parallel reduction is illustrated by local maximum part and is shown in Figure 5. Its distribution belongs to the horizontal stripe method. Each thread block that contains 128 threads (128 * 1 as the dimension of the thread block: T0~T127) and the total number of thread blocks is 128. Each thread sequentially compares its local data to find the local maximum value in its register. The local maximum values are then written to its shared memory.
B0 B2
T0T1 a
GPU T127
B1 B3
B126 b If b > Max
Max
If c > Max
…
…
Register
T0
B127
c
B0
256
If d > Max d
T1
Global Memory Input Image 256
0 1 2
Shared Memory Local Max
63 64 65 66
127
8 2 1 …3 6 9 7 … 5
63 64 65 66
0 1 2
127
1 6 5 …2 7 4 3 … 8
B0
B1… B127
Figure 5. The parallel method of finding the local maximum in the GPURetinex.
The final step of the GPURetinex is to parallelly normalize the final image to the range from 0 to 255 by the maximum and minimum values. These computations can be performed in parallel on a pixel-wise level and are shown in Figure 6. The R(x,y) and O(x,y) are respectively the input and output images. The data distribution in this step uses a square subimage method. Each thread block contains 256 threads (16 * 16 dimension for the thread block: T(0,0)~T(15,15)) and corresponds to a square subimage. Each thread computes the operations for all pixels within its square subimage. GPU T(1,0)…T(15,0) T(0,0) T(0,1)
T(1,0)…T(15,0) B(0,0) B(1,0) … B(15,0)
T(0,15)
256
B(0,1)
…
… B(0,15)
R(x,y) …
Width = 256
B(0,15)
Global Memory
T(0,0): O(0,0) = (R(0,0)-min)*(255/(max-min)) T(0,1): O(0,1) = (R(0,1)-min)*(255/(max-min))
O(x,y)
…
B(0,1)
…
256
…
…
B(0,0) B(1,0) … B(15,0)
T(0,15)
T(0,0) T(0,1)
…
Width = 256
T(1,0): O(1,0) = (R(1,0)-min)*(255/(max-min)) … T(1,1): O(1,1) = (R(1,1)-min)*(255/(max-min)) … …
…
Figure 6. The parallel normalization in the GPURetinex.
4. EXPERIMENTAL RESULTS The performance of GPURetinex has been tested on Tesla C1060 and CUDA 3.0. The GPGPU is cooperated with Inter Core2 Duo 3.0GHz. The GPGPU has 240 streaming processor(SP), 16KB shared memory per streaming multiprocessor(SM), and 4GB device memory. For comparison purposes, a serial implementation of Retinex was developed and run with single thread on one core of the CPU. The Gaussian blur filtering in the CPU version adopts the optimized implementation in the OpenCV library. Three color images with poor visibility conditions were tested in our experiments. The first image is a scene obscured by water turbidity. The poor lighting condition under water made the objects difficult to see. The second image is a scene obscured by fog. The third image is an indoor scene that the sun shines bright from outside. The illumination
SPIE-IS&T/ Vol. 7872 78720E-7 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
problem of high dynamic range introduces dark scenes and local shadow that reduce visibility. Figures 7 (a)(d)(g) give the three original images. Figures 7 (b)(c) show the enhancement of the first image. We can see the whole scene become visible and both the Retinex algorithms improve color correction, especially for the fish, sea urchin, and the texture of stone. The visibility and color are also improved in Figures 7 (e)(f) for the identification of ground objects. The enhanced images by CPUbased Retinex and GPURetinex are showed in Figures 7 (h)(i). There is a dramatic increase in the overall visibility and details in the shadows, especially the floor, ceiling, and aircraft are brought out.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 7. The three sets of original and enhanced images. (a) The first image, (b) the result of CPU-based Retinex, (c) the result of GPURetinex, (d) the second image, (e) the result of CPU-based Retinex, (f) the result of GPURetinex, (g) the third image, (h) the result of CPU-based Retinex, (i) the result of GPURetinex.
SPIE-IS&T/ Vol. 7872 78720E-8 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
We next compare the execution time of the GPURetinex and CPU-based algorithms on four resolutions of 256 x 256, 512 x 512, 1024 x 1024, and 2048 x 2048. Three Gaussian filters of widths 17, 83, and 253 are adopted to compute the center/surround information for each spectral band in the experiments. The execution times of the four steps in the Retinex algorithm are individually metered. The detailed execution times are shown in TABLE I. The Other columns in TABLE I represent memory management and transfer time. For CPU it is only the cost of memory management, while for GPU it includes not only memory management but also the data transfer from host memory to device memory, from device memory to host memory, and from device memory to device memory. We can see that GPURetinex can run faster than CPURetinex, and the Gaussian smooth step always consumes the most processing time. More details are going to be analyzed in the following. Table 1. Execution times of five parts in the Retinex algorithm with respect to image resolutions.
Image Size 256 x 256 512 x 512 1024 x 1024 2048 x 2048
Gaussian Smooth Log-domain Processing Reduction Normalization Others Total Execution Time CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU 143.15 5.31 64.19 0.38 1.16 0.44 2.23 0.10 0.27 3.31 211.00 9.54 597.71 18.45 252.22 0.75 4.55 0.44 8.77 0.15 1.12 12.95 864.37 32.74 2418.96 77.89 1006.90 2.67 18.65 0.57 35.70 0.34 4.90 42.65 3485.11 124.12 11097.00 343.03 4016.79 10.51 74.65 1.45 143.90 1.20 19.60 163.87 15351.94 520.06
Figure 8 compares total execution times with different image sizes between the CPURetinex and GPURetinex. The total execution time of both CPU and GPU versions increases proportionally with respect to image dimensions. Moreover, the total execution time of the GPURetinex is far less than CPU-based implementation. As the image size increases, the difference is more obvious. 100000
Time (msec)
10000
1000
100
10
1
256 x 256
512 x 512
1024 x 1024
2048 x 2048
CPURetinex
211.00
864.37
3485.11
15351.94
GPURetinex
8.80
31.31
124.42
514.82
Figure 8.Total execution time with different image sizes between the CPURetinex and GPURetinex.
Figure 9 shows the speedup of the GPURetinex over the CPURetinex. The speedup increases proportionally and the relative speedup is also linear. For the image of 2048 x 2048 resolution, the GPURetinex can gain 30x speedup compared with CPU-based implementation. The experimental results of the GPU-accelerated Retinex demonstrate excellent speed boost.
SPIE-IS&T/ Vol. 7872 78720E-9 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
35 30 25
speedup
20 15 10 5 0 Speedup
256 x 256
512 x 512
1024 x 1024
2048 x 2048
23.98
27.61
28.01
29.82
Figure 9.The speedup of the GPURetinex over CPURetinex.
5. CONCLUSIONS This paper presents a GPU-accelerated data parallel algorithm called GPURetinex that is proposed to parallelize the Retinex algorithm. The computational complexity of Retinex is very high, especially in computing the center/surround information. In addition, the Retinex is an inherently parallel problem and the parallelization by GPGPU/CUDA can greatly improve its performance. However, a complete understanding of the memory hierarchy and programming model is very important for the parallelization. High processor occupancy is also critical for maximum performance on the GPGPU. Moreover, we should pay attention to the data distribution for each thread block and data allocation in the memory hierarchy. The memory bandwidth between host and device is the bottleneck of the performance, so quick read/write of data is also very important. Our experimental results show that using the GPGPU/CUDA can greatly accelerate the Retinex algorithm and the GPURetinex can gain 30 times speedup compared with the CPU-based implementation on the images with 2048 x 2048 resolution.
REFERENCES [1] M. Ebner, [Color Constancy], John Wiley & Sons, England, 143-153 (2007). [2] E. Land, “The Retinex,” Amer. Scient., 52(2), 247–264 (1964). [3] D. J. Jobson, Z. Rahman, and G. A. Woodell, “Properties and performance of a center/surround retinex,” IEEE Trans. on Image Processing, 6(3), 451-462 (1997). [4] Z. Rahman, D. Jobson, and G. A.Woodell, “Multiscale retinex for color image enhancement,” Proc. IEEE International Conference on Image Processing, 1003-1006 (1996). [5] D. J. Jobson, Z. Rahman, and G. A. Woodell, “A multi-scale Retinex for bridging the gap between color images and the human observation of scenes,” IEEE Trans. on Image Processing: Special Issue on Color Processing, 6(7), 965976 (1997). [6] Z. Rahman, D. Jobson, and G. Woodell, “Retinex processing for automatic image enhancement,” The Human Vision and Electronic Imaging VII Conference, 390-401 (2002). [7] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron, “A performance study of general-purpose applications on graphics processors using CUDA,” Parallel and Distributed Computing, 68(10), 1370–1380 (2008). [8] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. W. Hwu. “Optimization principles and application performance evaluation of a multithreaded GPU using CUDA,” Proc. 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 73–82 (2008).
SPIE-IS&T/ Vol. 7872 78720E-10 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms
[9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]
E. Land and J. McCann, “Lightness and retinex theory,” J. Opt. Soc. Amer., 61(1), 1–11 (1971). E. H. Land, “Recent advances in retinex theory,” Vision Research, 26(1), 7–21 (1986). B. K. P. Horn, “Determining lightness from an image,” Proc. Computer Graphics and Image, 277–299 (1974). D. Brainard and B. Wandell, “Analysis of the Retinex theory of color vision,” J. Opt. Soc. Amer. A, 3(10), 1651– 1661 (1986). A. Rizzi, C. Gatta, and D. Marini, “From Retinex to automatic color equalization issues in deveolping a new algorithm for unsupervised color equalization,” J. Electron. Imag., 13(1), 15–28 (2004). E. Provenzi, L. D. Carli, A. Rizzi, and D. Marini, “Mathematical definition and analysis of the retinex algorithm,” J. Opt. Soc. Amer. A, 2613–2621 (2005). J. Frankle and J. McCann, “Method and apparatus for lightness imaging,” US Patent number 4384336, (1983). J. McCann, “Lesson learned from mondrians applied to real images and color gamuts,” Proc. IS&T/SID Seventh Color Imaging Conference, 1–8 (1999). B. Funt, F. Ciurea, and J. McCann, “Retinex in matlab,” J. Electron. Imag., 13(1), 48–57 (2004). F. Ciurea and B. Funt, “Tuning Retinex parameters,” J. Electron. Imag., 13(1), 58–64 (2004). R. Sobol, “Improving the Retinex algorithm for rendering wide dynamic range photographs,” J. Electron. Imag., 13(1), 65–74 (2004). E. Land, “An alternative technique for the computation of the designator in the retinex theory of color vision,” Proc. National Academy of Science, 3078-3080 (1986). L. Tao and V. Asari, “Modified luminance based MSR for fast and efficient image enhancement,” 32nd Applied Imagery Pattem Recognition Workshop, 174-179 (2003). Y. Luo and R. Duraiswami, “Canny edge detection on NVIDIA CUDA,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 1-8 (2008). V. Podlozhnyuk, “Image convolution with CUDA,“ NVIDIA white paper, (2007). G. M. Amdahl, “Validity of the single-processor approach to achieving large-scale computing capabilities,” Proc. Am. Federation of Information Processing Societies Conference, 83-485 (1967). M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,” IEEE Computer, 41(7), 33-38 (2008). J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, and T. J. Purcell, “A survey of generalpurpose computation on graphics hardware,” Computer Graphics Forum, 26(1), 80–113 (2007). K. Moreland, E. Angel, “The FFT on a GPU,” SIGGRAPH/Eurographics Workshop on Graphics Hardware, 112119 (2003). R. Strzodka and C. Garbe, “Real-time motion estimation and visualization on graphics cards,” Proc. IEEE Visualization, 545–552 (2004). G. Shen, G.-P. Gao, S. Li, H. Shum, and Y. Zhang, “Accelerate video decoding with generic GPU,” IEEE Trans. on Circuits and Systems for Video Technology, 15(5), 685–693 (2005). GPU4Vision: http://www.gpu4vision.org. J. Fung, S. Mann, and C. Aimone, “OpenVIDIA: Parallel GPU computer vision,” Proc. ACM international conference on Multimedia, 849–852 (2005). Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan, “GpuCV: An opensource gpu-accelerated framework for image processing and computer vision,” Proc. ACM international conference on Multimedia, 1089–1092 (2008). P. Babenko and M. Shah, “MinGPU: A minimum GPU library for computer vision,” Real-Time Image Processing, 3(4), 255–268 (2008). M. Lozano and K. Otsuka, “Real-time visual tracker by stream processing,” J. of Sign. Process. Syst., 674–679 ( 2008). M. Harris, “Optimizing parallel reduction in CUDA,” NVIDIA Developer Technology, (2007). H. J. Siegel, L. Wang, J. E. So, and M. Maheswaran, “Data parallel algorithms,” Parallel and Distributed Computing Handbook, 466-499 (1996).
SPIE-IS&T/ Vol. 7872 78720E-11 Downloaded from SPIE Digital Library on 27 May 2011 to 211.76.254.14. Terms of Use: http://spiedl.org/terms