Abstract Introduction Image Stack Fusion Automatic

2 downloads 0 Views 2MB Size Report
//Wrapper between OpenCV GpuMat to gpu::PtrStepSz ... for Tegra”. http://on-demand.gputechconf.com/gtc/2013/presentations/S3411-OpenCV-For-Tegra.pdf.
Automatic Depth of Field Extension using GPU and CUDA for microscopy applications J. M. Castillo-Secilla (1), M. Saval (1), L. Medina (1), S. Cuenca-Asensi (1), A. Martínez-Álvarez (1), C. Sánchez (2), G. Cristóbal (2). {jmcastillo, msaval, lmedina, sergio, amartinez} @dtic.ua.es, [email protected], [email protected] (1) Dpto. Tecnología Informática y Computación. Universidad de Alicante. (2) Instituto de Óptica Daza de Valdés, CSIC, Madrid.

Abstract Many applications in computer vision require a high computational time for their processing. In general terms, those applications carry out several basic morphological real-time convolutions which are highly parallelizable. Using the above-mentioned convolutions along with a stack of images obtained with a digital microscope with different focusing settings, it is possible to combine all of them for obtaining a highly focused image or extended DoF (Depth of Field) Image. This process has been parallelized using CUDA and OpenCV4Tegra on a Jetson TK1. Results show that using NVIDIA TK1 SoC is possible to get a real-time embedded implementation for automatic DoF extension.

Introduction

Image Stack Fusion

The use of a set of images for estimating a final image as a compound of the first is a common procedure in computer vision. Our project provides a focused image from different slices acquired at different focal distances (z). This problem has been addressed as a plugin for ImageJ in an average execution time of 149.339s, which is not suitable for many real-time applications.

gpu::GpuMat imageProcessed[]

cv::Mat cudaFusion_GpuMat(cv::gpu::GpuMat imageProcessed[], cv::gpu::GpuMat imageArray[], int WIDTH, int HEIGHT, int STACK_SIZE, Stream stream) { //Wrapper between OpenCV GpuMat to gpu::PtrStepSz

gpu:GpuMat imageArray[]

//Fill gpu arrays for allocating it in GPU memory

fusion_GPU

//Allocate gpu memory //Copy data to GPU space //kernel configuration dim3 dimBlock = dim3 (THREADS, THREADS, 1); dim3 dimGrid = dim3 (WIDTH/THREADS, HEIGHT/THREADS); fusion_GPU(senderImageProcessedArray, senderImageArray, d_output, WIDTH, HEIGHT, STACK_SIZE);

Figure 3: Image Stack Fusion inputs.

“A wrapper between OpenCV GpuMat for passing the images to the CUDA kernel has been developed.”

The main goal of this work aims to get a high performance embedded implementation using NVIDIA Jetson TK1 SoC.

//Copy back data from destination device memory to OpenCV output image //Free gpu memory }

__global__ void fusion_GPU(gpu::PtrStepSz arrayProcessed[], gpu::PtrStepSz arrayImages[], PtrStep imageMaximums, int WIDTH, int HEIGHT, int DEPTH) { int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int idx = y*WIDTH + x;

Figure 1: Example of the focuser system.

//iterate in the z dimension if(x < WIDTH && y < HEIGHT) { for(z = 0; z < DEPTH; z++) if(max < arrayProcessed[z].ptr(y)[x]) { max = arrayProcessed[z].ptr(y)[x]; index = z; } uchar3 pixel = arrayImages[index](y, x); //BGR format for OpenCV pixel.x, pixel.y, pixel.z imageMaximums[idx] = make_uchar3(pixel.x, pixel.y, pixel.z); }

Automatic DoF Extension The system consists of five main stages (See Figure 1):

}

1. Autofocus: Determinates the best focal distance by means of finding the sharpest image of the z-stack. There are several methods [1 -> 5] to get this goal, among them we have selected Vollath4 (see Equation 1), because offer the best results in tuberculosis microscopy [2].

Figure 4: Thread 3D grid proposed in the Image Stack Fusion process.

Experimentation

FVOL 4   M  N g m, n   g m  1, n    M  N g m, n   g m  2, n 

The testbed is composed by a NVIDIA Jetson TK1. This embedded system is built-in with an ARM Cortex A15 CPU (4 cores) along with a NVIDIA GPU Kepler (192 cores).

Equation 1: Vollath-4.

2. Depth of Field Extension. Generates a focused image from a subset of the stack by se lecting in-focus regions and automatically stitching them together to generate a single focused image (stack fusion). The process involves the following filters:

The input stack of images is composed by 17 images with a 2592x1944 resolution. Vollath 4 determines the sharpest image and the fusion process starts with a reduced stack (configured with 7 images) centered in the mentioned image:

:

1. Sobel 2. Maximum 3. Image Stack Fusion

4. Gaussian Blur Sobel, Maximum and Gaussian Blur are convolution operations which are applied per pixel, hence, they are ideal for parallelization. These functions are well-known and their implementations in GPU with OpenCV are widely used. For this reason, we used it with the OpenCV4Tegra [6] library and centered our work in the design and development of a CUDA [7] kernel for the Image Stack Fusion process.

Read stack of images Full set Vollath-4

Subset of images

Sobel edge filter

Maximum/ dilate filter

Sobel edge filter

Maximum/ dilate filter

Sobel edge filter

Maximum/ dilate filter

Sobel edge filter

...

Maximum/ dilate filter

Stack image fusion

Speedup

1.13 s

x9.85

Sobel

1.71 s

0.18 s

x9.44

Maximum

0.09 s

0.04 s

x1.92

Image Stack Fusion

0.90 s

0.10 s

x9.21

Gaussian Blur

0.38 s

0.18 s

x2.12

Others

0.21 s

0.40 s

(-)x0.53

Table 1: Execution times and speedup.

“Our Image Stack Fusion approach gives a x9.21 Speedup compared to its ARM version” 14%

Config_1 Sharpness Time (GPU) Sobel (GPU) Maximum (GPU)

40% 32%

Gaussian Filter (GPU) Data Fusion (CPU) Others gpu

6%

6% 2%

Speedup: x5.08

Figure 5: Config_1 profiling and speedup.

Config_2

Config_2: Autofocus and Fusion were fully migrated to GPU.

Sharpness Time (GPU)

20%

5%

Sobel (GPU)

Gaussian blur

Final image

Conclusions

“OpenCV4Tegra for GPU did not improves significantly in certain functions such as Maximum and Gaussian Blur”



GPU versions get improvements from 1.92x to 9.85x respect to ARM implementations.



Image Stack Fusion is highly parallelized and fully customized kernel gets a 9.21x of Speedup. Poor performance observed in OpenCV4Tegra for GPU in Gaussian and Maximum Filters (2.12x & 1.92x) Next steps: Data alignment for improving data accessing in the Image Stack Fusion Kernel and improvement of Vollath 4 processing using native CUDA code.

55%

9%

Maximum (GPU) Gaussian Filter (GPU) Data Fusion (GPU)

2%

Others gpu

9%

Speedup: x7.10 Figure 6: Config_2 profiling and speedup.

Bibliography 





GPU

11.11 s

Two different configurations were tested and compared to the Baseline (see Table 1, Figure 5 and Figure 6) implementation on Quadcore ARM: Config_1: Autofocus process was fully migrated to GPU. Fusion process was partially migrated (stack fusion remains in ARM)

CPU

Vollath 4

Reduced Stack  Sharpest - 3, Sharpest  3

Figure 2. Diagram of the stages in the system. First, a Sobel and a Maximum Filter for all images is performed. After that, Image Stack Fusion, and a Gaussian Blur for returning the final image.



2592x1944



[1] J. L. Pech-Pacheco, G. Cristobal, J. Chamorro-Martinez and J. Fernandez-Valdivia, "Diatom autofocusing in brightfield microscopy: a comparative study," Proceedings. 15th International Conference on,Pattern Recognition (Barcelona), 2000, pp. 314-317 vol.3. [2] M. Russel, and T. Douglas, “Evaluation of autofocus algorithms for tuberculosis microscopy,” in Proc. 29 International Conference of the IEEE EMBS, Lyon, France, 3489–3492 (2007). th

[3] A. Santos, C. Ortiz de Solorzano, J. J. Vaquero, J. M. Pea, N. Malpica, and F. del Pozo, “Evaluation of autofocus functions in molecular cytogenetic analysis,” Journal of Microscopy 188, 264–272(1997).



[4] D. Vollath, “The influence of the scene parameters and of noise on the behavior of automatic focusing algorithms,” J. of Microscopy 151. (1988).



[5] S. Cuenca-Asensi, “Enfoque automatico para microscopía con FPGAs”. JCRA 2011. La Laguna.



[6] S. Gupta. “Introduction to OpenCV for Tegra”. http://on-demand.gputechconf.com/gtc/2013/presentations/S3411-OpenCV-For-Tegra.pdf.



[7] N. Wilt. “The Cuda Handbook: a comprehensive guide to GPU programming”. Addison-Wesley. 2013.