Fast and Accurate Color Image Processing Using

0 downloads 0 Views 765KB Size Report
Nov 19, 2003 - image processing algorithms: local mean filtering,. RGB to L∗a∗b∗ and RGB to HSV color spaces conversions, local principal component ...
Fast and Accurate Color Image Processing Using 3D Graphics Cards Philippe C OLANTONI† , Nabil B OUKALA‡ , Jérôme DA RUGNA† Laboratoire LIGIV† Université Jean Monnet 10, rue Barrouin 42000 Saint-Étienne - FRANCE colantoni|[email protected]

1

Laboratoire DIPI‡ École Nationale d’Ingénieurs de Saint-Étienne 58, rue Jean Parot 42023 Saint-Étienne - FRANCE [email protected]

Abstract

forme color image processing using such a type of hardware technology. The second aim of this work is to evaluate the performances of 3D cards in regards to several common algorithms, to study whether the use of such hardware is relevant for a given class of algorithms. This study is mainly based on the technologies available in the latest generation of chips. Algorithms can be differentiated and classified by their structure and the programmability resources they require. The tested algorithms have been carefully chosen to cover most types of implementations encountered in image processing, excluding those which cannot be applied to the graphic hardware. The proposed algorithms have been implemented twice. While a first implementation makes use of the 3D card, an other is optimized for the CPU. This allows us to compare and analyze the performances obtained from each version.

The aim of this paper is to demonstrate how the new technologies introduced in recent 3D cards can be used in color image processing and analysis. We used the latest programmability feature available in 3D cards in order to implement and to test five color image processing algorithms: local mean filtering, RGB to L∗ a∗ b∗ and RGB to HSV color spaces conversions, local principal component analysis and anisotropic diffusion filtering. Using a nVIDIAT M NV30 graphic processor unit (GPU) we obtained, in most cases of study, faster results than with the tested processors (CPU). We are showing that the GPU can be 10 times faster than the best CPU that we have tested in the case of per-pixel processing with mathematical complex functions and vectorial calculations.

2

Introduction

Image processing and analysis are fields which naturally require high performances and large computational capabilities. Most algorithms in image processing need a large amount of computations and high precision. This is particularly true in color and multispectral image processing. Nowadays, real time and high frame rate digital image processing can be obtained by using most of the time, dedicated and expensive hardware, which are the only ones that can provide enough computational capabilities to process streams of color images. Moreover, latest 3D cards, which are now more affordable, include interesting features for image processing purposes. The purpose of this study is to show how to perVMV 2003

3

Programmability

So far, 3D cards have always been designed to offer high performances in two main points: • arithmetic calculations on vectors and matrices, which are essential for efficient computations of geometric transformations and color manipulations. In our case of study, colors are 3D or 4D vectors. • absolutely necessary accesses to different memory locations, such as the texture memory, the Z-buffer, the frame buffer, requiring a large memory bandwidth. Due to the rigidity of the graphic pipeline, these capabilities could not be exploited for any other pur1

Munich, Germany, November 19–21, 2003

pose except 3D rendering. Today, the programmability provided by the new GPU (graphic processor unit) generation gives the possibility to alter the main two engines of the pipeline: the transforming and lighting engine (T&L) and the multi-texturing engine. Basically, the vertex processor allows the substitution of the traditional transforming and lighting process by a specific geometrical manipulation program, called “vertex program”. A vertex program handles the 3D geometrical data (vertices) entering the graphic pipeline and outputs, for each vertex, the vertex itself and its parameters (color, texture coordinates...). Further in the pipeline, the fragment processor, usually in charge of the multi-texturing task, can instead run any complex color manipulation program, and thus processes each polygon in a per pixel way. This flexibility, especially at the multi-texturing stage, is essential for our field of study since the main idea of our work is to use the fragment processor for an image processing purpose. This approach has already been partially explored by [1] within the scope of color image segmentation based on basic programming methods. More recently, GPU computational capabilities have been used for other applications such as numerical computations [2, 3] and simulations.

4

Figure 1: Image processing GPU implementation model restricted to a maximum number of instructions1 : 65536 for vertex programs and 1024 for fragment programs. Moreover, they cannot contain more than 256 loops, 256 constants and 16 temporary registers. Finally, the parallel “architecture” of the fragment processor enables several pixels to be processed at the same time. Consequently, this optimizes the performances of the algorithms implemented but also prevents the processor to know or to monitor in which order pixels need to be processed. Therefore, the fragment processor is limited to per pixel processing. On the other hand, algorithms based on sequential scans of the image, where the processing of the current pixel needs the result obtained by the processing of a previous pixel (e.g. algorithms such as labeling or edge following segmentation) are totally unsuitable for the GPU and cannot (or with difficulty) be implemented with such a parallel architecture. Consequently they are disregarded of our study. This severe limitation can be, to some extent, overcome. Indeed, as for 3D image rendering, a multi-pass approach can be used to process an image: 2 pbuffers are needed, while the first contains the data to be processed, the other one receives the processed data. Once finished, a new process can be performed from the resulting image, then, the second pbuffer receives in its turn the recently processed data, and so forth. Thanks to this method, an image can be processed several times and the implementation of iterative algorithms is made possible. It should also be noted that both GPU and CPU can work simultaneously, fully exploiting the system and giving better performances. The processing pipeline is used at its best when the CPU and GPU processing times are equal.

Image processing using the fragment processor

In addition to this programmability feature, these recent 3D cards are particularly well suited to perform floating point computations allowing high precision calculations. Moreover, off-screen rendering, which is a very useful feature can be done by the use of pbuffers, and, mathematical functions implemented in hardware make the computation of complex mathematical expressions more efficient. The combination of these features allows us to think that this type of hardware is well adapted to run image processing algorithms. However, some types of image processing algorithms cannot be implemented due to the fragment processor limitations. These algorithms have been therefore disregarded of our study. The first limitation is due to the fact that a fragment program can only give, for a given pixel, its color and depth values. Secondly, vertex and fragment programs are

1

2

In our study we have used the nVIDIAT M NV30 processor

Consequently, most of image processing algorithms can be implemented using 3D cards, such as: filtering, the major part of segmentation methods, including those based on thresholding and multiresolution approaches (e.g. those using splitting and merging processes). Likewise, most of mathematical morphology methods can also be adapted without any difficulty. Our field of interest concerns only color image processing algorithms.

5

to the same processing: • GPU Implementation num.xyz=Ixx.xyz*Iy.xyz*Iy.xyz -2*Ixy.xyz*Ix.xyz*Iy.xyz +Iyy.xyz*Ix.xyz*Ix.xyz; denom.xyz=Ix.xyz*Ix.xyz+Iy.xyz*Iy.xyz; dst.xyz+=alphag*(num.xyz/denom.xyz); return clamp (dst.xyz, 0, 1);

• CPU implementation for (int c=0;c1) value=1; else if (value