EASY-PIPE - An âEASY to useâ Parallel Image Processing ... - CiteSeerX

EASY-PIPE - An ”EASY to use” Parallel Image Processing Environment based on algorithmic skeletons Cristina Nicolescu and Pieter Jonker Delft University of Technology Faculty of Applied Physics Pattern Recognition Group Lorentzweg 1, 2628CJ Delft, The Netherlands email: [email protected]

Abstract The paper presents an approach of using algorithmic skeletons for adding data parallelism to an image processing library. The method is used for parallelizing image processing applications composed of low-level image operators on a distributed memory system. In this way, a user who wants to parallelize an image processing application is not involved in the design and the implementation of parallel algorithms, but his only task is how to select for each low-level operator the appropriate skeleton to obtain the parallel version of the application. Example of the multibaseline stereo vision image processing aplication is given for reference.

1. Introduction Image processing is widely used in many applications including film industry, medical imaging, industrial manufacturing, weather forecast etc. In some of these areas the size of the images is very large yet the processing time has to be very small and sometimes real-time processing is required. Therefore, during the last decade there has been an increasing interest in the developing and the use of parallel algorithms in image processing. Many algorithms have been developed for parallelizing different image operators on different parallel architectures. Most of these parallel image processing algorithms are either architecture dependent, or specifically developed for different applications and very difficult to be implemented by an image processing user without enough knowledge of parallel computing. In this paper we present an approach of adding data parallelism to an image processing library using algorithmic skeletons [1, 3, 2]. Skeletons are algorithmic abstractions common to a series of applications, which are implemented

using a parallel programming approach. Skeletons are embedded in a sequential host language, thus being the only source of parallelism in a program. Using skeletons we create a data parallel parallel image processing environment which is very easy to use by a common image processing user. The environment is implemented in C using MPI [6] as a communication library, so all the image processing applications are coded in C using skeletons. Based on the data parallel environment, the authors are working at adding also a task parallel framework, but this work is not described in the paper. The task parallel framework explains the additional task number parameter which appears in the header of each skeleton. The paper is organized as follows. Section 2 briefly presents the concept of the algorithmic skeletons and a survey of related work. Section 3 identifies the skeletons for parallel low-level image processing on a distributedmemory system. The code with and without using skeletons for the multi-baseline stereo vision is presented for evaluation in Section 4. Section 5 concludes the paper.

2. Skeletons and related work Skeletons are algorithmic abstractions which encapsulate different forms of parallelism, common to a series of applications. The aim is to obtain environments or languages that allow easy parallel programming, in which the user does not have to handle with problems as communication, synchronization, deadlocks or non-deterministic program runs. Usually, they are embedded in a sequential host language and they are used to code and hide the parallelism from the application user. The concept of algorithmic skeletons is not new and a lot of research is done to demonstrate their usefulness in parallel programming. Most skeletons are polymorphic higherorder functions, and can be defined in functional languages

in a straightforward way. This is the reason why the most skeletons are build upon a functional language [1, 3]. Work have been also done in using skeletons in image processing. In [2] Serot et al. presents a parallel image processing environment using skeletons on top of CAML functional language. In this paper we develop algorithmic skeletons to create a parallel image processing environment ready to use for an easy developing/implementation of parallel image processing applications in the C programming language. The user of the parallel environment sees each skeleton as a higher order function (template) representing a collection of algorithmic tools by means of which a specific image processing problem can be computed in parallel. The parallel environment designer sees each skeleton as a generic computational pattern for which an equivalent and efficient parallel implementation has to be defined. The skeletons are implemented in C using MPI as a communication library and they can be used in a C/C++ programming environment.

Arithmetic and logic operations Image ALU operations are fundamental operations needed in almost any imaging product for a variety of purposes. We refer to operations between an image and a constant as monadic operations, operations between two images as dyadic operations and operations involving three images as triadic operations.

3. Skeletons for low-level image processing

Table 1 Monadic image operations Function Add constant Subtract constant Multiply constant Divide by constant Or constant And constant Xor constant Absolute value

3.1 A classification of low-level image operators Low-level image processing operators use the values of the image pixels to modify the image in some way. They can be divided into point operators, neighborhood operators and global operators. Point operators depend only on the values of the corresponding pixels from the input image and their parallelization is simple and totally embarrassingly parallel. Neighborhood operators produce an image in which the output pixels depend on a group of neighboring pixels around the corresponding pixel from the input image. Operations like smoothing, sharpening, noise reduction or edge detection are highly parallelizable. Global operators depend upon all the pixels of the input image and they are also parallelizable. 1. Point operators Image point operators are the most powerful functions in image processing. A large group of operators falls in this category. Their main characteristics is that a pixel from the output image depends only on the corresponding pixel from the input image. Point operators are used to copy an image from one memory location to another, in arithmetic and logical operations, table lookup, image compositing. We will discuss in detail arithmetic and logic operators, classifying them from the point of view of the number of images involved, this being an important issue in developing skeletons for them.

Monadic image operations Monadic image operators are ALU operators between an image and a constant. These operations are shown in Table 1 - s(x; y ) and d(x; y ) are the source and destination pixel values at location (x; y ), and K is the constant. Monadic operators may process the input image or may create a resulting output image, this being an important issue in the design and use of skeletons for such operators, see Section 3.2.

Operation

d(x; y) = s(x; y) + K d(x; y) = s(x; y) K d(x; y) = K s(x; y) d(x; y) = s(x; y)=K d(x; y) = Kors(x; y) d(x; y) = Kands(x; y) d(x; y) = Kxors(x; y) d(x; y) = abs(s(x; y))

Monadic operations are useful in many situations. For instance, they can be used to add or subtract a bias value to make a picture brighter or darker. Dyadic image operators Dyadic image operators are arithmetic and logical functions between the pixels of two source images producing a destination image. These functions are shown below in Table 2 - s1(x; y ) and s2(x; y ) are the two source images that are used to create the destination image d(x; y ). Diadic operators may modify one of the input images or may create an additional output image. Table 2 Dyadic image operations Function Add Subtract Multiply Divide Min Max Or And

Operation

d(x; y) = s1(x; y) + s2(x; y) d(x; y) = s1(x; y) + s2(x; y) d(x; y) = s1(x; y) s2(x; y) d(x; y) = s1(x; y)=s2(x; y) d(x; y) = min(s1(x; y); s2(x; y)) d(x; y) = max(s1(x; y); s2(x; y) d(x; y) = s1(x; y)ors2(x; y) d(x; y) = s1(x; y)ands2(x; y)

Dyadic operators have many uses in image processing. For example, the subtraction of one image from another is useful for studying the flow

of blood in digital subtraction angiography. Addition of images is a useful step in many complex imaging algorithms. Triadic image operators Triadic operators use three input images for the computation of an output image. An example of such an operation is alpha blending. Image compositing is a useful function for both graphics and computer imaging. In graphics, compositing is used to combine several images into one. Typically, these images are rendered separately, possibly using different rendering algorithms. For example, the images may be rendered separately, possibly using different types of rendering hardware for different algorithms. In image processing, compositing is needed for any product that needs to merge multiple pictures into one final image. All image editing programs, as well as programs that combine sinthetically generated images with scanned images, need this function. In computer imaging, the term alpha blend can be defined using two source images S 1 and S 2, an alpha image and a destination image D, see formula (1).

D(x; y) = (x; y)S 2(x; y)

(1

(x; y))S 1(x; y)

+

(1)

Another example of a triadic operator is the squared difference between a reference image and two shifted images, operator used in the multi-baseline stereo vision application, described in Section 4. Table 3 Triadic image operations Function Alpha blend Squared diff

d(x; y) = (1 (x; y)) s1(x; y)+ +(x; y) s2(x; y) d(x; y) = (ref (x; y) s1(x; y))2 + +(ref (x; y) s2(x; y))2

Operation

2. Local neighborhood operators Neighborhood operators (filters) create a destination pixel based on the criteria that depend on the source pixel and the value of pixels in the “neighborhood” surrounding it. Neighborhood filters are largely used in computer imaging. They are used for enhancing and changing the appearance of images by sharpening, blurring, crispening the edges, and noise removal. They are also useful in image processing applications

as object recognition, image restoration, and image data compression. We define a filter as an operation that changes pixels of the source image based on their values and those of their surrounding pixels. We may have linear and nonlinear filters. Linear filtering versus nonlinear filtering Generally speaking,a filter in imaging refers to any process that produces a destination image from a source image. A linear filter has the property that a weighted sum of the source images produces a similarly weighted sum of the destination images. In contrast to linear filters, nonlinear filters are somewhat more difficult to characterize. This is because the output of the filter for a given input cannot be predicted by the impulse response. Nonlinear filters behave differently for different inputs. Linear filtering using two-dimensional discrete convolution In imaging, two-dimensional convolution is the most common way to implement a linear filter. The operation is performed between a source image and a twodimensional convolution kernel to produce a destination image. The convolution kernel is typically much smaller than the source image. Starting at the top of the image (the top left corner which is also the origin of the image), the kernel is moved horizontally over the image, one pixel at a time. Then it is moved down one row and moved horizontally again. This process is continued until the kernel has traversed the entire image. For the destination pixel at row m and column n, the kernel is centered at the same location in the source image. Mathematically, two-dimensional discrete convolution is defined as a double summation. Given an M N image f (m; n) and K L convolution kernel h(k; l), we define the origin of each to be at the top left corner. We assume that f (m; n) is much larger than h(k; l). Then, the result of convolving f (m; n) by h(k; l) is the image g (m; n) given by formula (2):

g(m; n) = n+

X Xf m

K

1L 1

(

+

x=0 y=0

L

1 2

y)h(x; y)

K

1 2

(2)

x;

The sequential time complexity of this operation is O(MNKL). As it can be observed, this is a time consuming operation, very well fitted to the data parallel approach.

Master 0

Master 0 Master 0

node 1 Original Image

Processed Image node 2 ..............

3. Global operators

node n-1

Global operators create a destination pixel based on the entire image information. A representative example of an operator within this class is the Discrete Fourier Transform (DFT). The Discrete Fourier Transform converts an input data set from the temporal/spatial domain to the frequency domain, and vice versa. It has a lot of applications in image processing, being used for image enhancement, restoration, and compression. In image processing the input is a set of pixels forming a two-dimensional function that is already discrete. The formulae for the output pixel Xlm is the following:

Xlm =

X Xx

N

1M

1

j =0 k=0

Figure 1. DCG skeleton for point operators

Master 0 Master 0 Master 0

node 1 Original Image

Processed Image

node 2 ..............

jk e

jl

2i( N + km M)

Extended Image

(3)

where j and k are column coordinates, 0 j and 0 k M 1.

N

node n-1

1

Figure 2. DCG skeleton for neighborhood operators

We also include in the class of global operators, operators like the histogram transform, which have not an image as output, but another data structure.

3.2 Data parallelism of low-level image operators Point and neighborhood image processing operators can be parallelized using the data parallel paradigm with a master-slave approach. A master processor is selected for splitting and distributing the data to the slaves. The master also process a part of the image. Each slave process its received part of the image and then the master gathers and assembles the image back. In Figures 1, 2 and 3 we present the data parallel paradigm with the master-slaves approach for point, neighborhood and global operators. For global operators we send the entire image to slaves but each slave will process only a certain part of the image. In order to avoid extra inter-processor communication due to border processing for neighborhood operator, several partition schemes (row-stripe, column stripe, block) are available and Figure 2 presents the extended row-stripe approach in which the master processor adds aditional rows to the image, according to the kernel (neighborhood operator) size, and distributes the image in a way such that each slave processor receives all the data needed for applying the neighborhood operator.

Based on the above observations we identify a number of skeletons for parallel processing of low-level image processing operators. They are called according to the type of the low-level operator and the number of images involved in the operation. Headers of some skeletons are shown below. All of them are based on a ”Distribute Compute and Gather” (DCG) main skeleton, previously known as the map skeleton [1, 3], suitable for regular applications as the low-level image operators. The header of the skeletons contains information about the images which are going to be processed by an operator (given as image indentifiers) and the name of the operator. If the operator needs some extra parameters, then these parameters are also given in the skeleton header. The skeletons are implemented taken into consideration the number of input/output images involved in the low level operators. For instance, ImagePointDist2 can be a skeleton for some monadic image operators (which produces an additional output image) and it can also be a skeleton for some diadic image operators (which modifies one of the input images). Some other diadic image operators may produce an additional output image, so they are processed using the ImagePointDist3 skeleton. The parallel environment contains a list of available

1111111111 0000000000 Processed area at 0 0000000000 1111111111 0000000000 1111111111

Master (0)

1111111111 0000000000 0000000000 1111111111 Processed area at 1 0000000000 1111111111 0000000000 1111111111

Master (0)

Processed Image

Original Image

1111111111 0000000000 0000000000 1111111111 Processed area at 2 0000000000 1111111111 0000000000 1111111111

parameter is used by the task parallel framework, which is not described in this paper, and has no meaning if only the data parallel framework is used. Depending on the skeleton type, one or more identifiers of the images are given as parameters. Then we must supply the number of processors on which the skeleton will be executed. Usually, for the data parallel framework, this number is equal to the number of processors on which we run the application, but for the task parallel framework the number is different. The next parameter is the set of processors on which the skeleton is executed. Again, if using only the data parallel framework the set of processors usually consists of all the processors on which we want to run the application, but if used in combination with the task parallel framework the set is different. It should be noted that the user has the choice to use less processors than the available ones. The next argument is the point operator for processing the image(s). Depending on the operator type and the skeleton type, there might exist additional parameters necessary for the image operator.

1111111111 0000000000 0000000000 1111111111 Processed area at n-1 0000000000 1111111111 0000000000 1111111111

4. Multi-baseline stereo vision application coded with skeletons

Figure 3. DCG skeleton for global operators

The multi-baseline stereo vision application uses an algorithm developed by Okutomi and Kanade [4] and described by Webb and al. [7, 8], that gives greater accuracy in depth through the use of more than two cameras. Input consists of three n n images acquired from three horizontally aligned, equally spaced cameras. One image is the reference image, the other two are named match images. For each of 16 disparities, d = 0; ::; 15, the first match image is shifted by d pixels, the second image is shifted by 2d pixels. A difference image is formed by computing the sum of squared differences between the corresponding pixels of the reference image and the shifted match images. Next, an error image is formed by replacing each pixel in the difference image with the sum of the pixels in a surrounding 13 13 window. A disparity image is then formed by finding, for each pixel, the disparity that minimizes error. Finally, the depth of each pixel is displayed as a simple function of its disparity. Below is the pseudocode of the application and in Figure 4 we illustrate the data-flow graph of the program.

skeletons. Details and programming examples using the available skeletons are provided in a user’s guide. Based on the description of the low level image processing application to be implemented (which is actually the pseudocode containing the low level image operators needed by that specfic application), the user has only to select and compose the appropriate skeletons to implement the application. /* dist.h */ ... void ImagePointDist(unsigned int n,char *name,int no_proc, list_proc set,void(*im_op)()); // DCG skeleton for monadic point operators void ImagePointDist_c(unsigned int n,char *name,int no_proc, list_proc set,void(*im_op)(),void const); // DCG skeleton for monadic point operators and additional parameter void ImagePointDist2(unsigned int n,char *name1,char *name2, int no_proc,list_proc set,void(*im_op)()); // DCG skeleton for monadic/diadic point operators void ImagePointDist3(unsigned int n,char *name1,char *name2,char *name3, int no_proc,list_proc set,void(*im_op)()); // DCG skeleton for diadic/triadic point operators void ImagePointDist4(unsigned int n,char *name1,char *name2,char *name3, char *name4,int no_proc,list_proc set,void(*im_op)()); // DCG skeleton for triadic point operators ... void ImageWindowDist(unsigned int n,char *name,Window *win,int no_proc, list_proc set,void(*im_op)()); // DCG skeleton for neighborhood operators void ImageGlobalDist(unsigned int n,char *name,int no_proc,list_proc set, void(*im_op)()); // DCG skeleton for global operators ...

With each skeleton we associate a parameter which represents the task number corresponding to that skeleton. This

Input: ref, m1, m2 (the reference and the two match images) for d=0,15 Task T1,d: m1 shifted by d pixels Task T2,d: m2 shifted by 2*d pixels Task T3,d: diff = (ref-m1)*(refm1)+(ref-m2)*(ref-m2) Task T4,d: err(d) = sum diff[i,j] Task T5: Disparity image = d which minimizes the err image

Pseudocode of the multibaseline stereo vision application diff0 1 diff1

ref

2 diff2 3

err0 17 err1 18 err2 19

0

33

broadcast

disparity image

reduce diff15

err15

16

32

Figure 4. Multi-baseline stereo vision

It can be observed that the computation of the difference images requires point operators, while the computation of the error images requires neighborhood operators. The computation of the disparity image requires also a point operator. Below we present the sequential code of the application versus the parallel code of the application. Coding the application by just combining a number of skeletons doesn’t require much effort from the image processing user, yet it parallelizes the application. Besides creating the images on the master processor the code is the same, only the function headers differ. Skeleton headers have as parameters the name of the images, the window and the image operator, while in the sequential version operator headers have as parameters the images and the window. There are 14 lines in the sequential program and 16 lines in the skeletons program. The difference comes from the fact that in the skeleton version the images are created only on the MASTER processor and then they are processed via skeletons by all the processors. Although is much faster to create the images on all the processors since it avoids part of communication, we prefer to create the images on a master processor because we further use the data parallel framework together with a task parallel framework, but this is not presented in this paper. The skeletons are implemented in C using MPI [6]. Figure 5 presents the output of the upshot visualization tool of the stereo vision algorithm executed on 4 processors, while Figure 6 shows the upshot output of the same application on 4 processors but with the two image point operators (Image2DifSqr and ImageMin) executed only on one processor (see the last part of the code). It can be observed that the second approach is faster. In the first case, it can be noticed that our implementation of skeletons is not efficient if simple point operators are applied in sequence to the same image because of too much communication (

for each operator the master processor has to distribute the data to processors and then to collect the data from the processors). Communication can be reduced by executing the corresponding skeletons on a single processor, or by developing more complex skeletons which can process sequences of image operators. We prefer the first approach, so executing simple image point operators on single processors and executing more complicated low-level image operators on all the available processors because our task parallel framework is based on this type of implementation. Our implementation is efficient for neighborhood and global operators. In Figure 7 we present the speed-ups of the multibaseline stereo vision with the point operator skeletons running on all the available processors versus the point operator skeletons running only on one processor. In the first approach, the speed-ups are less performant because of too much communication involved in the implementation skeletons corresponding to point operators. ........ ref=CreateImage(256,256,"ref",PAR_DOUBLE); InitImage_ij(ref); m1=CreateImage(256,256,"m1",PAR_DOUBLE); InitImage_ij(m1); m2=CreateImage(256,256,"m2",PAR_DOUBLE); InitImage_ij(m2); min=CreateImage(256,256,"min",PAR_DOUBLE); InitImageZero(min); for(k=0;k

EASY-PIPE - An âEASY to useâ Parallel Image Processing ... - CiteSeerX

EASY-PIPE - An âEASY to useâ Parallel Image Processing ... - CiteSeerX

Suggest Documents

An Easy Setup for Parallel Medical Image Processing ... - NorduGrid

parallel image processing in heterogeneous computing ... - CiteSeerX

parallel image processing in heterogeneous computing ... - CiteSeerX

parallel image processing in heterogeneous computing ... - CiteSeerX

A toolkit for parallel image processing - CiteSeerX

The Parallel Image Processing System PIPS1 - CiteSeerX

Topology optimization using PETSc: An easy-to-use, fully parallel ...

An Easy-to-use Scalable Framework for Parallel Recursive Backtracking

Parallel Image Processing on Heterogeneous

Parallel Processing Considerations for Image

Audio and Image Processing Easy Learning for

Vector-Valued Image Processing by Parallel Level Sets - CiteSeerX

A cluster-based parallel image processing toolkit - CiteSeerX

Parallel Biomedical Image Processing with GPGPUs in ... - CiteSeerX

Development of a Parallel Image Processing

Parallel Hyperspectral Image and Signal Processing - UMBC

Parallel Hyperspectral Image and Signal Processing - UMBC

A REAL-TIME PARALLEL IMAGE-PROCESSING

Large Scale Parallel Document Image Processing

Multi-Paradigm Framework for Parallel Image Processing

An Easy-to-use Visualization System for Huge Cultural ... - CiteSeerX

An Easy to Use and Affordable Home- Based Personal ... - CiteSeerX

An Easy to Use Distributed Computing Framework - CiteSeerX

An Easy-to-use Visualization System for Huge Cultural ... - CiteSeerX

EASY-PIPE - An âEASY to useâ Parallel Image Processing ... - CiteSeerX