A Software Architecture for Application Driven High ... - CiteSeerX

2 downloads 29147 Views 208KB Size Report
The paper introduces a software architecture to support a user from the image processing ... development of time-constrained image processing applications.
A Software Architecture for Application Driven High Performance Image Processing D. Koelmaa, P.P. Jonkerb , and H.J. Sipsc a Intelligent Sensory Information Systems, University of Amsterdam, b Pattern Recognition Group, Delft University of Technology, c Department of Computer Science, Delft University of Technology

ABSTRACT

The paper introduces a software architecture to support a user from the image processing community in the development of time-constrained image processing applications on parallel computers. The architecture is based on abstract data types with a well de ned interface. The interface separates an application from the actual hardware used. On the application side of the interface the programmer is presented with a familiar (sequential) programming model. On the hardware side of the interface detailed knowledge of a parallel machine may be employed to arrive at ecient implementations of basic functionality. Knowledge of both suitable data distributions for images and performance characteristics of operations on those image allows for automated selection of an appropriate data distribution scheme throughout the application. Experiments show that with little e ort reasonable levels of eciency and scalability are achieved on a 32-node MIMD architecture. Keywords: software development, parallelization of abstract data types, time-constrained image processing

1. Introduction

The high computational requirements and the time constraints associated with many image processing applications have resulted in considerable interest in the utilization of parallel computing resources. Still, parallel computing is not employed by the image processing community at large. In our view, the most important bottleneck is the lack of a suitable tools for software development. The paper addresses the design of such tools speci cally tuned for the development of time-constrained image processing applications. The intended user of the proposed software architecture stems from the image processing community. To our user, parallel computing is a means to achieve improved performance. It is not a research issue in itself. As a consequence, the user is not willing to invest the e ort needed to master the art of parallel programming. The ideal solution is to o er an environment that parallelizes an application automatically. Despite considerable e ort and progress, current state of the art in parallelizing compiler technology has not achieved a solution for the problem in general. Here, we present an intermediate solution in the form of a software library with a set of parallelized operations on speci c abstract data types. It is our goal to eventually incorporate the knowledge of the abstract data types and operations within a compiler to o er automatic parallelization for the speci c application domain considered here. The paper is organized as follows. In Section 2, we identify abstract data types and operations that constitute the basic elements of time-constrained image processing applications. Section 3 demonstrates the parallel programming model used to implement the abstract data types and operations on a variety of hardware platforms. The performance of each operation is modeled to select an appropriate image data distribution scheme and to incorporate time constraints. In Section 4 the performance model is validated with experiments in a prototype version of the software architecture. Related work is discussed in Section 5.

Other author information: (Send correspondence to D.K.) D.K.: Intelligent Sensory Information Systems, Department of Computer Science, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands; e-mail: [email protected] P.P.J.: Pattern Recognition Group, Delft University of Technology, Lorentzweg 1, 2628 CJ Delft, The Netherlands; e-mail: [email protected] H.J.S.: Department of Computer Science, Delft University of Technology, Zuidplantsoen 4, 2628 BZ Delft, The Netherlands; e-mail: [email protected]

2. Data concepts

An important issue in our work is to develop abstract data types for key concepts in the development of an application. In the application domain considered here we have identi ed the required data types to be images, image sequences, and scale spaces. For each of the abstract data types we develop an interface to arrive at a more formal description of their properties and permissible operations. We consider an image to be a set of pixels. Usually, the set of points is a discrete subset of the n-dimensional Euclidean space with n = 1; 2; or 3. Also, the point set has a rectangular shape. The value of a pixel is a scalar value or a vector of n scalar values (usually n = 2 or 3). A scalar value is represented by a k bits integer value, a k bits

oating point value, or a complex number. Operations from the AFATL Image Algebra [1] constitute the basic set of operations on images. The operations include unary pixel operations, binary pixel operations, reduce operations, template operations, (iterative) neighbourhood operations, and geometric operations. The set of operations is readily extended with data driven detection operations from other sources, e.g. [2]. In general, we aim to select operations that have a generic nature. The use of image sequences and scale spaces has a dual nature. On the one hand, an image sequence or scale space is regarded to be a set of images. On the other hand, an image sequence or a scale space is regarded to be an image of a higher order dimension, e.g. a sequence of two-dimensional images is regarded to be a three-dimensional image. As is to be expected from the latter use, basic operations on image sequences and scale spaces include the basic operations on images. Additional operations are required to support the former use, i.e. operations to manipulate individual elements of the set of images. An important aspect of the use of image sequences is that usually not all data is available at the construction of a sequence. Also, only a small selection of all images (a window) in a sequence is normally used in the computational process. The images in the window comprise the input data of a step in the computation, possibly extended with results from previous computation steps. Upon completion of a step the window advances one or more time steps in the sequence. Thus, the computation iterates over the whole sequence. The interface of the abstract data types constitute a dividing line between an application and the data as depicted in Figure 1. The advantage of the separation is that the application programmer is able to process images etc. on a higher level of abstraction, i.e. without actually having to think about individual pixel values and where and how they are stored. At the same time, the user is able to test the validity and applicability of an application on a wide variety of image data types. The purpose of the separation is also to reduce the number of dependencies between software libraries. Multiple software libraries are inevitable as it is practically not feasible to design an image processing library that ful lls everybody's needs. Our de nition of abstract data types stands somewhere between functionality required by the user and properties of the actual hardware platform. The interface aims to de ne a set of operations that constitutes the basis of image processing in any application domain. Applications should not have to come closer to the data on a speci c hardware platform than this set of operations. The interface de nition allows for software library designers on both sides (see Figure 1) to rely on a clear and usable intermediate stage. On the one side, software libraries for speci c application domains can add semantics to the data and introduce additional functionality native to their domain. On the other side, operations in the interface may be implemented using existing software packages such as the IUE [3]. Operations may also be optimized for use on speci c hardware platforms following a speci c programming paradigm such as Bucket Processing [4,5]. The separation of concerns enforced by the explicit de nition of concepts should facilitate software development. To support implementation of image processing operations on a variety of hardware platforms an extra layer is de ned between the abstract data types and the actual hardware. The layer consists of three data types: ImageSequenceCpp, ScaleSpaceCpp, and ImageCpp (see Figure 1). The de nition of the data types and their operations is geared towards eciency instead of functionality as with the de nition of the abstract data types. The ImageCpp concept is a cornerstone of the software architecture. A key aspect here is handling of border e ects in an image processing operation. Typically, dealing with the e ects the presence of a border has on the algorithm is what makes implementation of an operation dicult. Also, it usually results in a lot of extra code.

library user application programmer

tensor image library

document image library

Image

medical image library

MPEG IV library

ImageSequence

ScaleSpace

ImageSequenceCpp

IUE

data in memory sequential machine X

BP

ScaleSpaceCpp

ImageCpp

data in shared memory machine X

data in distributed memory machine X

Figure 1. A view on the use of image processing libraries (see text). In our approach border handling is taken care of by the software architecture. Operations de ned for the

ImageCpp concept do not have to do any border handling themselves. Instead, each operation is annotated with

its border size, for example de ned by the size of the neighbourhood in a template operation or the transformation parameters in a geometric operation. Prior to the execution of an operation the infrastructure assures an appropriate border is set around the image. An operation then has to compute results for the interior of an image only. Typically, the overhead in copying to set a border is outweighed by the increase in eciency due to simpli cation of the algorithm. The user is to bene t from decoupling border handling and operations in that implementation of operations becomes easier and less error prone. Also, it allows for distributed execution of operations to be described in next section.

3. Programming and performance model for parallel execution

In this section we describe the programming model employed in the software architecture below the dividing line in Figure 1. The model is used to implement operations of the abstract data types on a parallel hardware platform. The basic elements of the model are the operations de ned for the ImageCpp concept and the image data distribution algorithms of the infrastructure. Each operation is annotated with a function to model its performance. The programming model is illustrated with development of a sample operation: An iterative lter operation with kernel size W  W on an N  N image. The pseudo code for the operation is given in Listing 1.

ImageCpp image(N, N) ImageCpp block; ImageCpp frame; Kernel kernel(W, W); Boolean workToDo = true;

// allocate image data, for host only // each process has its own image data block to process // a frame is a block with a border around it

// initialization omitted block.getDataFromHost(); // scatter the image data frame.copyFromBlock(block); // allocate space for border while (workToDo) f frame.communicateBorder(); // send and receive data of border shared with neighbouring processes frame.updateBorder(); // update the remainder of the border frame.convolution(kernel); // do the actual image processing work workToDo = communicateDecision(); // tell all processes whether a next iteration is necessary g

block.copyFromFrame(frame); image.getDataFromProcs();

// remove border // gather the image data

Listing 1: Pseudo code for the parallel iterative lter operation. All operations except for convolution are obtained from the infrastructure. In the actual implementation the combination of getDataFromHost and copyFromBlock is a single operation (The same holds the combinations communicateBorder - updateBorder and copyFromFrame getDataFromProcs). Here, they are listed separate to indicate all computation and communication steps in order. In order for a computational task to be executed in parallel on P processors, computation and data need to be decomposed into P (or more) tasks. For operations on images two natural decomposition approaches come to mind: one- and two-dimensional block-wise decomposition (see Figure 2). Comparing the decompositions we note that in both cases the computational task of a processor is the same: Tcomp = Tseq =P (with P  N ). The di erence in eciency is determined by the size of the border that needs to be communicated to keep the image data coherent. For two-dimensional images using a one-dimensional decomposition the time needed to communicate the border data is: Tcomm border 1d = 2ts + 2NV tw

with V = (W ? 1)=2, V  N=P , ts the startup time for a message, and tw the transfer time per word (a pixel value). For a two-dimensional decomposition the time is: NV Tcomm border 2d = 8ts + 4( p + V 2 )tw P

In general, a one-dimensional decomposition is more ecient when the image is small and a small lter is used in the operation. For larger images and especially 3D images the two-dimensional decomposition is more ecient. Also, the two-dimensional decomposition is a more scalable solution as more processors can be assigned to one image than with the one-dimensional decomposition ( pNp vs. NP ). However, using a two-dimensional decomposition puts more constraints on the interconnection network of the architecture in that each processor must have a direct connection to its eight neighbours in order for the given formula to hold. With a one-dimensional decomposition only two direct neighbours are required. The relative eciency of one- and two-dimensional image decomposition is apparent from their quotient. The quotient is plotted in Figure 2 using the hardware parameters of our target machine (see next section). The graph shows that in all cases considered here the one-dimensional decomposition is more ecient than the two-dimensional decomposition. As the one-dimensional decomposition also puts less constraints on the hardware architecture and we need our software to be portable to as many architectures as possible the one-dimensional decomposition is used from now on. A fortunate side-e ect is that a one-dimensional decomposition reduces the complexity of the programming and performance models.

Relative efficiency of image decomposition: Tcomm border 2d / Tcomm border 1d 4 N = 256, W = 3 N = 256, W = 9 N = 256, W = 21 N = 1024, W = 3 N = 1024, W = 9 N = 1024, W = 21

3.5

One-dimensional decomposition 3

2.5

2

1.5

1

Two-dimensional decomposition

1

2

4 8 Number of processors

16

32

Figure 2. Left. One- and two-dimensional block-wise decomposition of images. Right. Relative eciency

(Tcomm border 2d=Tcomm border 1d ) of one- and two dimensional image decomposition using hardware parameters of the target machine. The rst step in the sample application is to distribute the image data across the processors. At startup, the image data is assumed to be present on one of the processors. The time needed to scatter the image data by means of a hypercube algorithm is: Tscatter = ts log2 P + N 2 tw

In the second step (copyFromBlock) each process puts a border around its part of the image data. The time needed is: N2 Tblock to frame = t P copy with tcopy the time needed to perform a local copy of one pixel. Prior to each iteration in the actual image processing phase the border of a block needs to be updated. The time needed to communicate the top and bottom border with neighbours has already been given. The time needed to update the left and right border of a block including the corners is: Tupdate border = 2(V N +

NV + 2V )tcopy P

In the actual image processing phase two versions of the lter operation are considered: A separable and a nonseparable template operation with neighbourhood sizes W  W . In a template operation each pixel value in the

resulting image depends upon the W  W surrounding pixel values. For a separable lter operation the computation is performed by two consecutive W  1 operations, one in the horizontal direction and one in the vertical direction. The computational complexity of one iteration is Tsep =

for a separable lter operation and

2N 2 W t

Tnon sep =

P

op

N 2W 2 t P op

for a non-separable operation, with top the time needed to process one pixel. In a template operation this typically includes a multiplication, an addition, and pointer increment operations. The time needed to communicate the decision to continue after each iteration is: Tdecision = 2(ts + tw ) log2 P

The nal stages of the application (copying data from frame to block and gathering of all the image data by the host) require the same amount of time as the corresponding stages in the initialization phase. As shown in the development of the sample application each operation is annotated with a function to model its performance. The functions incorporate a number of hardware architecture dependent parameters. Still, the model should be widely applicable as it is based on assumptions valid for most common parallel hardware architectures. The performance functions for a speci c parallel computer are easily established by running a number of experiments to determine the appropriate values for the function parameters.

4. Experiments

To validate the performance model a number of experiments were run on a 32-node Parsytec CC-system using a prototype version of the software architecture. The prototype version has been implemented in C/C++ and PVM. The implementation does not use the group communication functions of PVM (pvm scatter, pvm gather, and pvm reduce) as these employ a very simple algorithm requiring a communication time linear in the number of processors. Instead, the software infrastructure is extended with group communication functions based on a hypercube algorithm. The group communication functions are implemented on top of PVM to keep the software portable. By running these algorithms with a comprehensive set of values for the application parameters N , W , and P the values for the machine parameters ts and tw were experimentally determined to be 225 sec and 0.1 sec. Although the CC-system does not have a hypercube architecture (it is a \fat mesh of CLOS") the performance model ts rather well (Figure 3). The actually measured execution times for the sample operation from Listing 1 (the markers) are very close to the execution time predicted by the performance model (the lines). For larger values of the lter width W and the number of iterations I the prediction is more accurate. For a non-separable template operation the value of top is constant for the number of processors P and the image size N but depends upon the kernel width W (see Table 1). The e ect is due to cache in uences. For a separable template operation the value of top is less dependent upon W . Also, the value changes for a combination of a large P and a small N , i.e. when a processor has very little to do. The value of tcopy is fairly constant at 0.21 s. With the performance function parameters determined we are able to model the performance of the sample application as already indicated in Figure 3. Next, we analyze the e ects of the application parameters N , W , and I on the estimated performance. An impression of the speedup for typical N is given in Figure 4. Although one might expect higher speedups for a non-separable operation than for a separable operation due to its higher computational requirements the graph shows that this is not the case. The disturbing factor here is top .

Absolute time (usec) 3e+06 Measured N = 128 Predicted N = 128 Measured N = 256 Predicted N = 256 Measured N = 512 Predicted N = 512 Measured N = 1024 Predicted N = 1024

2.5e+06

2e+06

1.5e+06

1e+06

500000

0 1

2

4 8 Number of processors

16

32

Figure 3. Measured and predicted times for the iterative (non-separable) lter operation from Listing 1 with W = 3

and I = 1.

non-separable separable separable large P small N W top (sec) top (sec) top (sec) 3 0.21 0.31 0.36 5 0.12 0.25 0.32 9 0.073 0.22 0.30 15 0.053 0.21 0.25 21 0.044 0.20 0.23

Table 1.

top for typical values of W .

Speedup 32 Linear 0.5 Linear sep: N = 256 nonsep: N = 256 sep: N = 512 nonsep: N = 512 sep: N = 1024 nonsep: N = 1024

16

8

4

2

1

0.5 1

2

4 8 Number of processors

16

32

Figure 4. Speedup for the sample operation from Listing 1 according to the performance model with W = 3 and I = 1.

Speedup 32 Linear 0.5 Linear W=3 W=5 W=9 W = 15 W = 21

16

8

4

2

1

0.5 1

2

4 8 Number of processors

16

32

Figure 5. The e ect of W using a separable lter with N = 256 and I = 1.

Speedup 32 Linear 0.5 Linear I=1 I=2 I=5 I = 10

16

8

4

2

1

0.5 1

2

4 8 Number of processors

16

32

Figure 6. The e ect of I using a separable lter with N = 256 and W = 3. The in uence of W is shown in Figure 5. As is to be expected, the increased computational e ort induced by a higher value of W improves the speedup. Using other values for N and/or a non-separable lter yields very similar graphs. The e ect of I is shown in Figure 6. Again, the graph is as expected and does not depend upon the value for N or the type of lter used. The scalability of the sample operation is rather poor. For W = 3 and I = 1 the eciency drops below 50% when using more than 6 processors. The cause is the scattering and gathering of image data. This is demonstrated in Figure 7 where the speedup of the sample operation is depicted but this time assuming that the image data is already present on the processor array. Comparing Figure 7 to Figure 4 the increase in speedup is obvious. Clearly, it is vital to keep the image data on the processor array as long as possible without communication to the host. The graphs shown thus far are based on the lower bounds of the application parameters N , W , and I . In a sense, the graphs depict worst case behaviour for the speedup of image processing applications. To assess average behaviour we analyze the performance of the computation of a basis for many image processing applications: The di erential structure of images, see e.g. [6]. Examples are edge detection (based on rst or second order derivatives) and invariants (based on n-th order derivatives). Here, we limit ourselves to the use of rst and second order derivatives: Lx ( rst order derivative in the x-direction), Ly ( rst order derivative in the y direction), Lxx (second order derivative in the x-direction), Lyy, and Lxy. A derivative is best computed using convolution with a Gaussian kernel [7]. The kernel width W depends upon  (the smoothing factor) and the order of the derivative. A conservative estimate for the average kernel width using only rst and second order derivatives is W = 15. The speedup obtained in computing the ve derivatives in sequence (I = 5) on an image obtained from a normal camera (N = 512) assuming the image data is already distributed over the processors is depicted in Figure 8. For 32 processors the speedup is 30 (eciency is 94 %), indicating near-linear speedup.

Speedup 32 Linear 0.5 Linear N = 128 N = 256 N = 512 N = 1024

16

8

4

2

1

0.5 1

2

4 8 Number of processors

16

32

Figure 7. Speedup assuming image data is already distributed using a separable lter with W = 3 and I = 1. Speedup 32 Linear 0.5 Linear N = 512, W = 15, I = 5 16

8

4

2

1

0.5 1

2

4 8 Number of processors

16

32

Figure 8. Speedup of computation of the rst and second order derivatives using a separable Gaussian lter with W = 15 on a 512  512 image.

5. Discussion and related work

An outline of a software architecture for the development of time-constrained image processing applications on a parallel computer has been presented. A key aspect of the architecture is the de nition of abstract data types with a well de ned interface. The interface o ers a user a sequential programming model for the development of applications. The software infrastructure takes care of parallelization of operations on the abstract data types with reasonable performance. The current implementation o ers parallelization of a set of basic operations on images as de ned in [1]. Further research is needed to determine whether the set of operations suces to implement all kinds of image processing applications. With the set of basic operations de ned we will investigate the e ect of concatenation of basic operations. And, we have to de ne the set of basic operations on image sequences and scale spaces. It is often stated, e.g. in [8], that one of the main problems of using a parallel system is the writing of parallel code. Therefore, we consider o ering a user a familiar (sequential) programming model a prerequisite for any parallel software development environment to suit an application domain outside the core of parallel computing. Somewhat surprisingly, this topic has received little attention in literature regarding software environments for high-performance image processing. The work on the Image Understanding Architecture [9] is related to our work in that it o ers the user a software library with a sequential programming model. However, their software library is based on a special hardware architecture where our work is virtually hardware independent. The project from [10] has practically the same goals as we do. They also state [11] that programmability and application-driven design is very important the development of software environment for high-performance computing. The work di ers in that a poly-algorithmic approach is followed [12] to achieve scalability in performance. The approach requires a much larger e ort in coding as multiple versions of each algorithm need to be implemented. In our approach we trade a small loss in performance for increased portability and a signi cant reduction in implementational e ort. We feel that our approach brings the performance bene ts of parallel computing to users in the image processing community without e ort. Thus, it may bridge the gap between high performance computing and image processing.

ACKNOWLEDGEMENTS

This work was supported by ASCI project A Programming Environment for High Performance Image Processing Applications. The project is a collaboration with R.L. Lagendijk (Department of Electrical Engineering, Delft University of Technology) and A.W.M. Smeulders (Intelligent Sensory Information Systems, Department of Computer Science, University of Amsterdam).

REFERENCES

1. G. Ritter and J. Wilson, Handbook of Computer Vision Algorithms in Image Algebra, CRC Press, 1996. 2. D. Noguet, A. Merle, and D. Lattard, \A data dependent architecture based on seeded region growing strategy for advanced morphological operations," in Mathematical Morphology and its Applications to Image and Signal Processing, Maragos et al., ed., Kluwer, 1996. 3. J. L. Mundy and IUE Committee, \The Image Understanding Environment: overview," in Proceedings of the Image Understanding Workshop, pp. 283{288, 1993. 4. J. Olk and P. Jonker, \Bucket Processing: A paradigm for image processing," in Proceedings of the 13th International Conference on Pattern Recognition, vol. IV, pp. 386{390, 1996. 5. J. Olk, P. Jonker, A. Pugh, and P. Baglietto, \The Bucket Processing Paradigm," Deliverable Bn1, Esprit LTR project 8849, SIMD-MIMD systems applied to image processing , 1996. Available on request to [email protected]. 6. B. M. Ter Haar Romeny, L. M. J. Florack, A. H. Salden, and M. A. Viergever, \Higher order di erential structure of images," Image and Vision Computing 12(6), pp. 317{325, 1994.

7. J. Canny, \A computational approach to edge detection," IEEE Transactions on Pattern Analysis and Machine Intelligence 8, pp. 679{698, June 1986. 8. P. Hatcher and M. Quinn, Data-parallel programming on MIMD computers, The MIT Press, 1991. 9. C. Weems et al., \Status and current research in the Image Understanding Architecture program," Proceedings of the Image Understanding Workshop , pp. 1133{1140, 1993. 10. L. Jamieson et al., \Parallel scalable libraries and algorithms for computer vision," in Proceedings of the 12th International Conference on Pattern Recognition, pp. 223{228, 1994. 11. L. Jamieson, S. Hambrusch, A. Khokhar, and E. Delp, \The role of models, software tools, and applications in high performance computing," in Developing a Computer Science Agenda for High-Performance Computing, U. Vishkin, ed., pp. 90{97, ACM Press, 1995. 12. L. Jamieson, A. Khokhar, and J. Patel, \Algorithm scalability: A poly-algorithmic approach," in Proceedings of the 1995 International Conference on Parallel Processing, pp. 90{95, 1995.

Suggest Documents