parallel image processing in heterogeneous computing ... - CiteSeerX

4 downloads 28006 Views 142KB Size Report
as well as scientific processing. Number of works .... and 7x7 Our virtual parallel computer that consists of ... State University, Fullerton, CA, March 29-31, 1989.
PARALLEL IMAGE PROCESSING IN HETEROGENEOUS COMPUTING NETWORK SYSTEMS Prachya Chalermwat, Nikitas Alexandridis, Punpiti Piamsa-Nga, and Malachy O'Connell

Parallel and Distributed Computing Laboratory The George Washington University 2020 K St. Suite 300 Washington DC 20052, USA E-mail: [email protected] ABSTRACT

We present a uniform construct of parallel programming for a set of image processing tasks based on our Distributed Computing Primitive (DCP) concept. Our target architecture is a heterogeneous computing network system consisting of various high performance workstations connected through the local area network. We show that DCP has advantages over non-primitive PVM-based parallel programs in three aspects: easeof-use, automation, and optimization.

1. INTRODUCTION Most image processing algorithms are computational intensive and require signi cant computing power that cannot be provided by a uniprocessor. Parallel architectures contribute higher processing power by several orders of magnitude. However, parallel system is not a cost e ective way to improve the performance of parallel applications. Heterogeneous systems widely exist in industrial and academic computing environments offering a wealth of underutilized resources for high performance computing. Network Of Workstation (NOW) is one common subclass of heterogeneous systems that consists of a number of high performance workstations connected through common network like Local Area Network (LAN). Communications overhead is the only disadvantage that degrades the performance of such systems; thus the performance does not grow proportionally with the number of processors in the system. Parallel computing on a NOW is a cost e ective way to achieve high performance computing comparable to those provided by costly supercomputers and parallel machines. A message passing library like PVM To be appeared in Proceedings of IEEE International Conference on Image Processing 1996 Switzerland

enables ecient parallel computing on a NOW [7] and is a very common tool being used by many researchers. However, writing parallel programs is not only a tedious task, but the mapping and scheduling mechanisms are much more dicult in heterogeneous environments [5, 6]. In this paper, we establish a uniform framework for parallel image processing on heterogeneous network systems. We consider Single Program Multiple Data (SPMD) tasks where each image processing task can be computed as multiple SPMD subtasks on various kinds of computing nodes. Since well designed/optimized sequential algorithms are very machine dependent, running the same code on di erent machines would not yield expected results. The key idea here is to utilize existing hardware and software resources by constructing a uniform interface to the user applications and making complexity of parallel execution transparent. This paper is organized as follow. Section 2 describes related work. Section 3 gives an overview of the Distributed Computing Primitives. The data partitioning schemes and experimental results are described in section 4 and section 5 repectively. Section 6 gives the concluding remark about our experiments as well as future works.

2. RELATED WORK Parallel Primitives Concept has been introduced to ease the burden of parallel programming on the user and to use the knowledge of task characteristics toward improving mapping, partitioning and scheduling of the tasks onto a parallel machine [1]. Their work targets only on a large integrated multiprocessor architecture (homogeneous) for image analysis and understanding as well as scienti c processing. Number of works have been made to ease the PVM-based parallel program-

ming by incorporating visual tool to help construct parallel applications [2, 4]. Our approach is di erent. We preserve normal serial semantics within a regular programming language and encapsulate the complexity of parallelism from the user. It is our main focus here to extend the Parallel Primitives Concept to work with heterogeneous computing network systems; we call it Distributed Computing Primitive (DCP).

3. DISTRIBUTED COMPUTING PRIMITIVE OVERVIEW We present here a uniform construct of parallel programming for a set of image processing tasks called Distributed Computing Primitives (DCP). A DCP is a frequently used function; for example, in the image processing application domain, DCPs would be the median ltering, smoothing, averaging, histogram, edge detection, Fourier transform, etc. Operations such as sum, average, and maximum of vectors are frequently used in many application domains, and are primitives in our system. Thus, the DCPs themselves may be expressed as a composition of other primitives. For example, the edge detection task may involve image smoothing, gradient magnitude computation, and thresholding [1].

3.1. Speci cation of the DCP To be consistent, the structure of the DCP is formally de ned as depicted in gure ??. DCPname {

; name of DCP (median-filtering, etc.) id ; numerical identification input ; image, matrix, vector output ; image, matrix, vector prefer ; best machine statistics; execution time and communication time

}

Figure 1: Speci cation of DCP The DCPname is the name of the primitive such as median- ltering or FFT. Each DCP consists of ve components: id, input, output, prefer, and statistics. An id is a numerical identi cation. An input speci es type of input data such as image, matrix, or vector. An output speci es type of output data. A prefer indicates the best machine for this primitive. A statistics stores information about execution and communication times.

Figure 2: An overview of the DCP architecture

3.2. The DCP Architecture

Figure 1 illustrates an overview of the DCP architecture consisting of: a user's program, a DCP manager, a Parallel Virtual Machine, and the DCP Simulator. The user program consists of a sequence of primitives. These primitives are then queued by the DCP manager. The DCP manager consists of a queue, mapper, scheduler, and a dynamic knowledge table. The mapper and scheduler use the knowledge table to determine an optimal data partition size and the best set of machines to execute the primitives. The knowledge table contains information for all primitives regarding execution time and communication time on various combination of the available workstations. A parallel virtual machine consists of heterogeneous workstations connected via a non-dedicated network. DCP manager communicates with the parallel virtual machine via the Control Unit.

3.3. Programming with DCPs Figure ?? shows a simple example C++ program that calls to the DCP. The program rst reads an input image and then performs a 5  5 median ltering. Parallelism is done automatically and is hidden from the user. #include "dcp.h" main() { image A; A.read("bird.jpg"); A.median_filter(5); A.show(); }

Figure 3: Example of program using DCP.

3.4. The DCP Simulator

We integrate an interactive simulator into our DCP architecture. It uses the precise prediction model [9] that takes as inputs both the processor's statistical information and/or a raw speed from the manufacturer speci cations. Our interactive simulator captures initial information (tasks and system characteristics) from actual execution of the primitives and then uses it to simulate the application performance when the user changes various system characteristics such as number of available processors, heterogeneity of processors (raw speed), and heterogeneity of memory (size). This enables us to simulate the impact of various system characteristics onto the parallel image processing application.

4. DATA PARTITIONING We brie y discuss the data partitioning scheme used in our experiments in this section. Our partition scheme yields better results over non-primitive-based scheme under the following assumptions: each computing node executes only one subtask at a time, an input task is a SPMD task which can be partitioned into numbers of SPMD subtasks. We consider the following data partitioning schemes: without knowledge of processing speed, using standard benchmark, and using primitive benchmark. More detail of paritioning schemes and furture development can be found in [3].

5. EXPERIMENT RESULTS Obviously, the DCP concept reduces the complication of parallel programming since the application program is now just the sequence of image processing primitives. In this experiment we use two image processing primitives as our application domain: median ltering and convolution. Both of them are implemented in a distributed fashion where the DCP manager is responsible for distributing correction partition sizes to its slave processors. Figure 2 shows a sample result from our distributed median ltering of a big gray-scale image of size 1024x1024 with di erent window sizes of 3x3, 5x5, and 7x7 Our virtual parallel computer that consists of heterogeneous workstations with di erent speeds and memory capacities (nine Sun Sparc-20/50MHz, four Sparc-5/85MHz, and two Pentiums/66/100MHz) running Unix operating system and PVM 3.3.8. Figure 3 compares the response times of two different partitioning schemes: schem0 and scheme1. In scheme0 an image is partitioned into partitions equally, where is a number of processors involved in the computation. In scheme1, we use a more acurate primitive n

n

140 2 '3x3' 3 120 '5x5' + '7x7' 2 100 2 Seconds 80 + 2 60 2 2 2 2 2 + + + 40 + + + + 3 3 3 3 3 3 3 3 20 1

2

3 4 5 6 7 No. of Processors

8

Figure 4: Response time of DCP for median ltering computing rate(pixels/sec) as a parameter to the partitioning scheme. The result shows that better response time can be achieved if the partitioning scheme produces correct proportion of subimages coresponding to processor's capability. 30 + 3

'scheme0' 3 'scheme1' +

25 20

Seconds 15 + 3

10

3 +

3 + 3+ 33 33 3++ 3++3 +++

2

4 6 8 10 12 No. of Processors

5 0

0

Figure 5: Response time of DCP for median ltering The response times do not decrease when we increase number of processors due to the communication overhead. We deal with this problem by carefully estimating the ratio of communication and computation time and using it to determine an optimal number of processors. We also validate our model by experimenting on various image processing tasks on our NOW environment and comparing the results with a nonprimitives-based approach.

6. CONCLUSION Our DCP architecture allows the user to eciently run actual image processing algorithms on heterogeneous

NOW while the integrated simulator can be used to predict the program performance when system characteristics, such as number of processors and processor speed, are varied. Major advantages of the DCP over the nonprimitive-based are: ease of use, automation, optimization. Future work will be to integrate the visual programming interface and increase number of primitives as well as to investigate fault tolerance issues. Although the DCP concept is now implemented using PVM, there is no limits in choices of communication libraries. Thus we also plan to investiage the implementation of the DCP using MPI[?] and RCP[?] in the near future.

7. REFERENCES [1] N. Alexandridis, H-A Choi, B. Narahari, S. Rotensteich, and A. Youssef, \A Hierarchical, partitionable, knowledge based, parallel processing system," 3rd Annual Parallel Processing Symposium, Calif. State University, Fullerton, CA, March 29-31, 1989. [2] A. Beguelin, J. Dongarra, Al Geist, Robert Manchek, K. Moore, and Vaidy Sunderam, \PVM and HeNCE: Tools for Heterogeneous Network Computing," Environments and Tools for Parallel Scienti c Computing, Edited by Jack Dongarra and Bernard Tourancheau, Advances in Parallel Computing, Volume 6, North-Holland, 1993. [3] P. Chalermwat, N. Alexandridis, P. Piamsa-Nga, and M. O'Connell \An Overview of Distributed Computing Primitive Concept and Experimental Results," Internal Report, The George Washington University, Washington, DC, April 21, 1996. [4] J. Dongarra and Peter Newton, \Overview of VPE: A Visual Environment for Message-Passing Parallel Programming," Heterogeneous Computing Workshop '95, Proceedings of the 4th Heterogeneous Computing Workshop, Santa Barbara, CA, April 25, 1995. [5] V. M. Lo, \Heuristic Algorithms for Task Assignment in Distributed Systems," IEEE Trans. on Computers, vol. 37, no. 11, November 1988, pp. 1384-1401. [6] H.J. Siegel, J.B. Armstrong, and D.W. Watson, \Mapping Computer-Vision Related Tasks onto Recon gurable Parallel Processing Systems," Computer, Vol. 25, No. 2, Feb 1992, pp. 54-63.

[7] V.S. Sunderam, \PVM: A Framework for Parallel Computing," Concurrency: Practice and Experience, Vol. 2, No. 4, December 1990, pp. 315-339. [8] Zhichen Xu, \Simulation of Heterogeneous Network of Workstation," TR-95-08-02, Computer Science Department, University of Wisconsin-Madison, 1995. [9] X. Zhang and Y. Yan, \A Framework of Performance Prediction of Parallel Computing on Nondedicated Heterogeneous NOW," Proceedings of the 1995 International Conference on Parallel Processing, CRC Press vol.1k, Aug. 1995.

Contents

1 Introduction 1 2 Related work 1 3 Distributed Computing Primitive Overview 2 3.1 3.2 3.3 3.4

4 5 6 7

Speci cation of the DCP The DCP Architecture Programming with DCPs The DCP Simulator

Data Partitioning Experiment Results Conclusion References

: : : : : : : :

: : : : : : : : : : : : : : : : :

: : : : : : : : : : :

2 2 2 2

3 3 3 4

Suggest Documents