image partitioning using system characteristics in ... - Semantic Scholar

2 downloads 0 Views 151KB Size Report
tives DCP concept. Our initial application domain is parallel and distributed image processing on het- erogeneous computing network systems. In this paper the ...
IMAGE PARTITIONING USING SYSTEM CHARACTERISTICS IN HETEROGENEOUS COMPUTING SYSTEMS Prachya Chalermwat, Nikitas Alexandridis, Punpiti Piamsa-Nga, and Malachy O'Connell

Department of Electrical Engineering and Computer Science The George Washington University Washington, DC 20052, U.S.A. email: [email protected] ABSTRACT Many image processing tasks can be computed eciently in a Single Program Multiple Data (SPMD) fashion on massively parallel systems. Although executing SPMD tasks on coarse-grained heterogeneous systems yields a cost-e ective solution, heterogeneity introduces more complexity in data partitioning, mapping, and scheduling problems. In this paper, three image data partitioning schemes for parallel image processing in heterogeneous systems are investigated and implemented using the Parallel Virtual Machine (PVM) message passing library. The partitioning schemes are based on the system characteristics (processing capability) incorporated within the Distributed Computing Primitives (DCP) environment using SpecInt92 benchmark and our DCPbased benchmark. We compare the results with the baseline (Eq-based) scheme that equally partitions images regardless of processing speed. The results from the experiments show that the DCP-based partitioning scheme outperforms the Eq-based and Specbased schemes.

1. INTRODUCTION Computing image processing SPMD tasks on architectures other than SIMD requires extra e ort. Running a single SPMD task on heterogeneous systems requires ecient data partitioning because each node in the system has a di erent processing capability. Heterogeneous systems widely exist in industrial and academic computing environments and o er a wealth of underutilized resources for high performance computing. One common subclass of heterogeneous systems is a Network Of Workstation (NOW). A NOW To appear in Proceedings of IEEE International Conference on System, Man and Cybernetic 1996, Beijing, China

consists of a number of high-performance workstations connected through a Local Area Network (LAN). Interprocessor communication for high performance parallel computing can be achieved by using a message passing library such as PVM [9]. However, writing parallel programs is not only a tedious task, but the mapping and scheduling mechanisms are much more dicult in heterogeneous environments [6, 8]. A number of works have made the non-primitivebased parallel programming easier by incorporating visual tools to help construct parallel applications [2, 4]. Works based on object-oriented approach for data parallel programming in homogeneous systems can be found in [7, 5]. In contrast, Parallel Primitives Concept was introduced to ease the burden of parallel programming and to use the knowledge of task characteristics toward improving partitioning, mapping, and scheduling of the tasks onto a parallel machine [1]. This work targets only a large integrated homogeneous multiprocessor architecture for image analysis and understanding as well as scienti c processing. We have augmented the parallel primitives concept to work on heterogeneous systems with the following goals in mind: ease-of-use, automation, and optimization. We call it a Distributed Computing Primitives (DCP) concept. Our initial application domain is parallel and distributed image processing on heterogeneous computing network systems. In this paper the experimental results for image partitioning using system characteristics are presented. Based on the system characteristics (processing capability), we examine partitioning schemes using SpecInt92 benchmark and our DCP-based benchmark and compare the results with the Eq-based scheme that partitions images equally regardless of processing speed. This paper is organized as follows. Section two

brie y discusses an overview of the DCP environment. Section three describes the details of data partitioning schemes using system characteristics. Section four reports the results from the experiments. The conclusions and future work are given in the last section.

2. AN OVERVIEW OF THE DCP The DCP is an enhancement of the parallel primitive concepts [1] for heterogeneous computing. A DCP system contains a set of frequently used functions. For example, in the image processing application domain these functions would be median ltering, smoothing, averaging, histogram, edge detection, Fourier transform, etc. Operations such as sum, average, and maximum of vectors are frequently used in many application domains, and are primitives in our system. Thus, the primitives themselves may be expressed as a composition of other primitives. For example, the edge detection task may involve image smoothing, gradient magnitude computation, and thresholding. In our NOW environment, each workstation is running the DCP module called DCP server. The DCP server is capable of executing requested image processing tasks received from a DCP manager residing in any of the workstations. The DCP manager takes as input the requests for prede ned primitives. It determines the best data partition as well as mapping and scheduling and then sends requests along with data to the DCP servers. DCP Manager User’s program

Parallel Virutal Machine DCP Queue

primitive 1 DCP Mapper primitive 2

primitive 3

DCP Scheduler

prachyapc

pdc01

DCP Server

DCP Server

Dynamic knowledge table

Control Unit

monarch DCP Server

vision DCP Server

DCP Simulator

Graphical Performance monitor

Interactive User Feedback

Figure 1: An overview of the DCP architecture.

2.1. The DCP Architecture Figure 1 illustrates an overview of the DCP architecture consisting of: a user's program, a DCP manager, a Parallel Virtual Machine, and the DCP simulator.

The user program consists of a sequence of primitives. These primitives are then queued by the DCP manager. The DCP manager consists of a queue, mapper, scheduler and a dynamic knowledge table. The mapper and scheduler use the knowledge table to determine an optimal data partition size and the best set of machines on which to execute the primitives. The knowledge table contains information for all primitives regarding execution time and communication time on various combinations of the available workstations. A parallel virtual machine consists of heterogeneous workstations connected via a non-dedicated network. The DCP manager communicates with the parallel virtual machine via the Control Unit.

2.2. The DCP Simulator

In addition to the ability to execute image processing tasks in parallel, we have integrated an interactive simulator into our DCP architecture. It uses a prediction model that takes as inputs both the processor's statistical information and/or a raw speed from the manufacturer speci cations. Our interactive simulator captures initial information (tasks and system characteristics) from actual execution of the primitives and then uses it to simulate the application performance when the user changes various system characteristics such as number of available processors, heterogeneity of processors (raw speed), and heterogeneity of memory (size). The simulation model is based on the simulation of a heterogeneous network of workstations [10].

3. DATA PARTITIONING USING SYSTEM CHARACTERISTICS 3.1. Characteristics of Heterogeneous Systems Two major set of characteristics that exhibit the heterogeneity of the systems are communication and computation characteristics. Communication characteristics can be represented by network speed, network bandwidth, or latency, etc. Computation characteristics relates to processing capability of each processor. If computation characteristics information is not carefully considered parallel execution of the tasks in heterogeneous systems can be dramatically degraded due to the imbalanced computation. In this paper we concentrate only on computation characteristics while the experiments of incorporating communication characteristics to the partitioning scheme are our future works. More communication characteristics issues can be found in [3].

3.2. Image data partitioning

An image is partitioned into sub-images in a rowwise fashion. In order to gain the fastest parallel execution time on a heterogeneous system, each partition must be weighted properly according to its processor's capability. Two major characteristics that impact system performance are communication time and computation time. The computation of an SPMD task is, in general, data-size dependent. This means that the computation time scales with the data size. The communication time of each SPMD task is also data-size dependent, i.e., larger data requires longer communication time. We consider the following data partitioning schemes:  Baseline: equal partitions (Eq-based)  SpecInt benchmark (Spec-based)  DCP benchmark (DCP-based) For each scheme a weighting function is used to determine the partition size.

3.3. Weighting functions

To eciently run SPMD tasks in heterogeneous systems, proper partitioning must be taken into consideration. In our work, an image is partitioned into sub images in a row-wise fashion including the border region. In order to gain the fastest parallel execution time on heterogeneous systems, each partition must be weighed properly according to the processor capability. Our experiments are based on the following assumptions: each computing node executes only one subtask at a time, and each input task is a dataparallel task which can be partitioned into a number of SPMD subtasks. In a heterogeneous system, it is clear that the partition sizes need to be matched to the processing capability of di erent processors. The problem is which metric should be used in our weighting function to get optimal partition sizes. Partitioning can use one of the following metrics to compute the weighting function: standard benchmark and DCP-based benchmark throughput. In this paper we experiment with both approaches. Let L be the length of the data and P be a number of processors. Partition size for each processor can be simply calculated as P sizei = wi  L where wi is the weighting function, wi = SSTi , Si is a processing speed ofPprocessor i, and ST is the total of all Si , ST = Pi=1 Si . It is very important to select the proper value of Si in order to get an appropriate partition size.

3.3.1. Baseline (Eq-based) In this scheme, the data is partitioned into equal sizes depending on the number of processors. Let L be the length of the data and P the number of processors; the partition size for each processor can be simply calculated as P sizei = PL . This is the baseline scheme which is used to identify the bene ts from a better matching between the processor speed and data partition size. This also does not require any information other than the problem size and the number of available processors. This scheme works well in systems where all processors are identical. Unfortunately, in a heterogeneous system consisting of di erent processors with di erent capabilities, the bottleneck will be the slowest machine. 3.3.2. SpecInt Benchmark (Spec-based) Processing capability of computer systems can be measured using various metrics: SpectInt92, MFLOPS, MIPS, etc. These representations often involve measuring some set of operations. We use SpecInt92 in our experiment to indicate relative speed between heterogeneous processors. The advantage of this scheme is that it does not require any experimentation. Weights can be determined based on published reports. Si in this case is simply the SpecInt92 of each processor. 3.3.3. DCP Benchmark (DCP-based) SpecInt92 and other general benchmarks do not represent how fast the computer can execute a speci c application. Therefore, we directly measure the processing speed for each function and use this information as our partitioning factor. The selected metric in this case is a primitive-based throughput e.g. number of operations per second (or pixels per second for the underlying problem). This requires that the sequential code is executed on all nodes o -line to capture the relative weights and store them for subsequent instances of parallel operation. This scheme has the best representation of relative speeds that can be used to obtain accurate weights. Si in this case is the DCP-based processing rate of each processor.

4. EXPERIMENTAL RESULTS Our virtual parallel computer consists of heterogeneous workstations with di erent speeds and memory capacities (Sun Sparc-20/ 60 MHz, Pentium 150 MHz, Pentium 100 MHz, and Intel80486 66MHz)

running Unix operating system and PVM 3.3.8. Table 1 shows the system characteristics in terms of processor speeds (SpectInt92 and DCP-based processing rate). The SpecInt92 is a standard benchmark while the DCP-based rate is an application speci c benchmark. Two image processing primitives are used in our experiments namely median lter (MF) and convolution (CONV). Both primitives are implemented using PVM message passing library and executed on our heterogeneous system. An input to the median lter is a 1024 x 1024 gray scale image and a 5 x 5 kernel. An input to the convolution, is a 2048 x 2048 gray scale image using 7x7 blurring kernel. Parallel virtual machine con gurations are varied from one to four processors based on the the following assumption: when adding more processors select the fastest processor rst, i.e., processor P2, P3, and P4 will be selected respectively.

distributed convolution primitive. Since the input image is quite large (2048x2048) and the complexity of the convolution is not too large, communication time is dominant. This shows that the weighting function using only processing speed is not enough to get good performance. Comparing the results from gure 2 with the results from gure 3, we can see that the MF primitive is computation bound while the CONV primitive is communication bound. This leads to the conclusion that a partitioning scheme must not omit the e ect of communication characteristics and it is our intention to investigate a more ecient partitioning scheme based on the combination of computation and communication characteristics.

Processor SpecInt92 DCP Rate P1:Sparc-20 60MHz 88.9 10968 P2:Pentium 150MHz 165.2 13032 P3:Pentium 100MHz 126.2 9803 P4:80486 66MHz 32 4952 Table 1: Processing capability of four di erent processors

4.1. Computation Imbalance Ideally the parallel execution times for all processors should be identical. When each processor nishes the task at a di erent time, the parallel execution is imbalanced. Imbalanced execution can dramatically degrade the overall parallel execution time if the heterogeneous system has a high degree of heterogeneity, i.e., the di erence between processing speeds is large. The imbalance forces the master process to wait for the slowest machine to complete its job; thus the overall execution time will be approximately equal to the partial execution time on the slowest machine. Figure 2 depicts di erent levels of imbalanced execution for the distributed median ltering primitive. The DCP-based scheme yields the smallest parallel execution time while the Eqbased scheme causes the nish time to be equal to the slowest P4 processor. The Spec-based scheme is not as balanced as the DCP-based scheme and incorrect determination of partition size causes P3 to execute longer than other processors. Figure 3 depicts di erent levels of imbalanced execution for the

Figure 2: Comparison of communication and computation characteristics of Eq-, Spec-, and DCP-based partitioning schemes for median lter

Figure 3: Comparison of communication and computation characteristics of Eq-, Spec-, and DCP-based partitioning schemes for image convolution

4.2. Scalability of the partitioning schemes We consider reponse time and number of processors as scalability factors. Ideally, as the number of processors increases the reponse time should decrease. Figure 4 shows the behaviors of three partitioning schemes for distributed median ltering when varying the number of processors from one to four. The DCP-based scheme outperforms the Eq-based and Spec-based schemes. The response time for the Eqbased scheme is quite high for a four processor con guration due to imbalanced partitioning. Figure 5 shows the behaviors of three partitioning schemes for distributed convolution primitive. Similarly, we vary number of processors from one to four to observe the scalability of the partitioning schemes. The DCP-based scheme consistently outperforms the Eq-based and Spec-based schemes. 90 ’eq-mf’ ’spec-mf’ ’dcp-mf’

80

Seconds

70 60 50 40 30 20 1

1.5

2

2.5 3 No. of Processors

3.5

4

Figure 4: Response times of distributed median ltering using scheme eq, spec, and dcp 60 ’eq-conv-2048’ ’spec-conv-2048’ ’dcp-conv-2048’

55 50

Seconds

45 40 35 30 25 20 1

1.5

2

2.5 3 No. of Processors

3.5

4

Figure 5: Response times of distributed median ltering using scheme 2 (SpecInt92-based)

5. CONCLUSIONS Our DCP-based partitioning scheme exhibits outstanding performance over the other two schemes. The major system characteristics - communication and computation capability - are the keys to good performance when executing SPMD-style tasks on heterogeneous systems. The computation-bound and communication-bound characteristics of the SPMD tasks must also be taken into consideration in the data partitioning schemes. Future work will be to incorporate detailed communication characteristics into the weighting function in order to improve the accuracy of the data partitioning.

6. REFERENCES [1] N. Alexandridis, H-A Choi, B. Narahari, S. Rotensteich, and A. Youssef, \A Hierarchical, partitionable, knowledge based, parallel processing system," 3rd Annual Parallel Processing Symposium, Calif. State University, Fullerton, CA, March 29-31, 1989. [2] A. Beguelin, J. Dongarra, Al Geist, Robert Manchek, K. Moore, and Vaidy Sunderam, \PVM and HeNCE: Tools for Heterogeneous Network Computing," Environments and Tools for Parallel Scienti c Computing, Edited by Jack Dongarra and Bernard Tourancheau, Advances in Parallel Computing, Volume 6, North-Holland, 1993. [3] P. Chalermwat, N. Alexandridis, P. Piamsa-Nga, and M. O'Connell \An Overview of Distributed Computing Primitive Concept and Experimental Results," Internal Report, The George Washington University, Washington, DC, April 21, 1996. [4] J. Dongarra and Peter Newton, \Overview of VPE: A Visual Environment for Message-Passing Parallel Programming," Heterogeneous Computing Workshop '95, Proceedings of the 4th Heterogeneous Computing Workshop, Santa Barbara, CA, April 25, 1995. [5] D. Kotz, \A DAta-Parallel Programming Library for Education (DAPPLE)," Technical Report PCS-TR94-235, Dept. of Computer Science, Dartmouth College, November 7, 1994. [6] V. M. Lo, \Heuristic Algorithms for Task Assignment in Distributed Systems," IEEE Trans. on Computers, vol. 37, no. 11, November 1988, pp. 1384-1401.

[7] A. Malony, B. Mohr, P. Beckman, D. Gannon, S. Yang, and F. Bodin, \Performance analysis of pC++: A Portable Data-Parallel Programming System for Scalable Parallel Computers," in Proceedings of the 8th International Parallel Processing Symposium (IPPS), Cancun, Mexico, April 1994. [8] H.J. Siegel, J.B. Armstrong, and D.W. Watson, \Mapping Computer-Vision Related Tasks onto Recon gurable Parallel Processing Systems," Computer, Vol. 25, No. 2, Feb 1992, pp. 54-63. [9] V.S. Sunderam, \PVM: A Framework for Parallel Computing," Concurrency: Practice and Experience, Vol. 2, No. 4, December 1990, pp. 315-339. [10] Zhichen Xu, \Simulation of Heterogeneous Network of Workstation," TR-95-08-02, Computer Science Department, University of WisconsinMadison, 1995.

Suggest Documents