Optimizing Task Distribution for Heterogeneous ... - Semantic Scholar

1 downloads 0 Views 194KB Size Report
Nov 12, 2006 - [7] Barry Wilkinson and Michael Allen, “Parallel. Programming”, Pearson Education, 2002. [8] V.Rajaraman, and C.S. Ram Murthy, “Parallel.
Optimizing Task Distribution for Heterogeneous Desktop Clusters Neeta Nain, Vijay Laxmi, Bhavitavya B Bhadviya, and Nemi Chand Singh Department of Computer Engineering Malaviya National Institute of Technology Jaipur-302017 and Rajashtan Swayat Shashan Mahavidyalay, Jaipur 302015, Rajasthan, India.

Abstract A three CPU, two node cluster-based parallel multiprocessor system is developed with a Core Duo laptop and an AMD Athlon Desktop using Fedora Core 5 (smp and non-smp kernels respectively) and OSCAR. Such a system is used to execute a number of parallel programs to evaluate their performance and to optimize the number of tasks to be scheduled on laptop and Desktop respectively, depending upon their hardware configuration. Aim is to setup a Distributed parallel computer right on the Desktop, and to extend this thought of optimizing a cluster’s performance, based on its node’s hardware configurations over any large heterogeneous Cluster.

1. Introduction Parallel Computing is knocking at our doorsteps and Clusters are the need of the day. While architecture for supporting parallel processing is a major issue, it certainly is not the major problem. There are other critical issues mainly related to the programmability of parallel machines. For an effective application, the synergism among parallel architectures, algorithms and programming should be consciously fostered. This paper is primarily designed with this point in view.

1.1. Current Computing Scenario In the context of today's technology, which has reached to a limit where it seems impossible to gain higher performance from sequential machines, parallel processing is being viewed as an effective alternative to the von-Neumann computer architecture. Clusters demand that nodes should be similar in configuration, but for a normal user this is insane due

to economic constraints. As far as a Small Office Home Office (SOHO) user or a student is concerned, he or she may have only a Desktop and a Laptop and even if more systems are present, they may be of different configurations as they are bought over the time of continuous expansion like some set of computers now then a few after two years.

1.2. Clusters and OSCAR Cluster-based Linux multiprocessor systems are currently used in Industry, Research Laboratories and Educational Institutions. These cluster-based systems are being used in High Performance Computing (HPC) applications, High Availability (HA) applications, Load Balancing applications and in Fault-Tolerance applications. We come across clusters with a number of nodes ranging from 8, 16, and 32 to even more. Such a huge setup is very costly since switches and many systems are involved. OSCAR (Open Source Cluster Application and Resource) [1] is an integrated bundle of software designed for building Linux-based cluster multiprocessor / multi-computer systems for High Performance Applications. The OSCAR software package includes the following components: Message Passing Interface software ( MPI / MPICH ), Local Area Multiprocessor software ( LAM ), Parallel Virtual Machine software ( PVM ), Cluster Command and Control software suite ( C3 ), Scheduling software ( MAUI ) and Portable Batch System software (PBS). The latest version 5.0 released on 12th November, 2006 is used for our experimentation.

1.3. Parallel programming and LAM / MPI LAM/MPI[2] is a high-performance, freely available, open source implementation of the MPI standard that is researched, developed, and maintained at the Open Systems Lab at Indiana University. LAM/MPI supports all of the MPI-1 Standard and much of the MPI-2 standard. The Message Passing Interface (MPI) is a set of API functions enabling programmers to write high performance parallel programs that pass messages between processes to make up an overall parallel job. LAM/MPI 7.1.2 is used for our experimentation purposes.

2. Construction of the Desktop Cluster A Cluster was constructed using a dual core laptop and a single core desktop. The dual core was from Intel while the single core was from AMD. The details of the hardware and software used are as follows;

2.1. Hardware Specifications An ACER Aspire 5672wlmi Laptop is used as the head node in our cluster. The single client used was a Desktop based on AMD Athlon XP 2000+. No switch or hub is used; instead a Peer 2 Peer LAN was setup using a Cross-connected LAN Cable. The detailed configuration of cluster components is listed below:ACER Aspire 5672wlmi (Headnode) - Intel Core Duo T2300 (1.67Ghz). - 1 GB 533 MHz DDR2 RAM. - 120 GB SATA Hard Disk. - Broadcom 5789 Gigabyte Ethernet On Board. Desktop Client: - AMD Athlon XP 2000+ (Palamino 1.67 Ghz). - 512 MB DDR 266 RAM. - 80 GB SATA Hard Disk. - Realtek 8201 Fast Ethernet on Board.

-2) Windows XP Professional The “Software Development” general package was selected at the time of installing Fedora Core 5. OSCAR 5.0 was installed for Fedora 5 –i386 on the laptop. The sample SCSI disk partition file in Oscar sample folder was modified to compensate for the windows partition on the client Desktop.

2.3. Installation of cluster Installation procedure as listed by OSCAR install guide [3] was strictly followed. Laptop and the Desktop both had the same secondary username and group apart from the ‘root’ user name. The Home Directory of the user on the Laptop was made the current working directory. The host file used for lamboot command was initially configured as in Figure 1: #myhostfile bhavioscar cpu=2 oscarnode1.oscardomain Figure 1: Hostfile used for lamboot After booting the LAM daemon on Desktop and the Laptop the parallel system was ready for testing and evaluation.

3. Experimental Results and Discussion We have used four real world applications to test our proposal. First is the matrix multiplication which involves heavy calculations, second is the Fast Fourier Transform which incurs heavy data exchange. The third application was calculating π which to some extent is architecture (CPU) dependent. The fourth was a combination of all and involved the implementation of a complete image processing algorithm. All the timing values tabulated in the upcoming tables are average values for 10 system runs until an approximate constant value was achieved.

2.2. Software Specifications Since both the computers were for SOHO purposes, they had dual boot configuration. The Laptop had options of booting into either -1) Fedora 5 –2.6.15-1.2054_FC5smp or -2) Windows XP Professional Whereas the Desktop had the following boot options in the Grub boot loader -1) Fedora 5-2.6.15-1.2054_FC5

3.1. Matrix Multiplication Matrix multiplication has wide spread use in all fields extending from image processing to circuit solving and many more. An algorithm to multiply two matrices has a time complexity of O (n3). Let n x n be the dimension of the matrix, and p be the number of parallel processes. The sequential algorithm was broken down into n/p tasks and executed as p

separate processes reducing the complexity to O ((n/p)*(n2)). A 600 x 600 matrix was used for testing purposes. Mpirun command was tested with all options including C (to run on all CPUs), N (to run on all nodes), -np (specifies the exact number of processes irrespective of CPU or node count) in the form of

And number of processes can be chosen such that CPU load is balanced. Here we can take either 21(2x3+4x1+5x1+2x6) processes or 25(2x3+4x1+5x1+2x6+1x4) processes for better performance of this system making sure that the slower machines get the least work.

[~]$ mpirun X mm Figure 2: Lam/mpi Command to Compile and Execute C Source File mm.c

3.2. Fast Fourier Transform (FFT)

The processing time with various options for mpirun is shown in Table 1 below: Table 1: Execution Time for Matrix Multiplication Options with Wall-clock Total number mpirun (X) timings of processes (p) n1 5.555274 1 c2 5.544434 1 c0 4.019474 1 c1 3.975676 1 n0 3.961359 1 N 3.353559 2 -np 3 2.367699 3 C 2.336615 3 -np 5 2.231874 5 c0-1 2.230702 2 -np 2 2.215675 2 -np 4 2.072999 4 From Table 1 it is clear that the execution time reduces until p = 4, after which at p = 5, the data exchange time over the 100mbps network overhauls the execution time leading to overall loss in performance. With better networking say with gigabyte Ethernet some more performance gain can be extracted. But the conclusion remains that even in a two node, three CPU system best performance for matrix multiplication is obtained at p = 4 for any value n. It is important to note that, this conclusion would apply to all heterogeneous clusters in which the nodes should be arranged in descending order of their computation power and an appropriate value of p should be chosen so that maximum performance can be extracted from the parallel machine. For example, let a setup has 3 Core Duo T2500 Machines, 4 AMD Athlon 2400+ systems and 5 Pentium III 1Ghz systems then they should be arranged as: -First 3 should be Core duo systems. -4th to 7th machines should be the Athlon ones. -8th to 12th systems should be the Pentium III's.

Fast Fourier Transform is an inseparable part of almost all engineering applications and involves CPU hungry calculations. A non-parallel algorithm has a time complexity of O (n2) [4] where n is the number of elements in a vector whose FFT has to be found. A parallel algorithm of FFT finds the solution in O (nlog2n) [5]. Keeping in mind the configuration of parallel machine as in Figure 1, it would be erroneous to merely run FFT on all the three processors since this does not guarantee fastest computation. We have performed two optimizations: -The first optimization is selecting log2n processes irrespective of the CPU or node count. -The second and the most important is the way we distribute these processes on our parallel system. Firstly we changed our hostfile as in Figure 3: #myhostfile bhavioscar cpu=16 oscarnode1.oscardomain cpu=16 Figure 3: Hostfile for FFT Thus each node was configured to run log2n processes. We then ran the command >$mpirun X fft Figure 4: Command to Execute fft.c Various options were tested with mpirun using n=32768. It was implemented using log2 (32768) = 15 processes. Each process consists of n/log2n elements to process. The execution timings are shown in Table 2. In Table 2 1:6 x 1+ 4, 2:6 x 1 means that 6 continuous processes were distributed to each node alternately i.e., process 0 to process 5 are passed to node1, process 6 to 11 were passed to node2 and so on. Also 4 processes were left unpaired and thus it was distributed to a node1.

Table 2: Execution Timing for Finding FFT of Each Process(set of elements). Wall-clock Total number timings for of processes Options with mpirun (X) each on each process node. c0 c15 c1 c16 …c7 c22 c0-1 c15-16…c6-7 c22-23 c0-2 c15-17…c6-8 c21

1.026160 0.850032 0.596754

c0-3 c15-18 c4-7 c19-22 c0-4 c15-19 c5-9 c20

0.454382 0.687652

c0-5 c15-20 c6-9

0.389662

c0-6 c15-21 c7-8

0.538113

c0-8 c-15-22

0.423823

1:1x8, 2:1x8 1:2x4, 2:2x4 1:3x3, 2:3x2+1 1:4x2, 2:4x2 1:5x2, 2:5x1+1 1:6x1+4, 2:6x1 1:7x1+ 2, 2:7x1 1:8x1, 2:8x1

From Table 2 we conclude that the way the tasks are distributed also affects the performance. This is in accordance with the Shuffle and Exchange network [6, 7] or the Butterfly network [8-10] system.

Figure 5: Simulated Butterfly Network when 8 Processes were scheduled on Each Node. It may be further noted that the fastest performance was not achieved when n/2 and n/2 processes were distributed (as in the last case of Table 2 where each node received 8 process) but rather it was obtained when 6 process were distributed to each node, indicating that the theoretical and practical results may vary. Thus by giving more work to a faster processor or to the one which is free and at the same time reducing the amount of data that will be exchanged on the network, will all lead to a faster processing.

3.3. Calculating the Value of Pi (π) It is one of the most trivial of tests for any parallel machine but we chose to run this test since it is architecture dependent that is, it performs differently on Intel and AMD machines. We chose to calculate the value of π up to 15 decimal places using integration technique. The various options tried and their execution time is depicted in Table 3. From Table 3 given below we can summarize that AMD’s Athlon calculates value of π faster than the Core Duo, hence a different approach was used to gain performance. Firstly the task was divided into some appropriate number of processes. For example, from section 3.1 we get p = 4, as the optimal value of processes for our system. All CPUs were then first distributed a single copy of the task, but along with this copy the Athlon systems were also given some more tasks to utilize their faster processing power. Table 3: Execution Time for Finding Value of Pi (π) Option with Wall-clock Total number mpirun (X) timings of processes n0 3.640854 1 c0-2 2.113456 3 N 1.904542 2 -np 4 1.898810 4 C 1.824241 3 -np 6 1.727367 6 -np 5 1.685756 5 n1 1.512333 1 C n1 1.045495 4 Also care should be taken to avoid large queues or else performance gain will be negative. For example, in our case from Table 3 Core Duo takes 3.640854 seconds while Athlon takes 1.512333 seconds, therefore maximum number of tasks queued for a single Athlon System must not exceed 2 (floor(3.640854/1.512333)).

3.4. Morphological Edge Detection and Corner Detection Algorithm using Chain Encoding A complete image processing algorithm involving morphological edge detection and corner detection using chain encoding [11] was implemented in parallel on the cluster. The program for single mask application

on an image was run in a batch process for 10 masks, 5 of which were 3 x 3 while the remaining 5 x 5 were 5 x 5. We have taken an image file of 1200 x 1200 (N x N) pixels for our test purposes. The image matrix was padded with zeros depending upon the dimensions of the target mask’s dimensions that is, (m-1) extra rows and columns were added to the image matrix for a target mask of m x m. The image pixel matrix was broken down into n processes where n = (N/p). Here p is the number of processors where execution will be carried out. Table 4 below displays the wall unit time for each execution with different options with mpirun. Table 4: Execution time for applying 10 masks on an image (erosion, corner detection and edge detection were performed) Option with Wall-clock Total number mpirun (X) timings of processes (n) c2 16.784599 1 n0 16.459122 1 c1 16.458934 1 c0 16.456987 1 n1 16.356372 1 N 14.291345 2 -np 2 12.302911 2 c0-1 12.291872 2 -np 5 12.291742 5 C 11.715910 3 -np 3 11.672309 3 -np 4 10.514229 4 From Table 4 we conclude that parallel processing of image over the cluster enhances the performance and reduces the computation time. Also this performance is not related to number of processors alone but also depends upon the way the image processing task is distributed. Thus the best performance (gain of approximately 38%) for a 3 CPU cluster was obtained at n = 4 i.e. when 4 simultaneous processes were run and not when n = 3.

3.5. Generalization of Parallel Image Processing. Image processing functions like convolution, edge detection and morphological operations etc, involves application of a number of masks on images of large dimensions and sizes. We tried to generalize the procedure of applying masks of varied sizes on images having different dimensions .

Images with dimensions 1200x1200, 1000x1000, 800x800 and 500x500 were used and masks of 3x3, 5x5 and 7x7 were applied to each of them. Table 5, 6 and 7 illustrate the experimental results where –np 4 and n1 are the options used with mpirun. The time is stated in terms of wall clock timings in seconds. Table 5: Execution time for applying 3x3 mask.

Dimensions of image 1200 x 1200 1000 x 1000 800 x 800 500 x 500

-np 4 (Wall clock timings) 0.93145 0.86992 0.61972 0.42234

n1 (Wall clock timings) 1.46482 1.31471 0.86147 0.48912

Table 6: Execution Time for Applying 5x5 mask.

Dimensions of image 1200 x 1200 1000 x 1000 800 x 800 500 x 500

-np 4 (Wall clock timings) 1.11945 1.03854 0.88198 0.62279

n1 (Wall clock timings) 1.78448 1.56712 1.24484 0.84179

Table 7: Execution Time for Applying 7x7 Mask.

Dimensions of image 1200 x 1200 1000 x 1000 800 x 800 500 x 500

-np 4 (Wall clock timings) 1.28945 1.16954 0.96898 0.70982

n1 (Wall clock timings) 1.98467 1.69712 1.47484 1.10179

Thus from Table 5, 6 and 7 we can generalize that any parallel implementation of a mask on an image will be successful and useful only if the calculations being involved are more than a certain threshold value. For example, calculation involved in applying 3x3 mask on 500x500 image involves only 0.225x107 (500x500x3x3) multiplications as compared to 1.296x107 multiplications involved in applying a 3x3 mask over 1200x1200 image. In the first case the performance earned is only 14% while in the second case it is approximately 37%. All the values in bold indicate performance gain of 30% or more. Thus we need to carefully select which applications to actually parallelize and which are better without parallelism. A compromise between performance gain and cost of parallelism is what will determine the procedure.

4. Conclusions The paper initially demonstrates the construction of a Heterogeneous cluster right on the Desktop using a Core-Duo laptop making it a 2-node, 3-CPU machine without the hassles of space and money. The paper then using different applications suggests various ways to optimize a cluster that consists of nodes with different hardware configuration. The conclusions can be summarized as: 1) Choose the most appropriate number of processes irrespective of the number of CPUs you have, such that, none of the CPU is idle throughout the execution of a package (all tasks in a program including master and slave sections). For example in matrix multiplication and image processing, the most performance oriented approach was to choose n=4 processes over a 3-CPU and 2-node cluster. 2) Depending upon the nature of the task, change the number of CPUs on each node, and divide the task in the most appropriate fashion. For example, FFT by its nature is best executed on a butterfly network; therefore simulate this network connection with your task distribution scheme. 3) Each of the different CPU in your parallel system may give varied performance for the single task. In such a scenario we must choose the CPU appropriately for faster execution. For example AMD Athlon calculated Pi faster than Core Duo, so some more processes should be queued for Athlon, so that we can maximize performance gained from our system. Finally we conclude that parallel computing does not require one to buy expensive systems and setup huge Labs, rather, research on a parallel paradigm can be done right away using a single Desktop and a single Laptop. Also, by simply following the various optimization techniques mentioned above and using mpirun more efficiently one can do more justice to their Heterogeneous Cluster setup.

5. References [1] Open Source Group OSCAR, Official website http://oscar.openclustergroup.org [2] LAM/MPI – Distributed Message Passing Interface http://www.lam-mpi.org

[3] OSCAR 5.0 install guide http://svn.oscar.opencluster.org/wiki/oscar5.0:instal_gu ide [4] Kai Hwang and Faye, “Computer Architecture and Parallel Processing”, McGraw-Hill, 1984. [5] Michael J Quinn, “Parallel Computing Theory and Practice”, McGraw-Hill, 2002. [6] Jordan Alaghbad, “Fundamentals of Parallel Processing”, PHI, 2003. [7] Barry Wilkinson and Michael Allen, “Parallel Programming”, Pearson Education, 2002. [8] V.Rajaraman, and C.S. Ram Murthy, “Parallel Computers- Architecture and Programming”, PHI, 2000. [9] Wouter Caarls, Peiter Jonker and Henk Corporaal, “Skeletons and Asynchronous RPC for Embedded Data and Task Parallel Image Processing”, IEICE Transactions on Information and Systems 2006 E89D(7):2036-2043. [10] E.V. Rusin, “Design and Application of the Library PLVIP for Parallel Image Processing”, (489) ACIT - Software Engineering - 2005. [11] Neeta Nain, Vijay Laxmi, Ankur Jain, and Rakesh Agarwal, “Morphological Edge Detection and Corner Detection Algorithm using Chain Encoding”, IPCV 06, Vol II, pp 520-525, Las Vegas 2006.