PC cluster as a platform for parallel applications - CiteSeerX

3 downloads 10665 Views 260KB Size Report
computers that communicate with each other by ... cheap computers with a network for the widespread, ... (Network Information Services) which allows one to.
special ssesion: Advanced information and wireless communication systems session orginzier: Hesham Eldeeb

PC cluster as a platform for parallel applications AMANY ABD ELSAMEA, HESHAM ELDEEB, SALWA NASSAR Computer & System Department Electronic Research Institute National Research Center, Dokki, Giza Cairo, EGYPT-12622

Abstract: - The complexity and size of the current generation of supercomputers leads to the emergence of cluster computing which is characterized by its scalability, flexibility of configuration and upgrade, high availability and improvement of cost and time. This paper, explains the importance of cluster computing and its advantages and disadvantages. Also, it presents the types of schedulers and the steps of building the cluster. The work herein also evaluates this cluster by two case studies: matrix multiplication as a simple case study and sobel edge detection as a heavy computation one. Key-Words: - cluster computing, middleware, latency, scheduler, image processing, execution time, efficiency.

1 Introduction The distribution and sharing of resources allows systems like supercomputer and large databases to be build at much lower cost. Also, requirements toward high availability and fault tolerance can in many cases only be realized in a distributed system [1]. Distributed systems consist of several computers that communicate with each other by message passing over a communication network and carry out some cooperative activity. They are developed on top of existing networking and operating systems software. They are not easy to build and maintain. To simplify their development and maintenance a new layer of software called middleware is being developed. This layer of software provides high-level services, abstracting over low-level details that may differ between platforms. This software layer allows multiple processes running on one or more machines to interact transparently across a network. If the distributed resources happen to be managed by a single global centralized scheduling system, then it is a cluster. A Linux cluster is a collection of interconnected parallel or distributed machines that can be viewed and used as a single, unified computing resource. Clusters can consist of homogeneous and heterogeneous collections of Von Neumann (serial) and parallel architecture computers or even sub-clusters [2]. A cluster system can be viewed as being made up of four major components, two hardware and two software. The two hardware components are the nodes that

perform the work and the network that interconnects the nodes to form a single system. The two software components are the collection of tools used to develop user parallel application programs and the software environment for managing the parallel resources of the cluster [3]. The paper is organized as follows: Section 2, explains the importance of cluster computing and its advantages and disadvantages. Section 3, presents the architecture of Linux cluster. Section 4, discusses the types of schedulers. Section 5, explains the steps of building our Linux cluster and the performance evaluation of two case studies on the cluster. Finally, the conclusion is discussed in section 6.

2

Cluster computing advantages and disadvantages

As high performance local and wide area networks have become less expensive and as the price of commodity computers has dropped, it is now possible to connect a number of relatively cheap computers with a network for the widespread, efficient sharing of data to produce a cluster which is a type of distributed systems. Cluster parallel processing offers several important advantages: - Cluster computing can scale to very large systems. - Better price/performance ratios. - High availability, clusters provide multiple redundant identical resources that, if managed correctly, can provide continuous system operation

through graceful degradation even as individual components fail. - Flexibility of configuration and upgrade [3]. Although clusters have several advantages, yet they have disadvantages too: - Generally, network hardware is not designed for parallel processing. Typically latency is very high and bandwidth is relatively low compared to SMP (Symmetric Multiprocessor ) and attached processors. For example, SMP latency is generally no more than a few microseconds, but is commonly hundreds or thousands of microseconds for a cluster. SMP communication bandwidth is often more than 100 MBytes/second, whereas even the fastest ATM network connections are more than five times slower.

3 Linux cluster architecture A cluster is a group of computers which work together toward a final goal. The architecture of a PC cluster is shown in Fig.1 where the first layer is the hardware which is the nodes that perform the work and the network that interconnects the node. The operating system that is used in this cluster is Linux which is the most popular open source operating system in the world. Its success is due to its stability, availability, and straightforward design. It can easily be modified, rearranged for whatever task. While most Linux clusters use a local file system for scratch data, it is often convenient to use network-based or distributed file systems to share data. Most common and most popular is NFS (Network File System) which allows remote hosts to mount partitions on a particular system and use them as though they were local file systems and NIS (Network Information Services) which allows one to setup a server and then configure a number of client machines that ask that server if the person logging into the client machine is allowed to. The nice thing about it is that usernames and passwords are stored in one place. NIS and NFS represent the middleware layer. Message-passing libraries are implemented on HPC (High Performance Computing) systems using two separate standards, PVM (Parallel Virtual Machine) and MPI (Message Passing Interface). In many ways, MPI and PVM are similar, each defines portable, high-level functions that are used by a group of processes to make contact and exchange data without having to be aware of the communication medium. The support for each is available over the Internet at low or no cost. Each supports C and Fortran 77. Each provides for

automatic conversion between different representations of the same kind of data so that processes can be distributed over a heterogeneous computer network. The difference between MPI and PVM is in the support for the “topology” of the communicating processes. In MPI, the group size and topology are fixed when the group is created. This permits lowoverhead group operations. In PVM, group composition is dynamic, which requires the use of a “group server” process and causes more overhead in common group-related operations. Other differences are found in the design details of the two interfaces. MPI, for example, supports asynchronous and multiple message traffic, so that a process can wait for any of a list of message-receive calls to complete and can initiate concurrent sending and receiving [4]. Providing a cluster requires software to effectively control job and system resources, load balance across the network, maximize the use of shared resources, make sure that everyone can effectively and equitably utilize that resource, this software is known as Resource Manager (Job Scheduler). It takes job requests from user input or other means and schedules them to be run on the number of nodes required in the cluster. The next section, discusses the different types of schedulers. Application Scheduler PVM/MPI Middleware Hardware Fig.1 Cluster architecture

4

Types of schedulers

There are a number of specialized scheduling software products available. These may be divided into batch queuing systems and extended batch systems. Batch queueing systems are designed for use on tightly interconnected clusters, which usually feature shared file systems. Extended batch systems, designed for use in loosely interconnected clusters, do not usually make assumptions about

shared file systems, and often offer increased functionality over typical batch queueing systems. Examples of batch queueing systems are: DQS; GNQS; PBS; EASY; LSF and LoadLeveler, while examples of extended batch systems are: Condor; PRM; CCS and Codine [2]. In our system, we choose PBS (Portable Batch System) since it provides many features and benefits

3. 4.

MPI is used and tested with several programs. Portable batch system scheduler (PBS) is used and configured on server side and on client side. 5. Queue Manager is configured on the server then, we apply a PBS batch script on it. Server Client1 Client2 Client3

to the cluster administrator which are: (a) User interfaces: XPBS provides a graphical

5

Building and performance evaluation of the cluster

The cluster consists of a number of nodes, one node is the master (server) and the other nodes are the slaves (clients), as shown in Fig.2. Building the cluster is made as described in the following steps: 1. Linux is used as an operating system. 2. The physical cluster network is build by using Network File System (NFS) and Network Information Services (NIS). Then checking the network server and client NFS & NIS setup are made.

Network Fig.2 Hardware architecture of the cluster For performance evaluation of this cluster, the two case studies, matrix multiplication as a simple one and edge detection as a practical one are implemented. The performance metrics used to evaluate these two case studies are, first, the execution time, and second, the efficiency which is defined by: E = S/P (2) where E is the efficiency , S is the speedup, P is the number of processors. The closer it is to one, the more perfectly parallel the task is at that level of parallelism; the closer to zero, the less of level of parallelism [7].

5.1 Matrix multiplication case study For parallelization of the matrix multiplication case study, the master distributes the data among the workers who perform the actual multiplication in smaller blocks and send back their respective results to the master. We change the size of the matrices and record the execution time of the parallel program using different numbers of processors and different cluster layers which are MPI and PBS. It is clear that there is a dramatically reduction of execution time with increasing the number of processors, as shown in Fig.3. Execution time of MPI vs PBS 50 Time in seconds

interface for submitting both batch and interactive jobs, querying job, queue, and system status, and tracking the progress of jobs. Also available is the OpenPBS command line interface (CLI) providing the same functionality as XPBS. (b) Job priority: Users can specify the priority of their jobs, and defaults can be provided at both the queue and system level. (c) Job-Interdependency: It enables the user to define a wide range of interdependencies between batch jobs. Such dependencies include: execution order, synchronization, and execution conditioned on the success or failure of a specified other job. (d) Automatic File Staging: It provides users with the ability to specify any files that need to be copied onto the execution host before the job runs, and any that need to be copied off after the job completes. The job will be scheduled to run only after the required files have been successfully transfered. (e) Single or Multiple Queue Support: It can be configured with as many queues as you wish. (f) Multiple Scheduling Algorithms: With OpenPBS you can select the standard first-in firstout scheduling, or a more sophisticated scheduling algorithm and other features which are discussed in [5],[6] . The next section, describes the steps of building the cluster and its performance evaluation.

40 30

MPI

20 10

PBS

0 1 Procs 2 Procs 3 Procs 4 Procs Number of processors

Fig.3 execution time for matrix multiplication case study using matrix size 500x500 Fig.4 and Fig.5 show the execution time and the efficiency respectively for the matrix multiplication

case study using MPI running interactively and PBS with its default scheduler (First-In First-Out) applied on four processors. It is clear that the execution time decreased when using PBS because Job startup time is greatly decreased, so to use the resources of the cluster most efficiently, jobs that take more than the allowed CPU time (long jobs) must be executed using batch requests. In case of MPI, the efficiency is very low for small matrix sizes this is because of the increased overhead due to communication and synchronization among processors. The chance for performing efficient parallelism is only available during computations of large matrix sizes to be distributed among processors. While using PBS the efficiency for small and large matrix sizes is improved.

For this study, the Sobel masks were chosen because of their smoothing and differencing effects. The Sobel operator performs a 2-D spatial gradient measurement on an image. Typically it is used to find the approximate absolute gradient magnitude at each point in an input grayscale image. The Sobel edge detector uses a pair of 3x3 convolution masks, one estimating the gradient in the x-direction (columns) and the other estimating the gradient in the y-direction (rows). A convolution mask is usually much smaller than the actual image. As a result, the mask is slid over the image, manipulating a square of pixels at a time [8] [9]. The actual Sobel masks are shown below:

Time in second

Execution time of MPI vs PBS 16 14 12 10 8 6 4 2 0

MPI PBS

100x100

200x200 300x300 400x400 Size of Matrix

500x500

Fig.4 execution time for matrix multiplication case study using four processors Efficiency of MPI vs PBS 100.00%

Efficiency

80.00% 60.00%

MPI

40.00%

PBS

20.00% 0.00% 100x100 200x200 300x300 400x400 500x500

The magnitude of the gradient is then calculated using the formula: (3) Parallel sobel edge detection naturally uses a master-worker paradigm. A flow chart of the responsibilities of the master and workers is provided in Fig.6. The advantage of the nonworking master, which used in this case, is that as soon as a worker sends its results, the master can almost immediately receive and evaluate the results. The master reads the image then distributes it to the slaves. The slaves work on its part of the image and perform sobel edge detection then send the sub detected image to the master which gathers those subimages into the final detected image.

Size of Matrix

Fig.5 efficiency for matrix multiplication case study using four processors

Input image data from file

Send images to workers

5.2 Sobel edge detection case study Computations for edge detection are performed on a pixel by pixel basis, with many arithmetic operations performed on each pixel. The complete detection of edges in a gray-level image is generally employed in three steps. First, the image is convolved with a derivative mask (operator mask) that produces a measure of intensity gradient. Second, a threshold operation is applied in which points contributing to edges are identified as those exceeding a set level of intensity gradient values. Third, the edge points are combined into coherent edges by applying a linking algorithm.

Receive image from master

Calculate number of rows in each subimage

Perform sobel edge detection on a part of the image

Send the result to the master

Receive results from workers

Put the resulted detected image in a file

Fig.6 Parallel sobel edge detection flow chart

Fig.7 shows two columns. Column A represents images of different sizes and column B represents the edge detected images.

120.00%

Efficiency MPI vs PBS

Efficiency

100.00% 80.00% MPI

60.00%

PBS

40.00% 20.00% 0.00% 256x256

512x340

441x600 500x700 Image Size

506x700

Fig.9 efficiency for Sobel edge detection case study using four processors From the above figures, the efficiency using PBS is more than 30% better than MPI. Also, it is clear that the execution time for this case study is smaller than the execution time of the matrix case study. Since the matrix case study sends two matrixes to the processors, so this leads to increase the communication time compared to image case study.

6 Conclusion

(A)

(B)

Fig.7 Images with different sizes and its Edge detected images using Sobel edge detection Then we record the execution time of the program using images of different size applied on different numbers of processors. Fig.8, Fig.9 show the execution time and the efficiency respectively using MPI running interactively and PBS applied on four processors.

Time in second

Execution time of MPI vs PBS 1.2 1 0.8 0.6 0.4 0.2 0

We build a Linux cluster using a number of relatively cheap computers which connected together by a network to produce widespread, efficient sharing of resources. In this paper the performance of the PC cluster is evaluated by two case studies running on two different layers of the cluster. First, the case studies run on MPI interactively and show better results than the serial processing. Second, using PBS which gives better results because it uses the full resource of the cluster. Also the parallelization of matrix multiplication and edge detected case studies decrease the execution time with increasing the number of processors, this leads to improve the efficiency. This paper approved that the cluster is an efficient platform for running heavy computation applications. Also, it is clear that the cluster gives better price to performance ratios. As a future work, this cluster will be extended over WAN to form a Grid which provides flexible, secure, coordinated resource sharing.

MPI PBS

256x256 512x340 441x600 500x700 506x700 Image Size

Fig.8 execution time for Sobel edge detection case study using four processors

Acknowledgement: This work is partially funded by NSF project No. 0138760 “RAMSys: Collaborative Metacomputing System”.

References: [1] N. Nicolas and B. Skarup, “Java Grid: Building A Grid Computer Engine with Jinni and Java”, Bachelor Thesis in Computer Science Distributed Systems, Aalborg University, 2002. [2] H. A. James, “Scheduling In Meta Computing Systems”, B.SC, PhD thesis, Department of Computer Science University of Adelaide, July, 1999. [3] T. sterling, “Beowulf Cluster Computing with Linux“, MIT Press, Cambridge, 2001. [4] D. Cortesi, A. Evans, W. Ferguson and J. Hartman, “Topics in Irix Programming”, Published by Silicon Graphics, 1996. [5] V. Hazlewood, “Cluster Computing: A Survey and Tutorial“, Published in SysAdmin, March, 1997. [6] B. Bode, D. M. Halstead, R. Kendall and Z. Lei, “The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters”, Proceeding of the 4th Annual Linux Showcase & Conference, Atlanta, October 2000. [7] W. Ramadan, “Performance Evaluation of Multithreaded Programming Over Distributed Memory Message Passing in a Multiprocessor Computer System”, MSC. thesis, Faculty of Engineering, Cairo University, Giza, Egypt, November, 1997. [8] L. Hopwood, W. Miller and A. George, “Parallel Implementation of the Hough Transform for the Extraction of Rectangular Objects," Proc. IEEE Southeastcon, IEEE cat. no. 96ch35880, pp. 261264, April, 1996. [9] J. Barbosa, J. Tavares, and A. J. Padilha, ”Parallel Image Processing System on a Cluster of Personal Computers”, Vector and Parallel Processing , 4th International Conference , Porto, Portugal, June, 2000.

Suggest Documents