Writing Parallel Programs on LINUX CLUSTER - Google Sites

3 downloads 181 Views 2MB Size Report
4.2 A Parallel Java Application using mpiJava ...... After Java JDK is installed successfully, add the Java JDK `bin' di
Writing Parallel Programs on LINUX CLUSTER

Kabir Ahmed Syed Asadul haq Md. Monerul Islam Md. Iftakher Hossain

A graduation project completed under supervision of Syed Akhter Hossain Associate Professor & Chairperson Department of computer Science & Engineering

East West University 43 Mohakhali, Dhaka-1212.

Writing Parallel Programs on LINUX CLUSTER

Kabir Ahmed Syed Asadul haq Md. Monerul Islam Md. Iftakher Hossain

Copyright © October 2003

Modification of any part of this document, without the prior permission of author(s) is considered to be a violation of copyright law.

A graduation project completed under supervision of Syed Akhter Hossain Associate Professor & Chairperson Department of computer Science & Engineering

East West University 43 Mohakhali, Dhaka-1212.

ACKNOWLEDGMENTS

We would like to thank our project supervisor Associate Professor and Chairperson of Computer Science and Engineering Department, Syed Akhter Hossain. It is obvious that, without his endless support, encouragement, and assistant it would not be possible for us to accomplish this report. We are grateful to Professor Dr. Mozammel Huq Azad Khan for his patience and supportive attitude. We would like to mention the name of Safiqul Islam, Lab assistant, for his tremendous help. We are grateful to all our teachers and to our friends for their kind cooperation during our graduation period. Above all we are very much grateful to our family members for their endless love, encourage ment, and supports.

Abstract

Parallel programming has been evolved over the years with new dimensions and scientific community fostering on the new approaches of parallel programming. Network clustering and parallel programs are now becomes the only cheapest solution to the super computer. This paper is an initiative to a newer approach of using parallel programming language on the top of Message Passing Interface (MPI) in Linux Cluster. This paper discusses the very basic features of building Linux cluster and parallel programming using the different implementation of MPI and measures a little about the performance of the cluster.

- Table of Contents CHAPTER 01 - Introduction 1. Introduction __________________________________________________________2 1.1

Beowulf – A Parallel Computing Architecture__________________________3

1.2

The Evolution of Beowulf__________________________________________4

1.2.1 First-Generation Beowulf _________________________________________5 1.2.2

Second-Generation Beowulf ______________________________________7

1.2.2.1 BProc______________________________________________________7 1.2.2.2 The Scyld Implementation _____________________________________8 1.3 References ________________________________________________________10

CHAPTER 02 - Parallel Computing Architecture 2. Parallel Computing Architecture _________________________________________12 2.1 Parallel Computing Systems __________________________________________12 2.1.1 SIMD Systems _________________________________________________12 2.1.2 MIMD Systems ________________________________________________13 2.1.3 SPMD Systems ________________________________________________14 2.2 Beowulf Architecture _______________________________________________14 2.3

Cluster Design__________________________________________________15

2.3.1

Cluster Setup and Installation____________________________________16

2.3.2

Os installation ________________________________________________17

2.3.3

Configuration of Master Node and Client Node______________________17

2.4

Testing the System ______________________________________________19

2.5

References _____________________________________________________20

CHAPTER 03 - Communication API 3. Communication API ___________________________________________________22 3.1.1 Parallel Virtual Machine _________________________________________22

3.1.2 Message Passing Interface (MPI) __________________________________23 3.1.2.1 Architecture of MPI _________________________________________23 3.2 Software Bindings on MPI ___________________________________________25 3.2.1 MPICH_______________________________________________________26 3.2.1.1 Installation of Configuration of MPICH__________________________27 3.2.1.2 Testing the Beowulf cluster and MPICH configuration ______________28 3.2.2 mpiJava ______________________________________________________29 3.2.2.1 Class Hierarchy of mpiJava ___________________________________29 3.2.2.2 API of mpiJava _____________________________________________31 3.2.2.3 Installation and Configuration of mpiJava ________________________32 3.2.3 HPJava _______________________________________________________34 3.2.3.1 Installing HPJava ___________________________________________35 3.2.3.2 Compiling and Running HPJava program ________________________35 3.3 References ________________________________________________________36

CHAPTER 04 - Writing Parallel Application 4. Parallel Application Architecture _________________________________________38 4.1 Writing Parallel Programs in MPICH___________________________________40 4.1.1 HelloWorld.c __________________________________________________40 4.2 A Parallel Java Application using mpiJava ______________________________41 4.2.2 Source code of Matrix-Matrix Multiplication using Scatter-Gather ________45 4.2.3 Result of Matrix-Matrix Multiplication______________________________46 4.3.1 Image Enhancement using Foruier Transform ________________________47 4.3.1.1 How Fourier Transform works _________________________________47 4.3.1.2 Implementation _____________________________________________50 4.3.1.3 Source Code _______________________________________________51 4.4 References ________________________________________________________64

CHAPTER 05 - Performance Analysis 5. Performance Analysis __________________________________________________66

5.1 PI calculation in MPICH____________________________________________66 5.2 Matrix–matrix multiplication in MPICH ________________________________67 5.3

Matrix-Matrix multiplication in MPIJAVA ___________________________68

5.4 References ________________________________________________________69

CHAPTER 06 – Conclusion & Future Works 5. Conlusion ___________________________________________________________71 Communicated Paper to ICCIT 2003 ______________________________________73

WRITING PARALLEL PROGRAMS ON LINUX CLUSTER

Chapter 1

Introduction

1. Introduction

2

1. Introduction From the dawn of computer era, power is stand as a main driving force of development of computers. Scientists and engineers believe that more computational power makes a computer more powerful. On the basis of computational power they scaled the computer category as Super Computer, Mini Computer, Micro computer etc. Hardware vendors are engaging in development of more powerful CPU’s to gain more processing power. Now days, over 3 GHz processors are available in the market for the desktop computers. But still now, to run a complex scientific program, like simulation of weather forecasting model, a complex fluid dynamics code or data mining application, we need a super computer, more precisely the processing power of a super computer.

The obvious question arises that what is causing this ever-escalating need for greater computational power. The answer relies on the fact that, for centuries, science has followed the basic paradigm of first observe, then theorize, and then test the theory through experimentation. Similarly, engineers have traditionally first designed (typically on paper), then built and tested prototypes, and finally built a finished product. However, it is becoming less expensive to carry out detailed computer simulations than it is to perform in real with the hassle of numerous experiments or built a series of prototypes. Thus the experiment and observation in the scientific paradigm, and design and prototyping in the engineering paradigm, are being increasing replaced by computation. Furthermore, in some cases, we can now simulate phe nomena that could not be studied using experimentation; e.g., the evolution of universe.

But the problem arises that the cost of a super computer is extremely high and installation and maintenance is also a very complex process. Moreover, it has been proven that, in

1. Introduction

3

some cases, the greater computational power subsumes the both of greater speed and greater storage [1]. So to meet the needs of more computational power for the complex application, scientist provides a new approach which is called Parallel Computing. It is a method of computing, where a collection of computers works together to solve a problem.

And as the performance of commodity computer and network hardware increase, and their prices decrease, it becomes more and more practical to build parallel computational systems from off- the-shelf components, rather than buying CPU time on very expensive Supercomputers. In fact, the price per performance ratio of a Beowulf type machine is between three to ten times better than that for traditional supercomputers. Beowulf architecture scales well, it is easy to construct and only have to pay for the hardware as most of the software is free.

1.1 Beowulf – A Parallel Computing Architecture There are probably as many Beowulf definitions as there are people who build or use Beowulf Supercomputer facilities. Some claim that one can call their system Beowulf only if it is built in the same way as the NASA's original machine. Others go to the other extreme and call Beowulf any system of workstations running parallel code. But we recognized that the definition of Beowulf is between the two views. Beowulf is a multi computer architecture which can be used for parallel computations. It is a system which usually consists of one server node, and one or more client nodes connected together via Ethernet or some other network.

It is a system built using commodity

hardware

components, like any PC capable of running Linux, standard Ethernet adapters, and switches.

It does not contain any custom hardware components and is trivially

reproducible. Beowulf also uses commodity software like the Linux operating system,

1. Introduction

4

Parallel Virtual Machine (PVM) and Message Passing Interface (MPI). The server node controls the whole cluster and serves files to the client nodes. It is also the cluster's console and gateway to the outside world. Large Beowulf machines might have more than one server node, and possibly other nodes dedicated to particular tasks, for example consoles or monitoring stations. In most cases client nodes in a Beowulf system are dumb, the dumber the better. Nodes are configured and controlled by the server node and do only what they are told to do. In a disk- less client configuration, client nodes don't even know their IP

address or name until the server tells them what it is.[] One of the

main differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf behaves more like a single machine rather than many workstations. In most cases client nodes do not have keyboards or monitors, and are accessed only via remote login or possibly serial terminal like Keyboard Video Monitor (KVM Switch). Beowulf nodes can be thought of as a CPU + memory package which can be plugged in to the cluster, just like a CPU or memory module can be plugged into a motherboard. It is not a special software package, new network topology, rather than it is a technology of clustering Linux computers to form a parallel, virtual supercomputer [2].

1.2 The Evolution of Beowulf The original concept for Beowulf clusters was conceived by Donald Becker while he was at NASA Goddard in 1994 [3]. The premise was that commodity computing parts could be used, in parallel, to produce an order of magnitude leap in computing price/performance for a certain class of problems. The proof of concept was the first Beowulf cluster, Wiglaf, which was operational in late 1994. Wiglaf was a 16-processor system with 66MHz Intel 80486 processors that were later replaced with 100MHz DX4s, achieving a sustained performance of 74Mflops/s (74 million floating-point operations

1. Introduction

5

per second). Three years later, Becker and the CESDIS (Center of Excellence in Space Data and Information Services) team won the prestigious Gordon Bell award. The award was given for a cluster of Pentium Pros that were assembled for SC'96 (the 1996 Super Computing Conference) that achieved 2.1Gflops/s (2.1 billion floating-point operations per second). The software developed at Goddard was in wide use by then at many national labs and universities.

1.2.1 First-Generation Beowulf The first generation of Beowulf clusters had the following characteristics: commodity hardware, open-source operating systems such as Linux or FreeBSD and dedicated compute nodes residing on a private network. In addition, all of the nodes possessed a full operating system installation, and there was individual process space on each node.

These first-generation Beowulfs ran software to support a message-passing interface, either PVM (parallel virtual machine) or MPI (message-passing interface). Messagepassing typically is how slave nodes in a high-performance computing (HPC) cluster environment exchange information.

Some common problems plagued the first-generation Beowulf clusters, largely because the system management tools to control the new clusters did not scale well because they were more platform- or operating-specific than the parallel programming software. After all, Beowulf is all about running high-performance parallel jobs, and far less attention went into writing robust, portable system administration code. The following types of problems hampered early Beowulfs:



Early Beowulfs were difficult to install. There was either the labor- intensive, install- each-node- manually method, which was error-prone and subject to typos,

1. Introduction

6

or the more sophisticated install-all-the- nodes-over-the-network method using PXE/TFTP/NFS/DHCP--clearly getting all one's acronyms properly configured and running all at once is a feat in itself. •

Once installed, Beowulfs were hard to manage. If anyone thinks about a semilarge cluster with dozens or hundreds of nodes, it becomes impossible to manage. To run a new kernel on a slave node, must have to install the kernel in the proper space and tell LILO (or other favorite boot loader) all about it, dozens or hundreds of times. To facilitate node updates the r commands, such as rsh and rcp, were employed. The r commands, however, require user account management accessibility on the slave nodes and open a plethora of security holes.



It was hard to adapt the cluster: adding new computing power in the form of more slave nodes required fervent prayers to the Norse gods. To add a node, need to install the operating system, update all the configuration files, update the user space on the nodes and, of course, all the HPC code that had configuration requirements of its own



It didn't look and feel like a computer; it felt like a lot of little independent nodes off doing their own thing, sometimes playing together nicely long enough to complete a parallel programming job.

In short, for all the progress made in harnessing the power of commodity hardware, there was still much work to be done in making Beowulf 1 an industrial-strength computing appliance. Over the last year or so, the Rocks and OSCAR clustering software distributions have developed into the epitome of Beowulf 1 implementations [ ``The Beowulf State of Mind'', LJ May 2002, and ``The OSCAR Revolution'', LJ June 2002]. But if Beowulf commodity computing was to become more sophisticated and simpler to use, it was going to require extreme Linux engineering.

1. Introduction

7

1.2.2 Second-Generation Beowulf The hallmark of second-generation Beowulf is that the most error-prone components have been eliminated, making the new design far simpler and more reliable than firstgeneration Beowulf. Scyld Comp uting Corporation, led by CTO Don Becker and some of the original NASA Beowulf staff, has succeeded in a breakthrough in Beowulf technology as significant as the original Beowulf itself was in 1994. The commodity aspects and message-passing software remain constant from Beowulf 1 to Beowulf 2. However, significant modifications have been made in node setup and process space distribution.

1.2.2.1 BProc At the very heart of the second- generation Beowulf solution is BProc, short for Beowulf Distributed Process Space, which was developed by Erik Arjan Hendriks of Los Alamos National Lab. BProc consists of a set of kernel modifications and system calls that allows a process to be migrated from one node to another. The process migrates under the complete control of the application itself--the application explicitly decides when to move over to another node and initiates the process via an rfork system call. The process is migrated without its associated file handles, which makes the process lean and quick. Any required files are re-opened by the application itself on the destination node, giving complete control to the application process.

Of course, the ability to migrate a process from one node to another is meaningless without the ability to manage the remote process. BProc provides such a method by putting a ``ghost process'' in the master node's process table for each migrated process. These ghost processes require no memory on the master--they merely are placeholders

1. Introduction

8

that communicate signals and perform certain operations on behalf of the remote process. For example, through the ghost process on the master node, the remote process can receive signals, including SIGKILL and SIGSTOP, and fork child processes. Since the ghost processes appear in the process table of the master node, tools that display the status of processes work in the same familiar ways.

The elegant simplicity of BProc has far-reaching effects. The most obvious effect is the Beowulf cluster now appears to have a single-process space managed from the master node. This concept of a single, cluster-wide process space with centralized management is called single-system image or, sometimes, single-system illusion because the mechanism provides the illusion that the cluster is a single-compute resource. In addition, BProc does not require the r commands (rsh and rlogin) for process management because processes are managed directly from the master. Eliminating the r commands means there is no need for user account management on the slave nodes, thereby reducing a significant portion of the operating system on the slaves. In fact, to run BProc on a slave node, only a couple of dæmons are required to be present on the slave: bpslave and sendstats.

1.2.2.2 The Scyld Implementation Scyld has completely leveraged BProc to provide an expandable cluster computing solution, eliminating everything from the slave nodes except what is absolutely required in order to run a BProc process. The result is an ultra-thin compute node that has only a small portion of Linux running--enough to run BProc. The power of BProc and the ultrathin Scyld node, taken in conjunction, has great impact on the way the cluster is managed. There are two distinguishing features of the Scyld distribution and of Beowulf 2 clusters. First, the cluster can be expanded by simply adding new nodes. Because the nodes are ultra-thin, installation is a matter of booting the node with the Scyld kernel and making it

1. Introduction

9

a receptacle for BProc migrated processes. Second, version skew is eliminated. Version skew is what happens on clusters with fully installed slave nodes. Over time, because of nodes that are down during software updates, simple update failures, the software on the nodes that is supposed to be in lockstep shifts out of phase, resulting in version skew. Since only the bare essentials are required on the nodes to run BProc, version skew is virtually eliminated.

In above we are trying to give an elaborate history of Beowulf evolution. For further information and reference, visit official Beowulf site at http://www.beowulf.org .

Figure 1. A Beowulf Cluster

1. Introduction

1.3 References [1] Peter S. Pacheco. Parallel programming with MPI . [2] Jamesdennel.Lecture notes for intro parallel computing .Spring 1995.http://www.cs.berkeley.edu/~demmel/cs267. [3] Glen Otero Richard Ferri. The Beowulf Evolution.Linuxjournal Issue 100 http://www.linuxjournal.com.

10

Chapter 2

Architecture Overview & System Design

2. Parallel Computing Architecture

12

2. Parallel Computing Architecture Parallel processing refers to the concept of speeding up the execution of a program by dividing the program into multiple fragments that can execute simultaneously, each on its own processor .A program being executing across n processor might execute n time faster than it would using a signal processing .the original classification of parallel computers is popularly known as Flynn’s taxonomy. In 1966 Michael Flynn classified systems according to the number of instruction streams and the number of data streams the classical vonn Neumann machine has a single instruction stream and a single data stream, And hence is identified as a single – instruction single data (SISD)[1].At the opposite extreme is the multiple instruction multiple data (MIMD) system, in which a collection of autonomous processor operate on the own data streams. In Flynn’s taxonomy, this is the most general architecture for parallel computing. Intermediate between SISD and MIMD systems are SIMD and MISD.

2.1 Parallel Computing Systems 2.1.1 SIMD Systems SIMD (Single instruction stream, Multiple Data stream) refers to a parallel execution model in which all processors execute the same operation at the same time, but each processor is allowed to operate upon its own data. This model is naturally fits the concept of performing the same operation on every element of an array, and is thus often associated with vector or array manipulation. Because all of these operations are inherently synchronized, interactions among SIMD processors tend to be easily and efficiently implemented. The execution of the following code, for ( i = 0; I < 1000; i++) if ( y[i] ! =0.0) z[i] = x[i]/y[i];

2. Parallel Computing Architecture

13

else z[i] = x[i]; gives the sequence of operation like: Time Step 1. Test local_y != 0.0. Time Step 2. a. if local_y was nonzero, z[i] = x[i]/y[i] b. if local_y was zero, do nothing. Time Step 3 c. if local_y was nonzero, do nothing. d. if local_y was zero, z[i] = x[i]. This implies the completely synchronous execution of statements. This example makes the disadvantages of SIMD systems clear. That is, at any given instant of time, a given subordinate process is either “active” and doing exactly the same thing as the other entire active processes, or it is idle. So it found that in a program with many conditional branches or long segments of code whose execution depends on conditionals, it’s entirely possible that many processors will remain idle for long periods of time.

2.1.2 MIMD Systems MIMD (Multiple Instruction stream, Multiple Data Stream) refers to a parallel execution model in which each processor is essentially acting independently[2]. This model naturally fits the concept of decomposing a program for parallel execution on a functional basis; for example one processor might update a database file while another processor generates a graphic display of the new entry. This is a more flexible model than SIMD execution, but it is at the risk of debugging nightmares called race conditions, in which a program may intermittently fail due to timing variations reordering the operations of one processor relative to those of another.

2. Parallel Computing Architecture

14

2.1.3 SPMD Systems SPMD (Single program, Multiple Data) is restricted version of MIMD in which all processors are running the same program. Unlike SIMD, each processor executing SPMD code may take a different control flow path through the program[7]. In SPMD model parallelism is captured perfectly, as it recovers the barrier of idleness of processors and race conditions of SIMD and MIMD systems. The symmetric execution of code over the processors is guarantied on this model.

2.2 Beowulf Architecture We stated before that Beowulf is not special software package, new network topology or the latest kernel hack. It is a technology of clustering Linux computers to form a virtual super computer. It is built based on SPMD model of parallel computing, where in a group of processes cooperate by executing the identical program images on local data values. The whole systems is built with the commodity hardware components, like any PC capable of running Linux, Standard Ethernet Adapters, hubs and switches. It also uses common software Linux operating system, Parallel virtual Machine (PVM) and Message Passing Interface (MPI), etc. It does not require any custom hardware components.

Beowulf systems usually consist of one server node, and one or more client nodes connected together via Ethernet or some other network. The server node controls the whole cluster and serves the files and commands to the client nodes. It is also the cluster’s console and gateway to the outside world. Nodes are configured and controlled by the server node and do only the task they are asked to do. In disk- less client configuration client nodes don’t even know their IP address or until the server node tells them what it is.

2. Parallel Computing Architecture

15

Large Beowulf cluster may contain more server node which is dedicated to specific task like monitoring clusters performance etc. And the most important thing is client nodes don’t have keyboard and monitor. The entire terminals of the client nodes are attached to the server nodes through KVM (Keyboard Video Monitor) switches, which plays the main roll to imagine the cluster as unite machine rather than thinking of pile of PC’s. The physical layout of a Beowulf system would be look like the following picture:

Figure 2. The physical layout of Beowulf System 2.3 Cluster Design Beowulf cluster have been constructed from variety of parts. So the type of application will run in the cluster and the availability of the hardware components determines the systems configuration. And there is rule for same commodity components to be used with the system. But it is stated cluster with same configuration will work better than the others. In our research work the EWU Beowulf system was built with one server and four client node and the configuration for the system was as below: § § §

Processor Ram Hard Disk

:p4-1.8GHz :128MB :40GB

2. Parallel Computing Architecture

§ §

16

Network Card :Realtalk Bandwidth :100KBps

2.3.1 Cluster Setup and Installation This section of the chapter will cover the construction and configuration phenomena of EWU Beowulf system. It will differ from the references in many cases as it is a research work and as it is done according to project supervisor.

There are at least four methods of configuring disk storage in a Beowulf cluster. These configurations differ in price, performance and ease to administer. In this paper we will cover the fully local install configuration. a. Disk-less Configuration In this configuration the server serves all files to disk-less clients. The main advantage of disk- less clients system is the flexibility of adding new nodes and administering the cluster. Since the client nodes do not contain any information locally, when adding a new node you will only have to modify a few files on the server or run a script which will do the job. It doesn’t need to install the operating system or any other software on any of the nodes, but the server node. The disadvantages are increased NFS traffic slightly more complex initial setup [4,1,2]. b. Fully Local Install The other extreme is to have everything stored on each client. With this configuration every system has to install operating system and all the software on each of the client.

The advantage of this setup is no NFS traffic and the

disadvantage is a very complicated installation and maintenance. Maintenance of such a configuration could be made easier with complex shell scripts and utilities such as rsync which could update all file system.

2. Parallel Computing Architecture

17

c. Standard NFS Installation The third choice is a half way mark between the disk- less client and fully local install configurations. In this setup clients have their own disks with the operating system and swap locally, and only mount /home and /usr/local off the server. This is the most commonly used configuration of Beowulf clusters.

2.3.2 Os installation For our Beowulf cluster, we have chosen Linux distribution of RedHat Inc. version 7.3 (Valhalla). For detail installation instruction we refer to the RedHat Linux installation documentation. For the cluster, setup should complete with full network support with remote shell facilities. If the package for basic network communication and remote communication is not installed, then it will have to be installed manually later. And as it will work in trusted private network, any firewall protection should be removed, for proper functioning of the cluster.

2.3.3 Configuration of Master Node and Client Node 1. Create .rhosts files in the /user and /root directories. Our .rhosts files for the Beowulf users are as follows: node00 beowulf node01 beowulf node02 beowulf node03 beowulf And the .rhosts files for the root users are as node00 root node01 root node02 root node03 root

2. Parallel Computing Architecture

18

2. Create hosts file in the /etc directory. /etc/hosts file for master node (node00) in our EWU Beowulf cluster 192.168.1.220 node00.ewubd.edu node00 127.0.0.1 localhost 192.168.1.221 node01 192.168.1.222 node02 192.168.1.223 node03 /etc/hosts file for child node (node01) in our EWU Beowulf cluster 192.168.1.221 node01.ewubd.edu node01 127.0.0.1 localhost 192.168.1.220 node00 192.168.1.222 node02 192.168.1.223 node03 Pre-caution: Ordering of the nodes is very important. The node which is to be configured should place in first, and rest of the nodes will be listed in ascending order. 3. Modify hosts.allow files of /etc by adding the following lines: For node00: ALL: 192.168.1.220 ALL: 192.168.1.221 ALL: 192.168.1.222 ALL: 192.168.1.223 ALL: node00.ewubd.edu ALL: node01.ewubd.edu ALL: node02.ewubd.edu ALL: node03.ewubd.edu For node01: ALL: 192.168.1.221 ALL: 192.168.1.220 ALL: 192.168.1.222 ALL: 192.168.1.223 ALL: node01.ewubd.edu ALL: node00.ewubd.edu ALL: node02.ewubd.edu ALL: node03.ewubd.edu Pre-caution: Gives a “space’ after the colon. Maintain the ordering of the nodes. 4. Modify hosts.deny file of the /etc directory by adding the following lines:

2. Parallel Computing Architecture

19

ALL: ALL 5. Add the following lines to the /etc/securetty file: rsh, rlogin, rexec, pts/0, pts/1 6. Modify the rsh of /etc/pam.d directory as follows (The changing portion is highlighted by underlining): auth

sufficient

/lib/security/pam_nologin.so

auth

optio nal

/lib/security/pam_securetty.so

auth

sufficient

/lib/security/pam_env.so

auth

sufficient

/lib/security/pam_rhosts_auth.so

auth

sufficient

/lib/security/pam_stack.so service=system-auth

auth

sufficient

/lib/security/pam_stack.so service=system-auth

7. Modify the rsh, rlogin, telnet, rexec files of the /etc/xineted.d directory: Change the disabled = yes line to disabled = no. 8. After doing all the changes write the following command: xinetd -restart

2.4 Testing the System •

To test the system first use the ping command to test that whether there is physical connection between the nodes.



Try to remotely login in each of the machine. Successful login ensures the remote mechanism between the nodes, so that user of the system can use rcp, rexec etc commands.



Install software to run a parallel program, and test the system by running a demo program.

2. Parallel Computing Architecture

20

2.5 References [1] The latest version of the Beowulf HOWTO http://www.sci.usq.edu.au/staff/jacek/beowulf [2] Building a Beowulf system http://www.cacr.caltech.edu/beowulf/tutorial/building.html [3] Jacek’s Beowulf Links http://sci.usq.edu.au/staff/jacek/beowulf. [4] Chance Reschke,Thomas Sterling,Daniel Ridge,Daniel Savarese,Donald Becker,and Phillip Merkey A Design Study of Alternative Network Topology for the Beowulf parallel workstation,Proceedings Fifth IEEE internationa l Symposium on HIGH performance Distributed Computing, 1996. http://www.beowulf.org/papers/HPDC96/hpdc956.html [5] Thomas Sterling,Daniel Ridge,Daniel Savarese,Michel R.Berry,and Chance Res.Achieving a Balance Low-Cost Architecture for Mass storage Management through Multiple Fast ethernate Channels on the Beowulf parallel workstation .proceedings ,International parallel processing symposium, 1996. http://www.beowulf.org/papers/IPPS96/ipps96.html [6] Donald J.Becker,Thomas sterling ,Danial Savarese, John E. Dorband,Udaya A.Ranawak,Charles,V.packer. BEOWULF: A PARALLEL WORKSTATION FOR SCINTIFIC COMPUTATION .Pro-ceedings ,International Conference on parallel processing ,95.hhtp://www.beowulf.org/paper/ICPP95/icpp95.html [7] Beowulf Homepage http://www.beowulf.org [8] Extreme Linux http://www.extremelinux.org [9]

Extreme

Linux

Software

from

Red

Hat

http://www.redhat.com/extreme.

Chapter 3

Communication API

3. Communication API

22

3. Communication API A basic prerequisite for parallel programming is a good communication API. There are many software packages which are optimized for parallel computation. Application built using these package pass messages between nodes to communicate with each others. Message Passing architecture are conceptually simple. But their operation and debugging can be quite complex. There are two popular message passing libraries that are used: •

Parallel Virtual Machine (PVM)



Message Passing Interface (MPI)

3.1.1 Parallel Virtual Machine PVM is a freely available (http://www.epm.ornl.gov/pvm/pvm_home.htm), portable, message-passing library generally implemented on top of sockets. It is clearly established as the de- facto standard for message-passing cluster parallel programming. PVM supports single processor and SMP Linux machines, as well as clusters of Linux machines linked by socket-capable networks (e.g. SLIP, PLIP, Ethernet and ATM). In fact, PVM will even work across group of machines in which a variety of different types of processors, configuration and physical networks are used – Heterogeneous Cluster – even to the scale of treating machines linked by the internet as a parallel cluster. PVM also provide facilities for parallel job control across a cluster[1,3,5,7]. It is important to note that, PVM message passing calls generally ass significant overhead to standard socket operations, which already had high latency. Furthermore, the message handling calls themselves do not constitute a particularly “friendly” programming model.

3. Communication API

23

3.1.2 Message Passing Interface (MPI) The implementation process of MPI was begun at Williamsburg Workshop in April, 1992 and successfully organized at Supercomputing ’92 (November). The final version of the draft released in May, 1994. Although PVM is the de-facto standard message-passing library, MPI (Message Passing interface) is the relatively new official standard.. MPI is implemented using standard networking primitives. It attempts to preserve the functionality needed by scientific applications, while hiding details of networking, sockets etc. It is efficient, portable and functional is the case of parallel implementation of program. The features that include in MPI which caused the leap over PVM are: •

Completely separate address spaces and namespace.



Library handles al network reliability/ retransmission/handshake issues.



Provides a simple (trivial) naming scheme for communicating participants.



Point- to-point primitives (gather, scatter, broadcast etc.).



User-defined data types, topologies and other advanced topics

In our research work, we used Message Passing Interface (MPI) as the communication API to communicate between nodes. Because, we found that MPI provides a bit more functionality than PVM.[1,3,6]

3.1.2.1

Architecture of MPI

MPI has large library of various functions which provides extensive functionality to support the different branches of parallel computing. To be efficient parallel programmer one need to be master on all the parts of MPI. These are essential to the programmers as because the there are different types of approach to communication modes and architecture. Each of them is distinguished from others and should be used intelligently to acquire the parallel efficiency. The architectural features that MPI include are[7]:

3. Communication API





24

General -

Communicators combine context and group for message security.

-

Thread Safety.

Point to point Communication -

Structured buffers and derived data types.

-

Communication

Modes:

Normal

(blocking

and

non-blocking),

synchronous, ready, buffered. •



Collective -

Both built in and user-defined collective operations.

-

Large number of data movement routines.

-

Sub groups defined directly or by topology.

Application oriented process topologies -



Profiling -



Built in supports for grids and graphs (uses groups).

Hooks allow intercepting MPI calls to install their own tool.

Environmental -

Inquiry

-

Error control

MPI has over 126 library routine to support these various type of communication modes. But there are only six basic functions which are enough to build a complete parallel application. These are, •

MPI_Init() – Initializes MPI environme nt.



MPI_Send() – The routine sends messages.



MPI_Recieve() – The routine used to receive messages.

3. Communication API



25

MPI_COMM_Size() – The routine determines the Communicator (Number of Nodesin in the environment) size.



MPI_COMM_Rank() – The routine rnak the processors.



MPI_Finalize().- Finalize MPI and clean up all the objects.

The parameter for these routine are vary on different bindings of MPI. That is different type of implementation of MPI in different languages can have different types of parameters[4].

3.2 Software Bindings on MPI Message Passing Interface (MPI) has been implemented on different languages. Many different approaches have been taken to develop the MPI libraries. Fortran and C was first to deliver the MPI package. The important initiate to these steps are, •

MPICH (Message Passing Interface Chameleon) – in C.



mpiJava, Java/DSM, JavaPVM, use Java as wrapper for existing frameworks and libraries.



MPJ, jmpi, DOGMA, JVPM, JavaNOW, use pure Java libraries.



HPJava, Manta, JavaParty, Titanium, extend Java language with new keywords. Use preprocessor or own compiler to create Java (byte code).



WebFlow, IceT Javelin, web oriented and use Java applets to execute parallel tasks.

In our research work, we completed our experiment using different types of distribution of MPI. We use MPICH, mpiJava and HPJava which incorporates the maximum level of variation in parallel programming. Here we will discuss about the features, architecture and other issue of these binding of MPI and in Chapter 4 we will discuss about the factors

3. Communication API

26

of writing programs on these distribution and some program listing using these distribution[4].

3.2.1 MPICH MPICH is an open-source, portable implementation of the Message-Passing Interface Standard. It is developed by David Ashton, Anthony Chan, Bill Gropp, Rob Latham, Rusty Lusk, Rob Ross, Rajeev Thakur and Brian Toonen in Argonne National Laboratory. MPICH is a portable implementation of the full MPI specification for a wide variety of parallel and distributed computing environments. MPICH contains, along with the MPI library itself, a programming environment for working with MPI programs. The programming environment includes a portable startup mechanism, several profiling libraries for studying the performance of MPI programs, and an X interface to all of the tools. In MPICH, the basic six function of MPI are implemented with several numbers of parameters, which are as follows, §

MPI_Init(&argc, &argv);

§

MPI_COMM_Rank(MPI_COMM_WORLD, &my_rank);

§

MPI_COMM_Size(MPI_COMM_WORLD, &p);

§

MPI_Send(msg, msg_length, msg_type, dest, tag, MPI_COMM_WORLD);

§

MPI_Recv(msg, msg_length, msg_type,source,MPI_COMM_WORLD, &status);

MPI_Init() initializes MPI environment and every program must started with the initialization of the MPI. MPI_COMM_Rank() ranks the processor. The first parameter (MPI_COMM_WORLD) is the communicator object, which is a collection of processes that can send or receive

3. Communication API

27

messages from each other. The second parameter (my_rank) stores the rank of the processor. MPI_COMM_Size() determines the number of processor in the environment. It also takes the communicator object as parameter and store the result in the second parameter (p). MPI_Send() and Recv() are responsible to send and receive messages between nodes and it takes the message, message length, message type and communicator object as parameter. Tag and status are necessary in sending or receiving message because it ensures that the send or received data are correct of type and length. By conditional branching according to the rank of the processor the programs obtain the SPMD architecture.

3.2.1.1 Installation of Configuration of MPICH In our research we use MPICH version 1.2.5 which is freely available from http://www.mcs.anl.gov/mpi/mpich/download.html . The total installation MPICH1.2.5 required few steps. 1. Unpack the ”mpich.tar.gz” in the user specified directory by

prompt$ gunzip -c mpich.tar.gz | tar zxovf – 2. From the newly created directory “mpich-1.2.5.x” type following and press enter prompt $ ./configure 3. Then type the following to compile and install the distribution prompt $ make 4. Update the .bash_profile file by adding the following lines in the path environment variable. e.g. (for the root user) PATH = /root/mpich-1.2.5/util:/root/m[ich-1.2.5/bin:$PATH ( for the user other than root. If the user is Beowulf then)

3. Communication API

28

PATH = /home/beowulf/mpich-1.2.5/util:/home/beowulf/mpich-1.2.5/bin:$PATH 5. Modify the machines.LINUX file of /root/mpich-1.2.5/util/machines by writing the name of all nodes except the host node: So, for node00 of four node cluster, the file will appear as below, node01 node02 node03 6. Restart the PC.

3.2.1.2 Testing the Beowulf cluster and MPICH configuration To test the cluster configuration and the MPICH run an example file from the /mpich1.2.5/examples/basic directory. The syntax to run a program is prompt $ mpirun –np 4 hello Here mpirun is the program which executes the program in all the node of the cluster. -np 4 determines the number of processor will be involved in execution and hello is the executable binary code of the source program. Here is a detail process of compilation and execution of a parallel program: §

Compile the program using mpicc which includes the all the necessary header files and do all the necessary linking with the library files. E.g. prompt $ mpicc –o hello hello.c

§

Run the program by prompt $ mpirun –np 4 hello

§

The successful greetings from all of the nodes will confirm the installation process[5].

3. Communication API

29

3.2.2 mpiJava mpiJava was developed at the Sycracuse University by Bryan Carpenter, Mark Baker, Geoffrey Fox and Guansong Zhang. The existing MPI Standards specify a language bindings for Fortan, C nad C++ and this approaches implements a Java api for Message Passing Interface Chameleon(MPICH). More precisely, mpiJava is a Java interface which binds the Java Native Interface (JNI) C stubs to the underlying native MPI’s C interface – that is mpiJava uses some Java wrappers to invoke the C-MPI calls through the Java Native Interface (JNI) [1,11]. mpiJava runs the parallel Java programs on the top of MPICH through the Java Virtual Machine (JVM). The architecture stack of the environment, when running a parallel Java programs using mpiJava is shown in Figure 1.

3.2.2.1 Class Hierarchy of mpiJava Java Application

The existing MPI standard is explicitly object-based. The Java Virtual Machine (JVM)

C and Fortran bindings rely on “opaque objects” that can be manipulated only by acquiring object handles from

mpiJava

constructor functions, and passing the handles to suitable

MPICH

functions in the library. The C++ bindings specified in th OS (Linux)

MPI 2 standard collects these objects into suitable class hierarchies and defines most of the library functions as

Protocol (TCP/UDP)

class member functions. The mpiJava API follows this model, lifting the structure of its class hierarchy directly

Ethernet Card

from C++ binding. Node 0

Node 1

Node 2

Figure 3.1: Execution Stack of a parallel java program using mpiJava.

3. Communication API

30

The class MPI only has static members. It acts as a module containing global services, such as initialization of MPI, and many global constants including the default communicator COMM WORLD[6]. The most important class in the package is the communicator class Comm. All communication functions in mpiJava are members of Comm or its subclasses. As usual in MPI, a communicator stands for a “collective object” logically shared by a group of processors. The processes communicate, typically by addressing messages to their peers through the common communicator. The principle classes of mpiJava are shown in Figure 2.

MPI Group

Cartcomm Intracomm

Comm

Graphcomm Intercomm

Package MPI Datatype Status Request

Prequest

Figure 3.2: Class hierarchy of mpiJava Another important class of mpiJava is the Datatype class. This describes the type of the elements in the message buffers passed to send, receive and all the other communication functions. Various datatypes are predefined in the package [1,2]. These mainly correspond to the primitive types of Java and the interface between mpiJava and MPI via the Java Native Interface (JNI) is shown in Table 1.

3. Communication API

31

3.2.2.2 API of mpiJava

Table 3.1: Basic Datatypes

There are some basic communication API’s of

of mpiJava

mpiJava, which are used to develop parallel programs

in

Java

platform.

The

functions

Java Datatye

MPI Datatype MPI.BYTE

byte

MPI.CHAR

char

MPI.SHORT

short

initializes the MPI for the current environment and

MPI.BOOLEAN

boolean

Finalize(), finalize the MPI and frees up the

MPI.INT

Int

MPI.LONG

long

MPI.FLOAT

float

MPI.DOUBLE

double

MPI.Init(args) and MPI.Finalize() must used at the start and end of the program as because init()

memory by destroying all the communicator object, which are created for the communication purposes.

MPI.PACKED

In basic message passing, the processes coordinate their activities by explicitly sending and receiving messages. The standard send and receive operations of MPI are members of Comm interfaces[8]



public void send(Object buf, int offset, int count, Datatype datatype, int dest, int tag)



public void recieve(Object buf, int offset, int count, Datatype datatype, int source, int tag)

where buf is an array of primitive type or class type. If the elements of buf are objects, they must be serializable objects, offset is the starting point of message, datatype class describes the types of the elements, count is the amount of element to sent or received, source and dest are the rank of the processor of source and destination processor, and tag is used to identify the message.

3. Communication API

32

There are several issues that need to be addressed that the commands executed by process 0 (send operation) will be different from those executed by process 1 (receive operation).However, this does not mean that the programs need to different. By conditional branching according to the rank of the processor, the program can obtain the SPMD paradigm [1, 3]. Eg. my_process_rank= MPI.COMM_WORLD.Rank(); If ( my_process_rank==0 ) MPI.COMM_WORLD.Send(Object buf, int offset, int count, Datatype datatype, int dest, int tag) Else if (my_process_rank==1 ) MPI.COMM_WORLD.Recieve(Object buf, int offset, int count, Datatype dt, int source, int tag)

3.2.2.3 Installation and Configuration of mpiJava 1. Install your preferred Java programming environment. In our research we use j2sdk1.4.1–03. After Java JDK is installed successfully, add the Java JDK `bin' directory to path setting of .bash_profile, so that the `mpiJava/configure' script can find the `java', `javac', and `javah' commands. 2. Install your preferred MPI software. Add the MPI `bin' directory to your path setting. Test the MPI installation before attempting to install mpiJava. And we use MPICH-1.2.5 as our MPI software. 3. Now, install the mpiJava interface. step 1. Unpack the software, eg prompt $ gunzip -c mpiJava-x.x.x.tar.gz | tar -xvf A subdirectory `mpiJava/' is created. step 2. Go to the `mpiJava/' directory. Configure the software for the platform prompt $ ./configure step 3. Build (compile) the software:

3. Communication API

33

prompt $ make After successful compilation, the makefile will put the generated class files in directory `lib/classes/mpi/', and also place a native dynamic library in directory `lib/'. Now: Add the directory `/src/scripts' to your path environment variable. Add the directory `/lib/classes' to your CLASSPATH environment variable. Add the directory `/lib' to your LD_LIBRARY_PATH (Linux, Solaris, etc) or LIBPATH (AIX) environment variable. step 4. Test the installation: prompt $ make check

In the phase of installation of mpiJava we’ve encountered in many problem. We found some error in the configuration scripts of mpiJava and face some ghost situations. These are §

After configuring the mpiJava, go to the mpiJava/src/C directory. Open the make file in any editor and correct the entry for mpicc. In default it should be /root/mpich-1.2.5/bin/mpicc.

§

mpiJava runs a parallel java byte code by ‘prunjava’ command. It is a java wrapper which is wraps over the MPICH bindings. In using prunjava, we face some difficulties that sometimes our programs crashes; by saying “filename.jig” was not found. We discovered that the *.jig file was nothing but a file which includes the entire necessary path environment variable, library path and the commands to execute the program in parallel. The error occurs as because when MPI tries to execute a program in child node through a remote command (using rsh, rlogin etc), it can cannot find the appropriate path setting. And as a result it cannot generate the *.jig file and program crashes. To solve this problem we

3. Communication API

34

developed our own java wrappers to MPICH named it as “ewurun”. The syntax to run a java program is prompt $ ewurun 4 hello Here 4 are the number of processor should be involved in execution. Here is the sample java wrapper over MPI, the ewurun, which we are used execute parallel java programs. This wrapper may be different of type according to the application types and requirements. PNUMBER = $1 CLASSNAME = $2 cat > $CLASSNAME.kab

Suggest Documents