An Extension to MPI for Distributed Computing on MPPs - CiteSeerX

An Extension to MPI for Distributed Computing on MPPs Thomas Beisel , Edgar Gabriel and Michael Resch High Performance Computing Center Stuttgart, Parallel Computing Department, D-70550 Stuttgart, Germany

Abstract. We present a tool that allows to run an MPI application on several MPPs without having to change the application code. PACX (PArallel Computer eXtension) provides to the user a distributed MPI environment with most of the important functionality of standard MPI. It is therefore well suited for usage in metacomputing. We are going to show how two MPPs are con gured by PACX into a single virtual machine. The underlying communication management that makes use of highly optimized MPI for internal communication and uses standard protocols for external communication is presented. The performance of PACX for several basic message-passing calls is described. This covers latency, bandwidth, synchronization and global communication.

1 Introduction A diculty in simulating very large physical systems is, that even massively parallel processor systems (MPP) with a large number of nodes may not have enough memory and/or performance. There are many examples of these grand-challenge problems: CFD with chemical reactions or crash simulations of automobiles with persons inside. To tackle such problems metacomputing may be an option. And by now a range of metacomputing projects have evolved. The scenario driving the development of PACX [1, 2] was a metacomputing project that aims to combine the compute resources of Pittsburgh Supercomputing Center (PSC), Sandia National Laboratory (SNL) and High Performance Computing Center Stuttgart (HLRS) for one single application. The application used is a ow solver (URANUS) developed at the University of Stuttgart and adapted for parallel computation by HLRS. MPI [3] has become a standard for numerical applications and has partially replaced PVM on MPPs. The aim was therefore to use MPI also in metacomputing. However, there is no implementation of MPI available today that allows to integrate several MPPs for one single application. Some tools that aim to support metacomputing try to bridge that gap [4, 5, 6] but require the user to change his application. MPICH [7] that aims to support heterogeneous clusters is currently not supporting Cray T3E. The goal of PACX was to not change the code and supply the user with a standard MPI environment. So applications could run without change on more than one MPP.

2 The PACX Model The major design goals of PACX are:

{ { {

Provide the user with a single virtual machine. No changes to the code are necessary. Use highly tuned MPI for internal communication on each MPP. Use fast communication for external communication.

2.1 Basic Concept of PACX To reach these goals, PACX was developed at HLRS as an extension of the message passing library MPI. Initially PACX was developed to connect an MPP and a leserver to allow fast data transfer from inside the application. This rst version of PACX was based on raw HiPPI to exploit the underlying network[1]. MPP 1

MPP 2

Application PACX MPI HardwareInterface

Application PACX

PACX

MPI

MPI TCP

TCP

PACX MPI HardwareInterface

Fig. 1. Schematic view of PACX. The next step was extending PACX to connect two MPPs. Driven by the idea of running one application on two MPP's the goal was no longer to just send or receive data from a leserver but to fully support MPI. Thus, one application could exploit the full potential of two fast machines. The design decision was to implement PACX as a library sitting between the application and the local MPI implementation. The application would call MPI functions that would be interpreted and diverted into the PACX library. The PACX library determines whether there is a need for contacting the second MPP. If not, the library passes the unchanged MPI call to the local system, where it is handled internally. This guarantees usage of highly tuned vendor speci c MPI implementations for internal communication. Communication via the network is done with TCP sockets. PACX provides the user with a global MPI COMM WORLD as can be seen in gure 2. For the communication between the two machines, each side has to provide two additional nodes, one for each communication direction. On each of these nodes a daemon is running that is responsible for the communication

MPP 1 2

0

4

2

3

1

5

3

MPP 2

1

0

0

1

4

6

2

4

5

7

3

5

Global Numbering Local Numbering

Fig. 2. Basic scheme of PACX on two MPPs. between the machines involved. This includes communication with local nodes, compression and decompression of data for remote communication and communication with daemons of other machines. Using two extra communication nodes has turned out to make trac handling easier. Like MPI, PACX has language bindings for FORTRAN 77 and ANSI C. While MPI consists of more than 120 function calls PACX was restricted to a smaller number. It implements those functions that are used in the applications that drive the development of PACX. At this time PACX supports the following calls:

{ Initialization and control of the environment { Standard point-to-point communication { Collective operations:

MPI Barrier, MPI Bcast,MPI Reduce and MPI Allreduce

{ Standard nonblocking Communication

In addition to these calls, communicator constructs have been implemented and are currently in the testing phase. These will allow normal usage of communicator constructs across the machines without restrictions.

2.2 Point-to-point Communication The handling of a point-to-point communication is shown in gure 3. Since PACX provides an MPI COMM WORLD across the two machines involved there has to be a mapping of local and global process numbering. Numbers in the squares indicate this global node numbering. If global node 6 wants to send a message to global node 2 the following steps are taken:

{ Node 6 calls MPI Send specifying node 2 in communicator MPI COMM WORLD as destination. This call is processed by the PACX library.

MPP 1 2

0

4

2

3

1

5

3

MPI_Send Data and Status

MPP 2

1

0

0

1

4

6

2

4

5

7

3

5

Data

Data

Command

Command

Confirmation Confirmation

Fig. 3. Point-to-point communication between two MPPs using PACX.

{ PACX nds that global node 2 is on the other machine. So it has to hand

the message over to the PACX daemon. For this the message is split into a command package and a data package. The command package contains all envelope information of the original MPI call plus additional information for PACX. { The data packages is compressed to reduce network trac and sent over to the other system's incoming communication node. There the data is uncompressed and the command package is interpreted. { Using this information a normal MPI Send to the destination node is issued. The return value of this call is handed back to the rst system to be handed back to the original sender.

2.3 Global Communication For global communication things become even more complicated. Figure 4 shows how a broadcast to MPI COMM WORLD from root 6 is handled correctly on two machines using PACX. The following steps are taken:

{ Node 6 rst sends a command package describing the broadcast and the data to be broadcasted to the outgoing communication node.

{ It then does a broadcast in a communicator PACX Comm 1 especially provided by PACX to include all local application nodes.

{ The outgoing communication node meanwhile hands the information over to the second MPP's incoming communication node.

{ This node now sets up a normal MPI Bcast from the command package and the data package and distributes it in a second communicator PACX Comm 2 provided by PACX including the incoming node and all local application nodes.

MPP 1 2

0

4

2

3

1

5

3

PACX_Comm_2

MPP 2

1

0

0

1

4

6

2

4

5

7

3

5

PACX_Comm_1

Fig. 4. Broadcast operation on two MPPs. This concept for global communication allows to overlap communication to the second MPP with internal communication. Furthermore, the local broadcast communication on the two machines is done asynchroneously.

3 Results So far PACX has been installed and tested on an Intel Paragon and the Cray T3E. The most recent version supports usage of two MPPs but does not provide any data conversion. It has been tested to connect T3Es in Europe and the US succesfully. Future developments will include support for heterogeneity and support for usage of more than two machines. This will include handling the data conversion problem. So far the following questions are of importance with respect to performance:

{ What is the overhead that PACX imposes on communication with respect to protocols, buer copying, compression, and so on? { What is the performance one can expect from PACX on a production network like the vBNS?

To answer these questions, one must have access to the resources required to do extensive testing. Experience has shown that it is rather easy to do testing on one machine and even on two machines that are in the vBNS [8]. But working across the Atlantic Ocean - as planned - is still a challenge. We have not yet had reasonable bandwidth between Stuttgart and the US, so those results are not available. Once a network connection is available, we hope to see similar results as those on the vBNS. The following are results for latency, bandwidth, and the time it takes to perform global communication with MPI on a single machine and using PACX between T3E's at PSC and San Diego Supercomputing Center's Cray T3E. Furthermore we provide rst results for the URANUS code though only for small test cases.

3.1 Latency and Bandwidth Latency is the time it takes for a zero-byte message to travel from one node to another and back divided by two. The latency overhead introduced by PACX to standard send calls of MPI for internal communication is about 3 microseconds, which is rather small. The additional latency incurred by accessing the TCP protocol stack and copying of data corresponds to latencies seen on workstations. One has to take into acccount that PACX actually sends two messages so that the real latency is only about 2 milliseconds. Mode Latency in sec Bandwidth in MB/s MPI 16 307 MPI+PACX 19 297 PACX local 4000 7 PACX remote 40.000 0.15

Table 1. Latency and bandwidth for dierent levels of connectivity. Bandwidth as discussed here is aggregated bandwidth for very large messages. The overhead introduced by PACX before actually calling an MPI call for internal communication only slightly reduces bandwidth by about 3 %.Detailed analysis has revealed possibilities for improvement. Bandwidth and latency for remote communication will be reinvestigated when high-performance network connection is available.

3.2 Synchronization and Global Communication In metacomputing it is time-consuming to synchronize an application. Therefore the following results have to be seen as penalties that a code incurs if it makes use of synchronizing message-passing calls. #Nodes MPI PACX local PACX remote 4 2.5 13300 157000 8 3.5 13300 158000 16 2.7 12900 156000 32 3.3 13800 153000 64 3.4 13400 172000

Table 2. Barrier synchronisation results (in sec) for dierent levels of connectivity.

Barrier synchronization is done rather fast on the T3E itself. PACX synchronizes both machines separately and then exchanges a synchronization message. With that message both machines can be sure that the other one has synchronized locally. Using a second barrier locally, they make sure that both machines run nearly synchronously. The algorithm for the broadcast operation as described above requires transmission of one message which imposes a latency on that operation. #Nodes MPI PACX local PACX remote 4 26 14800 159500 8 38 14500 156600 16 49 14700 170300 32 61 21200 158700 64 71 12800 158800

Table 3. Broadcast results (in sec) for dierent levels of connectivity. The results indicate that for global communication and synchronization the exchange of messages is the critical point whereas time for global communication on one system is of minor importance. Due to the constant overhead for exchanging a message, timings are independent of the number of nodes involved and even independent of the type of communication performed.

URANUS The application tested is not yet adapted for metacomputing. It does no latency hiding and uses several collective operations. Due to the I/O bottleneck, results can not be expected to be spectacular. In the following we give only the overall time that it takes to solve a small example (110K cells). Medium-sized Test Case #nodes MPI PACX local PACX remote 16 94 291 513 32 59 293 527 64 46 303 |

Table 4. Timings (in sec) for 10 URANUS iterations using MPI and PACX for a small test case.

Timings as given here include preprocessing, processing and postprocessing. Since it was dicult to do testing on more than one machine we calculated

only 10 iterations. Normally it needs from 200 to 10000 iterations for the code to converge. But these rst test results certainly point out the metacomputing challenges we face. It is obvious that PACX imposes such a high overhead on the communication by using TCP that for the nonoptimized version of PACX and without having adapted the code of URANUS we see a slow down for all problem sizes even if we are on the same machine. However, the message that we see from these rst results is that timings remain nearly constant which implies that it is latency that slows down the calculation. If we then go to two machines, we see an additional slow down and again nearly constant values for timings. Again it seems that latency dominates the results.

4 Summary PACX is a exible and easy to handle library that allows to con gure two MPPs into one single virtual machine. First tests for a ow solver application were done [8] The concept as described is currently extended to integrate more than two machines. Furthermore, PACX is ported to several MPP platforms to allow distributed computing between heterogeneous platforms. The results indicate that PACX at the moment imposes a rather high latency on the communication and limits bandwidth. Currently we are working on an implementation that will reduce latency by at least a factor of two for local usage. However, one has to keep in mind that latency between two remote machines will be signi cantly in uenced by the underlying network.

References 1. Beisel, T.: Ein ezientes Message-Passing-Interface (MPI) fur HiPPI. Studienarbeit, RUS, 1996. 2. Gabriel, E.: Erweiterung einer MPI-Umgebung zur Interoperabilitat verteilter MPPSysteme. RUS-37, 1997. 3. Message Passing Interface Forum: MPI: A Message-Passing Interface Standard. University of Tennessee, Knoxville, Tennessee, USA, 1995. 4. Fagg, G.E., Dongarra, J.J.: PVMPI: An Integration of the PVM and MPI Systems. Department of Computer Science Technical Report CS-96-328, Knoxville, May, 1996. 5. Brune, M., Gehring, J. and Reinefeld A.: A lightweight Communication Interface for Parallel Programming Environments. HPCN'97, Springer, 1997. 6. Cheng, F-C., Vaughan, P., Reese, D., Skjellum, A.: The Unify System. Technical Report, NSF Engineering Research Center, Mississippi State University, July, 1994. 7. Gropp, W., Lusk, E, Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard, Parallel Computing, 22, 1996. 8. Resch, M.M., Beisel, T., Boenisch, T., Loftis, B., Reddy, R.: Performance Issues of Intercontinental Computing. Cray User Group Conference, May, 1997. This article was processed using the LATEX macro package with LLNCS style

An Extension to MPI for Distributed Computing on MPPs - CiteSeerX

An Extension to MPI for Distributed Computing on MPPs - CiteSeerX

Suggest Documents

An Extension to MPI for Distributed Computing on MPPs - Google Sites

A comparison of MPI performance on di erent MPPs - CiteSeerX

An Extensible Framework for Distributed Testing of MPI ... - Open MPI

Programming for Distributed Computing - CiteSeerX

Bi-modal MPI and MPI+threads Computing on Scalable Multicore ...

An Easy to Use Distributed Computing Framework - CiteSeerX

MPI-HMMER-Boost: Distributed FPGA Acceleration - CiteSeerX

An Introduction to the MPI Standard - CiteSeerX

A Consortium to Promote Distributed Computing - CiteSeerX

grass gis on high performance computing with mpi ... - CiteSeerX

High Performance Computing Using MPI and OpenMP on ... - CiteSeerX

An MPI-Based Python Framework for Distributed Training with Keras

Designing distributed algorithms for mobile computing ... - CiteSeerX

The Laboratory Bench: Distributed Computing for ... - CiteSeerX

Prediction Services for Distributed Computing - CiteSeerX

Distributed Aggregation for Data-Parallel Computing - CiteSeerX

Cooperative Computing for Distributed Embedded Systems - CiteSeerX

A SECURITY MODEL FOR DISTRIBUTED COMPUTING ... - CiteSeerX

Distributed Snapshots for Mobile Computing Systems - CiteSeerX

A Distributed Computing Framework for Iterative ... - CiteSeerX

A SECURITY MODEL FOR DISTRIBUTED COMPUTING ... - CiteSeerX

Software Architecture for Mobile Distributed Computing - CiteSeerX

HW-SW framework for distributed parallel computing on ... - CiteSeerX

HW-SW framework for distributed parallel computing on ... - CiteSeerX