High Performance MPI: Extending the Message Passing ... - CiteSeerX

High Performance MPI: Extending the Message Passing Interface for Higher Performance and Higher Predictability Anthony Skjellum Mississippi State University Department of Computer Science & NSF Engineering Research Center for Computational Field Simulation Mississippi State, MS 39762 [email protected]

Abstract The Message Passing Interface (MPI-1 and MPI-2) provides portability, and a degree of high performance in message passing. A careful review of the design shows many opportunities for performance have been missed, and that these omissions can be corrected. Especially relevant are mechanisms for taking advantage of program characteristics { common to many regular data-parallel and coarse-grain data- ow utilizations of MPI { that allow for better runtime optimizability. This paper describes classes of de ciencies and their solutions, through upward compatible strategies that allow \Fast MPI" implementations to exceed the performance of regular MPI implementations on the same platform. A simple approach to retaining portability is also oered. The most important contribution is the utilization of message passing \temporal locality" in the speci cation of middleware behavior and services between the application and the middleware. Keywords: parallel computing, MPI, persistent communication, performance extensions

This work was supported in part by DARPA through the U. S. Air Force Research Laboratory under contract F30602-95-1-0036, and with additional support from the National Science Foundation, Early Career Program, ASC-95-01917.

1 Introduction Message passing has long been a method of programming multicomputers, clusters of computers, and also has found surprisingly wide acceptance with symmetric multiprocessor (SMP) programmers. The Message Passing Interface [1] Application Programmer Interface (API) has become the ubiquitous API for portable, parallel message-passing programs, and is spreading into non-technical applications as well. MPI-1 provides point-to-point and collective communication operations, offered within the framework of safe group and context-oriented communication (communicators), all in a static world of processes. High performance, expressive data gather/scatter, wide portability, \communication safety," and widespread implementation were key achieved goals of MPI-1 [1, 2]. In addition to primary standardization in 1993{94, additional work to extend MPI's speci cation was accomplished in 1995{97, called MPI-2 [3, 4]. MPI-2 speci ed vast new categories of features (not currently in wide implementation or use), which cover: extensions to collective operations, parallel disk I/O, one-sided distributed shared memory communication, C++ bindings, and dynamic process management. Rich functionality was the key outcome of MPI-2's standardization eort.

Neither MPI-1 nor MPI-2 provides the best library-level way to present canonical messagepassing primitives for high performance in a number of situations to be described in this paper. This so-called \price of portability" is too high for certain users and application spaces1 . Because some of the opportunities for enhanced performance were clear by the onset of the MPI-2 standardization, several proposals were made to enhance performance for speci c situations, notably for programs that exhibited temporal locality, thereby supporting the use of \persistent communication" more widely than in MPI-1. Unfortunately such functionality was not accepted into MPI-22 . Despite this standardization setback two years ago, this functionality remains important for real applications, and is becoming even more important over time as architectures increase their native performance, while adding features that reduce the minimum software overhead (e.g., [7]). This work is \orthogonal" to intensive implementation eorts that work to optimize the MPI speci cation as is. For situations where the speci cation may be improved, far greater optimization at runtime may be revealed tractably, with far less implementation eort. It may be seen that, from an implementation perspective, it is far more costeective to augment the API and allow applications to declare/exploit their communication structure, than it is to try to infer such properties after the fact based on the latebinding communication currently standardized in MPI. This paper covers performance enhancing primitives for MPI-1 (omitting similar techniques for MPI-2 extensions to MPI1 for brevity), provides a strategy for oering such extensions in high performance implemen1 Note that the \price of portability" of MPI programs is usually signi cantly less than programs formulated with other ad hoc message passing systems in use at the time of MPI-1's standardization, notably PVM3.x [5]. There is even more performance available for classes of applications, and this must be accessible to widen MPI's acceptance further. 2 This functionality is accepted as part of MPI/RT, a real-time variant of MPI, described in [6].

tations of MPI, and then describes a simple scheme for retaining full portability to MPI environments where the extensions are not native. The remainder of this paper is organized as follows: the second section describes notation, while the third describes the main issues associated with MPI that we seek to enhance. The fourth section details application properties and thereby de nes requirements for Fast MPI, the fth section subsequently provides a sketch of how this extended API actually looks. We overview future work in the sixth section, prior to concluding.

2 Notation We dierentiate message passing API calls according to their early- or late-binding nature: By \early-binding," we refer to API that has a single setup, with associated error checking, static resource reservation, and planning, and then a reuse phase, in which the setup operation is exploited to deliver a faster and/or more predictable N th execution of a particular operation. The cost of this setup is amortized over N uses. By \late-binding," we refer to API that has no advance reservation, but which presents operations to the library on the

y, requiring the library to do comparative error checking of arguments, plan all dynamically needed resources and scheduling at that time, and tear down any temporaries after the completion of the operation. Furthermore, we use the name \persistent communication" to refer to the realization of earlybinding API in existing MPI, and in the extensions we propose. \Persistence" equates to \early-binding."

3 Analysis of MPI API This part of the paper analyzes MPI from the perspective of early- vs. late-binding.

3.1 Parallel Programming Model MPI addresses several types of parallel programming models, and their arbitrary compositions:

data-parallel (single thread of control, multiple processes, clique-oriented), task-parallel (arbitrary communication and computation), and coarse-grain data ow (bi-partite communication).

The communication models derive from MPI COMM WORLD in MPI-1, or multiples thereof in MPI-2. Communicators and their operations provide early-binding operations to de ne process-group scopes, and safe communication space.

3.2 Point-to-point Communication MPI provides both late- and early-binding modes for point-to-point communication (e.g., MPI ISEND vs. MPI SEND INIT, which are both non-blocking, but the latter is early-binding). Early-binding modes are de nable on either end of a communication; however, MPI mandates that any send-type must match any receive-type, so that there is never a guarantee that both ends of a connection both operate persistently. There is never a way for a program to declare such a \channel" of persistence either. Communication paths between all pairs are supported a priori, when a communicator is formed. Also, there lacks support for blocking persistent communication. MPI matches communication between sender and receiver based on (tag, source, communicator), where the communicator provides the guarantee of \communication safety" (e.g., to scope messages lexically in a data parallel setting). No interference of order between communicators occurs, with pairwise ordering possible between unique sender-receiver pairs within a communicator (overridable with tags). Of the three arguments used for matching, source is sucient for most regular, data parallel programs, and

is used for irregular or adaptive parallel programs where number, type, and order of messages is unanticipated by the receiver. MPI implementations cannot anticipate the tag distribution of messages on a communicator, although they can plan in advance for the number of active communicators and sources, because this part of the MPI interface is early-binding. tag

3.3 Collective Communication

MPI provides a variety of blocking collective communication operations, de ned over the groups of communicators (set up by applications), divided into: synchronization (i.e., MPI BARRIER), data movement (e.g., MPI GATHER), and data movement plus reductions (e.g., MPI ALLREDUCE). These operations always work over the communicator prede ned by the application, and have safety apart from any pending non-blocking point-to-point communication. They also provide two additional safeties: they may be posed in any order, and they may be posed repetitively, on the same communicator, with guarantee that there are no mismatches or overruns. MPI runs into trouble with collective communication in several ways, because of late binding. This can be illustrated in several of the calls. We do so exhaustively in [8]. Here we illustrate with two calls, MPI ALLTOALLV (general data reorganization), and MPI REDUCE (all-to-one with associative or associative/commutative reduction). The calls have the following signatures in C [1]: MPI_Alltoallv(void *sendbuf, int *sendcounts, int *sdispls, MPI_Datatype sendtype, void *recvbuf, int *recvcounts, int *rdispls, MPI_Datatype recvtype, MPI_Comm comm); MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm);

First we consider MPI ALLTOALLV3. This call As is noted in the MPI-2 standard [4], this itself is not general enough as a primitive, because it does not have a vector of send and receive datatypes. However, such a generalization does 3

MPI ALLTOALLV

sends data from all processes to all other processes in the group of the communicator comm. It utilizes a complex signature that varies across the set of processes. This signature must be marshalled into N sends and N receives, in general, for a group of size N . All the of the transfers must be arranged, scheduled, and completed based on instantaneous invocation of the call, and strictly local information about the operation. For large transfers, synchronous collection of the arguments could be done, and a search performed over previously visited sets of arguments, but this scales poorly in N , and would be dicult to implement4. For this reason, implementations of this operation tend to be the naive composition of N sends and receives, perhaps posted in groups to avoid buering penalties, and with eorts to have receives post before sends. Realizations that exploit sparsity of the send/receive structure are not in evident use in practice. Second, we consider MPI REDUCE. There are evidently many algorithms for accomplishing such reductions, including trees, all-toone stars, and several modi ed formats that trade-o latency and bandwidth. The literature is full of such examples, suggesting that poly-algorithms are opportune for high quality implementations. Unfortunately, MPI implementations are confronted with the signatures of the whole group without the ability to know if the call has been seen previously (as with the previously discussed call). Furthermore, the implementation must take an instananeous decision based on knowledge of (count, datatype, op), and the size of the group, as to which algorithm to use. This decision may be expensive, and may possibly require information that requires runtime exper-

imentation (e.g., estimate useful upper bound on the speed of a user-de ned reduction operation op). Again, a synchronization over all the processes in the group could be done, with a search over the aggregated arguments to arrive at previously de ned optimized algorithm. However, this is viewed as a prohibitive approach for practical implementations, because of its time and space complexity (the gather operation exceeds the bandwidth requirements of the reduction itself, in a non-trivial set of cases). Consequently, typical implementations of MPI are not performing more than optimization according to the size of the group, and many simply use log-trees or binary trees.

not solve the late-binding problem, but only exacerbates it. 4 Both MPI-1 and MPI-2 considered and discarded tags for as required arguments for collective operations, arguing their apparent lack of value. Not only would such tags support disambiguation of non-blocking, latebinding collectives, they would also allow MPI implementations to match up late-binding collectives for the purpose of one-time optimization plus on-the- y reuse with acceptable cost.

3.5 Data Descriptors

3.4 Collective Abstraction Level Besides the late-binding nature of collective operations in MPI, the low-level of abstraction associated with the all-to-all type communication makes it dicult for implementations to infer when special reorganization of datacubes is underway. This is particularly important to applications that transpose matrices or tensors of data. The DARPA Data Reorganization Standard eort (\Data Reorg" [9]) is standardizing early-binding ways to describe such operations. The extant MPI operations cannot capture either the level of abstraction or the earlybinding needed to support such primitives ef ciently. \Data Reorg" builds on vast practice in the embedded and real-time application spaces, analogous to that in the HPC application space, where transposition and reshaping is required. The mismatch between MPI and what applications need to achieve high performance is evidently most marked in these operations. A further type of late-binding occurs because MPI describes gathers and scatters of data in point-to-point (and appropriate collective operations) with the triple (address, count, datatype), which forms a single descriptor when the arguments are united in an API call.

Datatypes are hierarchically de ned, and implementations may or may not optimize the representations of such descriptors even after they are \committed." The commit step for datatypes is intended to be early binding. However, because both the address, and count are absent, the datatype is incomplete until used. This has the negative eect of denying certain optimizations, except on datatypes that are \shallow"; that is, not formally deeply recursive in their de nitions (e.g.,, strided vector types of oats). Certain of these optimizations can be recovered by caching the (address, count) pair on datatypes internally, and performing heuristic lookups in order to decide if optimized marshalling or unmarshalling has previously been orchestrated. This approach would potentially impose a lookup-penalty on all datatypes, and at least imply additional conditional statements within the critical path of all sends/receives. It is not evidently exploited in any extant implementation. In general, implementations use datatypes as late-binding entities and do not exploit all the low-level DMA or other marshalling facilities they could readily use, provided the descriptor triple were uni ed before use. For persistent sends and receives, the triple is prede ned, and their is no local marshalling problem as just described. However, because of the inability to guarantee a virtual channel, implementations have to work hard to decide on a strategy for marshalling and unmarshalling data across single transfers (such as which side to burden more greatly). For short messages, this is most pronounced; for long messages, strategies may be dynamically arranged, in addition to RTS/CTS-type rendezvous commonly used to avoid undesired buering at the receiver side. Consequently, a latency penalty is imposed for marshalling and unmarshalling complex data, even if the operations are individually persistent at each end of the communication, that is, the communication is legitimately static end-to-end.

4 Application Properties vs. MPI API \Regular" applications, particularly dataparallel applications, often de ne their message structures up-front, and to work with loosely synchronous programming styles. Such applications do not often utilize tags, and many freeze their arguments after de ning them once. A number of aspects of the generality supported by MPI are not needed, either for point-to-point or collective operations. Similarly, the communication structures of coarsegrain data ow operations may also exhibit such repetitive structure. Finally, irregular algorithms that experience signi cant dynamic load balancing, and restructuring do not exhibit such properties, except possibly on reuse scales that are hard to exploit. We motivate that MPI is suboptimal for the common applications that do exhibit temporal locality, which are many of the applications written with MPI.

4.1 Type #1a/1b: Tag Restrictions If applications use MPI tag wildcards on a communicator, then a deep matching problem occurs, because queue searching is exacerbated, and FIFO ordering based on receipt order is overridden. Type #1a applications are de ned as applications that do not use wildcard tags5 . If there is a guarantee of no wildcard tags, the possibility of receiver-driven protocols opens; this is useful in connection with Type #1c, which avoids source wildcards as well. For the class of applications that set tags to a single constant (e.g., 0), MPI is too general, because it assesses all communicators with the xed overhead of supporting out-of-order tag matching, and must compare the tag words (Type #1b). Most communication layers pay a penalty for tag handling, even if not used, and a more signi cant penalty when tags actually appear6. For many implementations, the penalty is imposed if any communicator utilizes wildcard tags. 6 It is possible to plan for a small, xed number of tags with such new network technologies as VI Archi5

4.2 Type #1c: Restricting Wildcard Sources Applications that utilize explicit sources for selection of messages are inherently good, because this establishes one-to-one endedness between senders and receivers. Wildcard source selection is a costly feature for applications to use, because it limits certain protocols that are receiver-driven, and also forces selection across multiple network devices, when there are heterogeneous systems involved. Applications that avoid wildcard sources oer MPI with clear optimization: single network source, and, when coupled with tag restrictions mentioned above, simpli es MPI communication structures. Fast MPI should provide annotations of communicators (at creation) for these restricted point-to-point modes. It may also be necessary to annotate mpirun to impose this restriction on MPI COMM WORLD.

to justify a one-time upfront charge to determine how best to operate the collective operation. Early-binding semantics are not available in MPI, as mentioned above. Early-binding would support poly-algorithms, static resource allocation, and one-time error checking. Such steps can markedly enhance predictability of these algorithms as well. Because of a lack of persistent collective operations, MPI-2 Forum had a dicult time agreeing on non-blocking collective operations. MPI-2 did not ultimately provide this important capability. The persistent format allows operations to be planned; such as allocating threads, and/or devising additional safe communication space for use in a prede ned thread, for each such non-blocking collective across a communicator's group. Fast MPI should provide blocking and, optionally, non-blocking persistent API for collective operations.

4.2.1 Type #2: Static Point-to-Point

4.4 Type #4: Data Reorganization

For applications that x the arguments to their send/receive calls, which are also Type #1c, the channel abstraction de ned in [6], can be used in simpli ed form with MPI. One proposal oered for MPI-2 (defeated) was a pairwise blocking \MPI BIND()" call, with two endpoints of a persistent communication previously de ned. This works poorly when several channels are to be established because of deadlock possibilities that become an application's problem to avoid. This is readily corrected, as described in Section 5. Fast MPI should provide a persistent channel construct. This should easily allow a set of persistent channels to be established without deadlock over a communicator's group.

As mentioned earlier, planning for special remapping of \datacubes" is possible with early-binding APIs. This type of API has existed in engineering practice for some time, and currently is receiving attention for the purpose of standardization. Relevant applications, such as in-place and out-of-place 2D FFTs, evidently exhibit this type of persistent use of data reorganization for structured rather than unstructured data motion. Neither the level of abstraction nor early-binding semantics are present in MPI to address these operations without a signi cant performance gap. To support Type #4, the operations posed in abstract by DARPA \Data Reorg" should have Fast MPI realizations.

4.3 Type #3: Static Collectives

5 Fast MPI API

For applications whose collective operations do not change with iteration, it is possible tecture [7], by dedicating virtual channels for early demultiplexing.

The Fast MPI application programmer interface represents straightforward extensions to MPI operations, with straightforward generalization of the MPI operations that sup-

port asynchronous request management (completable operations).

5.1 Type #1 Applications

The restrictions on matching (per communicator) can all be addressed through annotations on the construction of communicators, which may be done by modifying MPI COMM SPLIT and MPI COMM DUP. For brevity, only the latter is shown here: Fast_MPI_Comm_dup(MPI_Comm oldcomm, MPI_Comm *newcomm, int flags);

The flags descriptor allows the restrictions allowable by the application to be described at the time the communicator is de ned, and its resources are allocated. This addresses access to Type #1a-c optimizations.

5.2 Type #2 Applications In addition to supporting communicator restrictions to avoid overly expensive matching, it was asserted that virtual channels should be supported between end-points in a communicator. In order to do this simply, a series of nonblocking setup calls could be supported. First, each end-point creates a persistent send or receive, as appropriate. Next, a non-blocking bind operation is initiated. Finally, completion is awaited. The following example initiates M outgoing virtual channels, and N incoming virtual channels, in a speci c process. These operations would have to have corresponding endpoints across the rest of the group. /* any process in group: */ /* do all Send_inits: */ MPI_Send_init(..., &req[0]); ... MPI_Send_init(..., &req[M-1]); /* do all Recv_inits: */ MPI_Recv_init(..., &req[M]); ... MPI_Recv_init(..., &req[M+N-1]); /* non-blocking Binds: */ Fast_MPI_IBindall(req); Fast_MPI_Waitall(req,stati);

Both sides initiate MPI Start in order to accomplish two-sided communication when desired. As in [6], it is easy to de ne one-sided extensions (put- and get-based) on these virtual channels as well.

5.3 Type #3 Applications

For Type #3 applications, persistent interfaces, with blocking/non-blocking options must be provided. For brevity, we will illustrate for the case of MPI REDUCE, together with indications that both MPI WAIT and MPI TEST are generalized as well: Fast_MPI_Reduce_init(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm, int blockflag, int min_uses, Fast_MPI_Request *request); Fast_MPI_Start(Fast_MPI_Request request); Fast_MPI_Wait(Fast_MPI_Request *request, Fast_MPI_Status *status);

This call would be started non-blocking using the modi ed start, and waited using the modi ed wait. Additionally, if the blockflag were speci ed as true, then Fast MPI Start would not return until the operation were complete, making a subsequent Wait trivial. The integer min uses provides a measure of many reuses over which the setup will be amortized, thereby allowing the setup phase to tradeo its own cost against the variable cost of the actual operation's performance.

6 Future Work The following tasks remain to be accomplished in order fully to demonstrate Fast MPI:

Release of the full speci cation, with C, C++, and Fortran-77 bindings, Realize the API within an optimized MPI implementation, to show dierential in performance in a high-performance setting. Realize a portability library for non-FastMPI environments.

De ne requirements on threading library in such portable environments, so that new non-blocking portable services can optionally be used as well. Realize the \DARPA Data Reorganization" abstract API in the context of extensions to MPI, and include these in the Fast MPI speci cation [9]. Utilize these primitives in applications that motivate widespread adoption of the Fast MPI extensions.

[2]

[3]

7 Conclusions Opportunities for MPI performance extensions are essential to reducing the \price of portability" associated with using MPI-1 and MPI-2 in demanding environments, while simultaneously oering higher predictability of execution, that is, extensions to areas MPI-2 was unable to oer. Late-binding characteristics of the extant functionality cause the identi ed performance limitations. Users can as yet bene t from performance extensions, despite their absence from the formal MPI-1 and MPI2 standards. Such extensions' implementation adds only trivial complexity to the delivery of an MPI library, if implemented without particular optimization (to achieve minimal compliance), and actually simplify runtime optimization for programs with temporal locality of message passing. In some cases, desired functionality is totally missing from MPI-2 because it could not be achieved adequately without \persistence." This paper identi ed higher performance primitives as direct extensions to MPI-1 or MPI-2, applicable to non-trivial classes of \regular" applications, while retaining portability to unenhanced MPI environments through the idea of oering a portability library that layers on MPI. Additional details are contained in [8, 10].

References [1] Message Passing Interface Forum. MPI:

[4]

[5]

[6] [7] [8]

[9]

[10]

A Message-Passing Interface Standard, 1994. Under http://www.mpi-forum.org. W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI Message-Passing Interface standard. Parallel Computing, 22:789|828, 1996. A. Geist, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, W. Saphir, A. Skjellum, and M. Snir. MPI-2: Extending the Message-Passing Interface. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Lecture Notes in Computer Science, volume 1/1123, pages 128|135. Spring Verlag, 1996. Euro-Par '96 Parallel Processing. MPI-2 Forum. MPI-2: Extensions to the Message Passing Interface, 1997. http://www.mpi-forum.org/docs/ mpi-20-html/mpi2-report.html. A. Beguelin, G. A. Geist, W. Jiang, R. Manchek, K. Moore, and V. Sunderam. The PVM project. Technical report, Oak Ridge National Laboratory, February 1993. MPI/RT Forum. MPI/RT: A RealTime Message-Passing Interface Standard, 1998. http://www.mpirt.org. Intel, Compaq, and Microsoft. VI Architecture Speci cation Version 1.0, 1998. http://www.viarch.org. A. Skjellum et al. Sub-optimalities of the Message-Passing Interface (and how to x them){part I. to be submitted to Parallel Computing, June 1998. DARPA Reorganization Forum Working Group. DARPA data reorganization standard. Under http://www.data-re.org, May 1998. A. Skjellum et al. Standard extensions for Fast MPI. Under http://www.erc.msstate.edu/labs/hpcl, July 1998.

High Performance MPI: Extending the Message Passing ... - CiteSeerX

High Performance MPI: Extending the Message Passing ... - CiteSeerX

Suggest Documents

Extending the Message Passing Interface (MPI) - CiteSeerX

MPI-Â¦: Extending the Message-Passing Interface Al Geist, ORNL ...

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

MPI: A Message-Passing Interface Standard - MPI Forum

High-Performance Message Passing in Java using Java.nio - CiteSeerX

MPI: A Message-Passing Interface Standard - MPI Forum

Object Oriented MPI: A Class Library for the Message Passing

Scalable High Performance Message Passing over InfiniBand for ...

Scalable High Performance Message Passing over InfiniBand for ...

Efficient Message Passing on Multi-Clusters: An IPv6 ... - Open MPI

Hybrid MPI: Efficient Message Passing for Multi-core ... - Torsten Hoefler

A Grid-Enabled MPI: Message Passing in Heterogeneous ... - Globus.org

PERFORMANCE ANALYSIS OF MESSAGE PASSING INTERFACE ...

Automated Performance Prediction of Message-Passing Parallel

Performance Evaluation of Message Passing vs. Multithreading

The Portals 3.3 Message Passing Interface - CiteSeerX

Performance Analysis of Approximate Message Passing for

Performance of Hybrid Message-Passing and Shared ... - CiteSeerX

Extending Message-Oriented Middleware - CiteSeerX