T
Interoperability of Message-Passing Interface (MPI) Implementations: A Position Paper
AF
Anthony Skjellum Thomas P. McMahon MPI Software Technology, Inc. 1 Research Blvd., Suite 201 Starkville, MS 39759 ftony,
[email protected] (domain changes to @mpi-softtech.com on or about March 1, 1997) 601-320-4300; FAX: 601-320-4301 February 20, 1997 (Revised December 18, 1997) DRAFT VERSION 0.2 Abstract
DR
Interoperability of MPI implementations, unlike TCP/IP-based software over the Internet, is signi cantly more dicult because of mechanisms and policies used locally to achieve the requirements of the MPI standard, including correctness, progress, and performance-portability. Each MPI implementation can have completely dierent protocol stacks, and when it is desired to make such stacks interoperate, more global rethinking of designs and implementations become necessary. The tendency toward polled, non-threaded implementations of current MPI make dicult the eective support of multi-protocol support, a situation often encountered in interoperability. However, interoperability over a single transport mechanism is itself a viably interesting problem independent of such concerns. In this position paper, we overview certain technical issues, that will hopefully spur a nationwide industry discussion of MPI interoperability. Suggestions for short, and medium-long term activities in standardization are included, and a number of areas of possible impact on MPI interoperability are enumerated. This is a paper in progress, which will be updated and augmented over time.
This paper is for public release worldwide. Copyright 1997, MPI Software Technology, Inc, but may be redistributed and disseminated in unmodi ed form without restrictions assuming all notices of authorship and copyright are retained. This paper is oered as input to the NIST discussion process now beginning.
1
1 Introduction
DR
AF
T
MPI, the message passing interface, is not currently interoperable to the degree desired to undertake high performance distributed computing [4]. We see the entire technical and future/extended uses of MPI enhanced by the notion of interoperability. Here we de ne interoperability to include multiple, independent implementations of MPI over networks, multicomputers, multiprocessors, and interoperation between such systems. A number of signi cant issues exist in supporting such interoperability. We oer our current perspective to what we anticipate will be an industry-wide discussion under the stewardship of NIST, hopefully leading to useful interoperability where possible, and acceptance of non-interoperability for certain systems, where interoperability is less of a requirement, or demonstrably harmful to other systemic performance requirements (such as cost or net performance). Problems in MPI interoperability arise from dierences in transport layers as well as mechanisms and policies within MPI implementations that are permitted by the standard in order to promote performance portability of codes, yet do not address issues about internal interoperability. Ancillary issues, such as authorization, authentication of remote resource access compounds the question of supporting interoperability as does the need to coordinate the executions through meta-launchers. The clear need to keep MPI independent of TCP/IP or other speci c protocol assumptions from a user's perspective adds interesting requirements to interoperability, as does the desire to retain high local performance in an implementation in the face of interoperability (that is, demand a low cost of interoperability on intra-implementation modes of operation). It is expected that implementors of single-box, highly tuned implementations will be adverse to any form of interoperability, because it will generally have impacts on their individual implementor eort as well as software performance, compared to the single-box version. It is also asserted that current benchmarks strongly disfavor support for additional exibility that compromises zero-message latency in any way. Such arguments are more investment/ROI than feasibility from an implementor/vendor of MPI's perspective; it is certainly possible, with appropriate investment and market interest to support both the single-box and interoperable versions, rather than posture these as an either/or decision. Users who do not need interoperability should not pay with greater global functionality and probably somewhat lower local performance for this feature: avoiding the unwanted compromise can be accomplished in the worst case by providing two separate libraries.1 Market forces must nonetheless be considered in oering interoperability and other exibility found outside of MPI's technical computing creche; our view is that interoperability and other
exibility ultimately makes MPI a success and broadens its market. It should be noted that some of the extensions of transport and network capabilities needed in any MPI implementation for MPI-2 support, such as dynamic process table distribution, is also required for MPI-1-type interoperability of multiple implementations, so that a commonality of function can be identi ed that supports both the move to MPI-2 and interoperability, within single designs. 1
A similar investment/ROI, and dual-library argument has been made for multithreaded MPI support.
2
1.1 Medium and Long-Term Issues
DR
AF
T
Here is a list of some of the areas of discussion seen as relevant to the development of interoperability pro les. 1. Agreeing on types of interoperability (classes or pro les or levels). . . beginning with MPI-1 interoperabiltiy with minimal MPI-2 support is one approach (e.g., dynamic process management). 2. Common network layers (IP and PacketWay [3]) 3. Common transport layers (TCP/IP, and what else?) 4. Role of Active Networking in Interoperability of MPI 5. The role of threads, multi-protocol in point-to-point interoperability, because multiple networks are often at issue when multiple implementations need to work together. 6. The role of distributed services in process management across the enterprise, including possibly other standards that can be leveraged 7. The role of "concensus mode" standardization for collective ops (picking common algorithms and giving them common designations) 8. How intercommunicators gure in low-grade interoperability 9. The market for interoperability, and its costs 10. Internal transport protocols of MPI, and their possible concensus standardization (short, long, eager, put, get) 11. Common reliability mechanisms over almost reliable networks 12. SAN-to-SAN2 operation with special hardware: characterizing smart bridges for networks, what we would like them to do, etc, to help (rather than hinder) interoperability 13. When Interoperability is wrong (highly embedded systems, systems that are closed and will never open, etc). 14. Delivering quality of service in interoperable implementations [RT feature] 15. Interoperability of MPI with Psched (or other acceptable standard) [6] for scheduling, to include local requirements MPI may pose on future local schedulers as well. 16. The roles of fault detection and fault tolerance in providing for interoperability, and clean up. SAN is System Area Network. It is the new name for Local Area networks where there is high performance, low-error-rate technology, and for which classical IP is too cumbersome to achieve a reasonable fraction of performance potential, or to reveal new types of network services, such as dynamic discovery, mapping, and so on. 2
3
1.2 Short Term
AF
T
Recognizing that there will be widely varying views on the \big picture," we see as relevant the designation of single-transport interoperability in the near future. TCP/IP interoperability is the logical rst candidate. Under such consideration, these, among other issues could be raised: 1. Registering common ports for messaging and process management with IETF 2. Agreeing on interoperable headers (e.g., unify p4-type [1, 2], lam-type, msti-type) 3. Agreement on protocols for message lengths (short, long, very long) 4. Agreement on concensus for context establishment 5. Agreement on minimal concensus for collective 6. Agreement on well-known services for remote process management (dmonology, including security) 7. Allowing tool developers to make their debuggers and other tools run across multiple implementations 8. Heterogeneous launching 9. Investigation of Psched vis a vis MPI.
1.3 Issues Related, but Explicitly not Considered
DR
The MPI Forum has at least agreed to support limited functionality of the form: 1. Interlanguage interoperability 2. Conversion of datatypes to transportable form 3. Conversion of handles between languages in a user program If any of these are dropped, or are handled insuciently by MPI Forum, then these topics emerge as interesting in the Interoperable MPI discussion. Otherwise, the requirements posed by MPI Forum are to be retained, so are not considered further here. Furthermore, solutions that modify the MPI API in ways that force the user to realize that interoperability is on-going, are deemed unacceptable. Portable, interoperable calls that support speci cation of mapping, and other behavior may be a necessity, or may be con ned to the heterogeneous launcher standard to be discussed.
4
1.4 Brief comments on MPI's Safety for Operations
AF
T
MPI provides several types of safety for communication, describable in brief as follows: 1. spatial safety (group scoping) 2. lexical scoping (contexts) 3. compositional successive use of point-to-point { partial ordering with matching rules to same destination { decoupling of ordering among contexts successive use of collective { distinct operations { repetitive use of the same operation asynchronous point-to-point and collective (correctness in the face of non-quiescence condition) Implementations handle these safety requirements through protocol and mux/demux techniques. Primarily, the upshot of these requirements includes the necessity of concensus on the mechanisms for each safety (often bit patterns and widths of bit elds).
2 Common Context Support of Interoperable Implementations
DR
The classical example of a context-supporting messaging system, Zipcode [10, 11], used a global name server for the support of contexts across communicating groups of processes. In MPI, explicit choices were made to eliminate this requirement3 , and allow local context support, and to make this quantity purely ineable, so that implementations could achieve communication safety independent of an integer that the user could identify and/or manipulate. This choice also led to the \outer group" synchronization rule that is relevant in the MPI Comm split, and MPI Comm create calls, and which subsequently helped motivate the inclusion of intercommunicators as an escape for this requirement. A proposal for the support of concensus in context management in interoperable MPI is to support both a local and a global context naming service. The local services remain essentially unchanged from current implementations, but reserve a bit to mark global contexts. Global contexts are served through a mechanism yet to be de ned, which would in the worst case, be a server for contexts across implementations, but which we could also generalize to include other algorithms, if they should prove useful. The support for hierarchical context management of this type might be useful at more than two levels, but no obvious application is seen as yet. 3
Some implementations have and still use a global service for this, though not mandated by MPI
5
3 Concensus Concept for Point-to-Point Operations
DR
AF
T
An underlying mechanism must be provided through a common discovery mechanism (at the network level), or through a common service (at the transport level), to identify header types, remote process tables, and endianness, as well as negotiate primitive channels that allow point-to-point messages to go between implementation with ow control, and retransmission policy compatibility. Among other things, interoperability will have to de ne: means for determining eective MTU of a primitive channel between endpoints optional de nition of protocol stack to be used (from a xed set of retransmission policies, such as go-back-N, including parameters) de nition of higher-level policies, such as eager, long, and rendezvous from the perspective of MPI, to control buering requirements Tag space and similar width concensus Agreement (from one or more templates) about header to be used Requirement for a small meta-header to support more than one kind of actual header Decisions about handling of data heterogeneity (the headers themselves, as network protocols, should be network-byte-ordered). This idea refers to the user data Mechanisms for routing among implementations, such as when networks are connected end-to-end, but not directly The introduction of a well-posed primitive channel notion is suggested as a mechanism to achieve point-to-point interoperability. The quality of service oered for non-real-time MPI will emphasize latency, bandwidth, and correctness issues, while oering the potential for expansion to include real-time quality of service later. Primitive channels in no way mandate an inclusion of channels from the user's perspective.
4 Concensus Concept for Collective Operations The properties of concensus for collective operations are as follows: A number of well-de ned collective algorithms are de ned4 and given unique interoperability numbers (e.g., IMPI BCAST TREE TYPE1)5 . Some emphasis on order to minimize inter-implementation communication is suggested as a mode of support
4 It is strongly suggested that any claims of special hardware that motivate or make unfavorable, special activities involving collective operations be disclosed and reduced to practice if they are to be considered as arguments for choosing or not-choosing a collective interoperability approach. 5 This is a canonical example of a poly-algorithm [9]
6
A guarantee of lowest-common-denominator overlap of all implementations (\easiest choices") is oered, but it is clearly desirable to do better. In addition, research in the area of composition of individual collective algorithms to form interoperable collectives deserves attention, but may be dicult to compose. In any event, since interoperable implementations can use pieces of collective operations, composition of user-level operations into interoperable user-level operations may not be that valuable.
T
5 Concensus for 1-Sided Operations (MPI-2)
AF
MPI-2 has not been adopted yet, but some form of 1-sided communication of a DSM nature is likely to be included. Similar concensus standardization involving store-order, byte-order, and so on are needed here. It is suggested that given the likely need for asynchronous agent support in interoperable MPI, that support for 1-sided will not be terribly dicult to bootstrap, assuming messagepassing can be made inteoperable. Performance is another question.
6 Dynamic Process Management (MPI-2)
It is clear that dynamic process management type operations are closely related to interoperability. It is suggested that the commonality between supporting dynamic, unrelated process groups joining together to form a single MPI program will be closely related to MPI interoperability issues when dierent implementations are involved. At the very least, it must be possible for peer implementations to spawn worlds, and connect them in a non-parent-child relationship, in order to support interoperability in a reasonable way. Restrictions placed by MPI-2, in the nal version, on what kinds of processes may initiate involve themselves in MPI communication may need to be revisited when considering interoperability.
DR
7 Building a Market for MPI Interoperability It is suggest in this section to build a prototype of interoperability, with these properties, to help establish the market for such technology. We refer to this prototype as IMPI for convenience. Similar ideas have been voiced in the past from time to time in various informal discussions [12]. Here are properties that IMPI could possess, as a demonstration: 1. IMPI is an MPI library built on top of individual MPI libraries 2. IMPI uses the pro ling versions of the MPI functions of individual MPI libraries 3. Each implementation retains its view of MPI COMM WORLD, but this is not the MPI COMM WORLD as seen by the application programmer. IMPI maps ranks from the application's MPI COMM WORLD into ranks of individual implementation's MPI COMM WORLD 7
AF
T
4. IMPI process startup would rely on the individual implementation's startup mechanisms. 5. IMPI's MPI Init() would call indiviual implementation's PMPI Init(), and then the root nodes of each individual implementation would communicate to establish IMPI's MPI COMM WORLD. 6. Global name clashes may be a problem, and this will have to be worked by preprocessing. 7. IMPI would be extremely portable, relying only on socket communication [between dierent implementations] and MPI communication [within each implementation's MPI COMM WORLD] 8. IMPI would essentially be a 2-protocol MPI 9. MPI Send/Recv calls by an application to/from ranks within an individual implementation's process space would be implemented by calling the implementation's PMPI Send/Recv calls, with mapped source and destination ranks. 10. MPI Send/Recv calls by applications to/from ranks not within a single implementation's process space would be implemented by the IMPI sockets device. 11. IMPI has its own packet types/headers and protocols for its socket device. 12. The root nodes of each individual implementation would act as collection nodes for inter-implementation collective operations. IMPI, if successful, may create greater user interest and demand for interoperable MPI implementations, and thus encourage vendors to create or increase support for making their implementations interoperable.
DR
8 Potential for Active Networking
We note that Active Networking [15, 14], an emerging technology, provides a glimpse of the future, where we will be able to download remotely new protocol stacks (mobile code). It would be helpful if the interoperable design presupposed the possibility of extensible code bases in the next few years. Active networking could likewise support upgrading of nodes to support common collective algorithms as well. In general, the problem of interoperability will not go away with Active Networking, but all the static policies and pre-agreed-upon concensus algorithms will be able to be relaxed. We do not expect excitement from people whittling microseconds o of non-interoperable implementations to admit Active Networking as a viable technology until proven. However, given the fast pace of Internet evolution and of gigabit/s networking, it is strongly suggested that Active Networking concepts be recognized as deployable into interoperable MPI's in three-to- ve years. 8
9 Obligation to Study Other Standards Eorts
AF
T
A number of eorts that bear on this eort include but are not limited to6: 1. POSIX and real-time POSIX standards, 2. CORBA and real-time CORBA (signi cant interoperability support), 3. IETF working groups, including IPv6, 4. ISO ASN.1 and BER for encoding of data, 5. The data interoperability aspects of the SCI standard. Furthermore, other industry-based eorts may exist that standardize other aspects of software and hardware, and we should classify these if possible, and see what can be exploited without reinvention.
10 Acknowledgements
DR
The rst author acknowledges many useful discussions with Nathan Doss (Sanders), and Shane Hebert (MSU), and Greg Burns (formerly of OSC) on this topic over the last several years.
6
Many references are needed for this section, and will be added in the future.
9
References
DR
AF
T
[1] R. Butler and E. Lusk. User's guide to the p4 programming system. Technical Report TM-ANL{92/17, Argonne National Laboratory, 1992. [2] Ralph Butler and Ewing Lusk. Monitors, messages, and clusters: The p4 parallel programming system. Parallel Computing, 20:547{564, April 1994. (Also Argonne National Laboratory Mathematics and Computer Science Division preprint P362-0493). [3] Danny Cohen, Craig Lund, Anthony Skjellum, Thom McMahon, and Robert George. Proposed speci cation for the packetway protocol. IETF Network Working Group Internet Draft, 1997. [4] Message Passing Interface Forum. MPI: A Message-Passing Interface standard. The International Journal of Supercomputer Applicatons and High Performance Computing, 8, 1994. [5] William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 1994. In preparation. [6] The PSCHED API Working Group. PSCHED: An API for Parallel Job/Resource Management, November 1996. Version 0.1. [7] Institute of Electrical and Electronics Engineers, New York. Draft Standard for Information Technology{Portable Operating System Interface (POSIX) { Part 1: System Application Program Interface (API) { Amendment 2: Threads Extension [C Language], Draft 8, October 1993. [8] Boris V. Protopopov and Anthony Skjellum. Multi-threaded MPI architecture with multiple communication devices. To be submitted to Parallel Computing. Also available as ftp://aurora.cs.msstate.edu/pub/reports/Message-Passing/shmempaper97a.ps.Z, February 1997. [9] J.R. Rice and S. Rosen. Numerical analysis problem solving system. In Proc. 21st ACM Nat. Conf., pages 51{56. ACM Publications, 1966. [10] Anthony Skjellum and Alvin P. Leung. Zipcode: A Portable Multicomputer Communication Library atop the Reactive Kernel. In Proceedings of the Fifth Distributed Memory Computing Conference (DMCC5), pages 767{776. IEEE, April 1990. [11] Anthony Skjellum, Steven G. Smith, Nathan E. Doss, Alvin P. Leung, and Manfred Morari. The Design and Evolution of Zipcode. Parallel Computing, 20(4):565{596, April 1994. [12] Marc Snir. Personal communication with Marc Snir about interoperability, October 1994. [13] Mark Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI: The Complete Reference. MPI, 1996. 10
DR
AF
T
[14] David J. Tennenhouse and David J. Wetherall. Towards an Active Network Architecture. Computer Communication Review, 26(2), April 1996. [15] Dave Wetherall and David Tennenhouse. The ACTIVE IP Option. In Proc. of the 7th ACM SIGOPS European Workshop, September 1996.
11