Dynamic Communicators in MPI - Semantic Scholar

1 downloads 0 Views 104KB Size Report
Jun 3, 2009 - [3] John Reid. The new features of fortran 2008. SIGPLAN Fortran Forum,. 27(2):8–21, 2008. [4] V. S. Sunderam. Pvm: A framework for parallel ...
Dynamic Communicators in MPI Richard L. Graham & Rainer Keller Oak Ridge National Laboratory {rlgraham,keller}@ornl.gov June 3, 2009 Abstract This paper describes a proposal to add support for dynamic communicators to the MPI standard. This adds the ability to grow or shrink a specified communicator, under well specified circumstances. The goal is to make it possible for a new class of applications - longrunning, mission-critical, loosely coupled applications, running in a highly dynamic environment - to use MPI libraries for their communications needs, and to enable HPC applications to adjust to changing system resources. Implementation analysis indicates that performance impact on existing high-performance MPI applications should be minimal, or non-existent. The changes to MPI implementations are expected to be well compartmentalized.

1

Introduction

The Message Passing Interface (MPI) [2] is the ubiquitous choice for userlevel communications in High Performance Computing (HPC). This standard serves as a basis for the implementation of high-performance communication systems enabling simulation codes to fully benefit from the capabilities of modern, high-performance networks, such as InfiniBand [1]. The MPI communicator brings together the notion of a partitioned communication context and the MPI process groups that may communicate in this partitioned space, with communications occurring between an MPI rank in the local group and one or more ranks in the remote group. Two types of communicators are defined, intra-communicators where the local and remote groups are identical, and inter-communicators, where the local and 1

remote groups are disjoint. All communications occurring in the context of a communicator. Point-To-Point communications and collective communications explicitly reference the communicator in all communications, while one-sided communications references the communicator through the window handle, and MPI file I/O references this through the file handle. In the current MPI standard the groups associated with the communicators are static, fixed for life-time of the communicator. To change the MPI processes involved in a given communication pattern a new communicator must be constructed. Communicator creation is a collective operation over the processes in the groups forming the new communicator, providing a time for all processes in the communicator to exchange information and initialize optimized communications. For example, the static nature of the communicator allows MPI implementations to cache the communication patterns used for collective operations, such as a broadcast, and initialize cached scratch buffers for such communications. The static nature of the MPI communicator is a good match for current HPC computing environments where machine resources are allocated as static resource pools, with MPI implementations taking advantage of the communication optimizations this model provides. HPC simulation codes have also adapted to this mode of operation. However, there are other classes of parallel applications that run in much more dynamic environments, that can’t easily adapt to the current communicator model used by MPI. In these environments the number of processes and the resources used by these change considerably over time. In general, these tend to be loosely coupled client/server types of applications where the clients rarely communicate with each other, and their use of collective communications is non-existent, or limited to server management activities. In general these include long running mission critical services, such as Condor based applications [5], dynamic client/server applications like on-line gaming sites to mention a few. Solution providers in these spaces would like to be able to take advantage of the large amount of effort going into creating portable and high-performance MPI library communication sub-systems, but require support for the dynamic process environment. In addition, as part of the on-going work for fault-tolerance support within the MPI standard, application developers have expressed interest in MPI support for changing the size of an existing communicator. This paper describes an extension to the MPI communicator concept that preserves the performance optimization opportunities inherent in the current standard, but also allows communicators to be dynamic in nature.

2

Background

Communication systems targeting parallel applications in general must address the notion of process groups. The Parallel Virtual Machine (PVM) specification [4] provides support for dynamic process groups, but PVM communications are not as efficient as MPI communications. The Uniform Parallel C (UPC) and Co-Array Fortran are current language specifications that incorporate the notion of parallelism. The UPC specification [6] team definition is heavily influenced by the MPI standard, and is static in nature. The concept of process groups, or teams, was considered for the Fortran 2008 language standard [3], but has not been included in this specification.

3

Dynamic Communicators

This paper proposes to add to the MPI standard support for dynamic communicators, defined to be communicators that may change size over the lifetime of the communicator of a healthy communicator. Support is proposed for two classes of applications - loosely coupled and tightly coupled. For the loosely coupled applications, a mechanism with relaxed communicator consistency is proposed. At a given point in time members of the communicator may, for a finite amount of time, have different views of the communicator. Support for tightly coupled applications is aimed at providing high-performance communications support. To do so, communicators are kept consistent over all ranks, for the duration of communicator use. Providing support to shrink the size of a communicator will provide the ability to support a class of applications not currently supported by the MPI standard. Support for increasing communicator size may be viewed as semantic candy, but does increase MPI’s ease of use for applications needing such support. Overall, this proposal sets the stage for MPI applications to be much more adaptable to changing application and system requirements. Support for tightly coupled and loosely coupled communicators is fundamentally different and is described in the following subsections.

3.1

Loosely coupled dynamic communicators

The loosely coupled dynamic communicator model assumes lazy notification when the size of a communicator is changed. As part of normal operations, a communications target may vanish, leaving the library unable to complete requested operations while needing to continue normal library operations. For example, a sender may initiate communication to a rank that no longer

exists, requiring the library to return an appropriate warning to the application, and continue running, even if the user has set the error handler to be MPI ERRORS ARE FATAL. A communicator may grow either by a request from an existing rank to grow the communicator, or by an pre-existing process requesting to join the communicator. An existing rank in the communicator can increase communicator size by making a call to the routine: MPI COMM GROW(communicator handle, n to grow, return code) INOUT IN OUT

communicator handle n to grow return code

communicator (handle) number of ranks to add to the communicator (int) return error code (int)

Unless MPI COMM WORLD is the communicator being expanded, the newly created ranks are also members of a new MPI COMM WORLD, whose members are the newly created MPI processes. The new processes need to obtain the new communicator handle after the call to MPI Init() of MPI Init thread() completes. The function MPI COMM GET RESIZED COMM HANDLE(communicator handle) OUT

communicator handle

communicator (handle)

is introduced to obtain the handle to the communicator grown by the MPI COM RESIZE() routine. No input parameters are needed, as the newly created MPI process will have only the two default communicators MPI COMM WORLD and MPI COMM SELF and maybe a third communicator. If the expanded communicator is MPI COMM WORLD this is the handle returned, otherwise the handle to the third communicator is returned. For an MPI process to connect to an existing communicator, it first needs to open a connection to the MPI server specified by port name. To become a member of an existing communicator it uses the call

MPI CONNECT TO COMM(port name, communicator name, newcomm) IN

port name

network address (string, used only on root)

IN

communicator name

name of communicator to join (string)

OUT

newcomm

communicator (handle)

This implies the communicators need to be named for other processes be able to join the communicator. To get a list of existing communicators that may be grown the following function may be used: MPI GET COMMUNICATORS(port name, comm names, count, return code) IN

port name

network address (string, used only on root)

OUT

comm names

Array of communicators(strings)

OUT

count

Number of Communicators(int)

OUT

return code

return error code(int)

The size of the new communicator is increased by one with each process connecting to the communicator. The rank assigned to this new member of the communicator should be the first available rank in this communicator. A rank can leave a communicator using the routine MPI EXIT COMMUNICATORS(comm , return code) OUT

return code

return error code(int)

with the predefined communicator MPI COMM ALL signaling a clean exit from MPI. 3.1.1 Change Notification When processes join or leave a communicator, the other ranks in the communicator will be notified lazily. Since notification needs to be on a per communicator basis for layered library support, we propose two mechanisms, notification using a library initiated message delivered to each rank in the

communicator, or via a callback mechanism. In the first case the notification will occur if the application posted a receive with the pre-defined tag MPI NOTIFY COMMUNICATOR CHANGE, The return buffer will include the communicator handle and rank information of the newly added ranks, and may return information on as many newly created ranks as the input buffer will hold. In the second case, a user defined callback registered after communicator construction will be invoked, providing either the ranks of the processes that left or joined the communicator. The callback will be provided with communicator-handle, process rank, and an indication as to whether this rank joined the communicator, or left it. Only local work is allowed in the callback routine. 3.1.2 Performance Considerations Special attention need to be given to the communications performed in the context of support for such loosely coupled communicators. In particular, one must consider how this support may impact communications in the context of static communicators. Collective operations need special attention, as each rank in a communicator may have a different view of the communicator. It would be tempting to restrict the collective operations supported, however there is no good reason to do so. A best effort is assumed to complete these operations, with each process including the ranks it is aware of at the start of the collective operation. Since communicator composition may change as the collective operation progresses, algorithms need to be ready to handle such situations and not deadlock as a result. If the later happens, the return code indicates this, and a user-defined callback function called just before the collective operation returns reporting which ranks did not participate in the call. For example, if after a broadcast is issued some of the ranks that the root attempts to reach exit the communicator, the return code will indicate this, with the callback routine listing the ranks that did not get this data. It is clear that the performance of collective operations for such communicators will not be as good as that of static MPI collective operations, however for these loosely coupled applications, the performance of the collective operations is not as important as the convenience of using library provided collective operations. The impact of support for dynamic communicators on the performance of point-to-point communications also requires careful consideration. Associating a given set of collective operations with a communicator is fairly common practice, and therefore it is possible to implement less efficient col-

lective operations targeting these dynamic communicators without harming the performance of collective operation within the context of static communicators. However, the author is unaware of implementations that do so for point-to-point and one-sided communications or MPI I/O, and as such the performance implications on MPI library as a whole must be considered. Two items need to be analyzed in the context of point-to-point and onesided communications, (1) inability to complete such communication due to the target exiting the communicator, and (2) the impact of the changing communicator size on access to internal library data structures. The first is not a performance problem as the initiator must already handle error conditions. Local MPI completion semantics continue to be sufficient, as the target exits a communicator at it’s initiative, with no furhter data delivery expected. The second item could have negative performance impact on communications within static communicators, if these use control structures that are proportional to the size of the communicator. Changing the size of these data structures may change the value of pointers in use. To avoid such problems, these data structures need to be accessed atomically. It should be possible to keep the number of such data structures that need to be protected to a small number, perhaps even as low as a single data structure, making the performance cost similar to that of providing a thread safe MPI library. Another approach could be at run-time to set an upper limit on the increase in size of a given communicator removing the extra cost, or providing support for loosely coupled communicators as a compile time option.

3.2

Tightly coupled dynamic communicators

Tightly coupled dynamic communicators are dynamic communicators for which each rank in the communicator has the same view of the communicator while it is in use. To keep such communicators consistent, the mechanisms for changing the size of the communicator are defined to be collective operations, and as such there is no need to explicitly declare such a communicator as a dynamic communicator. It becomes a dynamic communicator, of size requested size when the following function is called:

MPI COMM RESIZE ALL(communicator handle, requested size, removed procs, return code) INOUT

communicator handle

communicator (handle)

IN

requested size

new size (int)

IN OUT

removed ranks return code

array of ranks to be removed (struct) return error code (int)

This function may be called on any communicator but MPI COMM SELF, which is not allowed to change it’s size. The list of ranks to be removed is relevant only when the group is being shrunk, and even then is optional. It includes the group (local or remote) and the rank within this group of the processes that need to be removed. In it’s absence (NULL pointer) the highest order ranks of the local communicator will be removed. This list must be consistent across all ranks in the communicator. When shrinking a communicator the resulting ranks in the redefined communicator of size M will be dense, with values in the range of 0 to M-1, and therefore a processes rank in this communicator should be determined after the return from this call. Growing the size of a communicator requires specification of what happens in the newly created MPI processes that are the result of growing the communicator. The first consequence is that new processes are created. Unless MPI COMM WORLD is the communicator being expanded, the newly created ranks are members of a new MPI COMM WORLD, whose members are the newly created MPI processes. As in the case of the loosely coupled communicator, the newly created MPI processes obtain the communicator handle for the expanded communicator with the function MPI COMM GET RESIZED COMM HANDLE(communicator handle) OUT

communicator handle

communicator (handle)

If the processes was not started as the result of resizing a communicator, the handle MPI COMM NULL is returned. If the return handle is not the null communicator handle, this handle is used as input to the routine MPI COMM RESIZE ALL, to complete the new communicator initialization.

As mentioned before, a key goal is to ensure that resizing the communicator does not imply loss of performance. The collective nature of the call provides a synchronization point at which an implementation can redo any optimizations done during communicator construction, thus avoiding loss of communications performance. However, this does require completing all outstanding communications before resizing the communicator, forming well defined communication epochs.

4

Summary

This paper proposes adding support for the concept of dynamic communicators to the MPI standard. This addition is aimed at making it possible for a new class of applications - long-running, loosely coupled application running in a dynamic environment - to use MPI libraries for their communications needs, as well as to allow HPC applications to adjust to changing system resources. Performance impact on existing MPI applications is minimal, and may be non-existent. Future work is planned to implement this new feature set.

References [1] Infiniband Trade Association. Infiniband architecture specification vol 1. release 1.2, 2004. [2] The MPI Forum. Mpi: A message-passing interface standard - version 2.1. 2008. [3] John Reid. The new features of fortran 2008. SIGPLAN Fortran Forum, 27(2):8–21, 2008. [4] V. S. Sunderam. Pvm: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2:315–339, 1990. [5] Douglas Thain, Todd Tannenbaum, and Miron Livny. Distributed computing in practice: the condor experience. Concurrency - Practice and Experience, 17(2-4):323–356, 2005. [6] UPC Consortium. UPC Language Specifications, v1.2, 2005.

Suggest Documents