MPI-IO Implementation Strategies for the Cenju-3 Maciej Brodowicz
Darren Sanders
Olin Johnson
fmaciek,sanders,
[email protected]
High Performance Computing Center The University of Houston Houston, TX November 1996
Abstract The lack of a portable parallel I/O interface limits the development of scienti c applications. MPI-IO is the rst widespread attempt to alleviate this problem. Its ecient implementation requires the developer to face and solve several software design and interface issues. Our paper outlines strategies, which may be helpful in this task. Although originally targeted for the NEC Cenju-3, our considerations are applicable to other message-passing platforms as well.
1 Introduction. An initiative led by Nasa Ames and IBM Watson Research Center has resulted in creation of MPI-IO, which de nes a portable interface for parallel I/O. Currently, MPI-IO is ocially incorporated into MPI-2 [4], an ambitious extention of the original MPI. This paper summarizes our experiences with development of an MPI-IO system for the NEC Cenju-3 supercomputer. The Cenju-3 features a multi-level switch, NORMA architecture.
Research supported by a grant from NEC Corporation.
1
The Cenju/DE operating system is a proto-Unix system based on the Mach 3.0 microkernel. A version of MPI-1 [3], MPI/DE is presently available. We believe that our observations are general enough to be applicable to other distributed memory parallel computers. No parallel le system is required, although there must exist a number of nodes equipped with disks and methods to access them locally (e.g. via Unix calls). For this purpose, we have ported le support from BSD Unix. Here we present the structure and current functionality of our system beginning with those motivational issues and decisions involved in the resulting con guration. We then describe the chosen implementation issues in more detail.
2 System description. 2.1
Structure.
UNIX server
UNIX server
UNIX server
UNIX server
MPIO server
MPIO server
MPIO server
MPIO server
communicator 1
communicator 2
app1
app1
app1
app1
communicator 3
app2
app2
PE 0
PE 1
PE 2
MPI library
MPI-IO library
Figure 1: MPI-IO system.
2
PE n-1
The MPI-IO implementation reported here consists of two major components: a library linked to the application code and servers, which handle lower level le accesses (see gure 1). This con guration may be viewed as a compromise between two opposite concepts: encapsulating the entire MPI-IO functionality in a library and placing all the services in external servers. Our approach attempts to nd a \golden mean" for functionality distribution among parts of the system. The rationale for deploying separate servers in addition to libraries are as follows:
Disk accessibility. A fundamental part of our strategy is to build a parallel le system on top of BSD Unix le servers which perform independent local le access on each node. These servers must be augmented with \data relaying" routines so that applications are not restricted to run on the same set of nodes across which its les were originally stripped. This leaves the user free to de ne the number of spawned processes as well as the assignment of tasks to nodes. Nodal MPI-IO servers eliminate this diculty as the data may now ow freely from any disk to any remote requestor.
File caching. Ecient caching is essential to achieving good le system performance. To reduce data access time, it might seem reasonable to cache le blocks within the client application. This, however, is a potentially wasteful solution, as MPI-IO process groups operate independently. Hence, several applications may have replicated blocks in their client caches. This can lead to physical memory exhaustion. Client caching also requires a more complicated consistency maintenance mechanism with other system entities. Our implementation enforces server level caching so that simpler \horizontal" coherency methods may be deployed. The cost is an additional RPC to the MPI-IO server, but since it is limited to the local node, the advantage of using optimal local code paths may be utilized.
Global tracking of le accesses. The MPI-IO standard doesn't explicitly specify the outcome of some operations. Perhaps the most dangerous is deleting a named le (or closing a le with the DELETE ON CLOSE ag set), while another application is still
3
accessing it. The MPI-IO servers can resolve such cases by postponing the actual deletion until global reference count drops to zero. Another bene t of global le management is reduction of resource consumption. An example for Unix server-based implementations is the number of allocated le descriptors. Note that in order to assure no interference between concurrent accesses to data in the same le, one needs dierent le descriptors. (Data transfers are always translated into lseek followed by either read or write.) Without global coordination, one would have to allocate le descriptors separately for each process group, potentially causing Unix descriptor table over ow. With this justi cation of MPI-IO servers, one might infer that the entire functionality should reside in the servers. However, much of the code needs to be in a client library due to the following reasons:
Server load minimization. The MPI-IO servers constitute a shared centralized resource for applications running on each node. The response time may grow signi cantly if all MPI-IO requests require access to the server. Often, this can be avoided. Accessing parameters associated with a le handle (datatypes, system maintained pointers, etc.), synchronization within communicators de ned inside the application or datatype processing are all performed faster by a client library.
Trac reduction to MPI-IO server. Ideally, communication with the server would be restricted to data transfers and other direct disk access ( le opening and syncing). The library can be bene cial by processing some requests internally or merging them (e.g. by piggybacking prefetch hints on regular data transfer requests). 2.2
Functionality.
A second fundamental strategy has been to build MPI-IO on top of and separate from an existing MPI where possible. Since MPI-IO was incorporated into the MPI-2 standard, 4
user interface speci cations change very frequently (during every MPI Forum), making the library and application development a tedious task. In this situation, we decided to base our interface on the latest pre MPI-2 speci cation release [2]. Currently, our library supports (INIT1, FINALIZE), (OPEN, CLOSE, DELETE, FILE CONTROL, FILE SYNC), (SEEK, SEEK SHARED) and the full set of data transfer functions. The latter involves all variations of reads and writes with explicit, shared and individual osets, plus their non-blocking counterparts. The asynchronous calls require testing for completeness, which can be achieved using equivalents of MPI request handlers (TEST, REQUEST FREE and WAIT). to MPI-IO server
client stub
token server
t o
thread
datatype engine
pool
descriptor tables
token server communicator 3 token server communicator 2
communicator 1
l i M b P r I a r y
o t h e r t a s k s
MPI-IO user interface
Application
Figure 2: MPI-IO library. The internals of the MPI-IO library are depicted in gure 2. The MPI-IO library is thread safe and hence capable of handling concurrently issued requests from multiple threads of control (symbolized in our picture by wavy lines). Since le handles are always associated 1
The MPIO pre x was removed for brevity.
5
with certain communicators indicated by the user in OPEN, the library tracks communicator information via a list. For internal message passing a duplicate communicator is generated (by MPI COMM DUP), assuring that user messages (passed inside the application) are not intercepted by the library and vice versa. On inquiry, the library returns the original communicator associated with the le handle as the user might expect MPI IDENT as a result of further communicator comparison rather than MPI CONGRUENT. Each open le is assigned a slot in a local descriptor table associated with a corresponding communicator. The descriptor is unique and identical for all nodes in the communicator thus simplifying destination encoding in requests to remote libraries. The library maintains individual le pointers on each node, whereas, the value of a shared pointer is stored on node zero in each communicator group. Updates are propagated via MPI messages. For that purpose as well as for atomicity control and synchronization of collective calls, the system spawns additional \background" threads on node rank zero in each communicator. Finally, the application communicates with the MPI-IO servers through client stubs. Due to a limitation of MPI/DE2 (not uncommon in supercomputers equipped with high-performance interconnection networks), a maximum of one MPI process per node is allowed, thus, inter-server and application-server communication has been based on Mach messages.
3 Implementation issues. 3.1
Layering.
It is expedient, ecient and a matter of common practice to develop new systems by importing functionality from existing systems. Thus our system uses BSD Unix le facilities, Mach system calls and the existing MPI library. This approach has its problems. With regard to MPI, no MPI-2 implementations are yet available; currently the only choice is to layer the MPI-IO implementation on top of MPI-1. General requests are introduced in 2
MPI/DE is currently the only MPI-1 compliant message passing library for the Cenju-3.
6
MPI-2, yet code which handles requests native to MPI-1 has to be eectively duplicated to support MPI-IO speci c requests. This leads directly to interface naming con icts if someone wants to use MPI-2 compliant identi ers. Since datatype objects are opaque, the need to use non-standard typemap-extracting functions arises. (Our methods of handling datatypes are delineated in the next section.) Introduction of MPI-2 libraries will, of course, make this additional code obsolete. Even with MPI-2 present, there are other dicult layering issues. An example is the initialization and termination of the MPI-IO environment, accomplished in earlier standard releases by MPIO INIT and MPIO FINALIZE. MPI-2 doesn't provide any functionality to address this problem at the time the MPI environment is initialized. Furthermore { MPIO INIT and MPIO FINALIZE were removed from the MPI-IO part of the MPI-2 standard. 3.2
Datatypes.
There are two primary issues involved when implementing le I/O that relies on MPI datatypes. These issues include: datatype processing and le access strategies. Processing and mapping of MPI datatypes used to de ne the le access is necessary in order to correctly identify the data boundaries within a datatype. The original MPI standard did not provide a method for accessing the internal structure of an MPI datatype, therefore we de ned two functions which could perform this task. The functions are, MPI Type first and MPI Type next and essentially traverse the \tree" representing an MPI datatype, returning oset and type information about the nodes within the tree. The second and more important issue is le access strategies. Currently, we have identi ed four primary access strategies needed for performing le I/O. The simplest case is one in which the etype's density is 100%, the letype's density is 100%, and the buftype is equal to the letype. This case allows multiple etypes to be read or written in one access.
7
Secondly, reading/writing on the etype level involves sparse les with 100% etype density. In this situation each etype can be read or written in one le access. The third possible strategy, \data farming", occurs when the etype density and/or the letype density is below 100%, but suciently high to allow one or more etypes to be read in a single access. This type of access is applicable to reads only and requires that the data be farmed from the data buer after reading. Finally, the fourth strategy requires that the individual elements of each etype be read or written separately or, at a minimum, in small groups. Clearly, the existence of holes, regions for which no data is de ned, in datatypes can impact I/O performance. The use of MPI datatypes for de ning le access exacerbates this problem because of the possible existence of \implicit holes" within these datatypes. Implicit holes are holes which were not intended by the user and are transparent to the user. Such holes are most commonly products of hardware de ned alignment requirements. This issue also has an impact on correctness across heterogeneous computer architectures and may aect performance of caching algorithms (e.g. if Write-Full [5] replacement policy is used). 3.3
Collective calls.
A primary objective for any MPI-IO system is high performance. This objective suggests the use of collective calls, where groups of operations issued from dierent nodes may be optimized globally. An important implementation question is when the collective calls should involve explicit synchronization. For this reason the collective calls in our implementation exhibit non-uniform semantics. For operations having long-term eect on the environment (like OPEN or CLOSE), we deploy strong synchronization: the call may return MPI SUCCESS only if the operation succeeds on all nodes. On the other hand, excessive synchronization degrades performance. Consider the READ ALL function. There are no constraints to postpone a data transfer if one node invokes the collective operation ahead from the others. Moreover, this event can prepare the server for the fact that there will be more calls 8
emerging from other (and one can can tell exactly which) nodes, so it can e.g. prefetch le data. Collective calls involve various degrees of hardware and software support when transferring data between disks and requesting nodes. Architectures with a fast multicast facility (like Cenju-3) can take advantage of this feature. The software may enhance the data distribution eciency by using optimized two-phase access [8] or other strategies used with success in some parallel I/O libraries [1, 7]. One cannot neglect MPI itself as calls like MPI ALLTOALL can be tailored to perform data transfer directly from I/O caches to user buers on the requesting nodes. Unfortunately, due to the limitation of a single MPI task per PE, this option is presently unavailable in our implementation. 3.4
Non-blocking calls.
The disparity between disk access time and CPU speed dictates that ecient le systems cannot rely on single-threaded le servers. There are two techniques to assure concurrent processing of client requests: multithreading and interrupt-driven messages. While the latter allocates resources \on demand", it usually incurs higher context switching (interrupts). Some message passing libraries directly support interrupt-driven receives (NXLib, [6]). Our implementations uses multiple kernel threads (Mach), managed by the c thread library. In this approach the cost of thread switches is much reduced, as c threads are multiplexed on top of kernel threads. Hence, if execution of some c threads is mutually exclusive due to mutexes, the library may choose to implement all of them using the same kernel thread. In our system, thread pools are managed by special objects { thread servers. Thread server may be initialized in one of three con gurations: a xed number of threads in the pool, temporary spawned threads or by an increasing pool of permanent threads. The rst kind is used in server stub listening for incoming requests from client applications. It enforces a xed limit on concurrently processed requests thus protecting the MPI-IO servers from overloading. The second kind of thread server supports inter-server communications. 9
During increased loads, it may be bene cial to have additional threads listening to requests sent from other servers, as processing these incoming request may be shorter than that of other threads (e.g. waiting for disk access completion). The temporary thread is terminated when the request is satis ed. Finally, the increasing thread pool is used to implement nonblocking calls in the MPI-IO library. Since it is not known in advance how many concurrent non-blocking requests will be issued by the application, the thread count initially is zero. Spawning additional threads creates associated overhead, therefore it is not a good idea to use temporary threads for non-blocking requests. There is always some chance that another request will be issued after the previous one has completed. Note that this approach allows for adaptive adjustment of the system to the maximal number of non-blocking requests requiring concurrent service.
4 Conclusion. We have discussed issues involved in implementing MPI-IO for distributed memory computers. As shown, creation of ecient support for MPI-IO in the absence of an underlying parallel le system is not straightforward, mainly due to MPI standard immaturity and its peculiarities. Some of them (like datatype mapping) require more sophisticated techniques if performance is the primary objective. Others need to be resolved at the standard de nition level, which hopefully will happen in near future.
References [1] R. Benett, K. Bryant, A. Sussman, R. Das, and J. Saltz. Jovian: a framework for optimizing parallel I/O. Technical report, University of Maryland, 1995. [2] The MPI-IO Committee. MPI-IO: a parallel le I/O interface for MPI, April 1996. Version 0.5, also available from http://lovelace.nas.nasa.gov/MPI-IO/mpi-io-report.0.5.ps.
10
[3] Message Passing Interface Forum. MPI: A message-passing interface standard, June 1995. Version 1.1, also available from ftp://ftp.mcs.anl.gov/pub/mpi/mpi-1.jun95/mpireport.ps. [4] Message Passing Interface Forum. MPI-2: Extensions to the message-passing interface, October 1996. Draft. [5] David Kotz and Carla Schlatter Ellis. Caching and writeback policies in parallel le systems. Journal of Parallel and Distributed Computing, 17(1{2):140{145, January and February 1993. [6] Stafan Lamberts, Georg Stellner, and Thomas Ludwig. NXLib User's Guide. Technische Universitat Munchen, February 1995. Also available from http://wwwbode.informatik.tumuenchen.de/lamberts/NXLib. [7] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, December 1995. [8] Rajeev Thakur and Alok Choudhary. An extended two-phase method for accessing sections of out-of-core arrays. Technical Report CACR-103, Scalable I/O Initiative, Center for Advanced Computing Research, Caltech, June 1995.
11