Achieving High Performance with MPI-IO

0 downloads 0 Views 328KB Size Report
communication operations thereafter to receive data from other processes. Each process must therefore be ready to participate in the communication phase max ...
Achieving High Performance with MPI-IO Rajeev Thakur William Gropp Ewing Lusk Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439, USA fthakur, gropp, [email protected]

Preprint ANL/MCS-P742-0299 February 1999 Abstract

The I/O access patterns of many parallel applications consist of accesses to a large number of small, noncontiguous pieces of data. If an application's I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access noncontiguous data with a single I/O function call, unlike in Unix I/O. In this paper, we explain how critical this feature of MPI-IO is for high performance and how it enables implementations to perform optimizations. An application can be written in many di erent ways with MPI-IO. We classify the di erent ways of expressing an application's I/O access pattern in MPI-IO into four levels, called level 0 through level 3. We demonstrate that, for applications with noncontiguous access patterns, the I/O performance improves signi cantly if users write the application such that it makes level-3 MPI-IO requests (noncontiguous, collective) rather than level-0 requests (Unix style). We describe how our MPI-IO implementation, ROMIO, delivers high performance for noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and le systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications|an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)|on ve di erent parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000. 

Preliminary versions of portions of this work were published in two conference papers [36, 37].

1 Introduction I/O is a major bottleneck in many parallel applications. Although the I/O subsystems of parallel machines may be designed for high performance, a large number of applications achieve only about a tenth or less of the peak I/O bandwidth. The main reason for poor application-level I/O performance is that I/O systems are optimized for large accesses (on the order of megabytes), whereas parallel applications typically make many small I/O requests (on the order of kilobytes or even less). The small I/O requests made by parallel programs are a result of the combination of two factors: 1. In many parallel applications, each process needs to access a large number of relatively small pieces of data that are not contiguously located in the le [1, 6, 20, 29, 30, 35]. 2. Most parallel le systems have a Unix-like API (application programming interface) that allows a user to access only a single, contiguous chunk of data at a time from a le.1 Noncontiguous data sets must therefore be accessed by making separate function calls to access each individual contiguous piece. With such an interface, the le system cannot easily detect the overall access pattern of one process individually or that of a group of processes collectively. Consequently, the le system is constrained in the optimizations it can perform. Many parallel le systems also provide their own extensions to or variations of the traditional Unix interface, but these variations make programs nonportable. To overcome the performance and portability limitations of existing parallel-I/O interfaces, the MPI Forum (made up of parallel-computer vendors, researchers, and applications scientists) de ned a new interface for parallel I/O as part of the MPI-2 standard [18]. This interface is commonly referred to as MPI-IO. MPI-IO is a rich interface with many features designed speci cally for performance and portability. Multiple implementations of MPI-IO, both portable and machine speci c, are available [9, 14, 24, 26, 39]. To avoid the abovementioned problem of many distinct, small I/O requests, MPI-IO allows users to specify the entire noncontiguous access pattern and read or write all the data with a single I/O function call. MPI-IO also allows users to specify collectively the I/O requests of a group of processes, thereby providing the implementation with even greater access information and greater scope for optimization. A simple way to port a Unix I/O program to MPI-IO is to replace all Unix I/O functions with their MPI-IO equivalents. For applications with noncontiguous access patterns, however, such a port is unlikely to improve performance. In this paper, we demonstrate that to get real performance bene ts with MPI-IO, users must use some of MPI-IO's advanced features, particularly noncontiguous accesses and collective I/O. 1 Unix does have functions readv and writev, but they allow noncontiguity only in memory and not in the le. POSIX has a function lio listio that allows the user to specify a list of requests at a time. However, the requests in the list can be a mixture of reads and writes, and the POSIX standard says that each of the requests will be submitted as a separate nonblocking request [13]. Therefore, POSIX implementations cannot optimize I/O for the entire list of requests, for example, by performing data sieving as described in Section 5. Furthermore, since the lio listio interface is not collective, implementations cannot perform collective I/O.

1

An application can be written in many di erent ways with MPI-IO. We classify the di erent ways of expressing an application's I/O access pattern in MPI-IO into four levels, called level 0 through level 3. We explain why, for high performance, users must write their application programs to make level-3 MPI-IO requests (noncontiguous, collective) rather than level-0 requests (Unix style). We describe how ROMIO, our portable implementation of MPI-IO, delivers high performance when the user makes noncontiguous, collective MPI-IO requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and le systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications on ve di erent parallel machines: HP Exemplar, IBM SP, Intel Paragon, NEC SX-4, and SGI Origin2000. The three applications we used are the following: 1. DIST3D, a template representing the I/O access pattern in an astrophysics application, ASTRO3D [35], from the University of Chicago; 2. the NAS BTIO benchmark [9]; and 3. an unstructured code (which we call UNSTRUC) written by Larry Schoof and Wilbur Johnson of Sandia National Laboratories.

1.1 Contributions of this Paper

This paper demonstrates how users can achieve high parallel-I/O performance by using a good API and by using that API the right way. The same API, if not used the right way, can result in very poor performance. The paper also explains why performance improves when users use MPI-IO the right way: the MPI-IO implementation can then perform optimizations, such as data sieving and collective I/O. Although these optimizations have been proposed earlier [7, 15, 28, 33], this is the only paper that discusses in detail the practical issues involved in implementing these optimizations, in the context of a standard, portable API, on real state-of-the-art parallel machines and le systems. We also present performance results that con rm that these optimizations indeed work well in practice.

1.2 Organization

The rest of this paper is organized as follows. In Section 2, we give a brief overview of MPI-IO. Section 3 describes our portable implementation of MPI-IO, called ROMIO. In Section 4, we provide a classi cation of the ways in which an application's I/O access pattern can be expressed in MPI-IO. Sections 5 and 6 describe the two optimizations, data sieving and collective I/O, in detail. Performance results are presented in Section 7, followed by conclusions in Section 8. 2

2 Overview of MPI-IO In this section we describe the evolution of MPI-IO, the main features of MPI-IO, and the use of MPI's derived datatypes to specify noncontiguous accesses in MPI-IO.

2.1 Background

MPI-IO originated in an e ort begun in 1994 at IBM Watson Research Center to investigate the impact of the (then) new MPI message-passing standard on parallel I/O. A group at IBM wrote an important paper [25] that explores the analogy between MPI message passing and I/O. Roughly speaking, one can consider reads and writes to a le system as receives and sends. This paper was the starting point of MPI-IO in that it was the rst attempt to exploit this analogy by applying the (then relatively new) MPI concepts for message passing to the realm of parallel I/O. The idea of using message-passing concepts in an I/O library appeared successful, and the e ort was expanded into a collaboration with parallel-I/O researchers from NASA Ames Research Center. The resulting speci cation appeared in [4]. At this point a large email discussion group was formed, with participation from a wide variety of institutions. This group, calling itself the MPI-IO Committee, pushed the idea further in a series of proposals, culminating in [40]. During this time, the MPI Forum had resumed meeting. Its purpose was to address a number of topics that had been deliberately left out of the original MPI Standard, including parallel I/O. The MPI Forum initially recognized that both the MPI-IO Committee and the Scalable I/O Initiative [23] represented e orts to develop a standard parallel-I/O interface and therefore decided not to address I/O in its deliberations. In the long run, however, the three threads of development|by the MPI-IO Committee, the Scalable I/O Initiative, and the MPI Forum|merged because of a number of considerations:  The Scalable I/O Initiative, originally conceived independently of MPI, came to realize that any parallel-I/O interface would need the following: { collective operations, { nonblocking operations, { a way of describing noncontiguous data, both in memory and in a le, and { a mechanism for passing hints to the implementation. All these concepts are present in MPI, where considerable e ort had already been expended in de ning and implementing them. This realization made an MPI-based approach attractive.  The MPI-IO Committee, originally conceived independently of the MPI Forum, decided that its impact would be greater if its speci cation became part of the MPI-2 standard; hence, the committee petitioned to become part of the MPI-2 e ort. 3

The MPI Forum, originally intending to leave I/O out of its deliberations on MPI-2, realized that the MPI-IO Committee had evolved from a narrow collaboration into an open discussion group with considerable expertise. It then voted to incorporate the MPI-IO design group as an MPI Forum subcommittee. This expansion of the MPI Forum membership bene ted the rest of the MPI-2 design as well. The result was that, from the summer of 1996, the MPI-IO design activities took place in the context of the MPI Forum meetings. The MPI Forum used the latest version of the existing MPI-IO speci cation [40] as a starting point for the I/O chapter in MPI-2. The I/O chapter evolved over many meetings of the Forum and was released in its nal form along with the rest of MPI-2 in July 1997 [18]. MPI-IO now refers to this I/O chapter in MPI-2. 

2.2 Main Features of MPI-IO

MPI-IO is a rich interface with many features speci cally intended for portable, highperformance parallel I/O. The basic I/O functions in MPI-IO provide functionality equivalent to the Unix functions open, close, read, write, and lseek. MPI-IO has many other features in addition. We brie y describe some of these features. MPI-IO supports three kinds of basic data-access functions: using an explicit o set, individual le pointer, and shared le pointer. The explicit-o set functions take as argument the o set in the le from which the read/write should begin. The individual- le-pointer functions read/write data from the current location of a le pointer that is local to each process. The shared- le-pointer functions read/write data from the location speci ed by a common le pointer shared by the group of processes that together opened the le. In all these functions, users can specify a noncontiguous data layout in memory and le. Both blocking and nonblocking versions of these functions exist. MPI-IO also has collective versions of these functions, which must be called by all processes that together opened the le. The collective functions enable an implementation to perform collective I/O. A restricted form of nonblocking collective I/O, called split collective I/O, is supported for accesses with explicit o sets and individual le pointers, not with shared le pointers. A unique feature of MPI-IO is that it supports multiple data-storage representations: native, internal, external32, and also user-de ned representations. native means that data is stored in the le as it is in memory; no data conversion is performed. internal is an implementation-de ned data representation that may provide some (implementationde ned) degree of le portability. external32 is a speci c, portable data representation de ned in MPI-IO. A le written in external32 format on one machine is guaranteed to be readable on any machine with any MPI-IO implementation. MPI-IO also includes a mechanism for users to de ne a new data representation by providing data-conversion functions, which MPI-IO uses to convert data from le format to memory format and vice versa. MPI-IO provides a mechanism, called info, that enables users to provide hints to the implementation in a portable manner. Examples of hints include parameters for le striping, prefetching/caching information, and access-pattern information. Hints do not a ect the semantics of a program, but they may enable the MPI-IO implementation or underlying le system to improve performance or minimize the use of system resources [3, 22]. The implementation, however, is free to ignore all hints. 4

MPI-IO also has a set of rigorously de ned consistency and atomicity semantics that specify the results of concurrent le accesses. For details of all these features, we refer readers to [11, 18]. We elaborate further on only one feature|the ability to access noncontiguous data with a single I/O function by using MPI's derived datatypes|because it is critical for high performance in parallel applications.

2.3 Noncontiguous Accesses in MPI-IO

In MPI-1, the amount of data a function sends or receives is speci ed in terms of instances of a datatype [17]. Datatypes in MPI are of two kinds: basic and derived. Basic datatypes are those that correspond to the basic datatypes in the host programming language|integers,

oating-point numbers, and so forth. In addition, MPI provides datatype-constructor functions to create derived datatypes consisting of multiple basic datatypes located either contiguously or noncontiguously. The di erent kinds of datatype constructors in MPI are as follows:  contiguous Creates a new datatype consisting of contiguous copies of an existing datatype.  vector/hvector Creates a new datatype consisting of equally spaced copies of an existing datatype.  indexed/hindexed/indexed block Allows replication of a datatype into a sequence of blocks, each containing multiple copies of an existing datatype; the blocks may be unequally spaced.  struct The most general datatype constructor, which allows each block to consist of replications of di erent datatypes. Any noncontiguous layout can be expressed by struct.  subarray Creates a datatype that corresponds to a subarray of a multidimensional array.  darray Creates a datatype that describes a process's local array obtained from a regular distribution a multidimensional global array. Of these datatype constructors, indexed block, subarray, and darray were added in MPI-2; the others were de ned in MPI-1. The datatype created by a datatype constructor can be used as an input datatype to another datatype constructor. Any noncontiguous data layout can therefore be represented in terms of a derived datatype. MPI-IO uses MPI datatypes for two purposes: to describe the data layout in the user's bu er in memory and to de ne the data layout in the le. The data layout in memory is speci ed by the datatype argument in each read/write function in MPI-IO. The data layout in the le is de ned by the le view. When the le is rst opened, the default le view is the entire le; that is, the entire le is visible to the process, and data will be read/written 5

contiguously starting from the location speci ed by the read/write function. A process can change its le view at any time by using the function MPI File set view, which takes as argument an MPI datatype, called letype. From then on, data will be read/written only to those parts of the le speci ed by the letype; any \holes" will be skipped. The le view and the data layout in memory can be de ned by using any MPI basic or derived datatype; therefore, any general, noncontiguous access pattern can be compactly represented.

3 ROMIO: A Portable Implementation of MPI-IO We have developed a high-performance, portable implementation of MPI-IO, called ROMIO [39]. It is freely available from the Web site http://www.mcs.anl.gov/romio. ROMIO is designed to run on multiple machines and le systems. The current version runs on the following machines: IBM SP; Intel Paragon; Cray T3E; HP Exemplar; SGI Origin2000; NEC SX-4; other symmetric multiprocessors from HP, SGI, Sun, DEC, and IBM; and networks of workstations (Sun, SGI, HP, IBM, DEC, Linux, and FreeBSD). Supported le systems are IBM PIOFS, Intel PFS, HP HFS, SGI XFS, NEC SFS, NFS, and any Unix le system (UFS). All functions de ned in the MPI-2 I/O chapter except support for le interoperability, I/O error handling, and I/O error classes have been implemented in ROMIO. (The missing functions will be implemented in a future release.) A key component of ROMIO that enables such a portable MPI-IO implementation is an internal layer called ADIO [34]. ADIO, an abstract-device interface for I/O, is a mechanism for implementing multiple parallel-I/O APIs (application programming interfaces) portably on multiple le systems. We developed ADIO before MPI-IO became a standard, as a means to implement and experiment with various parallel-I/O APIs that existed at the time. ADIO consists of a small set of basic functions for parallel I/O. Any parallel-I/O API can be implemented portably on top of ADIO, and ADIO itself must be implemented separately on each di erent le system. ADIO thus separates the machine-dependent and machine-independent aspects involved in implementing an API. We used ADIO to implement Intel's PFS API and subsets of IBM's PIOFS and the original MPI-IO proposal [40] on multiple le systems. By following such an approach, we achieved portability with very low overhead [34]. Now that MPI-IO has emerged as the standard, we use ADIO as a mechanism for implementing MPI-IO, as illustrated in Figure 1. We can easily implement MPI-IO on other le systems (such as [5, 10, 12, 16, 19]) simply by implementing ADIO on those le systems. The MPI-2 chapter on external interfaces de nes a set of functions that provide access to some of the internal data structures of the MPI implementation. By using these functions, one can build an MPI-IO implementation that can operate with any MPI-1 implementation that also has a few of the MPI-2 external-interface functions. We use this feature of MPI-2 in ROMIO to enable ROMIO to operate with multiple MPI-1 implementations. ROMIO, at present, requires only the two datatype-decoding functions from the external-interfaces chapter: MPI Type get envelope and MPI Type get contents. These functions are used to decipher what an MPI derived datatype represents. ROMIO can operate with any MPI-1 6

MPI-IO

Portable Implementation remote site network

ADIO

ADIO

File-system-specific Implementations Unix

NFS Intel HP IBM SGI HFS PFS PIOFS XFS

NEC SFS

Figure 1: ROMIO Architecture: MPI-IO is implemented portably on top of an abstractdevice interface called ADIO, and ADIO is optimized separately for di erent le systems. implementation that also has these two functions.2 At present, those implementations are MPICH, HP MPI, and SGI MPI. (In fact, ROMIO is now included as part of these MPI implementations.) Details on how ROMIO is implemented can be found in [38].

4 A Classi cation of I/O Request Structures In this section we examine the di erent ways of writing an application with MPI-IO and how that choice impacts performance. Any application has a particular \I/O access pattern" based on its I/O needs. The same I/O access pattern, however, can be presented to the I/O system in di erent ways, depending on which I/O functions the application uses and how. We classify the di erent ways of expressing I/O access patterns in MPI-IO into four \levels," level 0 through level 3. We explain this classi cation with the help of an example, accessing a distributed array from a le, which is a common access pattern in parallel applications. Consider a two-dimensional array distributed among 16 processes in a (block, block) fashion as shown in Figure 2. The array is stored in a le corresponding to the global array in row-major order, and each process needs to read its local array from the le. The data distribution among processes and the array storage order in the le are such that the le contains the rst row of the local array of process 0, followed by the rst row of the local array of process 1, the rst row of the local array of process 2, the rst row of the local For a complete implementation of MPI-IO, ROMIO would eventually need a few more functions from the MPI-2 external-interfaces chapter, namely, functions for lling in the status object, generalized requests, adding new error codes and classes, attribute caching on datatypes, and duplicating datatypes. 2

7

Large array distributed among 16 processes

P0

P1

P2

P3

P4

P5

P6

P7

P8

P9

P10

P11

P12

P13

P14

P15

Each square represents a subarray in the memory of a single process

head of file P0

P1

P2 P4

P3 P5

P0 P6

P8

P1 P7

P9

P4 P10

P12

P2

P3 P5

P11 P13

P6 P8

P14

P7 P9

P15

P10 P12

P13

P11 P14

P15 end of file

Access pattern in the file

Figure 2: Distributed-array access array of process 3, then the second row of the local array of process 0, the second row of the local array of process 1, and so on. In other words, the local array of each process is located noncontiguously in the le. Figure 3 shows four ways in which a user can express this access pattern in MPI-IO. In level 0, each process does Unix-style accesses|one independent read request for each row in the local array. Level 1 is similar to level 0 except that it uses collective-I/O functions, which indicate to the implementation that all processes that together opened the le will call this function, each with its own access information. Independent-I/O functions, on the other hand, convey no information about what other processes will do. In level 2, each process creates a derived datatype to describe the noncontiguous access pattern, de nes a le view, and calls independent-I/O functions. Level 3 is similar to level 2 except that it uses collective-I/O functions. The four levels also represent increasing amounts of data per request, as illustrated in Figure 4.3 The more the amount of data per request, the greater is the opportunity for the implementation to deliver higher performance. Users must therefore strive to express their I/O requests as level 3 rather than level 0. How good the performance is at each level depends, of course, on how well the implementation takes advantage of the extra access information at each level. If an application needs to access only large, contiguous pieces of data, level 0 is equivalent to level 2, and level 1 is equivalent to level 3. Users need not create derived datatypes in such In this gure, levels 1 and 2 represent the same amount of data per request, but, in general, when the number of noncontiguous accesses per process is greater than the number of processes, level 2 represents more data than level 1. 3

8

MPI_File_open(..., "filename", ..., &fh) for (i=0; i