Jovian: A Framework for Optimizing Parallel I/O Robert Bennett
Kelvin Bryant Alan Sussman Joel Saltz
Raja Das
Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 frobertb,ksb,als,raja,
[email protected]
Abstract There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.
1 Introduction Recently, there has been a great deal of interest in parallel I/O. In this paper, we present the design and implementation of the Jovian library, a portable library designed to optimize the performance of multiprocessor architectures that include multiple disks or disk arrays. We describe the design of this library and then present preliminary IBM SP1 performance measurements obtained from benchmarking the library on two application templates. High I/O bandwidths can be obtained by declustering les among a number of disks or disk arrays. The disks or disk arrays are attached to processors. An architecture can be con gured so that the disks are attached to processors which have primary responsibility for I/O or the architecture can be con gured so that processors provide both computational and I/O support. A key objective of a library designed to optimize parallel I/O is to coalesce disk access requests in a way that makes it possible to present each secondary storage device with a minimal number of disk access requests. This work was supported by the National Science Foundation under Grant #ASC 9318183.
Dierent types of abstractions can be employed to characterize I/O requests. The traditional approach is for each processor to independently make requests to the I/O subsystem. Several mechanisms can serve to coalesce disk access requests. A disk cache maintains a set of recently used disk blocks in primary memory; the disk caches rely on spatial and temporal locality to reduce the number of disk access requests. I/O subsystems will sometimes go further by speculatively bringing additional disk blocks into the disk cache and by periodically searching queues of pending I/O requests in order to look for opportunities to coalesce. A large fraction of parallelized programs make use of a Single Process Multiple Data (SPMD) model of computation. SPMD programs frequently need to move large sets of data from secondary storage to each processor's primary memory. In the SPMD model, large data sets are distributed across a group of processors in some fashion to achieve parallelism. Also, there is another distribution that must be taken into account when performing I/O; that is, the distribution of data over the parallel disk system. More often than not, these distributions are not the same, possibly leading to a great deal of disk head movement. From I/O requests provided by the user, the Jovian library attempts to minimize the number of I/O requests to disk by coalescing them into larger contiguous requests. The library can make use of a varying number of coalescing processes ( xed before program execution) to carry out this aggregation process. We will call this kind of optimization collective local I/O optimization. There are at least two other projects that aim to perform collective local I/O optimizations. Choudhary [2, 6] assumes that each processor has access to either a physical or logical I/O device. They partition parallel I/O into two phases, where processors rst
read data in a layout that corresponds to the logical disk layout, and then perform in-memory permutations to lay out the data as required. The goal is to do the preprocessing needed so that disk requests access data in large contiguous chunks. Kotz [9] modi es this approach by having compute processors pass their disk access requests to I/O processors. In Kotz's approach, the I/O processors are responsible for coordinating the collective requests to the le system. This will make it possible for the I/O processors to carry out optimizations that require knowledge of how data is physically mapped to disk. In some cases applications programs have access to compact global descriptions of what data needs to be transferred from secondary storage to primary memory. For instance, this is likely to be the case when executing an out-of-core calculation involving a structured mesh or dense matrix. Therefore all application processes can share a common description that summarizes the data that needs to be transferred between secondary and primary storage. Access to a compact global descriptor will allow the Jovian library to skip the collective communication stage that is needed when performing collective local I/O optimization. The collective global I/O optimizations in the Jovian library will be similar to the optimizations carried out by the multiblock Parti library [1, 12]; we will be able to use many ideas from multiblock Parti to construct this part of the Jovian library. The paper is organized as follows. Section 2 discusses the execution model the Jovian library uses, and presents several application areas that can bene t from the library, and Section 3 introduces a model for out-of-core compilation that can take advantage of the optimizations provided by the library. Section 4 presents the design of the library, and its request interface, in some detail. Section 5 provides the results of experiments that applied the library to two application templates, to show the bene ts it can provide. We conclude in Section 6.
2 Execution model A parallel program that uses the Jovian library must execute using the single program multiple data (SPMD) model. This model requires that the same program text be executed on each processor, but the input data to each of the programs may be dierent. This may cause the execution path taken in each program to be dierent. Following an SPMD model may cause performance degradation, but simpli es the de-
velopment of both the library and the applications that will use it. We assume that an application which uses the Jovian library to perform I/O will execute in a loosely synchronous manner. The main feature of loosely synchronous computation is that all the participating processes alternate between phases of computation and I/O. The major implication of loosely synchronous execution within the context of the Jovian I/O library is that even if some process does not need to read/write data it still has to participate in the I/O preprocessing phase. This participation involves sending a single null request, informing the I/O library that it does not require any data. The assumption of a loosely synchronous, SPMD execution model implies that all processes that require I/O in a particular phase of the computation will implicitly synchronize their requests, much like collective communication. For collective communication, all processes that need to communicate at a given point in the program must execute matching communication operations. In this case, matching means that all the sends speci ed by various processes must have corresponding receives posted by other processes (and for each receive there should be a corresponding send). Similarly, the Jovian library implements collective I/O operations. By collective I/O we mean that every process participating in the computation must execute every I/O operation (read, write, open le, etc.). Some of the parallel applications that we plan to implement using the Jovian library to optimize I/O come from the following areas:
Geographic Information Systems: Understand-
ing land cover dynamics is one of the most important challenges in the study of global change. Many of these changes take place at very ne scales (less than 1 km cell size), and require the analysis of high resolution satellite images for accurate measurement. Analysis of large amounts of satellite data (multiple Gbytes) stored in spatial databases are an integral part of this application. This may involve reading blocks of data stored on many disks, resulting in a severe I/O bottleneck. The Jovian library will be used to implement out-of-core algorithms for accessing spatial databases.
Data Mining: The Jovian I/O library can be used
to support data mining applications being developed at the University of Maryland [7]. In a database system, similarities exist among the resident objects. Groups of similar objects form classes. Given a database and a set of classes that can be used to cate-
gorize the objects present in the database, data mining is the technique which infers the rules required to do the actual classi cation. The data mining tool uses inductive reasoning to infer the required rules. These rules are developed based on the classes and examples, which are objects stored in the database. The whole database or a subset of the database can be used as the training set to generate the rules. The description of the classes, based on which the rules are generated, are inferred by looking at common properties of the examples of the classes.
Scienti c Codes: A number of sparse and block-
structured applications can utilize the Jovian library. In such codes, there are instances where the data is partitioned in an irregular fashion to reduce communication. For this reason each process may have to read data which is not stored contiguously on disks. If every process accesses the disk to read the data it requires, there would be severe contention for the disks. Using the I/O optimizations provided by the Jovian library should alleviate this bottleneck. A number of computational uid dynamics, computational chemistry and other sparse codes will use the library.
3 Issues for out-of-core compilation 3.1 Global vs. distributed I/O view There are two complementary views for accessing an out-of-core data structure, such as an array, from each process running a parallel program. With a global view, access to out-of-core data requires copying a globally speci ed subset of an out-of-core data structure into a globally speci ed subset of a data structure distributed across the processes. In order to implement such accesses, the library needs to know the distributions of both the out-of-core data structure and the in-core data structure, and also requires a description of the requested data transfer. With a distributed view, each process eectively requests only the part of the data structure that it requires. In that case, the user program is responsible for translating local data structure addresses into the correct global addresses for accessing the entire out-of-core data structure (perhaps through calls to a runtime library such as CHAOS [5] or multiblock PARTI [1]). A global view of out-of-core data structures is the natural view for a compiler for a parallelizing compiler for languages such as HPF or Fortran D. Since a
program written in such a language uses a shared address space model for all distributed data structures, any I/O statements in the source code can be directly translated into I/O statements in the code generated for each process. Global I/O routines require knowledge of the user speci ed data distributions to determine the portion of the data structure that is requested by each process. Since the source program to the compiler uses a global view for the data structures, the most straightforward way for the compiler generated parallel code to interact with the I/O system is using a global view. On the other hand, using a distributed view for out-of-core data structures greatly simpli es the design and implementation of a collective I/O optimizing library. With a distributed view, the user program is responsible for providing the I/O system with only the requests for that process, so the burden of translating local addresses for a distributed data structure into the corresponding global, out-of-core, addresses is on the user program. The I/O system is then only responsible for optimizing the accesses to maximize bandwidth and minimize latency to secondary storage. One way to optimize I/O requests from multiple processes using a distributed view is to restrict the requests to collective operations. In our loosely synchronous, SPMD execution model, this means that all processes must execute every I/O call, even if it is only for a process to tell the I/O system that it is not transferring any data for that request. The current implementation of the Jovian library implements a distributed view. Each process executes every I/O call, and is responsible for translating all requests into osets into the complete out-of-core data structure. The major advantage of using the library is that the requests are aggregated to optimize access to the le system, taking advantage of the restriction to collective I/O operations. We are planning to implement a global view in the near future, to ease the use of the library by compilers. Implementing a global view requires that the compiler and the I/O routines use a common distributed data descriptor. The format for such a descriptor, which must be able to handle both regular and irregular distributions, is in the process of being de ned by the Parallel Compiler Runtime Consortium (PCRC), , which includes groups from Syracuse University, University of Indiana, Rice University and several other universities, in addition to our group at Maryland. The library will include methods for dealing with various descriptions of I/O requests, including entire arrays, irregular portions of arrays, and regular sections of arrays (with strides). Ecient
implementations of the mapping from this global view to local addresses will be based on similar existing implementations in the CHAOS and multiblock PARTI libraries.
3.2 Compilation model The general model for out-of-core compilation relies on having the compiler generate calls to various runtime libraries to handle both I/O and interprocessor communication. For communication, our techniques have been validated for irregular, block structured and structured data access patterns [1, 4, 11, 10]. Extending these techniques to handle out-of-core data structures requires using the local memories as \caches" for out-of-core data. The problems involved in keeping track of the data that is currently available in local memory are very similar to the problems involved in caching data from other processes on a distributed memory machine. For data that is irregularly distributed across processes, a data structure analogous to the CHAOS distributed translation table [5], can serve to store the locations (process and local address) of data currently available in the local memory of some process. We call this data structure an out-of-core data access descriptor. All other data will reside only in secondary storage. Since irregular data access patterns usually cannot be analyzed at compile time, the compiler has to generate code to initialize and maintain the out-of-core data access descriptor at runtime. The compiler can then generate code to use the descriptor to move data to the right place at the right time. Instead of generating low-level communication and I/O statements, the compiler can use high-level calls to both communication libraries (e.g. CHAOS or multiblock PARTI) and I/O libraries (e.g. Jovian) to manage the data. The Jovian library can be used by the compiler to move data from secondary storage into local memories (the equivalent of a gather, for data irregularly distributed across a distributed memory machine) and from the local memories back to secondary storage (the equivalent of scatter, for irregularly distributed data). For more regular data distributions, the compiler can generate the equivalent of regular section copies between local memories and secondary storage. The Jovian library will provide this functionality once a global view has been implemented, based on existing implementations of this capability for moving data between local processor memories in CHAOS and multiblock PARTI. There are several tasks that a compiler has to perform for out-of-core data structures on distributed
memory machines that were not necessary for in-core data structures. First, the compiler has to analyze the data layout statements to determine the logical assignment of data to processes. We are assuming that the user program speci es which data structures potentially do not t into the local memory of the processes (i.e. which data structures are out-of-core), in addition to specifying the layout across processes. Second, the compiler has to generate the code to build the out-of-core data access descriptor for each out-ofcore data structure. This step also involves generating code to update the descriptor whenever an I/O operation is performed, because such operations are the only way to change which parts of the out-of-core data structure are available in local memory. We are still assuming a loosely synchronous execution model, so that all processes can have access to a coherent version of the out-of-core data access descriptor at all times. Note that, for a given process, the descriptor not only contains information about the data in the local memory of that process, but also about the portions of the out-of-core data structure residing in the local memories of other processes. The third new task that the compiler must perform involves optimizing the use of out-of-core data. This task includes strip mining user loops to reference only a portion of the out-of-core data structures at a time, and other loop optimizations to optimize local memory usage and interprocessor communication. Since this task may involve generating I/O that the user did not specify (for moving out-of-core data between local memories and secondary storage), the results of this task impact the code generation for managing the out-of-core data access descriptors. Strip mining algorithms for out-of-core compilation for regular data distributions and regular access patterns are described by Choudhary [2]. The main goal of these optimizations is to minimize both the number of I/O operations (since each one is a collective operation) and the total I/O volume.
4 Jovian I/O runtime library The model that was used to develop the Jovian Runtime Library is illustrated in Figure 1. The model distinguishes between two types of processes: application processes and coalescing processes, hence referred to as A/Ps and C/Ps, respectively. The A/Ps consist of the user application and part of the Jovian runtime library, and issue I/O requests to a statically assigned C/P. The mapping between A/Ps and C/Ps can be one-to-one or many-to-one. The C/Ps, which are in
AP
AP
AP
AP
Ranges
Ranges
Ranges
Ranges
Blocks
Blocks
Blocks
Blocks
CP
CP
Blocks
Blocks
Disk
Disk
Figure 1: Jovian application and coalescing processes many aspects similar to server processes in database management systems (DBMS), are each responsible for a part of the global data structure on disk. The C/Ps can also be thought of as servers for persistent data (e.g. les) that are created speci cally for the application. A simple extension of the model would be for the C/Ps to persist across execution of multiple applications. However, that would seriously complicate the implementation of the runtime library, and might also require operating system support for multithreading the server to handle multiple applications at the same time. The C/Ps rst attempt to minimize the number of I/O requests by coalescing them into larger contiguous requests. This is called collective local I/O optimization. The C/Ps then perform the necessary disk I/O operations and service the I/O requests. In this model, C/Ps can potentially access disjoint data les; that is, in the best possible scenario, each C/P has access to a local disk or to a parallel le system (e.g. IBM's Vesta [3]) that supports parallel access to disjoint partitions (sub les of a global le). The basic phases employed in the library to handle I/O requests are: 1. All application processes send I/O requests to predetermined coalescing processes (each application process sends requests to one coalescing process). 2. Each coalescing process is responsible for accessing a logically de ned I/O device. The logically de ned I/O device can be a single disk, a set of disks serviced by a single disk controller or a com-
munication channel associated with a RAID array. 3. All coalescing processes carry out a collective communication phase to partition the application level I/O requests. At the end of this collective communication phase, each coalescing process is responsible for only the I/O requests directed at its logically de ned I/O device. 4. The coalescing processes perform the necessary disk I/O. 5. Coalescing processes forward data directly to the original requesting application processes. Currently, the interface provided by the library allows an application to create and destroy C/Ps, open/close distributed les and issue I/O requests to the C/Ps to read/write distributed data structures (i.e, matrices).
4.1 Process Initialization Initially, when an application that has been linked to the Jovian library is executed on a parallel system, there is no distinction between A/Ps and C/Ps. When the Jovian library initialization routine is called in the application, the routine determines which processes are designated A/Ps and C/Ps based on information provided by the user (e.g. number of C/Ps, logical C/P ids) and system information (e.g. total number of processes in the partition, logical processor ids). Collectively, all processes generate maps that determine which C/P services each A/P. That is, C/Ps determine how many and which A/Ps each will be servicing, while A/Ps determine which C/P to send requests to. After this phase, the A/Ps return execution back to the application program and the processes that have been designated C/Ps execute a command interface, thus becoming server processes waiting for I/O commands (i.e, read, write, open, close) from the A/Ps.
4.2 Opening/Closing Distributed Files Jovian provides a Unix-like interface for opening and closing distributed les. These routines are similar in functionality to the Unix system calls open() and close(). When an application opens a le, the C/Ps are requested to execute a collective open routine. The library makes use of the lesystem interface (i.e., AIX, OSF/1, etc.) on the host machine to open the le.
This is also the case for reading, writing and closing a le. Each C/P opens its portion of the physical le (i.e., a large le that has been striped across multiple local disks or a le managed by a parallel le system), then, collectively, all C/Ps determine the total le size and the mapping of blocks to C/Ps. The C/Ps use le distribution information provided by the user to access the le metadata. At this point, each C/P has all the information it needs about the layout of the physical le on disk (i.e., total number of le pages, total number of pages and the sequence of page numbers and blocks that it is responsible for) so that any subsequent I/O requests can be optimized, via collective local I/O optimization, and satis ed.
4.3 Satisfying I/O Requests We describe read requests in detail. The algorithms used for write requests are similar, so we will not describe them in this paper, but brie y discuss some issues at the end of this section. As described in Section 3, there are two views that an application can use to read a distributed data structure from disk; requests can be made using either a distributed or a global view. Currently, the library only supports a distributed view. Using a distributed view, there are two ways in which A/Ps can specify the data needed from disk. Requests can be speci ed as: Range requests consist of a set, rangei , 0 i N, in which each rangei is a tuple of the form (li ; ui; si ), where li , ui , si are the global lower index, global upper index and stride of rangei , respectively. Also associated with each range is a local user address where the data from disk will be placed in memory. For example, given a matrix that is block distributed across the A/Ps; given P A/Ps, each A/P is responsible for a pNP pNP submatrix. Each A/P will request pNP ranges, numbered, range0 ; : : :; range pNP ,1 , one for each row of the submatrix. From the example, we can see that as the number of dimensions in the data structure increases, more ranges are required to access the data on disk. This means that range requests are, at best, sucient for data structures with a small number of dimensions, and also are useful for irregularly accessed data arrays (i.e., non-contiguous access using an indirection array), but cumbersome and memory consuming for more complex structures. Regular sections with strides provide a compact
representation for accessing arbitrary sized, ndimensional data structures [8]. For each dimension, a lower bound, upper bound and stride (a single tuple) must be speci ed. Therefore, for an n-dimensional array, only n tuples are necessary. The library supports both regular and irregular data access patterns by utilizing a set of ranges. Allowing I/O requests to the C/Ps to consist of range information can potentially lead to small, possibly duplicate, data messages being sent back the to A/Ps after the data is retrieved from disk. Regular section requests do not suer from this problem, but are not supported in the current version of the library. The library preprocesses the ranges in an attempt to eliminate duplicate or overlapping ranges. Ranges are mapped to disk blocks (pages). Since the library generates mapping information to determine where data received by the A/Ps are copied to, it also maps ranges into blocks and sends the list of block numbers as I/O requests to the C/Ps. This reduces the amount of preprocessing on the C/P side.
4.3.1 Library Preprocessing in an A/P
When an application performs a read with the appropriate request information (i.e., range requests), the library determines the structure of the request and builds an I/O message that is sent to the C/Ps. The requests are preprocessed for two main reasons. First, I/O requests have to be in a form that the C/Ps can readily use; currently, this form consists of blocks. A C/P views its part of the distributed le as a linear sequence of blocks, so when an application requests data from disk, it is requesting a sequence of partial or full blocks and receiving full blocks from the C/P. Any duplicate ranges are eectively mapped to an already requested block. This has the eect of minimizing the number of messages and maximizing the size of the message returned to an A/P. For example, if m out of n ranges fell within a blockj , the A/P sends only one block request for those m ranges and the servicing C/P would send back only one block, instead of m separate data elements. The trade-o is transmitting a large number of small messages, which can incur a high message start-up overhead versus a smaller number of larger messages. Second, the library needs to store mapping information to process the data, which is returned as blocks by the C/Ps. One can easily map a range or set of ranges to a speci c block and thus extract the requested data quickly. Block conversion is only used when the requests are in range format, since the regular section
representation is already in a compact form.
4.3.2 Coalescing Processes
The C/Ps are responsible for performing collective local I/O optimization to read data from disk and satisfy the I/O requests from the A/Ps. In this subsection, each phase that takes place in the C/Ps is detailed.
Exchange Requests The exchange phase is re-
sponsible for routing each block request to the C/P that owns the disk le containing the requested data. This routing is accomplished in 3 steps: block mapping, send requests, and receive requests. Each C/P must determine which of the requests received from A/Ps it can service and which should be routed to other C/Ps. As stated in Section 4.2, the C/Ps engage in a collective communication phase when opening a le, to determine the layout of the striped le (i.e. which blocks are owned by which C/P). The mapping step of the exchange routine uses this information to sort the block requests according to the C/P that will service them. Requests that are forwarded to another C/P must contain the requesting A/Ps, so that the C/Ps can send data directly to the A/P(s) that requested a particular block.
Coalescing Requests One of the goals of Jovian is to coalesce blocks into larger requests to optimize disk accesses. Larger requests should minimize disk seek times, thus improved disk performance. In addition, larger read requests potentially improve communication delays associated with sending data back to A/Ps. Disk accesses occur in units no smaller than a disk sector. These sectors are grouped into blocks by the le system. This physical block size is used by the the C/P to determine the number of blocks in a le stripe. Logical le blocks are numbered from zero, starting with the rst le stripe, and extend to the end of the distributed le. Each C/P sorts its block requests, so that requests to consecutive blocks can be converted into single disk operations. Jovian also includes a parameter that allows the speci cation of a Read Gap Size. This enables the reading of non-consecutive block, where the gap between the blocks is less than or equal to the speci ed gap size. For example, the library may coalesce requests and discover that there are requests for logical pages 0 through 17 and 19 through 50. If the read gap size is 2, then a read request for pages 0 through 50 will be issued to the le system.
Disk Reads and Returning Data to A/Ps
Reading and returning data to the A/Ps is performed in three phases. First, Jovian issues a request to the le system. Next, blocks in the read buer that are destined for the same A/P are copied to a communication send buer. Finally, a send operation is issued for the blocks in the send buer. Note that at this point, as will soon be discussed, the A/Ps that requested the data are waiting on blocking receive calls for the data from any C/P. The coalescing phase creates large disk requests from smaller requests; however, there is a limit on the amount of data that can be read from disk in a single request, based on the availability of in-core memory for the C/P. Jovian scans the request table in sorted order and issues disk read requests for the coalesced logical blocks. Requests that are too large are satis ed with multiple disk requests. Since Jovian coalesces based on block numbers (and not A/P numbers), a single request will potentially satisfy read requests for several A/Ps. After the send buer is created, a non-blocking send request is issued. Jovian uses a multiple buering scheme that allows the lling of a new send buer while previously- lled buers are being sent to the destination A/P. When all the blocks in the current read buer have been sent, the next set of coalesced requests is issued to the le system.
Receiving and Processing Blocks by A/P Once
an A/P sends its requests to its assigned C/P, the A/P determines how much data it will be receiving and issues blocking receive() calls to collect the data. Each A/P uses mapping information to copy the received data to the appropriate address in the user memory space; this step is repeated until all blocks have been processed. Providing non-blocking I/O calls in the Jovian library would only require replacing the blocking receive() calls in this part of the library with non-blocking receives. Non-blocking I/O calls have not yet been implemented.
4.3.3 Write Requests Write operations are more dicult to handle than read operations. Excess data can be read by the C/Ps, as a consequence of the coalescing phase performed by the C/P, but only the requested data will be copied out of the A/P read buer and written into the application memory space. For writes to disk after coalescing the requests, some scheme must be used to prevent extra data from being written to a le.
For example, suppose a C/P has coalesced a set of
blocks into a write buer, and the last block is only
5 Experimental Results Two application templates, one modeling structured grid applications and one modeling accesses to regular sections of structured grids, were used to evaluate the performance of the Jovian library. Each application was run with PA=P = 4 and PC=P = 4 on dedicated IBM SP1 nodes. Each C/P had access to a local 1GB SCSI disk, with a maximum transfer rate of about 3MB/sec.
5.1 Structured Grid The rst test application involved reading an N N, distributed, structured grid from disk using PA=P application processes and PC=P coalescing processes. The distribution of the disk across A/Ps and the striping of the grid across the C/Ps is illustrated in Figure 2. Initially, the structured grid was striped by blocks of rows across the disks of the C/Ps. Each C/P was N rows of N data elements. In this responsible for PC=P case each data element is a double, which is 8bytes on the RS6000. The application was parallelized using the Multiblock PARTI runtime library [1]. The library provides the functionality to lay out distributed arrays. Using Multiblock PARTI, the grid was distributed by block in both dimensions p across thepA/Ps. Each sub-grid was of size N= PA=P N= PA=P . Also, the library provided the functionality by which the range request information was generated.
.
.
.N/2-1
N/2
.
.
. N-1 C/P-0
A/P-0
A/P-1 C/P-1
.
.
.
0
N/2-1
.
N/2
.
C/P-2 A/P-2
A/P-3
.
partially lled with valid data. The C/P cannot write this buer to directly to disk as it exists, since data in the le corresponding to the location of the non-valid data in the last block will be overwritten. The C/P will use a Read-Modify-Write algorithm to write the data in the proper location in the le. That is, the corresponding block is read into the local memory of the C/P, the appropriate locations are modi ed, and the block is written back to disk. Although a simple example was used to illustrate how the write operations are handled, a more interesting case involves a totally irregular write access pattern to the le. The overhead of seeking and writing each individual data element would be much greater than the overhead from reading a block from the le, modifying some of the data elements within the block and writing the block back to disk.
0
C/P-3 N-1
Figure 2: Distribution and Striping of Structured Grid Table 3 presents the performance numbers obtained from the structured grid application. The disk read times and transfer rates for each C/P are given in column 2. For the grids sizes listed in column 1, the respective amount of data read by the C/P was 2MB, 8MB and 32MB. Given that each A/P was requesting its entire sub-grid, each C/P read contiguous disk blocks and was able to sustain about a 2.0+ MB/s data transfer rate from the local SCSI disk. Unfortunately, the A/P does not obtain data at the disk transfer rate. The rate at which data arrives at the A/P is given in column 3. Once the overhead of the library is factored in, the data rates decrease. Notice that the data rates sustained by the Jovian library are much better than those obtained by doing replicated reads of the data from disk. In the replicated read model, each A/P seeks to the proper position in the data le for each range request. Since there is currently no parallel lesystem support on the SP1, each data le must be replicated on the local disk of each A/P. For this data access pattern, the Jovian library performs many fewer disk seeks than in the replicated read model, so performs much better. Although the library overhead is signi cant, even for large data les, Jovian has a clear advantage over a replicated read method.
5.2 Regular sections of a structured grid In this application, several sub-matrices of a given distributed matrix (distributed across application processes and disks as in Figure 2) were read from disk.
Global Grid Disk Read Jovian Gather Replicated Read % Overhead Size ms (MB/sec) ms (MB/sec) ms (MB/sec) 1K 1K
1000 (2.2)
1700 (1.2)
1900 (1.1)
44.3
2K 2K
3700 (2.3)
6600 (1.3)
9600 (.9)
43.4
4K 4K
14600 (2.3)
24300 (1.4)
36800 (.9)
39.8
Figure 3: Jovian performance and overhead for a structured grid (per CP or AP)
.
0
.
N-1
A/P-1
A/P-0
.
0
. (x1,y1)
(x2,y2)
.
(x0,y0)
RS-2
RS-3
.
RS-1
N-1
A/P-2
A/P-3
Figure 4: Regular sections of distributed array; X's mark the origins These regular sections are illustrated in Figure 4. Regular sections, labeled RS-1, RS-2 and RS-3 in the gure, are speci ed by their origin in global indices and a dimension extent. This application template is intended to model a geographic information system (GIS) application for computing a greenness measure (i.e. how much vegetation there is on the ground) from multiple satellite images. The satellite images are taken at dierent times in approximately the same regions, so the regular sections represent the same area on the ground in dierent images.
Unlike the structured grid application, there is no guarantee that every A/P will receive data for each regular section. Similarly, there is no guarantee that each C/P will provide data from disk for each regular section. Referring to Figure 4, if the application requests only regular section RS-1, then A/P-0 and A/P-2 will receive the corresponding data. Since the data is striped by blocks of rows across the C/Ps, all four C/Ps will supply the data for RS-1. The performance of the regular section application using the Jovian library is illustrated in Table 5. The total distributed array consists of 4096 4096 double precision numbers. Each regular section is 3K 1:5K, for a total of about 36 MB. The A/Ps that request no data for a regular section (e.g. for RS-1, AP-1 and AP-3) incur some overhead, since they must call the Jovian gather routine anyway. This overhead is about 14 ms. We expect that the overhead can be decreased further since many parts of the library has not been optimized. This application has a more complicated access pattern compared to the pattern for the structured application, and consequently has a much lower data transfer rate (.4-1 MB/sec) to the A/P. The two main factors aecting the transfer rate are load imbalance and the data access pattern to the disks. Many of the range requests are not handled by all of the C/Ps, leading to load imbalances. For this data access pattern, many non-contiguous parts of the data owned by each C/P must be read, leading to many disk seek operations. For RS-1 and RS-3 the library obtains better transfer rates than the replicated reads, because all four C/Ps are supplying data to two A/Ps. The total number of disk seeks for both cases is the same (one for each row in the regular section). However, for the library, the seeks are performed by the four C/Ps in par-
Section A/P RS-1 RS-2 RS-3
0 1 2 3 0 1 2 3 0 1 2 3
Data Requested Replicated Read Jovian Gather (MBytes) ms (MB/sec) ms (MB/sec) 18.0 28800 (.7) 18800 (1.0) 0.0 0.0 (|) 14.1 (|) 18.0 29000 (.7) 18600 (1.0) 0.0 0.0 (|) 14.5 (|) 8.2 22800 (.4) 22800 (.4) 9.8 22800 (.5) 22900 (.5) 8.2 22700 (.4) 23000 (.4) 9.8 23600 (.4) 22900 (.5) 0.0 0.0 (|) 14.3 (|) 18.0 25800 (.7) 18500 (1.0) 0.0 0.0 (|) 14.4 (|) 18.0 25500 (.7) 18000 (1.0)
Figure 5: Jovian performance for a regular section access pattern. allel, while for the replicated reads the seeks are only performed by the two A/Ps that require data from the regular section. On the other hand, for RS-2, the library and the replicated reads provide approximately the same performance. For this regular section, the advantages from coalescing requests are oset by the increased communication costs in the library.
6 Conclusions We have presented the design and implementation of the Jovian I/O library. The library is intended to optimize the performance of distributed memory parallel architectures that include multiple disks or disk arrays. Preliminary performance measurements obtained from benchmarking the library on two application templates on the IBM SP1 show that these optimizations can provide signi cant improvements over traditional I/O methods for such machines.
Acknowledgements This research was performed on the IBM SP1's at Argonne National Laboratory and Cornell University, on the SP2 and Intel Paragon DoD HPC supercomputers at Maui and Wright-Patterson AFB, and on the Intel Paragon operated by Caltech on behalf of the Concurrent Supercomputing Consortium. Access to the CCSC Paragon was provided by the Center for Research on Parallel Computation. We wish to gratefully acknowledge the willingness of these organizations to provide access to their multiprocessors.
We also would like to thank John Townshend for productive discussions on I/O requirements associated with the Land Cover Dynamics grand challenge.
References [1] Gagan Agrawal, Alan Sussman, and Joel Saltz. Compiler and runtime support for structured and block structured applications. In Proceedings Supercomputing '93, pages 578{587. IEEE Computer Society Press, November 1993. [2] Rajesh Bordawekar, Rajeev Thakur, and Alok Choudhary. Ecient compilation of out-of-core data parallel programs. Technical Report SCCS 622, NPAC, April 1994. Submitted to Supercomputing '94. [3] P.F. Corbett and D.G. Feitelson. Design and implementation of the Vesta parallel le system. In Proceedings of the Scalable High Performance Computing Conference (SHPCC-94), pages 63{ 70. IEEE Computer Society Press, May 1994. [4] Raja Das, Joel Saltz, and Reinhard von Hanxleden. Slicing analysis and indirect access to distributed arrays. In Proceedings of the 6th Work-
shop on Languages and Compilers for Parallel Computing, pages 152{168. Springer-Verlag,
August 1993. Also available as University of Maryland Technical Report CS-TR-3076 and UMIACS-TR-93-42.
[5] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. Communication optimizations for irregular scienti c computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462{479, September 1994. Also available as University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109. [6] Juan Miguel del Rosario, Rajesh Bordawekar, and Alok Choudhary. Improved parallel I/O via a two-phase run-time access strategy. ACM Computer Architecture News, 21(5):31{38, December 1993. [7] Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD Conference, May 1994. Also available as University of Maryland Technical Report CS-TR-3190 and UMIACS-TR-93-131. [8] P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Transactions on Parallel and Distributed Systems, 2(3):350{360, July 1991.
[9] David Kotz. Disk-directed I/O for MIMD multiprocessors. Technical Report PCS-TR94-226, Department of Computer Science, Dartmouth College, July 1994. [10] Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz, Alok Choudhary, and Georey Fox. Supporting irregular distributions in FORTRAN 90D/HPF compilers. Technical Report CS-TR-3268 and UMIACS-TR-94-57, University of Maryland, Department of Computer Science and UMIACS, May 1994. Submitted to IEEE Parallel and Distributed Technology. [11] Ravi Ponnusamy, Joel Saltz, and Alok Choudhary. Runtime-compilation techniques for data partitioning and communication schedule reuse. In Proceedings Supercomputing '93, pages 361{ 370. IEEE Computer Society Press, November 1993. Also available as University of Maryland Technical Report CS-TR-3055 and UMIACS-TR93-32. [12] Alan Sussman, Gagan Agrawal, and Joel Saltz. A manual for the multiblock PARTI runtime primitives, revision 4.1. Technical Report CS-TR3070.1 and UMIACS-TR-93-36.1, University of Maryland, Department of Computer Science and UMIACS, December 1993.