CMMD I/O : A Parallel Unix I/O Michael L. Best1 Craig Stanfill
Adam Greenberg Lewis W. Tucker
Thinking Machines Corporation 245 First Street Cambridge, MA 02142
Abstract We have proposed a library providing Unix file system support for highly parallel distributed-memory computers. CMMD I/O supports Unix I/O commands on the CM-5 supercomputer. The overall objective of the library is to provide the node level (SPMD) parallel programmer with routines for opening, reading, writing a file, and so forth. The default behavior mimics standard Unix running on each node; individual nodes can independently perform file system operations. New extensions to the standard Unix file descriptor semantics provide for co-operative parallel I/O. New functions provide access to very large (multi-gigabyte) files. Index Terms -- File systems, I/O systems, parallel supercomputing, parallel UNIX, message passing, Connection Machine.
1 Introduction Want to start an argument? Get together a room of parallelprocessing aficionados and propose this problem: You have N processing nodes and you’re running on each node the canonical “Hello World” program: main() { printf(“Hello World\n”); } What do you see on stdout? Half the crowd will insist you should see “Hello World,” printed out once: Hello World Half the crowd will demand that you see N copies of “Hello World” on the display: Hello World Hello World Hello World Hello World .... ®Connection
Machine is a registered trademark of Thinking Machines Corporation. CM, CM-2, CM-5, and DataVault are trademarks of Thinking Machines Corporation. Unix is a registered trademark of AT&T.
And in any crowd there’s always one fringe thinker who will require this behavior: N copies of “H”, followed by N of “e”, and so on. HHHHHHHHHeeeeeeeeellllllllllllllllll.. What’s our point? Even with years of experience porting Unix file systems onto parallel machines, it’s clear that a consensus on the desired behavior of application-parallel Unix I/O has yet to develop within the computer science community [6]. A number of proposals have, however, been presented in the literature including [1], [2], [4], [5]. In this paper, we will first overview the CM-5, its highspeed I/O devices, and the CMMD library in general. Next we will discuss some of the major issues in designing a parallel Unix I/O system. We will overview the four major I/O modes provided by the library: Local, Synchronous Sequential, Synchronous Broadcast, and Independent. We will consider buffered and standard file I/O, error handling, and finally support for large files. 2 CM-5 Overview2 A CM-5 system may contain tens, hundreds, or even thousands of parallel processing nodes each based on industrystandard RISC microprocessor technology. Each processing node may be augmented with a special high-performance hardware arithmetic accelerator that uses wide data paths, deep pipelines, and large register files to improve peak performance. The processing nodes are supervised by a control processor, which runs an enhanced version of the Unix operating system. The control processor consists of a standard RISC microprocessor, associated memory and memory interface, and perhaps I/O devices such as local disks and Ethernet connections. A system administrator may divide the parallel processing nodes into groups, known as partitions. There is a separate control processor, known as a partition manager, for each partition. Each user process executes on a single partition, but may exchange data with processes on other partitions. Since all partitions utilize Unix timesharing 1. Author’s e-mail address:
[email protected] 2. Adapted from [7].
and security features, each allows multiple users to access the partition while ensuring that no user’s program interferes with another’s. Every control processor and parallel processing node in the CM-5 is connected to two scalable interprocessor communication networks. Any node may present information to the networks, tagged with its logical destination, for delivery via an optimal route. The two interprocessor communication networks are the Data Network and the Control Network. In general, the Control Network is used for operations that involve all the nodes at once; the Data Network is used for bulk data transfers where each item has a single source and destination. Specifically, the Control Network handles broadcasting, combining operations such as reductions and parallel prefix, global bit operations such as logical OR, and barrier synchronization. The Data Network supports general patterns of communication and a variety of flexible message protocols such as an interrupt protocol, where the receiving node is interrupted when a data item arrives, or a receiver-polls protocol in which each processor checks its Data Network receiveFIFO for incoming messages. The CM-5 runs a Unix-based operating system and provides its own high-speed parallel file system as well as access to ordinary Sun Network File System (NFS) file systems. A special I/O interface supports high-performance parallel mass storage devices such as the DataVault disk farm or Scalable Disk Array. The Scalable Disk Array system is an extremely high performance, highly expandable RAID disk storage system. Capacity ranges from 9 Gigabytes to 3.2 Terabytes, with sustained transfer rates of 12 Mbytes/s to 4.2 Gbytes/s. The size of the Scalable Disk Array can be expanded in both capacity and transfer rates independent of the number of processing nodes. Just as every partition is managed by a control processor, every I/O device is managed by an input/output control processor (IOCP), which provides the software that supports the file system, device driver, and communication protocols. Like partitions, I/O devices and interfaces use the Data Network and the Control Network to communicate with processes running in other parts of the machine.
3 CMMD Overview The CM-5 supports a variety of programming models, one of which is the node-level model. In this approach, a single program is replicated across all processing nodes; upon program start-up each node runs completely independently of each other node and can branch to wholly different sections of the initial program. The node-level programmer is fully responsible for all load balancing, memory allocation, inter-processor synchronization, communication, and so forth. (This is in contrast to the data parallel model, wherein the program is single-threaded and issues of load balancing, synchronization, and the like are not dealt with directly by the user.)
The CMMD library is a collection of support routines for the node-level programmer. It provides a wide array of message passing and synchronization methods, reduction operations, and support for some Unix system calls and library functions. The CMMD library allows a variety of means of access to the control processor or partition manager on the CM-5; the program running on the control processor can be provided by the user or by the CMMD library itself. The favored approach for CMMD I/O is for the control manager to be exclusively under CMMD library control. This is referred to as the hostless model and is the only approach which will be detailed in this paper. (In common usage, the control processor is often referred to as the “host.”) In hostless development, the user writes a program for the nodes in a favorite programming language and embeds CMMD library calls where desirable. The program is compiled to an object module using a traditional compiler. The final link stage, however, is performed by a special CMMD linker which will take this single program and produce a version to run on each of the processing nodes of the CM-5. A separate CMMD provided program will run on the control processor. This program will help service many of the I/O requests which will be described in this paper.
4 Design Goals for a Parallel Unix I/O An important model which underlies CMMD I/O we call the process model. Each node may be executing different code at any tick of the clock; we say each node represents an independent thread. These threads, taken together, represent a single process running on that partition. The threads are “heavyweight,” that is they maintain their own program counter, their own address space, and operate independently of one another. However, as a single process they share things as well; they have a single current working directory, they share a process id, a group id, and to some degree they share open file descriptors. What does it mean for a thread to acquire a file descriptor? Is this an independent or co-operative event? As with most hard questions, the answer is both behaviors are desirable; users require both node independent and co-operative I/O in application-parallel programs. Some of the major design goals we considered when developing CMMD I/O are: 4.1 Independent operations Individual nodes should be able to independently open a file, read, write, and carry out basic file system activities without interference from other nodes. Nodes must maintain their own local list of file descriptors and seek pointers.In this model, each thread looks to some degree as if it was executing on its own workstation, with all nodes networked to a common file server. So, for instance, a few threads could open the same file, read and write to it, and only notice each other’s activities to the extent that their writes were conflicting. Any co-ordina-
tion between nodes when using locally opened descriptors should be minimized or avoided all together. 4.2 Co-ordinated operations Parallel computers bring together a large collection of processors, working in co-ordination, to reduce the time-to-solution of a single problem. An application-parallel I/O system needs to support these efforts with operations which involve all the processing elements working in useful co-operation. For instance, a co-operative mode could map onto the data parallel programming model wherein typically a single data structure (say an array) spans a number of processing nodes. We would like to deal with this distributed array in a single co-ordinated fashion. Further, the I/O system must support co-ordinated modes which take advantage of the high-bandwidth parallel devices offered for the system. 4.3 Unix look-alike To facilitate the porting of serial applications to a parallel machine, it is desirable that standard Unix operations on files behave just as they would had they been executed on a normal workstation. Their default behavior should be that of a standard Unix file system operation and should behave in the “expected” serial way when placed on a node. In this way an application programmer could start by compiling the serial program to run “as is” on each node. Then they could selectively change areas of the program which exhibit high degree of parallelism. In such areas co-ordinated modes of I/O would probably be most appropriate. 4.4 Language independent design Parallel extensions to Unix I/O need a consistent design across the popular programming languages in which they will be embedded. For instance, the treatment of co-ordinated modes associated with unit numbers in F77 should be similar to co-ordinated modes associated with file descriptors or streams in C. 4.5 Efficient scalability with number of nodes Our extensions to Unix I/O should scale well with the number of processing nodes added to the system. As the number of processing nodes increases, we need to take advantage of their power during I/O. Further, we know that all of these nodes might often be accessing the same disks or file server; we need to ensure that as the number of nodes grows, we do not overly tax any shared resources. (A good question to consider is: How will this design scale to a 1024 node system and beyond?) 4.6 Support for big files People often turn towards parallel systems because they are the only systems which can accommodate their large computational problems. From the initial design, a parallel Unix I/O system must be prepared to deal with very large files, in the multi-gigabyte range.
4.7 Device independent design When you open and write to a file on a standard workstation you need rarely concern yourself with what sort of device you are writing to. For instance, the operation of writing to an optical disk will appear identical to writing to a local SCSI disk which in turn looks identical to writing to a remote NSF mounted disk. With the addition of special high-speed parallel disk systems, this device independence should be preserved. Opening a file on an Datavault or Scalable Disk Array should appear to the coder as an identical operation (though obviously requiring an identifying pathname) as opening a file on a normal disk attached to the control processor. This is in contrast, for example, with the CM-2 I/O system which had a unique space of commands for working on the DataVault [8].
5 I/O Modes We satisfy many of the stated goals by one major addition to Unix I/O: an I/O mode is now associated with each and every file descriptor. There are four available I/O modes: Local, Synchronous Sequential, Synchronous Broadcast, and Independent. Note that the latter three modes are collectively called global modes, because they involve a global open across all nodes. For descriptive ease, we will show only the interface to the C programming language. 5.1 Local mode Consider a program with some I/O, taken from a serial workstation and compiled for the CM-5. This program should exhibit behavior identical to that seen on the serial workstation. This “workstation-like” I/O mode is called LOCAL mode. An open system call, when made on a node, is non-cooperative, non-synchronizing, and produces a file descriptor independent of all other nodes. Once a file descriptor is opened in LOCAL mode it may not change to any other mode; throughout its lifetime it will remain associated with the LOCAL mode. The default semantics of open, creat, fopen, etc. is always to return LOCAL descriptors. A typical use of LOCAL mode is demonstrated by the following code fragment: if(error_condition == TRUE) { int fd = open(“Node_0_Crashfile”, O_WRONLY | O_CREAT, 0666); /* write crash information to fd*/ close(fd); exit(0); } Here some node may have encountered an error condition. We would like this node to be able to open a file and write out crash information completely independent of the actions of any other node. LOCAL mode provides this ability. While this mode has uses and may facilitate porting, it also has important (and perhaps unwanted) performance character-
istics. Consider the following program placed on N nodes of a CM-5: main() { int fd; fd = open(“foo”, O_RDWR, 0666); } Executed on all N nodes, this represents N opens of the same file, each returning a file descriptor. An important characteristic of the process model is that all file descriptors come from the same pool of available descriptors for that process. Further, since we have a single file server for the entire process across all N threads there is a potential performance issue. The above program, run on a 1024 node partition, would represent 1024 calls to open, all against a single pool of available file descriptors and all taxing a single file server. Even if a kernel were configured to allow 1024 open descriptors, strange would be a serial program which tried to open that many. Nevertheless, by replicating this open 1024 times across each node, that is exactly the behavior we would see. Note while all file descriptors originate from a single pool they are not “shareable” from node to node. You may not communicate file descriptors from one node to another; they will have no meaning to other nodes. This is a general characteristic of file descriptors in any mode. 5.2 Synchronous Sequential mode We have seen how the LOCAL mode supports one of our critical design goals; it mimics traditional Unix I/O behavior. The SYNC_SEQ mode also solves one of our critical design goals; it supports co-operative and high-bandwidth parallel I/O operations. Consider a problem where a single array is stored across a number of processing elements and computation on this aggregation takes place in parallel with occasional communication between nodes of boundary values (this will be recognized as a “conventional” means of programming machines like the CM-5). We would like to be able to seed this array with stored values, store out this array throughout the computation (perhaps to checkpoint), and finally store out an end solution. We could open N files (for our N nodes) and write out each node’s contribution to the array into its own file. If N is small enough, we might have enough file descriptors to do this. However, we certainly should expect the file system to take a beating given that it will have to service N independent requests to open, write, etc. With SYNC_SEQ mode only one request is made of the file server during the open. Subsequent writes in SYNC_SEQ mode will be co-operative and allow the file system to perform optimal parallel operations. Since open usually returns file descriptors in LOCAL mode (and once LOCAL a descriptor’s mode can not change) how can we ever get a descriptor in some other mode, specifically, SYNC_SEQ? Consider the following fragment: fd = CMMD_global_open(“foo”, O_RDWR | O_CREAT, 0666);
CMMD_set_io_mode(fd, CMMD_sync_seq); write(fd, array_section, size); The CMMD I/O library provides two new functions: CMMD_global_open and CMMD_set_io_mode. CMMD_global_open, by default, returns file descriptors in INDPENDENT mode (we will learn about this mode below). However, we can happily change a file descriptor between any of the global modes. CMMD_global_open takes identical arguments as the normal Unix open. CMMD_set_io_mode arguments are an open file descriptor and an io mode (kept as an enumerated type). Note that CMMD_global_open is a synchronizing call; all nodes must participate and they must pass the same arguments. However, CMMD_set_io_mode is not synchronizing and we will see the reason when discussing INDEPENDENT mode (when changing to a synchronous mode, however, if all nodes do not participate then you are asking for trouble). See section 6.0 for an alternative means of opening global files. In the above fragment we are placing fd into SYNC_SEQ mode. A subsequent write to a file descriptor in this mode will be a synchronous operation. Thus, the write must be called on all nodes; failure to do so may result in deadlock or other catastrophe. In a SYNC_SEQ write, each node contributes some amount of data in processor order. That is, first node 0 writes some data to the file, then node 1, and so forth. The amount of data may be different across all nodes (i.e. size above need not be identical). A node wishing not to contribute to a write still needs to make the system call; however it may contribute zero data. A read will work in reverse; node 0 will receive its requested part of the file first, node 1 will receive the next section, and so forth. All operations on a SYNC_SEQ file descriptor are co-operative and synchronous; argument consistency is checked across all nodes. For instance, if an fcntl is called on one node with a SYNC_SEQ descriptor it is required that all nodes call the fcntl. Further, they all must pass identical arguments to the function. “Identical” when applied to the file descriptor itself means that all the descriptors must originate from a matching open. (Since each node keeps its own list of open file descriptors, matching calls to CMMD_global_open may not return identical descriptor values across all nodes.) The handling of co-operative seek pointers is simple when a file descriptor keeps to a single mode. Since operations like lseek are co-operative with argument consistency checked, all nodes must request an identical position within the file to seek to for it to succeed. Functions like read and write start at the current seek pointer and leave the seek pointer at the end of the last visited byte; in other words, at the byte last visited by the (N-1)st processing node. The SYNC_SEQ mode supports parallel algorithm developers through co-operative access to shared file descriptors. Further, it allows for high-bandwidth access to parallel I/O devices. For instance, if a pathname passed to the CMMD_global_open happens to reside on a Scalable Disk Array,
subsequent SYNC_SEQ reads or writes to that file descriptor will be very high-bandwidth parallel operations. 5.3 Synchronous Broadcast mode CMMD I/O provides a second co-operative mode as a convenience to the user, the SYNC_BC mode. All use of SYNC_BC file descriptors within system calls are synchronizing, co-operative, and argument consistency is checked. A SYNC_BC read is as if a single node performed the entire read and broadcast the value to every other node. A write is as if a single node performed the entire write of its buffer. Unlike SYNC_SEQ mode, wherein the size of each node’s buffer could be different, in SYNC_BC the size must be fixed across all of the nodes. Depending on the level of safety a user is executing under, the actual buffer of data across each node can be checked for consistency as well; if consistency is not maintained, then the values written are ill-defined. This mode is especially useful when reading header information out of a file or when printing co-operative messages to standard out. In the following example, every node reads in some common header information -- it is the seek pointer position to begin a SYNC_SEQ read. fd = CMMD_global_open(“foo”, O_RDONLY) CMMD_set_io_mode(fd, CMMD_sync_bc); read(fd, &seek_pos, sizeof(long)); lseek(fd, seek_pos, SEEK_SET); CMMD_set_io_mode(fd, CMMD_sync_seq); read(fd, buffer, size); Note that every use of the file descriptor, fd, above (except for CMMD_set_io_mode) is synchronous and must be called on all of the nodes. 5.4 Independent mode The INDEPENDENT mode has much more in common with LOCAL mode than it does with its fellow global modes. In particular, all INDEPENDENT descriptors have an individual seek pointer from node to node and the use of these file descriptors is neither synchronous nor co-operative. However, all other state associated with a file descriptor is shared between nodes. For instance, if one node performs an ioctl and changes the blocking mode for itself it will change the blocking node for the corresponding independent descriptor on every other node. From the processes’ perspective, INDEPENDENT mode represents a single descriptor with N seek pointers distributed to each of the N threads. This mode is particularly useful when you wish to perform independent file access on each node to the same file. Since an INDEPENDENT descriptor results from a global open, it only represents a single open request to the file server. Consider the following code fragment where every node reads from a random byte in a file. The seek is performed in SYNC_BC mode; if it were executed on all nodes under INDEPENDENT mode it would represent N requests to the file server for a seek: fd = CMMD_global_open(“foo”, O_RDONLY); CMMD_set_io_mode(fd, CMMD_sync_bc);
length = lseek(fd, 0L, SEEK_END); /* every node gets different number */ my_pointer = rand() % length; CMMD_set_io_mode(fd, CMMD_independent); lseek(fd, my_pointer, SEEK_SET); read(fd, buffer, size); Recall that the default mode for CMMD_global_open is INDEPENDENT. Consider the final two lines of code in the above example. What happens if two seek operations from differing nodes are interleaved before the read is executed. In other words, could some other node seek the file pointer away from a node’s desired my_pointer value before this node has a chance to perform the read? In CMMD I/O we guarantee that all read operations are cognizant of their node’s independent file pointer and will perform the read from the desired location (note that this may require a seek accompanying some reads). Another very powerful use of this mode is to allow a node to “sneak” into INDEPENDENT to perform some special operation, say to print out an error message. if(error_condition == TRUE) { CMMD_set_io_mode(STDERR, CMMD_independent); write(STDERR, error_message, size); CMMD_set_io_mode(STDERR, CMMD_sync_bc); } At the bottom of this code fragment, we reset stderr to SYNC_BC because we assumed that this was the mode the other nodes where in. Woe to the program which tries to execute a synchronous operation on a descriptor when all of the nodes are not in the same mode! What happens in the following example to the seek pointer? /* mode of fd is currently SYNC_BC */ lseek(fd, 0L, SEEK_SET); read(fd, buffer, size); if(this_node == 0) { CMMD_set_io_mode(fd, CMMD_independent); lseek(fd, 0L, SEEK_END); write(fd, buffer, size); CMMD_set_io_mode(fd, CMMD_sync_bc); } write(fd, new_buffer, new_size); Here, every node has read from the beginning of the file. Node zero, however, has then slipped into INDEPENDENT mode and written to the end of the file. What happens to the seek pointer; specifically, where does the final write occur? If all nodes had remained in SYNC_BC mode, the seek pointer would be at the byte right after the last byte which all of the nodes read. However, here node zero has moved its seek pointer to the end of the file. In CMMD I/O the rule applied is that all synchronous uses of a file descriptor start with their
seek pointer at the location obtained by taking the maximum of all the node’s local seek pointers for that descriptor. At the start of our final write above, the maximum of all seek pointers is taken -- all the pointers are identical except for node zero which is at the end of the file -- and a seek is performed to this maximum. So, the final write above would occur at the very end of the file. We were motivated towards these semantics because it is the behavior most often wanted. If a subset of nodes perform some independent operations, and then the nodes renew co-operative use of the descriptor, we would like the seek pointer to reflect the totality of those independent operations. Specifically, we don’t want to write over nor reread the efforts of the nodes working in independent mode. This rule of maximizing local seek pointers when performing subsequent global operations also applies with dup’d file descriptors where one descriptor has the INDEPENDENT mode associated with it and the other has a synchronous mode associated with it. For instance: CMMD_set_io_mode(fd, CMMD_sync_bc); fd_ind = dup(fd); CMMD_set_io_mode(fd_ind, CMMD_independent); where(this_node == 0) lseek(fd_ind, 0L, SEEK_END); length = lseek(fd, 0L, SEEK_CUR); After fd is dup’d it shares its seek pointer with the file descriptor fd_ind. When fd_ind becomes independent node zero is allowed to perform operations on its own without involving the other nodes. Node zero seeks to the end of the file. When all nodes perform a SYNC_BC lseek to find the current position of the seek pointer, a max is performed across all local seek pointers. The shared seek pointer between fd and fd_ind at node zero is at the end of the file. Thus, the final lseek operation will show the current seek pointer for fd to be at the end of the file.
6 Default Open Mode In all of the above code examples we used CMMD_global_open anytime we wished to open a global file. Unhappily, since the open in Fortran is a statement and not a function, we can not create a CMMD_global_open for Fortran. For this reason, we allow you to change the default mode associated with the normal open. This works in C as well as Fortran. Consider this example: CMMD_set_open_mode(CMMD_sync_seq); /* All subsequent open’s will be SYNC_SEQ */ fd = open(“foo”, O_RDONLY); The descriptor, fd, above is opened in SYNC_SEQ mode. Note that we do not recommend this approach to the C programmer -- who wants to mess around with global dynamic state. Instead we recommend the use of CMMD_global_open.
7 Buffered I/O and Standard Files Streams are C data structures used for buffered I/O. One of the things they bundle up is a file descriptor. As usual, a new I/O mode is associated with this descriptor and therefore with the stream itself. As a coding convenience, however, we have provided means to globally open streams, change a stream’s mode, and so on (see Overview of New Functions below). We currently do not fully support the use of SYNC_SEQ or SYNC_BC mode with buffered I/O. Since we are relying on the C runtime library to provide the buffered I/O code we do not control the timing of when a buffer spills and actually performs a read or write. Consider this concrete example: We are performing buffered writes to a line buffered device, thus every time we see a newline we’re going to perform a write. This stream is in SYNC_SEQ mode, so those writes will attempt to synchronize. If only some nodes require a write, then they will idle awaiting synchronization. The other nodes, who presumably did not encounter a newline, could easily try some operation which attempts communication across the network. With several nodes already awaiting a pending synchronization, this will cause collisions in the combine network. This problem clearly will respond to treatment; it just means recoding the buffered I/O routines to co-ordinate on all reads and writes. As we have seen in an example, our new modes are associated with stdin, stdout, and stderr, just as they are associated with descriptors and streams opened by users. The only difference is in their default open mode; while user descriptors default to LOCAL or INDEPENDENT mode (depending on which open is called), the standard files always begin in SYNC_BC mode. This is the result of performance considerations on our part, not capriciousness. Some language start-up code, in particular Fortran, make a number of Unix system calls on the standard files before they dispatch to user code. If the standard files where initially opened up in INDEPENDENT mode (what we initially implemented), then each of these start-up system calls would represent N requests to the file server and would degrade start-up performance greatly. (You’ll notice that we’ve finally answered our initial question posed by the “Hello World” program -- in CMMD I/O you will see a single Hello World.)
8 Error Handling By and large, error handling proceeds unchanged within CMMD I/O. We have, however, added a few new error types and have made perror cognizant of these.
• ECMMDBADMODE
•
The file descriptor passed into this I/O operation is associated with a mode which is not allowed. For instance, a LOCAL descriptor may not be passed into CMMD_set_io_mode. ECMMDINCONSISTNT The arguments to a global I/O function are not the same across all nodes.
• ECMMDNOMATCH
A global I/O function has been called on each node; however, the function called is not the same on every node.
9 Overview of New Functions Here we give the prototyped declarations for the new functions related to I/O modes. Most of these have been seen above, some are new.
• int CMMD_set_open_mode(CMD_file_mode_t io_mode) • •
•
•
Change the behavior of Unix open to the mode given by io_mode. CMMD_file_mode_t CMMD_get_open_mode(void) Returns the current default open mode. int CMMD_set_io_mode(int fd, CMMD_file_mode_t io_mode) int CMMD_fset_io_mode(FILE *stream, CMMD_mode_t io_mode) Change the file mode on an already opened descriptor or stream. CMMD_mode_t CMMD_get_io_mode(int fd) CMMD_mode_t CMMD_fget_io_mode(FILE *stream) Accessor functions for the I/O mode of an already opened descriptor or stream. int CMMD_global_open(char *path, int flags, int mode) FILE *CMMD_global_fopen(const char *filename, const char *mode) Open a file in global INDEPENDENT mode.
10 Support for Large Files People turn to parallel systems (at least historically) because their problems are big. A parallel I/O system needs to be designed with this in mind and should efficiently and conveniently support access to very large files. In CMMD I/O this is accomplished with the addition of a seek pointer of type double and by a few new functions which take these large seek pointers. We would have preferred to use true 64 bit integers rather then storing our large seek pointers in a double. Unhappily, there is not yet full support or agreement on the use of the long long type (it is on the agenda of ANSI’s Numerical C Extension Group subcommittee). We intend to use seek pointers of type double temporarily until 64 bit integers are widely available. It is possible (given IEEE compliant double formats) to store integral values up to 252. Whenever we encounter a double seek pointer we treat it as the integral value resulting from a truncation (that is we round towards zero). NaN’s, INF’s, denormalized numbers, and their ilk result in errors. The main motivation towards use of doubles, in contrast to using two 32-bit integers for instance, is that you can use doubles with the normal arithmetic, conditional, and assignment operators. It is easy to appreciate this advantage just by considering something as simple as an equality comparison between two large seek pointers. This is contrasted to storing the pointer as a structure with a most-
and least-significant integer, as is described in [1]. We provide the following functions all which take double seek pointers:
• double dlength(int fd) • • •
Returns the length of the file in bytes as a double. double dseek(int fd, double offset, int whence) Identical to lseek but takes a double seek pointer and returns the file position as a double. double dtell(int fd) Identical to dseek(fd, 0.0, SEEK_CUR). int dftruncate(int fd, double length) Truncates a file to the number of bytes specified by length. Length is first rounded towards zero.
11 Conclusion We have described CMMD I/O, a new library which supports parallel UNIX I/O on the CM-5. CMMD I/O meets a variety of design goals. These include support for both independent and co-operative methods of I/O, ease of porting from serial computers, scalability with the size of the parallel machine, and support for large files. The majority of these goals are met by the simple addition of I/O modes associated with file descriptors. We described the single local and three global I/O modes which exist in CMMD I/O. Large files are supported by a few new I/O system calls which take double precision seek pointers. The immediate intent of CMMD I/O is to provide the node level (SPMD) CM-5 programmer with effective and efficient library support for I/O. We also offer CMMD I/O as a contribution to the growing discussions on how to provide the application-parallel programer with UNIX-like I/O facilities.
References [1]
Crockett, Thomas W. File Concepts for Parallel I/O. Proceedings of Supercomputing ‘89, 1989
[2]
DeBenedictis, Erik and Peter Madams. nCUBE’s Parallel I/O with Unix Compatibility. Eleventh Annual IEEE International Phoenix Conference on Computers and Communications, 1992.
[3]
Intel Corporation. iPSC/2 User’s Guide (Preliminary). December 1989.
[4]
Kotz, David. Multiprocessor File System Interface.
[5]
Russell, Channing H., and Pamela J. Waterman. Variations on UNIX for Parallel-Processing Computers. Communication of the ACM. 30(12):1048-1055, December 1987.
[6]
Test, Jack A. Parallel Unix: Up in the Air. Unix Review, 7(4):64-71, April 1989
[7]
Thinking Machines Corporation. CM5 Technical Summary. January 1992.
[8]
Thinking Machines Corporation. Connection Machine I/O System Programming Guide. 1991.