Implementation and Evaluation of Collective I/O in ... - Semantic Scholar

2 downloads 0 Views 458KB Size Report
can be allocated to store the collective data, (MAX COLL DATA LEN). These parameters can be modi ed at runtime for a particular collective group and for a ...
CENTER FOR ADVANCED COMPUTING RESEARCH California Institute of Technology Scalable I/O Initiative Technical Report

CACR-128

November 1996

Implementation and Evaluation of Collective I/O in the Intel Paragon Parallel File System Rajesh Bordawekar California Institute of Technology ABSTRACT A majority of parallel applications obtain parallelism by partitioning data over multiple processors. Accessing distributed data structures like arrays from les often requires each processor to make a large number of small non-contiguous data requests. This problem can be addressed by replacing small noncontiguous requests by large collective requests. This approach, known as Collective I/O, has been found to work extremely well in practice [BdC93, Kot96, SCJ+ 95]. In this paper, we describe implementation and evaluation of a collective I/O prototype in a production parallel le system on the Intel Paragon. The prototype is implemented in the PFS subsystem of the Intel Paragon Operating System. We evaluate the collective I/O performance using its comparison with the PFS M RECORD and M UNIX I/O modes. It is observed that collective I/O provides signi cant performance improvement over accesses in M UNIX mode. However, in many cases, various implementation overheads cause collective I/O to provide lower performance than the M RECORD I/O mode.

Implementation and Evaluation of Collective I/O in the Intel Paragon Parallel File System Rajesh Bordawekar

Center for Advanced Computing Research, California Institute of Technology 1200, E. California Blvd., Pasadena, CA 91125 [email protected]

1 Introduction It has been widely known that the task of getting data in and out of parallel systems poses the biggest obstacle to e ective use of the available computing power. Traditionally large scienti c problems like weather forecasting or aircraft simulations exhibit large I/O requirements. Furthermore, parallel machines are being increasing used for non-scienti c applications like information processing, hypertext and multimedia systems, and databases, which also require processing of large quantities of data. It is, therefore, essential that parallel machines acquire signi cant I/O capabilities. Many existing parallel machines approach this problem by providing parallel I/O environments. Generally, a parallel I/O environment consists of a parallel le system and an user interface [TMC92, Div93]. The parallel le systems improve the I/O performance by striping data over multiple disks and the user interfaces provide special I/O modes that allow multiple processors to simultaneously access a le. Unfortunately, most of the parallel le system interfaces are derived from UNIX. As a result, these interfaces often fail to express access patterns generated in parallel workloads [KN94]. This often leads to situations where processors are forced to make a large number of small non-contiguous data requests. It has been shown that in such cases, a technique called Collective I/O can be e ectively used to improve the I/O performance [BdC93, Kot94]. The primary focus of this work is to provide le system support for collective I/O. In this paper, we describe the design and evaluation of a collective I/O prototype in the Intel Paragon Parallel File System. The implementation is done in a production parallel le system by modifying its code and evaluated on real machines. Thus, all the overheads that come with production software are also present in our software, which are neglected in simulations. We believe ours is the rst implementation of collective I/O in a production le system.The prototype is in development stage and in this paper, we present initial experiences and evaluation results. The rest of the paper is organized as follows In section 2, we describe collective I/O in detail and analyze two known collective I/O techniques. A brief a description of the Paragon Operating System and the Paragon PFS is presented in Section 3. This is followed by the implementation details of the collective I/O prototype in Section 4. First, a new collective I/O le system interface is introduced and illustrated using an example. Later, various issues in implementing collective I/O system calls are discussed and di erent costs incurred in the implementation are analyzed. Experimental evaluation of the collective I/O prototype November 1996 | Scalable I/O Initiative

2

Tech. Report CACR-128

on two Paragon systems is presented in Section 5. In Section 6, we propose extensions to the collective I/O interface for supporting complex data access patterns. In Section 7, we overview related work and summarize our conclusions in Section 8.

2 Collective I/O 2.1 Need for Collective I/O Consider a simple operation of accessing an array from a le. To perform this access using the standard UNIX programming interface, a user would specify a le descriptor, the amount of data to be accessed and a pointer to the data bu er. From a le system's point-of-view, any I/O operation is a read or write of a speci ed amount of data from a given o set. The le system doesn't know that data to be accessed is part of a high-level data structure like an array. The fundamental problem is that the le systems and the user programs work in completely di erent spaces. In the user (program) space, computation is done using entities like arrays which have a type and a dimension. Traditionally, in the le space, the only active entity is a le which is a linear sequence of bytes.1 A user has to explicitly translate a computation involving a program entity (e.g., arrays) into an operation in le space (i.e., access n bytes from the given le o set)2 . In this process, valuable semantic information about high-level data structures (e.g., array type, and dimension) is lost [Kot96]. Unfortunately, most of the existing parallel machines support extensions of standard UNIX I/O interface [TMC92, Div93]. In a majority of parallel applications, parallelism is obtained by partitioning arrays over multiple processors in various distribution patterns.3 In parallel applications that use the Single Program Multiple Data (SPMD) paradigm, each processor has a local array which is conceptually a sub-array of the logical global array. For many distribution patterns, in order to read its local array from a shared le, each processor has to make several small non-contiguous read requests; thus degrading the I/O performance. Note that in this case, in addition to array type, and dimension, distribution information is also lost. 0

0

4

8

12

1

1

5

9

13

2

2

6

10 14

3

3

7

11 15

Processor 0’s Access

0

1

2

(A) Row-block Distribution.

3

4

5

6

7

8

9 10 11 12

(B) File Mapping

Figure 1: (A) Row-Block Distribution over 4 processors. (B) Corresponding le mapping. Information about the array dimension and distribution can be used to reduce the I/O cost. To illustrate this point, consider an integer array A(4,4) distributed in ROW-BLOCK fashion over 4 processors (Figure 1). Assume that the array is stored in column-major order in a le. Using the standard interface, to read its 1 We follow standard UNIX semantics and restrict ourselves to binary les. 2 Memory-mapped les do not require such translation. 3 All regular distribution patterns are special cases of BLOCK-CYCLIC(k) distribution pattern

November 1996 | Scalable I/O Initiative

3

Tech. Report CACR-128

local array, each processor has to make 4 separate I/O requests, where each I/O request reads one integer from the le from non-contiguous positions. Therefore, to read 16 integers, 16 separate I/O requests are required. If the le is striped across multiple disks, each I/O request may result in a separate disk request. As a result, the operation of reading the entire array A will require signi cant amount of time. Consider an alternative approach to read array A. Now, instead of reading its own data, each processor reads a distinct column of array A and then distributes the data among processors. In this approach, each processor makes a single large I/O request to read non-overlapping contiguous data chunk. This approach requires less number of I/O requests than the earlier, which in turn improves the I/O performance [BdC93]. To implement the second approach, each processor needs to know what data is required by other processors. Each processor, therefore, needs to know the global data distribution pattern. Also, each processor needs to know how the data is ordered in le so that it can decide how to and what to read. It should be noted that in this approach, data read by each I/O request is required by more than one processors. Such a request can be termed collective since it (partially) satis es requirements of several processors in a single operation. Similarly, each processor modi es its access pattern to satisfy requirements of other processors, therefore, the overall approach can be called a collective approach. The term collective is usually used to describe communication patterns [Mes94, For93]. A collective operation can be de ned as an operation which requires participation of one or more entities (e.g., processors) for completion and its result may a ect all the participating entities. Examples of collective operations include global reduction operations like HPF MAX and communication patterns like broadcast (one-to-many communication), both of which require participation of more than one processor for completion. Collective I/O counterpart can be de ned as follows: De nition: Collective I/O is a technique in which system entities (e.g., compute clients, or I/O servers) share the (global) data access and layout information, and make coordinated, non-overlapping, and independent I/O requests in a conforming manner. Global data access information includes array dimension and distribution information, while data layout information includes information about the le storage order and the le striping strategy. A conforming access pattern consists of one or more accesses where each access reads or writes data from consecutive locations to/from a le or disk. In other words, the order in which data is fetched by each access matches the data storage pattern (on le or disk). An interface which supports collective I/O is called collective I/O interface. Such an interface would logically coalesce individual I/O requests and generate hints about the collective request which can be used optimizing I/O. The collective I/O can be supported by both runtime libraries and le systems.

2.2 Collective I/O Implementations In this section, we describe two di erent implementations of collective I/O, namely, Two-phase I/O [dBC93, BdC93] and Disk-directed I/O [Kot94, Kot96].

November 1996 | Scalable I/O Initiative

4

Tech. Report CACR-128

2.2.1 Two-phase I/O Two-phase I/O is an implementation of collective I/O in the le space, i.e., its conforming pattern matches the order in which data is stored in les. Two-phase I/O is primarily designed as a runtime optimization technique to be run on compute processors for obtaining consistent and high I/O bandwidth while accessing distributed multi-dimensional arrays from/to les striped over multiple disks. Two-phase I/O consists of two main phases: (1) data access and (2) data distribution. The two-phase access strategies uses the fact that there exists a data distribution pattern that requires the least I/O cost [BdC93]. Examples of such distribution are, COLUMN-BLOCK distribution for FORTRAN arrays and ROW-BLOCK distribution for C arrays. The two-phase strategy uses this conforming distribution to access the data and then redistributes the data to generate the required distribution. A two-phase read rst performs I/O and then distributes the data. A two-phase write executes the two phases in the reverse order. To illustrate the method, consider the example from Figure 1. Instead of reading one integer at a time, each processor can rst read four consecutive integers at a time (Processor 0 reads elements 0-3, processor 1 reads 4-7 and so on) and then redistribute the data to get ROW-BLOCK distribution. This approach provides several advantages over the naive method (i.e., method shown in Figure 1). First, the I/O accesses match the column-major le storage order. The total number of I/O requests are now only 4 and each request reads 4 elements. As a result, a large number of small I/O requests are replaced by a small number of large independent I/O requests, each reading distinct and contiguous data. The high bandwidth communication network can be exploited to obtain ecient data distribution. Hence, the data redistribution costs are minimal. The number of processors performing I/O can be a subset of the processors over which arrays are distributed. By choosing a subset of processors for performing I/O, granularity of I/O operations can be increased, further improving the overall I/O bandwidth. This technique, therefore, substantially improves the I/O cost, while keeping the communication cost as small as possible. Two-phase was initially tested on the Intel Touchstone Delta Concurrent File System (CFS) and was found to be extremely ecient in accessing distributed arrays [BdC93]. Two-phase I/O was later implemented as the core optimization technique in the PASSION runtime system [CBD+ 95]. The most important advantage of Two-phase I/O is the ease of information ow. Since Two-phase I/O is designed as a runtime technique, it can use high-level array-based interfaces [CBD+ 95] which can provide sucient semantic information such as (global) array type, dimension, and distribution. Since each processor has the global program view, Two-phase I/O does not require information coordination among compute processors. Also, array-based interfaces are suciently expressive to cover a wide range of access patterns. What are the disadvantages of the Two-phase I/O strategy? Two-phase strategy requires extra bu er space during data communication. Also, data travels twice through the communication network, once in each phase of the operation. The amount of generated communication can pose a problem for low-bandwidth interconnects, e.g., in distributed computing environments.

November 1996 | Scalable I/O Initiative

5

Tech. Report CACR-128

2.2.2 Disk-directed I/O Disk-directed I/O, is an implementation of collective I/O in the disk space, i.e., its conforming pattern matches the order in which data blocks are stored on disk. Disk-directed I/O is designed as a server-side optimization technique to be executed by the I/O servers rather than by compute processors. In Disk-directed I/O, compute processors rst coordinate array access information and a single compute processor broadcasts the information to the I/O servers. The I/O servers can get the le layout information using the le system data structures, therefore, I/O servers don't perform any communication among themselves. Once, each I/O server gets the global access information, it uses the le layout information to compute the list of data blocks involved in the operation. The disk blocks are then sorted by location and accessed by a disk thread spawned by the I/O server. The thread accesses as much consecutive data blocks from the disk as possible. To summarize, compute processors coordinate access information and I/O servers make independent, non-overlapping, and sorted disk accesses. The I/O servers then use either DMA or a message passing protocol to transfer the data to and from the user bu ers. Disk-directed I/O provides several advantages over Two-phase I/O. In Disk-directed I/O, data travels only once through the network. Since the conforming access pattern matches the disk storage order, the optimizations become independent of the le storage order (which is application dependent). In addition, the space complexity of the technique is very small and no extra bu er space is required. One serious drawback of Disk-directed I/O is its potential to generate large number of small messages to transfer the data from server to user bu ers. The communication complexity depends on several factors which include the le layout, amount of bu er space available per I/O server, and the global array access pattern. In some cases, Disk-directed I/O can require very large number of small messages which can completely overwhelm the le system. Another obvious drawback is that Disk-directed I/O is not yet implemented in a production-grade le system. Therefore, the conclusions presented in [Kot96] may not hold for some implementations. Since Disk-directed I/O is primarily designed as a le system optimization strategy, it does not propose or depend on a particular le system interface. However, the description of Disk-directed I/O in [Kot96] assumes some sort of high-level interface which can translate global array access patterns to corresponding I/O requests (i.e., access n amount of data from the given le o set).

3 Implementation Platform We implemented the collective I/O prototype in the Parallel File system (PFS) of the Paragon Operating system (Paragon OS). The Intel Paragon is a massively parallel supercomputer with a large number of nodes connected by a high-speed mesh interconnect network. The network uses X-Y routing with virtual cuthrough and is used both as a communication and I/O network. Each Paragon node can have 2 (GP node) or 3 (MP node) i860 processors, out of which one processor is exclusively used for communication. The Paragon nodes are classi ed as compute nodes, I/O nodes and service (e.g., network, HiPPI) nodes. Usually, each I/O node has more than one I/O devices connected with it and has large memory (64 MBytes), whereas the compute November 1996 | Scalable I/O Initiative

6

Tech. Report CACR-128

and service nodes have 32 MBytes of memory. Figure 2 illustrates a con guration of Paragon consisting of 3 compute and 2 I/O nodes, and single GP node which used as a network server.

Compute Node

Compute Node

Compute Node

I/O Node

Emulator User OSF 1/AD Server

NORMA IPC

Mach 3.0 Kernel Network Server

I/O Node

Figure 2: Architectural model of the Intel Paragon

3.1 The Paragon Operating System (Paragon OS) Our version of Paragon OS (Release 1.4.1) is based on OSF/1 AD TNC4 operating system. The Paragon OS is built on the Mach 3.0 microkernel and the OSF/1 NORMA MK13.26 Single server (NORMA stands for NO Remote Memory Access). Figure 2 illustrates the architectural model of OSF/1 AD TNC [ZRB+ 93, RNN93].

 The Mach microkernel runs on all nodes of Paragon. The microkernel provides generic task/thread management, memory management, inter-process communication services, and low-level device support. The microkernel also allows system call traps to be handled in user mode by code executing in the same task [Vah96].

 Every node has an OSF/1 MK Single server in its user-space. The server implements Unix function-

alities, such as le services, process management, and networking. I/O or network devices managed by the server are co-located on the same node as server. Usually, every node of the Paragon runs the

4 OSF/1 with Advanced Development (AD) extensions from OSF Research Institute and Transparent Network Computing

(TNC) extensions from Locus Computing Corp.

November 1996 | Scalable I/O Initiative

7

Tech. Report CACR-128

same server. However, it is possible to run striped-down version of the OSF/1 server on the compute nodes. In such case, compute nodes can not perform any I/O and act as dedicated compute nodes. We ran the same OSF/1 MK Single server on both compute and I/O nodes, therefore, compute nodes also performed I/O (if required).

 An emulator is also present in each node's address space. A system call trap executed by the application

is redirected to the emulator. The emulator converts most system calls to service request messages and sends them to the server. It also contains a thread to receive callback messages from servers in support of interruptible system calls and le caching.

 The Mach microkernel also provides the multiprocessor extension of Mach IPC called NORMA IPC.

NORMA IPC takes advantage of the high-bandwidth interconnect and allows direct communication among Mach microkernels on di erent nodes. NORMA IPC is location independent and is transparent to Mach tasks. Mach/NORMA IPC supports both in-line and out-of-line communication. NORMA IPC adheres to the distributed memory image of the machine and does not support copy-on-write while communicating data between two kernels.

 The Intel Paragon comes with two versions of the Mach microkernel, NICA and NICB. NICA uses 4 KBytes as the basic packet size, while NICB uses 8 KBytes as the basic packet size.

3.2 The Paragon Parallel File System (PFS) The PFS consists of a set of regular Unix File Systems (UFS) which are located on distinct storage devices. Usually, each UFS is controlled by a di erent I/O node. A Paragon can have more than one PFS le systems, each with di erent attributes and bu ering strategies [Div93, ACR96]. The PFS achieves I/O parallelism by (1) Striping data over multiple le systems and (2) Providing parallel le access modes. File striping attributes describe how a le is laid across a PFS. The important stripe attributes are stripe unit size, SU, (unit of data interleaving) and the stripe group, SD, (number of I/O node disks across which a PFS le is interleaved). The parallel le access modes are used to coordinate access to a PFS le from multiple application processes running on multiple nodes. These modes are essentially hints provided by the application to the le system. The le system uses these hints to optimize the I/O accesses based on the le layout, the degree of parallelism and the level of data integrity required. Currently, the PFS supports six I/O modes, M UNIX, M SYNC, M ASYNC, M RECORD, M GLOBAL, and M LOG. Benchmark results have shown that M ASYNC and M RECORD provide the high performance [ACR96, Bor96]. The PFS uses a technique called Fast Path I/O to avoid data caching and copying on large transfers to/from disks [ACR96]. The le system bu er cache on the Paragon OS is bypassed, as is the client-side memory mapped le support. Instead, Fast Path I/O reads data directly from the disks to the user's bu er and writes from the user's bu er directly to the disks. On the server side, block coalescing and sorting is done, which reduces the number of required disk accesses when blocks of the le are contiguous on the disk. The Fast Path I/O is speci cally designed to support large data transfers (> 64 KBytes). The read and write routines in PFS are implemented in three levels. The high-level user interface generate system call November 1996 | Scalable I/O Initiative

8

Tech. Report CACR-128

traps that are redirected to the emulator. The emulator using the access pattern information (i.e., amount of data to be written, the current le o set, and the le I/O mode) and the le layout information (i.e., number of stripe disks and stripe size), sends messages to I/O servers to access data from their storage units. The current le I/O mode decides the strategy to update the le pointers and to access the I/O server. Each server reads or writes data from/to the storage unit (single disk/RAID drive) using the block size of 8 KBytes and transfers data to compute nodes via Mach/NORMA IPC. In a PFS read, the data received from an individual server is rst reordered and then copied into correct locations in the user bu er (linearization), while in a PFS write, server acknowledges that the data has been written. Majority of PFS functionality is provided by the emulator. I/O servers have minimum intelligence and only serve the requests sent by the emulators. Throughout this paper, the term le system will be used as a reference to the le managements routines in the emulator.

4 Collective I/O Prototype 4.1 Extending PFS Interface to support Collective I/O Section 2 argued that traditional UNIX I/O interfaces cannot fully express user access patterns. This results in loss of valuable semantic information and in some cases, user is forced to make several small I/O requests leading to increased program complexity and I/O overhead. This problem can be tackled in two ways: (1) by providing a suitable interface and (2) by devising strategies to exploit the information provided by the interface. In this section, we propose a simple collective I/O interface and in the next section, describe an optimization strategy which uses the collective interface. In order to perform aggressive optimizations, the le system requires as much information as possible about the user access patterns. File system can get this information using high-level interfaces. Traditional UNIX interfaces and the array-based PASSION-like interfaces represent two extreme cases of such interfaces. While array-based interfaces provide sucient information about user access patterns, they are restricted to certain programming paradigms only (e.g., HPF data distribution). A le system interface should be paradigm independent, and therefore, should not directly support paradigm-speci c concepts like data distribution. But the le system can provide additional functionality in its interface by which the user can express high-level global access patterns as a function of amount of data to be accessed from a le position (In other words, an indirect translation from accesses in user space to those in le space.). We provide this functionality by introducing the concept of processor group. Users de ne a processor group for performing collective I/O and provide the le system with a one-dimensional logical map of the group. Using the logical processor indices and the amount of data accessed by individual processors, the le system reconstructs the collective user-level access patterns in the le space. Note that the le system does not have any idea that the data may be distributed across the processors. To illustrate the concept, consider the example from Figure 3 which presents an array A(4,4) distributed in COLUMN BLOCK fashion over 4 processors. In order to obtain this distribution, processor 0 needs to read elements 0-3, processor 1 needs to read elements 4-7 and so on. To express this access pattern, the user has to create a processor group with 4 processors, each reading 4 elements. Figure 3:C represents the logical mapping of the group. Using the November 1996 | Scalable I/O Initiative

9

Tech. Report CACR-128

logical processor mapping, the le system knows that 4 elements of processor 0 should be read rst, followed by those for processor 1 and so on. In other words, the le system generates a collective le access pattern that exactly matches the user access pattern. Since the collective le access pattern consists of contiguous data requests, it can be serviced in a single operation. As the le system uses the logical processor mapping information to implicitly compute the le o sets, this interface is called the implicit-o set interface. 0

4

8

12

1

5

9

13

2

6

10 14

3

7

11 15

0

1

2

0

1

2

3

4

5

6

7

8

9

(B) File Mapping 0

3

(A) Column-block Distribution.

1

2

3

(C) Logical Processor Mapping

Figure 3: (A) Column-Block Distribution over 4 processors. (B) Corresponding le mapping. (C) Logical processor mapping The implicit-o set interface provides support for collective I/O in two ways:

 At the interface level, the processor group explicitly de nes a set of processors whose accesses should

be implemented in a collective manner. By controlling the processor mapping, user can generate le mappings that correspond to di erent data distribution strategies. Of course, this interface, in its naive form, can not support all possible data distributions. In order to support complicated distribution strategies like BLOCK CYCLIC(k), simple extensions are required (Section 6). This collective interface, on one hand, minimizes the loss of semantic information while on the other, hides user-level data distribution strategies from the le system.

 At the le system level, several small I/O requests from the members of the collective group can be

merged to generate a small number of large I/O requests accessing contiguous data. The resulting le access pattern substantially improves the overall I/O cost.

4.1.1 Collective I/O Calls In order to facilitate implementation of collective I/O routines in the PFS, we have introduced a new I/O mode called M COLL. This mode has the following characteristics:

 Each node has its own le pointer.  All nodes in the program must open the le, however, multiple processor groups (each with a subset of processors) can perform I/O operations on the le.



supports both collective and non-collective operations. A collective I/O operation must be called by all nodes in the operation. But, only the members of the active processor group participate M COLL

November 1996 | Scalable I/O Initiative

10

Tech. Report CACR-128

in the operation. A collective I/O operation is non-blocking across the nodes in the program. Nonparticipating processors exit immediately; the order in which participating processors exit depends on the type of I/O call (Section 4.2). In the non-collective mode, the I/O accesses adhere to the PFS M UNIX semantics. The le accesses are honored on a rst-come, rst-served basis. Within a program, collective and non-collective I/O operations can be interleaved.

 Actual data access within a collective call is performed by a subset of group members called Masters.

The number of Masters is predetermined by the le system or can be changed at runtime by user. To achieve maximum performance, Masters perform out-of-order transactions with the I/O servers (Section 4.2).

 Members of the collective group can read or write di erent amount of data.  Since collective I/O routines are non-blocking, the user should maintain consistency by using explicit synchronization techniques (such as message-passing).

Collective I/O routines can be classi ed into: (1) Group manipulation routines and (2) Collective read and write routines. 1. Group manipulation routines The processor groups are dynamic. The group manipulation routines can be used to modify the group contents during program execution. These routines are called by all processors participating in the program and are non-blocking.



mkcgrp(): mkcgrp()

creates a new collective group for a given le. mkcgrp() takes the following parameters as input: (1) File descriptor, fd, (2) one-dimensional array specifying the list of processors, (3) number of processors in the group, and (4) group access hint. mkcgrp() returns a group handle which will be used in collective read and write routines. Group access hint may be RDONLY, WRTONLY, and RDWR.

gd=mkcgrp(fd,nodlst,no procs,access hints)



delcgrp(): delcgrp()

deletes a collective group speci ed by its group descriptor. All future accesses with this group will result in error.

delcgrp(gd)



addcgrp(): addcgrp() adds

if unsuccessful.

a processor proc into an existing group. addcgrp() returns an error

error=addcgrp(gd,proc)



rmcgrp(): rmcgrp() removes a processor proc from an existing group. rmcgrp() returns an error

if unsuccessful. If the processor to be removed was the master, a new master is selected.

November 1996 | Scalable I/O Initiative

11

Tech. Report CACR-128

error=rmcgrp(gd,proc)

2. Collective read and write routines Collective read and write routines provide hints to the le system specifying which accesses should be executed in a collective manner. These routines take following input parameters: (1) Group descriptor, gd, (2) File descriptor, fd, (3) pointer to a data bu er, and (4) amount of data to be read or written. Each processor provides the bu er pointer in its local address space and can read or write di erent amount of data. creadc(gd,fd,&buffer,count) cwritec(gd,fd,&buffer,count)

4.2 Implementation Details We implemented the collective I/O prototype at two levels: High-level wrappers were developed for collective I/O routines (Section 4.1) at the runtime library level and the corresponding callback and transformation functions were developed in the PFS emulator.

4.2.1 Implementation of Collective I/O Routines Implementation of collective I/O routines involved modifying the existing NX library and developing a new library to provide additional functionalities. Collective I/O routines require all processors participating in the application to open a le in the M COLL mode. The M COLL mode was implemented by extending the PFS gopen routine which allows simultaneous opening of a global le by more than one processors. gopen also sets up the NORMA IPC communication environment required by the collective I/O routines. Since the collective processor groups support dynamic logical mapping, during the execution of a program, any processor can communicate with any other processor. This results in a all-to-all communication pattern. In Mach IPC, data communication is carried between any two ports. Each Mach port gets receive-right on creation, but it needs to have send-right for a port to which data needs to be sent. To obtain all-to-all communication pattern, each port should have send-rights to all other ports. In order to achieve this communication pattern, the Mach port names should be globally visible to all the participating processors. The gopen routine, therefore, allocates port names to processors using a simple function (e.g., function of processor index). Using this function, a processor can easily compute port names of all other processors. mkcgrp() returns a group handle for a new collective group. This routine must be called by all participating processors. Each processor maintains a table of allocated groups.5 For every new group, mkcgrp() allocates space for a new group structure and assigns it to the next available entry in the table. The index of this entry is returned as the group handle. The group structure is initialized to store group information like number of processors in the group, their logical indices, index of the group master etc. mkcgrp() also connects the NORMA IPC communication. Each processor gets send-rights to the all other processors' ports. delcgrp() uses the group handle as an index into the group table. It nds the group table entry, 5 At present, only 64 groups are supported.

November 1996 | Scalable I/O Initiative

12

Tech. Report CACR-128

frees the corresponding group structure and marks the entry as available. addcgrp() and rmcgrp() modify the group structure indexed by the group handle to re ect addition or removal of a processor. In addition, for the processor being added, all operations corresponding to mkcgrp() are also executed; similarly, for the processor being removed, all operations corresponding to delcgrp() are executed.

4.2.2 Implementation of Low-level Collective I/O We have implemented low-level collective I/O in the le space; i.e., the conforming access pattern matches the data storage order in the one-dimensional le space. In this section, we rst overview the our collective I/O strategy and then describe in detail the implementation of collective read and write routines. Time

Client

Master

Client

Master

Info. Coord.

Info. Coord.

Data Comm.

Request Data

Write Data

Return

Data Comm.

Return Exit

Exit

Exit

Exit

(A) Collective Read

(B) Collective Write

Figure 4: Trace diagrams of collective read and write Figure 4 presents trace diagrams of collective read and write routines. Each processor (to be speci c, its emulator) receives information about the local le access pattern from the high-level runtime interface. In order to generate the global, collective le access pattern, each processor sends its request size (i.e. amount of data to be read or written) to a special processor called group master. By default, the rst processor in the processor array is chosen as the group master, and the remaining processors are designed as its clients. The group master uses the logical processor mapping to generate the global le access pattern. The le o set of the group master is chosen as the base o set for performing collective operations. In a collective read (Figure 4:A), the master rst reads the entire collective data and then using NORMA IPC, distributes it among the clients according to the global access pattern. In a collective write (Figure 4:B), the master rst fetches data from clients using NORMA IPC and then performs a collective write according to the global le access pattern. 6 There are two obvious problems in this approach. First, since there is only one master, space required to store the collective data can become very large, e.g., a case in which 64 processors write 1 MBytes each, 6 Note that the creadc and cwritec semantics guarantee that the collective data fetched is contagious in the le space.

November 1996 | Scalable I/O Initiative

13

Tech. Report CACR-128

requires 64 MBytes to store the collective data. Also, the clients communicate with a single master which results in signi cant communication overhead. The communication overhead can substantially degrade the overall performance. In order to avoid these problems, we provide a special system call, gcntl(). Using gcntl, the user can explicitly control: (1) number of masters in a collective operation and (2) maximum amount of space that can be allocated to store the collective data, (MAX COLL DATA LEN). These parameters can be modi ed at runtime for a particular collective group and for a collective operation. For example, the following routine sets MAX COLL DATA LEN to 2048 KBytes: buffsize=2048*1024; error=gcntl(gd, fd, F SETCBSIZE, &buffsize);

The updated value of MAX COLL DATA LEN can be obtained using error=gcntl(gd, fd, F GETCBSIZE, &buffsize);

Similarly, the number of masters can be changed to 2 in the following manner masters=2; error=gcntl(gd, fd, F SETMASTER, &masters);

If the user wants more than one masters, those processors whose logical group indices satisfy the equality (logical index (mod) no masters =0), are chosen as masters. Each master is allocated a subset of processors as its clients. For example, for the logical processor mapping shown in Figure 3:C, if the number of masters was 2, processors 0 and 2 would be chosen as masters. Processor 0 would have processor 1 as its client and processor 2 would have processor 3 as its client. Even if there are more than one masters, the le o set of the rst processor in the processor list is chosen as the base o set for the collective operations (a consequence of implicit-o set approach). The following example shows a program fragment illustrating the use of creadc() and cwritec() system calls. The program creates two collective groups with di erent processor maps. Data is written into the le using one processor map and read using another map. This program also illustrates the use of fcntl and gcntl system calls. struct sattr sattr;

/* Pointer to the stripe attributes structure */

/* Get environment information */ iam=mynode(); nprocs=numnodes(); /* Get the stripe information

*/

sunitsize = 64*1024;

/* Stripe unit size is 64 KBytes */

sfactor = 64;

/* Number of disks is 64 */

November 1996 | Scalable I/O Initiative

14

Tech. Report CACR-128

start_dir = 0;

/* Starting directory is 0 */

/* Get the collective group information

*/

grp = 4;

/* Size of collective group is 4 */

coll_size = 4096*1024;

/* Size of collective buffer is 4 MBytes */

masters = 2;

/* Numbers of masters is 2 */

request = 256*1024;

/* Each processor accesses 256 KBytes of integers */

/* Open a global file 'foobar' using gopen in M_COLL mode */ fd=gopen("/pfs/foobar", O_CREAT | O_TRUNC | O_RDWR , M_COLL, 0766); /* Create new collective I/O groups

*/

for(i=0; i < grp; i++){ nodlst[i]=i;

/* Processor Map */

} gd=mkcgrp(fd,grp,nodlst,RDWR); for(i=0; i < grp; i++){ nodlst[i]=grp-1-i;

/* Reverse Processor Map */

} gd1=mkcgrp(fd,grp,nodlst,RDWR); /*

Set up the stripe attributes

*/

if (fcntl(fd, F_SETSATTR, &sattr) != 0){ perror("fcntl"); exit(1); } /* Set up collective buffer size and number of masters for both groups */ if (gcntl(fd, gd, F_SETCBSIZE, &buff_size) != 0){ perror("gcntl"); exit(1); } if (gcntl(fd, gd2, F_SETCBSIZE, &buff_size) != 0){ perror("gcntl"); exit(1); }

November 1996 | Scalable I/O Initiative

15

Tech. Report CACR-128

if (gcntl(fd, gd2, F_SETCBSIZE, &masters) != 0){ perror("gcntl"); exit(1); } if (gcntl(fd, gd2, F_SETCBSIZE, &masters) != 0){ perror("gcntl"); exit(1); } /* Perform Collective write using preallocated buffer at the start of file */ cwritec(gd,fd,buffer,request*4) /* Perform computation and later go to the start of the file

*/

lseek(fd,0,0) /* Collective read using the group gd2 */ creadc(gd2,fd, buffer, request*4)

November 1996 | Scalable I/O Initiative

16

Tech. Report CACR-128

Implementation of creadc() 0

1

2

3

4

5

0

1

User Buffer 4

5

6 7 8

6

7

8

(A)

9 10 11 12 13 14 15 2

Emulator

User Buffer

Emulator

Node 1

12 13 14 15

Node 3

8

Mach Buffer 4

5

12 13 14 15 1

1

User Buffer 1

Mach Buffer

6 7 7

0

3

7

User Buffer

2 3

Emulator

8

1

Emulator

9 10 11

Node 2

Node 0 6

6

Coll. Buffer 0

1

2

3

4

5

6 7

8

Coll. Buffer 9 10 11 12 13 14 15

5

5 2

2

2

2 4

Server

3

3

0

2

4

6 8

4

Server

10 12 14

Disk 0

3

4

1 (B)

3

3

5

7 9

11 13 15

Disk 1

Figure 5: Implementation of Collective Read Figure 5 illustrates implementation of the collective read. Figure 5:A represents a synthetic access pattern in which 4 processors participate in a collective read operation, each reading 4 units of data. Assume that each unit equals 64 KBytes. Figure 5:B shows 4 processors, out of which 2 processors also serve as I/O servers and group masters (processors 0 and 2). Each I/O server is connected to a disk. Assuming the stripe size as 64 KBytes, the resultant data needs to be fetched both disks. The overall collective read operation can be split into eight phases.

 Each emulator rst receives a system call trap executed by the application. From the system call, the

emulator obtains the current group handle, le descriptor and information about the local le access

November 1996 | Scalable I/O Initiative

17

Tech. Report CACR-128

pattern. Using the group handle, emulator updates the group information (i.e., whether the current processor is the master; if not, the index of the master, etc.). The le descriptor is used to obtain le stripe information. After each emulator receives the local le access information, information coordination takes place between the processors (1). After phase 1, each master has the complete picture of the collective le access.

 Using the client information, each master (master's emulator) computes the total amount of data to

be read, COLL DATA LEN. COLL DATA LEN represents the amount of data to be read by a master for itself and for its clients (i.e, masters partition collective data among themselves). Since masters do not share clients, masters read non-overlapping data. For example, in Figure 5, rst master (processor 0) reads rst 512 KBytes and the second master reads the remaining 512 KBytes. In order to store the data to be read, each master allocates a collective bu er of size MIN(COLL DATA LEN, MAX COLL DATA LEN). If COLL DATA LEN is greater than MAX COLL DATA LEN, then the collective read operation requires several collective iterations; each iteration collectively reads not more than MAX COLL DATA LEN amount of data. For example, processors 0 and 1 in Figure 5 will have COLL DATA LEN as 512 KBytes. If MAX COLL DATA LEN is 256 KBytes, there are two collective iterations; each reading 256 KBytes of data. In every collective iteration, using the le layout of the data to be read, the emulator computes the disks (i.e. I/O servers) on which requested data resides and sends a single read message to each I/O server (2). The message carries data access information like the amount of data to be read, beginning/end le o sets etc. This information can be used by the I/O servers for computing which disk blocks need to be fetched. The messages are either Mach IPC (if a master is also an I/O server) or NORMA IPC messages (if data lies on a distinct I/O server).

 Each I/O server, upon receipt of a read message, computes the number of disk blocks involved in the read operation, sorts the disk requests, fetches the data from the connected device (3) and sends the data back to the master using Mach/NORMA IPC (4).

 After an (master) emulator receives data from the I/O server, it reorders it and copies it into correct position in the collective bu er (5). For example, processor 2 receives data units 8, 10, 12, and 14 from processor 0. These units are then copied into the collective bu er at positions separated by an o set of an data unit (i.e., 64 KBytes).

 After the collective data is copied into the collective bu er, data distribution begins. Using the collective

access pattern, the master computes the recipients of the fetched data. If the master itself is a recipient, then the data is copied from the collective bu er to the user bu er (6); else data is communicated via an out-of-line NORMA IPC message to the target processor (7). On the receiver end, the mach receive call returns the starting address of the Mach bu er where received data is stored. The receiver then copies the data from the Mach bu er to the user bu er. (8) The Mach bu er is deallocated after the data copy.

November 1996 | Scalable I/O Initiative

18

Tech. Report CACR-128

Implementation of cwritec() 0

1

2

3

4

5

0 Emulator

6 1

7

(A)

4

5

3

1

User Buffer 12 13 14 15

2

1

User Buffer 1

Mach Buffer 4

2 3

5

1

2

3

4

5

Coll. Buffer

6

Node 2

8

5

4

6 7

8

6

9 10 11

9 10 11 12 13 14 15

2

4

6 8

Coll. Buffer

6

Server

8 10 12 14

Disk 0

1 (B)

3

6

8

7

7 Server

12 13 14 15 5

8 7

Mach Buffer

User Buffer

6 7

4

0

3

Emulator

Node 0

0

3

1

Emulator

0

2

Node 3

6 7

2

9 10 11 12 13 14 15

Emulator

User Buffer

Node 1

8

5

7 9

7

11 13 15

Disk 1

Figure 6: Implementation of Collective Write Figure 6 describes the implementation of the collective write. Figure 6 uses the same synthetic access pattern and the processor con guration as Figure 5. The stripe size can be assumed as 64 KBytes. The collective write operation can be split into eight phases.

 Each emulator receives a system call trap executed by the application. As in the implementation

of collective read, each emulator obtain the group, le and access information. Each emulator then performs information coordination (1). After phase 1, each master has the complete picture of the collective le access.  Using the client information, each master computes the amount of data that it needs to write, COLL DATA LEN. Each master allocates a collective bu er of size MIN(COLL DATA LEN, MAX COLL DATA LEN).

November 1996 | Scalable I/O Initiative

19

Tech. Report CACR-128

If COLL DATA LEN is greater than MAX COLL DATA LEN, then the collective write operation requires several iterations; each collective iteration writes not more than MAX COLL DATA LEN amount of data. In every iteration, the master requests the appropriate clients to send the data (2). This approach is also used in Server-directed I/O [SCJ+ 95]. The clients upon receipt of the request, send the data via an out-of-line Mach message (3). On the master side, mach receive returns a pointer to the received data. The data is then copied into the collective bu er (5) at a proper o set. If required, the master also copies its data (4).

 After the collective bu er is lled, each master computes the disks (or I/O servers) on which the data

would reside and sends a single data message to every I/O server. The message carries data access information and a pointer to the data to be written (6).

 Each I/O server, upon receipt of a data message, computes the number of disk blocks involved in the operation, sorts the disk requests, writes the data to the connected device (7) and send the an acknowledgment back to the master (8).

We now discuss some implementation issues that are common for both creadc() and cwritec().

 Implementing Inter-processor communication using NORMA IPC Collective I/O requires communication for information coordination and for data communication between masters and clients. As noted before, Paragon OS performs inter-processor communication via NORMA IPC. NORMA IPC allows both in-line and out-of-line messages. We use in-line NORMA IPC for small messages (e.g., in information coordination), and out-of-line NORMA IPC for large data transfers. In both cases, send is implemented as a non-blocking operation (by specifying MACH MSG TIMEOUT as 0) and receive is implemented as a blocking operation. An out-of-line message carries a pointer to the data to be communicated. When a processor sends an out-of-line message, a local copy of the data is rst made (virtual copy, not byte copy) in the virtual address space of the sending processor. A reference to this data, called an RDMA handle, is passed in the message, which is enqueued on the receiving processor, waiting to be received. When the receiving processor receives the message, the kernel on that processor requests the data to be sent over the mesh. The sending processor deallocates the local copy as soon as the data is transmitted. The kernel of the receiving processor, upon receipt of the data, copies it into the receiving task's virtual address space and returns the pointer to that location. Note that on the receiving side, there is byte copy and copy-on-write optimization is not used. The receiving processor has to explicitly deallocate the local copy after copying the data into user space. The communication of an out-of-line message, therefore, requires three separate inter-kernel messages (two request and one data). This results in severe performance degradation. Experiments on our platforms showed that performance of NORMA IPC is directly proportional on the size of data communicated and the observed peak bandwidth for NORMA IPC is signi cantly lower than that for NX communication.

 Ordering of server-accesses November 1996 | Scalable I/O Initiative

20

Tech. Report CACR-128

By default, each collective group has one master. However, to reduce space requirement and to reduce communication cost, user can choose more than one masters. These masters can simultaneously access the I/O servers. By default, the PFS does not allow simultaneous access of I/O servers by more than one processor. In order to maintain the UNIX semantics, accesses to the I/O servers are serialized and serviced individually in rst-come- rst-serve basis. For large number processors, the access serialization results in severe degradation of performance. If we use this approach, additional wait incurred by the masters would o set any gain obtained by reducing the communication. Therefore, we allow the I/O servers to service simultaneous requests by multiple processors. This approach is used to implement the M ASYNC mode in the PFS [Div93]. Since the masters always access non-overlapping data, le consistency is maintained.

 Semantics of completion One important issue in implementing collective I/O calls is the semantic of completion. Speci cally, when should a collective read/write call return? In collective read, each client returns when it receives data in its user bu er and each master returns when it sends data to all its clients and copies its own data. However, in collective write the situation is more complicated. Speci cally, should collective write return as soon as all processors get acknowledgments regarding their data or should each client return immediately after successfully sending data to a master? The rst approach is a safe approach but clients have to wait until the masters nish the data transfer. In the second approach, the clients can resume their computations as soon as the data communication is over (Figure 6, phase 3) but it is the user's responsibility to ensure that clients do not overwrite the le being written. Our implementation follows the second approach. Though unsafe, it provides signi cant performance gains. Since Mach communication is reliable, from a clients point-of-view, successful return of a mach send can be considered as successful data write. We will return to this issue again in the next section.

 File-pointer manipulations As noted in Section 4.1, the le system uses the logical processor mapping to implicitly compute o sets for members of the processor group. As soon as creadc() and cwritec() returns, each processor updates its le pointer to re ect the new le position. For example, consider a collective write with 4 processors. The processors are arranged in reverse order, i.e, processor 3 has the logical index 0, processor 2 has the logical index 1, and so on. Assume that the base o set for the collective write is 1024 bytes and each processor writes 1024 bytes. After the completion of cwritec(), processor 3's le pointer will be updated to 2048, processor 2's le pointer will be updated to 3072, etc. The PFS provides a unique token manager to handle the le length and o set information. In order to obtain the latest le information, a processor needs to acquire a le token and release it after updating its local o set and length values. At one time, only one processor can acquire a token. In our implementation, since each processor has its own le pointer, the token manager is only required to obtain updated value of le length. In collective read, each master needs to get the le length information to verify that the read request does not cross the end-of- le. This checkup can be done at the beginning of collective read. After each November 1996 | Scalable I/O Initiative

21

Tech. Report CACR-128

processor receives its data, it can update its le pointer to re ect the new le position. However, for collective write, each master needs to update the le length after every data write. Hence, in every collective iteration, each master acquires the le token and releases it after updating the le length. This operation, though inexpensive, does create a bottleneck when the number of masters is large.

4.2.3 Cost Analysis In this section, we rst analyze various costs associated with our implementation of collective read and write routines, and use the information to compare performance of collective and non-collective I/O in the Paragon PFS. c and T c Let Tread write be the times required to collectively read or write R bytes. Let P be the number of processors in the processor group, m be the number of masters, and r be the data accessed by each processor in bytes, where r = PR . From Figures 5 and 6, it can be observed that both creadc() and cwritec() implementations require 8 phases. These phases can be broadly classi ed into: (1) Communication, (2) Copy, and (3) I/O. In order to correctly analyze the performance of creadc() and cwritec(), additional overheads due to either le pointer manipulations or delays in accessing I/O servers must be taken into consideration (Tmisc ). Taking c and T c these parameters into consideration, Tread write can be expressed as a sum of the following c = Tc = Tc c c c Twrite comm + Tcopy + Tio + Tmisc read

It should be observed that due to di erent completion semantics, clients and masters may require di erent times for collective read and writes. Speci cally, c (master)  T c (client) Twrite write c c Tread(client)  Tread (master) c )  Communication Costs (Tcomm

In collective read, communication is performed in 2 phases; in phase 1 for coordinating access information and in phase 7 for transmitting data from a master to its client. In phase 1, each processor sends a single in-line NORMA message to its master, hence, the total number of messages in phase 1 are O(P ). During phase 7, each master sends a single out-of-line NORMA message per client. Therefore, the number of messages sent during phase 7 is also O(P ). Collective write also performs communication in 2 phases; in phase 1 for coordinating access information, and in phase 2 for requesting o -processor data. In each phase, the total number messages sent are O(P ). Even though communication complexity of collective read and write is the same, collective write may require more communication time than collective read. This di erence is due to the following reason: In collective read, in every collective iteration, each master performs several non-blocking sends to communicate data to its clients. Each client, in turn, makes a single blocking receive and exits. In collective write, in every collective iteration, each master performs several blocking receives to get data from its clients. A sequence of receives in each collective iteration creates a ripple e ect and increases the overall communication cost. November 1996 | Scalable I/O Initiative

22

Tech. Report CACR-128

c )  Copy Costs (Tcopy

In collective I/O routines, the cost required for copying data is dependent on the amount of bu er space required for storing the collective data. Generally, collective I/O routines require a constant amount of space to store the collective data (MAX COLL DATA LEN). But in some cases, additional space is required. For example, consider a collective I/O access using 3 processors. First processor accesses 800 bytes, the second 600 bytes and the third 600 bytes. Let MAX COLL DATA LEN be 1000 bytes. If there is a single master, COLL DATA LEN will be 2000 bytes. Therefore, the collective I/O access will require 2 iterations. In collective write, during the rst iteration, the master will copy 800 bytes and then receive 600 bytes from the second processor. Since MAX COLL DATA LEN is 1000 bytes, it can only write 200 bytes of the second processor in the rst iteration. The remaining 400 bytes need to be copied into a temporary bu er. The data from temporary bu er can be copied into the collective bu er as soon as the rst data write is nished. Similarly, in collective read, rst iteration will read 800 bytes for the rst processor and 200 bytes for the second processor. The data for second processor needs to be copied into a temporary bu er. However, the size of temporary bu er will be 600 bytes not 400 bytes. By allocating a larger bu er, a master can send the data in a single data message. Amount of extra space required, therefore, depends on the type of collective operation and is a function of the processor request size. Collective I/O routines perform two types of data copies: (1) Virtual copy is performed using Mach vm copy routine. Virtual copy uses copy-on-write optimization and does not involve physical data transfer. (2) Byte copy is performed using a modi ed bcopy routine and it involves physical data transfer. Virtual copies are used primarily during out-of-line NORMA communication (on senders side). Remaining copy operations perform byte copies using the modi ed bcopy() routine. Collective write performs byte copy in two phases, namely, (4) and (5) and collective read performs byte copy in three phases, namely, (5), (6), and (7). Copies from collective-to-user and Mach-to-user bu ers can be carried out as a single operation. However, phase (5) in collective read (linearization), involves several physical byte copies. The number of byte copies required is MAX COLLSUDATA LEN , where SU is the stripe size. For large values of MAX COLL DATA LEN and small values of SU, the required number of byte copies becomes very large.

 I/O Costs (Tioc ) Tio refers to the time required to access data to/from I/O servers. Tio can be computed as a sum of the time required for communicating with I/O servers (Tio1 ) and the time required for the I/O servers to access data from the devices (Tio2 ). Tio depends on various factors which include the collective le

access pattern, number of stripe units and the stripe size. Ideally, as the number of stripe units is increased, Tio2 decreases.

Performance Analysis of Collective and Non-Collective I/O In order to facilitate the analysis, we consider a hypothetical distributed memory parallel machine with N nodes and a parallel le system with SD storage units. The nodes can be classi ed into compute nodes, November 1996 | Scalable I/O Initiative

23

Tech. Report CACR-128

CP , and I/O servers, IOS . An I/O server can also act as compute node. The parallel le system stripes les over SD storage units in a round-robin fashion using SU as the stripe size. Data communication between

I/O servers and compute nodes is performed using a system-level communication facility. In order to read or write data from/to les, users send I/O requests to the le system. Each I/O request is distributed into several disk requests, which are then serviced in parallel by the I/O servers. The I/O servers then communicate data back to compute nodes via the communication facility. Let Tread, Twrite denote the time required for P processors to individually (i.e., non-collectively) read or c , Tc write r bytes. Similarly, let Tread write denote the maximum time required for P processors to collectively read or write R = rP bytes. Since there are m masters, each master accesses mR = rP m bytes. c and T c Using the analysis presented in the previous section, Tread write can be expressed as a sum of the following: c = T cr + T cr + T cr + T cr + T cr Tread copy comm io1 io2 misc

(1)

c = T cw + T cw + T cw + T cw + T cw Twrite copy comm io1 io2 misc

(2)

Similarly, Tread and Twrite can be expressed as a sum of the following: r + Tr + Tr + Tr Tread = Tcopy io2 misc io1

(3)

w + Tw + Tw + Tw Twrite = Tcopy misc io1 io2

(4)

r is includes the cost of linearization and system-to-user bu er copy. T w represents the cost of Tcopy copy

system-to-user bu er copy. Recalling the discussion from Section 2, collective I/O tries to improve the I/O performance by replacing large number of small I/O requests by a smaller number of large I/O requests. Speci cally, collective I/O tries to minimize the I/O overhead that results from servicing a large number of small requests. In other r or T w cost. words, collective I/O tries to reduce the large Tmisc misc Comparing Equations 1 and 3, and 2 and 4, we can observe that in our implementation, collective I/O routines involve more operations (mainly in the communication phase) than their non-collective counterparts. In the non-collective operation, there are P processors accessing r bytes, whereas, in the collective operation, data access is performed by m  P masters; each master accessing rP m  r bytes. The increase in the I/O granularity can a ect the following parameters: cr and T cw : In collective write, increase in the I/O granularity will not change the  Copy costs, Tcopy copy

copy cost because the number of copies performed will still be the same. However, in collective read, number of copies performed in the linearization phase (phase 5, Figure 5) will increase. The increase will be prominent for smaller stripe unit sizes. The increased copy cost can degrade the collective read performance.

 I/O costs, Tiocr and Tiocw : Two components of the I/O cost, Tio1 and Tio2 can show di erent trends to

the increase in I/O granularity. Tio2 can decrease if the increased I/O granularity improves the I/O

November 1996 | Scalable I/O Initiative

24

Tech. Report CACR-128

parallelism (i.e., the collective request spans more disks than the non-collective counterpart). The two cases where Tio2 can increase with the increase in I/O granularity are 1. r < SU and R < SU 2. r > (SU  SD) ! R > (SU  SD) In both cases, the large collective I/O does not improve I/O parallelism. The second component, Tio2 , is the cost of communicating data between the I/O servers and the processors. Increase in I/O parallelism leads to an increase in the communication complexity, which in turn may lead to communication contention and degrade the Tio performance. cr and T cw : These costs are mainly due to the bottleneck at the token  Miscellaneous Cost, Tmisc misc

manager. For collective write, for small values of MAX COLL DATA LEN, and m > 1, Tmisc can dominate the overall performance.

In Section 5, we present detailed experimental results and check if the conclusions reached in this section are observed in practice.

5 Experimental Evaluation This section presents performance evaluation and measurements of the collective I/O implementation.

5.1 Experimental Setup We ran our experiments on two Paragon XP/S machines; raptor and trex. For the experiments, these machines were rebooted with our version of the le system. Further, all experiments were performed in the dedicated mode (only one user had access to the machines). raptor is a 49 node Paragon XP/S machine with 8 MP (with 64 MBytes memory) and 41 GP (with 64MBytes memory) nodes. Out of 8 MP nodes, 4 nodes also act as I/O servers. Each I/O servers is con gured with a Fast SCSI-16 card and a 4 GByte Seagate ST15150W disk with external transfer rate of 20 MBytes/sec. Our experiments used a PFS con gured with 4 stripe directories (SD = 4) and the stripe size of 64 KBytes. On raptor, we used NICA as the underlying kernel. trex is a 554 node Paragon XP/S with 448 GP (with 32 MBytes memory), 93 MP (80 with 64 MBytes memory and 13 with 32 MBytes memory) nodes. The remaining nodes were used as network, HiPPI or service nodes. We carried our experiments on the 64 node sio partition. This partition represents a mini Paragon with 64 compute nodes which also act as I/O servers, each with 64 MBytes memory, a Fast SCSI-16 card and a 4 GByte Seagate ST15150W disk. Our PFS was con gured with 64 stripe directories and had 64 Kbytes as the stripe size. On trex, we used NICB as the default kernel. Both machines had the same communication network with the peak bandwidth exceeding 200 MBytes/sec. trex represents an ideal machine with 1 disk-per node, whereas, raptor represents an under-balanced machine with a disk per 12 nodes. Results on raptor are important as many of the real-life systems have an under-balanced I/O subsystem. November 1996 | Scalable I/O Initiative

25

Tech. Report CACR-128

5.2 Experimental Methodology To evaluate the collective I/O prototype, we measured the execution time for a creadc() or cwritec() system call under di erent workloads. To simulate a workload, we used a synthetic benchmark program and appropriately set the data access and layout parameters. We compared the performance of creadc() and cwritec() with their non-collective counterparts, cread() and cwrite(). To perform the non-collective data accesses, we used two di erent PFS I/O modes, M RECORD and M UNIX. These modes were chosen for the following reasons: 1. Both M RECORD and M UNIX provide high I/O bandwidth. 2. Both I/O modes have individual le pointers. 3. In M RECORD mode, each processor has to access the same amount of data. The le o sets are computed implicitly according to the node numbers. The I/O requests are serviced in rst-come, rst-served manner. The implementation of M RECORD is highly parallelized to give maximum possible I/O parallelism. 4. In M UNIX mode, each processor can access di erent amount of data. The I/O requests are serviced in rst-come, rst-served basis, however, the accesses to the I/O server are serialized to maintain the UNIX semantics. In essence, a parallel access by P processors in M UNIX mode can be logically considered as an sequential access in which a single processor makes P requests, each accessing di erent amount of data. 5. In many cases, M ASYNC mode provides better performance than M RECORD mode [ACR96]. Since M RECORD is the most used PFS mode, we didn't use the M ASYNC mode for comparison. The M RECORD performance can be considered as the lower bound of the M ASYNC performance. In our experiments, each processor accessed equal amount of data. The logical mapping used by the M COLL mode was adjusted to represent the logical mapping used by the M RECORD mode. In order to investigate the e ects of request sizes, each experiment was repeated for 4 di erent request sizes (BSIZE), 16, 64, 256 and 1024 KBytes. To analyze the scalability, the number of processors was varied from 4 to 64. In each case, the resultant le size was (number of processors*request size) KBytes. The smallest le accessed was 64 KBytes and the largest was 64 MBytes. In addition, we simulated di erent le layout strategies by changing the stripe size of the PFS using the fcntl() system call. We ran experiments for three di erent stripe sizes, 64 KBytes (default), 16 KBytes and 1024 KBytes. The experiments were repeated for both raptor and trex. For collective I/O routines, MAX COLL DATA LEN was set to 4 MBytes. In each experiment, we measured two timings:

 Wall-clock timing, WALL: WALL was measured in seconds using the user-level dclock() routine. For collective system calls, largest time required by any processor was reported.

 Internal system timings: We used the 64 bit Paragon hardware clock to measure cost of various internal operations in microseconds. Speci cally, we measured the cost of bu er copy, COPY, Mach/NORMA communication, COMM, and low-level I/O, SVIO. Note that SVIO includes the cost of data access and

November 1996 | Scalable I/O Initiative

26

Tech. Report CACR-128

the cost of Mach/NORMA communication between I/O servers and processors. For collective system calls, internal system timings were measured for masters only and maximum time required in each category was reported. For creadc(), we found that COMM measured for clients and masters did not vary signi cantly. Therefore, we measured COMM only for the masters.

5.3 Results Tables 1 to 6 present experimental evaluation and comparison of collective I/O on raptor, whereas, tables 7 to 12 present results of similar experiments on trex. For collective I/O, the number of masters used for a particular access is shown along with the COMM cost (number in brackets). Using the performance results, we rst make some general observations and then analyze di erent trends in a greater detail.

General Observations  In all experiments, collective read and write showed signi cant performance improvement over noncollective read and write in M UNIX mode. This result is signi cant as it shows that in practice, collective I/O improves performance by replacing a large number of small I/O requests by a small number of large I/O requests.

 As compared to M RECORD, however, the collective read or write did not fare well. For small request

sizes, collective read and write showed performance improvement over M RECORD accesses. However, as the request size is increased, due to a variety of reasons (to be discussed later), collective I/O performance degraded.

 Collective I/O performance was a ected by communication, copy, and I/O costs. As per the analysis

presented in Section 4.2.3, collective write required more communication time than collective read, and collective read required more time for copying data than collective write. Similarly, in many cases, time required for the I/O servers (SVIO) for accessing collective data was signi cantly more than that required for non-collective data.

 As the number of masters was increased, the amount of data accessed per master decreased. As a result,

communication and copy costs decreased. However, the overall (wall-clock) time initially decreased, reached a minimum and then increased. This pattern was attributed to an increase in the cost of accessing the PFS token manager for a large number of masters.

 E ects of stripe unit size were visible only on trex. Since raptor had only four disks, variation in stripe unit size did not signi cantly change the le striping pattern.

 Between trex and raptor, di erent access modes illustrated di erent trends. The M RECORD performance improved considerably on trex, whereas, the increased server-side wait.

M UNIX

performance degraded signi cantly due to an

Speci c Trends November 1996 | Scalable I/O Initiative

27

Tech. Report CACR-128

 E ect of request size For small request sizes, collective I/O provided signi cant performance improvement over both M UNIX and M RECORD accesses. Main reason for this performance improvement was that in such cases, the collective I/O size became signi cantly large to exploit I/O parallelism via data striping. For example, on raptor, a write access pattern with 4 nodes and 16 KBytes request size (Table 1), required 18733 microseconds in M UNIX mode, 35844 microseconds in M RECORD mode, and only 3780 microseconds in M COLL mode, to perform low-level I/O (SVIO). However, after a certain value of data size (which depends on stripe unit size and number of stripe directories, Section 4.2.3), collective I/O failed to exploit I/O parallelism. In fact, the cost of low-level I/O (SVIO) increased signi cantly; thus degrading the overall collective I/O performance. For example, consider the case of 4 processors performing collective accesses on trex (Tables 7 and 8). Notice that SVIO shows a marked increase for the request size of 1024 KBytes (96400 microseconds for collective write and 137358 microseconds for collective read) and correspondingly the WALL increases.

 Low-level I/O cost SVIO SVIO is dependent on many factors which include the stripe unit size, the number of stripes directories, ratio of processors and stripe directories, and the request size. In our experiments, we did not modify the number of stripe directories from the default ones. In many cases, for large individual request sizes (  64 KBytes), and for large number of processors, we observed that SVIO for the collective request was larger than that required for individual requests. The cause for this increase was the increased communication between I/O servers and collective masters. For example, on trex, a write with 16 nodes and request size of 256 KBytes required 26883 microseconds (SVIO) for M UNIX mode, 52614 microseconds for M RECORD mode, and 149735 microseconds for M COLL mode. To verify the e ects of stripe unit sizes, we ran our experiments for three values of stripe unit sizes, 16 KBytes, 64 KBytes and 1024 KBytes. On raptor, we did not observe a signi cant change in performance as the stripe size was modi ed. The reason for this trend is that for 4 disks, the change in stripe unit size did not drastically change in the parallel disk access pattern (i.e., the number of disks accessed in parallel still remained the same). However, on trex, we could clearly see the e ects of the change in the stripe size. The three access patterns M UNIX, M RECORD and M COLL reacted di erently to the change. In addition, read and write performance showed di erent trends for di erent stripe unit sizes. On trex, the M UNIX write cost (both SVIO and WALL) increased as the stripe size was decreased to 16 KBytes, but WALL decreased as the stripe size was increased to 1024 KBytes, whereas, M UNIX read cost (WALL, and in some cases, SVIO) increased as the stripe size was changed from 64 KBytes. M RECORD read and write, on the other hand, showed a marked increase in the cost (both SVIO and WALL) for 16 KBytes stripe size and a corresponding reduction in the cost as the the stripe unit size is increased to 1024 KBytes. The M COLL mode illustrated the same trend as the M UNIX read. These performance trends are very dicult to analyze and we believe a combination of factors (which include the request size, cost of communication between I/O servers and masters, number of stripe directories, and overheads while accessing I/O servers and token managers) are responsible for such irregular behavior.

November 1996 | Scalable I/O Initiative

28

Tech. Report CACR-128

 Communication Cost COMM

From the experimental results, we can observe the following trends in the communication pattern:

{ The communication cost for sending 16 KBytes messages was relatively large. For collective read,

the message cost dropped as the message size was increased, however, after a certain point, the message cost again rose. For collective write, due to increasing wait incurred during receiving messages, the message cost rose as the message size was increased. { In all experiments, collective write required more communication time than collective read. In many cases, the communication cost was signi cantly more than the I/O cost (SVIO), and the problem became communication bound! Examples of such cases are: (1) A collective write on trex with 64 nodes and BSIZE of 16 KBytes required 258646 microseconds for communication and 64935 microseconds for I/O (SVIO) (Table 7), (2) A collective write on raptor with 64 nodes and BSIZE of 16KBytes required 102550 microseconds for communication and 15566 microseconds for I/O (Table 1). { In all experiments, as the number of processors was increased, the communication cost increased. This pattern was observed for both collective read and write. The increase in the communication cost is due to increased communication contention between di erent kernels. { The choice of kernel (NICA or NICB) did not a ect the communication performance.

 Copy Cost COPY In collective read, the cost of linearization (phase 5, Figure 5) increased the overall copy cost. In all experiments, collective write required less copy cost than collective read. For example, on raptor, a collective write with 16 processors and 64 KBytes of request size required 828 microseconds (Table 1) and the corresponding collective read required 5451 microseconds (Table 2). On trex, for same access pattern, collective write took 365 microseconds for data copying and collective read took 5661 microseconds. (Tables 7 and 8).

 Number of masters in collective I/O In both collective read and write, as the number of masters was increased, the communication, copy and I/O cost decreased. However, as the number of masters crossed a certain threshold value, the cost of accessing PFS token manager dominated and the overall performance dropped. The optimal number of masters was di erent for collective read and write. For collective write, each token manager update took more time than that for collective read. Therefore, in collective write, performance often peaked at a smaller number of masters. The e ect of token manager access was more prominent for large number of processors. For example, for 64 processors, collective read required gave best performance with 4 masters for request sizes of 16 KBytes and 64 KBytes, but for collective write, performance dropped as soon as the number of masters was increased from 1 (Tables 8, 10, and 12).

 Performance Comparison between the three I/O modes November 1996 | Scalable I/O Initiative

29

Tech. Report CACR-128

Table 1: Performance of write on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 18733 0.089 16730 0.111 55823 0.229 60134 0.232 29239 0.181 10603 0.179 24914 0.188 44917 0.433 21052 0.291 43771 0.384 61707 0.421 69939 0.827 25699 0.575 124652 0.71 136331 0.73 86631 1.658

M RECORD

SVIO WALL 35884 0.099 31444 0.121 33501 0.118 135852 0.197 37189 0.163 21975 0.119 62810 0.152 115986 0.236 69581 0.181 29323 0.132 178980 0.291 157602 0.348 86523 0.21 51254 0.162 370345 0.485 351131 0.51

SU=64

KBytes and SD=4.

M COLL

COMM COPY SVIO WALL 8752 383 3780 0.037 11359 435 18992 0.043 35042(2) 639 55832 0.099 53280(2) 867 263028 0.329 21018 835 7303 0.014 30860 619 28924 0.086 32217(2) 648 38142 0.167 36575(4) 771 303676 0.347 53820 567 26468 0.08 72720 828 54921 0.14 23234(4) 587 153272 0.33 128479(4) 660 194185 0.484 102550 566 12834 0.16 32960(4) 559 30068 0.212 60166(4) 559 190450 0.4 144610(8) 937 294252 0.75

Among the three I/O modes, M RECORD mode consistently gave the best performance and M UNIX gave the worst performance. The collective I/O performance depended on various factors and in some cases, performed better than M RECORD mode. The server-side bottleneck severely restricted the performance of M UNIX mode. The bottleneck was more prominent when a large number of processors wrote large amount data (1 MBytes per processor). For example, it took 16.33 seconds for 64 processors to read 1 MBytes of data per processor (Table 12). The same access pattern required 0.44 seconds in M RECORD mode and 1.26 seconds in M COLL mode.

5.4 Discussion In the previous section, we found that in our implementation, collective I/O does not improve performance for all access patterns. Results also showed that a well-implemented non-collective I/O approach can perform better than the collective counterpart. Let us take another look at the low-level collective I/O strategy. Outwardly, it appears to be an exact replica of Two-phase I/O strategy. However, there are some sigi cant di erences. In our approach, neither there is a conforming distribution nor the le system knows about the data distribution. The implicito set interface generates a contiguous collective access request using the logical processor mapping. The order in which data is accessed is independent of the language and the le storage order. In our collective approach, data travels only once through the communication network. However, as in Two-phase I/O, our implementation also requires additional bu er space. The space requirement could be reduced at the expense of increased communication. Since NORMA IPC is very expensive, we decided to use large bu er space to minimize the communication. Another question to be answered is why we didn't implement Disk-directed I/O? Figure 7 illustrates the November 1996 | Scalable I/O Initiative

30

Tech. Report CACR-128

Table 2: Performance of read on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 6408 0.071 11009 0.078 83927 0.378 83352 0.375 13757 0.159 26085 0.215 34421 0.329 80758 0.721 6211 0.334 33283 0.66 34481 0.648 80656 1.51 6560 0.67 35112 1.273 27185 1.32 79636 2.89

M RECORD

SVIO WALL 35884 0.099 16634 0.044 45898 0.13 244468 0.26 17750 0.04 25601 0.077 87427 0.168 220056 0.493 16187 0.059 55091 0.098 326656 0.35 707468 0.898 38951 0.14 72843 0.168 327255 0.629 1112062 1.807

Table 3: Performance of write on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 13977 0.087 37363 0.136 13652 0.174 59490 0.336 14923 0.174 22320 0.229 9372 0.315 18816 0.587 20823 0.283 40869 0.393 61278 0.63 77039 1.152 25268 0.529 25723 0.655 27194 1.266 74337 2.272

November 1996 | Scalable I/O Initiative

M RECORD

SVIO WALL 11681 0.064 43251 0.105 15625 0.095 91550 0.136 18402 0.088 228984 0.182 16116 0.126 154220 0.222 21968 0.127 269361 0.265 101229 0.17 307767 0.289 21666 0.169 272756 0.52 146578 0.253 254851 0.430

31

SU=64

KBytes and SD=4.

COMM 10728 1946 3666 4054(2) 12881 10480 3736(2) 3221(4) 22501 29194 4321(4) 12441(4) 37729 6721(4) 10516(4) 11828(8)

SU=16

M COLL

COPY SVIO WALL 696 8044 0.023 1751 18998 0.038 6093 90333 0.119 24188 263028 0.329 1055 13496 0.039 3194 51942 0.077 6235 93543 0.179 13435 419438 0.47 1930 19775 0.064 5451 83361 0.116 5880 207062 0.294 23382 725633 0.824 3136 51064 0.115 3144 92061 0.148 10881 371946 0.475 22930 1397039 1.61

KBytes and SD=4.

M COLL

COMM COPY SVIO 8590 388 12694 11524 673 9208 20605 397 26608 23546(2) 470 155480 20369 1240 9183 29263 582 12247 23471(2) 830 32632 22797(4) 418 384837 46049 959 9408 60929 831 20640 22384(4) 834 126276 70262(4) 852 330931 97920 347 12708 26243(4) 946 40024 48355(4) 821 99331 69164(8) 1038 673962

WALL

0.0432 0.05 0.087 0.302 0.045 0.07 0.171 0.385 0.076 0.124 0.291 0.545 0.140 0.285 0.394 0.927

Tech. Report CACR-128

Table 4: Performance of read on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 13977 0.087 37363 0.136 34616 0.153 98676 0.464 14923 0.174 22320 0.229 32391 0.321 84042 0.883 6427 0.385 21246 0.399 33705 0.670 97045 1.8 6482 0.666 17191 0.895 23504 1.33 74337 2.272

M RECORD

SVIO WALL 11681 0.064 43251 0.105 15625 0.095 293818 0.306 18402 0.088 52351 0.058 103520 0.181 154220 0.222 30112 0.047 269361 0.265 264169 0.327 766385 1.00 44635 0.069 130961 0.246 356027 0.60 1697150 1.76

Table 5: Performance of write on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 15798 0.093 16417 0.08 37671 0.117 150934 0.34 10205 0.178 11659 0.152 38307 0.251 52993 0.566 63604 0.809 17844 0.527 42541 0.437 76761 0.747 25823 0.55 25340 1.147 59652 2.44 79652 1.618

November 1996 | Scalable I/O Initiative

M RECORD

SVIO WALL 43373 0.105 14995 0.08 46394 0.091 158623 0.217 41968 0.164 23749 0.1 25543 0.097 84897 0.169 78882 0.246 97338 0.118 115508 0.176 102816 0.224 223188 0.429 61661 0.23 383742 0.552 128215 0.312

32

SU=16

KBytes and SD=4.

COMM 10503 2027 1956 3424(2) 16183 17593 3536(2) 3484(4) 21751 30246 4643(4) 11267(4) 37241 5584(4) 10470(4) 12044(8)

SU=

M COLL

COPY SVIO WALL 2180 12179 0.028 5662 22967 0.044 1786 19542 0.129 12921 246879 0.31 3225 15581 0.041 9890 42922 0.074 6152 132127 0.173 13577 404617 0.478 4975 23203 0.061 19791 84910 0.138 4692 214692 0.305 22331 530754 0.915 9921 39420 0.097 3449 60119 0.196 10366 497290 0.606 24304 1438206 1.55

1024 KBytes and SD=4

M COLL

COMM COPY SVIO 8206 362 11218 16486 343 3596 10297(2) 4012 31578 58476 947 247702 30373 378 8178 32590 375 19733 19720 (4) 350 64285 35691(4) 418 171344 52978 1227 12117 36754(2) 830 24017 22122(4) 1276 281462 60696(4) 442 731197 99774 2001 29845 25904 377 22265 48633(4) 962 173558 143311(4) 909 256646

WALL

0.038 0.045 0.148 0.324 0.0524 0.071 0.22 0.309 0.0813 0.20 0.40 0.9 0.16 0.293 0.44 0.676

Tech. Report CACR-128

Table 6: Performance of read on raptor. Nodes 4 8 16 32

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 6583 0.075 25845 0.094 63303 0.305 224222 0.997 6447 0.145 23056 0.23 66030 0.616 206176 1.87 200772 3.53 17844 0.527 57351 1.29 202302 3.64 5853 0.671 25340 1.147 59652 2.44 216027 6.95

M RECORD SVIO WALL 44674 0.055 41956 0.08 213789 0.22 205658 0.22 48900 0.066 69670 0.11 107246 0.132 263687 0.549 120999 0.13 87242 0.255 232007 0.247 701746 0.91 158769 0.269 297913 0.322 383742 0.55 915717 1.607

Table 7: Performance of write on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 38562 0.117 30402 0.192 38248 0.21 45629 0.353 39584 0.598 14207 0.253 23971 0.32 50297 0.571 20624 0.409 18304 0.40 26883 0.551 52934 1.10 24856 0.717 22798 0.716 18728 0.99 57273 1.89 30899 1.34 34517 1.417 22456 1.93 89912 3.83

November 1996 | Scalable I/O Initiative

SU=

M RECORD

33

M COLL

COMM COPY SVIO WALL 10806 712 7793 0.024 1997 1050 20436 0.04 1076(2) 3630 112224 0.165 3031(2) 3017 254219 0.291 13002 729 16212 0.05 1891(2) 1774 52975 0.08 1351(4) 1205 124471 0.24 2794(4) 3314 422383 0.482 22619 725 22619 0.057 6236(2) 761 132521 0.161 4026(4) 6045 241990 0.357 11154(4) 4104 731197 0.90 13765(2) 1812 44405 0.09 6459(4) 2823 113767 0.223 9963(4) 1898 421655 0.47 26796(4) 9569 1121729 1.55

SU=

SVIO WALL 23703 0.099 15171 0.082 28341 0.095 77109 0.155 70544 0.165 34883 0.127 49830 0.141 164742 0.281 51527 0.159 32242 0.136 52614 0.167 121518 0.23 42097 0.185 45406 0.155 63887 0.182 149904 0.29 25998 0.178 43897 0.17 69115 0.233 340652 0.49

1024 KBytes and SD=4.

64 KBytes and SD=64.

M COLL

COMM COPY SVIO WALL 8774 323 22911 0.049 12023 312 25932 0.05 32202(2) 384 66521 0.11 58047(2) 446 96400 0.444 20359 312 17068 0.013 36464 318 141188 0.115 59458 385 79409 0.155 96591(4) 433 171981 0.612 49377 309 39300 0.109 79287 365 91949 0.185 112308 860 149735 0.37 63654(4) 435 177484 0.624 127975 834 35923 0.19 131376 351 79242 0.326 22364(8) 371 79298 0.804 158734(4) 451 34396 0.893 258646 378 64935 0.239 81804 867 115194 0.899 38899(8) 814 60950 0.95 34679(8) 1256 37995 0.955

Tech. Report CACR-128

Table 8: Performance of read on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 6511 0.084 10671 0.085 32174 0.129 68020 0.302 5995 0.194 10686 0.178 24339 0.266 84391 0.668 6063 0.393 12012 0.392 26908 0.526 75368 1.31 6491 0.834 11022 0.804 46393 1.52 73267 2.62 6610 1.675 8615 1.544 40389 2.96 83418 5.267

SU=64

M RECORD

SVIO WALL 23703 0.099 15171 0.082 34211 0.05 75929 0.105 31724 0.062 34883 0.127 44065 0.064 98336 0.127 20382 0.067 10786 0.056 52614 0.167 118673 0.152 39252 0.087 11452 0.067 56560 0.097 171805 0.22 31135 0.093 25911 0.085 97341 0.167 381542 0.511

Table 9: Performance of write on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 56642 0.228 87244 0.383 263537 0.614 133661 0.544 1311 0.252 17334 0.306 70312 0.613 146291 1.087 15761 0.409 14064 0.502 44896 0.859 134290 2.138 19974 0.718 21605 0.979 49318 1.623 140425 4.297 23904 1.369 33824 1.7 55823 3.098 146114 8.49

November 1996 | Scalable I/O Initiative

M RECORD

SVIO WALL 11210 0.067 15802 0.079 41734 0.097 153248 0.221 17205 0.093 14469 0.101 105966 0.18 259103 0.353 15920 0.099 14038 0.111 59986 0.176 370626 0.549 14838 0.11 19890 0.135 47888 0.23 393543 0.792 17983 0.116 30754 0.187 86314 0.364 465727 1.10

34

KBytes and SD=64.

COMM 10732 1975 3543 2985(2) 13759 16751 10416 12107(4) 22140 26973 10446(2) 12147(4) 37772 18951 3524(8) 29534(4) 21460(4) 17665(4) 7734(8) 8994(8) SU=16

M COLL

COPY 921 1782 7895 15513 1059 3359 13234 25297 1873 5661 12458 27334 3096 6847 2369 34590 1711 5655 6061 6721

SVIO WALL 10844 0.034 18709 0.042 65598 0.099 137358 0.183 12574 0.049 31208 0.062 124825 0.173 273883 0.343 29446 0.0813 66066 0.099 139713 0.22 313446 0.4 38826 0.118 99107 0.133 85412 0.62 618218 0.856 25015 0.101 90195 0.1494 53397 0.347 58432 0.4332

KBytes and SD=64.

M COLL

COMM COPY SVIO 9177 373 41246 7721(2) 460 38981 20912(2) 869 194251 58842(2) 933 272360 19719 833 26285 28769 840 65507 26468 352 129185 31582 430 113907 45877 834 38524 60668 931 117358 109652 847 277825 65180(4) 461 180367 99403 827 65239 129658 949 187124 56693(4) 823 118244 25784(8) 425 116887 204085 377 117219 287464 355 279518 48292(8) 357 122729 165948(8) 442 438352

WALL

0.066 0.307 0.23 0.363 0.058 0.107 0.259 0.544 0.097 0.194 0.421 0.707 0.178 0.34 0.6693 0.981 0.357 0.602 1.033 1.498

Tech. Report CACR-128

Table 10: Performance of read on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 6472 0.064 17481 0.095 50628 0.226 186916 0.77 6497 0.146 17775 0.213 52118 0.463 193333 1.59 4600 0.292 29810 0.461 52695 0.962 196844 3.17 5903 0.622 30495 1.092 50399 1.919 30829 6.432 25283 1.39 20040 2.124 58357 3.973 204972 13.17

M RECORD

SVIO WALL 10193 0.028 30029 0.043 60164 0.073 210284 0.248 15556 0.038 40160 0.053 65265 0.075 262868 0.27 24079 0.06 47218 0.059 81481 0.092 295607 0.337 28749 0.05 49722 0.057 42527 0.147 401181 0.596 33905 0.05 60063 0.071 249576 0.29 468817 0.938

Table 11: Performance of write on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 35629 0.138 13196 0.081 64693 0.157 191072 0.449 16462 0.175 11864 0.18 65478 0.351 172515 0.795 21066 0.336 46095 0.36 53123 0.51 121566 0.733 25137 0.676 49009 0.764 55866 0.92 181656 1.33 50081 1.34 57501 1.37 59660 1.75 63251 2.84

November 1996 | Scalable I/O Initiative

SU=16

COMM 9920 610(2) 1228(2) 3936(2) 14690 1996(2) 4901(2) 3693(4) 23173 7530(2) 12567(2) 15552(4) 13688 20575(2) 10949(4) 3307(8) 28299(4) 19726(4) 10376(8) 28101(8)

SU=1024

M RECORD

SVIO WALL 16167 0.102 16966 0.109 23537 0.093 75334 0.155 19445 0.11 20205 0.114 29261 0.127 90545 0.196 95173 0.394 21565 0.128 37984 0.145 114043 0.234 34657 0.309 23979 0.137 14609 0.17 155018 0.29 36922 0.19 28415 0.147 37117 0.196 232096 0.4

35

KBytes and SD=64.

M COLL

COPY 2580 1353 3808 14550 3780 2364 6387 16214 5838 3644 12661 28006 1714 6253 12030 17314 2975 7147 12899 64648

SVIO WALL 13648 0.037 31544 0.07 40435 0.125 139055 0.289 23775 0.053 29130 0.085 188692 0.226 149148 0.377 47406 0.095 45423 0.137 231572 0.287 297160 0.5312 34375 0.113 67208 0.235 146158 0.390 170100 0.738 51792 0.154 76698 0.332 166679 0.749 708713 1.31

KBytes and SD=64.

M COLL

COMM COPY SVIO WALL 11023 338 6880 0.044 15456 367 12331 0.05 26482 388 138954 0.183 96604 428 245371 0.355 21154 344 7270 0.054 31521 833 17823 0.068 73540 844 256611 0.23 52052(4) 1494 101232 0.607 48972 821 8344 0.084 67250 1002 130121 0.215 150030 367 281911 0.459 68262(4) 1056 206325 0.832 111607 373 10562 0.084 129679 333 179371 0.323 49634 480 106676 0.639 143253 1036 467816 1.12 200392 346 125228 0.345 270383 816 219558 0.51 112538(4) 391 183278 0.854 146376(8) 468 512012 1.58

Tech. Report CACR-128

0

1

2

3

4

5

0 User Buffer 4

5

6 1

6 7

7

8

(A)

9 10 11 12 13 14 15 2

3

Emulator

User Buffer

Emulator

Node 1

12 13 14 15

Node 3 8

8 4 Mach Buffers

6

5

12 14

7

Mach Buffers

7

1

1

1 1

User Buffer 0

2 3

13 15

User Buffer

Emulator

8

Node 0

2

8

7

9 10 11

Emulator Node 2 8

0

2

7

1 3

8

10

9

11

Mach Buffer

Mach Buffer

7 6 5 0 2

4

6 6

3

5

3

8 10 12 14

1

Coll. Buffer

2

5

7

9 11 13 15

Coll Buffer 4

0

3

4

4

Server

6 8

10 12 14

Disk 0

1 (B)

3

5

7 9

Server 11 13 15

Disk 1

Figure 7: Implementation of Disk-directed Read

November 1996 | Scalable I/O Initiative

36

Tech. Report CACR-128

Table 12: Performance of read on trex. Nodes 4 8 16 32 64

BSIZE 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024 16 64 256 1024

M UNIX

SVIO WALL 6807 0.098 22102 0.13 107470 0.347 235685 1.14 6490 0.168 24239 0.329 121229 0.719 253727 2.13 6014 0.34 44544 0.36 77370 1.432 217938 4.23 6228 0.718 33707 1.31 84554 2.826 232493 8.27 6115 1.547 24423 2.52 63251 5.52 230606 16.33

SU=1024

M RECORD

SVIO WALL 29356 0.055 11083 0.038 28258 0.051 76729 0.102 21193 0.052 16654 0.050 43912 0.063 96210 0.13 21821 0.067 20081 0.06 56230 0.08 124836 0.163 25285 0.08 23979 0.137 53467 0.092 163270 0.26 29418 0.091 23854 0.082 83355 0.15 292974 0.44

KBytes and SD=64.

COMM 10857 2496 1106(2) 3216(2) 14474 1887(2) 1110(4) 3355(4) 22433 8375(4) 4136(4) 11771(4) 37847 7850(4) 11300(4) 29833(4) 14708(4) 19228(4) 24631(4) 29532(8)

M COLL

COPY 712 842 1220 3507 750 2004 1230 16017 751 899 1173 17609 797 3531 12694 12102 1702 7092 22278 45897

SVIO 7689 20564 112239 268436 12072 25847 38242 243733 20682 116710 240360 321037 114272 40988 339893 646255 40640 70482 312374 697973

WALL

0.0312 0.045 0.13 0.306 0.042 0.061 0.2582 0.387 0.08 0.146 0.302 0.55 0.165 0.27 0.431 0.925 0.249 0.363 0.56 1.26

Disk-directed implementation of collective read. The global access pattern is the same as used in Figure 5. The Disk-directed I/O implementation requires 8 phases. First, all processors exchange access information (1); in the second phase, a compute processor (in this case, node 2) sends the global access information to the I/O servers. Each I/O server, on receipt of the request, computes what data to fetch (3) and reads the data in the collective data bu er (4). Notice that the fetched data is contiguous in the disk space. The collective data is then communicated via either Mach messages (5) or NORMA messages (7) to the compute processors. The compute processors, on receiving data messages copy the data into the user space (linearization, phase 8). Comparing this implementation with our collective implementation (Figure 5), it is easy to observe that Disk-directed I/O does not depend on the le storage order. Furthermore, the data

ow is directed by I/O servers. However, Disk-directed I/O requires more communication and data copying than collective I/O. Figure 7 illustrates an ideal case in which each I/O server has enough space to store the collective data. In practice, this assumption will not hold and the communication complexity will increase. In summary, following reasons collectively explain why we didn't implement Disk-directed I/O.

 Disk-directed I/O assumes the I/O servers to have a certain degree of intelligence so that they can

direct the I/O ow rather than the compute processors. In addition, Disk-directed I/O requires an internal compute processors-I/O servers (CP-IOP) interface (phases 2 and 3, Figure 7). In [Kot96], Kotz suggests using either PASSION-like (array based) or nested-batched ( le based) interfaces as CP-IOP interface.

November 1996 | Scalable I/O Initiative

37

Tech. Report CACR-128

In the existing Paragon OS framework, the I/O servers have minimal intelligence. They neither have the global le striping information nor know what other I/O server may be accessing. Most of the functionality is implemented in the emulator. Emulators, can be, therefore, considered as smarter compute processors. The I/O servers are not burdened with the task of computing what data to fetch, where to send etc. In some respects, this con guration is better than the one suggested in Disk-directed I/O.7

 Disk-directed I/O exploits the high-bandwidth communication network for minimizing the space re-

quirements. In Disk-directed I/O, the total number of messages generated depends on both the global array access and the le layout pattern. In our case, if a le is striped using a small stripe unit size, Disk-directed I/O may generate a large number of messages (phases 5, 6, and 7, Figure 7). In [Kot94], Kotz assumes the interconnection bandwidth to be 200 MBytes/sec (bidirectional). Our experimental results have found that peak bandwidth of NORMA IPC is about 80 MBytes/sec (for message sizes > 1024 KBytes). Therefore, in the current Paragon con guration, it is not advisable to implement Disk-directed I/O.

 The third important reason is the copy cost. As shown in the previous section, copy cost, especially that required for linearization, is signi cant (phases 5 and 8). If we implement Disk-directed I/O, both read and write routines will have to linearize the data blocks sent by the I/O servers. The cost of data linearization, in addition to communication cost, can severely degrade the performance.

In [Kot96], Kotz presents performance numbers of an I/O optimization technique, where as, in Section 5, we present performance evaluation of collective read and write system calls. The wall-clock time (WALL) includes time required for both user-level and system-level operations. It should be noted that the implementation is in a development stage and several optimizations like providing client-side caching of le o set and length are currently being implemented. These results are speci c to the version and type of system software used in implementation. We believe our collective I/O implementation will perform better if built on a kernel which provides high communication bandwidth (e.g., SUNMOS). What are the lessons to be learned from our experiments? 1. The importance of communication networks should not be underestimated. All the existing collective I/O strategies exploit high communication bandwidth for improving I/O performance. As the I/O locality is improved, in many cases, an I/O bound problem can become communication bound (Section 5, [Mor96]). Similar to our implementation, the design of an I/O optimization would often depend on the capabilities of the communication network. 2. Any collective I/O implementation requires a lot of information (e.g., user access patterns and le layout). If not implemented correctly, the information gathering phase would generate signi cant overhead. It is therefore, essential to design an interface (either at runtime or le system level) which

7 However, I/O servers should have some intelligence for optimizing cross-application I/O using collective techniques [PEK96].

November 1996 | Scalable I/O Initiative

38

Tech. Report CACR-128

i= 0 move the le pointer to the correct position while (!EOF or (i < quant))f read amount bytes update the le pointer by stride i++

g

Figure 8: Simple Strided Access is both expressible and easy-to-use. Success of any collective I/O implementation will depend on the interface used for information retrieval.

6 Future Extensions The collective interface proposed in Section 4.1 can describe a few array access patterns. However, the interface can be easily extended to support complex array distributions. In this section, we propose two extensions to the basic collective I/O interface for supporting simple strided accesses and nested batched accesses. At present, these extensions are implemented in the runtime library and work is underway to implement them in the le system. Consider a problem of accessing an array in ROW BLOCK distribution (Figure 1). If the array is stored in column-major order, each processor will make 4 di erent requests, each reading a single integer. Moreover, each request will fetch data stored in positions which are apart by a xed o set. For example, processor 0 will read elements 0, 4, 8 and 12 and each element will be fetched from le locations which are apart by 16 bytes. This access pattern is called as a simple-strided access pattern [NK95]. A simple-strided le access can be characterized by the access triplet (amount, quant, stride), which translates into the access pattern shown in Figure 8. The implicit-o set interface can be easily extended to support simple-strided le accesses. Similar to non-strided accesses, the implicit o set interface will use the processor mapping information to construct the collective access pattern. For example, consider the problem of reading data in ROW BLOCK distribution. The per-processor simple-strided access can be speci ed as (1*sizeof(int), 4, 4*sizeof(int)) (i.e., read 4 bytes 4 times, each access separated by a stride of 16 bytes). Assume the logical processor mapping shown in Figure 3:C. Using the logical mapping and the amount of data accessed by each processor, the le system reconstructs a basic collective request. The overall le access pattern consists of quant number of basic collective requests, each request is separated by an o set of stride. In our example, the basic collective request consists of 4 integers (16 bytes), rst integer to be read by processor 0, second by processor 1, third by processor 2, and forth by processor 3. Therefore, the overall le access pattern consists of 4 collective requests separated 16 bytes. Since the size of the basic collective request is same as the stride, the overall le access pattern consists of a single contiguous request for 64 bytes. Following routines describe the simple-strided collective accesses. These routines require the following November 1996 | Scalable I/O Initiative

39

Tech. Report CACR-128

input parameters: (1) group descriptor, gd, (2) le descriptor, fd, (3) pointer to a data bu er, and (4) pointer to the request structure describing the simple strided access. The le system uses logical processor mapping along with the request triplet to compute the basic collective request and the overall access pattern. creads(gd,fd,&buffer,&request) cwrites(gd,fd,&buffer,&request)

The creads() and cwrites() routines can be used to express a wide variety of data distributions. However, these routines can only support those user accesses that generate a contiguous collective access. This restriction can be removed by using an explicit-o set interface. In the explicit-o set interface, each processor provides the le o set from which its access begins. The collective group information is used only to group together multiple accesses. The processor mapping information is not used for computing the le o sets. Each processor can specify a simple request (i.e., access n bytes from the given o set), simple-strided request or nested-batched request [NK95]. The nested-batched request is a structured request in the form of a list of access pairs, (amount, offset). Each pair speci es the amount of contiguous data to be accessed from a given o set. The o set can be absolute or relative. Nieuwejaar and Kotz have proposed an extremely compact data structure to represent a nested-batched access [NK95]. We have used this data structure for representing the collective version of a general access pattern.8 Following routines describe the nested-batched collective accesses. These routines require the following input parameters: (1) group descriptor, gd, (2) le descriptor, fd, (3) pointer to a data bu er, and (4) pointer to the request structure describing the nested-batched access. creads(gd,fd,&buffer,&request) cwrites(gd,fd,&buffer,&request)

7 Related Work In this section we review related work in interface design and I/O optimization techniques. Several runtime libraries use application programmer's interfaces (APIs) for describing user-level access patterns. Such interfaces allows the users to directly read or write (subsections of) data structures like matrices. Examples of such libraries include PASSION [CBD+ 95], PANDA [SCJ+ 95], Jovian [BBS+94], and Chameleon [GGL93]. Though these APIs are suciently expressive, they are dependent on the high-level programming paradigm. For example, a majority of these libraries can only express regular distributed matrix computations. In the present form, these interfaces can not express irregular matrix operations involving indirection arrays or operations in which arrays are distributed in non-standard (i.e., non-HPF) manner. A notable exception is SOLAR, a runtime library developed by Toledo and Gustavson, which uses an interface that supports non-standard array distribution patterns [TG96]. Runtime libraries like MPI-IO take another approach [For96]. In MPI-IO, the user has to translate the high-level access pattern in the le space. This approach has an important advantage that it makes the API 8 Due to space constraints, we will not describe the nested-batched structure. For more information, refer [NK95].

November 1996 | Scalable I/O Initiative

40

Tech. Report CACR-128

independent of the high-level programming paradigm. However, it is not always easy to express high-level patterns in the le space. In such cases, a layered approach is practical. A layered approach would require users to use high-level libraries like PASSION, PANDA or SOLAR, to express the access patterns and these libraries would then translate these access patterns into corresponding MPI-IO requests. MPI-IO provides both collective and non-collective I/O routines. Since MPI-IO uses the MPI processor group information for collective I/O, we believe that our collective I/O interface can be directly used for implementing MPI-IO. Some le systems support interfaces that can express high-level accesses. An example of such a le system is Vesta [CFPB93]. Low-level le system interfaces are independent of high-level access patterns and represent I/O accesses in form of amount of data to be accessed from a give le o set. Many le systems provide extensions for expressing user-level I/O parallelism. File access modes provided in Intel machines (iPSC 860, Touchstone Delta, and Paragon), and in the Thinking Machine CM-5 are examples of such extensions [BCR93, KN93, ACR95, KR94]. Recently, several research groups have proposed low-level le system interfaces that support irregular le access patterns. The nested-batched interface supported by the Galley le system [Nie96] provides a compact representation of irregular le access patterns. The Scalable I/O low-level interface [lIC96] supports a generalized version of the nested batched interface. This interface also supports collective I/O but it uses a concept of collective iterations rather than the collective data approach used in this paper. Most of the existing runtime libraries use a variation of the Two-phase or Disk-directed I/O as the underlying I/O optimization technique. PASSION uses the extended two-phase strategy [TC96] for accessing out-of-core data. Jovian [BBS+ 94] implements a dynamic variation of the two-phase strategy. Hemy et. al use the two-phase strategy for improving I/O performance over distributed networks [HS95]. The two-phase strategy is also used by the SOLAR runtime library [TG96]. Collective bu ering [Nit95], a technique used for implementing MPI-IO is a variation of the two-phase strategy. MTIO, a runtime library developed by More et. al [Mor96] implements collective I/O in a multi-threaded environment. The Panda runtime library [SCJ+ 95] uses a variation of Disk-directed I/O called Server-directed I/O. ENWRICH [PEK96] uses the disk-directed I/O for implementing compute processor write caching. Most of the related work in collective I/O involves developing techniques to be used by runtime libraries. Disk-directed I/O is developed using the Proteus simulator [Kot94] and at present, no implementations of Disk-directed I/O in a production le system exist. We believe the work presented in this paper describes the rst implementation of collective I/O in a production le system.

8 Conclusions In this paper, we presented an implementation of collective I/O prototype in the Intel Paragon Parallel File System (PFS) and evaluated its performance on two di erent Paragon systems. We proposed simple extensions to the existing PFS interface for supporting collective I/O and, designed and implemented the corresponding optimization techniques in the PFS code. We compared the performance of collective I/O routines with the non-collective PFS routines in M UNIX and M RECORD I/O mode. November 1996 | Scalable I/O Initiative

41

Tech. Report CACR-128

Our results showed that collective I/O substantially improved performance over M UNIX mode accesses; illustrating that collective I/O can improve performance by coalescing large number of small requests. However, in majority of cases, M RECORD mode gave better performance than the collective I/O counterpart. We found that performance of collective I/O was a ected by the communication cost. The main cause of excessive communication cost was the low-bandwidth NORMA IPC subsystem of the Paragon Operating System. Other factors that a ected the collective I/O performance were costs due to data copying, increased I/O costs due to accessing larger amount of data and time required by various le managements routines.

Acknowledgments We thank Paul Messina and James Pool for supporting this project from the very beginning. Many thanks to Thomas Metzger for setting up the Scalable I/O infrastructure; to Heidi Lorenz-Wirzba for allocating sucient system time on the Paragons; to Sharad Garg and Brad Rullman for helping us understand the Paragon PFS architecture; to Jerry Toman and Stephan Zeisset for explaining the NORMA IPC implementation; to Yuqun Chen and Yuanyuan Zhou for help in understanding the Paragon virtual memory system; to Alok Choudhary and David Kotz for helpful comments and suggestions. This work was sponsored by the Scalable I/O Initiative, contract number DABT63-94-C-0049 from Advanced Research Projects Agency(ARPA) Administered by US Army at Fort Huachuca and was performed using the Intel Paragon systems operated by Caltech on behalf of the Center for Advanced Computing Research (CACR).

References [ACR95] Meenakshi Arunachalam, Alok Choudhary, and Brad Rullman. A Prefetching Prototype for the Parallel File System on the Paragon. In Proceedings of the 1995 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, pages 321{323, May 1995. Extended Abstract. [ACR96] Meenakshi Arunachalam, Alok Choudhary, and Brad Rullman. Implementation and Evalaution of Prefetching in the Intel Paragon Parallel File System. In Proc. International Parallel Processing Symposium, 1996. [BBS+ 94] Robert Bennett, Kelvin Bryant, Alan Sussman, Raja Das, and Joel Saltz. Jovian: A Framework for Optimizing Parallel I/O. In Proceedings of the Scalable Parallel Libraries Conference, pages 10{20. IEEE Computer Society Press, October 1994. [BCR93] Rajesh Bordawekar, Alok Choudhary, and Juan Miguel Del Rosario. An Experimental Performance Evaluation of Touchstone Delta Concurrent File System. In Proceedings of the 7th ACM International Conference on Supercomputing, pages 367{376, 1993. [BdC93]

Rajesh Bordawekar, Juan Miguel del Rosario, and Alok Choudhary. Design and Evaluation of Primitives for Parallel I/O. In Proceedings of Supercomputing '93, pages 452{461, 1993.

[Bor96]

Rajesh Bordawekar. Using File Striping for achieving High I/O Bandwidth. Unpublished manuscript, June 1996. [CBD+ 95] Alok Choudhary, Rajesh Bordawekar, Apurva Dalia, Sivaram Kuditipudi, and Sachin More. PASSION User Manual, Version 1.2, October 1995.

November 1996 | Scalable I/O Initiative

42

Tech. Report CACR-128

[CFPB93] P. Corbett, D. Feitelson, J. Prost, and S. Baylor. Parallel Access to Files in the Vesta File System. In Proceedings of Supercomputing '93, pages 472{481, November 1993. [dBC93]

Juan Miguel del Rosario, Rajesh Bordawekar, and Alok Choudhary. Improved Parallel I/O via a TwoPhase Run-time Access Strategy. In IPPS '93 Workshop on Input/Output in Parallel Computer Systems, pages 56{70, 1993. Also published in Computer Architecture News 21(5), December 1993, pages 31{38.

[Div93]

Intel Supercomputing Systems Division. Paragon System User's Guide, 1993.

[For93]

High Performance Fortran Forum. High Performance Fortran Language Speci cation Version 1.0. Technical Report CRPC-TR92225, Center for Research in Parallel Computing,Rice University, January 1993.

[For96]

Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, November 1996.

[GGL93] N. Galbreath, W. Gropp, and D. Levine. Applications-Driven Parallel I/O. In Proceedings of Supercomputing '93, pages 462{471, 1993. [HS95]

Michael Hemy and Peter Steenkiste. Gigabit I/O for Distributed-Memory Machines: Architecture and Applications. In Proceedings of Supercomputing '95, 1995.

[KN93]

John Krystynak and Bill Nitzberg. Performance Characteristics of the iPSC/860 and CM-2 I/O Systems. In Proceedings of the Seventh International Parallel Processing Symposium, pages 837{841, 1993.

[KN94]

David Kotz and Nils Nieuwejaar. Dynamic File-Access Characteristics of a Production Parallel Scienti c Workload. In Proceedings of Supercomputing '94, pages 640{649, November 1994.

[Kot94]

David Kotz. Disk-directed I/O for MIMD Multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61{74, November 1994. Updated as Dartmouth TR PCS-TR94-226 on November 8, 1994.

[Kot96]

David Kotz. Disk-directed I/O for MIMD Multiprocessors. Technical Report PCS-TR94-226, Dept. of Computer Science, Dartmouth College, October 1996. Revision of Dartmouth TR PCS-TR94-226, November 1994. Thomas T. Kwan and Daniel A. Reed. Performance of the CM-5 Scalable File System. In Proceedings of the 8th ACM International Conference on Supercomputing, pages 156{165, July 1994.

[KR94] [lIC96] [Mes94]

Scalable I/O Low level I/O Committee. SIO Proposal for Low-level Parallel File System API, November 1996. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Version 1.0, April 1994.

[Mor96]

Sachin More. MTIO, A Multi-threaded Parallel I/O System. Master's thesis, Dept. of Electrical and Computer Engineering, Syracuse University, August 1996.

[Nie96]

Nils Nieuwejaar. Galley: A New Parallel File System for Parallel Applications. PhD thesis, Computer Science Department, Dartmouth College, November 1996. Also available as a Dartmouth TR PCS-TR96300. William J. Nitzberg. Collective Parallel I/O. PhD thesis, Department of Computer and Information Science, University of Oregon, December 1995.

[Nit95] [NK95]

Nils Nieuwejaar and David Kotz. Low-level Interfaces for High-level Parallel I/O. In IPPS '95 Workshop on Input/Output in Parallel and Distributed Systems, pages 47{62, April 1995.

November 1996 | Scalable I/O Initiative

43

Tech. Report CACR-128

[PEK96] Apratim Purakayastha, Carla Schlatter Ellis, and David Kotz. ENWRICH: A Compute-Processor Write Caching Scheme for Parallel File Systems. In Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 55{68, May 1996. [RNN93] Paul Roy, David Noveck, and Durriya Netterwala. The File System Architecture of OSF/1 AD Version 2, 1993. [SCJ+ 95] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, December 1995. [TC96]

R. Thakur and A. Choudhary. An Extended Two-Phase Method for Accessing Sections of Out-of-core Arrays. Technical Report CACR-103, Scalable I/O Initiative, California Institute of Technology, Center for Advanced Computing Research, Caltech, Revised May 1996. (To appear in Scienti c Programming).

[TG96]

Sivan Toledo and Fred G. Gustavson. The Design and Implementation of SOLAR, a Portable Library for Scalable Out-of-Core Linear Algebra Computations. In Fourth Workshop on Input/Output in Parallel and Distributed Systems, pages 28{40, Philadelphia, May 1996.

[TMC92] Thinking Machines Corporation. Programming the CM-5 I/O System, November 1992. [Vah96]

Uresh Vahalia. Unix Internals The new frontiers. Prentice Hall, Upper Saddle River, NJ 07458, 1996.

[ZRB+ 93] Roman Zajcew, Paul Roy, David Black, Chris Peak, Paulo Guedes, Bradford Kemp, John LoVerso, Michael Leibensperger, Michael Barnett, FaraMarz Rabii, and Durriya Netterwala. An OSF/1 UNIX for Massively Parallel Multicomputers. In Proceedings of the 1993 Winter USENIX Conference, pages 449{468, January 1993.

Availability This technical report (CACR-128) is available on the World-Wide Web at http://www.cacr.caltech.edu/techpubs/. Titles and abstracts of all CACR and CCSF technical reports distributed by Caltech's Center for Advanced Computing Research are available at the above URL. For more information on the Scalable I/O Initiative and other CACR high-performance computing activities, please contact CACR Techpubs, Caltech, Mail Code 158-79, Pasadena, CA 91125, (818) 395-4116, or send email to: [email protected].

November 1996 | Scalable I/O Initiative

44

Tech. Report CACR-128

Suggest Documents