Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho1 , Marianne Winslett1, Szu-wen Kuo1, Ying Chen2 , Jonghyun Lee1 , Krishna Motukuri1 1
Department of Computer Science, University of Illinois, 1304 W. Spring eld, Urbana, IL 61801, U.S.A. 2 IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, U.S.A. email: fycho,
[email protected] phone: (+1 217) 244 7232, fax: (+1 217) 333 6500
Abstract
Thanks to powerful processors, fast interconnects, and portable message passing libraries like PVM and MPI, networks of inexpensive workstations are getting more popular as an economical way to run highperformance parallel scienti c applications. On traditional massively parallel processors, performance of parallel I/O is most often limited by disk bandwidth, though the performance of other system components, especially the interconnect, can at times be a limiting factor. In this paper, we show that the performance of parallel I/O on commodity clusters is often signi cantly aected not only by disk speed but also by the interconnect network throughput, I/O bus capacity and load imbalance caused by heterogeneity of nodes. Speci cally, we present our experimental results from reading and writing large multidimensional arrays with the Panda I/O library, on two signi cantly dierent clusters: HP workstations connected by FDDI and HP PCs connected by Myrinet. We also discuss the approaches we use in Panda to maximize the I/O throughput available to the application on these two platforms, particularly by a careful placement of I/O servers.
1
Introduction
Due to wide availability of powerful processors, fast interconnects, and portable message passing libraries like PVM [15] and MPI [8], networks of commodity workstations are gaining popularity as an economical way to run very large parallel applications. In this environment, scienti c applications typically distribute their data across multiple processes which are kept closely synchronized for computation and communication. Often these applications output large intermediate results and read them back in later for a subsequent step of computation. Large nal results may also be output for subsequent visualization, especially with timedependent simulation applications. Long running applications also typically periodically save a snapshot of their major arrays in a le (checkpoint), so that they can restart from the checkpoint in the event of failure. Due to relatively low performance of the I/O subsystem and lack of ecient software, I/O performance can be a major bottleneck in all these applications. The Panda parallel I/O library1 is designed for SPMD-style parallel applications to provide I/O portability, an easy-to-use high-level interface and high-performance collective I/O of multidimensional arrays. On traditional massively parallel processors like the IBM SP2, Panda's performance is mainly limited by disk speed. In this paper, we show that Panda's performance on a commodity cluster can be aected signi cantly by almost every system component: disk speed, interconnect network throughput, main memory size, I/O bus speed and the presence of heterogeneous nodes. We focus on our experimental results from two very different clusters - HP workstations connected by FDDI and HP PCs connected by Myrinet. With workstations connected by FDDI that have reasonably fast disks, the I/O bottleneck is likely to lie in message passing because FDDI is a 100 Mb/s shared-media network. Thus, it is crucial to minimize simultaneous access to the network by multiple processes as well as the amount of data transferred over the network. However, when a more contemporary network like Myrinet is used together with high-performance PCs (each a 2- or 4-processor symmetric multiprocessor (SMP) sharing memory and I/O busses) the interconnect is no longer a bottleneck but the contention among processors for the shared resources can be a limiting factor in parallel 1
More information can be found online at http://drl.cs.uiuc.edu/panda/.
I/O performance. For both types of clusters, a careful placement of I/O servers can have a strong positive impact on performance. In the next section, we introduce the Panda parallel I/O library and describe the two clusters used in our experiments. In section 3, we discuss problems related to parallel I/O on workstations connected by FDDI. In section 4, we show that resource contention among processors in the same SMP can be a limiting factor for I/O performance if a switch-based, more contemporary network is used as an interconnect. Sections 5 and 6 discuss heterogeneity related to parallel I/O performance and some related work respectively. Finally we conclude the paper in section 7. 2
Background
2.1 The Panda parallel I/O library
Panda is a parallel I/O library for multidimensional arrays. Its original design was intended for SPMD applications running on distributed memory systems, with arrays distributed across multiple processes that are closely synchronized at I/O time. Panda supports HPF-style [5] BLOCK, CYCLIC, and Adaptive Mesh Re nement-style data distributions [12] across the multiple compute nodes on which Panda clients are running. Panda's approach to high-performance collective I/O, in which all clients and servers cooperate to perform I/O, is called server-directed I/O [14]. 0 0
1
1
2
3
2
0 0 1
1
3 clients (compute nodes)
servers (I/O nodes)
Fig. 1: Dierent array data distributions in memory and on disk provided by Panda. Fig. 1 shows a 2D array distributed (BLOCK, BLOCK) across 4 compute processors arranged in a logical 22 mesh. Each piece of the distributed array is called a compute chunk, and each compute chunk resides in the memory of one compute processor. The I/O processors are also arranged in a logical mesh, and the data can be distributed across them and implicitly across their disks using a variety of distribution directives. The array distribution on disk can be radically dierent from that in memory. For instance, the array in Fig. 1 has a (BLOCK, *) distribution on disk. Using this distribution, the resulting data les can be concatenated together to form a single row-major or column-major array, which is particularly useful if the array is to be sent to a workstation for postprocessing with a visualization tool, as is often the case. With server-directed I/O, each I/O chunk resulting from the distribution chosen for disk will be buered and sent to (or read from) disk by one Panda server, and that server is in charge of reading, writing, gathering, and scattering the I/O chunk. For example, in Fig. 1, during a Panda write operation server 0 gathers compute chunks from clients 0 and 1, reorganizes them into a single I/O chunk, and writes it to disk. In parallel, server 1 is gathering, reorganizing and writing its own I/O chunk. For a read operation, the reverse process is used. During I/O Panda divides large I/O chunks into a series of smaller pieces, called subchunks, in order to obtain better le system performance and keep server buer space requirements low. For a write operation, a Panda server will repeatedly gather and write subchunks, one by one. When reading or writing an entire array, each Panda server reads or writes its le sequentially. Usually, Panda clients and servers reside on physically dierent processors. However, it would be wasteful to dedicate processors to I/O, leaving them idle during computation on a network of workstations where a limited number of processors are available. Panda supports an alternative I/O strategy, part-time I/O, where there are no dedicated I/O processors. Instead, some of the Panda clients run servers at I/O time, and return to computation after nishing the I/O operation.
2.2 Cluster systems
The rst platform on which our experiments were conducted is an 8-node HP 9000/735 workstation cluster running HP-UX 9.07 on each node, connected by FDDI (see Tab. 1). Each node has two local disks and each Panda I/O node uses a 4 GB local disk. At the time of our experiments, the disks were 45{90% full depending on the node. We measured the average le system throughput using 128 sequential 1 MB application requests2 to the le system: 5.96 MB/sec for the least occupied disk and 5.63 MB/sec for the fullest disk. For the message passing layer, we used MPICH 1.0.11 and obtained an average message passing bandwidth per node of 3.2, 2.9 or 2.3 MB/sec when there are 1, 2 or 4 pairs of senders and receivers, respectively, for Panda's common message sizes (32{512 KB). So, message passing slower than the underlying le system is a clear bottleneck for parallel I/O on this cluster. Our second platform is a High Performance Virtual Machine (HPVM) [2], a collection of HP Kayak XU dual-processor PCs with a more advanced switch-based Myrinet interconnect. Each node consists of symmetric dual 300 MHz Pentium IIs and a 4 GB 10000 RPM disk running Windows NT Server 4.0. We measured the average le system (NTFS) throughput using the Win32 API as 10{12 MB/sec depending on the request size, le caching option and total amount of data read or written. With le caching turned on, the peak le system throughput was obtained from requests of size 8{128 KB, whereas without le caching, bigger request sizes (32{1024 KB) performed better on average and also gave much more consistent performance for write operations. The results are consistent with the analysis of NTFS performance presented in [13]. As for read, throughput is best with le caching turned on, because of the performance advantage oered by prefetching into the le cache. For reads, request sizes of 8{64 KB lead to peak performance. So in our experiments with Panda on the PC cluster, we use 64 KB application read and write requests and turn le caching on only for read operations. Tab. 1 summarizes the le system and message passing performance in this con guration. The le system throughputs were measured by using 2048 sequential 64 KB application requests to the le system (a total of 128 MB is read or written). The message passing throughput per node available from MPI-FM [7] is 40{70 MB/sec, again depending on the message size and the total amount of data transferred. Sharing of an SMP between clients and servers hurts performance when multiple processors in a node try to send/receive a message. Even if the theoretical peak performance of PCI is 133 MB/s, real achieved performance is much lower than that [13]. All SMP processes share a PCI bus and thus message passing can be bottlenecked by the PCI bus connecting to a fast network like Myrinet. In the system that we used for our experiments, contention reduces the message passing throughput between two processors in the same node to approximately 40 MB/sec, which is a little over half of the peak throughput obtainable from a pair of processors in dierent nodes. System name HP 9000/735 workstation cluster SMP cluster (HPVM)
MPI: throughput, latency MPICH: 3.2 MB/s, 570 us dual 300 MHz Myrinet 512 MB MPI-FM: Pentium II 70 MB/s, 17 us
# of Processors nodes 8 99 MHz PA RISC 64
Memory Inter- per connect node FDDI 144 MB
File system: write, read throughput HP-UX: 4.1{5.9 MB/s, 4.3{5.7 MB/s NTFS: 11.7 MB/s, 12.5 MB/s
Tab. 1: Comparison of clusters. MPI latency is measured by sending 100 0-byte messages between two processes and 32 KB messages are used on both clusters to measure the MPI throughput. For le system throughput, we used a 1 MB request size on the workstation cluster and a 32 KB request size on the SMP cluster for both read and write operations. 3
Parallel I/O on workstations connected by FDDI
As implemented in Panda 2.1, part-time I/O chooses the rst m of the compute processors to run Panda servers, regardless of the array distribution or compute processor mesh. A similar strategy has been 2
In our experiments on the HP workstation cluster, Panda servers also read or write one 1 MB subchunk at a time.
taken in other libraries that provide part-time I/O [10]. This provides acceptable performance on a high speed, switch-based interconnect whose processors are homogeneous with respect to I/O ability, as on many SP2s, but on a shared-media interconnect like FDDI or Ethernet, we found that performance is generally unsatisfactory and tends to vary widely with the exact choice of compute and I/O processor meshes. To see the source of the problem, consider the example in Fig. 1. With the naive selection of compute processors 0 and 1 as I/O servers, compute processor 1 needs to send its local chunk to compute processor 0 and gather a subchunk from compute processors 2 and 3. This incurs extra message passing which is unnecessary if compute processor 2 also acts as an I/O server, instead of compute processor 1. In an environment where the interconnect will clearly be the bottleneck for I/O, as is the case for the HP workstation cluster with an FDDI interconnect, the single most important optimization we can make is to minimize the amount of remote data transfer. Our previous work [3] describes how to place I/O servers in a manner that will minimize the number of array elements that must be shipped across the network during I/O. More precisely, suppose we are given a target number m of part-time I/O servers, the current distribution of data across processor memories, and a desired distribution of data across I/O servers. We show how to choose the I/O servers from among the set of n m compute processors so that remote data transfer is minimized. We begin by forming an m n array called the I/O matrix (M ), where each row represents one of the I/O servers and each column represents one of the n compute servers. The (i; j ) entry in the I/O matrix, M (i; j ), is the total number of array elements that the ith I/O server will have to gather from the j th compute processor, which can be computed from the array size, in-memory distribution and the target disk distribution. In Panda, every processor involved in an array I/O operation has access to array size and distribution information, so M can be generated at run time. Given the I/O matrix, the goal of choosing the m I/O servers that will minimize remote data transfer can be formalized as the problem of choosing m matrix entries M (i1; j1 ); : : : ; M (imP; jm ) such that no two entries lie in the same column or row (ik 6= il and jk 6= jl , for 1 k < l m) and 1km M (ik ; jk ) is maximal. To solve this problem, we can view M as a representation of a bipartite graph, where every row (I/O server) and every column (compute processor) represents a vertex and each entry M (i; j ) is the weight of the edge connecting vertices i and j . [3] shows that the problem of assigning I/O servers is equivalent to nding the matching3 of M with the largest possible sum of weights. The optimal solution can be obtained using the Hungarian Method [11] in O(m3) time, where m is the number of part-time I/O servers. We compared the performance of Panda using the rst m processors as I/O servers (\ xed" I/O servers) and optimally placed (in terms of minimal data transfer) I/O servers. We used all 8 nodes as compute processors in our experiments, as that con guration is probably most representative of scientists' needs. The in-memory distribution was (BLOCK, BLOCK) and we tested performance using a 24 compute processor mesh. We used 2, 4, or 8 part-time I/O servers, while increasing the array size from 4 MB (10241024) to 16 MB (20482048) and 64 MB (40964096). For the disk distribution, we used either (BLOCK, *) or (*, BLOCK) to show the eect of a radically dierent distribution. We present results for writes; reads are similar. Since the cluster is not fully isolated from other networks, we did our experiments when no other user job was executing on the cluster. All the experimental results shown are the average of 3 or more trials and error bars show a 95% con dence interval for the average. Fig. 2 and Fig. 3 compare the time to write an array using dierent placements of I/O servers. A group of 6 bars is shown for each number of I/O servers. Each pair of bars within the group shows the response time to write an array of the given size using xed and optimal placement of I/O servers respectively. Optimal placement of I/O servers reduces array output time by at least 19% across all dierent combinations of array sizes and meshes, except for the cases where the xed and optimal I/O server placements are identical. We found that even with optimal I/O server placement, performance is very dependent not only on the amount of local data transfer, but also on the compute processor mesh chosen, array distribution on disk and the number of I/O servers. For instance, in Fig. 3, moving from 2 to 4 I/O servers gives a superlinear speedup with optimal placement, but that does not happen if the (*, BLOCK) disk distribution is used (Fig. 2) instead. To obtain good I/O performance, the user needs help from Panda in determining the eect on I/O performance of seemingly irrelevant decisions such as the choice between a 2 4 or 4 2 compute processor mesh. In [3], we presented a performance model for Panda running on an FDDI cluster, to be used in predicting message passing performance in Panda, and showed its accuracy. The performance model can guide a user to select array distributions and compute processor meshes that can give the best performance on this cluster. 3
A matching in a graph is a maximal subset of the edges of a graph such that no two edges share the same endpoint.
write operation, 2x4 compute processor mesh (BLOCK,BLOCK) in memory, (*,BLOCK) on disk
20
20
18
18
16
64MB
14 12
64MB
10
64MB
8 6
16MB
16MB
4 2
16MB
4MB
4MB
4MB
16 64MB
14 12 10
64MB
8 6
64MB
16MB
4 2
16MB 16MB
4MB
4MB
4MB
0
0 2
4 # of I/O servers
I/O server placement:
fixed
2
8
4 # of I/O servers
I/O server placement:
optimal
Fig. 2: Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 2 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: n 1, where n is the number of I/O servers. Disk distribution: (BLOCK, *). 4
Panda Response Time (sec)
Panda Response Time (sec)
write operation, 2x4 compute processor mesh (BLOCK,BLOCK) in memory, (BLOCK,*) on disk
fixed
8
optimal
Fig. 3: Panda response time for writing an array using xed or optimal I/O server placement. Memory mesh: 2 4. Memory distribution: (BLOCK, BLOCK). Disk mesh: 1 n, where n is the number of I/O servers. Disk distribution: (*, BLOCK).
Parallel I/O on PCs connected by Myrinet
As summarized in Tab. 1, each node in our SMP cluster consists of dual processors sharing memory, I/O bus and a le system. There is one 4.0 GB Ultra Wide SCSI disk (10000 RPM) and a 160 MB/sec (fullduplex) Myrinet board connected to each SMP. The details are shown in Fig. 4. When a parallel application is running on both processors, contention for shared resources like the I/O bus (PCI bus in Fig. 4) or disk can be a serious bottleneck. For example, if both processors perform I/O at the same time, we nd that each processor obtains less than half of the le system throughput obtained by using only one processor, because the disk and the I/O bus connecting the disk controller are shared, and the I/O requests coming separately from each processor cause disk seeks or rotational delays. So in this con guration, it is crucial to avoid using multiple processors in the same SMP node as Panda I/O servers if possible. Fig. 5 compares the Panda output performance when 8 dedicated I/O servers are used with dierent placements. Each group of 3 bars compares performance using dierent con gurations; the white bars show performance when both processors in the same SMP are used as I/O servers ( xed I/O servers). With xed I/O servers, each server provides a throughput of only about 3 MB/sec. If a 4-processor SMP were used with all 4 processors as I/O servers, the throughput would be even lower. However, if the I/O servers are carefully placed to avoid multiple servers in the same SMP (black bars in Fig. 5), Panda throughput per server increases by more than 100%. The gray bars in Fig. 5 show the positive impact of placing each client and server in a separate SMP; the resulting performance is close to the peak le system performance reported in Tab. 1. For the 16 MB array, Panda does not perform as well as for larger arrays because the amount that each I/O node writes is so small that throughput is not scalable due to Panda's constant startup/shutdown overhead. We repeated the tests shown in Fig. 5 for read operations. In all cases, throughput at each I/O server is higher than for write operations, with the same performance trends as for writes. For instance, we obtained 10.4 MB/sec throughput at each I/O server for the 16 MB gray bar and 11.0{11.6 MB/sec for the rest of the gray bars. Experiments not included in Fig. 5 show that if we place a Panda server or client on only one processor per SMP, the throughput that each Panda server delivers to the underlying le system averages 50 MB/sec for read and write operations. In other words, Panda can keep the underlying le system busy as long as the underlying le system has a peak throughput of at most that amount times the number of I/O servers sharing the le system. With careful placements, each I/O server delivers data to the le system at a throughput rate of about 25 MB/sec (22 MB/sec for read operations), which is just half of the throughput obtained
16 compute processors and 8 I/O servers
CPU 2
12
system bus (528 MB/s) PCI bus (133 MB/s) 40 MB/s Network Interface
Ultra Wide SCSI Controller Disk (10K RPM)
System Memory (512 MB)
Throughput per I/O server (MB/s)
CPU 1
10 8 6 4 2
0 Array size (MB) 16
32 64 128 256 fixed I/O servers carefully placed I/O servers using only one processor per SMP
512
Fig. 4: System architecture of each PC workstation having 2 processors sharing memory and I/O subsystem. The band- Fig. 5: Memory mesh, disk mesh: 2 2 4. widths shown are the theoretical peak per- Array distribution in memory and on disk: formance. (BLOCK, BLOCK, BLOCK). 16 compute processors, 8 dedicated I/O servers. when a Panda server or client is placed on only one processor per SMP. However, if xed placement is used, throughput drops to 20 MB/sec for both reads and writes, which means that message passing bandwidth of 3{5 MB/sec is wasted by contention between servers on the same SMP for the PCI bus. 5
Heterogeneity in clusters
The clusters used in the experiments in this paper had homogeneous system software and hardware, but many clusters will be heterogeneous. Heterogeneity can have a big impact on I/O performance and needs to be taken into consideration when choosing the placement of I/O servers. Since on a large cluster users often will not know in advance what nodes will be assigned to their job, server placement will need to be done at runtime, preferably by the I/O library. Heterogeneity can hurt parallel I/O performance by causing load imbalance. For example, on the cluster used for the experiments in this paper, le system performance varied from node to node, due to dierent amounts of free space on local disk. If work is assigned to I/O servers without considering their dierent capabilities, performance will be limited to that of the slowest I/O server. In other words, a single server with a very full disk could signi cantly delay completion of an entire I/O operation. Sources of heterogeneity other than disk free space can also cause load imbalance and reduce I/O performance. Some other examples: Data placement. To improve computational load, data may not be spread evenly across all compute processors. The processors with the most data may become a bottleneck for I/O (e.g., on an FDDI cluster). The algorithm for I/O server placement in Section 3 took this type of heterogeneity into account. Disk and le system performance. Each node may have a dierent disk capacity and speed, a dierent le system, or a dierently partitioned le system. In this case both the placement of I/O servers and the distribution of data on disk must take the diering abilities into account. Further, the I/O strategy should be tailored to the le system of each server for best results (e.g., do not use le caching for write operations with NTFS). Processor characteristics. Main memory size can signi cantly impact I/O performance, because larger memories often allow large le caches, which can help performance. Processor speed can also impact I/O performance, because the cost of copying data to and from message and le system buers is signi cant. As shown in Section 4, processors that must share resources such as I/O busses or le system can have very dierent I/O performance characteristics from stand-alone processors. Thus optimal I/O server placement and workload distribution in a heterogeneous environment is an extremely complex problem. With so many potential variables to consider, a general portable solution to this problem would probably need to use heuristic search through the space of possibilities, rather than
relying entirely on exact algorithms. In general, for top performance, I/O servers need to be placed in such a way that all servers' I/O capabilities are as similar as possible. For instance, suppose a cluster consists of nodes having older PCs with a single processor and a 5400 RPM disk, and a few new SMPs with a 10000 RPM disk. On such a system, it might be advantageous to place multiple I/O servers on the same SMP node, directly contradicting our advice for a homogeneous system! Further, given that the I/O servers have dierent capabilities, work should be divided among them according to their abilities. We have taken some preliminary steps in this direction in [6], which examined several ways of dividing a workload among heterogeneous servers. 6
Related work
A number of researchers have examined the problem of parallel I/O on workstation clusters; we believe we are the rst group to address problems related to resource sharing on SMPs and heterogeneity across nodes in collective I/O. PIOUS [9] is a pioneer work in parallel I/O on a workstation cluster. PIOUS is a parallel le system with a Unix-style le interface; coordinated access to a le is guaranteed using transactions. Heterogeneity also raises performance issues for parallel le systems. If les are automatically striped across all servers, performance can suer if some servers are slower than others. If the le system allows dynamic allocation of les to servers, our approaches to placing data to minimize contention for network, I/O bus, and disk may be helpful. VIP-FS [4] provides a collective I/O interface for scienti c applications running in parallel and distributed environments. Their assumed-request strategy is designed for distributed systems where the network is a potentially congested shared medium. It reduces the number of I/O requests made by all the compute nodes involved in a collective I/O operation, to reduce congestion. In such an environment, careful placement of I/O servers can also reduce the total data trac. VIPIOS [1] is a design of a parallel I/O system to be used in conjunction with Vienna Fortran. VIPIOS exploits logical data locality in mapping between servers and application processes and physical data locality between servers and disks, which is similar to our approach in exploiting local data on workstations on FDDI. Our approach adds an algorithm for server placement that guarantees minimal remote data access during I/O. In [3], we also quantify the savings obtained by careful placement of servers, and use an analytical model to explain other performance trends. Our work is also related to I/O resource sharing in multiprocessor systems. [16] studies contention for a single I/O bus from accesses to dierent devices like video, network and disk, and studies the correlation among these devices. It characterizes how multiple device types interact when one or more Unix utilities are running on a multiprocessor workstation. Panda can probably bene t from this type of study when heuristic search through the space of all possible placements is used to help place I/O servers. 7
Conclusion
Compared to traditional supercomputers, commodity clusters are an economically attractive platform for running parallel scienti c codes. While a few vendors dominate the marketplace for traditional supercomputers, it is relatively easy for any vendor to create a high-performance cluster product. The result is a dizzying array of possible cluster con gurations, each with its own capabilities for computation, networking, and I/O, and each having dierent potential bottlenecks for I/O performance. Thus customization of I/O strategies will be needed for high performance I/O on many clusters. Making the customization strategy more dicult is the ease with which heterogeneous clusters can be constructed and operated; heterogeneous clusters will often require particularly sophisticated approaches to I/O optimization. This paper discusses our experiments with the Panda parallel I/O library on two dierent cluster systems. Unlike traditional massively parallel processors in which the main bottleneck for parallel I/O usually is on disk speed, we have found that the bottleneck can be almost anywhere in the system on commodity clusters. We presented a way to improve overall I/O performance on each platform by placing I/O servers carefully. On workstation clusters connected by FDDI, the bottleneck is in message passing and we place I/O servers to minimize the amount of data transferred over the network. On a cluster of SMPs connected by Myrinet, parallel I/O can be bottlenecked by sharing of disks and I/O busses by multiple processors in the same SMP node, so the I/O servers are placed to minimize the contention for shared resources. We expect 2-processor and 4-processor SMPs to become more popular in the future. Unfortunately, resource
sharing among processors in the same SMP node imposes a new potential cause of parallel I/O performance degradation.
Acknowledgements. This research was supported in part by NASA under NAGW 4244 and NCC5 106, and by the U.S. Department of Energy through the University of California under subcontract B341494. Experiments were conducted using an HP workstation cluster at HP Labs in Palo Alto and a High Performance Virtual Machine (HPVM) at the National Center for Supercomputing Applications and the Concurrent Systems Architecture Group of the Department of Computer Science at the University of Illinois. References 1. P. Brezany, T. A. Mueck, and E. Schikuta. A Software Architecture for Massively Parallel Input-Output. In Proceedings of the Third International Workshop PARA'96, Lyngby, Denmark, August 1996. Springer Verlag. 2. A. Chien, S. Pakin, M. Lauria, M. Buchanan, K. Hane, L. Giannini, and J. Prusakova. High Performance Virtual Machines (HPVM): Clusters with Supercomputing APIs and Performance. In Proceedings of the the Eighth SIAM Conference on Parallel Processing for Scienti c Computing, Minneapolis, MN, March 1997. 3. Y. Cho, M. Winslett, M. Subramaniam, Y. Chen, S. Kuo, and K. E. Seamons. Exploiting Local Data in Parallel Array I/O on a Practical Network of Workstations. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, pages 1{13, San Jose, CA, November 1997. 4. M. Harry, J. Rosario, and A. Choudhary. VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing. In Proceedings of the Ninth International Parallel Processing Symposium, April 1995. 5. High Peformance Fortran Forum. High Performance Fortran Language Speci cation, November 1994. 6. S. Kuo, M. Winslett, Y. Chen, Y. Cho, M. Subramaniam, and K.E. Seamons. Parallel Input/Output with Heterogeneous Disks. In Proceedings of the 9th International Working Conference on Scienti c and Statistical Database Management, pages 79{90, Olympia, Washington, August 1997. 7. M. Lauria and A. Chien. MPI-FM: High Performance MPI on Workstation Clusters. Journal of Parallel and Distributed Computing, 40(1):4{18, January 1997. 8. Message Passing Interface Forum. MPI: Message-Passing Interface Standard, June 1995. 9. S. Moyer and V. S. Sunderam. Parallel I/O as a parallel application. International Journal of Supercomputer Applications, 9(2):95{107, Summer 1995. 10. J. Nieplocha and I. Foster. Disk Resident Arrays: An Array-Oriented I/O Library for Out-of-Core Computation. In Proceedings of the Sixth Symposium on the Frontiers of Massively Parallel Computation, pages 196{204, October 1996. 11. C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall Inc., 1982. 12. M. Parashar and J. Browne. Distributed Dynamic Data-Structures for Parallel Adaptive Mesh-Re nement. In Proceedings of International Conference for High Performance Computing, 1995. 13. E. Riedel, C. van Ingen, and J. Gray. A Performance Study of Sequential I/O on Windows NT 4. In Proceedings of the Second USENIX Windows NT Symposium, pages 1{10, Seattle, WA, August 1998. 14. K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-Directed Collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, November 1995. 15. V. Sunderam. PVM: A Framework for Parallel Distributed Computing. Concurrency: Practices and Experience, 2(4):315{339, 1990. 16. S. VanderLeest and R. Iyer. Measurement of I/O Bus Contention and Correlation among Heterogeneous Device Types in a Single-bus Multiprocessor system. Computer Architecture News, 22(4):17{22.