called I/O processors are used as dedicated writers. The ... costs. Normally a server serves multiple clients and its com- munication bandwidth will be shared ...
Faster Collective Output through Active Buffering Xiaosong Ma
Marianne Winslett Jonghyun Lee Shengke Yu Department of Computer Science, University of Illinois at Urbana-Champaign xma1, winslett, jlee17, syu3 @uiuc.edu
Abstract
made, i.e., processors participating in a collective write call do not return from the call until the write is buffered by the file system. For example, in the case of two-phase I/O [3], all the processors first exchange messages to reorganize the data in their memories, then each processor writes out its data before returning to computing. Hence the latency of file system write requests is visible to the whole application.
Scientific applications often need to write out large arrays and associated metadata periodically for visualization or restart purposes. In this paper, we propose active buffering for collective I/O, in which processors actively organize their idle memory into a hierarchy of buffers for periodic output data. Active buffering exploits one-sided communication for I/O processors to fetch data from compute processors’ buffers and performs actual writing in the background while compute processors are computing. It gracefully adapts as buffers at different level of the hierarchy fill and empty, and as new collective I/O requests arrive. Experimental results with synthetic benchmarks and a real rocket simulation code on the SGI Origin 2000 and IBM SP show that active buffering improves the apparent collective write throughput so that it approaches the local memory bandwidth or the MPI bandwidth under appropriate conditions. These speedups are due entirely to increased parallelism during I/O, and are in addition to any performance improvements that may come from buffering small requests.
Because they will not read their output during a run, scientific applications typically do not care whether their periodic output data are actually in the hands of the file system before the next computing phase starts, as long as they can freely overwrite the old data values in memory. On the other hand, the computation time between output requests is often long enough for all the output data to reach the disk. These characteristics lead us to propose active buffering, an aggressive buffering scheme for high performance collective output. In this scheme, extra processors called I/O servers are used to carry out the output tasks for the compute processors on which the application is running. The I/O servers and their clients actively use their idle memory to form a buffering hierarchy: during a collective I/O call, the output data are buffered at the clients’ side as much as possible, shipped to the servers’ side only when the client buffers overflow, and written to the disk only when the server buffers overflow. After the clients return to computation, the servers fetch data from the clients’ buffers using one-sided communication, and write the data out. State machines are used for the clients and servers to switch between different behaviors when overflow happens at different levels and for the servers to respond to new output requests amid background activities.
1 Introduction Many scientific simulations need to write out large data sets periodically. Typically, these are multi-dimensional arrays holding snapshot data for visualization and checkpoint data for restart. Efficient transfer of this output from main memory to secondary storage is very important to achieve high performance for such applications. Often the output data belong to a logically shared data set, encouraging the use of collective I/O [3, 8, 12, 14]. In this approach, all the processors cooperate to transfer data between disk and memory. Information about the on-disk and in-memory layouts of the data set is used to plan efficient file operations, and reorganize the data across the memory of the processors if necessary. Most collective I/O techniques block participants until file system I/O calls are
We implemented active buffering in the Panda Parallel I/O Library [14] and evaluated it with both synthetic benchmarks and real simulation applications. We implemented the client-side and server-side buffering with onesided communication on the SGI Origin 2000. Our experiments there show active buffering can bring an order of magnitude improvement in apparent I/O throughput when there is sufficient buffer space, and a smaller but quite predictable improvement when buffers overflow. We also con1
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
ducted experiments on the IBM SP with server-side buffering, which also shows an apparent throughput peak improvement between 300% and 1600%, depending on the configuration used. Previous research done on buffering or caching for collective I/O [13, 4, 9, 11] focuses on buffering for the purpose of aggregating small or non-continuous writes into long, sequential writes for better output performance. In contrast, active buffering tries to hide the cost of writing by increasing the overlap between I/O and communication, instead of trying to speed up the actual writes. While traditional collective I/O techniques aimed to deliver apparent I/O throughput close to the peak file system throughput, with active buffering we can achieve apparent I/O throughput close to the memory copy throughput or MPI message passing throughput, which is usually higher than the peak file system throughput and more scalable. In addition, active buffering has no minimum buffer space requirement, though larger buffer spaces may bring more benefit. Buffering any part of the data helps to reduce visible I/O cost and raises the apparent write throughput. The full version of this paper [10] also explains why active buffering offers benefits even though the file system and message passing system also offer buffering facilities. In the rest of the paper, section 2 discusses active buffering and section 3 presents performance results. Section 4 discusses related work and section 5 concludes the paper.
iments we find HDF4 has poor scalability compared with the other two: though it can sometimes outperform HDF5, its throughput drops severely when the number of datasets grows in a file, either by increasing the file size or by decreasing the dataset size. Before we explain more about the buffer hierarchy, we give a picture of our general architecture. To take full advantage of the parallelism between I/O activities and computation allowed by aggressive buffering, and to maintain load balance between compute processes, extra processors called I/O processors are used as dedicated writers. The compute processors run the user application, while the I/O processors run the server executable of the collective I/O library. We use the terms “compute processor” and “client” interchangeably, as well as “I/O processor” and “server”. Each client executes the application, processing its own part of the data and communicating with other clients when necessary. The clients periodically write out snapshot or checkpoint data with their associated meta data using collective write calls. In such a call, the clients send a write request to the servers and exchange information with them on how to carry out the write operation. Then each client copies as much of its output data as possible into local buffers and sends overflowing data to the server(s) responsible for writing those data out. The servers listen for requests from the clients. When one arrives, each server determines which parts of the data it is responsible for, collects these data from the appropriate clients by explicit messages or using one-sided communication, and writes them to disk. As shown in Figure 1(b), both the servers and the clients utilize available local memory for active buffering. The servers can use most of their memory for this purpose. The clients can use idle memory not used by the compute application. Since applications normally can find out the peak memory they use per processor, they can set a maximum client buffer size when the run begins, and alter it during the run as needed. Collective write operations using active buffering enforce a semantics similar to that of MPI Send(): the caller can immediately reuse the data source buffer it supplied to the Panda collective output routine. This ensures that the buffering is transparent to the application code. If client-side buffers have room for all the output, the clients will return to computation as soon as their data are copied to local buffers and the immediately visible I/O cost is only the cost of that copying. The client-side buffering is fully parallel, with aggregate throughput scaling up as the number of clients grows. If the amount of output data exceeds local buffer capacity, the overflow will be sent to the servers using MPI messages and the immediately visible I/O cost will include both copying and message passing costs. Normally a server serves multiple clients and its communication bandwidth will be shared between them. Hence
2 Active buffering Some supercomputers have local disks attached to individual processors, but most such platforms require local files to be removed before the run finishes or shortly thereafter. This is inconvenient and expensive, since migration is usually very slow. Further, shared file systems make it easy for processors to share files, so that users can control the number of files generated by their applications. Because these two factors have made shared file systems very popular with users, we consider only shared file systems. In collective I/O, typically many compute processors assemble output data in their local memory simultaneously, then ship these data to the processors that do the writing (writers), while the writers reorganize the data in their memory if necessary and write them to the shared disks. The farther the output data go along this funnel-shaped path of data flow, the more aggregated they are and the less bandwidth is available, as shown in figure 1(a). We measured write throughput for three file formats: binary and two versions of HDF [7], a widely used scientific data set format. Figure 1(a) shows that binary files are by far the fastest in terms of write speed. HDF5 is faster than HDF4 when writing large files or when each dataset is small, but its throughput is still quite lower than that of binary files [6]. In our exper2
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
o
-G D
P o
GG D
G l o o
-P D
P RR P l PRR P l o RR -P PRR P lo G P Ro P RP ! P ¡G¢ P
Z [D\G]-^G_-`-a b c-d e f-g J;SGTKGULGV-MGNPW X-ORY Q 9;: < = >@? A&BDC EGFDH I .!/ 0 1 2&3 4&5*6*7-8 ! " # $&% '&(*)*+-, hjilk monlplqlrosRtjuovDwjxDyRz {D|R}*~ &j D (a) Data transfer throughputs per server
(b) Buffer hierarchy
Figure 1. Part (a) shows the data transfer rates in different I/O system components on the NCSA Origin 2000. One server handles the output of 8 clients, each with 16MB output data. Aggregate throughput for the 8 clients to buffer their data in parallel is over 700MB/s, much higher than for the server to collect the data from the clients using 2MB MPI messages, at less than 120MB/s. Writes are even slower, and their throughput depends heavily on the file format. These facts help motivate the buffer hierarchy shown in part (b).
for this explicit data transfer, the cost is often higher than a local copy and the aggregate throughput will not scale up as the number of clients grows, unless the number of servers grows proportionally and the communication system does not become saturated. Further, if the total amount of overflow sent to a server exceeds its buffer capacity, it has to write out data to make room for new incoming data and the immediately visible I/O cost will include the costs of local copying, message passing, and file system requests. The aggregate throughput of file system writes is again limited by the number of servers, and the write throughput on most shared file systems does not scale up so well as the number of concurrent writers increases. In active buffering, a collection of fixed-sized buffers is managed at each processor. We call the data held in each buffer a data block. The servers use one-sided communication to fetch output data from clients’ buffers in the background while clients are computing. On the SGI Origin 2000 and CRAY T3E, this is enabled by the SHMEM library, which provides the fastest interprocessor communication for large messages [15]. The SHMEM library requires the remotely accessible buffers to be allocated and released collectively by all the processors. MPI-2 also provides one-sided communication interfaces, which can be used on other platforms. Implementation of active buffering using MPI-2 will be very similar to that using SHMEM. The full version of this paper [10] describes details of the SHMEM implementation, such as buffer allocation/release and choice of data block size. It is not unusual for periodic write requests to be bursty: during an output phase, different computing modules issue separate collective write requests to write out their snapshot or checkpoint data. While the servers are engaged in fetching and writing data from the first write request in a
burst, the cost of the servers’ “background” activities will become visible again if clients have to wait for the second request to be accepted. To take advantage of both available buffer space and servers’ idle time, while maintaining the servers’ responsiveness to client requests, we designed state machines to guide the behavior of collective I/O participants. No explicit construction and maintenance of the state machines is needed, as everything can be done by control statements, so the state machine method brings flexibility in resource utilization and responsiveness at no extra overhead. A client goes through the state machine in Figure 2(a) once every time it invokes a collective write call. It enters the collective write routine in the prepare state, where it works with the other processors to collectively free local buffers. Then it sends the servers metadata associated with the current output, such as array rank and size, data type, the name of the array and mesh type. It then works with other clients to decide how much data can be buffered locally and how much should be sent to servers, and to inform the servers about this decision. If there is local buffer space available, it switches to the buffer data state, where it allocates local buffers and copies data there. When local buffer space runs out, the client switches to the send a block state, where it sends data blocks to the appropriate servers. When all output data are buffered or sent out, the client returns from the collective write call. Unlike the clients, a server goes through its state machine (figure 2(b)) only once during its lifetime. After initialization, a server enters the idle-listen state, where it listens for incoming client requests. On receiving a request, it enters the prepare state, where it collaborates with the clients in the same state: free buffers, collect metadata, and receive the clients’ decision on how much overflow data it will re3
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
enter collective write routine
data to receive and enough buffer space
data to receive
prepare prepare
symm buf. available out of symm buf. space
server
buffer data
initiated
no overflow overflow
idlelisten
data to send
all data sent
out of buffer space
more data to receive
buffers no data to receive
write finished & no new request
all data received fetch a block
write & fetch all
get exit message
idle, no data to fetch, & data to write
idle & data to fetch
write finished
exit
write a block
allocate get write request
exit
receive a block
data to fetch
no request
get exit message
send a block
data to receive
get write request
busylisten
(b) Server state machine
(a) Client state machine
Figure 2. The state machines. ceive. If the clients will buffer data, the server switches to the allocate buffers state and joins other processors to collectively allocate remotely accessible buffers. If there is overflow to receive, it goes to the receive a block state and receives one data block at a time from its clients. Before receiving each data block, the server checks whether it has enough buffer space to buffer that block. If so, it allocates a buffer and uses it to receive the block. Otherwise, it switches to the write a block state, and writes out a data block to make room for the incoming block. Then it returns to the receive a block state and checks whether there is enough memory to accept the waiting block. The server switches back and forth between the two states until the incoming data block can be buffered. When there is no more data to receive, each server goes to the busy-listen state, where it probes for new client requests. When there are none, the server goes to the fetch a block state to take a block from a client buffer and write a block out if there are data to fetch, or just write a block out, before returning to the busy-listen state. If a server senses a new write request at the busy-listen state, it goes to the prepare state. When all data are written out, the server returns to the idle-listen state. Special handling at the end of the run ensures that all data get written before exiting.
the client-server state machine. As mentioned before, our goal is not to show how to aggregate or reorganize data for better file I/O or communication performance. Therefore our experiments use a simple balanced BLOCK distribution of data on the clients and servers: the number of clients is a multiple of the number of servers, each client has the same amount of output data, each server is in charge of writing data for a separate subset of clients to a separate file for each collective write, data from a single client are written in the order supplied by the client, and the data from different clients are ordered by the clients’ MPI rank. We compare active buffering’s performance with the file system throughput measured by writing sequential files in the same format, the best we can hope for from traditional collective I/O techniques. Most of our experiments use a synthetic benchmark, for the ease of controlling the number of clients/servers, output data size, and snapshot frequency. This benchmark creates arrays and metadata, performs meaningless computations on them, and periodically takes their snapshots. We measure the time spent in the collective write calls and calculate the apparent throughput for the benchmark. We also used a rocket simulation code to test how well active buffering works with a real application.
3 Performance results and analysis
3.1 Results from the SGI Origin 2000
Our implementation uses the existing Panda Library’s collective I/O protocol, including the communication protocol and the facilities for packing and shipping metadata. The major part of the collective write operation, transferring output data across layers of buffers to the disk, is driven by
The Origin 2000 at NCSA is a distributed-shared memory machine with 256 250MHz MIPS R10000 processors running IRIX 6.5. It has 128GB of memory and 456GB of scratch space on XFS with shared RAIDs. All tests used shared job queues and the error bars show the 95% con4
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
fidence interval from three or more runs. The HDF4.1r4 library is used to write HDF4 files. The tests on the Origin 2000 use a 256MB buffer space on each server. The total client buffer size will depend on the amount of idle memory on the compute processors. The tests use synthetic benchmarks with a moderate 64MB of client buffer space on all clients. We demonstrate active buffering’s performance in writing binary and HDF4 files, the fastest and slowest formats we have experimented with. We report the apparent write throughput per server, calculated as the total amount of output, divided by the number of servers times the maximum amount of time spent in the collective write call among the clients. Figure 3 shows the performance of collective writes with active buffering and no client side buffer overflow. The number of servers is fixed to 2 and the number of clients increases from 2 to 32. The local buffering line shows the throughput per server when each client allocates local buffers and copies its output data from the application there without interacting with any servers. The MPI line shows the throughput per server when each server collects 16MB of data in 2MB messages from each client. The binary and HDF4 write lines show the throughput per server for writing a sequential file of size equal to the total amount of data a server collects from its clients, in binary with 2MB requests, and in HDF4 with 2MB datasets respectively. Each client has 16MB of output data, and the time between two collective write requests is enough for the servers to write out all the data in the clients’ buffers. This is the ideal case for active buffering, and its ideal performance in this case should be the local buffering throughput. Figure 3 shows that the performance of active buffering (the “AB” line) is very high. Compared to message passing with MPI and file system writes, the I/O components whose costs are visible to user applications in traditional collective I/O techniques, active buffering shows great performance improvement, raising the apparent write throughput to about 8 times higher and more than 30 times higher than the binary and HDF write throughput respectively when each server serves 16 clients. The aggregate apparent throughput delivered by active buffering is about 1.4GB/s with 32 clients, at the cost of only 2 dedicated servers. This aggregate throughput is better than what the file system can offer even with many more writers writing in parallel. Table 1 illustrates the performance of writing a total of 2GB of data on XFS using different numbers of writers. The aggregate XFS throughput stops growing when the number of writers reaches 8, and with these 8 writers the aggregate throughput is less than 300MB/s. In addition, by comparing figure 3(a) and figure 3(b) we see that active buffering can mask the difference in throughput between writing binary and HDF4 files when buffer space is sufficient. On the other hand, active buffering’s
performance scales up more slowly than that of local buffering as the number of clients grows. This is mainly due to the influence of servers’ background remote memory access activities on the clients’ local copying efficiency. In general, active buffering’s throughput is within 70% of the ideal throughput in figure 3. To observe active buffering’s performance when overflow occurs, we repeat the previous test, but with 96MB of output data on each client (figure 4). Since there is only 64MB of total local buffer space per client, each client will have 32MB of overflow per snapshot. The server has 256MB of total buffer space, which can hold all the overflow until there are 16 clients per server. Let , , , and be the total amount of output data, data buffered at the clients, overflow at the clients, and overflow at the servers respectively. Let , and be the local buffering throughput, MPI throughput, and file system write throughput respectively. We can estimate the ideal apparent throughput, , as below:
£ª©*¦
£¬¡¦
¥£ ¤o¦¡¤o§!¨ £ª©*« ® ©*« ®°¯²±°³ ®µ´µ¶· ¤o¸
¹® ·-º ¸»§!¨ ® ·Gº ¸»§;¨½¼ ¾ÀÂÿl¿oÁ¥Á Ä ÂDz£ ¾ ÉÆ ¿l¤oÈËŦ¡Ê̤o§!Ĩ ÂÏÀ¾ÑÎ ÐjÍ ÒÅ Ó-Ô
Plugging in appropriate data sizes per server, local buffering throughput at 64MB per client, MPI throughput at 32MB total message size per client, and file system write throughput at 256MB, we draw an “ideal” throughput line in figure 4, and compare it to active buffering’s performance. As expected, this time the apparent throughput is considerably lower than in the previous test and the coordination overhead of the collective write protocol weighs less in the overall latency. Therefore, active buffering’s performance is closer to ideal than before: over 78% of the ideal throughput, except for one case: HDF4 with 32 clients. The reason is that the blocks collected at the servers do not form one contiguous file region at the point when the servers’ buffers begin to overflow, because servers write all the data from one client contiguously before writing data from the next. For HDF4 writes when server buffers overflow, random accessing of datasets is difficult and expensive [7], so a server keeps fetching a missing block from a client local buffer and writing it out, until the “hole” is filled and it can continue sequential writing with blocks in its buffers. Thus with 32 clients in figure 4, the servers each write 768MB to disk before control returns to the application, and the throughput is only 48% of the ideal. However, as HDF4 write throughput is only 6.8MB/s for a 1536MB file, active buffering still brings a factor of 5.7 improvement in apparent throughput. For binary writes, our implementation of active buffering has the server seek to the right position for the next block to write from its buffers, and fill in the holes later using background writing. When writing in the background, the server always fetches a block before writing one if there are data to fetch, so any missing block will 5
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
.
õñ òö ÷ó øô îë ïì ðí èå éæ êç ßâÜ ãàÝ äáÞ ÙÖ Ú× ÛØ Õ
+, )*
'(
%& #$ !"
qrst mnop jkl ghi def abc ^_` [\] XYZ UVW RST
¦
/ 021 324 57678 9 :2; < = > ? @ A BDC EGF H I2J K L M N O P
£¤ ¥ ¡¢
ù
ú
û
üý
þÿ
§ ¨ © ª « ¬7 ® ¯ ° ± ² ³ ´ µ¶ ·¸º¹ »D¼½ ¾¿À Á  Ã
Q u
v |}~w
D x y
(a) Binary writes
z{
(b) HDF4 writes
Figure 3. Active buffering performance per server when there is no client buffer overflow on the Origin 2000. Number of writers 2 4 8 16
File size per writer (MB) 1024 512 256 128
Aggregate throughput (MB/s) 134.4 214.5 287.9 274.7
Table 1. Aggregate binary throughput on XFS as the number of writers increases ÐÑÒ
þÿ
'()
ÍÎÏ üý ùúû
]
Z[ \
^ _` a b cBd e5fhg i>j>k l m5n o p q
$%&
XY VW
ö÷ø Ê Ë Ì TU
!"# RS
ôõ
PQ
ò óñ Ç È É ï ðî ìí éê ë Å Æ
N OM
KL IJ
Ä
EF
Ó
Ô ÚÛÜÝÞßàÕ áâã ä åæç è Ö ×
GH
ØÙ
(a) Binary writes
*
+
, 1325476 8 9 : ; = ? @BABC D
-.
/0
(b) HDF4 writes
Figure 4. Active buffering’s performance when client buffers overflow on the NCSA Origin 2000. such a scenario can be found in the full paper [10].
arrive before it needs to be written. For r clients, there are at most r holes to skip during foreground writing and a total s of r seeks. This approach is faster than the approach required with HDF4: figure 4(a) shows its performance with 32 clients to be 82% of the ideal. Overall, active buffering’s binary throughput is 40% - 200% higher than the file system binary write throughput.
Will the servers’ background communication and write activities hurt the application’s performance on the clients? To answer this, we did an experiment measuring the clients’ slowdown when there is background I/O going on at the servers. Since scientific simulations can be viewed as a mixture of computing and communication tasks, we measure the performance of “pure” computation and communication tasks to find the bounding box of possible slowdown. A “pure” computation task performs iterations of floating-point computation on a 128MB array. A “pure” communication task performs iterations of MPI broadcast-
The above discussion is for a single collective write operation, initiated when both client and server buffers are unoccupied. We mentioned that the client-server state machines can also maintain responsiveness in answering new requests while performing background I/O. Experiments simulating 6
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
64 clients run GENX and periodically write approximately 160MB of output data in three back-to-back write requests issued by different computing modules. The output is written in a specific HDF4 format required by Rocketeer, GENX’s visualization tool. Each dataset is relatively small, with the size of about 100KB, which leads to HDF4 write performance as slow as 2MB/s without active buffering. With only two I/O servers, the apparent write throughput per server becomes over 300MB/s and servers can finish their background writing before the next set of GENX write requests arrives. Figure 5 shows the total compute time and the total visible I/O time in a GENX run with 64 compute processors and 30 time steps, as we adjust the frequency of snapshots. The total compute time is the total time spent computing time-step results, and the total visible I/O time is the total time spent on the collective output calls, measured from the application code. When active buffering is not used, GENX uses its original HDF output routines. These routines divide the 64 compute processors into two groups, and all the processors in the same group take turns writing their data to a shared HDF file. When active buffering is used, two additional processors run I/O servers that collect and write the 64 clients’ data. At the end of the run, both tests have created identical files. Figure 5 shows the big performance improvement from using active buffering. Without active buffering, the total I/O time steadily grows, exceeding the total compute time once 11 snapshots are taken. With Panda, as long as each server can finish writing the 80MB of data in its buffer before the next wave of requests comes in, the visible I/O cost to the compute processors is only the communication cost. In Figure 5, the visible I/O cost does not significantly exceed the communication cost until 16 snapshots are taken (one snapshot every two time steps). Then the compute time between two output phases is not enough for the servers to output 80MB of data to HDF files and the servers have to force real writes. The total I/O time with active buffering in this case is still about 6.7 times shorter than without it.
ing of 2MB messages among all the clients, rooted from each one in turn. To test the slowdown with servers driving in full power, we choose the number of iterations in both tasks so that each task is completely overlapped with server background activities, and let servers write in binary to generate heavier one-sided communication and I/O traffic. We carried out this test with 32 clients and 1, 2, 4 and 8 servers, to see the impact of more intense server activities on clients’ computation and communication performance. Results show that the slowdown is small: in most cases less than 4%, with the maximum at 7.1%. Since the slowdown only happens while the servers’ background I/O lasts, and the time those activities last is time that would otherwise be spent with the clients blocked by collective write requests, a small reduction in clients’ concurrent computation performance will not have a significant impact on the gains from active buffering.
3.2 Results from the IBM SP Given our space constraints, we will give a short outline of our experimental results on the IBM SP, a distributed memory machine. We used Blue Horizon at SDSC, which has 1152 processors in 144 SMP nodes. Each node has 4GB memory, shared among 8 375MHz Power3 processors running AIX 4.3. All experiments write to the GPFS shared file system, which has 5TB of scratch space. The HDF4.1r3 library is used to write HDF4 files. We tested the server-side buffering on Blue Horizon. In this case, clients’ data are collected by the servers using MPI and stored in servers’ buffers. The state machines are used but there are fewer states compared with buffering at both clients and servers. The servers only issue file system write requests in the background or when their buffers overflow, and the ideal performance is the communication throughput. Since on that IBM SP the gap between MPI throughput and file system throughput is larger than on the NCSA Origin 2000, and HDF4’s performance is worse, the server-side buffering offers nice speedups. Apparent throughput per server is up to 350MB/s when buffer space is sufficient: about 3 times higher than the binary write throughput per server without active buffering, and 16 times higher than HDF4 write throughput per server without active buffering. Server-side active buffering’s performance scales up relatively well as the number of servers increases, offering apparent aggregate throughput of 6.8GB/s in both binary and HDF4 format with 32 servers. When buffers overflow, the ideal performance of server buffering can be estimated as discussed earlier and the actual performance measured is within 90% of the ideal. Active buffering was originally motivated by the poor snapshot performance of GENX, a real rocket simulation application (http://www.csar.uiuc.edu). In a typical run,
4 Related work Collective I/O can be implemented in multiple ways [3, 8, 14], and active buffering can be applied to all of these to improve write performance. More generally, this paper follows in the tradition of increasing the overlap between I/O and other activities. Optimizations developed by others [1, 2, 5] can be combined with active buffering to provide additional parallelism. This paper also follows in the tradition of investigating parallel buffering/caching issues. Collective I/O on top of parallel file systems [9, 13] can benefit from the proposed techniques, while our work adds extra layers of caching that 7
0-7695-1573-8/02/$17.00 (C) 2002 IEEE
! " # $ % & ' ( ) *+ , - . / 0 12 34 56 7 8 9;:=< >@?BABC D E@FHGI J KLNMPORQ S TVU W XRY
t u v w x y z { | } ~
¡£¢¤ ¥ ¦ §©¨«ª ¬ ° ® ¯ ± ² ³ ÑhÒhÓhÔ ÍhÎhÏhÐ
ÉhÊhËhÌ ÅhÆhÇhÈ
ÁhÂhÃhÄ ½ ¾h¿hÀ ¹ ºh»h¼ µh¶¸· ´ Õ
°
ª« ¬
® ¯
Ö × Ø Ù h Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ òó ô õ ö ÷ø ù ú>û üþýÿ
(a) GENX performance without active buffering
x]y]z]{ t]u]v]w p]q]r]s l]m]n]o h]i]j]k d e]f]g ` a]b]c []\_^ Z
± ² ³´ µ ¶· ¸¹ ºR» ¼ ½ ¾¿ À Á | } ~ ]
H¡£¢¤ ¥§¦©¨
(b) GENX performance with active buffering
Figure 5. Running time of GENX with and without active buffering. provides more predictable behavior from the application’s perspective, by adapting to individual applications’ I/O behavior and resource availability.
[4] Y. Chen, M. Winslett, Y. Cho, and S. Kuo. Automatic parallel I/O performance optimization in Panda. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1998. ACM Press. [5] P. Dickens and R. Thakur. Improving collective I/O performance using threads. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, April 1999. [6] M. Folk, HDF project director, NCSA. Personal communication. [7] http://hdf.ncsa.uiuc.edu/UG41r3 html/. HDF 4.1r3 User’s Guide. [8] D. Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the Symposium on Operating Systems Design and Implementation, pages 61–74, Nov. 1994. [9] D. Kotz and C. S. Ellis. Caching and writeback policies in parallel file systems. In 1991 IEEE Symposium on Parallel and Distributed Processing, December 1991. [10] X. Ma, M. Winslett, J. Lee, and S. Yu. Faster Collective Output through Active Buffering (full version). http://drl.cs.uiuc.edu/pubs/actbuf-full.ps. [11] J. Moore and M. J. Quinn. Enhancing disk-directed I/O for fine-grained redistribution of file data. Parallel Computing, 23(4–5):447–499, June 1997. [12] J. No, S. Park, J. Carretero, A. Choudhary, and P. Chen. Design and implementation of a parallel I/O runtime system for irregular applications. In Proceedings of the Joint International Parallel Processing Symposium and IEEE Symposium on Parallel and Distributed Processing, March 1998. [13] A. Purakayastha, C. S. Ellis, and D. Kotz. ENWRICH: a compute-processor write caching scheme for parallel file systems. In Proceedings of the Fourth Workshop on Input/Output in Parallel and Distributed Systems, May 1996. [14] K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing ’95, Nov. 1995. [15] H. Shan, L. Oliker, R. Biswas, and J. P. Singh. Comapring three programming models for adaptive applications on SGI Origin 2000. In Proceedings of Supercomputing ’00, Nov. 2000.
5 Conclusions and future work With active buffering, participants in collective write operations can use their local idle memory to lessen the I/O burden of write-intensive applications, by overlapping I/O with computation to the maximum extent. Active buffering allows processors to adjust to the application’s memory usage and I/O demands dynamically. When there is enough buffer space, the apparent write throughput observed by the user application approximates local buffering throughput. A pair of state machines gracefully handles buffer space shortages and bursty requests, and we see a performance benefit even when the data size exceeds the available buffer size at the clients and at the servers. This work can be expanded in many ways. Smarter data fetching and writing strategies can be designed for better buffer utilization and faster writing. Also, data migration can be integrated into the scheme as new states and actions in the server state machine.
References [1] A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz, and A. Sussman. Tuning the performance of I/O-intensive parallel applications. In Proceedings of the Fourth Annual Workshop on I/O in Parallel and Distributed Systems, pages 15–27, May 1996. [2] G. Agrawal, A. Acharya, and J. Saltz. An interprocedural framework for placement of asynchronous I/O operations. In Proceedings of the 10th ACM International Conference on Supercomputing, May 1996. ACM Press. [3] R. Bordawekar, J. Rosario, and A. Choudhary. Design and evaluation of primitives for parallel I/O. In Proceedings of Supercomputing ’93, pages 452–461, 1993.
8
0-7695-1573-8/02/$17.00 (C) 2002 IEEE