PMPIO yields a 900MB/sec write bandwidth on 16K processors using 1024 files ..... split where by 3 LUNs on each server are dedicated to the. âsmallâ 128KB ...
Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu† , Ning Liu† , Onkar Sahni¶ , Kenneth E. Jansen¶, Mark S. Shephard¶, Christopher D. Carothers† †
Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180 Scientific Computation Research Center, Rensselaer Polytechnic Institute, Troy, New York 12180 E-mails:{fuj,liun2,chrisc}@cs.rpi.edu, {osahni,kjansen,shephard}@scorec.rpi.edu ¶
Abstract—With the development of high-performance computing, I/O issues have become the bottleneck for many massively parallel applications. This paper investigates scalable parallel I/O alternatives for massively parallel partitioned solver systems. Typically such systems have synchronized “loops” and will write data in a well defined block I/O format consisting of a header and data portion. Our target use for such an parallel I/O subsystem is checkpoint-restart where writing is by far the most common operation and reading typically only happens during either initialization or during a restart operation because of a system failure. We compare four parallel I/O strategies: 1 POSIX File Per Processor (1PFPP), a synchronized parallel IO library (syncIO), “Poor-Man’s” Parallel I/O (PMPIO) and a new “reduced blocking” strategy (rbIO). Performance tests using real CFD solver data from PHASTA (an unstructured grid finite element Navier-Stokes solver [1]) show that the syncIO strategy can achieve a read bandwidth of 6.6GB/Sec on Blue Gene/L using 16K processors which is significantly faster than 1PFPP or PMPIO approaches. The serial “token-passing” approach of PMPIO yields a 900MB/sec write bandwidth on 16K processors using 1024 files and 1PFPP achieves 600 MB/sec on 8K processors while the “reduced-blocked” rbIO strategy achieves an actual writing performance of 2.3GB/sec and perceived/latency hiding writing performance of more than 21,000 GB/sec (i.e., 21TB/sec) on a 32,768 processor Blue Gene/L.
I. I NTRODUCTION Solver systems are an important software tool across a wide range of scientific, medical and engineering domains. To date, great strides have been made in terms of enabling complex solvers to scale to massively parallel platforms [1], [2]. In particular, the PHASTA computational fluid dynamics (CFD) solver has demonstrated strong scaling (85% processor utilization) on 294,912 Blue Gene/P processors [1] compared to a base case using 16,384 processors. Thus, for PHASTA and other petascale applications, the central bottleneck to scaling is parallel I/O performance, especially at high processor counts (i.e., > 16K processors). The computational work involved in implicit solver systems, using PHASTA as a driving example, consists of two components: (a) formation of the linearized algebraic system of equations and (b) computation of implicit, iterative solution to the linear system of equations. In the context of implicit CFD solver systems, entity-level evaluations over the mesh are performed to construct a system of equations (Ax = b) as part of the finite element method. This results in an implicit solution matrix that is sparse but involves a large number of unknowns. The underlying mesh is partitioned into balanced parts and distributed among processors such that the workload imbalance is minimized for parallel processing. 978-1-4244-6534-7/10/$26.00 ©2010 IEEE
In the second component, the pre-conditioned iterative solver computes q = Ap products and generates an update vector, x. Since q is distributed among processors, complete values in q would be assembled through a two-pass communication step. In terms of I/O, PHASTA, like many partitioned solver systems, such as the Blue Waters PACT Turbulence system [3], [4], GROMACS [5], and LAMMPS [6], reads an initial dataset across all processors prior to starting the main computational components of the solver. This initialization data is divided into parts, where a part of the mesh or grid is assigned to a specific processor. In such an approach, each processor can read a dedicated part mesh/grid file coupled with an additional file for solution fields. Thus, there are typically two files per processor open for reading during initialization. At large processor counts, this can lead to situations where over 500,000 files could be open simultaneously and results in very slow I/O read performance [7]. For writing part data, especially for the purposes of system checkpoint-restart, the previously mentioned update vector x for each entity (i.e., vertex, element, particle etc.) needs to be written out to stable storage. The frequency of system wide entity writes to disk can vary. For PHASTA, it is sufficient to only write the solution state about every 10 time steps which is approximately one massively parallel write every 10 seconds of compute time where checkpoint data sizes range between two to eight gigabytes per system wide write depending on mesh size and only a single checkpoint is typically maintained (e.g. the data is overwritten each checkpoint epoch). Based on the above, efficient, massively parallel I/O for partitioned solver systems should meet the following set of objectives: 1) The parallel I/O system needs to provide scalable parallel I/O performance and maintain that peak scaling if saturated as the processor counts continue to increase. 2) The parallel I/O interface needs to be sufficiently flexible to allow the number of files used to not be directly dependent on the number of processors used. In fact, the number of files should be a tunable parameter to the parallel I/O system. 3) The parallel I/O system should be able to “hide” I/O latency and not keep compute tasks waiting while I/O is being committed to disk. The key contribution of this paper is a performance study of I/O alternatives to meet these objectives and ultimately dramatically decrease the time required for both the reading and writing of checkpoint-restart data files for massively parallel partitioned solver systems. The rest of the paper is
Master Header: Call Open File Function: (create one file) MPI IO Tag Setup(#fields, #files, # parts/file) Loop over Fields: (0-2, solution, nFields = 3 time derivative solution & solution: 0 discretization error) time derivative:1 Parallel Loop over Parts: discrt. error: 2 Proc 0: Part[0] & Part[1]; nParts per file = 4 Proc 1: Part[2] & Part[3]; Compute GPID for this part field data;nFiles = 1 Call write_header(...); Offset Table: Call write_datablock(...); (0, 0) End Parallel loop (1, 0) End loop (2, 0) Close file (3, 0) (0, 1) (1, 1) Fig. 1. Example parallel I/O using solver parallel I/O interface for a . 2 processor (Proc 0, 1), 2 part (Part[0], Part[1]) and 3 field . (Field[0], Field[1], Field[2]) scenario. The precise details of the arguments provided to the write_header and write_datablock . functions are beyond the scope of this paper. (3, 2) File Body
organized as follows: Section II provides a detailed description for the I/O file format and system design. Section III presents an overview of Blue Gene/L and its I/O subsystem and detailed performance results and analysis of our approaches using actual data files from PHASTA which serves as our example system. Section IV contrasts the approaches in this paper with prior related work in parallel I/O and finally in Section V conclusions and future work are discussed. II. PARTITIONED S OLVER PARALLEL I/O F ILE F ORMAT AND I NTERFACE The central assumption of our partitioned solver I/O interface is that data is written and read across all processors, in a coordinated manner. That is, each processor will participate in an IO operation, either reading or writing. With this assumption, the partitioned solver file format is defined as follows: for every file, there will be a master header and data blocks. The master header contains file information, such as number of data blocks, number of fields, and more importantly, an offset table that specifies the starting address of each data block in the file. The data blocks are divided into two sections: data header and data. The header specifies information about the data block (e.g., data block size, field string, etc) and the actual data generated from solver computations is contained in the data section. An example using the parallel I/O interface is shown in Figure 1. Here, a two processor case is given where each processor is assigned two parts and each part contains three fields (e.g., solution, time derivative and discretization error). These two processors write in parallel to one file. That is, the I/O is done in parallel using part-specific data assigned to each processor. In the multi-file case, each file has the same general format except that a different processor will have a different rank. In this case, the global part ID GPid is used instead of MPI rank when writing/reading parts. It is observed that by providing multiple parts per file, the number of different processor configurations that can be supported for a given set of input data increases. Currently, with the 1PFPP approach, there needs to be a specific set of input data files created for every processor configuration.
Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data: Header: Data:
part part part part part part part part part part part part part part part part part part part part part part part part
0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3, 0, 0, 1, 1, 2, 2, 3, 3,
field field field field field field field field field field field field field field field field field field field field field field field field
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
Fig. 2. Master Header and Body structure for 1 File case. Write header and data boxes are read/written by Processor 0 and gray header and data boxes are read/written by Processor 1.
The structure for the Master Header and Body is shown in Figure 2 for the two processor, two parts per processor with three fields example from above assuming only 1 file for both processors. The formula for finding the index into the offset table is computed as follows: of f set index =
parti + (f ieldj × nppf )
where nppf is the number of parts per file. As an example, consider what entry (0, 0) means. The first index is the global part ID, parti . The second is the field index, f ieldj . Thus, (0, 0) returns the offset index into the file for part0 , f ield0 .
Fig. 3. Architecture for 1-POSIX-File-Per-Processor – 1PFPP. X in each of the panels denotes which of the processors are writers. Also, a horizontal listing of X-writer processors denotes parallel access such as the case with syncIO.
Fig. 4. Architecture for PMPIO. X in each of the panels denotes which of the processors are writers. Also, a vertical line of processors with a single X-writer processor denotes the “baton” passing approach of PMPIO.
(2, 1) returns the offset index into the file for part2 , f ield1. Last, the @ symbol is used as a string separator for the field tag. That is a field tag of “solution@2” means solution field on part two would mean parti = 2, f ieldj = 0 in the offset table of the example file, again assuming only one file for both processors. The above I/O architecture interface was implemented across the following parallel I/O systems. Each of these alternatives is described next. A. PMPIO: Poor Man’s Parallel I/O PMPIO (Poor Man’s Parallel I/O), which is part of the Silo mesh and field I/O library and scientific database [8]
Fig. 5. Architecture for syncIO. X in each of the panels denotes which of the processors are writers.
Fig. 6. Architecture for rbIO. X in each of the panels denotes which of the processors are writers. rbIOc is the client side software for rbIO which runs on the workers where rbIOs is the server side software for rbIO which runs on the writers.
provides a straight forward migration path for existing applications that use the 1PFPP approach (like PHASTA) as shown in Figure 3 adds the ability to vary the number of files used by the application. PMPIO is based around the idea of dividing processors into groups where each group writes serially to a file. As denoted in Figure 4, a “baton” is passed between processors within a group to denote which processor has permission to initiate a write or read operation. Because the number of groups (which denotes the number of files used) is a tunable parameter, the application can decide which decomposition yields the best overall I/O performance. In the above data format shown in Figures 1 and 2, the number of mesh parts will have some implications in terms of overall eligible processor/file configurations. In terms of
B. syncIO: Synchronized parallel IO Synchronized parallel IO library (syncIO) is based directly on the MPI-IO library as shown in Figure 5. Here, all processors are divided evenly into groups before writing, which determine the file format based on the previous discussion. For reading, the number of groups is driven by the number of processors in the configuration and number of parts written in each file as well as the total number of files in the mesh/element dataset. However, when writing, the number of groups is a flexible user controllable parameter but the number of mesh parts does come into play when setting this parameter. The grouping strategy is similar to PMPIO [8] with one file per group, but makes use of the MPI parallel I/O routines within a group. Here, an MPI communicator enables a group of processors to communicate with each other. In our case, each group of processors shares a local MPI communicator and writes to one file in parallel. In order to avoid file system lock contention, each group of processors has one directory for their file. C. rbIO: Reduced-blocking Worker/Writer Parallel IO Reduced-blocking Worker/Writer parallel I/O (rbIO) approach as shown in Figure 6 divides compute processors into two categories: workers and writers. A writer is dedicated to a specific, fixed group of workers and takes charge of the I/O operations that the workers require. When the workers want to write data to disk (i.e., rbIOc “client” layer in Figure 6), they send their data to their dedicated writer with MPI_Isend() and return from this non-blocking call very quickly without any interruption for I/O. The writer (rbIOs “server” layer in Figure 6) would aggregate the data from all workers, reorder data blocks and write to disk using a single call per writer to the MPI_File_write_at() function as described in Figure 2. The number of writers is tunable parameter in the I/O system. III. P ERFORMANCE R ESULTS A. I/O Experiment Setup All experiments were executed on a 32,768 processor Blue Gene/L system located at RPI’s CCNI supercomputer facility [12]. The Blue Gene/L is a well known large-scale supercomputer system that balances the computing power of the processor against the data delivery speed and capacity of the interconnect, which is a 3D torus along with auxiliary networks for global communications, I/O and management. The Blue Gene divides compute and I/O work between dedicate groups of processors: compute and I/O. CCNI has 64 compute processors for each I/O processor-pair (i.e., node) in the system but this can vary. Because compute processors are diskless, the OS forwards all I/O service calls from the compute processors to the dedicated I/O processors which executes the real I/O system call on behalf of the compute processor and processes the I/O request inconjunction with a farm of file servers which are running the GPFS filesystem [13], [14]. The CCNI
I/O performance with 1PFPP for 7.7GB Phasta restart data on CCNI BG/L 1.5 Aggregate Bandwidth in GB/s
lower-level data formats the user can select from either using Hierarchical Data Format (HDF5) [9] or using the raw MPIIO interface [10] built on top of the ROMIO package [11]. Performance results using both HDF5 and raw MPI-IO forms of PMPIO are examined in this study.
1 file/proc writing 1 file/proc reading
1
0.5
0 2048
4096
8192 Number of processors
16384
32768
Fig. 7. I/O performance for 1 POSIX File Per Processor (1PFPP) as a function of number of processors.
filesystem is 800TBs of storage built using 26 IBM DS4200 storage arrays connected to 48 GPFS SMP file servers. The disk array itself is split into 240 logical units (LUNs) where each LUN is replicated across two file servers. Thus, each file server services 10 LUNs each, however there is a further split where by 3 LUNs on each server are dedicated to the “small” 128KB block partition and 7 LUNs are dedicated to the “large” 1MB block partition. Finally, not only data but file metadata is striped across the disk array to avoid disk contention. All experiments are executed on the GPFS large 1MB block partition. The peak I/O performance of this filesystem is 8 GB/sec for reading and 4 GB/sec for writing. The available bandwidth for writing is cut by one-half because each data block is written twice for data integrity purposes (i.e., the two file servers responsible for a particular LUN are both involved in committing data to disk). Performance test data is from a abdominal aortic aneurysm (AAA) case with a one billion element mesh, which is partitioned to 2K, 4K, 8K, 16K and 32K parts [1] for the respective processor configuration. Recall, there is a one to one correspondence between the number of parts and the number of processors. Thus, as the number of processors increases, the computation required and data per processor decreases. The total computation and total data written in one step is nearly constant (slight increase due to boundary duplication). For example, with 2K parts, each processor will write 3.5MB of data. B. 1PFPP Results Performance results for the 1 POSIX File Per Processor (1PFPP) approach is shown in Figure 7 where the bandwidth in GB/sec is reported as a function of the number of processors used. Here, it is observed that the obtained bandwidth grows but then shrinks as the number of processors is increased to 32,768. The sharp decrease in performance at 16K processors is a consequence of small writes per processor being issued to the underlying I/O system. As previously mentioned, the GPFS filesystem is configured with a 1 MB block size, which is the basic data unit which the filesystem manages. Disk writes less
than that size incur similar filesystem overheads (e.g., metadata and block allocation on a per-file basis). Consequently, as the number of files increase, the chunk-size per file decreases which increases the filesystem overhead per byte written. That is, at 32K, each file holds less than one-quarter of 1 MB. Consequently, more filesystem overhead is experienced at 32K processors than at 16K and 8K processors, etc. In terms of absolute peak performance, it is observed that the peak read bandwidth is 1.3 GB/sec at 4K processors and peak write bandwidth is 700 MB/sec at 8K processors. Here, nearly 1MB of data per processor (and per file) is written which aligns best to the filesystem 1 MB block size and incurs the least filesystem overhead compared to other processor configurations. These results represent 16% of the peak read and 14% of the peak write performance.
Aggregate Bandwidth in GB/s
C. PMPIO Results Write performance with PMPIO+HDF5 for 7.7GB restart data on CCNI BG/L 1.5 8k procs 16k procs 32k procs
1
0.5
0 64
Aggregate Bandwidth in GB/s
performance is slightly better than 1PFPP, it is not scaling on high processor counts. The decrease in write I/O performance at higher processor counts can be attributed to the overheads associated with “baton” passing within groups. As the number of processors increase for a given file count, the size of a “file group” increases which increases the number of “baton” passes for the given amount of data being written. Additionally, the concurrency among writers decreases as the file group size increases. Another source of overhead involves the metadata management layer of HDF5. At 32K processors, the time required to open and close a file is significantly longer than the time to actually write the data. To improve the PMPIO performance, direct calls to MPI-IO routines are used. These results show an improvement in I/O bandwidth as shown in Figure 8 (bottom). However, it is very clear that this approach at higher processor counts does not scale because of “baton” passing overheads associated with a large processor file groups (i.e., for 32K processors and 1024 files, the group size is 32). Finally, the performance loss when going from 16K to 32K processors is attributed to the size of per-processor writes decreasing by half since the overall data set size is fixed at 7.7 GB. Smaller writes in turn lowers the effective utilization of system I/O node bandwidth.
128
256 512 Number of files
1024
2048
Write performance with PMPIO+MPI−IO for 7.7GB Phasta restart data on CCNI BG/L 1.5 16k procs 32k procs
1
0.5
0 32
64
128
256 512 Number of files
1024
2048
Fig. 8. Write performance for PMPIO with HDF5 (top) and MPI-IO (bottom) as a function of number of files and processors.
Performance results for the PMPIO approach are shown in Figure 8 where bandwidth in GB/sec is reported as a function of the number of files used across 8K, 16K and 32K processor configurations. It is observed that while the
D. syncIO Results SyncIO is implemented using the MPI_File_write_at_all_begin()/end() function pair. This decision was based on our own performance tests which compared the performance of this function pair to using MPI_File_write_at_all(), and MPI_File_write_at(). While the implementation of MPI_File_write_at_all_begin() uses MPI_File_write_at_all() [15], we found “begin/end” to be slightly faster on the CCNI Blue Gene/L system. A more detailed performance analysis is required to fully understand this phenomenon and is beyond the scope of this paper. The read/write I/O performance of the syncIO approach is shown in Figure 9. Here, the bandwidth is measured as a function of the number of files across 8K, 16K and 32K processor configurations. SyncIO’s peak performance occurs at a modest number of files: between 64 and 128 for reading (6.6 GB/sec at 16K processors which is 82% of the peak read performance) and 32 and 64 for writing (1.4 GB/sec at 16K processors which is 35% of the peak write performance). The decrease in read and write performance at 32K processors compared with 16K processors is attributed to smaller data sizes on a per-processor basis which results in higher filesystem overheads on a perbyte basis since the read/write sizes are significantly less than the 1 MB data block size of GPFS. Overall, this is a substantial improvement (400% for read and 50% for write) over using either the 1PFPP or PMPIO approaches. E. rbIO Results The rbIO writing performance results are shown in Figure 10. In terms of the write implementation, unlike syncIO, rbIO uses the MPI_File_write_at() function because writing is not done across all processors. Since each file is written to by only one writer processor, the writer processor will specify the MPI_SELF_COMM communicator to
Write performance with syncIO for 7.7GB Phasta restart data on CCNI BG/L 1.5 Aggregate Bandwidth in GB/s
8k procs 16k procs 32k procs
1
0.5
0
1
2
4
8
16 32 64 Number of files
128
256
512
1024
Read performance with syncIO for 7.7GB Phasta restart data on CCNI BG/L
Aggregate Bandwidth in GB/s
8 8k procs 16k procs 32k procs
7 6 5 4 3 2 1 0
1
2
4
8
16 32 64 Number of files
128
256
512
1024
Fig. 9. I/O performance syncIO as a function of number of files and processors. Write performance is the top panel and read performance is bottom panel.
the MPI_File_write_at() routine. Because rbIO worker processors only transmit data in a non-blocking manner via MPI_Isend() to writer processors, the notion of perceived writing speed is defined as: the speed at which worker processors could transfer their data. This measure is calculated as total amount of data sent across all workers over the average time that MPI_Isend() takes to complete. To be clear, this assumes the configuration of the writer pool, the worker pool, and the disks are such that the writers can flush their IO requests in roughly the time between writes (recall, the typical delay between steps that write solution states is O(10) seconds). Figure 10(top) shows the average “perceived” writing performance from the worker’s view. A bandwidth of more than 21 terabytes per second using 16,384 processors is obtained. At first glance, this result might appear overly optimistic. However, this level of performance coincides the total bandwidth for memcpy() function coupled with a small overhead for MPI non-blocking message processing. In other words, rbIO leverages the Blue Gene internal memory bandwidth and torus network to transfer I/O data to the writer. So, from the worker’s
view the writing speed is on par with the total memory system bandwidth, which is 5.5 GB/s peak per node, 45TB/sec peak for a 16K partition. Because 6% or even less processors are dedicated as writers, the majority of processors (the workers) incur little overhead for write I/O operations and return to next computation step immediately. Since the write time (3 to 4 seconds) on the I/O side is typically shorter than the computational time required to execute 10 time steps (e.g., the time that separates solution writes), the writers would be ready to receive worker’s data before the workers actually require a new write and initiate I/O write requests. This tradeoff can dramatically reduce the application execution time. Also, based on the results collected so far, the number of files (i.e., number of writers) does not appear to impact the perceived writer performance. The caveat to this approach is that the application has to be written such that it can “giveup” 3 to 6% of physical compute processors (256 to 512 on 16K processors) and re-map computational work among the remaining 94 to 97% of compute processors. Figure 10 (bottom) reports the actual writing speed from file system’s perspective. Here, every writer would commit its aggregated data to disk using MPI_File_write_at(). The actual writing speed is measured as the total amount of data across all writers over the total wall-clock time it takes for the slowest writer to finish. It is a conservative measurement but its performance still out performs the other approaches including syncIO for writing. This improvement in actual write performance is attributed to lower filesystem overheads because a writer task will commit data chunk size that are inline with the GPFS filesystem 1 MB block size. Additionally, there are no synchronization overheads for data blocks since there is only 1 writer task per file. Finally, it is observed that the writer processors/number of files that yields the best performance is larger than the number of real Blue Gene system I/O processors which is reasonable since the Blue Gene/L architecture allows up to 64 compute processors to be writing to a single real system I/O node. IV. R ELATED W ORK The research mostly closely related to and inspired our rbIO system is by Nisar et al [16]. Here, an I/O delegate and caching system (IODC) is created that aggregates data by combining many small I/O requests into fewer but larger I/O request before initiating requests to file system while keeping a small number of processors dedicated to perform I/O. By using 10% extra compute processors as I/O processors, they improved performance of several I/O benchmarks using upto 400 processors. This work differs from the results presented here in several ways. First, IODC operates below the MPI-IO layer. The consequence of this implementation is that it has to be general and requires a complex cache/replica consistency mechanism. In contrast, rbIO is implemented at the user-level and leverages knowledge of the application in terms of how to partition writers and workers. Finally, rbIO only requires MPI/MPI-IO functionality whereas IODC requires OS support for threading which the Blue Gene/L does not support and Blue Gene/P supports threading only in a limited fashion. Finally, rbIO targets parallel I/O at scales greater than > 16K processors. More recently, Alok et al [17] improve the collective I/O performance of MPI by providing three new methods for file
Aggregate Bandwidth in TB/s
Perceived write performance with rbIO for 7.7GB Phasta restart data on CCNI BG/L 30 8k procs 16k procs 25
20
15
10
5
Aggregate Bandwidth in GB/s
0 64
128
256 512 Number of files
1024
2048
Actual write performance with rbIO for 7.7GB Phasta restart data on CCNI BG/L 3 8k procs 16k procs 2.5
2
1.5
1
0.5
0 64
128
256 512 Number of files
1024
2048
Fig. 10. Perceived (top) and actual (bottom) write performance with rbIO as a function of number of files and processors.
domain partitioning. The first method aligns the partitioning with the file system’s lock boundaries. The second method partitions a file into fixed blocks guided by filesystem lock granularity and then round-robin assigns the blocks to the I/O aggregators. The third approach divides the I/O aggregators into equalized groups. Here, the size of the group is equal to the number of I/O servers. These methods are then evaluated across the GPFS and Lustre filesystems using the FLASH and S3D I/O benchmarks on the Cray XT4 (Jaguar at ORNL) and a Linux cluster (Mercury at NCSA). Using the group-cyclic approach on Jaguar/CrayXT4 configured with the Lustre filesytem, a peak of 14GB/sec is obtained using 2048 processors. Increasing processor count does not improve performance (configuration with up to 16K processors are reported). Because this approach operates within the MPI library software layer, our syncIO approach could leverage these improvements. The performance of rbIO may improve but the writer tasks would need to be re-implemented to make use of the collective I/O operations. All rbIO writers are currently implemented using the “self” MPI communicator
and write to their own file. H. Yu et al [18] designed a parallel file I/O architecture for Blue Gene/L to exploit the scalability of GPFS. They used MPI-IO as the interface between application and file system. Several I/O benchmarks and a I/O intensive application were evaluated with their solution upto 1k processors with reported bandwidth 2 GB/s. Saini et al [19] ran several I/O benchmarks and synthetic compact application benchmarks on Columbia and NEC SX-8 using upto 512 processors. They conclude that MPI-IO performance depends on access patterns and that I/O is not scalable when all processors access a shared file. Dickens et al [20] report poor MPI-IO performance on Lustre and presented a user-level library that redistributes data such that overall I/O performance is improved. W. Yu et al [21] characterized the performance of several I/O benchmarks on Jaguar using the Lustre filesystem. Here, they demonstrate the efficacy of their I/O tuning approaches such as the twophase collective I/O. Shan et al [22], [23] analyzed the disk access patterns for I/O intensive applications at the National Energy Research Supercomputing Center (NERSC) and selected parameters for IOR benchmark [24] to emulate the application behavior and overall workloads. They tested I/O performance with different programming interfaces (POSIX, MPI-IO, HDF5 and pNetCDF) upto 1k processors and they show that the best performance is achieved at relatively low processor counts on both ORNL Jaguar and NERSC Bassi. In contrast to these past results, this paper presents scaling data at higher processor counts. Larkin et al [7] ran the IOR benchmark on Cray XT3/XT4 and report a tremendous performance drop at higher processor counts (around 2k). When a subset of processors (1K out of 9K processors) for I/O is used, which is similar to our rbIO approach, they achieved much better performance. A key difference is that for their approach, the I/O between workers, writers and the filesystem is synchronous based on the performance data provided. Additionally, Fahey et al [25] performed sub-setting experiments on the Cray XT4 using four different I/O approaches (mpiio, agg, ser and swp) and upto 12,288 processors. Here, they achieved a write performance that is about 40% of peak. However, in their sub-setting benchmark, all subset master processors write data into a single file. The rbIO and syncIO approaches allow the number of files to be a tunable parameter. Borrill et al [26], [27] develop the MADbench2 benchmark to examine the file I/O performance on several systems including: Lustre on Cray XT3; GPFS and PVFS2 [28] on Blue Gene/L; and CXFS [29] on SGI Altix3700. They extensively compared I/O performance across several different parameters including: concurrency, POSIX vs. MPI-IO, unique vs. shared-file access and I/O tuning parameters using upto 1K processors. They report there is no difference in performance between POSIX and MPI-IO for large contiguous I/O access patterns of MADbench2. They also found that GPFS and PVFS2 provide nearly identical performance between shared and unique files under default configurations. The Blue Gene/L achieved slow I/O performance even at low processor counts (256 processors). In contrast, the results reported here targets parallel I/O performance at much higher processor counts than discussed in this previous work. Finally, Lang et al [30] have conducted the most recent benchmark study of large-scale parallel I/O. Here, they used
the 131,072 processor Blue Gene/P located at Argonne National Labs and report I/O bandwidths of nearly 60 GB/s (read) and 45 GB/s (write) using up to 131,072 processors across a number of I/O benchmarks including BTIO (block tridiagonal I/O benchmark), Flash3, IOR, and MADbench2 when coupled with the PVFS parallel filesystem. However, non-blocking I/O mechanisms are not discussed in that paper which is a focus of this research. V. C ONCLUSIONS While system designers continue to build larger and larger systems comprising hundreds of thousands of processors, current I/O subsystems continue to be a bottleneck in obtaining peak petascale system performance. SyncIO and rbIO provide a new perspective on combining portable parallel I/O techniques for massively parallel partitioned solver systems. Performance tests using real partitioned CFD solver data from the PHASTA CFD solver show that it is possible to use a small portion (3 to 6%) of the processors as the dedicated writers and achieving an actual writing performance of 2.3GB/sec and perceived/latency hiding writing performance of more than 21TB/sec (nearly 50% of aggregate peak memory bandwidth) on the Blue Gene/L. It is additionally reported that the syncIO strategy can achieve a read bandwidth of 6.6GB/sec on Blue Gene/L using 16K processors and is significantly faster than the 1PFPP approach. In contrast, using a serial “token-passing” approach within each group of files, PMPIO achieves only a 900MB/sec write bandwidth on over 16K processors using 1024 files. Parallel I/O performance studies at even higher processors counts are planned by leveraging the Blue Gene/P systems at both ANL (163,840 processors) and JSC (294,912 processors) as well as an investigation of syncIO and rbIO performance on other massively parallel systems such as the CrayXT5 coupled with the Lustre filesystem. In particular, we plan to investigate how the distribution of the designated I/O processors impacts the performance of rbIO across these systems for a variety of data sets. ACKNOWLEDGMENTS We acknowledge funding support from the NSF PetaApps Program, OCI-0749152 and computing resources support from CCNI-RPI. We thank Rob Latham, Ray Loy and Rob Ross at Argonne National Labs for a number of discussions that lead to this research and Rob Ross for an early review this manuscript. We also acknowledge that the results presented in this article made use of software components provided by ACUSIM Software Inc. and Simmetrix. Last updated: January 29, 2010: R EFERENCES [1] O. Sahni, M. Zhou, M. S. Shephard, and K. Jansen, “Scalable implicit finite element solver for massively parallel processing with demonstration to 160k cores,” in Proceedings of the Supercomputing 2009 Conference, Portland, Oregon, 2009, 294,912 processor case was only presented at the conference and is not covered in the details of the paper. [2] O. Sahni, C. D. Carothers, M. S. Shephard, and K. E. Jansen, “Strong scaling analysis of a parallel, unstructured, implicit solver and the influence of the operating system interference,” Scientific Programming, vol. 17, pp. 261–274, 2009.
[3] D. A. Donzis, P. Yeung, and K. Sreenivasan, “Dissipation and enstrophy in isotropic turbulence: Resolution effects and scaling in direct numerical simulations,” Physics of Fluids, vol. 20, 2008. [4] “Blue Waters PACT Team Collaborations.” [Online]. Available: http://www.ncsa.illinois.edu/BlueWaters/prac.html [5] GROMACS, http://www.gromacs.org. [6] LAMMPS, http://www.cs.sandia.gov/ sjplimp/lammps.html, Sandia National Labs. [7] J. M. Larkin and M. R. Fahey, “Guidelines for Efficient Parallel I/O on the Cray XT3/XT4,” in Cray Users Group Meeting (CUG) 2007, May 2007. [8] “Silo: A mesh and field i/o library and scientific database.” [Online]. Available: https://wci.llnl.gov/codes/silo/ [9] “HDF5.” [Online]. Available: http://www.hdfgroup.org/HDF5 [10] “Message Passing Interface.” [Online]. Available: http://www.mcs. anl.gov/research/projects/mpi [11] R. Thakur, E. Lusk, and W. Gropp, “Users guide for ROMIO: A high-performance, portable MPI-IO implementation,” 2004, technical Report Technical Memorandum ANL/MCS-TM-234, Mathematics and Computer Science Division, Argonne National Laboratory. [12] “Computational center for nanotechnology innovations (ccni), rensselaer technology park, north greenbush, n.y.” [Online]. Available: http://www.rpi.edu/research/ccni [13] “General Parallel File System.” [Online]. Available: http: //www-03.ibm.com/systems/clusters/software/gpfs/index.html [14] F. B. Schmuck and R. L. Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters,” in FAST, 2002. [15] R. Ross, “Personal communication via email, December 12, 2009.” [16] A. Nisar, W. Liao, and A. Choudhary, “Scaling Parallel I/O Performance through I/O Delegate and Caching System,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. [17] A. Choudhary, W. k. Liao, K. Gao, A. Nisar, R. Ross, R. Thakur, and R. Latham, “Scalable I/O and analytics,” Journal of Physics: Conference Series, vol. 180, no. 1, 2009. ˜ [18] H. Yu, R. K. Sahoo, C. Howson, G. Almasi, J. G. Castanos, M. Gupta, J. E. Moreira, J. J. Parker, T. Engelsiepen, R. B. Ross, R. Thakur, R. Latham, and W. D. Gropp, “High performance file I/O for the Blue Gene/L supercomputer,” in HPCA, 2006. [19] S. Saini, D. Talcott, R. Thakur, P. A. Adamidis, R. Rabenseifner, and R. Ciotti, “Parallel I/O Performance Characterization of Columbia and NEC SX-8 Superclusters,” in IPDPS, 2007. [20] P. M. Dickens and J. Logan, “Y-lib: a user level library to increase the performance of MPI-IO in a lustre file system environment,” in HPDC, 2009. [21] W. Yu, J. S. Vetter, and S. Oral, “Performance characterization and optimization of parallel I/O on the Cray XT,” in IPDPS, 2008. [22] H. Shan and J. Shalf, “Using IOR to analyze the I/O performance for HPC platforms,” in Cray Users Group Meeting (CUG) 2007, May 2007. [23] H. Shan, K. Antypas, and J. Shalf, “Characterizing and predicting the i/o performance of hpc applications using a parameterized synthetic benchmark,” in SC, 2008. [24] “IOR Benchmark.” [Online]. Available: https://asc.llnl.gov/sequoia/ benchmarks/#ior [25] M. R. Fahey, J. M. Larkin, and J. Adams, “I/O Performance on a Massively Parallel Cray XT3/XT4,” in IPDPS, 2008, pp. 1–12. [26] J. Borrill, L. Oliker, J. Shalf, and H. Shan, “Investigation of leading HPC I/O performance using a scientific-application derived benchmark,” in SC, 2007. [27] J. Borrill, L. Oliker, J. Shalf, H. Shan, and A. Uselton, “HPC global file system performance analysis using a scientific-application derived benchmark,” Parallel Computing, vol. 35, no. 6, pp. 358–373, 2009. [28] “Parallel Virtual File System 2.” [Online]. Available: http: //www.pvfs.org [29] “Cluster XFS.” [Online]. Available: http://www.sgi.com/products/ storage/software/cxfs.html [30] S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock, “I/O performance challenges at leadership scale,” in Proceedings of Supercomputing, November 2009.