A Study of Real World I/O Performance in Parallel Scientific Computing Dries Kimpe1,2 , Andrea Lani3 , Tiago Quintino1,3 , Stefan Vandewalle1 , Stefaan Poedts2 , and Herman Deconinck3 1
3
Technisch-Wetenschappelijk Rekenen, K.U.Leuven, Celestijnenlaan 200A, BE-3001 Leuven, Belgi¨e
[email protected] 2 Centrum voor Plasma-Astrofysica, K.U.Leuven, Celestijnenlaan 200B, BE-3001 Leuven, Belgi¨e Von Karman Instituut, Waterloosesteenweg 72, BE-1640 Sint-Genesius-Rode, Belgi¨e
Abstract. Parallel computing is indisputably present in the future of high performance computing. For distributed memory systems, MPI is widely accepted as a de facto standard. However, I/O is often neglected when considering parallel performance. In this article, a number of I/O strategies for distributed memory systems will be examined. These will be evaluated in the context of COOLFluiD, a framework for object oriented computational fluid dynamics. The influence of the system and software architecture on performance will be studied. Benchmark results will be provided, enabling a comparison between some commonly used parallel file systems.
1 1.1
Motivation and Problem Description Parallel Programming
Numerical simulation and other computationally intensive problems are often successfully tackled using parallel computing. Frequently these problems are too large to solve on a single system or the time needed to complete them makes single-CPU calculation unpractical. Successful parallelisation is usually measured by the problem “speedup”. This quantity indicates how much faster a given problem is solved on multiple processors, compared to the solution time on one processor. More often than not, this speedup is only based on the computationally intensive part of the code, and phases as program startup or data loading and saving elude the test. Also, when the ratio of computation to the input data is high enough, I/O time is negligible in the total execution time. However, when scaling to larger problem sizes (and consequently more processors), one often sees that I/O is becoming an increasingly large bottleneck. The main reason for this is that without parallel I/O, the I/O and calculation potential of a cluster quickly becomes unbalanced. This is visible both in hardware and in software; often there is but a single file server managing data for the B. K˚ agstr¨ om et al. (Eds.): PARA 2006, LNCS 4699, pp. 871–881, 2007. c Springer-Verlag Berlin Heidelberg 2007
872
D. Kimpe et al.
whole cluster. Moreover, traditional I/O semantics do not offer enough expressional power to coordinate requests, leading to file server congestion, reducing the already limited I/O bandwidth even further. 1.2
Computational Fluid Dynamics and COOLFluiD
Computational fluid dynamics (CFD) deals with the solution of a system of partial differential equations describing the motion of a fluid. This is commonly done by discretizing these equations on a mesh. Depending on the numerical algorithm, a set of unknowns is associated with either nodes or cells of the mesh. The amount of computational work is proportional to the number of cells. For realistic problems this quickly leads to simulations larger than a single system can handle. COOLFluiD[4] is an object oriented framework for computational fluid dynamics, written in C++. It supports distributed memory parallelisation through MPI, but still allows optimized compilation without MPI for single-processor systems. COOLFluiD utilises parallel I/O for two reasons. One is to guarantee scalability of the code. The other is to hide parallelisation from the end user. During development, a goal was set to mask the differences between serial and parallel builds of COOLFluiD as much as possible. This, among other things, requires that the data files used and generated by the parallel version do not differ from those in the serial version. This depends on parallel I/O, as opening a remote file for writing on multiple processors using posix semantics is ill defined and often leads to corrupted files. 1.3
I/O in a Parallel Simulation
There has been much research on the optimal parallel solution of a system of PDEs. However, relatively little study has been devoted to creating scalable I/O algorithms for this class of problems. Generally speaking, there are three reasons for performing I/O during a simulation. At the start of the program, the mesh (its geometric description and an initial value for each of the associated unknowns) needs to be loaded into memory. During the computation, snapshots of the current solution state are stored. Before ending the program, the final solution is saved. In a distributed memory machine, the mesh is divided between the nodes. Consequently each CPU requires a different portion of the mesh to operate. This offers opportunities for parallel I/O, since every processor only accesses distinct parts of the mesh. Figure 1 shows an example of a typical decomposition, and the resulting I/O access pattern. On the left, the partitioned mesh is shown. On the right, the file layout (row-major ordering) can be seen. Color indicates which states are accessed by a given CPU.
A Study of Real World I/O Performance in Parallel Scientific Computing
873
Fig. 1. Decomposition and file access pattern of a 3D sphere
2
I/O Strategies
Within COOLFluiD, I/O is fully abstracted. This simplifies supporting multiple file formats and access APIs, and allows run-time selection of the desired format. Mesh input and output is provided by file plugins. A file plugin offers a well defined, format independent interface to the stored mesh, and can implement any of the following access strategies: Parallel Random Access: This strategy has the potential to offer the highest performance. It allows every processor to read and write arbitrary regions of the file. If the system architecture has multiple pathways to the file this can be exploited. File plugins implementing this interface enable all CPUs to concurrently access those portions of the mesh required for their calculations. Non-Parallel Random Access: In this model, the underlying file format (or access API) does not support parallel access to the file. Only a single CPU is allowed to open the file, which will be random accessible. This strategy can be used with data present on non-shared resources, for example local disks. Non-Parallel Sequential Access: Sometimes the way data is stored prohibits meaningful true parallel access. For example, within an ASCII based file format, it is not possible to read a specific mesh element without first reading all the previous elements. This is due to the varying stride between the elements. As such, even when the OS and API allow parallel writing to the file, for mesh based applications, this cannot be done without corrupting the file structure. Note that applications that do not care about the relative ordering of the entries in the file can still use parallel I/O to read and write from this file (using shared file pointer techniques). However, as this article studies I/O patterns for mesh based applications this is not taken into consideration.
3
Performance Testing
Currently, obtaining good parallel I/O performance is still somewhat of a black art. By making use of the flexibility COOLFluiD offers concerning mesh I/O, an attempt is made to explore and analyse the many different combinations of file system, API and interconnect that can be found in modern clusters.
874
D. Kimpe et al.
3.1
Test Description
We will concentrate on the parallel random access pattern, since the other two access strategies are inherently non-scalable (when considering I/O bandwidth). Although COOLFluiD supports them, they are offered as a convenience. For large simulations, converting the mesh to one of the formats supporting true parallel random access is recommended. Figure 2 shows the software invoked during mesh transfers. COOLFluiD has file plugins that utilise a storage library (HDF5[3] or PnetCDF[7]) or that directly employ MPI-IO to access the mesh. Internally, these storage libraries rely on the I/O functions of MPI to access raw files. ROMIO[6] is an implementation of these I/O functions, and is used in almost all research or open source MPI implementations. ROMIO has a number of ADIO (abstract-device interface for I/O) drivers providing optimized access to a certain file system. While PVFS2[2] and NFS have specific ADIO implementations, Lustre[1], aiming for full POSIX compliance, is accessed through the generic “UFS” driver. COOLFluiD PHDF5
PNetCDF MPI ROM−IO (MPI−IO)
UFS PVFS2
lustre
PVFS2
NFS
Operating System
Fig. 2. Software stack for parallel mesh I/O
For testing, the time needed to access a set of unknowns (of a given dimension) will be measured. These unknowns are stored as a linear sequence of “states” (figure 3), each state consisting of a number of doubles. The storage library (HDF5, PnetCDF) is responsible for the mapping between the virtual layout (n× d doubles) and the file layout (a linear byte sequence). In general, each state is only accessed by one CPU. However, states on the border of a partition will be accessed by multiple CPUs. States are loaded or stored in groups, where the group size is determined by the buffer size. Since MPI-IO requires file datatypes to have positive type displacements, states need to be addressed in increasing order (for a given CPU). This means that in each access round, a CPU will buf f er size access sizeof (double)×d states. Because the state partitions are balanced to evenly distribute the computational cost between the CPUs, this also causes the I/O load to be balanced. 3.2
Test Hardware
All tests were conducted on VIC, a 862 CPU cluster located at K.U.Leuven. The cluster has a number of different interconnect fabrics. All nodes possess a gigabit
A Study of Real World I/O Performance in Parallel Scientific Computing
875
... n
0 1 2 3 ... states
...
...
}
d x double
CPU 0
CPU 1
CPU 2
Fig. 3. (Virtual) file layout of the unknowns
ethernet connection. Two 144 port infiniband (4X) switches provide infiniband connections to most of the nodes. For PVFS2[2], 4 I/O servers were employed, each server having 2 opteron CPUs. Data is stored locally on a SATA disk attached to the node. The disk has a raw read bandwidth of approx. 50 Mb/s. One of the I/O servers doubles as metadata server. Connections between the servers and clients were made using native Infiniband. PVFS2 version 1.5.1 was used. The Lustre[1] file system used for testing ran on the same 4 servers. Here too, one of them performed both metadata (MDS) and storage (OST) functions, while the others only served as storage servers. Connections were made using IPoIB, an IP emulation mode running over the infiniband network. All files were striped over the available I/O servers. The Lustre version was 1.4.6. A dedicated server (of the same type) was installed to export the NFS file system, also using IPoIB. The async option was enabled, allowing the server to cache writes in order to increase performance. Eight nodes were reserved as I/O clients. Only one of the two opteron CPUs from every node was used, avoiding contention for the network ports. All nodes were installed for the purpose of this article, in order to exclude any interference from other jobs running on the cluster. 3.3
MPI-IO File Hints
The MPI-2 file interface enables the user to specify implementation specific hints. These hints can be used to communicate additional information to the underlying software layers. However, an implementation is free to ignore hints. Both PnetCDF and HDF5 allow the user to specify hints, which are subsequently passed unmodified to MPI-IO during file access. Unfortunately, most hints are useless in combination with high level storage libraries. In order to specify meaningful hints, an application needs to know intimate details of the underlying file layout. However, the goal of a storage library is to abstract this underlying file layout and to present a higher level interface to the application. Because of this, and also considering the fact that an implementation can ignore hints, the influence of file hints on performance was not studied.
876
3.4
D. Kimpe et al.
MPI-IO on PVFS2
Since both HDF5 and PnetCDF rely on MPI-IO for actual file access, ROMIO performance can help explain their test results, therefore pure MPI-IO performance will be discussed first. PVFS2 will be accessed through the user-space PVFS2 ROMIO driver, which does not do client-side caching. As such, OS file caches do not influence the performance. Figure 4 shows the read and write performance of a number of MPI access methods. (test dataset: 400000 x 8 doubles) For MPI-IO, two completely different access methods were studied. For the first (the left graph), no file datatypes were used. This is referred to as “level 0” for independent accesses, and “level 1” for collective accesses[8]. The second method does use file datatypes, which make it possible to address full non-contiguous access pattern in one operation. This method is known as “level 2” for independent and “level 3” for collective accesses. Additionally, each method was tested using a combination of the following optimizations: optimize: Try to group adjacent requests into larger ones. This is done by the client application, before passing the request to MPI-IO or the storage library. typed: Use a one-dimensional array of an array type (with base type double) instead of a two-dimensional array of doubles. Since there was no releveant performance difference between typed and non-typed tests, this data was omitted from the graphs. collective: Use MPI File write all instead of MPI File write. Selecting the right I/O method can make a huge difference in I/O performance. Level 0 and 1 lead to unuseable performance (less than 1 Mb/s!). Although level 2 performs a little bit better, the graph already indicates a scaling problem, even with eight nodes for four I/O servers! Only level 3 I/O leads to acceptable performance. 3.5
PnetCDF and HDF5 on PVFS2
PnetCDF currently doesn’t offer a suitable API for unstructured dataset access. Because of this, client software is forced to repeatedly call the library to access a small number of elements, leading to many small read or write operations (level 0/1). Although ROMIO is capable of aggregating and grouping some of these requests, the data volume is still too small to really benefit from this. For the same reason, collective I/O (which has some additional overhead) performs even worse. Figure 4 shows the results. Because of the slow I/O speed, a small test dataset (50000 × 8) was choosen. This dataset was actually smaller than the buffer size. On one CPU, when optimization of the access pattern was enabled, this resulted in one large read/write request of the full dataset. Therefore, speed measurements with only one client node were omitted from the graph when optimization was enabled. Because of the huge difference between the performance of the
MB/s
MB/s
Fig. 4. unstructured access pattern performance on PVFS2
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
1
2
client nodes
5
6
3
4
client nodes
5
6
PNetCDF on PVFS2, unstructured pattern
3
collective-read collective-write independent-read independent-write collective-optimize-read collective-optimize-write independent-optimize-read independent-optimize-write
2
4
MPI-IO on PVFS2, unstructured pattern, no file datatypes
independent-read independent-write independent-optimize-read independent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
7
7
8
8
MB/s MB/s
0.45
1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0
20
40
60
80
100
120
140
160
180
2
2
3
5
4
5 client nodes
6
HDF5 unstructured pattern on PVFS2
client nodes
hyperslab-read hyperslab-write hyperslab-optimize-read hyperslab-optimize-write collective-hyperslab-read collective-hyperslab-write collective-hyperslab-optimize-read collective-hyperslab-optimize-write elements-read elements-write collective-elements-read collective-elements-write
3
4
6
MPI-IO unstructured pattern on PVFS2, using file datatypes independent-read independent-write indepedent-optimize-read indepedent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
7
7
8
8
A Study of Real World I/O Performance in Parallel Scientific Computing 877
878
D. Kimpe et al.
different I/O levels, the performance of the PnetCDF library (for this access pattern) ultimately is determined by the access pattern presented by PnetCDF to MPI-IO. The graph clearly resembles the MPI-IO graph (without file datatypes). Figure 4 also demonstrates that HDF5 also has serious problems dealing with unstructured access patterns. Although the API offers two ways to setup the access pattern, both have problems. A first way is to use a union of hyperslabs. The H5S select hyperslab call allows extending an existing dataset selection with a specified hyperslab. By repeatedly calling this function, the full access pattern can be described. Unfortunately, every time the function is called, it loops over the old selection to make sure no duplicate selections exist. This causes the setup time of the access pattern to become unreasonably large, making it even slower than the actual data transfer. The second method uses the H5S select elements call, which allows the user to specify an array of coordinates of points that have to be read. Although application-level optimization (by grouping adjacent points) is not possible with this call, the access pattern can be described in one fast function call. However, internally, HDF5 (currently) does not relay this access pattern to MPI-IO. Instead, it is broken up again in seperate one-element read operations, and read by independent I/O requests. Because of this, its performance is comparable to that of pure MPI-IO using independent accesses without file datatypes. Results with only one client node were not obtained, as the latest stable HDF5 release (1.6.5) contains a bug preventing its use on only one CPU when using userspace ROMIO drivers. 3.6
Unstructured MPI-IO Access on NFS and Lustre
The previous tests demonstrated that PnetCDF performance closely follows that of MPI-IO with the same access pattern. For HDF5, at least when using H5S select elements style selections, this is also the case. The other HDF5 method, a hyperslab union, is limited not by I/O time but by the setup time. Therefore, only MPI-IO performance will be shown for Lustre and NFS. In figure 5 can be seen that NFS and PVFS2 had different design goals. NFS was designed as a general purpose network file system, optimized to handle multiple small file requests. As such, for unstructured access without file datatypes (causing small requests), on average NFS performs better than PVF2 if the number of clients is small. For more than 4 clients, the NFS server cannot handle the load any more. Since NFS uses UDP – an unreliable transport protocol – lost packets triggering retransmission delays cause a serious performance drop. This can clearly be seen for collective operations, which by their nature increase network contention. For 5 client nodes and more, transfer speed approached zero as some transfers took multiple hours to complete. For this reason, no performance numbers were obtained for collective modes with more than 6 clients. When using file datatypes, by aggregating data, collective calls are able to improve performance. Because of the small dataset size, from 6 or more clients on, performance becomes less predictable due to client side caching. If (part of) the data is in the client side cache, in addition to avoiding the data transfer, load
MB/s
MB/s
Fig. 5. MPI-IO performance on NFS and Lustre 5
6
6
7
8
0 5
80
100
120
140
0
client nodes
1
160
0
20
4
8
1
3
7
40
2
client nodes
2
1
4
MPI-IO on Lustre, unstructured pattern, no file datatypes
3
independent-read independent-write independent-optimize-read independent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
2
60
1
10
20
30
40
50
60
70
80
3
4
5
6
7
8
0
0.2
0.4
0.6
0.8
1
1.2
1.4
MPI-IO on NFS, unstructured pattern, no file datatypes
independent-read independent-write independent-optimize-read independent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
MB/s MB/s
1.6
1
2
client nodes
5
6
3
4 client nodes
5
6
MPI-IO unstructured pattern on Lustre, using file datatypes
3
independent-read independent-write indepedent-optimize-read indepedent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
2
4
MPI-IO unstructured pattern on NFS, using file datatypes independent-read independent-write indepedent-optimize-read indepedent-optimize-write collective-read collective-write collective-optimize-read collective-optimize-write
7
7
8
8
A Study of Real World I/O Performance in Parallel Scientific Computing 879
880
D. Kimpe et al.
on the NFS server is reduced, resulting in extra performance. Application-based merging of adjacent access requests is a delicate issue on NFS. When dealing with very small requests some benefit can be seen. However, when using level 3 I/O (collective mode and file datatypes), optimizing collective read patterns can cause a 10-fold performance drop. This is probably due to network or server congestion. The graph shows the average, minimum and maximum transfer speed obtained. For collective reading with 6 or more clients, the avarage is misleading; In reality, either very high or very low numbers were obtained. The Lustre file system performs very well for level 0 I/O. This is partly due to the client side cache, write accumulation and read-ahead. By default, on the client nodes, Lustre utilized up to 1500 MB for client side caching. Also, a maximum read-ahead window of 40MB was automatically set. Writes are accumulated until at least 1MB of dirty pages is available. When doing independent reads without file datatypes, these optimizations enable the best performance of all file systems tested. Independent write speed is still adequate, but performance drops when the number of clients increases. This is probably due to lock contention. However, collective operations without file datatypes need to be avoided on Lustre. From two clients on, transfer speeds become unworkable. As is the case with NFS, collective operations cause the most lock contention and this translates into low performance. PVFS2, which does not have file locking, is not affected by this. Looking at level 2 and 3 I/O (independent and collective using file datatypes), graph 5 shows nice results. In independent access modes, there is the usual performance drop going from one to two clients, due to the locking protocol. Collective access is less affected, because of its ability to avoid issuing small requests[9].
4
Conclusion
As a first conclusion one can state that the most important factor influencing performance is the access pattern. Using contiguous accesses results in much better performance, up to an order of magnitude. The large gap in transfer speed between NFS, a traditional shared file system, and true parallel file systems such as Lustre and PVFS2 is clearly visible. Even with only 8 clients, the NFS server load became unreasonably high, demonstrating the need for scalable I/O solutions. For non-contiguous access patterns (such as those resulting from unstructured meshes) performance using pure MPI-IO is adequate when using collective I/O and file datatypes. However, at this time, none of the tested storage libraries is ready for these kind of acccesses. Unless these libraries can accept a noncontiguous file access pattern, and pass this information information on to MPIIO, they cannot be used. Until they do, if true non-contiguous file access is really needed, MPI-IO should be utilized.
A Study of Real World I/O Performance in Parallel Scientific Computing
881
However, if eventually all data needs to be accessed, utilizing an application level parallel cache will outperform any non-contiguous file access method. In such a scheme, the application would first read all data in contiguous chunks, store everything in a parallel cache, and serve all future (non-contiguous) requests from this cache. Also, although this could equally wel be done by MPI-IO (or the storage library), application-level grouping of read and write requests slightly increases performance. PVFS2 is easy to install (no superuser access required), is well supported by research MPI implementations and performs very well compared to commercial file systems such as Lustre. PVFS2 enables end-users to easily setup and run I/O servers alongside their jobs, allowing scalable I/O on any cluster offering local disk access.
References 1. Lustre: A Scalable, High-Performance File System. white paper (November 2002), http://www.lustre.org/docs/whitepaper.pdf 2. Latham, R., Miller, N., Ross, R., Carns, P.: A Next-Generation Parallel File System for Linux Clusters, LinuxWorld, vol. 2 (January 2004) 3. HDF5: http://hdf.ncsa.uiuc.edu/HDF5/ 4. Lani, A., Quintino, T., Kimpe, D., Deconinck, H., Vandewalle, S., Poedts, S.: The COOLFluiD Framework: Design Solutions for High-Performance Object Oriented Scientific Computing Software. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 281–286. Springer, Heidelberg (2005) 5. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In: PVM/MPI 2004, pp. 97–104 (2004) 6. Thakur, R., Gropp, W., Lusk, E.: An Abstract-Device Interface for Implementing Portable Parallel-I/O Interfaces. In: Proc. of the 6th Symposium on the Frontiers of Massively Parallel Computation, pp. 180–187 (October 1996) 7. Li, J., Liao, W.-k., Choudhary, A., Ross, R., Thakur, R., Gropp, W., Latham, R., Siegel, A., Gallagher, B., Zingale, M.: Parallel netCDF: A High-Performance Scientific I/O Interface. In: Proceedings of SC2003, Phoenix, AZ (November 2003) 8. Thakur, R., Gropp, W., Lusk, E.: A case for using MPI’s derived datatypes to improve I/O performance. In: Proc. of SC98: High Performance Networking and Computing (1998) 9. Thakur, R., Gropp, W., Lusk, E.: Data sieving and collective I/O in ROMIO. In: Proc. of the 7th symposium on the Frontiers of Massively Parallel Computation, pp. 182–189 (February 1999)