Lawrence Berkeley National. Laboratory. Surendra Byna. Lawrence .... [3] A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K.. Hollingsworth. A scalable auto-tuning ...
Auto-Tuning of Parallel I/O Parameters for HDF5 Applications
1.
Babak Behzad
Joseph Huchette
Huong Vu Thanh Luu
University of Illinois at Urbana-Champaign
Rice University
University of Illinois at Urbana-Champaign
Ruth Aydt
Quincey Koziol
Prabhat
The HDF Group
The HDF Group
Lawrence Berkeley National Laboratory
Surendra Byna
Mohamad Chaarawi
Yushu Yao
Lawrence Berkeley National Laboratory
The HDF Group
Lawrence Berkeley National Laboratory
ABSTRACT
tifies well-performing parameter sets for a given system and Parallel I/O is an unavoidable part of modern high-performance problem size. I/O parameters are specified in an XML configuration file that is read by the H5Tuner library. Since computing (HPC), but its system-wide dependencies means the parameters are not embedded in the application, it is it has eluded optimization across platforms and applications. easy to try different settings without changing or recompilThis can introduce bottlenecks in otherwise computationally ing the application. Currently, H5Tuner is implemented as efficient code, especially as scientific computing becomes ina dynamic library, which is preloaded before the HDF5 and creasingly data-driven. Various studies [4] have shown that MPI libraries. It intercepts application HDF5 function calls, dramatic improvements are possible when the parameters adjusts parameters based on the configuration file contents, are set appropriately. However, as a result of having multiple and then calls the stock HDF5 functions.The framework also ˘ ˇ layers in the HPC I/O stackˆ aATeach with its own optimizaincludes H5Evolve, built on Pyevolve [2], which uses a ge˘ ˇ tion parametersˆ aATand nontrivial execution time for a test netic algorithm to find well-performing parameter sets. It run, finding the optimal parameter values is a very complex has a discrete set of values for each of the tunable parameters problem. Additionally, optimal sets do not necessarily transand uses crossover and mutation functions to intelligently late between use cases, since tuning I/O performance can be search for a well-performing set. As the runtime of the aphighly dependent on the individual application, the problem plication may not be the only output of interest, we have size, and the compute platform being used. Tunable paramalso developed H5PerfCapture, an extension to Darshan [1]. eters are exposed primarily at three levels in the I/O stack: H5PerfCapture uses the same dynamic library approach as the system, middleware, and high-level data-organization H5Tuner, and captures performance characteristics of the layers. HPC systems need a parallel file system, such as LusHDF5 and MPI-IO function calls: time taken to read/write tre, to intelligently store data in a parallelized fashion. Middata and metadata, number of bytes read/written from/to dleware communication layers, such as MPI-IO, support this the disk by the application, etc. These values are tabulated, kind of parallel I/O and offer a variety of optimizations, such compressed, and written to log files as the application termias collective buffering. Scientists and application developers nates. This information offers insights into the specifics of a often use HDF5, a high-level cross-platform I/O library that particular I/O stack, and will be used to further understand offers a hierarchical object-database representation of scienand address I/O bottlenecks. To date, we have auto-tuned tific data. One solution to this problem is using empirical opthree application-based I/O benchmarks (VPIC-IO, GCRMtimization techniques, also called auto- tuning. Auto-tuning IO, and VORPAL-IO) with this framework, running on two has been investigated as a solution to problems of this type HPC systems (Hopper at LBNL and Ranger at TACC). as it is an autonomous, portable, and scalable approach [3]. ˘ S 16.7X were achieved with the autoSpeedups of 3.3X ˆ aA¸ In order to assign best-possible parameter sets, the autotuned I/O parameters in comparison to the default values. tuner must have a set of high-performing configurations for Our work shows that auto-tuning parallel I/O parameters indicative test cases. To traverse the intractably large pain HDF5 applications can improve I/O performance without rameter search space, we chose these sets via a genetic algorequiring hands-on optimization or code changes. We plan rithm heuristic, which we found to produce well- performing to use our extensible framework to further explore I/O perconfigurations after a suitably small number of test runs. formance issues and tuning opportunities, and have begun Since different applications inevitably invoke different I/O work on H5Recorder/Augmenter to automatically construct patterns, in this project we propose benchmark-guided autoI/O application kernels from full HDF5 applications. tuning covering all three layers of the I/O stack. To this end, we have worked towards a benchmark framework that idenCopyright is held by the author/owner(s). SC’12, November 17–21, 2012, Salt Lake City, UT, USA. ACM X-XXXXX-XX-X/XX/XX.
2.
ACKNOWLEDGMENTS
This work was supported by the Director, Office of Science, Office of Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract No. DEAC02-05CH11231. Some architectural concepts were drawn from I/O benchmarking work with LLNL under Contract B592421. H. Luu is partially supported by NSF grant 0938064. This research used resources of the National Energy Research Scientific Computing Center and of the Texas Advanced Computing Center.
3.
REFERENCES
[1] P. Carns, K. Harms, W. Allcock, C. Bacon, R. Latham, S. Lang, and R. Ross. Understanding and improving computational science storage access through continuous characterization. In In Proceedings of 27th IEEE Conference on Mass Storage Systems and Technologies, 2011. [2] C. S. Perone. Pyevolve: a Python open-source framework for genetic algorithms. SIGEVOlution, 4(1):12–20, 2009. [3] A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, IPDPS ’09, pages 1–12, Washington, DC, USA, 2009. IEEE Computer Society. [4] A. Uselton, M. Howison, N. Wright, D. Skinner, N. Keen, J. Shalf, K. Karavanic, and L. Oliker. Parallel i/o performance: From events to ensembles. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1 –11, april 2010.