NVRAM as Main Storage of Parallel File System

18 Volume 9, Number 1, May 2016 __________________________________________________________________________________________________________

NVRAM as Main Storage of Parallel File System MALINOWSKI Artur Gdansk University of Technology, Poland, Faculty of Electronics, Telecommunications and Informatics Narutowicza 11/12, 80-233 Gdansk, Poland, [email protected]

Abstract – Modern cluster environments' main trouble used to be lack of computational power provided by CPUs and GPUs, but recently they suffer more and more from insufficient performance of input and output operations. Apart from better network infrastructure and more sophisticated processing algorithms, a lot of solutions base on emerging memory technologies. This paper presents evaluation of using non-volatile random-access memory as a main storage of Parallel File System. The author justifies feasibility of such configuration and evaluates it with MPI I/O, OrangeFS as a file system, two popular cluster I/O benchmarks and software memory simulation. Obtained results suggest, that with Parallel File System highly optimized for block devices, small differences in access time and memory bandwidth does not influence system performance. Keywords: NVRAM; Parallel File System; MPI I/O, benchmarking. I. INTRODUCTION Nowadays, memory hierarchy applied in most of existing computer architectures is based on two different memory types. In typical usage, fast but volatile memory – e.g. random-access memory (RAM) – is used for computations, while persistent and high capacity devices are responsible for data storage – e.g. hard disk drive (HDD) or solid state drive (SSD). Both memory types' drawbacks lead to research on new type of component – high-capacity, fast, non-volatile randomaccess memory (NVRAM). Fast input and output operations are extremely important for many data-intensive high performance computing (HPC) applications. In most cases, computational clusters are equipped with Parallel File System (PFS) that is responsible for efficient management of storage devices. Many PFS optimizations contain not only software modifications, but also make use of emerging memory technologies like NVRAM. Although a lot of experiments involve combining NVRAM with PFS, to authors knowledge all of them assumed very limited capacity of NVRAM devices. However, recent manufacturers' reports provide information about higher expected capacity, what brings

a possibility to use NVRAM in PFS not only as a accelerator, but also as a main storage technology. On the other hand, modern PFS are optimized for block devices like HDD and SSD, what leads to the conclusion that replacing them with NVRAM would not imply better performance. This paper aims to evaluate NVRAM as a potentially main storage technology for PFS. A set of experiments will show, that using NVRAM is possible without any PFS modification, but without significant improvement in performance, as modern PFS are optimized for block storage devices. The author will prove, that in order to make the best use of NVRAM advantages, more customized I/O solutions are required. The paper continues as follows: Section II presents related work connected with NVRAM technology and its usage in improving performance of PFS. Section III provides detailed information about experiments including testbed environment, description of simulation platform, list of used software components and experiment results with conclusions. Section IV contains overall summary and proposes future work. II. RELATED WORK Nowadays, with increasing size of clusters and huge demand for data in scientific applications, system performance is not only a result of many CPUs and GPUs, but it also requires efficient I/O operations. MPI I/O [1] implementation, together with a PFS, is a widely-used solution for file operations in highperformance computing, that is constantly being improved in order to provide maximum performance. Optimizing noncontiguous access in MPI I/O [2], MPI I/O implementation for a Lustre file system [3] or designing of PFS called PVFS [4] are examples of a contribution based on software implementations. On the other hand, most of hardware-based solutions are connected with memory and storage devices. Searching for better memory technologies is not a recent idea. In 2009, Mark H. Kryder and Chang Soo Kim prepared a comprehensive survey of popular nonvolatile memory types. They are predicting, that some of them (e.g. Phase Change RAM) can potentially replace SSDs and HDDs [5]. In 2015, Intel® and Micron® announced 3D XPoint – NVRAM technology completely different from previous solutions [6][7]. According to provided information, 3D XPoint would be

Journal of Computer Science and Control Systems 19 __________________________________________________________________________________________________________

available on market in 2016 and should offer performance comparable to DRAM, with density similar to NAND (technology used in SSD)[8]. Many researches prove, that extending systems with fast, byte-addressable and non-volatile memory could be promising solution in terms of performance and energy consumption. Huang, J. et al. speeded up transaction system using NVRAM for the logging component [9], Ryu, S. et al. proposed new efficient logging mechanism for mobile devices [10]. Another example is MN-MATE – energy efficient resource management architecture for cloud nodes, that uses NVRAM to accelerate average speed of memory access for guest virtual machines [11]. NVRAM was also proposed to improve PFS performance. Many systems (e.g. HeRMES [12], Conquest [13], PRIMS [14]) use hybrid approach, where meta-data is stored in non-volatile RAM, while HDD is used as a storage device. Another common idea is to include new memory type in caching, buffering or I/O staging [15]. III. EXPERIMENTS To prove this paper’s thesis, author performed several experiments using two popular MPI I/O benchmarks: IOR 1 developed at Lawrence Livermore National Laboratory, and MPI Tile IO benchmark proposed by Parallel I/O Benchmarking Consortium 2. Both applications are configurable, IOR measures read/write bandwidth, while MPI Tile IO provides information only about speed of write access. On the other hand, one advantage of MPI Tile IO is splitting file into two-dimensional tiles with specified size. Configuration of regions where tiles overlap gives a possibility to test different file access patterns. A.

Testbed environment

All the tests were performed on five-node cluster connected with Infiniband, each node specification is described in Table 1. One of the nodes hosted PFS server, while four other nodes were responsible for computations. TABLE 1. Single cluster node configuration. Element CPU RAM Network Card

Specification 2 x Intel® Xeon® Processor 2.80 GHz (two cores each) 4GiB Mellanox MT25208 InfiniHost III Ex

MPI I/O implementation. Both software tools support Infiniband connection. B.

Memory simulation

RAM storage device was simulated using UNIX tmpfs mechanism. Size of test files was limited to 1GiB to prevent performance drop, that could be caused by swapping out parts of tmpfs storage into HDD. Memory latency was introduced by adding constant software delays up to 3 milliseconds for each storage operation. Memory bandwidth was also controlled with software delays, but time of delay was directly proportional to size of accessed data. All software delays were located in OrangeFS on a server side. The persistence of data was not an issue because the scope of experiments did not require it. C.

Results – various memory latency

First experiment investigates PFS bandwidth obtained with different memory latency. Table 2. describes in details experiment configuration. TABLE 2. Benchmarks configuration for various memory latency. Common Number of processes 16 (4 per each CPU) Iterations 10 IOR benchmark Size of single data chunk 32MiB File size 1GiB MPI Tile IO benchmark Number of tiles 16 Number of elements in tile 4M Element size 128B Overlapping 20%

According to specifications of modern SSD, random memory latency can be limited to 0.1 milliseconds [16]. Even though NVRAM is expected to be faster than SSD, test parameters were extended to 3 milliseconds to experiment more with its influence on bandwidth. Fig. 1. and Fig 2. show, that introducing additional latency at level of milliseconds does not reduce final system bandwidth more than 5%. In this example, for both benchmarks, small changes in memory latency do not influence system performance significantly.

OrangeFS 3 2.8.7 was chosen as a PFS solution due to simplicity of configuration and relatively good performance, while MVAPICH2 4 2.1 with ROMIO as

1 2 3 4

https://github.com/LLNL/ior http://www.mcs.anl.gov/research/projects/piobenchmark/ http://www.orangefs.org/ http://mvapich.cse.ohio-state.edu/

Fig. 1. IOR benchmark results for various memory latency.

20 Volume 9, Number 1, May 2016 __________________________________________________________________________________________________________

Fig. 2. MPI Tile IO benchmark results for various memory latency. D.

Results – various memory bandwidth

In second experiment, memory bandwidth is simulated. As stated previously, NVRAM should have better parameters than SSD with bandwidth at the level of at least 330MiB/s [16]. Bandwidth of 330MiB/s could be interpreted as a less than 3 nanoseconds delay per each accessed byte. Detailed benchmark configuration is presented in Table 3. In this experiment additional memory latency per each byte was the same, both for read and write access operations. TABLE 3. Benchmarks configuration for different bandwidth simulation parameters. Common Number of processes 16 (4 per each CPU) Iterations 10 IOR benchmark Size of single data chunk 32MiB File size 1GiB MPI Tile IO benchmark Number of tiles 16 Number of elements in tile 4M Element size 128B Overlapping 20%

Fig. 3. and Fig. 4 show PFS bandwidth for various memory bandwidth simulation parameters. In this test, additional latency per each byte has no impact on file system performance. PFS bandwidth fluctuations are caused by system instability.

Fig. 4. MPI Tile IO benchmark results for bandwidth simulation. E.

Results – various data chunk size

In first two experiments, size of single data chunk transmitted between client and server was fixed. In the third test, simulation parameters are determined on the level below modern SSD according to Table 4. TABLE 4. Benchmarks configuration for various size of single data chunk. Common Number of processes 16 (4 per each CPU) Iterations 5 Additional latency per storage 1ms operation (NVRAM simulation) Additional latency per accessed 3ns byte (NVRAM simulation) IOR benchmark File size 512MiB MPI Tile IO benchmark Number of tiles 16 adjusted in such way that a single tile Number of elements in tile contains 512MiB of data Overlapping 20%

Results presented on Fig. 5. and Fig 6. show system bandwidth of NVRAM simulation both at the level of SSD parameters (as specified in Table 4.) and at the level of RAM (no additional latency). Corresponding charts suggest, that differences in performance of storage devices, at the level of milliseconds, do not change the PFS bandwidth more than 5%. Moreover, for provided experiment parameters, size of data chunk does not influence system performance between PFS storage on RAM and PFS storage with NVRAM simulation tuned to modern SSD specification. Suspected cause of bandwidth limitation for bigger data chunks is fully saturated network connection. Disparity between benchmarks is a result of different bandwidth calculation method. IV. CONCLUSIONS AND FUTURE WORK

Fig. 3. IOR benchmark results for bandwidth simulation.

Expected specification of newest NVRAM technology in terms of its capacity should allow for

Journal of Computer Science and Control Systems 21 __________________________________________________________________________________________________________

Fig. 5. IOR benchmark results for different size of accessed data chunk.

Fig. 6. MPI Tile IO benchmark results for different size of element of a single tile.

using NVRAM storage devices as a main storage in PFS. However, experiments described in this paper, performed with MPI I/O and particular PFS, show, that at the certain level, better performance of storage device (understood as memory latency and bandwidth) does not result in significant PFS performance improvement (understood as a PFS server bandwidth). It leads to the conclusion, that in order to benefit from all of the NVRAM properties, additional solutions – both server and client side – are required. In the nearest future, the author plans to investigate possibilities to improve performance of HPC I/O operations, using new NVRAM parameters – from updated specification. As soon as first new technology NVRAM enabled devices are available, the research will include experiments on actual devices instead of simulated. REFERENCES [1] Message Passing Interface Forum, "MPI: A MessagePassing Interface Standard, Version 3.1", June 2015

[2] R. Thakur, W. Gropp, E. Lusk, "Optimizing noncontiguous accesses in MPI-IO", Parallel Computing, vol. 28, pp. 83-105, January 2002 [3] P.M. Dickens, J. Logan, "A high performance implementation of MPI-IO for a Lustre file system environment", Concurrency and Computation - Practice & Experience, vol. 22, pp. 1433-1449, 2010 [4] P.H. Carns, W.B. Ligon, R.B. Ross, R. Thakur, "PVFS: A parallel file system for Linux clusters", 4th Annual Linux Showcase and Conference, pp. 317-327, October 2000 [5] M.H. Kryder and Chang Soo Kim, "After Hard Drives – What Comes Next?", Magnetics, IEEE Transactions on, vol. 45(10), pp. 3406-3413, October 2009 [6] Micron Technology Inc, "3D XPoint Technology. Breakthrough Nonvolatile Memory Technology", https://www.micron.com/about/emerging-technologies/3dxpoint-technology, July 2015 [7] Intel Corporation, "Intel and Micron Produce Breakthrough Memory Technology", http://newsroom.intel.com/community/intel_newsroom/blo g/2015/07/28/intel-and-micron-produce-breakthroughmemory-technology, July 2015 [8] Intel Corporation, "Breakthroughs in Memory Technology. 2015: 3D XPoint Technology", http://www.intelsalestraining.com/memorytimeline/, January 2016 [9] J. Huang, K. Schwan, K.Q. Moinuddin, "NVRAM-aware logging in transaction systems", Proceedings of the VLDB Endowment, vol. 8, pp. 389-400, December 2014 [10] S. Ryu, K. Lee, H. Han, "In-memory Write-ahead Logging for Mobile Smart Devices with NVRAM", IEEE Transactions on Consumer Electronics, vol. 61, pp. 39-46, February 2015 [11] Park, K.H. Park, W. Hwang, H. Seok, C. Kim, D.J. Shin, D.J. Kim, M.K. Maeng, S.M. Kim, "MN-MATE: Elastic Resource Management of Manycores and a Hybrid Memory Hierarchy for a Cloud Node", ACM Journal on Emerging Technologies in Computing Systems, vol. 12, Article No. 5, July 2015 [12] E.L. Miller, S.A. Brandt, D.D.E. Long, "HeRMES: HighPerformance Reliable MRAM-Enabled Storage", Eighth Workshop On Hot Topics In Operating Systems, Proceedings, pp. 95-99, May 2001 [13] A.I.A. Wang, P. Reiher, G.J. Popek, G.H. Kuenning, "The Conquest file system: Better performance through a disk/persistent-RAM hybrid design", ACM Transactions on Storage (TOS), vol. 2, pp. 309-348, August 2006 [14] K.M. Greenan, E.L. Miller, "PRIMS: making NVRAM suitable for extremely reliable storage", HotDep'07 Proceedings of the 3rd workshop on Hot Topics in System Dependability, Article No. 10, 2007 [15] S. Kannan, A. Gavrilovska, K. Schwan, D. Milojicic, V. Talwar, "Using Active NVRAM for I/O Staging", Proceedings of the 2nd International Workshop on Petascal Data Analytics: Challenges and Opportunities, PDAC ’11, pp. 15-22, 2011 [16] Samsung Group, "Why SSDs Are Awesome", http://www.samsung.com/global/business/semiconductor/ minisite/SSD/global/html/whitepaper/whitepaper01.html

NVRAM as Main Storage of Parallel File System

NVRAM as Main Storage of Parallel File System

Suggest Documents

Active Storage Processing in a Parallel File System - PNNL: High ...

Evaluation of Active Storage Strategies for the Lustre Parallel File

A File System for Object-based Storage Devices - Storage Systems

File Deduplication with Cloud Storage File System - IEEE Xplore

File classification in self-* storage systems - Parallel Data Lab

Integrating Parallel File Systems with Object-Based Storage ... - SC07

A File System Construction Method in Sequential Data Storage System

1st Main Book File

Making NVRAM Suitable for Extremely Reliable Storage - Usenix

Design and Evaluation of a High Performance Parallel File System

Scalable Performance of the Panasas Parallel File System

Towards Automatic Load Balancing of a Parallel File System with ...

Towards Simulation of Parallel File System Scheduling ... - CiteSeerX

Performance of the IBM General Parallel File System - CiteSeerX

Performance of the IBM General Parallel File System - Semantic Scholar

Performance of the IBM General Parallel File System - Semantic Scholar

Parallel File Systems - Dell

GPUs as Storage System Accelerators - Semantic Scholar

Storage and File Structure

Simulation of main memory database parallel recovery

A hierarchical parallel storage system based on ... - ACM Digital Library

TFS: A Transparent File System for Contributory Storage - UMass CS

A Highly Scalable and Efficient Distributed File Storage System

Shield: A stackable secure storage system for file ... - Semantic Scholar