A hierarchical parallel storage system based on ... - ACM Digital Library

4 downloads 0 Views 276KB Size Report
memory for large scale systems. Francisco Rodrigo Duro. University Carlos III. Avda. de la Universidad, 30. 28911 Leganes, Spain [email protected].
A hierarchical parallel storage system based on distributed memory for large scale systems Francisco Rodrigo Duro University Carlos III Avda. de la Universidad, 30 28911 Leganes, Spain

[email protected]

Javier Garcia Blas

University Carlos III Avda. de la Universidad, 30 28911 Leganes, Spain

[email protected]

ABSTRACT This paper presents the design and implementation of a storage system for high performance systems based on a multiple level I/O caching architecture. The solution relies on Memcached as a parallel storage system, preserving its powerful capacities such as transparency, quick deployment, and scalability. The designed parallel storage system targets to reduce the I/O latency in data-intensive high performance applications. The proposed solution consists of a user-level library and extended Memcached servers. The solution aims to be hierarchical by deploying Memcachedbased I/O servers across all the infrastructure data path. Our experiments demonstrate that our solution is up to 40% faster than PVFS2.

Categories and Subject Descriptors D.4.2 [Operating Systems]: Storage Management—Storage hierarchies, distributed memories

Keywords 1. INTRODUCTION Nowadays, storage systems are one of the main bottlenecks in high performance systems and this challenge is expected to continue in next generation exascale systems [2]. It is greatly accepted that large scale storage systems will be necessarily hierarchical [3]. This can be done by organizing the memory spaces in a complex hierarchy, moving data to local cache as fast as possible and throwing it to slower devices in an asynchronous way [5]. This hierarchical structure should be constructed with decoupling in mind, which consists on splitting and isolating compute nodes, storage nodes, and services (network, admin, etc.) as much as possible. Current research works depict that emerging highspeed networks outperform physical disk performance, reducing the relevance of disk locality [1].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EuroMPI ’13, September 15 - 18 2013, Madrid, Spain Copyright 2013 ACM 978-1-4503-1903-4/13/09 ...$15.00.

139

[email protected]

One of the most popular solutions for providing an efficient and scalable distributed cache infrastructure is Memcached [4]. Memcached aims to provide a caching system for web-based database services, and is used in some of the most important web applications such as YouTube, Facebook, and Twitter. The generic design of Memcached allows to store raw data inside its distributed cache as key-value entries. Furthermore, Memcached can be used to provide a distributed memory solution for different application fields. The main motivation of this work is to present a hierarchical storage solution for large scale parallel systems regarding the continuous increase of the data access latencies. Our proposed solution aims to fill the latency gap between compute nodes and final storage sub-systems. We present a Memcached based storage system, namely MemcacheFS or MFS. Our solution offers the possibility to be used to deploy a storage system on all the levels of the data path hierarchy. Applications benefit from this solution by improving data locality, reducing storage latencies, and overlapping computation and I/O operations.

2.

Memcached, parallel storage system, distributed cache

Jesus Carretero

University Carlos III Avda. de la Universidad, 30 28911 Leganes, Spain

MEMCACHED STORAGE SYSTEM

In this section we present the detailed design of MemcachedFS. Our solution relies on two main components: client library and Memcached servers (as shown in Figure 1). First, clients access data through a user-level library, denoted as MFS library in the figure. This library has been designed with two main goals: portability and flexibility. Portability has been achieved by providing a POSIX-like interface and flexibility by designing the solution on top of Memcached. The library provides a logical view of the Memcached servers as a block device. All Memcached servers conform an unique shared memory space while offering transparent data mapping. The key-value pairs symbolize blocks in the device. The key field is used as the block address and the value stores its contents. When a new block is needed, the key is generated based on both a file unique identifier and the file offset. Following accesses are done calculating again this block identifier. The benefits of this method is that it avoids the necessity of storing the pointers to the blocks as metadata. Data and metadata are treated equally as key-value items, so are fully distributed between all the Memcached nodes. The MemcachedFS library takes advantage of the Memcached’s hashing mechanisms, transparently mapping blocks with a known list of servers. This map is done at the client side, so no information exchange is needed between clients and servers.

Compute node

Compute node

Application

Application

Application

MPI-IO

MPI-IO

MPI-IO

MFS Library

MFS Library

MFS Library

libMemcached

libMemcached

libMemcached

IOR Benchmark - Write

Throughput (MB/s)

Compute node

Network

Persitence service Memcached server I/O node

FS

Internal hash table

MFS

80

80

60

60

40

40

20

20

0

0

PVFS2

MFS 2srv

2

4

8

16

32

1

Number of clients

Memcached server

2

4

8

16

32

Number of clients

Figure 2: Throughput performance of MemcachedFS against PVFS2 on IOR benchmark.

DB

MFS

Figure 1: MemcachedFS architecture. Parallel applications access data using MPI-IO. The MemcachedFS (MFS) library maps key-value based blocks to Memcached-based I/O nodes. The I/O nodes provide a distributed cache system which aggregates items in a transparent way. Finally, the persistence service forwards items to the next level of the hierarchy. Focusing on the potential use of the proposed solution for parallel applications based on MPI, we have implemented an ADIO interface for MemcachedFS. The ADIO interface is built on top of the MemcachedFS library and allows any application written for MPI-IO to use our solution. The second component is the Memcached servers. Memcached has been selected as our storage server for two main reasons. First, its popularity: Memcached is largely used in production environments, guaranteeing the reliability and optimization of the code (multi-thread, light protocol, memory use optimization, etc.) Second, the simplicity of deployment of Memcached servers. However, the Memcached server was initially designed without any kind of persistence in mind, items dropped by LRU become unrecoverable. In order to provide a persistent storage, we have modified the Memcached server regular operation. MemcachedFS relies on a persistence management service, which stores dropped key-value entries into a mounted file system (FS), a Berkeley DB (DB) or another instance of MemcahcedFS, resulting in a hierarchical architecture.

3.

100

MFS 4srv

Persitence service

FS

100

1

I/O node

DB

120

MFS 1srv

Distributed Cache (caches aggregation) Internal hash table

IOR Benchmark - Read

120

EVALUATION

We have evaluated our solution against PVFS2 using the IOR benchmark. Each node of the cluster used for executing the IOR benchmark is configured as follows: Intel Xeon CPU E5405, 4GB DDR3, Gigabit Ethernet, Ubuntu 10.10 server, and mpich2 1.4.1p1. PVFS2 2.8.4 was used with 64KB of stripping size and 4 I/O nodes. MemcachedFS was configured with a block size of 512KB and 128MB cache in each node, using 1 to 4 MemcachedFS servers. IOR benchmark was executed writing/reading 4GB of data, independently of the number of clients and always using 512KB blocks. The results shown in Figure 2 are the best values obtained for each case in 5 iterations. Our solution outperforms PVFS2 by 40% in the best case,

140

while it does not suffer from any significant performance hits. As we demonstrate in the previous figure our solution is able to scale with the number of clients. It should be noticed that the best performance cases are around the limits of the network used.

4.

CONCLUSIONS

In this work we have presented a highly portable and flexible storage infrastructure for hierarchical large computational systems. Our solution is based on Memcached, the most used distributed cache memory software architecture. The proposed solution takes in advantage the main features offered by Memcached such as scalability, flexibility, and performance. It also extends the functionality of the Memcached servers with a persistence service. Our ongoing work targets improving the persistence mechanisms included in MemcachedFS with a more asynchronous implementation and improving persistence over MemcachedFS servers, allowing the construction of complex hierarchical storage architectures.

5.

ACKNOWLEDGMENTS

This work was supported in part by Spanish Ministry of Science and Innovation under the project “Input/Output Scalable Techniques for distributed and high-performance computing environments” - TIN2010-16497.

6.

REFERENCES

[1] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Disk-locality in datacenter computing considered irrelevant. In HotOS’13, pages 12–12, Berkeley, CA, USA, 2011. USENIX Association. [2] G. Bell, J. Gray, and A. Szalay. Petascale computational systems. Computer, 39(1):110 – 112, jan. 2006. [3] J. Dongarra, P. Beckman, T. Moore, and Aerts. The international exascale software project roadmap. Int. J. High Perform. Comput. Appl., 25(1):3–60, Feb. 2011. [4] B. Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124):5–, aug 2004. [5] F. Isaila, J. G. Blas, J. Carretero, R. Latham, and R. Ross. Design and evaluation of multiple-level data staging for blue gene systems. IEEE Transactions on Parallel and Distributed Systems, 22(6):946–959, 2011.

Suggest Documents