Accelerating Cloud Storage System with Byte-Addressable Non ...

4 downloads 9177 Views 3MB Size Report
Abstract—As building block for cloud storage, distributed file system uses underlying local file systems to manage objects. However, the underlying file system, ...
2015 IEEE 21st International Conference on Parallel and Distributed Systems

Accelerating Cloud Storage System with Byteaddressable Non-Volatile Memory Qingsong Wei, Mingdi Xue, Jun Yang, Chundong Wang, Chen Cheng Data Storage Institute, Singapore {WEI_Qingsong, XUE_Mingdi, yangju, wangc, CHEN_Cheng}@dsi.a-star.edu.sg large I/O requests; they have failed to do so for metadata I/O requests which are small and popular. Research work [23] reported that about more than 60% of disk I/Os are metadata operations. Metadata is organized into block on disk in current file system. Partial metadata access results in whole block read or write which significantly amplifies disk I/O traffics. Second, journaling file system has been widely used to protect metadata from being corrupted. It commits dirty metadata in the journal first, and then periodically updates it to its original location through checkpointing. This technique enables fast file system recovery from crash at the cost of double writes, which further aggravates metadata I/O amplification. Last, multi-level indirections among object ID, directory, inode, and on-disk object involve several small disk accesses, which result in significant overhead for object retrieval. Quickly locating and fetching objects from disk with object ID is desirable for distributed storage system.

Abstract—As building block for cloud storage, distributed file system uses underlying local file systems to manage objects. However, the underlying file system, which is limited by metadata and journaling I/O, significantly affects the performance of the distributed file system. This paper presents an NVM-based file system (referred to as NV-Booster) to accelerate object access for storage node. The NV-Booster leverages byte-addressability and persistency of nonvolatile memory (NVM) to speedup metadata accesses and file system journaling. With NV-Booster, metadata is kept in NVM and accessed in byte-addressable manner through memory bus, while object is stored on hard disk and accessed from I/O bus. In addition, proposed NV-Booster enables fast object search and mapping between object ID and on-disk location with an efficient in-memory namespace management. NV-Booster is implemented in kernel space with NVDIMM and has been extensively evaluated under various workloads. Our experiments show that NV-Booster improves Ceph performance up to 10X, compared to the Ceph with existing local file systems. Keywords—Cloud Storage; Non-volatile Memory; Local File Syetm

I.

The past decade has witnessed a significant advancement in the computation power of CPU and the access speed of DRAM. In contrast, the access latency of hard disk stays relatively stagnant. This issue is partially alleviated by the development of Flash-based solid state disk (SSD). However, there are a few inherent limitations of Flash memory such as slow random writes and limited lifetime[2]. Recently, the next generation of NVM technologies such as phase-change memory (PCM), and spin-transfer torque memory (STTMRAM) [5] are under active development. It provides DRAMlike performance and Disk-like persistency[22]. Since NVM’s current capacity is small, block storage such as disk, or SSD is still needed to store large amount of data, while NVM can be used to optimize metadata which is small and popular.

INTRODUCTION

Cloud storage is emerging as a powerful paradigm for sharing information across the network, which satisfies people’s mobile data demand anywhere and anytime. Distributed file systems such as Ceph[16], Lustre[15] and Hadoop Distributed File System (HDFS) [13] are being developed as the key component of cloud storage [14]. Rather than relying on a few central large storage arrays, these distributed file systems aggregate distributed storage resources on large number of commodity storage nodes into a single file system image and provides low cost, large capacity and highperformance storage service.

This paper presents an NVM-based file system (referred to as NV-Booster) to accelerate object access for storage node. With NV-Booster, metadata is kept in NVM and accessed in byte-addressable manner through memory bus, while object is stored on hard disk and accessed from I/O bus. Proposed NVBooster enables fast object search and mapping between object ID and on-disk location with an efficient in-memory namespace management. NV-Booster is implemented in kernel space with NVDIMM[1]. Testing results show NV-Booster is very efficient in speeding up performance of distributed file systems.

Distributed file system stripes a file into multiple objects and distributes them across storage nodes to enable parallel data transfer. In practice, general-purpose file systems are used to manage objects in storage node. The distributed file system heavily relies upon the stability and performance of the underlying file systems[15,16]. Because large-scale distributed file systems may employ thousands of storage nodes, even a small inefficiency in the underlying file system can result in a significant loss of performance in the overall system.

In summary, this paper makes following contributions:

General-purpose file systems have several disadvantages that limit effectiveness of distributed storage systems. First, they have been very successful at exploiting disk bandwidth for

1521-9097/15 $31.00 © 2015 IEEE DOI 10.1109/ICPADS.2015.52

1) In-memory metadata and direct mapping is proposed to speedup metadata access and mapping between object

354

ID and object on-disk location. Customized disk layout and inode are designed.

approach is to put it on the memory bus and use it as memory device. This way can fully utilize the merits of NVM, i.e. ultralow latency and byte-addressable. However current software is designed for DRAM-disk hierarchy, which is not NVM-aware. For example, existing memory management does not support instant reboot and failure recovery from NVM. To make full use of NVM, operating system needs to be redesigned. Such approach has been adopted thoroughly in state-of-art works[811,22]. This paper also adopts memory-based NVM to accelerate large-scale storage system.

2) A dedicated memory management and layout is provided to prevent metadata in NVM from being overwritten and recover from failure. 3) A consistency mechanism for in-memory metadata and file system is designed. 4) NV-Booster is implemented in kernel space with NVDIMM. Our experiments show that NV-Booster improves Ceph performance up to 10X, compared to the Ceph with existing file systems.

B. Distributed File System With the rapid growth of Internet services, many large server infrastructures have been set up as data centres and cloud storage platforms, such as those in Google, Amazon[14] and Yahoo!. Compared with traditional large scale storage systems built for HPC (High Performance Computing), they focus on providing storage service on Internet and scalable data management for data-intensive computing. The key components of cloud storage infrastructures are distributed file systems. Three famous examples are GoogleFS [12], HDFS [13] and Ceph [16].

The rest of this paper is organized as follows. Section II provides relevant background. Detailed design of NV-Booster is presented in Section III. Section IV gives evaluation results. Related works in the literature is presented in Section V, and section VI summarizes the paper with the conclusions. II. BACKGROUND AND MOTIVATION A. Non-volatile Memory NVM has attracted more and more attention in both academia and industry [3,4,7]. As shown in Table I, Flash memory is still unsuitable to replace DRAM due to much higher latency and limited endurance [2]. Recent work has focused on the next generation NVM, such as PCM [6] and STT-MRAM [5], which (i) is byte addressable, (ii) has DRAM-like performance, (iii) persistent and (iv) provides better endurance than flash. PCM is several times slower than DRAM and its write endurance is limited to as few as 108 times. However, PCM has larger density than DRAM and shows a promising potential for increasing the capacity of main memory. Although wear-leveling is necessary for PCM, it can be done by memory controller [6]. STT-MRAM has the advantages of lower power consumption over DRAM, unlimited write cycles over PCM, and lower read/write latency than PCM. As promising NVM technology, the commercial 64Mb STT-MRAM chip with DDR3 interface is available[27]. In this paper, NVM is referred to the next generation of nonvolatile memory excluding flash memory.

A distributed file system consists of metadata server (MDS), a master server that manages the global file system namespace and regulates access to files by clients. In addition, there are a number of storage nodes. Internally, a file is striped into one or more objects and these objects are stored in a set of storage nodes(see Fig. 1). Object size is by default 4MB for Ceph and 64MB for HDFS, and configurable per file. The MDS executes global file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of objects to storage nodes. The storage node is responsible for serving object read/write requests from the clients. The storage nodes also perform object creation, deletion, and replication upon instruction from the MDS. Meta Data

Clients

Metadata Server

TABLE I: A SUMMARY OF NVM CHARACTERISTICS Technology Flash(SLC) PCM STT-MRAM RRAM DRAM

Read latency (ns) 25,000 48 32 10-50 15

Write latency (ns) 200,000-500,000 150 40 10-50 15

Network

Endurance (writes/cell) 105 108 1015 108 1018

Control Node

Node

Node

Node

Node

Node

Disks

Disks

Disks

Disks

Disks

Disks

Objects

Storage Nodes Fig. 1. Distributed file system

Two approaches have been proposed to incorporate NVM into computer. The first one is to use it as a fast block storage through PCIe [2]. This approach requires no modification in the system and application software. But it is inefficient since (i) NVM is treated as a block device, due to which the byteaddressability of NVM is wasted, and (ii) the latency of I/O controller and storage stack will dominate the data access time, leaving the fast access time of NVM in vain. The second

In practice, distributed file systems employ general-purpose file systems (e.g. Ext4/XFS) to manage objects. Each object on a storage node is represented by a file and an attribute in the local file system. Usually local file system uses object ID as file name. The file contains the data itself and the attribute is object’s information including checksums for the object data and the object’s generation stamp. Attribute is stored as a

355

Clients Clients Clients

NV-Booster

Read(Object(i)) Receive Object Requests

File Data

NVM-based File System Booster

Get inode no for Object(i) In-memory Metadata and Object Mapping

Metadata In-memory Metadata Normal Memory and Object Mapping Usage Byte-addressable

Persistent NVM

Modified Memory Management Obtain disk location of object(i)

Memory Bus

Read Object(i) from disk Generic Block Layer

NVM

DRAM

I/O Bus

Disk

Storage Nodes

(a)NV-Booster as local file system for distributed system

(b) Architecture of NV-Booster

Fig. 2. Proposed NM-Booster is optimized with in-memory metadata and direct mapping for distributed file system.

separate file or an Extended Attributes (XATTRs) along with object file.

recovery from crash by simply scanning the journal and redoing any incomplete committed operations. However, journaling’s double writes further aggravates I/O amplification and downgrades file system performance.

C. Motivation Distributed file systems rely heavily upon the stability and performance of the underlying file system [16]. However, following issues of underlying file system limits the effectiveness of the distributed file systems.

Namespace Management: Given an object identifier, storage node needs to retrieve the corresponding object from disk. In current file systems, there are complex multi-level indirections among object ID, directory, inode, and on-disk location (See Figure 2). The first indirection is mapping from object ID to inode No, the second is mapping from inode No to inode and the third is mapping from inode to object on-disk location. These indirections involve several small disk I/O, which result in significant overhead for object retrieval.

Metadata I/O: The performance of underlying file systems is dominated by small and frequent metadata I/O. Metadata is organized into block on disk in current general purpose file system. One block contains multiple metadata. Partial metadata access results in whole block read or write which significantly amplifies disk I/O traffics. The volume of metadata is relatively small, but the number of access to metadata is much higher than that of data. For example, most of file system calls such as stat, open, close need access metadata at first. We have analyzed the real enterprise workloads. Table II shows the number of disk I/O generated by accessing data and metadata for enterprise workloads on Linux Ext4 file system. The percentage of metadata I/O in Varmail and Webserver workloads reaches 85% and 71% respectively.

With emerging byte-addressable NVM, we are motivated to tackle the performance issue by redesigning underlying file system with in-memory metadata and object mapping (See Fig. 2(a)). III. DESIGN OF NV-BOOSTER Figure 2(b) shows the architecture of the NV-Booster. The NV-Booster decouples data and metadata I/O path, putting data on disk and metadata in NVM at runtime. Thus, data is accessed in block from I/O bus and metadata is accessed in byte-addressable manner from memory bus. Metadata I/O is eliminated because metadata in NVM is not flushed back to disk anymore. Following mechanisms are designed and implemented.

TABLE II. METADATA I/O IN ENTERPRISE WORKLOADS Workloads

Metadata I/O(%)

Data I/O(%)

Varmail

85

15

Webserver

71

29

1) In-memory metadata and object mapping is proposed to speedup mapping between object ID to object ondisk location.

File system journaling I/O: Journaling file systems are usually used to recover from crashes, power outages, etc. These file systems journal all of the changes they will make before performing writes. This technique enables fast file system

356

2) A dedicated memory management and layout is provided to prevent metadata in NVM from being overwritten and recover from failure.

See Figure 3). The NVM Zone is isolated from other memory space by using dedicated management function. A set of dedicated kernel functions are designed to allocate/free memory from/to the NVM Zone for the NV-Booster. Other applications which use legacy memory management functions are not allowed to access the reserved NVM Zone. In this way, the NVM could only be accessed by the NV-Booster, and metadata in NVM will not be overwritten by other applications.

3) A consistency mechanism for in-memory metadata and file system is designed. A. In-memory Metadata and Object Mapping The NV-Booster keeps metadata always in NVM and data on disk. Similarly, inode is used to describe the features of an object which is stored on disk. According to the distributed file system, given an object identifier, storage node needs to retrieve the corresponding object from disk, which involves object lookup and locating. Because hundreds of thousands of objects might coexist on a single storage node, traditional file system incurs significant overhead of indirection and disk I/O caused by retrieving inode block and directory blocks.

Metadata stored in the NVM Zone should be reused after system reboots. To do this, a fixed access entrance is defined at the beginning of the NVM Zone to store file system index which is the core structure to maintain system-level status such as failure or normal and file system level information (See Figure 4). The file system index maintains pointers pointing to superblock, inode bitmap, block bitmap, committing transactions, and root of the inode tree. This entrance is access point of the NV-Booster, from which all valid metadata can be recognized and retrieved after system reboot. Committing transactions are checked to recover file system to the latest consistency status after failure.

Root node

Key: Object ID

Interior node

DRAM Space

Leaf node

File System Index

Metadata (Inode)

In-memory Metadata space

Status

Superblock

Superblock

Inode Bitmap

Inode Bitmap

Disk Layout

NVM Zone

Block Bitmap

Block Bitmap Committing transaction

Commit ptr

Object Attribute

Metadata ptr

Object



Fig. 3. In-memory metadata and object mapping

To speedup searching and locating of object, metadata in NVM is organized as key-value pair, shown as Figure 3. In particular, the mapping between Object ID and inode is organized as a namespace B+tree for efficient object lookup. In the B+tree, each valid object has an entry in form of , in which key is object ID and value is the memory address of its corresponding inode. With in-memory inode, the address of the object on disk can be immediately located. This organization extremely reduces several levels of indirection between an object ID and the corresponding data, each of which can result in a separate disk access, which occurs in mapping between file name and on-disk inode of generalpurpose file systems. Consequently mapping from object ID to on-disk location is extremely fast without disk I/O. NV-booster allocates inode based on metadata bitmap.

Inodes

Fig. 4. Layout of NVM Zone

C. File System Consistency With metadata and its corresponding data structure stored in NVM, the immediate benefit is that it can be updated inplace and becomes persistent once updated. However, metadata consistency must be carefully taken care to avoid any inconsistency caused by system failure. In particular, if the system crashes when an update is being made to a metadata in NVM, the metadata may be left in a corrupted state as the update is only half-done. In that case, we need certain mechanism to recover the metadata to its last consistent state after reboot.

B. NVM Memory Management and Layout Since memory is allocated dynamically, we need to protect the metadata from being overwritten by other applications after system reboot. To address this problem, a special memory zone is physically reserved from memory space (called NVM Zone,

To achieve metadata consistency in NVM, ordered memory writes is fundamental. However, existing CPU and memory controller may reorder memory writes. Alternatively, without modifying existing hardware, we can use the sequence of

357

{MFENCE, CLFLUSH, MFENCE} instruction to form ordered memory writes [22]. Specifically, MFENCE issues a memory barrier which guarantees the memory operations after the barrier cannot proceed until those before the barrier complete, but it does not guarantee the order of write-back to the memory from CPU cache. On the other hand, CLFLUSH can explicitly invalidate the corresponding dirty CPU cachelines so that they can be flushed to NVM by CPU which makes the memory write persistent eventually.

NVDIMM to RESTORE backup content in Flash memory to DRAM. Since this process is transparent to other parts of the system, NVDIMM can be treated as NVM. Our NV-Booster is implemented and evaluated on a NVDIMM platform (see Figure 6).

CPU provides atomicity of memory writes no more than a few bytes (8 bytes for 64-bit CPUs, 4 bytes for 32-bit CPUs), so updating metadata larger than 8 bytes requires certain mechanisms to make sure metadata can be recovered even if system failure happens before it is completely updated. NVBooster uses a hybrid approach for metadata consistency, switching between atomic in-place update and CoW-based inmemory commit.

(a) Server with NVDIMM

The modified BIOS provides operating system a unique physical memory space including both DRAM and NVDIMM. To differentiate memory space of NVDIMM, we modified memory management and mapped NVDIMM as NVM Zone in virtual memory space. The reserved NVM Zone is used only for storing metadata persistently. A set of dedicated memory allocation functions such as kmalloc_nvm, alloc_page_nvm, free_page_nvm, mem_cache_alloc_nvm, are developed to allocate memory only from NVM Zone/NVDIMM, which prevents metadata from being overwritten by other applications.

Atomic in-place update: The processor natively supports 8-bytes atomic write. NV-Booster uses CPU’s atomic write to update metadata less than 8 bytes. CoW-based and In-memory Commit: NV-Booster uses Copy-on-Write (CoW) and in-memory commit to guarantee the metadata consistency for updates larger than 8 bytes(See Figure 5). NV-Booster leverages transaction to control when to ‘commit’ the metadata update. In-memory commit is running periodically. A committing transaction is atomic which contains multiple updates. MFENCE is issued at the beginning of a transaction. During the transaction, an original copy is created before a metadata is updated, following by CLFLUSH. MFENCE is issued again after all the updates are successfully committed. ‘Commit’ just finalizes the metadata update and deletes the original copies right after the transaction is successfully committed. If system crash happens before the transaction committing, the original copy of the metadata is used to recover metadata in the NVM. Update

The first 4K page of the NVDIMM is reserved to store file system index. After reboot, NV-Booster will check the file system index to detect whether system failure happened or not. To speed up file system mount after normal shutdown, our current implementation is able to achieve instant mount. More specifically, during normal shutdown, we mark the status of the file system index to indicate a successful shutdown. Then, the file system mount can (1) start with checking the status entry, (2) if it is marked, reset it and use the superblock stored in the NVM to complete the file system mount. If the status is not marked, it means a failure occurred, and a transaction undo is executed to recover the file system.

Update Time

Metadata Copy-on-write

A

A’

A’

A

A

A’

A’

In-memory Commit

A’’

A’’

A’’

A’

A’

A’

(b)NVDIMM

Fig. 6. NVDIMM platforms

We implemented the NV-Booster in Linux kernel. Our experiments are performed on normal servers with an Intel duo-core CPU, 4GB DRAM, Intel 64GB SSD and a 1TB SATA hard disk plus 8GB NVDIMM(two NVDIMMs). We compared NV-Booster with Ext4 file system. For fair comparison, 12GB DRAM is used for Ext4 file system, while 4GB DRAM and 8GB NVDIMM are used for NV-Booster.

A’’

In-memory Commit

Fig. 5. CoW and In-memory Commit

B. Consistency evaluation To validate the consistency of the NV-Booster, we did two types of tests: normal shutdown, and abnormal shutdown. We manually trigger the abnormal shutdown by cutting the power supply. For both normal and abnormal shutdown, NVDIMM detects the power down and trigger the SAVE operation, which copies whole contents from DRAM to Flash with help of embedded supercapacitor. Once power resume, the modified BIOS signals NVDIMM to conduct RESTORE operation, which copies backup contents from Flash to DRAM. For normal shutdown, the NV-Booster is able to instantly mount by

IV. IMPLEMENTATION AND EVALUATION A. Implementation Due to the price and prematurity of NVM, mass production with large capacity is still impractical today. Commercially available NVDIMM[1] offers a practical NVM for us to prototype the NV-Booster. During normal operations, NVDIMM is working as DRAM while flash is invisible to the host. However, upon power failure, NVDIMM SAVEs all the data from DRAM to flash by using supercapacitor to make the data persistent. Upon power on, modified BIOS asks

358

Ext4

NV-Booster

800 600 400 200 0 3

6

9

NV-Booster

25000

Ext4

Throughput(OPS)

Throughput(OPS)

Throughput(OPS)

Ext4 1000

20000 15000 10000 5000 0

3

12

6

9

Number of Files(M)

Number of Files(M)

(a) Varmail

(b) File Server

12

NV-Booster

30000 25000 20000 15000 10000 5000 0

3

6

9

12

Number of Files(M)

(c) Web Server

Fig. 7. Performance comparison under Filebench benchmark

locating the superblock from the file system index which is fixed at the beginning of the NVM. For abnormal shutdown, the NV-Booster detects the failure and recovers the file system by undoing the committing transaction. Then we check whether NV-Booster has any data inconsistency. We repeat the tests a few hundreds of times and find the NV-Booster passes the check in all cases.

million and 12 million. With the number of files increasing, the metadata and journaling I/O operations increase accordingly, which significantly slow down the performance of conventional Ext4 file system. However, increasing metadata operations do not affect the performance of NV-Booster. The reason is that NV-Booster eliminates metadata I/O and journaling I/O with in-memory metadata and in-memory commit.

C. Standalone Performance Evaluation As shown in Table III, three Filebench workloads Fileserver, Varmail and Webserver are generated to compare the performance of NV-Booster and Ext4.

D. Cluster Performance Evaluation Then we test NV-Booster as underlying file system for a distributed file system Ceph. In particular, a Ceph cluster containing 12 storage nodes is setup. For each storage node, there is 1GB size of journal partition on SSD and 500GB size of data partition on HDD. Ceph was setup as block device which stores data striped over multiple OSDs in a Ceph cluster. For the experiment, we use 8 clients and up to 32 threads per client. Rados benchmark is used to generate workloads, which are issued by multiple clients and multiple threads across the Ceph platform.

TABLE III: CONFIGURATIONS OF WORKLOADS FOR FILEBENCH Workloads

Number of files(M)

File Size(KB)

Mean Dir width

Threads

Fileserver

3, 6,9,12

16

2000

50

Varmail

3, 6,9,12

16

1M

16

Webserver

3, 6,9,12

16

20

100

Figure 8 shows the throughput of Ceph under Ext4 and NV-Booster for small I/O (4KB) with varying running threads. For random write of 4KB, the throughput of NV-Booster is 4100 IOPS for 32 running threads. By contrast, the throughput of Ext4 is 350 IOPS (see Figure 8(a)). NV-Booster makes up to 10X performance improvement for small random writes. For random read of 4KB, the throughput of NV-Booster is 1400 IOPS for 32 running threads. By contrast, the throughput of Ext4 is 450 IOPS (see Figure 8(b)). There is 3X performance improvement for small random read compared to Ext4. For sequential read of 4 KB, the throughput of NV-Booster is 7800 IOPS for 32 running threads. By contrast, the throughput of Ext4 is 5000 IOPS (see Figure 8(c)). There is 50% performance improvement for small sequential read compared to Ext4. We can see from the Figure 6 that the performance improvement for random write is higher than that of random read. The reason is that NV-booster reduces metadata I/O and journaling I/O with in-memory metadata and in-memory commit.

Figure 7 shows the throughput (operation per second) of Ext4 and NV-Booster for three workloads with varying number of files. For workload Varmail, the throughput of NV-Booster is 845 OPS when the number of test file is set as 12 million. By contrast, the throughput of Ext4 is 537 OPS (see Figure 7(a)). NV-Booster makes up to 57% performance improvement over Ext4 for write intensive Varmail workload. There is also up to 52% performance improvement of NV-Booster over Ext4 for workload Fileserver (see Figure 7(b)). For read intensive workload Webserver, the throughput of NV-Booster is 27690 OPS for 12 million testing files. By contrast, the throughput of Ext4 is 17340 OPS (see Figure 7(c)). There is 60% improvement of NV-Booster compared to Ext4. We also can see higher performance improvement of NVBooster over Ext4 when the number of testing file increases from 3 million to 12 million. For example, the performance improvement of NV-Booster over Ext4 is 34%, 39%, 46%, and 57% under workload Varmail when the number of testing file is set as 3 million, 6 million, 9 million and 12 million. Similarly, the performance improvement of NV-Booster over Ext4 is 11%, 23%,56%, and 60% under workload Webserver when the number of testing file is set as 3 million, 6 million, 9

Figure 9 shows the throughput of Ceph under Ext4 and NV-Booster for big I/O (4MB) with varying running threads. For random write of 4MB, the throughput of NV-Booster is 780 MB/s for 16 running threads. By contrast, the throughput

359

NV-B o o ster

Ext4

NV-B o o ster

1400 1200 1000 800 600 400 200 0

1

8

16

Ext4

1600

Throughput(IOPS)

Throughput(IOPS)

Throughput(IOPS)

Ext4 4500 4000 3500 3000 2500 2000 1500 1000 500 0

32

1

Threads

8

16

32

1

Threads

a. Random writes

NV-B o o ster

9000 8000 7000 6000 5000 4000 3000 2000 1000 0 8

16

32

Threads

b. Random reads

c. Sequential reads

Fig. 8. Performance comparison of small I/O (Request Size: 4KB, 8 clients and 12 OSDs) Ext4

NV-B o o ster

Ext4

Ext4

NV-Booster

1200

800 600 400 200 0

1000 800 600 400 200 0

1

8

16

32

NV-Booster

1000

Bandwidth(MB/S)

Bandwidth(MB/S)

Bandwidth(MB/S)

1000

800 600 400 200 0

1

Threads

a. Random writes

8

16

32

Threads

b. Random reads

1

8

16

32

Threads

c. Sequential reads

Fig. 9. Performance comparison of big I/O (Request Size: 4MB, 8 clients and 12 OSDs)

of Ext4 is 420 MB/s (see Figure 9(a)). NV-Booster makes up to 85% performance improvement for big random writes. For random read of 4MB, the throughput of NV-Booster is 950 MB/s for 16 running threads. By contrast, the throughput of Ext4 is 580 IOPS (see Figure 9(b)). There is 63% performance improvement for big random read compared to Ext4. For sequential read of 4 MB, the throughput of NV-Booster is 840 MB/s for 16 running threads. By contrast, the throughput of Ext4 is 600 MB/s (see Figure 9(c)). There is 40% performance improvement for small sequential read compared to Ext4.

the block management and keeps the space always contiguous for each file. PMFS [4] enables direct NVM access via CPU load/store instructions and uses journaling (undo log) to provide consistent updates to metadata. A flexible architecture called Aerie [10] is proposed to expose NVM to user-mode program so that can access files without kernel interaction.

It is thus obvious to see that NV-Booster is better than the existing file system Ext4 under workloads of distributed file system. This is because proposed NV-Booster leverages NVM to speed up metadata mapping and reduce disk I/O for distributed file system. The results indicate that the performance gain of NV-Booster comes from two aspects: inmemory metadata and direct mapping, and in-memory commit with reduced disk I/O.

The Conquest file system [19] stores metadata and small files in NVM, and big files in hard disk separately. In work [24,26], file system metadata is stored in MRAM in a special designed structure and layout to accelerate the performance. PFFS2 is a flash memory file system for the hybrid architecture of PCM and Flash [20]. PFFS saves file system metadata into PCM and manages all directories in the PCM. Based on the virtual metadata storage, PFFS2 can manage metadata in a virtually fixed location and through byte-level in-place updates. However, PFFS2 is only designed for flash memory and did not consider file system consistency issues. FRASH is a hybrid file system which designs an in-memory metadata structure as well as an on-disk structure [18]. In summary, above mentioned works have following issues: they do not consider 1) consistency issue, 2) memory partition and isolation and 3) reboot and failure recovery; 4) in addition, their evaluation and implementation are based on DRAM, not real NVM. This paper solved these issues. Comparing to our previous work

NVM-based Hybrid File System: Large capacity NVM is still far away from the real application. Building hybrid systems with NVM is a practical option.

V. RELATED WORK A. NVM-based file system NVM-only File System: BPFS[3] is a byte-addressable file system which is resided in NVM. It adopts short-circuit shadow paging, together with hardware modifications (epoch barrier) to maintain the file system consistency. SCMFS[11] utilizes the memory management in the operating system to do

360

FSMAC[21], this paper is different in following ways. First, NV-Booster proposes in-memory metadata and direct mapping to improve performance for distributed storage system. Customized disk layout is proposed. To save NVM space, unused inode is not kept and only one on-disk address is maintained. Second, COW and in-memory commit mechanism is proposed to maintain consistency. Finally, extensive evaluation on distributed storage system is conducted.

[4]

[5]

[6]

[7]

B. Customized Underlying File System OBFS [17] is an object-based file system developed specifically for Object-based Storage Devices (OSD). OBFS uses two block size: small blocks, roughly equivalent to the blocks in general purpose file system, and large blocks, equal to the maximum object size. OBFS separates the raw disk into regions and all of the blocks in a region have the same size. Blocks are laid out in regions that contain both the object data and onode for the objects. Free blocks of the appropriate size are allocated sequentially.

[8] [9]

[10]

[11] [12]

EBOFS [25] is another object-based file system upon OBFS. EBOFS is an extent and B+tree based object file system, allow arbitrarily sized objects and preserve intra-object locality of reference by allocating data contiguously on disk, and maintain high levels of contiguity even over the entire lifetime of a disk’s file system, allowing OSDs to operate more efficiently and distributed file systems to maximize performance. EBOFS uses Buckets to allocate the closest extents from free space. But it can’t keep disk layout compact and prevent disk space fragmentation.

[13] [14] [15] [16]

[17]

VI. CONCLUSION

[18]

This paper presents an NVM-based file system (called NVBooster) to accelerate object access of storage node. With NVBooster, metadata is kept in NVM and accessed in byteaddressable manner through memory bus, while data is stored on hard disk and accessed from I/O bus. Thus, metadata access is significantly accelerated and metadata I/O is eliminated because metadata in NVM is not flushed back to disk anymore. Proposed NV-Booster enables fast object search and direct mapping between object ID and on-disk location with an efficient in-memory direct mapping and customized disk layout. Evaluations of real prototype of NV-Booster on NVDIMM show that NV-Booster is very efficient in improving the performance of the distributed file system.

[19]

[20]

[21]

[22]

[23]

REFERENCE [1] [2]

[3]

Viking. Arxcis-nv (tm) non-volatile dimm. http://www.vikingtechnology.com/nvdimm-technology CAULFIELD, A. M., DE, A., COBURN, J., MOLLOW, T. I., GUPTA, R. K., AND SWANSON, S. Moneta: A high performance storage array architecture for next-generation, non-volatile memories. In Proceedings of MICRO 2010. CONDIT, J., NIGHTINGALE, E. B., FROST, C., IPEK, E., LEE, B., BURGER, D., AND COETZEE, D. Better i/o through byte-addressable, persistent memory. In Proceedings of SOSP 2009.

[24]

[25] [26] [27]

361

DULLOOR, S. R., KUMAR, S., KESHAVAMURTHY, A., LANTZ,P., REDDY, D., SANKARAN, R., AND JACKSON, J. System software for persistent memory. In Proc. of EuroSys 2014. FREITAS, R. F., AND WILCKE, W. W. Storage-class memory: The next storage system technology. IBM Journal of Research and Development 52, 4.5 (2008), 439–447. KIM, H., SESHADRI, S., DICKEY, C. L., AND CHIU, L. Evaluating phase change memory for enterprise storage systems: A study of caching and tiering approaches. In Proc. of FAST 14. LEE, E., BAHN, H., AND NOH, S. H. Unioning of the buffer cache and journaling layers with non-volatile memory. In Proc. of FAST 2013. NARAYANAN, D., AND HODSON, O. Whole-system persistence. In Proceedings of ASPLOS 2012, USA. VENKATARAMAN, S., TOLIA, N., RANGANATHAN, P., CAMPBELL, R. H., ET AL. Consistent and durable data structures for non-volatile byte-addressable memory. In Proceedings of FAST (2011). VOLOS, H., NALLI, S., PANNEERSELVAM, S., VARADARAJAN, V., SAXENA, P., AND SWIFT, M. M. Aerie: flexible file-system interfaces to storage-class memory. In Proceedings of EuroSys 2014. WU, X., AND REDDY, A. Scmfs: a file system for storage class memory. In Proceedings of SC 2011. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, The Google File System, In proceedings of SOSP 2003. K. Shvachko, H. Huang, S. Radia, R. Chansler, The Hadoop Distributed File System, In proceeding of MSST 2010. USA. Amazon-S3. Amazon simple storage service (amazon s3). http://www.amazon.com/s, 2009. Lustre: A Scalable, High-performance File System. Whitepaper, Cluster File System, Inc. http://www.lustre.org/docs/lustre.pdf. Sage A. Weil Scott A. Brandt Ethan L. Miller Darrell D. E. Long, Ceph: A Scalable, High-Performance Distributed File System, In proceeding of OSDI 2006. Feng Wang, Scott A. Brandt, Ethan L. Miller, and Darrell D. E. Long, OBFS: A File System for Object-based Storage Device. Proc. of MSST 2004. JUNG, J., WON, Y., KIM, E., SHIN, H., AND JEON, B. 2010. FRASH: Exploiting Storage Class Memory in Hybrid File System for Hierarchical Storage. ACM Transactions on Storage, Vol.6, No.1., 1-25. WANG A.-I A., KUENNING G., REIHER P., AND POPEK G. 2006. The conquest file system: better performance through a disk/persistentRAM hybrid design. ACM Transactions on Storage, Vol.2, No.3, 309– 348. PARK, Y.W., PARK, K.H., 2011. High-Performance Scalable Flash File System Using Virtual Metadata Storage with Phase-Change RAM, IEEE TRANSACTIONS ON COMPUTERS, Vol. 60, no.3, pp. 321-334. CHEN, J.X., WEI, Q.S., CHEN, C., WU. L.K., 2013, FSMAC: A file system metadata accelerator with non-volatile memory, In Proc. of MSST’13. Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Khai Leong Yong and Bingsheng He, NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems, In Proc. of FAST 15. LEUNG, A. W., PASUPATHY, S., GOODSON, G., AND MILLER, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of ATC’08. GREENAN, K. M. AND MILLER. E. L. 2006. Reliability mechanisms for file systems using non-volatile memory as a metadata store. In Proceedings of EMSOFT’06. S. A. Weil. Leveraging intra-object locality with EBOFS. UCSC CMPS290S Project Report, May 2004. Kevin M. Greenan and Ethan L. Miller. PRIMS: Making NVRAM Suitable for Extremely Reliable Storage. In Proc. of HotDep'07. EVERSPIN. Second generation mram: Spin torque technology. http://www.everspin.com/products/second-generation-stmram.html.

Suggest Documents