Optimizing Local File Accesses for FUSE-Based Distributed Storage Shun Ishiguro∗
∗ Department
Jun Murakami∗
Yoshihiro Oyama∗‡
Osamu Tatebe†‡
of Informatics, The University of Electro-Communications Email: {shun,murakami}@ol.inf.uec.ac.jp,
[email protected] † Faculty of Engineering, Information and Systems, University of Tsukuba Email:
[email protected] ‡ Japan Science and Technology Agency, CREST
Abstract—Modern distributed file systems can store huge amounts of information while retaining the benefits of high reliability and performance. Many of these systems are prototyped with FUSE, a popular framework for implementing user-level file systems. Unfortunately, when these systems are mounted on a client that uses FUSE, they suffer from I/O overhead caused by extra memory copies and context switches during local file access. Overhead imposed by FUSE on distributed file systems is not small and may significantly degrade the performance of data-intensive applications. In this paper, we propose a mechanism that achieves rapid local file access in FUSE-based distributed file systems by reducing the number of memory copies and context switches. We then incorporate the mechanism into the FUSE framework and demonstrate its effectiveness through experiments, using the Gfarm distributed file system.
I. I NTRODUCTION Highly scalable distributed file systems have become a key component of data-intensive science where low latency, high capacity, and high scalability of storage are crucial to application performance. As a result, a large body of literature has been devoted to distributed file systems and how they can store a great number of large files by federating storage across multiple servers [1]–[7]. Several modern distributed file systems [1], [3], [6], [7] allow users to mount their file systems on UNIX clients by using FUSE [8], a widely used framework for implementing user-level file systems. Once the file systems are mounted in this way, the distributed file system can be accessed with standard system calls such as open, close, read, and write. While this approach increases the simplicity and enhances the separation of concerns significantly, it also has a drawback: the current FUSE framework imposes considerable overhead on I/O and can degrade the overall performance of data-intensive applications. Previous papers describe that FUSE may incur large runtime overhead [9]–[11]. For example, Bent et al. [9] reported that approximately 20% overhead was imposed on the bandwidth of file systems. The FUSE framework consists of a kernel module and userland daemons. The kernel module mediates between applications and a userland daemon, forwarding requests for file system access to the daemon and returning the processing results of the daemon to the application. Because
these communications between the kernel module and the userland daemon involve frequent memory copies and context switches, they introduce significant runtime overhead. The framework forces applications to access data in the mounted file system via the userland daemon, even when the data is stored locally and could be accessed directly. The memory copies also increase memory consumption because redundant data is stored in different page cache. In this paper, we propose a mechanism that allows applications to access local storage directly via the FUSE kernel module. We focus on the flow of operations performed in local file accesses by FUSE-based file systems and propose a mechanism to “bypass” redundant context switches and memory copies. We then explain our modifications to the FUSE kernel module in order to incorporate our mechanism into it. The result is a kernel module that directly accesses local storage wherever possible, thereby eliminating unnecessary kernel-daemon communication and many memory copies and context switches associated with this communication, which in turn improves the I/O performance of FUSE-based file systems. We demonstrate the effectiveness of our proposed mechanism by using a Gfarm distributed file system [1] against our modified version of FUSE and testing its performance. Our results show that the mechanism significantly improved the I/O throughput of the Gfarm file system (ranging from 26% to 63%), and also greatly reduced the amount of memory consumed by page caching (half of the original). Because our mechanism has no strict dependencies on Gfarm, we believe that our test results are generalizable to other FUSE-based distributed file systems and that our proposed mechanism could be clearly integrated into the FUSE kernel module. The contributions of this work are (1) to provide a mechanism for reducing context switches and memory copies occurring in local storage accesses in FUSE-based distributed file systems, and (2) to demonstrate its effectiveness by adapting the mechanism to Gfarm and showing experimental results. II. G FARM
AND
FUSE
A. Gfarm Figure 1 shows the basic architecture of Gfarm, which consists of metadata servers and I/O servers. A metadata
① open ⑥ close
metadata server
④ validity,
② file locations
I/O server & client
file info I/O server & client
I/O server & client
③ open
⑤ read, write
Fig. 1.
⑥ close
Fig. 2.
Access to a user-level file system using FUSE
Gfarm Architecture
server manages file system metadata, including inodes, directory entries, and file locations. I/O servers manage file data and are run in either dedicated nodes or client nodes. Under certain settings, an I/O server and its client programs run in the same node; under other settings, they may run in different nodes. One of the ways for applications to access a Gfarm file system is to call functions in the Gfarm library, including those for standard operations such as open and read. When a client accesses a file, the flow of the operation is as follows: 1) The client sends an open request to a metadata server when it wants to open a file. 2) The client receives the location of the file from the metadata server. 3) The client sends an open request to the I/O server that contains the file. 4) The I/O server communicates with the metadata server to check whether the open request is valid. File information such as an inode number and a generation number is also communicated. 5) The client sends further requests, such as read and write, to the I/O server. 6) The client sends a close request to both the metadata server and the I/O server. By restricting communication of read and write requests to only the I/O server containing the file, Gfarm reduces the workloads of the metadata server and the network. B. FUSE The basic FUSE architecture consists of a kernel module (hereafter referred to as the “FUSE module”) and userland daemons. Developers create their user-level file systems within these daemons by providing file system operations such as open, close, read, and write. Figure 2 shows how the FUSE framework is used to access a user-level file system. The steps are as follows: 1) The application issues a system call to the operating system to access files in a FUSE-based file system. The service routine for the system call sends a request for the corresponding file operation to the FUSE module. 2) The FUSE module forwards the request to a userland daemon.
3) The daemon processes the request in several ways. Some use a local file system such as ext3. Others use a remote server and network communications. 4) The daemon returns the result to the FUSE module. 5) The FUSE module forwards the result to the application program. To clarify the runtime overhead imposed on reading and writing files, we here outline the flow of the read operation in FUSE-based file systems. First, the application issues a read system call to the user-level file system, which sends a corresponding read request to the FUSE module. The FUSE module receives this read request and forwards it to a userland daemon by writing the request and its properties to the special file /dev/fuse. The daemon picks up the request from /dev/fuse and performs the operations appropriate for the read request, such as copying data to the FUSE module buffer. Finally, the FUSE module returns the result to the application. Due to these additional context switches and memory copies, the performance of a FUSE-based user-level file system tends to be poorer than that of a kernel-level file system. At minimum, two additional context switches occur when an application process sends a request to a user-level file system and receives the result: one to transfer control from the application process to the daemon, and another to transfer control back from the daemon to the application process. Additional memory copies also occur, when request data (such as the request type) is communicated to the daemon via /dev/fuse, and when the daemon copies file data to the FUSE module’s own buffer, so that it can be returned to the application. C. Accessing a FUSE-Mounted Gfarm File System Gfarm2fs is a FUSE-based userland daemon for mounting a Gfarm file system on a UNIX client. Figure 3 shows the flow of a file access when gfarm2fs is used. The steps are as follows: 1) The application program issues a system call for a file access. 2) The FUSE module receives a request for file system access. 3) The FUSE module forwards the request to the gfarm2fs daemon. gfarm2fs processes the request by communicating with a metadata server and I/O servers using the Gfarm library.
Fig. 3.
Access to a Gfarm file system using FUSE
Fig. 4. Flow of the open component. The FUSE module obtains a file descriptor from gfarm2fs, which opens the file.
4) The results of the request are returned to the FUSE module. 5) The FUSE module forwards the results to the application. A gfarm2fs daemon sends requests to an I/O server that contains the target files. However, in situations where the I/O server and the gfarm2fs daemon are running in the same node, the daemon directly accesses local files without communicating with the I/O server. Unfortunately, this direct local access only occurs within the gfarm2fs daemon and not before. Since the daemon itself is unnecessary for local access, it constitutes additional I/O overhead. III. P ROPOSED M ECHANISM A. Design We have designed a mechanism that allows programs to access locally stored data in a mounted FUSE-based Gfarm file system without routing requests through a gfarm2fs daemon. The key idea is that the information on an locally opened file is passed from our modified gfarm2fs to our modified FUSE module, which then accesses the file directly using the information in the following read/write operations. Because our modified FUSE module does not need to communicate with gfarm2fs when read and write requests are called for local files, the mechanism eliminates context switches and memory copies needed for the communication. The mechanism also reduces cache memory use. In the current version of FUSE and gfarm2fs, both a local file system such as ext3 and the FUSE module maintain their own page cache in the kernel. Hence, when accessing a local file, a redundant page cache is always created, and this duplication leads to memory wastage. In our mechanism, the FUSE daemon does not create a page cache, and thus uses less memory than the current implementation. With our mechanism, client programs that issue system calls to a Gfarm file system can directly read from and write to the local file system where possible. To accomplish this, our mechanism uses two basic components: an open component and a read/write component. Figure 4 shows the procedural flow of the open component. The steps are as follows: 1) The application sends an open call to the user-level file system. 2) The FUSE kernel module forwards the request to the gfarm2fs daemon.
Fig. 5. Flow of the read/write component. The FUSE module passes requests from applications to a local file system.
3) gfarm2fs sends the open request to a metadata server, which returns the location of the I/O server that contains the target. 4) gfarm2fs accesses the I/O server. If the I/O server is in a local node, gfarm2fs opens the file directly in the local storage server and returns the file descriptor to the FUSE module, along with the results of the open request. 5) The FUSE module associates the structure of the locally opened file with the original structure of the target file. Figure 5 shows the procedural flow of the read/write component. The steps are as follows: 1) The FUSE module receives the file structure opened by gfarm2fs after receiving the read or write system call. 2) The FUSE module checks whether the original structure is associated with this received file structure. If it does, the FUSE module issues a read or write request to that file and returns the results to the client program. Otherwise, the FUSE module sends the request to gfarm2fs. In contrast, the standard FUSE module typically sends requests to gfarm2fs when it receives system calls, except read and write. B. Implementation The proposed mechanism was implemented by modifying gfarm2fs, the FUSE module, and the FUSE library. 1) Gfarm2fs: We created an extension to the open request handler used by gfarm2fs. In our implementation, gfarm2fs accesses an I/O server after communicating with a metadata server because the proposed mechanism requires the descriptor of an opened file. In the handler, a pointer to a struct fuse_file_info structure is passed as a function argument for interacting with the FUSE module. This structure is usually used to communicate information such as file
Fig. 6.
Management of structures for files
ssize_t fuse_generic_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos){ struct fuse_file *ff = filp->private_data; return generic_file_read(ff->file, buf, len, ppos); }
Fig. 7.
permissions to the FUSE module. To this structure, we added a new member for communicating the descriptor of an opened file. 2) FUSE: The modified FUSE library and module receives the extended structure from gfarm2fs. The FUSE module obtains the descriptor of an opened file from the new member added to the structure. After receiving an open system call, the FUSE module typically sends an open request to gfarm2fs and waits until the results are returned, via a write operation into /dev/fuse. When data are written into /dev/fuse, the write handler of the FUSE module is called and the data is copied into a buffer in the FUSE module. In our extension, the process structure of gfarm2fs is found in this handler, because the handler is executed in a gfarm2fs context. When the copy is finished, the FUSE module obtains the file descriptor. If the value of the descriptor is −1, the FUSE module knows that the file is in a remote I/O server. Otherwise, the FUSE module obtains the structure of the local file opened by gfarm2fs. Figure 6 shows our modification to kernel-level data structures for managing files. The file structure, which has type struct file, contains a member called private_data whose type is void *. This member can be used for a variety of purposes in Linux kernel modules. The original FUSE stores a pointer to a struct fuse_file structure into private_data. In our modified FUSE framework, we add a new member file to the structure. If the opened file is in local storage, the member points to a struct file structure that manages the file. Otherwise, the member points to the original struct file structure. FUSE registers itself as a fuse file system in the kernel. When a read or a write system call is issued for files in local storage, the FUSE module calls generic kernel functions generic_file_read and generic_file_write respectively. In our implementation, new functions are used when such calls are issued for a FUSE-mounted file. These functions obtain the structure of the file opened by gfarm2fs from the pointer of the file structure passed as an argument. They then call the generic kernel functions for the newly obtained file structure. Figure 7 shows the new read function. Here, filp is a pointer to the original file structure used in the FUSE module. This function references private_data to obtain struct ff and member ff->file, which is a structure of the file opened by gfarm2fs. When accessing a remote I/O server, ff->file will equal filp, so the proposed mechanism works seamlessly if the storage is local or remote. For write operations, the new write function obtains the structure of
Modified function for file read
the file opened by gfarm2fs and calls the generic kernel function by using an operation similar to that of the read implementation. IV. E XPERIMENTAL R ESULTS We conduct several experiments to evaluate the performance of the proposed mechanism. In all of these experiments, we use IOzone [12] as a benchmark for measuring throughput and page cache memory consumption. We use the same technical environment for all experiments. Our cluster consists of a single metadata server and three client nodes, each of which runs an I/O server and client applications. Each node contains two Intel Xeon 2.40 GHz CPUs (6-core), 48 GB RAM, 600 GB 15,000 rpm SAS HDD. The nodes are connected with Infiniband QDR. The operating system is CentOS 5.6 with 64 bits (kernel-2.6.18), Gfarm version 2.4.2, FUSE version 2.7.4, and Gfarm2fs version 1.2.3. In these experiments, we measure the read and write throughput for a temporary file of size 1 GB and a standard record size of 8 MB. The temporary file is placed in the local storage of the client node that runs the benchmark. Overall, we perform four operations: sequential read, sequential write, random read, and random write. We compare three cases: (1) using the original FUSE framework, (2) using the proposed mechanism, and (3) using direct access without FUSE. Through the experiments we attempt to measure the performance improvement achieved by the proposed mechanism and the reduction in the memory consumption of page cache. We also attempt to identify whether the performance achieved by the proposed mechanism is close to the performance achieved by direct access. Figure 8 shows the results of the benchmark comparison. In sequential read, the performance of the original FUSE was the highest, and both the proposed mechanism and direct access showed 21% lower performance. In sequential write, the proposed system achieved 63% improvement when compared with the original FUSE. In random read and random write operations, the proposed system achieved 57% and 26% improvement, respectively, when compared with the original FUSE. Throughput for all operations except sequential read improves when using the proposed mechanism instead of the original FUSE. The improvements range from 26% to 63% and bring the performance of the proposed mechanism close to that of direct access. The performance degradation of the proposed mechanism when compared with direct access in sequential read, sequential write, random read, and random
200 Original FUSE Proposed mechanism Direct access without FUSE
180
throughput ( MB / s )
160 140 120 100 80 60 40 20 0 sequential read
sequential write
random read
random write
Fig. 8. Throughput of sequential read and write, and random read and write. Throughput for all operations except sequential read improves when using the proposed mechanism. The improvements bring the performance of the proposed mechanism close to that of direct access.
write is 0.2%, 2.6%, 1.3%, and 1.4%, respectively. The reason the throughput of sequential read using the original FUSE is higher than that of the proposed mechanism and direct access is that the algorithm for read-ahead was modified under the FUSE framework [11]. We estimate the number of context switches reduced by the proposed mechanism in the experiment. The block size of data communicated between the FUSE module and the FUSE daemon in file read operations is 4 KB, and the block size in file write operations is 128 KB. Hence, the estimated number of reduced context switches in 1 GB file read is 1 GB / 4 KB = 262,144, and the estimated number of reduced context switches in 1 GB file write is 1 GB / 128 KB = 8,192. We measure the size of the page cache before and after executing a program that reads a temporary file of size 1 GB. We use the free command to measure the size of the page cache. Using the original FUSE the increase in the size of the page cache is 2 GB, whereas using the proposed mechanism the increase is 1 GB. It is natural that the increase using the original FUSE is twice as large as that using the proposed mechanism because, in the original FUSE, both the local file system (ext3) and the FUSE module create their own page cache for the same data, while in the proposed mechanism only the local file system creates a page cache. V. R ELATED W ORK LDPLFS [13] is an approach that uses dynamically loadable library to retarget POSIX file operations to functions of a parallel file system. Like our proposed mechanism, it can accelerate file read and write without modification to application code. However, LDPLFS does not work well with binary code that is statically linked with standard libraries. Narayan et al. [10] propose an approach in which a stackable FUSE module is used for reducing context switch overhead of FUSE-based user-level file systems. They apply the approach to encryption file systems and do not mention application to distributed file systems.
Ceph [4] is a distributed file system that provides highly scalable storage. Because the Ceph client is merged with the Linux kernel, users can mount a Ceph file system without FUSE. Ceph is designed with the assumption that client nodes are different from storage nodes. This is in contrast to Gfarm, which aims to improve performance by using local storage effectively when a client and I/O server share the same node. Lustre [5] is a distributed file system that consists of a metadata server and object storage targets, which store the contents of files. Object storage targets only provide storage, whereas in Gfarm, I/O servers may provide further functionality. As above, Lustre client nodes are assumed to be distinct from object storage targets. VI. S UMMARY AND F UTURE W ORK In this paper, we proposed a mechanism for accessing local storage directly from a modified FUSE module, and adapted the mechanism to the distributed file system Gfarm. Applications running on an operating system with our modifications installed can read and write to a local Gfarm file system without the intervention of a FUSE daemon, resulting in significant improvements in performance. In our experiments, we measured a 26-63% increase in read and write throughput and a 50% reduction in the size of the page cache. In future, we will design and implement a kernel driver that also allows access to remote I/O servers without the intervention of a gfarm2fs daemon, thereby further avoiding several unnecessary context switches and memory copies. R EFERENCES [1] O. Tatebe, K. Hiraga, and N. Soda, “Gfarm Grid File System,” New Generation Computing, vol. 28, no. 3, pp. 257–275, 2010. [2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pp. 20–43. [3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proceedings of the 26th IEEE Symposium on Massive Storage Systems and Technologies, 2010. [4] S. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: A Scalable, High-Performance Distributed File System,” in Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, 2006, pp. 307–320. [5] P. Schwan, “Lustre: Building a File System for 1,000-node Clusters,” in Proceedings of the 2003 Linux Symposium, 2003. [6] “Glusterfs,” http://www.gluster.org/. [7] S. Patil and G. Gibson, “Scale and Concurrency of GIGA+: File System Directories with Millions of Files,” in Proceedings of the 9th USENIX Conference on File and Storage Technologies, 2011. [8] “FUSE: Filesystem in Userspace,” http://fuse.sourceforge.net/. [9] J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate, “PLFS: A Checkpoint Filesystem for Parallel Applications,” in Proceedings of SC ’09, 2009. [10] S. Narayan, R. K. Mehta, and J. A. Chandy, “User Space Storage System Stack Modules with File Level Control,” in Proceedings of the 12th Annual Linux Symposium in Ottawa, 2010, pp. 189–196. [11] A. Rajgarhia and A. Gehani, “Performance and Extension of User Space File Systems,” in Proceedings of the 2010 ACM Symposium on Applied Computing, 2010, pp. 206–213. [12] “IOzone Filesystem Benchmark,” http://www.iozone.org/. [13] S. A. Wright, S. D. Hammond, S. J. Pennycook, I. Miller, J. A. Herdman, and S. A. Jarvis, “LDPLFS: Improving I/O Performance without Application Modification,” in Proceedings of the 13th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing, 2012, pp. 1352–1359.