Jul 21, 1994 - Bigfoot-NFS uses vectored remote procedure calls to get reasonable ... At the Supercomputing Research Center, we have machine cluster .... in lookups and other directory operations, but file operations such as read and ...
Bigfoot-NFS: A Parallel File-Striping NFS Server (Extended Abstract) Gene H. Kim University of Arizona Tucson, AZ 85721
Ronald G. Minnich Supercomputing Research Center Bowie, MD 20715-4300
Larry McVoy Sun Microsystems Computer Corp. Mountain View, CA 94043-1100 July 21, 1994 Abstract Bigfoot-NFS allows the transparent use of the aggregate space of multiple NFS file servers as a single file system. By presenting a single apparent file system to the user, Bigfoot-NFS allows the use of available storage without the maintenance overhead of tracking multiple mounted file systems. Unlike most other network parallel file systems, Bigfoot-NFS runs without a central “metadata” server that binds a file to its location. Bigfoot-NFS uses vectored remote procedure calls to get reasonable performance with multiple servers while retaining the simplicity of stateless NFS semantics. To date, we have demonstrated Bigfoot-NFS file systems as large as 30 gigabytes spanning 28 nodes. The server currently runs as a user-level process under SunOS. Because of context-switching overhead, Bigfoot-NFS generally runs 2-3x slower than the kernel based implementation of NFS. However, measurements of Bigfoot-NFS show that many file operations still show reasonable performance with increasing numbers of servers, and demonstrate how parallel file systems can be designed without centralized metadata servers.
1 Introduction At the Supercomputing Research Center, we have machine cluster composed of 32 SPARCstation Classics and 16 SPARCstation ELCs, interconnected via a 15-port Kalpana Ethernet crossbar.1 Each of the Classics has a 1 gigabyte disk. Although our primary interest in the cluster is the support of computations, the presence of 30 gigabytes of aggregate disk space motivated the building of a cluster file server. The result of this work is Bigfoot-NFS, hereafter referred to as Bigfoot. Bigfoot allows the transparent and convenient use of the aggregate disk space of multiple NFS servers as a single file system. Bigfoot differs from other parallel network file systems such as Zebra[5] and SWIFT[8, 2] in many ways. Significant Bigfoot design departures include using files as the unit of interleaving, and the absence of a centralized metadata server to manage file location bindings. Instead, Bigfoot clients resolve a file’s location dynamically. Bigfoot uses vectored RPC to achieve reasonable performance with even moderately large numbers of NFS servers (e.g., 32 nodes). Our vectored RPC library allows the complete implementation of Bigfoot as a client program accessing conventional NFS servers. We have demonstrated a 28 node, 30 gigabyte Bigfoot file system — performance measurements of most file operations degrade only slightly as the number of nodes increases. (Because of NFS version 2 semantics, file creation and renaming remain a glaring exception.) In its current implementation, Bigfoot is a user-level NFS server similar to the Sun automounter. File operation requests to and from the Bigfoot server and the operating system use NFS semantics. Bigfoot receives NFS requests 1 This research was supported in part by Sun Microsystems Computer Corporation. SMCC provided the 32 SPARCstation Classics and the Kalpana Ethernet crossbar switch.
1
NFS client
User−level commands
Bigfoot−NFS client Remote filesystem
User−level commands
Remote filesystem NFS server
Filesystem kernel
Filesystem kernel
NFS server
NFS server NFS server
Bigfoot server daemon NFS server
Figure 1: High-level view of NFS vs. Bigfoot operation from the kernel, evaluates the requests, and issues modified requests as a client to multiple remote servers. Figure 1 shows a high-level comparison of conventional NFS and Bigfoot operation. Because Bigfoot is not implemented in the kernel, performance suffers from context switches and numerous memory copies. A Bigfoot file system using between one node (equivalent to normal NFS) and four nodes performs 2-3x slower than the kernel. File system operations on a sixteen node Bigfoot file system are 0-30% slower. A 24 node Bigfoot file system is 0-50% slower. The rest of this paper is organized as follows: we describe how disk space is often underutilized, too small for a real file system, but too large to be used simply as a cache. We then describe key design issues of other network parallel file systems, and then show how Bigfoot design differs from then. Next, Bigfoot semantics and implementation are discussed. We present performance numbers, and we close by discussing possible applications of Bigfoot.
2 Motivation In the full paper, we motivate how the abstraction of the aggregate disk space spanning multiple servers utilizes disks that might otherwise go unused.
3 Background In this section, we describe key design issues of other parallel network file systems. We discuss why applications of RAID techniques to network file systems often will not deliver increased throughput, and discuss potential problems that centralized metadata servers can present.
3.1 Networked RAID architectures The problem of managing many disks in parallel is not a new one: it appears on massively parallel machines (e.g., the CM-2 and CM-5 with their parallel disk arrays), on supercomputers (e.g., Crays and the Maxximum Strategies systems), and on networks of machines (e.g., Zebra file system). However, the most common system in use for managing an array of disks is known as RAID[4]. Several models of RAID exist, but a theme common to all is the interleaved storage system. In an interleaved storage system, data is stored over a set of storage units. Often motivating the use of RAID is increasing apparent disk throughput. Moving data to disk is essentially a serial operation. If disks are connected in parallel, entire “words” can be transferred in parallel, with each disk storing a single bit in the word. (Using such a strategy, Maxximum Strategies achieves data rates of approximately 80 Mbytes/sec.) Many network parallel disk ideas have taken their cue from RAID architectures (e.g., SWIFT [8]), with concepts transplanted without change from the uni-processor version. However, these implementations do not show the expected
2
Disk 1
.......
Disk 2
Disk 36
Disk 1
Disk 2
.......
Disk 36
SCSI Bus (80 Mbits/sec) SCSI Bus (80 Mbits/sec)
CPU 0
CPU 1
CPU 36
CPU
Network (10/100/155 Mbits/sec)
HIPPI Link (800 Mbits/sec)
Traditional RAID System
Network RAID System
(SCSI bandwidth > Network bandwidth)
Figure 2: Topological layout of traditional vs. networked RAID system performance improvements because of the differences in disk and network bandwidth. In conventional RAID systems, a very high bandwidth channel exists between the file server and the network (e.g., an 800 Mbit/sec HIPPI connection. See Figure 2). In these configurations, the network bandwidth far exceeds the disk bandwidth. Consequently, to match the network bandwidth, disk I/O must be done in parallel. Only then does the resulting aggregate disk I/O become comparable to the network throughput. However, in a typical network file system, the bandwidth of the network is far smaller than the disk I/O bandwidth. In this configuration, the network is the bottleneck: interleaving data to increase disk throughput is no longer necessary.
3.2 Problems with metadata A problem that arises when interleaving files across multiple disks is metadata. Because a file or directory no longer resides on one disk, information on a file’s location in the set of disks must be stored somewhere. If this metadata information and an individual disk become out of sync, then data can be lost. While this may not be difficult on a single processor RAID system, problems arise when implemented across a network of machines: messages can be lost, reordered, time delayed, etc. The ensure that all clients have a consistent view of the file system, Zebra uses a centralized metadata server (the file manager) server that handles all file location information. A machine accessing a Zebra file first queries the file manager for the location, and having received the actual location of the file, then accesses the specified disk server. Using a centralized metadata server poses some limits to parallelism. The server is a single point of failure, and because all the clients must interact with the server, it is a possible performance bottleneck. (Client name caching to alleviate this problem is discussed in [5]).
4 Striping in Bigfoot Rather than interleaving bytes or blocks, Bigfoot uses files as the unit of interleaving (i.e., the entirety of a file resides on one machine). Individual files in a directory may be interleaved across several machines. This allows parallelism in lookups and other directory operations, but file operations such as read and write remain serialized. In this section, we describe how we eliminate metadata by enforcing several invariants. We also discuss some other advantages that Bigfoot design affords.
4.1 Eliminating metadata When a file system encompasses only one server, determining the file’s location is trivial — it either resides on the given server, or it does not exist. When striping files across multiple hosts, binding a file to its location becomes more 3
complex. Using centralized metadata servers to provide a consistent file mapping for all clients yields a system far more complex than the stateless server operation afforded by NFS [9]. Bigfoot stripes files across an arbitrary number of NFS servers (called slices), while still preserving all NFS semantics. Provided that three invariants hold, Bigfoot can bind a file to its location without any metadata:
Directories trees are replicated on all the slices. Files reside on one slice only. File must be uniquely named across all slices in any given directory. In Bigfoot, the entire directory tree structure is stored on every slice, but files are interspersed among the slices. Consider the resulting Bigfoot readdir operation, which is the union of the NFS readdir operations on each of slices. Since no files have the same name in any directory, no duplicate filenames can appear. Bigfoot lookup operations are done similarly, sending NFS lookup calls to each of the component slices. If the file exists, all but one NFS lookup call will fail — that slice which returns successfully contains the specified file. If the file does not exist, all the NFS lookup calls will fail. Because files reside on either one slice or none, no consistency issues exist.
4.2 Advantages In addition to simplicity, the Bigfoot design also affords the following advantages:
Bigfoot requires no changes to the NFS server — all aspects of Bigfoot operation are implemented on the client. Although not encouraged, file systems being exported as Bigfoot slices can continue to be accessed through conventional NFS. (Doing so potentially destroys Bigfoot invariants, such as allowing files with the same name to reside in a given directory on more than one slice.) The read and write path are identical to NFS. Consequently, performance for the common case is no worse than NFS. NFS servers used as Bigfoot slices have no stored metadata. If a server is lost, it is sufficient to bring another server online with a restored copy of the old server’s disk. Any NFS server can be a Bigfoot slice. These servers can continue to be used as conventional NFS servers — even concurrently with Bigfoot operations. Bigfoot slices each have a complete and self-contained directory tree. Therefore, backups can be done on all slices simultaneously. Given 32 backup devices, a 32 slice, 30 Gbyte file system could be backed up in less than an hour.
4.3 Files as interleaving unit The types of files found on our file servers seem to indicate that files are a logical choice for the unit of interleaving. Surveys of file systems, including [1], show that small files appear most frequently. The study in [6] surveyed of 6.2 million files, and showed 92% of these files were 32 Kbytes or less. The full paper motivates how file interleaving is reasonable for observed file sizes on typical file systems.
5 Vectored RPC For each file operation, Bigfoot semantics require that RPC transactions with all component slices must be completed before the Bigfoot reply can be assembled. Conventional, sequentially ordered RPC is clearly unacceptable, since service times would degrade linearly by n, where n is the number of Bigfoot slices. Bigfoot solves this performance quandary through the use of vectored RPCs. In Bigfoot, we replaced the standard Sun RPC libraries with our experimental vectored RPC library. The original library was developed at Sun Microsystems Computer Corporation by Larry McVoy. We discuss aspects of using vectored RPC in the full paper. For now, it suffices to say that without it, Bigfoot would be impractical. 4
5.00
Time
4.00
3.00
Scalar RPC
2.00
1.00
0.00 0
10
20
30
40
Number of remote servers
Figure 3: Vectored RPC scaling with increasing number of hosts
6 Semantics This section describes the semantics of Bigfoot file operations. Each Bigfoot operation corresponds to a conventional NFS operation. In the full paper, we describe file operation, and describe how we avoid race conditions and maintain a consistent view of the file system to all the clients. In this extended abstract, we simply give an overview and describe interesting challenges presented by the file creation operation.
6.1 Overview of Bigfoot operations Bigfoot replicates the entire directory tree on all slices, but files reside on one slice only. Furthermore, filenames must be uniquely named in directories (i.e., no files with the same name can exist in any given directory). In general, when Bigfoot receives a file system request from the operating system, it sends out a vector of NFS requests to the slice servers. Bigfoot collects the replies, and depending on the operation, assembles the appropriate reply to pass back to the operating system. To eliminate ambiguity, the following conventions are used to describe the transactions: A Bigfoot transaction is the local RPC between the operating system and the Bigfoot server (using NFS semantics). The resulting NFS transactions are the vectored RPCs between the Bigfoot server and its component slices. In the simplest case, a Bigfoot call results in n vectored NFS calls to all its slices. However, operations like readdir are more complicated because of implied state. Other operations require further semantics to ensure consistency and eliminate race conditions. (Not discussed in this abstract.) 6.1.1
NFS protocol limitations
Consider how the bf create operation should work. The Bigfoot server sends out a vector of nfs create requests. If any fail, then the file already exists, and the Bigfoot file will clean up those files that were created and return a failed create call to the operating system. Any arbitrary method could be used to select which host would hold the file, because a vector lookup will always find the file. However, NFS version 2 does not support exclusive creates — any existing file is overwritten without notice. The SunOS kernel handles O EXCL in NFS by doing first a lookup ; if the file does not exist, a create call is sent. Because this is not an atomic operation, and thus cannot be used as a mechanism for Bigfoot file creation. (Consider the following problem: Client A does the lookup , to ensure that the file does not exist. Before Client A can send the create call, Client B does a lookup and then sends a create , which will succeed. The delayed create call from Client A will also succeed. The invariant of a file residing on more than one server has be broken.)
5
2:33am (descartes) Work/myp 4 % bfdf Bigfoot-NFS Kbytes used avail capacity Mounted on (numhosts=27) 23,223,186 3,209 23,219,977 0% /export/hom e/bigfoot cl3 cl4 cl5 cl6 cl7 cl8
860118 860118 860118 860118 860118 860118
474 575 45 1517 35 30
859644 859543 860073 858601 860083 860088
0% 0% 0% 0% 0% 0%
/export/home/bigfoot /export/home/bigfoot /export/home/bigfoot /export/home/bigfoot /export/home/bigfoot /export/home/bigfoot
cl30
860118
24
860094
0%
/export/home/bigfoot
: : :
Figure 4: Sample output from the bfdf program Exclusive create calls are supported in NFS version 3. However, we opted to stick with NFS version 2 and use a less scalable workaround. We choose a single host to place the file on as a function of its name (e.g., we generate an index by adding the values of its name string). We then send out one create packet to that host. Because the location of a file is now dependent on its name, the meaning of bf rename changes. bf rename now copies the specified files using nfs read and nfs write . (When a file name changes, so too does the correct host the file lives on.) In short, NFS v2 limitations denied elegant implementations of several Bigfoot operations.
7 Implementation issues 7.1 Supplemental programs Bigfoot file systems reside beneath the operating system file abstraction; in the case of SunOS, this is through the VFS interface [7]. Consequently, the user notes nothing unusual when accessing Bigfoot mounted directories. Operations like cp and tar operate identically as it would in a UFS- or NFS-mounted file system. Because of the VFS abstraction in SunOS, no special programs are needed to interface with Bigfoot file systems. With only one exception, conventional file I/O system calls are used. SunOS uses 32-bit unsigned integers to represent the amount of space (in bytes) in a file system. Consequently, the statfs system call would overflow the file system space fields. Consequently, the df utility that reports free disk space reported meaningless numbers. Figure 4 shows output from bfdf, the Bigfoot equivalent of the df program.2
7.2 Filehandle management Filehandles returned by Bigfoot map internally to a vector of NFS filehandles. Consider what happens during a bf lookup of the directory entry “..”. Bigfoot must save the vector of NFS filehandles to send to the slices, because it may be used in as an argument in a future file operation.
7.3 Performance optimizations In general, Bigfoot semantics send RPC call packets to all the slices for each file operation. However, once a file is located, it is no longer necessary to send NFS requests to the other slices — they are guaranteed to fail because of the 2 Note the addition of commas when printing the amount of available disk space to aid in human parsing — interpreting 8 digit numbers (with “Kb” units!) is difficult, but is commonplace in Bigfoot file systems.
6
Performance of multi-slice Bigfoot-NFS Test SunOS NFS BNFS(1) BNFS(4) BNFS(8) /bin/ls 100 files (no caching) 0.06 0.15 0.16 0.22 /bin/ls -lF 100 files (no caching) 0.56 1.38 1.53 1.42 /bin/ls 600 files (no caching) 0.20 0.77 0.80 0.88 /bin/ls -lF 600 files (no caching) 3.23 6.46 6.12 9.33 write /vmunix (2.2 MB file) 14.51 26.61 32.23 35.24 touch 100 files 1.74 2.30 2.25 2.64 read 600 empty files and vmunix (tar) 19.84 53.34 56.21 65.24 untar 100 files (creating files) 7.99 13.70 13.73 17.40 untar 100 files (not creating files) 4.90 8.05 9.57 13.09 untar 500 files (creating files) 30.51 73.73 70.82 95.22
BNFS(16) 0.28 2.12 0.97 10.82 27.51 3.36 64.75 24.00 18.31 120.30
BNFS(24) 0.34 2.33 1.02 13.60 32.27 4.45 92.13 30.00 23.15 151.90
Table 1: Scaling performance of Bigfoot-NFS vs. kernel SunOS NFS
one file per directory invariant. Because flow of control is not returned from the RPC layer until all the packets have arrived or have timed out, sending the minimal number packets required improves performance. To reduce the number of RPC call packets, Bigfoot maintains two sets of caches. The two caches are the “readdir/lookup” and the “lookup/read-write” cache. The “readdir/lookup” cache handles the storm of lookup calls that immediately follow readdir operations. The “lookup/read-write” cache helps eliminates unnecessary calls when doing read and write operations.
8 Observed performance Table 1 shows performance numbers for a number of sample user-level applications. The measurements include the kernel based implementation NFS, a single slice Bigfoot file system, and a number of multi-slice Bigfoot file systems. The performance difference between the first two configurations indicate the penalties of a user-space server, and serve as a baseline for the other Bigfoot measurements. The measurements show that many of the Bigfoot operations scale relatively well with increasing number of slices. For example, the vector readdir operation for 24 hosts is less than 2x slower than for a the single slice case. Similarly, because binding a file to its location is done only once, read and write throughput does not degrade with increasing number of slices. File creation remains particularly unsatisfactory because of our workaround. (We discuss this in detail in the full paper.) Implementing Bigfoot using NFS version 3 semantics, which supports exclusive creates, would obviate this problem. More measurements will be available in the full paper, using LADDIS style benchmarks that exercise NFS specific operations. How these operations scale in Bigfoot is of particular interest to us. [Note to reviewers: I just discovered why file creation numbers are so slow. It’s sending out an extra vector lookup every time! How did that get there? I’ll have better numbers soon. –ghk]
9 State of Bigfoot and Availability At the time of this writing, Bigfoot remains in experimental operation. It runs on several production machines, and is stable enough to do performance measurements. We continue to modify the vectored RPC libraries and the Bigfoot code to improve performance. Initial plans were to move the entire Bigfoot functionality into the kernel for performance reasons. Source code to Bigfoot is publically available and freely redistributable. They will be made available via anonymous FTP.
7
10 Conclusion RAID-style striping is an attractive enough idea that it has been tried several times for network file servers. We argue that due to the very different environment of a network, the byte or block interleaving used by RAID may be inappropriate. However, interleaving is still a useful idea. In this paper, we have described a system called Bigfoot that uses file interleaving that provides a 30 Gbyte file server spanning 28 different machines with modest losses in performance, and no loss of convenience. Bigfoot could be used in systems such as the Sun SPARCcluster-1, which consists of four networked SPARCstation-10s. It currently looks like four different file servers; Bigfoot could be used to make it look like one large server.
References [1] Mary G. Baker, John H. Hartman, Michael D. Kupfer, Ken W. Shirriff, and John Ousterhout. Measurements of a Distributed File System. In Proc. of th Symposium Operating Systems Principles, pages 198–212. ACM, 1991. [2] Luis-Felipe Cabrera and Darrell D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4), Fall 1991. [3] Brent Callaghan and Tom Lyon. The Automounter. Technical report, Sun Microsystems, Inc., 1989. [4] Garth A. Gibson, Lisa Hellerstein, Richard M. Karp, Randy H. Katz, and David A. Patterson. Failure correction techniques for large disk arrays. In Third International Conference on Architectural Support for Programming Languages and Operating Systems, pages 123–132, April 1989. [5] John H. Hartman and John K. Ousterhout. Zebra: A striped network file system. In Proceedings of the USENIX File Systems Workshop, pages 71–78, Ann Arbor, Michigan, May 1992. USENIX. [6] Gordon Irlam. A static analysis of UNIX file systems circa 1993, October 1993. personal communication. [7] S. R. Kleiman. Vnodes: An architecture for multiple file system types in Sun Unix. In Proc. Summer 1986 USENIX Conf., pages 238–247, Atlanta, GA (USA), June 1986. USENIX. [8] Bruce R. Montague. The Swift/RAID distributed transaction driver. Technical Report UCSC-CRL-93-99, UC Santa Cruz, January 1993. [9] Russel Sandberg, David Goldberg, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and Inplementation of the Sun Network File System. In USENIX Conference proceedings, pages 119–130, June 1985.
8