Distributed Access to Parallel File Systems - IBM Research People ...

Distributed Access to Parallel File Systems

by Dean Hildebrand

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2007

Doctoral Committee:

Adjunct Professor Peter Honeyman, Chair Professor Farnam Jahanian Professor William R. Martin Associate Professor Peter M. Chen Professor Darrell Long, University of California, Santa Cruz

©

Dean Hildebrand All Rights Reserved

2007

To my family… individually they exceed my wildest expectations, together they give me strength to climb the highest mountains.

ii

ACKNOWLEDGEMENTS This dissertation is a testament to the dedication and abilities of many people. Their support and guidance transformed me into who I am today. My Ph.D. advisor Peter Honeyman is the main pillar of this dissertation. His insights and criticisms taught me the fundamental elements of successful research. He has a true love of computer science, which transfers to everyone around him. The other members of my committee, Farnam Jahanian, Bill Martin, Pete Chen, and Darrell Long, made many helpful suggestions at my proposal that helped guide the focus of this dissertation. My work depends heavily upon many bleeding edge technologies, each of which would not exist without the dedication of many brilliant and talented people. Garth Gibson, Lee Ward, and Gary Grider in particular championed pNFS to the wider storage and high-performance community, sparking interest for its continued research. The IETF pNFS working group, with all their bluster, raised many critical requirements and issues. pNFS would still be an unfinished IETF specification without the support and tireless efforts put forth by Marc Eshel at IBM and Rob Ross, Rob Latham, Murali Vilayannur, and the entire PVFS2 development team. In addition, I cannot forget Andy Adamson, Bruce Fields, Trond Myklebust, Olga Kornievskaia, David Richter, Jim Rees, and everyone else at CITI that have transformed Linux NFSv4 into the best distributed file system in existence. This material is based upon work supported by the Department of Energy under Award Numbers DE-FC02-06ER25766 and B548853, Lawrence Livermore National Laboratory under contract B523296, and by grants from Network Appliance and IBM. I am eternally thankful for the endless wisdom and knowledge shared by Lee Ward, Gary Grider, and James Nunez. They brought a Canadian to Albuquerque, New Mexico and bestowed motivation for this dissertation. Moreover, the snakes, yucca, caverns, deserts, mountains, and heat of New Mexico bestowed a reason for living.

iii

Ann Arbor could not be a better place; its kind people share a thirst for knowledge and a better world. Within Ann Arbor, late night beers with Jay, rock climbing, WCBN, and the CBC all helped prevent me from burning out years ago. Last but not least, without my family’s continued support, I would have lacked the strength to tackle this phase of my life.

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

iv

TABLE OF CONTENTS DEDICATION................................................................................................................... ii ACKNOWLEDGEMENTS ............................................................................................ iii LIST OF FIGURES ........................................................................................................ vii ABSTRACT...................................................................................................................... ix CHAPTER I. Introduction ............................................................................................................. 1 1.1. 1.2. 1.3.

Motivation................................................................................................... 2 Thesis statement.......................................................................................... 2 Overview of dissertation ............................................................................. 3

II. Background ............................................................................................................ 5 2.1. 2.2. 2.3. 2.4. 2.5. 2.6. 2.7. 2.8.

Storage infrastructures ................................................................................ 5 Remote data access ..................................................................................... 6 High-performance computing................................................................... 14 Scaling data access.................................................................................... 20 NFS architectures...................................................................................... 24 Additional NFS architectures.................................................................... 26 NFSv4 protocol......................................................................................... 29 pNFS protocol........................................................................................... 32

III. A Model of Remote Data Access ....................................................................... 34 3.1. 3.2. 3.3. 3.4. 3.5.

Architecture for remote data access.......................................................... 34 Parallel file system data access architecture ............................................. 37 NFSv4 data access architecture ................................................................ 37 Remote data access requirements ............................................................. 38 Other general data access architectures .................................................... 39

v

IV. Remote Access to Unmodified Parallel File Systems....................................... 43 4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7.

NFSv4 state maintenance.......................................................................... 44 Architecture............................................................................................... 44 Fault tolerance........................................................................................... 48 Security ..................................................................................................... 49 Evaluation ................................................................................................. 49 Related work ............................................................................................. 55 Conclusion ................................................................................................ 56

V. Flexible Remote Data Access .............................................................................. 57 5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7.

pNFS architecture ..................................................................................... 58 Parallel virtual file system version 2......................................................... 62 pNFS prototype......................................................................................... 63 Evaluation ................................................................................................. 65 Additional pNFS design and implementation issues ................................ 69 Related work ............................................................................................. 70 Conclusion ................................................................................................ 71

VI. Large Files, Small Writes, and pNFS ............................................................... 72 6.1. 6.2. 6.3. 6.4. 6.5.

Small I/O requests..................................................................................... 73 Small writes and pNFS ............................................................................. 74 Evaluation ................................................................................................. 79 Related work ............................................................................................. 85 Conclusion ................................................................................................ 86

VII. Direct Data Access with a Commodity Storage Protocol .............................. 87 7.1. 7.2. 7.3. 7.4. 7.5. 7.6. 7.7.

Commodity high-performance remote data access................................... 89 pNFS and storage protocol-specific layout drivers................................... 90 Direct-pNFS.............................................................................................. 93 Direct-pNFS prototype.............................................................................. 96 Evaluation ................................................................................................. 96 Related work ........................................................................................... 104 Conclusion .............................................................................................. 106

VIII. Summary and Conclusion............................................................................. 108 8.1. 8.2. 8.3 8.4.

Summary and supplemental remarks ...................................................... 108 Supplementary observations ................................................................... 111 Beyond NFSv4........................................................................................ 112 Extensions ............................................................................................... 114

BIBLIOGRAPHY ......................................................................................................... 117

vi

LIST OF FIGURES Figure 2.1:

ASCI platform, data storage, and file system architecture ............................. 14

2.2:

The ASCI BlueGene/L hierarchical architecture ............................................ 15

2.3:

A typical high-performance application.......................................................... 16

2.4:

Symmetric and asymmetric out-of-band parallel file systems........................ 22

2.5:

NFS remote data access .................................................................................. 24

2.6:

NFS with databases......................................................................................... 25

2.7:

NFS exporting symmetric and asymmetric parallel file systems.................... 27

3.1:

General architecture for remote data access ................................................... 35

3.2:

Parallel file system data access architectures.................................................. 37

3.3:

NFSv4-PFS data access architecture .............................................................. 38

3.4:

Swift architecture ............................................................................................ 40

3.5:

Reference Model for Open Storage Systems Interconnection ........................ 41

3.6:

General data access architecture view of OSSI model ................................... 42

4.1:

Split-Server NFSv4 data access architecture .................................................. 45

4.2:

Split-Server NFSv4 design and process flow ................................................. 46

4.3:

Split-Server NFSv4 experimental setup.......................................................... 49

4.4:

Split-Server NFSv4 aggregate read throughput.............................................. 52

4.5:

Split-Server NFSv4 aggregate write throughput............................................. 52

5.1:

pNFS data access architecture ........................................................................ 58

5.2:

pNFS design.................................................................................................... 59

5.3:

PVFS2 architecture ......................................................................................... 62

5.4:

pNFS prototype architecture ........................................................................... 65

5.6:

Aggregate pNFS write throughput.................................................................. 67

5.7:

Aggregate pNFS read throughput ................................................................... 67

vii

6.1:

pNFS small write data access architecture...................................................... 76

6.2:

pNFS write threshold ...................................................................................... 77

6.3:

Determining the write threshold value............................................................ 78

6.4:

Write throughput with threshold. small write requests ................................... 80

6.5:

ATLAS digitization write request size distribution with 500 events. ............. 82

6.6:

ATLAS digitization write throughput for 50 and 500 events ......................... 84

7.1:

pNFS file-based architecture with a parallel file system ................................ 91

7.2:

pNFS file-based data access............................................................................ 92

7.3:

Direct-pNFS data access architecture ............................................................. 93

7.4:

Direct-pNFS with a parallel file system.......................................................... 94

7.5:

Direct-pNFS prototype architecture with the PVFS2 parallel file system...... 97

7.6:

Direct-pNFS aggregate write throughput........................................................ 99

7.7:

Direct-pNFS aggregate read throughput....................................................... 100

7.8:

Direct-pNFS scientific and macro benchmark performance ........................ 102

9.1:

pNFS and inter-cluster data transfers across the WAN................................. 115

viii

ABSTRACT Large data stores are pushing the limits of modern technology. Parallel file systems provide high I/O throughput to large data stores, but are limited to particular operating system and hardware platforms, lack seamless integration and modern security features, and suffer from slow offsite performance. Meanwhile, advanced research collaborations are requiring higher bandwidth as well as concurrent and secure access to large datasets across myriad platforms and parallel file systems, forming a schism between file systems and their users. It is my thesis that a distributed file system can improve I/O throughput to modern parallel file system architectures, achieving new levels of scalability, performance, security, heterogeneity, transparency, and independence. This dissertation describes and examines prototypes of three data access architectures that use the NFSv4 distributed filing protocol as a foundation for remote data access to parallel file systems while maintaining file system independence. The first architecture, Split-Server NFSv4, targets parallel file system architectures that disallow customization and/or direct storage access. Split-Server NFSv4 distributes I/O across the available parallel file system nodes, offering secure, heterogeneous, and transparent remote data access. While scalable, the Split-Server NFSv4 prototype demonstrates that the absence of direct data access limits I/O throughput. Remote data access performance can be increased for parallel file system architectures that allow direct data access plus some customization The second architecture analyzes the pNFS protocol, which uses storage-specific layout drivers to distribute I/O across the bisectional bandwidth of a storage network between filing nodes and storage. Storage-specific layout drivers allow universal storage protocol support and flexible security and data access semantics, but can diminish the level of heterogeneity and transparency. The third architecture, Direct-pNFS, uses a commodity distributed file system for

ix

direct access to a parallel file system’s storage nodes, bridging the gap between performance and transparency. The dissertation describes the importance and necessity for both direct data access architectures depending on user and system requirements. I analyze prototypes of both direct data access architectures and demonstrate their ability to match and even exceed the performance of the underlying parallel file system.

x

CHAPTER I Introduction Modern research requires local and global access to massive data stores. Parallel file systems, which provide direct and parallel access to storage, are highly specialized, lack seamless integration and modern security features, often limited to a single operating system and hardware platform, and suffer from slow offsite performance. However, grid computing, legacy software, and other factors are increasing the heterogeneity of clients, creating a schism between file systems and their users. Distributed filing protocols such as NFS [1] and CIFS [2] are widely used to bridge the interoperability gap between storage systems. Unfortunately, implementations of these protocols deliver performance that is only a fraction of the exported storage system. They continue to have limited network, CPU, memory, and disk I/O resources due to their “single server” design, which binds one network endpoint to an entire file system. By continuing to use RPC-based client/server architectures, distributed filing protocols have an entrenched lack of scalability. The NFSv4 protocol [3] improves functionality by providing integrated security and locking facilities, and migration and replication features, but continues to use a client/server based architecture and consequently retains the single server bottleneck. Scalable file transfer protocols such as GridFTP [4] are also used to enable high throughput, operating system independent, and secure WAN access to high-performance file systems. The HTTP protocol is the most common way to access remote data stores, and by far the most widespread. Both of them are difficult to integrate with a local file system and neither provides shared access to a single data copy, which unnecessarily creates a copy for each user and increases the complexity of maintaining single-copy semantics.

1

2 1.1. Motivation Many application domains demonstrate the need for high bandwidth, concurrent, and secure access to large datasets across a variety of platforms and file systems. DNA sequences, other biometric data, and artwork databases have large data sets, ranging up to tens of gigabytes in size, that are often loaded independently by concurrent clients [5, 6]. Full database scans of huge files are often unavoidable, even when using indexing [7]. The Earth Observing System Data and Information System (EOSDIS) [8, 9] manages data from NASA's earth science research satellites and field measurement programs, providing data archive, distribution, and information management services. As of 2005, EOSDIS had stored more than three petabytes of data while continuing to generate more than seven terabytes per week. In 2004, EOSDIS supported more than 1.9 million unique users and fulfilled more than 36 million product requests [10]. Digital movie studios that generate terabytes of data every day require access from Sun, Windows, SGI, and Linux workstations, and compute clusters [11]. Users edit files in place or copy files between heterogeneous data stores. The scientific computing community connects large computational and data facilities around the globe to perform physical simulations that generate petabytes of data. The Advanced Simulation and Computing program (ASC) in the U.S. Department of Energy estimates that one gigabyte per second of aggregate I/O throughput is necessary for every teraflop of computing power [12], which suggests that file systems will need to support terabyte per second data transfer rates of by 2010 [13].

1.2. Thesis statement To meet this diverse set of data access requirements, I set out to demonstrate the following thesis statement: It is feasible to use a distributed file system to realize the I/O throughput metrics of parallel file system architectures and to achieve different levels of scalability, performance, security, heterogeneity, transparency, and independence.

3 1.3. Overview of dissertation To validate this thesis I outline a general architecture for remote data access and use it to describe and validate prototype implementations that use the NFSv4 distributed file system protocol as a foundation for remote data access to modern parallel file system architectures. I demonstrate that specialized reorganization of the general architecture to suit a parallel file system can improve I/O throughput while maintaining operating system, hardware platform, and parallel file system independence. The dissertation is organized as follows. In Chapter II, I discuss the background of storing and accessing data. This includes a history of remote data access techniques intended to accommodate the spiraling growth in performance requirements. I also give an overview of supercomputing and its I/O requirements, including a description of parallel file system architectures that require highperformance remote data access. Chapter II also details how organizations use NFS with UNIX file systems, parallel file systems, and databases. I discuss the challenge of achieving full utilization of a storage system’s available bandwidth with NFS, and discuss NFS architectures that attempt to overcome the single server bottleneck with these data stores. I introduce NFSv4, the emerging Internet standard for distributed filing, and discuss how its stateful server presents a new scalability obstacle. I also introduce pNFS, a high-performance extension of NFSv4 under design by the Internet Engineering Task Force. Chapter III presents a general architecture for remote data access to parallel file systems. I use this architecture to demonstrate the interactions of the data and metadata subsystems of NFSv4 with modern parallel file system architectures. The major components of the data access architecture are the application interface, the metadata service, the data and control components, and storage. Chapter IV introduces Split-Server NFSv4, a variant of the general architecture that targets parallel file system architectures that do not admit modification or prohibit direct storage access. Split-Server NFSv4 distributes I/O across parallel file system nodes, offering secure, heterogeneous, and transparent remote data access.

The Split-Server

NFSv4 prototype scales I/O throughput linearly with available parallel file system bandwidth, but lacks direct data access, which limits performance. The aggregate I/O per-

4 formance of the prototype achieves 70% of the maximum read bandwidth and 50% of the maximum write bandwidth. Empowering clients with the ability to access data directly from storage—eliminating intermediary servers—is key to achieving high-performance. The dissertation introduces two more variants of the general architecture tailored for parallel file systems that allow direct data access. In Chapter V, I analyze the pNFS protocol, which uses storage-specific layout drivers to distribute I/O across the bisectional bandwidth of a storage network between filing nodes and storage. Storage-specific layout drivers allow universal storage protocol support, flexible security, and well-defined data access semantics, but can diminish the level of heterogeneity and transparency offered by distributed file systems. Discussion of a prototype implementation that demonstrates and validates the potential of pNFS concludes this chapter. Chapter VI demonstrates how pNFS can be engineered to improve the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and a distributed file system for small write requests. Chapter VII introduces Direct-pNFS, a final variant of the general architecture, which offers high I/O throughput while retaining the security, heterogeneity, and transparency of NFSv4. Direct-pNFS uses a commodity distributed file system for direct access to the storage nodes of a parallel file system, bridging the gap between performance and transparency. Experiments with my Direct-pNFS prototype demonstrate that its I/O throughput matches or outperforms the native parallel file system client across a range of workloads. Through the understanding and exploration of remote data access architectures, this dissertation helps to bring scalable, secure, heterogeneous, transparent, and file system independent data access to high-performance computing and its large data stores.

CHAPTER II Background This chapter presents background information on storing and accessing data. This includes a history of remote data access, a description of modern storage architectures, and techniques invented to accommodate the spiraling growth in performance requirements. I also describe the storage architectures of modern supercomputers and characterize the I/O requirements of supercomputing applications. I conclude the chapter with a discussion of modern NFS infrastructures and techniques used to scale NFS data access.

2.1. Storage infrastructures In the beginning, access to storage entailed directly attaching disks to a single computer. Even the fastest disks consistently fail to transfer data at the rate offered by their enclosure’s interface, e.g., SATA [14], Ultra320 [15], and Fibre Channel [16]. Salem and Garcia-Molina [17] introduced the term disk striping for splitting files across multiple disks to improve I/O performance, a technique well known to the designers of I/O subsystems for early supercomputers [18]. The use of a Redundant Array of Inexpensive Disks (RAID) [19], now prolific, also improves performance and, with some RAID levels, fault tolerance. Although RAID systems can overcome the failure of a single disk, they continue to suffer from the host server’s single point of failure and lack of scalability. Storage area network (SAN) and network attached storage (NAS) architectures alleviate these problems by distributing data across multiple storage devices. SAN is defined as a dedicated network, e.g., Fibre Channel, in which I/O requests access storage devices directly using a block access protocol such as SCSI [20] or FCP

5

6 [21]. SAN delivers high-throughput data access, but is expensive, difficult to manage, and lacks data sharing capabilities. NAS is an IP-based network in which a NAS device—a processor plus disk storage— handles I/O requests by using a file-access protocol such as NFS or CIFS. The NAS device translates file requests into storage device requests.

NAS provides ease-of-

management, cost-effectiveness, and data sharing, but introduces performance bottlenecks As users demand the performance of SAN with the cost and manageability of NAS, the differences between SAN and NAS are disappearing thanks to storage virtualization, which takes several forms. One form enables direct data access using block protocols on IP-based networks, e.g., iSCSI [22, 23]. Another form connects NAS appliances to both IP-based and private networks. The IP-based network supports data sharing while the private network provides high throughput data access.

2.2. Remote data access The Internet file transfer protocol, FTP [24], was first developed in 1971 to transfer data in ARPANET. FTP had three primary objectives: to promote data sharing, to provide storage system and hardware platform independence, and to transfer data reliably and efficiently. The transition from an ARPANET consisting of few mainframe-based timesharing type machines to a global Internet made up of many smaller PCs, each with its own hard drive, introduced a model of computing alien to the FTP design. Many independent name spaces replaced the mainframe’s monolithic name space, resulting in multiple copies of shared data. Increased storage requirements, data inconsistencies, and slow networks sparked the creation of distributed file systems. The initial goals of distributed file systems [25] were: •

Efficient remote data access.

•

Avoid whole file transfers by transferring only requested data.

•

Seamless integration of remote data into a single file system.

•

Avoid clients accessing stale data.

7 •

Enable diskless clients.

•

Provide file locking.

In 1988, the Portable Operating System Interface (POSIX) [26] standard was defined to promote portability of application programs across UNIX system environments by developing a clear, consistent, and unambiguous standard for the interface specification of UNIX-like operating systems. POSIX quickly became synonymous with UNIX semantics, which greatly influenced file system design. POSIX-compliant file systems increase application portability by guaranteeing a specific set of semantics. Unfortunately, these semantics sometimes prove difficult, impossible, or unnecessary to implement. The following sections describe how some file systems choose to support a relaxed version.

2.2.1. Distributed filing protocols This section gives an overview of several successful and innovative distributed filing protocols and distributed file systems. The nomenclature for systems that provide remote data access can be confusing. These systems are known by several terms: file access protocol, distributed filing protocol, filing protocol, distributed file system protocol, or distributed file system. At the core of every client/server architecture is a wire protocol to communicate data between the client and the server. Sometimes the publication and distribution of this protocol is convoluted and limited, but it always exists in some form, although that form might be source code. This dissertation focuses on the Network File System (NFS) protocol, which is distinguished by its precise definition by the IETF, availability of open source implementations, and support on virtually every modern operating system.

2.2.1.1 Newcastle Connection Newcastle Connection [27], one of the first distributed file systems, is a portable userlevel C library that enables data transfer and supports full UNIX semantics. To stitch remote data stores together, a superroot directory contains the local root directory and the host names of all the available remote systems. Newcastle Connection performance is hampered by a lack of data and attribute caching. In addition, it required programs to be

8 relinked with a new C library that routes system calls between the local and remote file systems. Many modern distributed filing implementations now use a kernel-based client, which increases transparency to programs at the expense of portability. A kernel-based client does not require programs to relink and allows process sharing of attribute and data caches. Newcastle Connection is no longer supported.

2.2.1.2 Apollo Domain operating system Developed for Apollo workstations in the early 1980s, Domain [28-30] is one of the first distributed file systems. It is a peer-based system designed for tightly integrated groups of collaborators. A tuple consisting of an object’s creation time and a unique number identifying the Apollo workstation on which an object was created (set at time of manufacturing) identifies a system object. Domain stores an object on a single workstation, supports data and attribute caching, and uses a lock manager to maintain consistency. A user logged onto an Apollo workstation has access to all workstations in the work group. Access lists enforce file access permissions. Domain’s tight integration into the Apollo hardware and operating system has many benefits but also interferes with adoption on other operating system platforms.

2.2.1.3 LOCUS operating system Developed at UCLA in the early 1980s, the distributed file system in the LOCUS operating system [31] supports location independence and used a primary copy replication scheme. It also focuses on improving fault tolerance semantics in comparison to other distributed operations systems. Like Domain, the LOCUS distributed file system’s tight integration with the LOCUS operating system and use of specialized remote operation protocols limits its portability and widespread use.

2.2.1.4 Remote File Sharing file system (RFS) Developed by AT&T in the mid-1980s for UNIX System V release 3, the Remote File Sharing (RFS) [32] distributed file system supports full UNIX semantics. A name server advertises available file systems allowing clients to mount a file system without

9 knowledge of its precise location, using only the file system’s identifier. RFS eventually supported client caching using a stateful server for lock management, although it is disabled for multiple writers or on all readers when there is a single writer. RFS offers no secure way for clients to authenticate, using standard UNIX file and directory protection mechanisms instead. Lack of fault tolerance, sole support for UNIX System V version 3, and the use of a specialized transport protocol limited the widespread adoption and commercial success of RFS.

2.2.1.5 Network File System (NFS) The Network File System (NFS) [1] was developed at Sun Microsystems in 1985. Sun Microsystems distinguished NFS from previous distributed file systems by designing a protocol instead of an implementation. The NFS protocol encouraged the development of many implementations by being “agnostic” as to operating system, hardware platform, underlying file system. This was accomplished by defining a virtual file system (VFS) interface and by virtualizing file system metadata in the form of a Vnode definition [33]. In addition, NFS encapsulated the file system’s use of a network architecture and transport protocols. Network architecture and transport protocol independence is achieved by using the Open Network Computing Remote Procedure Call (ONC RPC) [34] for communication. By hiding the transport protocol from the application, ONC RPC supports heterogeneous transport protocols.

NFS also uses an External Data Representation

(XDR) [35] format to ensure that data is understood by both the sender and recipient. NFS versions 2 [36] and 3 [37, 38] are stateless, meaning that the NFS server does not maintain state across client requests. This simplifies crash recovery and the protocol itself, but weakens support for UNIX semantics and limits the features it can support. For example, the POSIX “last close” requirement, which mandates that a removed file not be physically removed until all clients have closed the file, is impossible to implement without keeping track of all clients that have the file open. The lack of server state also interferes with client cache consistency. NFS supports close-to-open consistency semantics: clients must flush all data blocks to the server when it closes a file. When the client opens (or re-opens) the file, it checks to see if its cached data is out of date, and if necessary, retrieves the latest copy from the server. Many implementations put a timeout

10 of three seconds on cached file blocks and a timeout of thirty seconds on cached directory attributes. This creates a trade-off between performance and data integrity, with performance heavily favored. An additional protocol, the Network Lock Manager, was later created to isolate the inherently stateful aspects of file locking. A supporting MOUNT protocol performs the operating system-specific functions that allow clients to attach remote file systems to a node within the local file system. The mount process also allows the server to grant remote access privileges to a restricted set of clients via export control. An Automounter mechanism [39] can be used to enable read-only replication by allowing a client to evaluate NFS mount points and choose the best server at the time data is requested. As with most distributed file systems at that time, security in NFS depends on standard UNIX file protection mechanisms. The UNIX security model trusts the identity of users without authentication. This security model is sometimes feasible within small organizations, but it is definitely not sufficient in larger organizations or across the Internet.

2.2.1.6 Andrew File System (AFS) and Coda Andrew [40] is a distributed computing environment developed at Carnegie Mellon University in 1983 on the 4.3 BSD version of UNIX. Andrew features a distributed file system called Vice, later renamed Andrew File System (AFS), which became a commercial product in 1989. An early goal was scalability, targeting support for up to 10,000 clients with approximately 200 clients for every server, an order of magnitude improvement over NFS in the ratio of clients to servers. Disks are divided into partitions, and partitions into volumes. AFS can automatically migrate or replicate heavily used volumes across server nodes to balance load. Only one mutable copy of a replica exists, with all updates forwarded to it. Special servers maintain a fully replicated location database that maps volumes to home servers, enabling full location transparency. All clients see a shared namespace under /afs, which contains links to groups (or cells) of AFS servers. AFS caches large chunks of files on local disk; early versions cached entire files. If a server allows a client to cache data, the server returns a promise that it will inform (or “callback”) the client if the data is modified. Note that since clients flush data on file

11 close or an explicit call to fsync, the server discovers changes only after the client has already modified its cached version. Servers, which are stateful, may revoke the promise, i.e., issue a callback, at any time and for any reason (principally memory exhaustion). Callbacks give AFS a scalable way to achieve the close-to-open semantics in NFS by enabling aggressive client caching, which reduces client/server communication. NFS and most other distributed file systems in the 1980s were targeted for use by a small collection of trusted workstations. The large number of AFS clients breaks this model, requiring a stronger security mechanism. AFS therefore abjures UNIX file protection semantics, instead requiring users to obtain Kerberos [41] tokens that map onto access control lists, which control access at the granularity of a directory. Kerberos is a network authentication protocol that provides strong authentication for client/server applications by using secret-key cryptography. AFS uses a secure RPC called Rx for all communication. Some work has begun to create an implementation of AFS that provides remote access to existing data stores, although it appears such a system does not yet exist. Vice, a predecessor of AFS, forms the basis of the high availability Coda file system [42], which adds support for mutable server replication and disconnected operation. Coda has three client access strategies: read-one-data, read data from a single preferred server; read-all-status, obtain version and status information from all servers; and writeall, write updates to all available servers. Clients can continue to work with cached copies of files when disconnected, with updates propagated to the servers when the client is reconnected. If servers contain different versions of the same file, stale replicas are asynchronously refreshed. If conflicting versions exist, user intervention is usually required.

2.2.1.7 DCE/DFS The Open Software Foundation uses AFS [40] as the basis for the DEcorum file system (DCE/DFS) [43], a major component of its Distributed Computing Environment (DCE). It redesigned the AFS server with an extended virtual file system interface, called VFS+, enabling it to support a range of underlying file systems. A specialized underlying file system, Episode [44], supports data replication, data migration, and access control lists (ACLs), which specify the users that can access a file system resource. DFS

12 supports single-copy consistency semantics, ensuring that clients see the latest changes to a file. A token manager running on each server manages consistency by returning various types of tokens to clients, e.g., read and write tokens, open tokens, and file attribute tokens. A server can prohibit multiple clients from modifying cached copies of the same file. Leases [45] are placed on the tokens to let the server revoke tokens that are not renewed by the client within a lease period, allowing quick recovery from a failed client holding exclusive-access tokens. A recovered server enters a grace period—lasting for a few minutes—which allows clients to detect server failure and reacquire tokens.

2.2.1.8 AppleTalk Developed by Apple Computer in the early 1980s, the AppleTalk protocol suite [46] facilitated file transfer, printer sharing, and mail service among Apple systems. Built from the ground up, AppleTalk managed every layer of the OSI reference model [47]. AppleTalk currently includes a set of protocols to work with existing data link protocols such as Ethernet, Token Ring, and FDDI. The AppleTalk Filing Protocol (AFP) allows Macintosh clients to access remote files in the same manner as local files. AFP uses several other protocols in the AppleTalk protocol suite including the AppleTalk Session Protocol, the AppleTalk Transaction Protocol, and the AppleTalk Echo Protocol. The Mac OS continues to use AFP as a primary file sharing protocol, but Mac support for NFS is growing.

2.2.1.9 Common Internet File System The Server Message Block (SMB) protocol, now known as the Common Internet File System (CIFS) [2], was created for PCs in the 1980's by IBM and later extended by 3COM, Intel, and Microsoft.

SMB was designed to provide remote access to the

DOS/FAT file system, but now NTFS forms the basis for CIFS. CIFS uses NetBIOS (Network Basic Input Output System) sessions, a session management layer originally designed to operate over a proprietary transport protocol (NETBEUI), but now operates over TCP/IP and UDP [48, 49]. Once a session ends, the server may close all open files.

13 Server failure results in the loss of all server state, including all open files and current file offsets. CIFS has a unique cache consistency model that uses opportunistic locks (oplocks). On file open, a client specifies the access it requires (read, write, or both) and the access to deny others, and in return receives an oplock (if caching is available). There are three types of oplocks: exclusive, level II, and batch. The first client to open a file receives an exclusive lock. The server disables caching if two clients request write access to the same file, forcing both clients to write through the server. A client requesting read access to a file with an exclusive lock disables caching on the client requesting read access and changes the writing client’s cache to read-only (level II). Batch locks allow a client to retain an oplock across multiple file opens and closes. Other features of CIFS include request batching and the use of the Microsoft DFS facility to stitch together several file servers into a single namespace. Security is handled with password authentication on the server. Although CIFS is exclusively controlled by Microsoft, Samba [50] is a suite of open source programs that provides file and print services to PC clients using the CIFS protocol.

2.2.1.10 Sprite operating system The Sprite operating system [51, 52] was developed at the University of California at Berkley for networked, diskless, and large main memory workstations. The Sprite distributed file system supports full UNIX semantics. To resolve a file location, each client maintains a file prefix table, which caches the association of file paths and their home server. Caching of file prefixes reduces recursive lookup traffic as a client walks through the directory tree, but the broadcast mechanism used to distribute file prefix information limits Sprite to LAN environments. Sprite supports single-copy semantics by tracking open files on clients and whether clients are reading or writing, allowing only a single writer or multiple readers to cache file data. The server uses callbacks to invalidate client caches when conflicting open requests occur. Sprite uses a writeback cache, flushing dirty blocks to the server after thirty seconds, and writing to disk in another thirty seconds. This caching model was integrated into Spritely NFS [53] at the cost of file open performance and complicating the server recovery model by storing the list of clients that

14

Figure 2.1: ASCI platform, data storage, and file system architecture1.

have opened a file on disk. Not Quite NFS (NQNFS) [54] avoids storing state on disk and the use of open and close commands by using a lease mechanism on the client data cache. This avoids having to introduce additional server recovery semantics to NFS by simply allowing leases to expire before a failed server recovers.

2.3. High-performance computing

2.3.1. Supercomputers This section gives an overview of modern supercomputers from a data perspective. Figure 2.1 displays a typical supercomputer data access architecture, consisting of a primary I/O (storage) network that connects compute nodes, login/development nodes, visualization facilities, archival facilities, and remote compute and data facilities. The Thunderbird cluster at Sandia National Laboratories, currently the largest PC cluster in the world, achieves 38 teraflops/s. Thunderbird consists of 4,096 dual processor nodes using Infiniband for inter-node communication and Gigabit Ethernet for storage access. This architecture supports direct storage access for all nodes. ASCI Purple has 1536 8-way nodes (12,208 CPUs) and achieves 75 teraflops/s [55]. ASCI Purple is a hybrid of commodity and specialized components. Custom designed

1

from ASCI Technology Prospectus, July 2001

15

Figure 2.2: The ASCI BlueGene/L hierarchical architecture2.

compute nodes communicate via the IBM Federation interconnect, which has a peak bidirectional bandwidth of 8 GB/s and a latency of 4.4 µs, and uses a commodity Infiniband storage network for data access. To meet I/O throughput requirements, 128 nodes are designated I/O nodes to act as a bridge between the Federation and Infiniband interconnects. Compute nodes trap I/O calls and automatically re-execute them on the I/O nodes, with the results shipped back to the originating compute node. I/O nodes also handle process authentication, accounting, and authorization on behalf of the compute nodes. Figure 2.2 describes the hierarchical architecture of the IBM BlueGene/L hybrid machine [56, 57]. A full BlueGene/L system has 65,536 dual-processor compute nodes, orders of magnitude more than contemporary systems such as ASCI White [58], Earth Simulator [59], and ASCI Red [60]. BlueGene/L has one I/O node for every sixty-four compute nodes, although this ratio is configurable. Currently, each node has a small amount of memory, limiting it to highly parallelizable applications.

2

from www.llnl.gov/asc/computing_resources/bluegenel/configuration.html

16 2.3.2. High-performance computing applications The design of modern distributed file system architectures derives mainly from several workload characterization studies [40, 61-64] of UNIX users and their applications. Applications that use thousands of processors have entirely different workloads. This section gives an overview of HPC applications and their I/O requirements. 1. 2. 3. 4. 5. 6. 7. 8. 9.

Load initialization data Begin loop BARRIER Compute results for current time step Distribute ghost cells to dependent nodes If required, checkpoint data to storage Advance time forward End loop Write result

Figure 2.3: A typical high-performance application. Pseudo-code of a typical highperformance computing application that each node executes in parallel.

High-performance applications aim to use every available processor. Each node calculates the results on a piece of the analysis domain, with the results from every node being combined at some later point. The example parallel application in Figure 2.3 executes on all cluster nodes in parallel. A physical simulation, e.g., propagation of a wave, moves forward in time. The unit of time depends on the required resolution of the result. In step 1, nodes load their inputs individually or designate one node to load the data and distribute it among the nodes. At the beginning of each computation (step 3), nodes synchronize among themselves. After a node completes its computation for a time step, step 5 communicates dependent data, a ghost cell, to domain neighbors for the next time step. Step 6 checkpoints (writes) the data to guard against system or application failures. The next section discusses checkpointing in more detail. In step 9, all nodes write their results directly to storage or have a single node gather the results and write the combined result to storage. Flash [65] is an example of the former; mpiBLAST [66] exemplifies the latter.

17 2.3.2.1 I/O Miller and Katz [67] divided high-performance application I/O into three categories: required, data staging, and checkpoint. Required I/O consists of loading initialization data and storing the final results. Data staging supports applications whose data do not fit in main memory. These data can be stored in virtual memory automatically by the operating system, or manually written to disk using out-of-core techniques for supercomputers that do not support virtual memory. Checkpoint I/O, also known as defensive I/O, stores intermediate results to prevent data loss. Checkpointing results after every computation increases application runtime to an unacceptable extent, but deferring checkpoint I/O too long risks unacceptable data loss. The time between checkpoints depends on a system’s mean time between hardware failures (MTBHF). Failure rates vary widely across systems, depending mostly on the number of processors and the type and intensity of application workloads [68]. Applications write checkpoint and result files in different ways. These are the most common three methods [69]: 1. Single file per process/node. Each node or process creates a unique file. This method is clumsy since an application must use the same number of nodes to restart computation after failure or interruption. In addition, the files must later be integrated or mined by special tools and post-processors to generate a final result. 2. Small number of files. An application can write a smaller number of files than processes/nodes in the computation by performing some integration work at each checkpoint (and for the final result). This method allows an application to restart computation on a different number of processes/nodes. However, special processing and tools are still required. 3. Single file. An application integrates all information into a single file. This method allows applications to restart computation easily on any number of processes/nodes and obviates the need for special post-processing and tools. The overall performance of a supercomputer depends not only on its raw computational speed but also on job scheduling efficiency, reboot and recovery times (including checkpoint and restart times), and the level of process management. The Effective Sys-

18 tem Performance (ESP) [70] of a supercomputer is a measure of its total system utilization in a real-world operational environment. ESP is calculated by measuring the time it takes to run a fixed number of parallel jobs through the batch scheduler of a supercomputer. The U.S. Department of Energy procures large systems with an ESP of at least seventy percent [12]. Defensive I/O is very time consuming and decreases the ESP of a supercomputer. Some systems report that up to seventy-five percent of I/O is defensive, with only twenty five percent being productive I/O, consisting of visualization dumps, diagnostic physics data, traces of key physics variables over time, etc. To ensure defensive I/O does not reduce the ESP of a machine below seventy percent, the Sandia National Research Lab uses a general rule of thumb of requiring a throughput of one gigabyte per second for every teraflop of computing power [12].

2.3.2.2 I/O workload characterizations Workload characterization studies of supercomputing applications on vector computers in the early 1990s [67, 71] found that I/O is cyclic, predictable, and bursty, with file access sizes relatively constant within each application. Files are read from beginning to end, benefiting from increased read-ahead. Buffered writes on the client was found to be of little benefit due to the large amount of data being written and the small data cache typical of vector machines. The CHARISMA study [72-74] instrumented I/O libraries to characterize the I/O access patterns of distributed memory computers in the mid-1990s. The main difference from the vector machine study is the large number of small I/O requests. CHARISMA found that approximately 90% of file accesses are small, but approximately 90% of data is transferred in large requests. The large number of small write requests, and the short interval between them, indicates that buffered-writes benefit performance in some instances. Small requests often result from partitioning a data set across many processors but may also be inherent in some applications. Write requests dominate read requests in the applications studied by CHARISMA. Write data sharing between clients is infrequent, since it is rarely useful to re-write the same byte. With read-only files, approximately 24% are replicated on each client and

19 64% experience false sharing. A data set can be divided into a series of fixed size data regions, or data blocks. With false sharing, nodes share data blocks, but do not access the same byte ranges within the data blocks. This can occur when a data set is divided among clients according to block divisions instead of the requested byte-ranges, creating unnecessary data contention. Read-write files exhibit very little byte sharing between clients due to the difficulty of maintaining consistency, but also experience false sharing. Separate jobs never share files. Clients interleave access to a single file, creating high inter-process spatial locality on the I/O nodes, benefiting from I/O server data caching. Current strided data access usually uses standard UNIX I/O operations, with application developers citing a lack of portability of the available and much more efficient strided interfaces provided by parallel file systems. The rapid pace of technological change means that applications generally outlast their targeted platform and must be portable to newer machines. The Scalable I/O [75-77] initiative in the mid-1990s instrumented applications to characterize I/O access patterns with distributed memory machines. The findings are similar to CHARISMA, but emphasize the inefficiency of the UNIX I/O interface with different file sizes, and spatial and temporal data access within a file. For example, one sample application generates most of its I/O by seeking through a file. The study lead to improved application performance by suggesting file access hints, access pattern information passed from the application to the file system, to improve interaction with the I/O nodes. The Scalable I/O initiative also found that using a single and shared file for storing information was still common and slow. In 2004, a group at the University of California, Santa Cruz, analyzed applications on a large Linux cluster [78]. Like the findings from a decade earlier, this study found that I/O is bursty, most requests consist of small data transfers, and most data is transferred in a few large requests. It is common for a master node to collect results from other nodes and write them to storage using many small requests. Each client reads back the data in large chunks. In addition, use of a single file is still common and accessing that file— even with modern parallel file systems—is slower than accessing separate files by a factor of five.

20 2.4. Scaling data access This section discusses common techniques for increasing file system I/O throughput.

2.4.1. Standard parallel application interfaces A parallel application’s lifespan may be ten years or longer, and may run on several different supercomputers architectures. A standard communication interface ensures the portability of applications. Three communication libraries dominate the development and execution of supercomputer applications. MPI (Message Passing Interface) [79] is an API for inter-node communication. MPI2 [80] includes a new interface for data access named MPI-IO, which defines standard I/O interfaces and standard data access hints. MPICH2 [81] is an open-source implementation of MPI that includes an MPI-IO framework named ROMIO [82]. Specialized implementations of MPI-IO also exist [83]. ROMIO improves single client performance by increasing client request size through a technique called data sieving [84]. With data sieving, noncontiguous data requests are converted into a single, large contiguous request consisting of the byte range from the first to the last requested byte. Data sieving then places the requested data portions in the user’s buffer. Access to large chunks of data usually outperforms access to smaller chunks, but data sieving also increases the amount of transferred data and number of read-modify-write sequences. MPI-IO can also improve I/O performance from multiple clients using collective I/O, which can exist at the disk level (disk-directed I/O [85]), at the server level (serverdirected I/O [86]), or at the client level (two-phase I/O [87]). In disk-directed I/O, I/O nodes use disk layout information to optimize the client I/O requests. In server-directed I/O, a master client communicates in-memory and on-disk distributions for the arrays prior to client data access with a master I/O node. The master I/O node then shares the information with other I/O nodes. This communication allows clients and I/O nodes to coordinate and improve access to logically sequential regions of files, not just physically sequential regions. Finally, ROMIO uses two-phase I/O, which organizes noncontiguous client requests into contiguous requests. For example, two-phase I/O converts inter-

21 leaved client read requests from a single file into contiguous read requests by having clients read contiguous data regions, re-distributing the data to the appropriate clients. Writing a file is similar except clients first distribute data among themselves so that clients can write contiguous sections of the file. Accessing data in large chunks outweighs the increased inter-process communication cost for data redistribution. Other APIs for inter-process communication include OpenMP [88] and Highperformance Fortran (HPF) [89], but they do not include an I/O interface. OpenMP divides computations among processors in a shared memory computer. Many applications use a mixture of OpenMP and MPI in their applications, using MPI between nodes and OpenMP to improve performance on a single SMP node. HPF first appeared in 1993. HPF-2, released in 1997, includes support for data distribution, data and task parallelism, data mapping, external language support, and asynchronous I/O, but has limited use due to its restriction to programs written in Fortran.

2.4.2. Parallel file systems Early high-performance computing engineered connected monolithic computers to monolithic storage systems. The emergence of low cost clusters broke this model by creating many more producers and consumers of data than the memory, CPU, and network interface of a single file server could manage. Initial efforts to increase performance included client caches, buffered writes on the client (Sprite, AFS), and write gathering on the server [90], but they did not address the single server bottleneck. Many system administrators manually stitch together several file servers to create a larger and more scalable namespace. This has implications in administration costs, backup creation, load balancing, and quota management. In addition, re-organizing the namespace to meet increased demand is visible to users. A more transparent way to combine file servers into a single namespace is to forward requests between file servers [91]. Clients access a single server; requests for files not stored on that server are forwarded to the file’s home server. Unfortunately, servers are still potential bottlenecks since a directory or file is still bound to a single server. Furthermore, data may now travel through two servers.

22

(a) Symmetric (b) Asymmetric Figure 2.4: Symmetric and asymmetric out-of-band parallel file systems (PFS).

Another method aggregates all storage behind file servers using a storage network [92]. This allows a client to access any server, with servers acting as intermediaries between clients and storage. This is an in-band solution since control and data both traverse the same path. This design still requires clients to send all requests for a file to a single server and can require an expensive storage network. Out-of-band solutions (OOB), which separate control and data message paths [93-96], currently offer the best I/O performance in a LAN. They enable direct and parallel access to storage from multiple endpoints. OOB parallel file systems stripe files across available storage nodes, increasing the aggregate I/O throughput by distributing the I/O across the bisectional bandwidth of the storage network between clients and storage. This technique can reduce the likelihood of any one storage node becoming a bottleneck and offers scalable access to a single file. A single network connects clients, metadata servers, and storage, with client-to-client communication occurring over this network or on an optional host network. Out-of-band separation of data and control paths has been advocated for decades [93, 94] because it allows an architecture to improve data transfer and control messages separately. Note that inter-dependencies between data and control may restrict unbounded, individual improvement. Symmetric OOB parallel file systems, depicted in Figure 2.4a, combine clients and metadata servers into single, fully capable servers, with metadata distributed among them. Locks can be distributed or centralized. Maintaining consistent metadata information among an increasing number of servers limits scalability. These systems generally require a SAN.

Examples include GPFS [97], GFS [98, 99], OCFS2 [100], and

PolyServe Matrix Server [101].

23 Asymmetric OOB parallel file systems, depicted in Figure 2.4b, divide nodes into clients and metadata servers. To perform I/O, clients first obtain a file layout map describing the placement of data in storage from a metadata server. Clients then use the file layout map to access data directly and in parallel. These systems allow data to be accessed at the block, object, or file level. Block-based systems access disks directly using the SCSI protocol via FibreChannel or iSCSI. Object- and file-based systems have the potential to improve scalability with a smaller file layout map, shifting the responsibility of knowing the exact location of every block from clients to storage. Examples of blockbased systems include IBM TotalStorage SAN FS (also known as Storage Tank) [102], and EMC HighRoad [103]. Examples of object-based systems include Lustre [104] and Panasas ActiveScale [105]. Examples of file-based systems include Swift [96, 106] and PVFS [107].

2.4.2.1 Parallel file systems and POSIX The POSIX API and semantics impede efficient sharing of a single file’s address space in a cluster of computers. The POSIX I/O programming model is a single system in which processes employ fast synchronization and communication primitives to resolve data access conflicts. Due to an increase in synchronization and communication time, this model is invalid for multiple clients accessing data in a parallel file system. Several semantics exemplify the mismatch between POSIX semantics and parallel file systems. •

Single process/node. The POSIX API forces every application node to execute the same operation when one node could perform operations such as name resolution on behalf of all nodes in the distributed application.

•

Time stamp freshness. POSIX mandates that file metadata is kept current. Each I/O operation on the storage nodes alters a time stamp, with single second resolution, which must be propagated to the metadata server. Examples include modification time and access time.

•

File size freshness. As part of file metadata, POSIX mandates that the size of a file is kept current. Clients, storage nodes, and the metadata server must all coor-

24

(a) Simple NFS remote data access (b) NFS namespace partitioning Figure 2.5: NFS remote data access.

dinate to determine the current size of a file while it is being updated and extended. •

Last writer wins. POSIX mandates that when two or more processes write to a file at the same location, the file contain the last data written. While a parallel file system can implement this requirement on each individual storage node, it is hard to implement this requirement for write requests that span multiple storage nodes.

•

Data visibility. POSIX mandates that modified data is immediately visible to all processes immediately after a write operation. With each client maintaining a separate data cache, satisfying this requirement is tricky.

The high-performance community is proposing extensions to the POSIX I/O API to address the needs of the growing high-end computing sector [69]. These extensions leverage the intensely cooperative nature of high-performance applications, which are capable of arranging data access to avoid file address space conflicts.

2.5. NFS architectures Figure 2.5a displays the NFS architecture, in which trusted clients access a single disk on a single server. Depending on the required performance, reliability, etc. of an NFS installation, each component of this model can be realized in different ways using a widerange of technologies. •

NFS clients. NFS client implementations exist for nearly every operating system. Clients can be diskless and can contain multiple network interfaces (NICs) for increased bandwidth.

•

Host network. IP-based networks use TCP, UDP, or SCTP [108]. Remote Direct Memory Access (RDMA) support is currently under investigation [109].

25

Figure 2.6: NFS with databases.

•

NFS server. The NFS server provides file, metadata, and lock services to the NFSv4 client. The VFS/Vnode interface translates client requests to requests to the underlying file system.

•

Storage and storage network. NFS supports almost any modern storage system. SCSI and SATA command sets are common with directly connected disks. Hardware RAID systems are also common, and software RAID systems are emerging as a cheaper (yet slower) alternative. iSCSI is emerging as a ubiquitous storage protocol for IP-based networks. FCIP enables communication between Fibre Channel SANs by tunneling FCP, the protocol for Fibre Channel networks, over IP. iFCP, on the other hand, allows Fibre Channel devices to connect directly to IP-based networks by replacing the Fibre Channel transport with TCP/IP.

NFS works well with small groups, but is limited in every aspect of its design: consumption of compute cycles, memory, bandwidth, storage capacity, etc. To scale up the number of clients, Figure 2.5b shows how many enterprises, such as universities and large organizations, partition a file system among several NFS servers. For example, partitioning student home directories among many servers spreads the load among the servers. The Automounter automatically mounts file systems as clients access different parts of the namespace [110]. High-demand read-only data may be replicated similarly, with the Automounter automatically mounting the closest replica server. One problem with the Automounter is a noticeable delay as users navigate into mounted-on-demand directories. Other disadvantages of this approach include administration costs, backup management, load balancing, visible namespace reorganization, and quota management. Despite these problems, many organizations use this technique to provide access to very large data stores.

26 Database deployments are emerging as another environment for NFS. Figure 2.6 illustrates database clients using a local disk with the database server storing object and log files in NFS. Database systems manage caches on their own and depend on synchronous writes, therefore these NFS installations disable client caching and asynchronous writes. To improve the performance of synchronous writes, some NFS hardware vendors (such as Network Appliance) write to NVRAM synchronously, and then asynchronously write this data to disk. It is common for database applications to have many servers accessing a single file at the same time. CIFS is unsuitable for this type of database deployment due to its lack of appropriate lock semantics, a specific write block size, and appropriate commit semantics. The original goals of distributed file systems were to provide distributed access to local file systems, but NFS is now widely used to provide distributed access to other network-based file systems. Although parallel file systems already have remote data access capability, many lack heterogeneous clients, a strong security model, and satisfactory WAN performance. Figure 2.7 illustrates standard NFSv4 clients accessing symmetric and asymmetric out-of-band parallel file systems. The NFSv4 server accesses a single parallel file system client and translates all NFS requests to parallel file system specific operations. Symmetric OOB file systems are often limited to a small number of nodes due to the high cost of a SAN. NFS can increase the number of clients accessing a symmetric OOB file system by attaching additional NFSv3 clients to each node in Figure 2.7a.

2.6. Additional NFS architectures This section examines NFS architecture variants that attempt to scale one or more aspects of NFS. These architectures transform NFS into a type of parallel file system, increasing scalability but eliminating the file system independence of NFS.

27

(a) Symmetric (b) Asymmetric Figure 2.7: NFS exporting symmetric and asymmetric parallel file systems (PFS).

2.6.1. NFS-based asymmetric parallel file system Many examples exist of systems that use the NFS protocol to create an asymmetric parallel file system. In these systems, a directory service (metadata manager) typically manages the namespace while files are striped across NFS data servers. NFS clients use the directory service to retrieve file metadata and file location information. Data servers store file segments (stripes) in a local file system such as Ext3 [111] or ReiserFS [112]. Several directory service strategies have been suggested, each offering different advantages. Explicit metadata node. This strategy uses an NFS server as the metadata node that manages the file system. Clients access the metadata node to retrieve a list of data servers and layout information describing how files are striped across them. Clients maintain data consistency by applying advisory locks to the metadata node. Unmodified NFS servers are used for storage. Support for mandatory locks or access control lists require a communication channel to coordinate state information among the metadata nodes and data servers. Store metadata information in files. The Expand file system [113] stores file metadata and location information in regular files in the file system. Clients determine the NFS data server and pathname of the metadata file by hashing the file pathname. To perform I/O, a client opens the metadata file for a particular file to retrieve data access information. Expand uses unmodified NFS servers but extends NFS clients to locate, parse, interpret, and use file layout information. A major problem with file hashing based on pathname is that renaming files requires migrating metadata between data servers. In ad-

28 dition, Expand uses data server names with the metadata file to describe file striping information, which complicates incremental expansion of data servers. Directory service elimination. The Bigfoot-NFS file system [114] eliminates the directory service outright, instead requiring clients to gather file information by analyzing the file system. Any given file is stored on a single unmodified NFS server. Clients discover which data server stores a file by requesting file information from all servers. Clients ignore failed responses and use the data server that returned a successful response to access the file. Bigfoot-NFS reduces file discovery time by parallelizing NFS client requests. The lack of a metadata service simplifies failure recovery, but the inability to stripe files across multiple data servers and increased network traffic limits the I/O bandwidth to a given file.

2.6.2. NFS client request forwarding The nfsp file system [91] architecture contains clients, data servers, and a metadata node. Unmodified NFS clients mount and issue file metadata and I/O requests to the metadata node. Each file is stored on a single NFS data server. Metadata nodes forward client I/O requests to the data server containing the file, which replies directly to the client by spoofing its IP address. The inability to stripe files across multiple data servers and a single metadata node forwarding I/O requests limits the I/O bandwidth to a given file.

2.6.3. NFS-based peer-to-peer file system The Kosha file system [115] uses the NFS protocol to create a peer-to-peer file system. Kosha taps into available storage space on client nodes by placing a NFS client and server on each node. To perform I/O, unmodified NFS clients mount and access their local NFS server (through the loopback network device). A NFS server routes local client requests to the remote NFS server containing the requested file. An NFS server determines the correct NFS data server by hashing the file pathname. Each file is stored on a single NFS data server, which limits the I/O bandwidth to a given file

29 2.6.4. NFS request routing The Slice file system prototype [116] divides NFS requests into three classes: large I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers, routes NFS client requests between storage servers, small-file servers, and directory servers. Large I/O flows directly to storage servers while small-file servers aggregate I/O operations of small files and the initial segments of large files. (Chapter VI investigates the small I/O problem in more depth, demonstrating that small I/O requests are not limited to small files.) Slice introduces two policies for transparent scaling of the name space among the directory servers. The first method uses a directory as the unit of distribution. This works well when the number of active directories is large relative to the number of directory servers, but it binds large directories to a single server. The second method uses a file pathname as the unit of distribution. This balances request distributions independent of workload by distributing them probabilistically, but increases the cost and complexity of coordination among directory servers.

2.7. NFSv4 protocol NFSv4 extends versions 2 and 3 with the following features: •

Fully integrated security. NFSv4 offers authentication, integrity, and privacy by mandating support of RPCSEC_GSS [117], an API that allows support for a variety of security mechanisms to be used by the RPC layer. NFSv4 requires support of the LIPKEY [118] and SPKM-3 [118] pubic key mechanisms and the Kerberos V5 symmetric key mechanism [119]. NFSv4 also supports security flavors other than RPCSEC_GSS, such as AUTH_NONE, AUTH_SYS, and AUTH_DH. AUTH_NONE provides no authentication. AUTH_SYS provides a UNIX-style authentication. AUTH_DH provides DES-encrypted authentication based on a network-wide string name, with session keys exchanged via the Diffie-Hellman public key scheme. The requirement of support for a base set of security protocols is a departure from earlier NFS versions, which left data privacy and integrity support as implementation details.

30 •

Compound requests. Operation bundling, a feature supported in CIFS, allows clients to combine multiple operations into a single RPC request. This feature reduces the number of round-trip-times between the client and the server to accomplish a job, e.g., opening a file, and simplifies the specification of the protocol.

•

Incremental protocol extensions. NFSv4 allows extensions that do not compromise backward compatibility through a series of minor versions.

•

Stateful server. The introduction of OPEN and CLOSE commands creates a stateful server. This allows enhancements such as mandatory locking and server callbacks and opens the door to consistent client caching. See Section 2.7.1.

•

Root file handles. NFSv4 does not use a separate mount protocol to provide the initial mapping between a path name and file handle. Instead, a client uses a root file handle and navigates through the file system from there.

•

New attribute types. NFSv4 supports three new types of attributes: mandatory, recommended, and named. NFSv4 also supports access control lists. This attribute model is extensible in that new attributes can be introduced in minor revisions of the protocol.

•

Internationalization. NFSv4 encodes file and directory names with UTF-8 to deal accommodate international character sets.

•

File system migration and replication. The fs_location attribute provides for file system migration and replication.

•

Cross-platform interoperability. NFSv4 enhances interoperability with the introduction of recommended and named attributes, and by mandating support for TCP and Windows share reservations.

2.7.1. Stateful server: the new NFSv4 scalability hurdle The broadest architectural change for NFSv4 is the introduction of a stateful server to support exclusive opens called share reservations, mandatory locking, and file delegations. This change significantly increases the complexity of the protocol, its implementa-

31 tions, and most notably its fault tolerance semantics. In addition, access to a single and shared data store can no longer be exported by multiple NFSv4 servers without a mechanism for maintaining global state consistency among the servers. A share reservation controls access to a file, based on the CIFS oplocks model [2]. A client issuing an OPEN operation to a server specifies both the type of access required (read, write, or both) and the types of access to deny others (deny none, deny read, deny write, or deny both). The NFSv4 server maintains access/deny state to ensure that future OPEN requests do not conflict with current share reservations. NFSv4 also supports mandatory and advisory byte-range locks. An NFSv4 server maintains information about clients and their currently open files, and can therefore safely pass control of a file to the first client that opens it. A delegation grants a client exclusive responsibility for consistent access to the file, allowing client processes to acquire file locks without server communication. Delegations come in two flavors. A read delegation guarantees the client that no other client has the ability to write to the file. A write delegation guarantees the client that no other client has read or write access to the file. If another client opens the file, it breaks these conditions, so the server must revoke the delegation by way of a callback. The server places a lease on all state, e.g., client connection information, file and byte-range locks, delegations. If the server does not hear from a given client within the lease period, the server is permitted to discard all of the client’s associated state. A failed server that recovers enters a grace period, lasting up to a few minutes, which allows clients to detect the server failure and reacquire their previously acquired locks and delegations. The NFSv4 caching model combines the efficient support for close-to-open semantics of AFS with the block based caching and easier recovery semantics provided by DCE/DFS, Sprite, and Not Quite NFS. As with NFSv3, clients cache attribute and directory information for a duration determined by the client. However, a client holding a delegation is assured that the cached data for that file is consistent. Proposed extensions to NFSv4 include directory delegations, which grant clients exclusive responsibility for consistent access to directory contents, and sessions, which provide exactly-once semantics, multipathing and trunking of transport connections, RDMA support, and enhanced security.

32 2.8. pNFS protocol To meet enterprise and grand challenge-scale performance and interoperability requirements, the University of Michigan hosted a workshop on December 4, 2003, titled “NFS Extensions for Parallel Storage (NEPS)” [120]. This workshop raised awareness of the need for increasing the scalability of NFSv4 [121] and created a set of requirements and design considerations [122]. The result is the pNFS protocol [123], which promises file access scalability as well as operating system and storage system independence. pNFS separates the control and data flows of NFSv4, allowing data to transfer in parallel from many clients to many storage endpoints. This removes the single server bottleneck by distributing I/O across the bisectional bandwidth of the storage network between the clients and storage devices. The goals of pNFS are to: •

Enable implementations to match or exceed the performance of the underlying file system.

•

Provide high per-file, per-directory and per-file system bandwidth and capacity.

•

Support any storage protocol, including (but not limited to) block-, object-, and file-based storage protocols.

•

Obey NFSv4 minor versioning rules, which require that all future versions have legacy support.

•

Support existing storage protocols and infrastructures, e.g., SBC on Fibre Channel [16], and iSCSI, OSD on Fibre Channel and iSCSI, NFSv4.

For a file system to realize scalable data access, it must be able to achieve performance gains relative to the amount of additional hardware. For example, if physical disk access is the I/O bottleneck, a truly scalable file system can realize benefits from increasing the number of disks. The cycle of identifying bottlenecks, removing them, and increasing performance is endless. pNFS provides a framework for continuous saturation of system resources by separating the data and control flows and by not specifying a data flow protocol. Focusing on the control flow and leaving the details of the data flow to implementers allows continuous I/O throughput improvements without protocol modifi-

33 cation. Implementers are free to use the best storage protocol and data access strategy for their system. pNFS extensions to NFSv4 focus on device discovery and file layout management. Device discovery informs clients of available storage devices. A file layout consists of all information required by a client to access a byte range of a file. For example, a layout for the block-based Fibre Channel Protocol may contain information about block size, offset of the first block on each storage device, and an array of tuples that contains device identifiers, block numbers, and block counts. To ensure the consistency of the file layout and the data it describes, pNFS includes operations that synchronize the file layout among the pNFS server and its clients. To ensure heterogeneous storage protocol support and unlimited data layout strategies, the file layout is opaque in the protocol. pNFS does not address client caching or coherency of data stored in separate client caches. Rather, it assumes that existing NFSv4 cache-coherency mechanisms suffice. Separating the control and data paths in pNFS introduces new security concerns. Although RPCSEC_GSS continues to secure the NFSv4 control path, securing the data path may require additional effort. pNFS does not define a new security architecture but discusses general security considerations. For example, certain storage protocols cannot provide protection against eavesdropping. Environments that require confidentiality must either isolate the communication channel or use standard NFSv4. In addition, pNFS does not define mechanisms to recover from errors along the data path, but leaves their definition to the supporting data access protocols instead.

CHAPTER III A Model of Remote Data Access Modern data infrastructures are complex, consisting of numerous hardware and software components. Issuing a simple question such as, “What is the size of a file?” may spark a flurry of network traffic and disk accesses, as each component gathers its portion of the answer. A necessary condition for improving data access is a clear picture of a system’s control and data flows. This picture includes the components that generate and receive requests, the number of components that generate and receive requests, the timing and sequence of requests, and the traversal path of the requests and responses. Successful application of this knowledge helps to identify bottlenecks and inefficiencies that fetter the scalability of the system. To clarify the novel contributions of the NFSv4 and parallel file system architecture variants discussed in the proceeding chapters, this chapter presents a course architecture that identifies the components and communication channels for remote data access. I use this architecture to illustrate the performance bottlenecks of using NFSv4 with parallel file systems. Finally, I detail the principal requirements for accessing remote data stores.

3.1. Architecture for remote data access This section describes an architecture that encapsulates the components and communication channels of data access. Shown in Figure 3.1, remote data access consists of five major components: application, data, control, metadata, and storage. In addition to the flow of data, remote data access also consists of five major control paths that manage data integrity and facilitate data access. These components and communication channels form a course architecture for describing remote data access, which I use to describe and analyze the remote data access architectures in subsequent chapters.

34

I use circles,

35

Figure 3.1: General architecture for remote data access. Application components generate and analyze data. Data and control components fulfill application requests by accessing storage and metadata components. Metadata components describe and control access to storage. Storage is the persistent repository for data. Directional arrows originate at the node that initiated the communication.

squares, pentagons, hexagons, and disks to represent the application, data, control, metadata, and storage components of the data access architecture, respectively. This dissertation applies the architecture at the granularity of a file system, providing a clear picture of file system interactions. A file system contains each of the five components, although some are “virtual”, and comprise other components. In addition, a single machine may assume the roles of multiple components, which I portray by adjoining components. The following provides a detailed description of each component: 1. Application.

Generate file system, file, and data requests. Typically, these

are nodes running applications. 2. Data.

Fulfill application component I/O requests through communication

with storage. These support a specific storage protocol. 3. Control.

Fulfill application component metadata requests through commu-

nication with metadata components. These support a specific metadata protocol.

36 4. Metadata.

Describe and control access to storage, e.g., file and directory

location information, access control, and data consistency mechanisms. Examples include an NFS server and parallel file system metadata nodes. 5. Storage.

Persistent repository for data, e.g., Fibre Channel disk array, or

nodes with a directly attached disk.

With data components in the middle, application and storage components bookend the flow of data. Control components support a metadata protocol for communication with file system metadata component(s). These components connect to one or more networks, each supporting different types of traffic. A storage network, e.g., Fibre Channel, Infiniband, Ethernet, or SCSI, connects data and storage components. A host network is IP-based and uses metadata components to facilitate data sharing. Independent control flows, request, control, and manage different types of information. The different types of control flows are as follows: 1. Control ↔ Metadata. To satisfy application component metadata requests, control components retrieve file and directory information, lock file system objects, and authenticate. To ensure data consistency, metadata components update or revoke file system resources on clients. 2. Control ↔ Control. This flow coordinates data access, e.g., collective I/O. 3. Metadata ↔ Metadata. Systems with multiple metadata nodes use this flow to maintain metadata consistency and to balance load. 4. Metadata ↔ Storage. This flow manages storage, synchronizing file and directory metadata information as well as access control information. It can also facilitate recovery. 5. Storage ↔ Storage. This flow facilitates data redistribution and migration.

37

(a) Symmetric (b) Asymmetric Figure 3.2: Parallel file system data access architectures. (a) A symmetric parallel file system has data, control, metadata, and application components all on the same machine; storage consists of storage devices accessed via a block-based protocol. (b) An asymmetric parallel file system has data, control, and application components on the same machine, metadata components on separate machines; storage consists of storage devices accessed via a block-, file-, or object-based protocol.

3.2. Parallel file system data access architecture Figure 3.2a depicts a symmetric parallel file system with data, control, metadata, and application components all residing on the same machine. Storage consists of storage devices accessed via a block-based protocol, e.g., GPFS, or GFS. Figure 3.2b shows an asymmetric parallel file system with application, control, and data components residing on the same machine, metadata components on separate machines. Storage consists of storage devices accessed via a block-, file-, or object-based protocol, e.g., Lustre, or PVFS2. Storage for block-based systems consists of a disk array while storage for object- and file-based systems consists of a fully functional node formatted with a local file system such as Ext3 or XFS [124].

3.3. NFSv4 data access architecture Viewing the NFSv4 and parallel file system architectures as a single, integrated architecture allows the identification of performance bottlenecks and opens the door for devising mitigation strategies. Figure 3.3 displays my base instantiation of the general architecture, NFSv4 exporting a parallel file system. NFSv4 application metadata requests are fulfilled by an NFSv4 control component that communicates with an NFSv4 metadata component on the NFSv4 server, which in turn uses a PFS control component to communicate with the PFS metadata component. NFSv4 application data requests are fulfilled

38

Figure 3.3: NFSv4-PFS data access architecture. With NFSv4 exporting a parallel file system, NFSv4 application metadata requests are fulfilled by an NFSv4 control component that communicates with the NFSv4 metadata component, which in turn uses a PFS control component to communicate with the PFS metadata component. NFSv4 application data requests are fulfilled by an NFSv4 data component that proxies requests through a PFS data component, which in turn communicates directly with storage. The NFSv4 storage component is the entire PFS architecture. The PFS application component is the entire NFSv4 architecture. With a symmetric parallel file system, the PFS metadata and data components are coupled.

by an NFSv4 data component that proxies requests through a PFS data component, which in turn communicates directly with storage. The NFSv4 virtual storage component is the entire PFS architecture. The PFS virtual application component is the entire NFSv4 architecture. Figure 3.3 does not display the virtual components. Figure 3.3 readily illustrates the NFSv4 “single server” bottleneck discussed in Section 2.4.2.1. Data requests from every NFSv4 client must fit through a single NFSv4 server using a single PFS data component. Subsequent chapters vary this architecture to attain different levels of scalability, performance, security, heterogeneity, transparency, and independence.

3.4. Remote data access requirements The utility of each remote data access architecture variant presented in this dissertation derives from several data access requirements. These requirements fall into the following categories: •

I/O workload. A data access architecture must deliver satisfactory performance. An application’s I/O workload determines the type of performance required, e.g., single and multiple client I/O throughput, small I/O request, file creation and metadata management, etc.

39 •

Security and access control. Many high-end computers run applications that deal with both private and public information. Systems must be able to handle both types of applications, ensuring that sensitive data is separate and secure. Security can be realized through air-gap, encryption, node fencing, and numerous other methods. In addition, cross-realm access control to encourage research collaboration must be transparent and foolproof.

•

Wide area networks. Beyond heightened security and access control requirements, successful global collaborations require high performance, heterogeneous, and transparent access to data, independent of the underlying storage system.

•

Local area networks. Performance and scalability are key requirements of applications designed to run in LAN environments. Heterogeneous data access and storage system independence are also becoming increasingly important. For example, in many multimedia studios, designers using PCs and large UNIX rendering clusters access multiple on- and off-site storage systems [11].

•

Development and management. With today’s increasing reliance on middleware applications, reducing development and administrator training costs and problem determination time is vital.

3.5. Other general data access architectures

3.5.1. Swift architecture The Swift parallel file system was an early pioneer in achieving scalable I/O throughput using distributed disk striping [96]. Figure 3.4 displays the Swift architecture components. Swift did not define a specific architecture, but instead listed four optional components. In general, client components perform I/O by using a distribution agent component to retrieve a transfer plan from a storage mediator component and transfer data to/from storage agent components. The original Swift prototype used a standard transfer plan, obviating the need for storage mediators.

40

Figure 3.4: Swift architecture [96]. The Swift architecture consists of four components: clients, distribution agents, a storage mediator, and storage agents. Clients perform I/O by using a distribution agent to retrieve a transfer plan from a storage mediator and transfer data to/from storage agents.

Swift architecture components map almost one-to-one with the general architecture components introduced in this chapter. Swift storage mediators function as metadata components, Swift storage agents function as storage components, and Swift clients function as application components. The architecture presented here splits the role of a Swift distribution agent into control and data components. Separating I/O and metadata requests into separate components lets us represent out-of-band systems that use different protocols for each channel. In addition, applying the architecture components iteratively and including all communication channels provides a holistic view of remote data access.

3.5.2. Mass storage system reference model In the late 1980s, the IEEE Computer Society Mass Storage Systems and Technology Technical Committee attempted to organize the evolving storage industry by creating a Mass Storage System Reference Model [93, 94], now referred to as the IEEE Reference Model for Open Storage Systems Interconnection (OSSI model) [125]. Shown in Figure 3.5, its goal is to provide a framework for the coordination of standards development for storage systems interconnection and a common perspective for existing standards. One system—perhaps the only one—based directly on the OSSI model is the Highperformance Storage System (HPSS) [126]. The OSSI model decomposes a complete storage system into the following storage modules, which are defined by several IEEE P1244 standards documents:

41

Figure 3.5: Reference Model for Open Storage Systems Interconnection (OSSI) [125]. The OSSI model diagram displays the software design relationships between the primary modules in a mass storage system to facilitate standards development for storage systems interconnection.

•

Application Environment Profile. The environmental software interfaces required by open storage system services.

•

Object Identifiers. The format and algorithms used to generate globally unique and immutable identifiers for every element within an open storage systems.

•

Physical Volume Library. The software interfaces for services that manage removable media cartridges and their optimization.

•

Physical Volume Repository. The human and software interfaces for services that stow cartridges and mount these cartridges onto devices, employing either robotic or human transfer agents.

•

Data Mover. The software interfaces for services that transfer data between two endpoints.

•

Storage System Management. A framework for consistent and portable services to monitor and control storage system resources as motivated by site-specified storage management policies.

•

Virtual Storage Service. The software interfaces to access and organize persistent storage.

The goals of the architecture presented in this chapter complement those of the OSSI model. Figure 3.6 demonstrates how my architecture encompasses the OSSI modules and protocols. The IEEE developed the OSSI model to expose areas where standards are

42

Figure 3.6: General data access architecture view of OSSI model. The OSSI modules in the general data access architecture. The Virtual Storage Service uses Data Movers to route data between application and storage components. Control components use the Physical Volume Library, Physical Volume Repository, and Virtual Storage Service to mount and obtain file metadata information. Metadata components use the Storage System Management protocol to manage storage.

necessary (or need improvement), so they could be implemented and turned into commercial products. . The OSSI model does not capture the physical nodes or data and control flows in a data architecture, but rather the design relationships between components. For example, Figure 3.5 displays data movers and clients as separate objects connected with a request flow, a representation more in line with modern software design techniques than physical implementation. The architecture presented in this chapter focuses on identifying potential bottlenecks by grouping a node’s components and identifying the data and control flows that bind the nodes.

CHAPTER IV Remote Access to Unmodified Parallel File Systems Collaborations such as TeraGrid [127] allow global access to massive data sets in a nearly seamless environment distributed across several sites. Data access transparency allows users to seamlessly access data from multiple sites using a common set of tools and semantics. The degree of transparency between sites can determine the success of these collaborations. Factors affecting data access transparency include latency, bandwidth, security, and software interoperability. To improve performance and transparency at each site, the use of parallel file systems is on the rise, allowing applications high-performance access to a large data store using a single set of semantics. Parallel file systems can adapt to spiraling storage needs and reduce management costs by aggregating all available storage into a single framework. Unfortunately, parallel file systems are highly specialized, lack seamless integration and modern security features, often limited to a single operating system and hardware platform, and suffer from slow offsite performance. In addition, many parallel file systems are proprietary, which makes it almost impossible to add extensions for a user’s specific needs or environment. NFSv4 allows researchers access to remote files and databases using the same programs and procedures that they use to access local files, as well as obviating the need to create and update local copies of a data set manually. To meet quality of service requirements across metropolitan and wide-area networks, NFSv4 may need to use all available bandwidth provided by the parallel file system. In addition, NFSv4 must be able to provide parallel access to a single file from large numbers of clients, a common requirement of scientific applications. This chapter discusses the challenge of achieving full utilization of an unmodified storage system’s available bandwidth while retaining the security, consistency, and het-

43

44 erogeneity features of NFSv4—features missing in many storage systems. I introduce extensions that allow NFSv4 to scale beyond a single server by distributing data access across the data components of the remote data store. These extensions include a new server-to-server protocol and a file description and location mechanism. I refer to NFSv4 with these extensions as Split-Server NFSv4. The remainder of this chapter is organized as follows. Section 4.1 discusses scaling limitations of the NFSv4 protocol. Section 4.2 describes the NFSv4 protocol extensions in Split-Server NFSv4. Sections 4.3 and 4.4 discuss fault tolerance and security implications of these extensions. Section 4.5 provides performance results of my Linux-based prototype and discusses performance issues of NFS with parallel file systems. Section 4.5.5 reviews alternate possible architectures and Section 4.6 concludes this chapter.

4.1. NFSv4 state maintenance NFSv4 server state is used to support exclusive opens (called share reservations), mandatory locking, and file delegations. The need to manage consistency of state information on multiple nodes fetters the ability to export an object via multiple NFSv4 servers. This “single server” constraint becomes a bottleneck if load increases while other nodes in the parallel file system are underutilized. Partitioning the file system space among multiple NFS servers helps, but increases administrative complexity, management cost, and fails to address scalable access to a single file or directory, a critical requirement of many high-performance applications [7].

4.2. Architecture Figure 4.1 shows how Split-Server NFSv4 modifies the NFSv4-PFS architecture of Figure 3.3 by exporting the file system from all available parallel file system clients. NFSv4 clients use their data component to send data requests to every available PFS data component, distributing data requests across the bisectional bandwidth of the client network. Any increase or decrease in available throughput of the parallel file system, e.g., additional nodes or increased network bandwidth, is reflected in Split-Server NFSv4 I/O throughput. NFSv4 access control components exist with each PFS data component to

45

Figure 4.1: Split-Server NFSv4 data access architecture. NFSv4 application components use a control component to obtain metadata information from an NFSv4 metadata component and an NFSv4 data component to fulfill I/O requests. The NFSv4 metadata component uses a PFS control component to retrieve PFS metadata and shares its access control information with the data servers to ensure the data servers allow only authorized data requests. The PFS data component uses a PFS control component to obtain PFS file layout information for storage access.

ensure that data servers allow only authorized data requests. The PFS data component uses a control component to retrieve PFS file layout information for storage access. The Split-Server NFSv4 extensions have the following goals: •

Read and write performance to scale linearly as parallel file system nodes are added or removed.

•

Support for unmodified parallel file systems.

•

Single file system image with no partitioning.

•

Negligible impact to NFSv4 security model and fault tolerance semantics.

•

No dependency on special features of the underlying parallel file system.

4.2.1. NFSv4 extensions To export a file from multiple NFSv4 servers exporting shared storage, the servers need a common view of their shared state. NFSv4 servers must therefore share state information and do so consistently, i.e., with single-copy semantics. Without an identical view of the shared state, conflicting file and byte-range locks can cause data corruption or allow malicious clients to read and write data without proper authorization.

46

(a) Design (b) Process flow Figure 4.2: Split-Server NFSv4 design and process flow. Storage consists of a parallel file system such as GPFS. NFSv4 servers are divided into data servers, which handle s READ, WRITE, and COMMIT requests, and a state server, which handles file system and stateful requests. The state server coordinates with the data servers to ensure only authorized client I/O requests are fulfilled.

To provide a consistent view, I use a state server to copy the portions of state needed to serve READ, WRITE, and COMMIT requests at I/O nodes, (designated data servers). Figure 4.2a shows the Split-Server NFSv4 architecture. Transforming NFSv4 into the out-of-band protocol shown in Figure 4.2b, unleashes the I/O scalability of the underlying parallel file system. Many clients performing simultaneous metadata operations can overburden a state server. For example, coordinating clients simultaneously opening separate result files. To reduce the load on the state server, a system administrator can partition file system metadata among several state servers, ensuring that all state for a single file resides on a single state server. In addition, control processing can be distributed by allowing data servers to handle operations that do not affect NFSv4 server state, e.g., SETATTR and GETATTR.

4.2.2. Configuration and setup The mechanics of a client connection to a server are the same as NFSv4, with the client mounting the state server managing the file space of interest. Data servers register with the state server at start-up or any time thereafter and are immediately available to Split-Server NFSv4 clients, allowing easy incremental growth.

47 4.2.3. Distribution of state information On receiving an OPEN request, a state server picks a data server to service the data request. The selection algorithm is implementation defined. In my prototype, I use round-robin. The state server then places share reservation state for the request on the selected data server. The following items constitute a unique identifier for share reservation state: •

Client name, IP address, and verifier

•

Access/Deny authority

•

File handle

•

File open owner

When a client issues a CLOSE request, the state server first reclaims the state from the data server. Once reclamation is complete, the standard NFSv4 close procedure proceeds. Support for locks does not require distributing additional state beyond share reservations. NFSv4 uses POSIX locks and relies on the locking subsystem of the underlying parallel file system. Delegations also require no additional state on the data servers as the state server manages conflicting access requests for a delegated file.

4.2.4. Redirection of clients Split-Server NFSv4 extends the NFSv4 protocol with a new attribute called FILE_LOCATION

to enable Split-Server NFSv4 to provide access to a single file via multi-

ple nodes. The FILE_LOCATION attribute specifies: •

Data server location information

•

Root pathname

•

Read-only flag

Clients use

FILE_LOCATION

information to direct READ, WRITE, and COMMIT re-

quests to the named server. The root pathname allows each data server to have its own

48 namespace. The read-only flag declares whether the data server will accept WRITE commands.

4.3. Fault tolerance The failure model for Split-Server NFSv4 follows that of NFSv4 with the following modifications: 1. A failed state server can recover its runtime state by retrieving each part of the state from the data servers. 2. The failure of a data server is not critical to system operation.

4.3.1. Client failure and recovery An NFSv4 server places a lease on all share reservations, locks, and delegations issued to a client. Clients must send RENEW operations, akin to heartbeat messages, to the server to retain their leases. If a server does not receive a RENEW operation from a client within the lease period, the server may unilaterally revoke all state associated with the given client. Leases are also implicitly renewed as a side effect of a client request that includes its identifier. However, Split-Server NFSv4 redirects READ, WRITE, and COMMIT operations to the data servers, so the renewal implicit in these operations is no longer visible to the state server. Therefore, RENEW operations are sent to a client’s mounted state server either by the modifying a client to send explicit RENE operations, or by engineering the data server that is actively fulfilling client requests to send them. Enabling data servers to send RENEW messages on behalf of a client improves scalability by limiting the maximum number of renewal messages received by a state server to the number of data server nodes.

4.3.2. State server failure and recovery A recovering state server stops servicing requests and queries data servers to rebuild its state.

49

Figure 4.3: Split-Server NFSv4 experimental setup. The system has four Split-Server NFSv4 clients and five GPFS servers exporting a common file system. The GPFS servers are exported by Split-Server NFSv4, consisting of a state server and at most four data servers.

4.3.3. Data server failure and recovery A failed data server is discovered by the state server when it tries to replicate state and by clients who issue requests. A client obtains a new data server by reissuing the request for the FILE_LOCATION attribute. A data server that experiences a network partition from the state server immediately stops fulfilling client requests, preventing a state server from granting conflicting file access requests.

4.4. Security The addition of data servers to the NFSv4 protocol does not require extra security mechanisms. The client uses the security protocol negotiated with a state server for all nodes. Servers communicate over RPCSEC_GSS, the secure RPC mandated for all NFSv4 commands.

4.5. Evaluation This section compares unmodified NFSv4 with Split-Server NFSv4 as they export a GPFS file system. The test environment is shown in Figure 4.3. All nodes are connected via an IntraCore 35160 Gigabit Ethernet switch with 1500-byte Ethernet frames. Server System: The five server nodes are equipped with Pentium 4 processors with a clock rate of 850 MHz and a 256 KB cache; 2 GB of RAM; one Seagate 80 GB, 7200 RPM hard drive with an Ultra ATA/100 interface and a 2 MB cache; and two 3Com 3C996B-T Gigabit Ethernet cards. Servers run a modified Linux 2.4.18 kernel with Red Hat 9.

50 Client System: Client nodes one through three are equipped with dual 1.7 GHz Pentium 4 processors with a 256 KB cache; 2 GB of RAM; a Seagate 80 GB, 7200 RPM hard drive with an Ultra ATA/100 interface and a 2 MB cache; and a 3Com 3C996B-T Gigabit Ethernet card. Client node four is equipped with an Intel Xeon processor with a clock rate of 1.4 GHz and a 256 KB cache; 1 GB RAM; an Adaptec 40 GB, 10K RPM SCSI hard drive using Ultra 160 host adapter; and a AceNIC Gigabit Ethernet card. All clients run the Linux 2.6.1 kernel with a Red Hat 9 distribution. Netapp FAS960 Filer: The storage device has two processors, 6 GB of RAM, and a quad Gigabit Ethernet card. It is connected to eight disks running RAID4. The five servers run the GPFS v1.3 parallel file system with a 40 GB file system and a 16 KB block size. GPFS maintains a 32 MB file and metadata cache known as the pagepool.

All experiments use forty NFSv4 server threads except the Split-Server

NFSv4 write experiments, which uses a single NFSv4 server thread to improve performance. (Discussed in Section 4.5.4)

4.5.1. Scalability experiments To evaluate scalability, I measure the aggregate I/O throughput while increasing the number of clients accessing GPFS, NFSv4, and Split-Server NFSv4. Since both standard NFSv4 and Split-Server NFSv4 export a GPFS file system, the GPFS configuration constitutes the theoretical ceiling on NFSv4 and Split-Server NFSv4 I/O throughput. The extra hop between the GPFS server and the NFS client prevents the performance of NFSv4 and Split-Server NFSv4 from equaling GPFS performance. The goal is for SplitServer NFSv4 to scale linearly with GPFS. GPFS is configured as a four node GPFS file system directly connected to the filer. NFSv4 is configured with a single NFSv4 server running on a GPFS node and four clients. Split-Server NFSv4 is configured with a state server, four data servers (each running on a GPFS file system node), and four clients. At most one client accesses each data server during an experiment. To measure the aggregate I/O throughput, I use the IOZone [128] benchmark tool. In the first set of experiments, each client reads/writes a separate 500 MB file. In the second set of experiments, each client reads/writes disjoint 500 MB portions of a single pre-

51 existing file. The aggregate I/O throughput is calculated when the last client completes its task. The value presented is the average over ten executions of the benchmark. The write timing includes the time to flush the client’s cache to the server. Clients and servers purge their caches before each read experiment. All read experiments use a warm filer cache to reduce the effect of disk access irregularities. The experimental goal is to test whether that Split-Server NFSv4 scales linearly with additional resources. I engineered a server bottleneck in the system by using a small GPFS pagepool and block size, and by cutting the number of server clock cycles in half. This ensures that each server is fully utilized, which implies that the results are applicable to any system that needs to scale with additional servers.

4.5.2. Read performance First, I measure read performance while increasing the number of clients from one to four. Figure 4.4a shows the results with separate files. Figure 4.4b presents the results with a single file. GPFS imposes a ceiling on performance with an aggregate read throughput of 23 MB/s with a single server. With four servers, GPFS reaches 94.1 MB/s and 91.9 MB/s in multiple and single file experiments respectively. The decrease in performance for the single file experiment arises because all servers must access a single metadata server. With Split-Server NFSv4, as I increase the number of clients and data servers the aggregate read throughput increases linearly, reaching 65.7 MB/s with multiple files and 59.4 MB/s for the single file experiment. NFSv4 aggregate read throughput remains flat at approximately 16 MB/s in both experiments, a consequence of the single server bottleneck.

4.5.3. Write performance The second experiment measures the aggregate write throughput as I increase the number of clients from one to four. I first measure the performance of all clients writing to separate files, shown in Figure 4.5a. GPFS sets the upper limit with an aggregate write throughput of 16.7 MB/s with a single server and 61.4 MB/s with four servers. The fourth server overloads the filer’s

52 100

100

80

90

GPFS Split-Server NFSv4 NFSv4

Aggregate Throughput (MB/s)


90

70 60 50 40 30 20

80


70 60 50 40 30 20 10

10

0

0 1

2

3

1

4

2

3

4

Number of Nodes

Number of Nodes

(a) Separate files (b) Single file Figure 4.4: Split-Server NFSv4 aggregate read throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat. 80

80 GPFS Split-Server NFSv4 NFSv4

70 Aggregate Throughput (MB/s)


70 60 50 40 30 20 10


60 50 40 30 20 10

0

0 1

2

3

Number of Nodes

4

1

2

3

4

Number of Nodes

(a) Separate files (b) Single file Figure 4.5: Split-Server NFSv4 aggregate write throughput. GPFS consists of up to four file system nodes. NFSv4 is up to four clients accessing a single GPFS server. Split-Server NFSv4 consists of up to four clients accessing up to four data servers and a state server. With separate files, Split-Server NFSv4 scales linearly as I increase the number of GPFS nodes but NFSv4 performance remains flat. With a single file, SplitServer NFSv4 performance is fettered by mtime synchronization.

CPU. NFSv4 and Split-Server NFSv4 initially have an aggregate write throughput of approximately 8 MB/s. The aggregate write throughput of Split-Server NFSv4 increases linearly, reaching a maximum of 32 MB/s. As in the read experiments, the aggregate write throughput of NFSv4 remains flat as the number of clients is increased. Figure 4.5b shows the results of each client writing to different regions of a single file. The write performance of GPFS and NFSv4 is similar to the separate file experiments. The major difference occurs with Split-Server NFSv4, achieving an initial aggregate throughput of 6.1 MB/s and increasing to 18.7 MB/s. Poor performance and lack of scalability is the result of modification time (mtime) synchronization between GPFS servers. GPFS avoids synchronizing the mtime attribute when accessed directly. GPFS must synchronize the mtime attribute when accessed with NFSv4 to ensure NFSv4 client

53 cache consistency. Furthermore, GPFS includes the state server among servers that synchronize the mtime attribute, further reducing performance.

4.5.4. Discussion Split-Server NFSv4 scales linearly with the number of GPFS nodes except when multiple clients write to a single file, which experiences lower performance since GPFS synchronizes the mtime attribute to comply with the NFS protocol. Client cache synchronization relies on the mtime attribute, but it is unnecessary in some environments. For example, some programs cache data themselves and use the OPEN option O_DIRECT to disable client caching for a file. Other programs require only non-conflicting write consistency, handling data consistency without relying on locks or cache consistency mechanisms. PVFS2 [129] is designed for such programs. To succeed in these environments, the NFS protocol must relax its client cache consistency semantics. NFS block sizes have tended to be small. Block sizes were 4 KB in NFSv2, and grew to 8 KB in NFSv3. Most recent implementations now support 32 KB or 64 KB. Synchronous writes along with hardware and kernel limitations are some of the original reasons for small block sizes. Another is UDP, which uses IP fragmentation to divide each block into multiple requests. Consequently, the loss of a single request means the loss of the entire block. The introduction in 2002 of TCP and a larger buffer space to the Linux implementation of NFS allows for larger block sizes, but the current Linux kernel has a 32 KB limit. This creates a disparity with many parallel file systems, which use a stripe size of greater than 64 KB. To avoid this data request inefficiency, NFS implementations need to catch up to parallel file systems like GPFS that support block sizes of greater than 1 MB. Multiple NFS server threads can also reduce I/O throughput. Even with a single NFS client, the parallel file system assumes all requests are from different sources and performs locking between threads. In addition, server threads can process read and write requests out of order, hampering the parallel file system’s ability to improve its interaction with the physical disk. In NFSv3, the lack of OPEN and CLOSE commands leads to an implicit open and close of a file in the underlying file system on every request. This does not degrade per-

54 formance with local file systems such as Ext3, but the extra communication required to contact a metadata server in parallel file systems restricts NFSv3 throughput.

4.5.5. Supplementary Split-Server NFSv4 designs

4.5.5.1 File system directed load balancing Split-Server NFSv4 distributes clients among the data servers using a round-robin algorithm. Allowing the underlying file system to direct clients to data servers may improve efficiency of available resources since the parallel file system may have more insight into the current client load. For example, coordinated use of the parallel file system’s data cache may prove effective with certain I/O access patterns. Allowing the underlying file system to direct client load may also facilitate the use of multiple metadata servers without an additional server-to-server protocol. This suggests extending the interface between an NFSv4 server and its exported parallel file system.

4.5.5.2 Client directed load balancing Split-Server NFSv4’s use of the FILE_LOCATION attribute enables a centralized way to balance client load. To avoid having to modify the NFSv4 protocol, random distribution of client load among data servers may prove sufficient in certain cases. Once clients discover the available data servers through configuration files or specialized mount options, they can randomly distribute requests among the data servers. Data servers can retrieve required state from the state server as needed.

4.5.5.3 NFSv4 client directed state distribution Split-Server NFSv4’s server-to-server communication increases the load on a central resource and delays stateful operations. Clients can reduce load on the metadata server by assuming responsibility for the distribution of state. After a client successfully executes a state generating request, e.g., LOCK, on the state server, it sends the same operation to every data server in the

FILE_LOCATION

attribute. In addition to reducing load on

the state server, this design also isolates all modifications to the NFS client.

55 4.6. Related work Several systems aggregate partitioned NFSv3 servers into a single file system image [91, 113, 114].

These systems, discussed in Section 2.6, transform NFS into a type of

parallel file system, which increases scalability but eliminates the file system independence of NFS. The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file system, an archival system, or a database, into a single data catalogue. The HTTP protocol is the most common and widespread way to access remote data stores. SRB and HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints and do not integrate with the local file system. The notion of serverless or peer-to-peer file systems was popularized by xFS [131], which eliminates the single server bottleneck and provides data redundancy through network disk striping. More recently, wide-area file systems such as LegionFS[132] and Gfarm [133] provide a fully integrated and distributed environment and a secure means of cross-domain access. Targeted for the grid, these systems use data replication to provide reasonable performance to globally distributed data. The major drawback of these systems is their lack of interoperability with other file systems–mandating themselves as the only grid file system. Split-Server NFSv4 allows file system independent access to remote data stores in the LAN or across the WAN. GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional throughput across high-speed, long haul networks, but is focused on large I/O transfers and is restricted to GPFS storage systems. GridFTP [4] is also used extensively in Grid computing to enable high I/O throughput, operating system independence, and secure WAN access to high-performance file systems. Successful and popular, GridFTP nevertheless has some serious limitations: it copies data instead of providing shared access to a single copy, which complicates its consistency model and decreases storage capacity; it lacks direct data access and a global namespace; runs as an application, and cannot be accessed as a file system without operating system modification. Split-Server NFSv4 is not intended to replace GridFTP, but to work alongside it. For example, in tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for long-haul scheduled transfers among the upper tiers, while

56 Split-Server NFSv4 offers advantages in the lower tiers by letting scientists work with files directly, promoting effective data management.

4.7. Conclusion This chapter introduces extensions to NFSv4 to utilize an unmodified parallel file system’s available bandwidth while retaining NFSv4 features and semantics. Using a new FILE_LOCATION

attribute, Split-Server NFSv4 provides parallel and scalable access to ex-

isting parallel file systems. The I/O throughput of the prototype scales linearly with the number of parallel file system nodes except when multiple clients write to a single file, which experiences lower performance due to mtime synchronization in the underlying parallel file system.

CHAPTER V Flexible Remote Data Access Parallel file systems teach us two important lessons. First, direct and parallel data access techniques can fully utilize available hardware resources, and second, standard data flow protocols such as iSCSI [22], OSD [136], and FCP [21] can increase interoperability and reduce development and management costs. Unfortunately, standard protocols are useful only if file systems use them, and most parallel file systems support only one protocol for their data channel. In addition, most parallel file systems use proprietary control protocols. Distributed file systems view storage through parallel file system data components, intermediary nodes through which data must travel. This extra layer of processing prevents distributed file systems from matching the performance of the exported file system, even for a single client. An architectural framework should be able to encompass all storage architectures, i.e., symmetric or asymmetric; in-band or out-of-band; and block-, object-, or file-based, without sacrificing performance. The NFSv4 file service, with its global namespace, high level of interoperability and portability, simple and cost-effective management, and integrated security provides an ideal base for such a framework. This chapter analyzes a prototype implementation of pNFS [121, 122], an extension of NFSv4 that provides file access scalability plus operating system, hardware platform, and storage system independence. pNFS eliminates the performance bottlenecks of NFS by enabling the NFSv4 client to access storage directly. pNFS facilitates interoperability between standard protocols by providing a framework for the co-existence of NFSv4 and other file access protocols. My prototype demonstrates and validates the potential of pNFS. The I/O throughput of my prototype equals that of its exported file system (PVFS2 [129]) and is dramatically better than standard NFSv4.

57

58

Figure 5.1: pNFS data access architecture. NFSv4 control components access the NFSv4 metadata component and use a parallel file system (PFS) data component to perform I/O directly to storage.

The remainder of this chapter is organized as follows. Section 5.1 describes the pNFS architecture. Sections 5.2 and 5.3 present PVFS2 and my pNFS prototype. Section 5.4 measures performance of my Linux-based prototype. Section 5.5 discusses additional pNFS design and implementation issues, including the impact of locking and security support on the pNFS protocol. Section 5.6 summarizes and concludes this chapter.

5.1. pNFS architecture To meet enterprise and grand challenge-scale performance and interoperability requirements, a group of engineers—initially ad-hoc but now integrated into the IETF—is designing extensions to NFSv4 that provide parallel access to storage systems. The result is pNFS, which promises file access scalability as well as operating system and storage system independence. pNFS separates the control and data flows of NFSv4, allowing data to transfer in parallel from many clients to many storage endpoints. This removes the single server bottleneck by distributing I/O across the bisectional bandwidth of the storage network between the clients and storage devices. Figure 5.1 shows how pNFS alters the original NFSv4-PFS data access architecture (Figure 3.3) by integrating NFSv4 application and control components with the parallel file system data component. The NFSv4 client (NFSv4 control component) continues to send control operations to the NFSv4 server (NFSv4 metadata component), but shifts the

59 Client Layout pNFS Client

Server

I/O

Layout Driver I/O Driver

NFSv4 I/O and Metadata

pNFS Parallel I/O

Storage Nodes

Layout pNFS Server

Control Flow

Storage System

Figure 5.2: pNFS design. pNFS extends NFSv4 with the addition of a layout and I/O driver, and a file layout retrieval interface. The pNFS server obtains an opaque file layout map from the storage system and transfers it to the pNFS client and subsequently to its layout driver for direct and parallel data access.

responsibility for achieving scalable I/O throughput to a storage-specific driver (PFS data component). Figure 5.2 depicts the architecture of pNFS, which adds a layout and I/O driver, and a file layout retrieval interface to the standard NFSv4 architecture. pNFS clients send I/O requests directly to storage and access the pNFS server for file metadata information. A benefit of pNFS is its ability to match the performance of the underlying storage system’s native client while continuing to support all standard NFSv4 features. This support is ensured by introducing pNFS extensions into a “minor version”, a standard extension mechanism of NFSv4. In addition, pNFS does not impose restrictions that might limit the underlying file system’s ability to provide quality-enhancing features such as usage statistics or storage management interfaces.

5.1.1. Layout and I/O driver The layout driver understands the file layout and storage protocol of the storage system. A layout consists of all information required to access any byte range of a file. For example, a block layout may contain information about block size, offset of the first block on each storage device, and an array of tuples that contains device identifiers, block numbers, and block counts. An object layout specifies the storage devices for a file and the information necessary to translate a logical byte sequence into a collection of objects. A file layout is similar to an object layout but uses file handles instead of object identifiers. The layout driver uses the layout to translate read and write requests from the pNFS

60 client into I/O requests understood by the storage devices. The I/O driver performs raw I/O, e.g., Myrinet GM [137], Infiniband [138], TCP/IP, to the storage nodes. The layout driver can be specialized or (preferably) implement a standard protocol such as the Fibre Channel Protocol (FCP), allowing multiple file systems to use the same layout driver. Storage systems adopting this architecture reduce development and management obligations by obviating a specialized file system client. This holds the promise of reducing development and maintenance of high-end storage systems.

5.1.2. NFSv4 protocol extensions This section describes the NFSv4 protocol extensions to support pNFS. File system attribute. A new file system attribute, LAYOUT_CLASSES, contains the layout driver identifiers supported by the underlying file system. Upon encountering an unknown file system identifier, a pNFS client retrieves this attribute and uses it to select an appropriate layout driver, To prevent namespace collisions, a global registry maintainer such as IANA [139] specifies layout driver identifiers. LAYOUTGET operation. The LAYOUTGET operation obtains file access information for a byte-range of a file from the underlying storage system. The client issues a LAYOUTGET operation after it opens a file and before it accesses file data. Implementations determine the frequency and byte range of the request. The arguments are: •

File handle

•

Layout type

•

Access type

•

Offset

•

Extent

•

Minimum size

•

Maximum count

61 The file handle uniquely identifies the file. The layout type identifies the preferred layout type. The offset and extent arguments specify the requested region of the file. The access type specifies whether the requested file layout information is for reading, writing, or both. This is useful for file systems that, for example, provide read-only replicas of data. The minimum size specifies the minimum overlap with the requested offset and length. The maximum count specifies the maximum number of bytes for the result, including XDR overhead. LAYOUTGET returns the requested layout as an opaque object and its associated offset and extent. By returning file layout information to the client as an opaque object, pNFS is able to support arbitrary file layout types. At no time does the pNFS client attempt to interpret this object, it acts simply as a conduit between the storage system and the layout driver. The byte range described by the returned layout may be larger than the requested size due to block alignments, layout prefetching, etc. LAYOUTCOMMIT operation. The LAYOUTCOMMIT operation commits changes to the layout information. The client uses this operation to commit or discard provisionally allocated space, update the end of file, and fill in existing holes in the layout. LAYOUTRETURN operation. The LAYOUTRETURN operation informs the NFSv4 server that layout information obtained earlier is no longer required. A client may return a layout voluntarily or upon receipt of a server recall request. CB_LAYOUTRECALL operation. If layout information is exclusive to a specific client and other clients require conflicting access, the server can recall a layout from the client using the CB_LAYOUTRECALL callback operation.3 The client should complete any in-flight I/O operations using the recalled layout and write any buffered dirty data directly to storage before returning the layout, or write it later using normal NFSv4 write operations.

62

Parallel i/O

Client Application

PVFS2 Client

User Kernel PVFS2 Client kmod

Control Flow

PVFS2 Storage Nodes PVFS2 Metadata Server User Kernel

Figure 5.3: PVFS2 architecture. PVFS2 consists of clients, metadata servers, and storage nodes. The PVFS2 kernel module enables integration with the local file system. Data is striped across storage nodes using a user-defined algorithm.

GETDEVINFO

and

GETDEVLIST

operations.

The

GETDEVINFO

and

GETDEVLIST operations retrieve additional information about one or more storage nodes. The layout driver issues these operations if the device information inside the file layout does not provide enough information for file access, e.g., SAN volume label information or port numbers.

5.2. Parallel virtual file system version 2 This section presents an overview of PVFS2, a user-level, open-source, scalable, asymmetric parallel file system designed as a research tool and for production environments. Despite its lack of locking and security support, I chose PVFS2 because its user level design provides a streamlined architecture for rapid prototyping of new ideas. Figure 5.3 depicts the PVFS2 architecture. PVFS2 consists of clients, storage nodes, and metadata servers. Metadata servers store all information about the file system in a Berkeley DB database [140], distributing metadata via a hash on the file name. File data is striped across storage nodes, which can be increased in number as needed. PVFS2 uses algorithmic file layouts for distributing data among the storage nodes. The data distribution algorithm is user defined, defaulting to round-robin striping. The clients and storage nodes share the data distribution algorithm, which does not change during the lifetime of the file. A series of file handles, one for each storage node,

3

NFSv4 already contains a callback operation infrastructure for delegation support.

63 Client

pNFS Parallel I/O Application

User Kernel Layout pNFS Client

Server

I/O

PVFS2 Storage Nodes


User Kernel Layout pNFS Server

PVFS2 Layout and I/O Driver

Control Flow

PVFS2 Client

PVFS2 Metadata Server User Kernel

Figure 5.4: pNFS prototype architecture. The pNFS server obtains the opaque file layout from the PVFS2 metadata server via the PVFS2 client, transferring it back to the pNFS client and subsequently to the PVFS2 layout driver for direct and parallel data access.

uniquely identifies the set of file data stripes. Data is not committed with the metadata server; instead, the client ensures that all data is committed to storage by negotiating with each individual storage node. An operating system specific kernel module provides for integration into user environments and for access by other VFS file systems. Users are thus able to mount and access PVFS2 through a POSIX interface. Currently, only Linux implementations exist of this module. Data is memory mapped between the kernel module and the PVFS2 client program to avoid extra data copies. Efficient lock management with large numbers of clients is a hard problem. Large parallel applications generally avoid using locks and manage data consistency through organized and cooperative clients. PVFS2 shuns POSIX consistency semantics, which require sequential consistency of file system operations, and replaces them with nonconflicting writes semantics, guaranteeing that writes to non-overlapping file regions are visible on all subsequent reads once the write completes.

5.3. pNFS prototype Prototypes of new protocols are essential for their clarification and provide insight and evidence of their viability. A minimum requirement for the fitness of pNFS is its ability to provide parallel access to arbitrary storage systems. This agnosticism toward storage system particulars is vital for widespread adoption. As such, my prototype fo-

64 cuses on the retrieval and processing of the file layout to demonstrate that pNFS is agnostic of the underlying storage system and can match the performance of the storage system it exports. Figure 5.4 displays the architecture of my pNFS prototype with PVFS2 as the exported file system.

5.3.1. PVFS2 layout The PVFS2 file layout information consists of: •

File system id

•

Set of file handles, one for each storage node

•

Distribution id, uniquely defines layout algorithm

•

Distribution parameters, e.g., stripe size

Since a PVFS2 layout applies to an entire file, no matter what byte range the pNFS client requests using the LAYOUTGET operation, the returned byte range is the entire file. Therefore, my prototype requests a layout once for each open file, incurring a single additional round trip. If a pNFS client is eager with its requests, it can even eliminate this single round trip time by including the LAYOUTGET in the same request as the OPEN operation. The differences between these two designs are apparent in my evaluation. The pNFS server obtains the layout from PVFS2 via a Linux VFS export operation.

5.3.2. Extensible “Pluggable” layout and I/O drivers Our prototype facilitates interoperability by providing a framework for the coexistence of the NFSv4 control protocol with all storage protocols. As shown in Figure 5.5, layout drivers are pluggable, using a standard set of interfaces for all storage protocols. An I/O interface, based on the Linux file_operations interface4, facilitates the management of layout information and performing I/O with storage. A policy interface informs the pNFS client of storage system specific policies, e.g., stripe and block size, lay4

The file_operations interface is the VFS interface that manages access to a file.

65

Client pNFS Client

Control

Server pNFS Server

I/O API

Policy API Linux VFS API

Layout Driver Storage System

I/O Driver

Storage Protocol Storage Nodes

Management Protocol

Figure 5.5: Linux pNFS prototype internal structure. pNFS clients use I/O and policy interfaces to access storage nodes and determine file system polices. The pNFS server uses VFS export operations to communicate with the underlying file system.

out retrieval timing. The policy interface also enables layout drivers to specify whether they support NFSv4 data management services or use customized implementations. The pNFS client can provide the following data management services: data cache, writeback cache with write gathering, and readahead. A layout driver registers with the pNFS client along with a unique identifier. The pNFS client matches this identifier with the value of the LAYOUT_CLASSES attribute to select the correct layout driver for file access. If there is no matching layout driver, standard NFSv4 read and write mechanisms are used. The PVFS2 layout driver supports three operations: read, write, and set_layout. To inject the file layout map, the pNFS client passes the opaque layout as an argument to the set_layout function. Once the layout driver has finished processing the layout, the pNFS client is free to call the layout driver’s read and write functions. When data access is complete, the pNFS client issues a standard NFSv4 close operation to the pNFS server. The syntax for the PVFS2 layout driver I/O interface is: ssize_t read(struct file* file,char __user* buf, size_t count, loff_t* offset) ssize_t write(struct file* file,const char __user* buf,size_t count,loff_t* offset) int set_layout(struct inode* ino,struct file* file,unsigned int cmd,unsigned long arg)

5.4. Evaluation In this section, I describe experiments that assess the performance of my pNFS prototype. They demonstrate that pNFS can use the standard layout driver interface to scale with PVFS2, and can achieve performance vastly superior to NFSv4.

66 5.4.1. Experimental Setup The experiments are performed on a network of forty identical nodes partitioned into twenty-three clients, sixteen storage nodes, and one metadata server. Each node is a 2 GHz dual-processor Opteron with 2 GB of DDR RAM and four Western Digital Caviar Serial ATA disks, which have a nominal data rate of 150 MB/s and an average seek time of 8.9 ms. The disks are configured with software RAID 0. The operating system kernel is Linux 2.6.9-rc3. The version of PVFS2 is 1.0.1. I test four configurations: two that access PVFS2 storage nodes directly via pNFS and PVFS2 clients and two with unmodified NFSv4 clients. One NFSv4 configuration accesses an Ext3 file system. The other accesses a PVFS2 file system with an NFSv4 server, exported PVFS2 client, and PVFS2 metadata server all residing on the metadata server. The metadata server runs eight pNFS or NFSv4 server threads when exporting the PVFS2 or Ext3 file systems. I verified that varying the number of pNFS or NFSv4 server threads does not affect performance. I compare the aggregate I/O throughput using the IOZone [128] benchmark tool while increasing numbers of clients. The first set of experiments has two processes on each client reading and writing separate 200 MB files. In the second set of experiments, each client reads and writes disjoint 100 MB portions of a single pre-existing file. Aggregate I/O throughput is calculated when the last client completes its task. The value presented is the average over several executions of the benchmark. The write time includes a flush of the client’s cache to the server. All read experiments use warm storage node caches to reduce disk access irregularities.

5.4.2. LAYOUTGET performance If a layout does not apply to an entire file, a LAYOUTGET request would be required on every read or write. In the test environment, the time for a LAYOUTGET request is 0.85 ms. On a 1 MB transfer, this reduces I/O throughput by only 3-4 percent; with a 10 MB transfer, the relative cost is less than 0.5 percent.

67 400.00

300.00

200.00 pNFS PVFS2 NFSv4-PVFS2 NFSv4-Ext3 100.00



400.00

0.00

pNFS PVFS2 NFSv4-PVFS2 NFSv4-Ext3 300.00

200.00

100.00

0.00 2

4

6

8

10 12 14 16 18 20 22 24 26 28 30 32

1

2

3

4

5

6

Number of Processes

7

8

9

10 11 12 13 14 15 16

Number of Clients

(a) Separate files

(b) Single file

Figure 5.6: Aggregate pNFS write throughput. pNFS scales with PVFS2 while NFSv4 performance remains flat. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two write processes.

500.00

600.00 pNFS PVFS2 NFSv4-PVFS2 NFSv4-Ext3



600.00

400.00

300.00

200.00

100.00

500.00

pNFS pNFS-2 PVFS2 NFSv4-PVFS2 NFSv4-Ext3

400.00

300.00

200.00

100.00

0.00

0.00 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Number of Processes

Number of Clients

(a) Separate files (b) Single file Figure 5.7: Aggregate pNFS read throughput. pNFS and PVFS2 scale linearly while NFSv4 performance remains flat. With a single file, pNFS performance is slightly below PVFS2 due to increasing layout retrieval congestion. pNFS-2, which removes the extra round trip time of LAYOUTGET, matches PVFS2 performance. pNFS and PVFS2 use sixteen storage nodes. With separate files, each client spawns two read processes.

5.4.3. I/O throughput performance In all experiments, the performance of NFSv4 exporting PVFS2 achieves an aggregate read and write throughput of only 1.9 MB/s and 0.9 MB/s respectively. I discuss this in Section 5.4.4. Figure 5.6a shows the write performance with each client writing to separate files. In Figure 5.6b, all clients write to a single file. NFSv4 with Ext3 achieves an average aggregate write throughput of 38 MB/s and 68 MB/s for the separate and single file experiments. pNFS performance tracks PVFS2, reaching a maximum aggregate write throughput of 384 MB/s with sixteen processes for separate files and 240 MB/s with seven cli-

68 ents writing to single file. With separate files, the bottleneck is the number of storage nodes. Metadata processing limits the performance with a single file. Figure 5.7a shows the read performance with two processes on each client reading separate files. NFSv4 with Ext3 achieves its maximum network bandwidth of 115 MB/s. pNFS again achieves the same performance as PVFS2. Initially, the extra overhead required to write to sixteen storage nodes reduces read throughput for two processes to 27 MB/s, but it scales almost linearly, reaching an aggregate read throughput of 550 MB/s with 46 processes. Figure 5.7b shows the read performance with each client reading from disjoint portions of the same pre-existing file. NFSv4 with Ext3 again achieves its maximum network bandwidth of 115 MB/s. PVFS2 scales linearly, starting with an aggregate read throughput of 15 MB/s with a single client, increasing to 360 MB/s with twenty-three clients. The pNFS prototype, which incurs a single round trip time for the LAYOUTGET, suffers slightly as the PVFS2 layout retrieval function takes longer with increasing numbers of clients, reaching an aggregate read throughput of 311 MB/s. A modified prototype combines the LAYOUTGET and OPEN operations into a single call. The prototype labeled pNFS-2 excludes the LAYOUTGET operation from the measurements and matches the performance of PVFS2.

5.4.4. Discussion While these experiments offer convincing evidence that pNFS can match the performance of the underlying file system, they also demonstrate that pNFS performance can be adversely affected by a costly LAYOUTGET operation. The poor performance of NFSv4 with PVFS2 suffers from a difference in block sizes. Per-read and per-write processing overhead is small in NFSv4, which justifies a small block size—32 KB on Linux, but PVFS2 has a much larger per-read and per-write overhead and therefore uses a block size of 4 MB. In addition, PVFS2 does not perform write gathering on the client, assuming each data request to be a multiple of the block size. To make matters worse, the Linux kernel breaks the NFSv4 client’s request on the NFSv4 server into 4 KB chunks before it issues the requests to the PVFS2 client. Data transfer

69 overhead, e.g., creating connections to the storage nodes, and determining stripe locations, dominates with 4 KB requests. The impact on performance is devastating. Lack of a commit operation in the PVFS2 kernel module also reduces the write performance of NFSv4 with PVFS2. To prevent data loss, PVFS2 commits every write operation, ignoring the NFSv4 COMMIT operation. Write gathering [90] on the server combined with a commit from the PVFS2 client would comply with NFSv4 fault tolerance semantics and improve the interaction of PVFS2 with the disk.

5.5. Additional pNFS design and implementation issues

5.5.1. Locking NFSv4 supports mandatory locking, which requires an additional piece of shared state between the NFSv4 client and server: a unique identifier of the locking process. An NFSv4 client includes a locking identifier with every read and write operation. How pNFS storage nodes support mandatory locks is not covered in the pNFS operations Internet Draft [123]. Several possibilities exist: enable the storage nodes to interpret NFSv4 lock identifiers, bundle a new pNFS operation to retrieve file system specific lock information with the NFSv4 LOCK operation, or include lock information in the existing file layout.

5.5.2. Security considerations Separating control and data paths in pNFS introduces new security concerns to NFSv4. Although RPCSEC_GSS continues to secure the NFSv4 control path, securing the data path requires additional care. The current pNFS operations Internet Draft describes the general mechanisms that will be required, but does not go all the way in defining the new security architecture. A file-based layout driver uses the RPCSEC_GSS security mechanism between the client and storage nodes. Object storage uses revocable cryptographic capabilities for file system objects that the metadata server passes to clients. For data access, the layout driver requires the cor-

70 rect capability to access the storage nodes. It is expected that the capability will be passed to the layout driver within the opaque layout object. Block storage access protocols rely on SAN-based security, which is perhaps a misnomer, as clients are implicitly trusted to access only their allotted blocks. LUN masking/unmapping and zone-based security schemes can fence clients to specific data blocks. Some systems employ IPsec to secure the data stream. Placing more trust in the client for SAN file systems is a step backwards to the NFSv4 trust model.

5.6. Related work Several systems aggregate partitioned NFSv3 servers into a single file system image [91, 113, 114].

These systems, discussed in Section 2.6, transform NFS into a type of

parallel file system, which increases scalability but eliminates the file system independence of NFS. The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file system, an archival system, or a database, into a single data catalogue. The HTTP protocol is the most common and widespread way to access remote data stores. SRB and HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints and do not integrate with the local file system. EMC HighRoad [103] uses the NFS or CIFS protocol for its control operations and stores data in an aggregated LAN and SAN environment. Its use of file semantics facilitates data sharing in SAN environments, but is limited to the EMC Symmetrix storage system. A similar, non-commercial version is also available [141]. Several pNFS layout drivers are under development. At this writing, Sun Microsystems, Inc. is developing file- and object-based layout implementations. Panasas object and EMC bock drivers are currently under development. GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional throughput across high-speed, long haul networks, but is focused on large I/O transfers and is restricted to GPFS storage systems. GridFTP [4] is also used extensively in Grid computing to enable high I/O throughput, operating system independence, and secure WAN access to high-performance file systems. Successful and popular, GridFTP nevertheless has some serious limitations: it

71 copies data instead of providing shared access to a single copy, which complicates its consistency model and decreases storage capacity; it lacks direct data access and a global namespace; runs as an application, and cannot be accessed as a file system without operating system modification. Distributed replicas can be vital in reducing network latency when accessing data. pNFS is not intended to replace GridFTP, but to work alongside it. For example, in tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for long-haul scheduled transfers among the upper tiers, while the file system semantics of pNFS offers advantages in the lower tiers by letting scientists work with files directly, promoting effective data management.

5.7. Conclusion This chapter analyzes an early implementation of pNFS, an NFSv4 extension that uses the storage protocol of the underlying file system to bypass the server bottleneck and enable direct and parallel storage access. The prototype validates the viability of the pNFS protocol by demonstrating that it is possible to achieve high throughput access to a high-performance file system while retaining the benefits of NFSv4. Experiments demonstrate that the aggregate throughput with the prototype equals that of its exported file system and far exceeds NFSv4 performance.

CHAPTER VI Large Files, Small Writes, and pNFS Parallel file systems improve the aggregate throughput of bulk data transfers by scaling disks, disk controllers, network, and servers—every aspect of the system architecture. As system size increases, the cost of locating, managing, and protecting data increases the per-request overhead. This overhead is small relative to the overall cost of large data transfers, but considerable for smaller data requests. Many parallel file systems ignore this high penalty for small I/O, focusing entirely on large data transfers. Unfortunately, not all data comes in big packages. Numerous workload characterization studies have highlighted the prevalence of small and sequential data requests in modern scientific applications [72-78]. This trend will likely continue since many HPC applications take years to develop, have a productive lifespan of ten years or more, and are not easily re-architected for the latest file access paradigm [12]. Furthermore, many current data access libraries such as HDF5 and NetCDF rely heavily on small data accesses to store individual data elements in a common (large) file [142, 143]. This chapter investigates the performance of parallel file systems with small writes. Distributed file systems are optimized for small data accesses [2, 36]; not surprisingly, studies demonstrate that small I/O is their middleware niche [63]. I demonstrate that distributed file systems can increase write throughput to parallel data stores—regardless of file size—by overcoming small write inefficiencies in parallel file systems. By using direct, parallel I/O for large write requests and a distributed file system for small write requests, pNFS improves the overall write performance of parallel file systems. The pNFS heterogeneous metadata protocol allows these improvements in write performance with any parallel file system. The remainder of this chapter is organized as follows. Section 6.1 explores the issues that arise when writing small amounts of data in scientific applications. Section 6.2 de-

72

73 scribes how pNFS can improve the performance of these applications. Section 6.3 reports the results of experiments with synthetic benchmarks and a real scientific application. Section 6.5 summarizes and concludes the chapter.

6.1. Small I/O requests Several scientific workload characterization studies demonstrate the need to improve performance of small I/O requests to small and large files. The CHARISMA study [72-74] finds that file sizes in scientific workloads are much larger than those typically found in UNIX workstation environments and that most scientific applications access only a few files.

Approximately 90% of file accesses are

small—less than 4 KB—and represent a considerable portion of application execution time, even though approximately 90% of the data is transferred in large accesses. In addition, most files are read-only or write-only and are accessed sequentially, but read-write files are accessed primarily non-sequentially. The Scalable I/O study [75-77] had similar findings, but remarked that most requests are small writes into gigabyte sized files, consuming, for example, 98% of the execution time of one application that was studied. Furthermore, it is common for a single node to handle the majority of reads and writes, gathering the data from, or broadcasting the data to the other nodes as necessary. This indicates that single node performance still requires attention from parallel file systems. The study also notes that a lack of portability prevents applications from using enhanced parallel file system interfaces. A more recent study in 2004 of two physics applications [78] amplifies the earlier findings. This study found that I/O is bursty, most requests consist of small data transfers, and most data is transferred in a few large requests. It is common for a master node to collect results from other nodes and write them to storage using many small requests. Each client reads back the data in large chunks. In addition, use of a single file is still common and accessing that file—even with modern parallel file systems—is slower than accessing separate files by a factor of five. NetCDF (Network Common Data Form) provides a portable and efficient mechanism for sharing data between scientists and applications [142]. It is the predominant file format standard within many scientific communities [144]. NetCDF defines a file format

74 and an API for the storage and retrieval of a file’s contents. NetCDF stores data in a single array-oriented file, which contains dimensions, variables, and attributes. Applications individually define and write thousands of data elements, creating many sequential and small write requests. HDF5 is another popular portable file format and programming interface for storing scientific data in a single file. It provides a rich data model, with emphasis on efficiency of access, parallel I/O, and support for high-performance computing, but continues to define and store each data element separately, creating many small write requests. This chapter demonstrates how pNFS can improve small write performance with parallel file systems for small and large files, regardless of whether an application or file format library generates the write requests.

6.2. Small writes and pNFS pNFS improves file access scalability by providing the NFSv4 client with support for direct storage access. I now turn to an investigation of the relative costs of the direct I/O path and the NFSv4 path.

6.2.1. File system I/O features A single large I/O request can saturate a client’s network endpoint. Engineering a parallel file system for large requests entails the use of large transfer buffers, synchronous data requests, deploying many storage nodes, and the use of a write-through cache or no cache at all. NFS implementations have several features that are sometimes an advantage over the direct write path: •

Asynchronous client requests. Many parallel file systems incur a per-request overhead, which adds up for small requests. Directing requests to the NFSv4 server allows the server to absorb this overhead without delaying the client application or consuming client CPU cycles. In addition, asynchrony allows request pipelining on the NFSv4 server, reducing aggregate latency to the storage nodes.

75 File System Write Throughput (MB/s) Ext3 5.02 NFSv4/Ext3 4.03 pNFS/PVFS2 0.65 NFSv4/PVFS2 2.44 Table 6.1: Postmark write throughput with 1 KB block size. NFSv4 outperforms direct, parallel I/O for small writes.

•

One server per request. Data written to a byte-range that spans multiple storage nodes (e.g., multiple stripes) requires two separate requests, further increasing the per-request overhead. The NFSv4 single server design can reduce client request overhead for small requests in these instances.

•

Open Network Computing Remote Procedure Call. NFSv4 uses ONC RPC [34], a low-overhead, low-latency network protocol that is well suited for small data transfers.

•

Client writeback cache. NFSv4 gathers sequential write requests into a single request, which lowers the aggregate cost of small write requests.

•

Server write gathering. Similarly, the NFSv4 server combines sequential write requests into a single request to the exported parallel file system. This can be useful, e.g., for applications performing strided access into a single file.

6.2.2. Small write performance example: Postmark benchmark Comparing the performance of the Postmark benchmark on my pNFS prototype, unmodified NFSv4, and Ext3 demonstrates the performance mismatch of parallel file systems in a real-world computing environment. Postmark simulates applications that have a large number of metadata and small I/O requests such as electronic mail, NetNews, and Web-based services [145]. Postmark creates and performs transactions on a large number of small files (between 1 KB and 500 KB). Each transaction first deletes, creates, or opens a file. If the transaction creates or opens a file, it then appends 1 KB. Data is sent to stable storage before the file is closed. Postmark performs 2,000 transactions on 100 files. The experimental evaluation uses eight nodes with dual 1.7 GHz P4 processors and

76

Figure 6.1: pNFS small write data access architecture. Clients use a PFS data component for large write requests and an NFSv4 data component for small write requests.

a 3Com 3C996B-T Gigabit Ethernet card. PVFS2 has six storage nodes and one metadata server. Table 6.1 shows the Postmark results for Ext3, NFSv4, and pNFS. Ext3 outperforms remote clients, achieving a write throughput of 5.02 MB/s. NFSv4 achieves a write throughput of 4.03 MB/s. pNFS exporting the PVFS2 parallel file system achieves a write throughput of only 0.65 MB/s due to its inability to parallelize requests effectively and its use of a write-through cache. By using the features discussed in Section 6.2.1, NFSv4 raises the write throughput to the same PVFS2 file system by 1.79 MB/s. This demonstrates that the parallel, direct I/O path is not always the best choice and the indirect path is not always the worst choice.

6.2.3. pNFS write threshold To enable the indirect I/O path for small writes, I modified the pNFS client prototype to allow it to choose between the NFSv4 storage protocol and the storage protocol of the underlying file system. To switch between them, I added a write threshold to the layout driver. Write requests smaller than the threshold follow the NFSv4 data path. Write requests larger than the threshold follow the layout driver data path. Figure 6.1 shows how the write threshold alters the pNFS data access architecture. Clients use a PFS data component for large write requests and an NFSv4 data component for small write requests. An additional PFS data component on the metadata server funnels small write requests to

77 Client

pNFS Parallel I/O Application

User Kernel

Write Threshold Layout

pNFS Client

Server

I/O

PVFS2 Layout and I/O Driver

PVFS2 Storage Nodes

NFSv4 Small Writes

User Kernel

Small Writes

pNFS Server

Layout

PVFS2 Client

PVFS2 Metadata Server User Kernel

(a) pNFS data paths (b) pNFS prototype with write threshold Figure 6.2: pNFS write threshold. (a) pNFS utilizes NFSv4 I/O along the small write path when the write request size is less than the write threshold. (b) pNFS retrieves the write threshold from PVFS2 layout driver to determine the correct data path for a write request.

storage. Figure 6.2 illustrates the implementation of the write threshold in both the general pNFS architecture and in the prototype. pNFS features a heterogeneous metadata protocol that enables it to benefit from the strengths of disparate storage protocols. A write threshold improves overall write performance for pNFS by hitting the sweet spot of both the NFSv4 and underlying file system storage protocols. Just as any improvement to NFSv4 improves access to the file system it exports, these improvements to pNFS are portable and benefit all parallel file systems equally by allowing pNFS (and its exported parallel file systems) to concentrate on large data requirements, while NFSv4 efficiently processes small I/O.

6.2.4. Setting the write threshold The advantage of a write threshold is that applications that mix small and large write requests get the better performing I/O path automatically. The optimal write threshold value depends on several factors, including server capacity, network performance and capability, system load, and features specific to the distributed and parallel file system. One way to choose a good threshold value is to compare execution times for distributed and parallel file systems with various write sizes and see where the performance indicators cross.

78

Figure 6.3: Determining the write threshold value. Write execution time increases with larger request sizes. Application write requests are either small or large, with few requests in the middle. The write threshold can be any value in this middle region.

Figure 6.3 abstracts write request execution time with increasing request size for a parallel file system and for an idle and busy distributed file system. When the distributed file system is lightly loaded, the transfer size at which the parallel file system outperforms the distributed file system, labeled B, is the optimal write threshold. When the distributed file system is heavily loaded, each request takes longer to complete, so the slope increases and intersects the parallel file system at the smaller threshold size, labeled A. (If the distributed file system is thoroughly overloaded, the threshold value tends to zero, i.e., an overloaded distributed file system is never a better choice.) The workload characterization studies mentioned in Section 6.1 state that scientific applications usually have a large gap between small and large write request sizes, with very few requests in the middle. Experiments reveal that small requests are smaller than the “busy” write threshold value, shown as A in Figure 6.3, and the large requests are larger than the “idle” write threshold values, shown as B. Applications should reap large gains for any write threshold value between A and B. For example, the ATLAS digitization application (Section 6.3.3) achieves the same performance for any write threshold between 32 KB and 274 KB. In addition, 87 percent of the write requests are smaller than 4 KB, which suggests that the threshold could be even smaller without hurting performance. The write threshold can be set at any time, including compile time, when a module loads, and run time. For example, system administrators can determine the write threshold as part of a file system and network installation and optimization. A natural value for the write threshold is the write gather size of the distributed file system.

79 6.3. Evaluation In this section, I evaluate the performance of the write threshold heuristic in my pNFS prototype.

6.3.1.

Experimental setup

IOR and random write IOZone experiments use a pair of sixteen node clusters connected with Myrinet. One cluster consists of dual 1.1 GHz processor PIII Xeon nodes. The other consists of dual 1 GHz processor PIII Xeon nodes. Each node has 1 GB of memory. The PVFS2 1.1.0 file system has eight storage nodes and one metadata server. Each storage node has an Ultra160 SCSI disk controller and one Seagate Cheetah 18 GB, 10,033 RPM drive, with an average seek time of 5.2 ms. The NFSv4 server, PVFS2 client, and PVFS2 metadata server are installed on a single node. All nodes run Linux 2.6.12-rc4. ATLAS experiments use an eight node cluster of 1.7 GHz dual P4 processors, 2 GB of memory, a Seagate 80 GB 7200 RPM hard drive with an Ultra ATA/100 interface and a 2 MB cache, and a 3Com 3C996B-T Gigabit Ethernet card. The PVFS2 1.1.0 file system has six storage nodes and one metadata server. The NFSv4 server, PVFS2 client, and PVFS2 metadata server are installed on a single node. All nodes run Linux 2.6.12rc4.

6.3.2. IOR and IOZone benchmarks

6.3.2.1 Experimental design The first experiment consists of a single client issuing one thousand sequential write requests to a file, using the IOR benchmark [146]. A test completes when data is committed to disk. I repeat this experiment with ten clients writing to disjoint portions of a single file. The second experiment consists of a single client randomly writing a 32 MB file using IOZone [128]. For each experiment, I first compare the aggregate write throughput of pNFS and NFSv4 with a range of individual request sizes. I then set the write threshold to be the

80 30

Throughput (MB/s)


pNFS - 32KB Write Threshold pNFS NFSv4/PVFS2 NFSv4/Ext3

25

20

15

10

5

40

pNFS - 4KB Write Threshold pNFS NFSv4/PVFS2 NFSv4/Ext3

25

pNFS - 64KB Write Threshold 35

pNFS NFSv4/PVFS2

30

20

Throughput (MB/s)

30

15

10

25 20 15 10

5 5

0

Individual Write Size (Bytes)

Individual Write Size (bytes)

,3 84

8, 19 2

1

2

4

8

16

32

64

128

256

16

4, 09 6

2, 04 8

51 2

1, 02 4

25 6

64

12 8

32

16

,7 68 65 ,5 3 13 6 1, 07 2

2

,3 84

16

8

6

0

32

8, 19

2, 04

4, 09

4

2 51

1, 02

6

8 12

25

32

64

16

0

Individual Write Size (KB)

(a) Single/Consecutive (b) Multiple/Consecutive (c) Single/Random Figure 6.4: Write throughput with threshold. (a) Write throughput of a single client issuing consecutive small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 64 KB. pNFS with a 32 KB write threshold achieves the best overall performance. (b) Aggregate write throughput of a ten clients issuing consecutive small write requests to a single file. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 8 KB. pNFS with a 4 KB write threshold achieves the best overall performance. (c) Write throughput of a single client issuing random small write requests. NFSv4 exporting PVFS2 outperforms pNFS until a write size of 128 KB. pNFS with a 64 KB write threshold achieves the best overall performance. Data points are a power of two; lines are for readability.

request size at which pNFS and NFSv4 have the same performance, and re-execute the benchmark.

6.3.2.2 Experimental evaluation Our first experiment, shown in Figure 6.4a, examines single client performance. The performance of NFSv4 writing to PVFS2 or Ext3 is comparable because the NFSv4 32 KB write size is less than the PVFS2 64 KB stripe size, which isolates writes to a single disk. With a single pNFS client, writing through the NFSv4 server to PVFS2 is superior to writing directly to PVFS2 until the request size reaches 64 KB. For 16-byte writes, NFSv4 has sixty-seven times the throughput, with the ratio decreasing to one at 64 KB. Write performance through the NFSv4 server reaches its peak at 32 KB, the NFSv4 client request size. At 64 KB, direct storage access begins to outperform indirect access. pNFS with a write threshold of 32 KB offers the performance benefits of both storage protocols by using NFSv4 I/O until 32 KB, then switching to direct storage access with the PVFS2 storage protocol. Figure 6.4b shows the results of ten nodes writing to disjoint segments of the same file. Ext3 performance is limited by random requests from the NFSv4 server daemons.

81 Using NFSv4 I/O to access PVFS2 does not incur as many random accesses since the writes are spread over eight disks. PVFS2 throughput grows approximately linearly as the impact of the request overhead diminishes. The aggregate performance of NFSv4 is the same as with a single client, with the write performance crossover point between pNFS and NFSv4 occurring at 4 KB. With 16-byte writes, NFSv4 has twenty times the bandwidth, with the ratio decreasing to one at just below 8 KB. The maximum bandwidth difference of 9 MB/s occurs at 1 KB. At 8 KB, direct storage access begins to outperform indirect access. pNFS with a write threshold of 4 KB offers the performance benefits of both storage protocols. Figure 6.4c shows the performance of writing to a 32 MB file in a random manner with increasing request sizes. NFSv4 outperforms pNFS until the individual write size reaches 128 KB, with a maximum difference of 13 MB/s occurring at 16 KB. pNFS using a write threshold of 64 KB experiences the performance benefits of both storage protocols.

6.3.3. ATLAS applications Not every application behaves like the ones studied in Section 6.1. For example, large writes dominate the FLASH I/O benchmark workload [147], with 99.7 percent of requests greater than 163 KB (with default input parameters). However, beyond the workload characterization studies, there is increasing anecdotal evidence to suggest that small writes are quite common. To assess the impact of the small write heuristic I use the ATLAS simulator, which does make many small writes. ATLAS [148] is a particle physics experiment that seeks new discoveries in head on collisions of high-energy protons using the Large Hadron Collider accelerator [149]. Scheduled for completion in 2007, ATLAS will generate over a petabyte of data each year to be distributed for analysis to a multi-tiered collection of decentralized sites. Currently, ATLAS physicists are performing large-scale simulations of the events that will occur within its detector. These simulation efforts influence detector design and the development of real-time event filtering algorithms for reducing the volume of data. The ATLAS detector can detect one billion events with a combined data volume of forty

82

(a) Breakdown of total number of requests (b) Breakdown of total amount of data output Figure 6.5: ATLAS digitization write request size distribution with 500 events.

terabytes each second. After filtering, data from fewer than one hundred events per second are stored for offline analysis. The ATLAS simulation event data model consists of four stages. The Event Generation stage produces pseudo-random events drawn from a statistical distribution of previous experiments. The Simulation stage then simulates the passage of particles (events) through the detectors. The Digitization stage combines hit information with estimates of internal noise, subjecting the hits to a parameterization of the known response of the detectors to produce simulated digital output (digits). The Reconstruction stage performs pattern recognition and track reconstruction algorithms on the digits, converting raw digital data into meaningful physics quantities.

6.3.3.1 Experimental design Experiments focus on the Digitization stage, the only stage that generates a large amount of data. With 500 events, Digitization writes approximately 650 MB of output data to a single file. Data are written randomly, with write request size distributions shown in Figure 6.5. Figure 6.5a shows that only 4 percent of write request sizes are 275 KB or greater, with the rest below 32 KB. Figure 6.5b shows that 96 percent of write requests are only responsible for 5 percent of the data, while 95 percent of the data are written in requests whose size is greater than 275 KB. This distribution of write request size and total amount of data output closely matches the workload characterization studies discussed in Section 6.1. Analysis of the Digitization write request distribu-

83 tion with varying numbers of events indicates that the distribution in Figure 6.5 is a representative sample. Analysis of the Digitization trace data reveals a large number of fsync system calls. For example, executing Digitization with 50 events produces more than 900 synchronous fsync calls. Synchronously committing data to storage reduces request parallelism and the effectiveness of write gathering. ATLAS developers explain that the overwhelming use of fsync is an implementation issue rather than an application necessity [150]. Therefore, to evaluate Digitization write throughput I used IOZone to replay the write trace data while omitting fsync calls for 50 and 500 events.

6.3.3.2 Experimental evaluation To evaluate pNFS with the ATLAS simulator, I analyze the Digitization write throughput with several write threshold values. First, I use the IOZone benchmark to determine the maximum PVFS2 write throughput. The maximum write throughput for a single-threaded application and an entire client is 18 MB/s and 54 MB/s respectively. The single threaded application maximum performance value sets the upper limit for ATLAS write throughput. Increasing the number of threads simultaneously writing to storage increases the maximum write throughput three-fold. Since ATLAS Digitization is a single threaded application generating output for serialized events, it cannot directly take advantage of this extra performance. As shown in Figure 6.6, pNFS achieves a write throughput of 11.3 MB/s and 11.9 MB/s with 50 and 500 events respectively. The small write requests reduce the application’s peak write throughput by approximately 6 MB/s. With a write threshold of 1 KB, 49 percent of requests are re-directed to the NFSv4 server, increasing performance by 23 percent. With a write threshold of 32 KB, 96 percent of write requests use the NFSv4 I/O path. With 50 events, the increase in write performance is 57 percent, for a write throughput of 17.8 MB/s. With 500 events, the increase in write performance is 100 percent, for a write throughput of 23.8 MB/s.

84 30

Throughput (MB/s)

25

pNFS pNFS - 1KB Write Threshold pNFS - 32KB Write Threshold NFSv4/PVFS2

Application Max Write Throughput = 18 MB/s Client Max Write Throughput = 54 MB/s

20

15

10

5

0

50 Events

500 Events

Figure 6.6: ATLAS digitization write throughput for 50 and 500 events. pNFS with a 32 KB write threshold achieves the best overall performance by directing small requests through the NFSv4 server and the 275 KB and 1MB requests to the PVFS2 storage nodes.

It is interesting to note that 32 KB write threshold performance exceeds the singlethreaded application maximum write throughput. The NFSv4 server is multi-threaded, so it can process multiple simultaneous write requests and outperform a single-threaded application. This is another benefit of the increased parallelism available in distributed file systems. When pNFS funnels all Digitization output through the NFSv4 server, performance drops dramatically, but is still slightly better than the performance of pNFS with direct I/O. In this experiment, the improved write performance of the smaller requests overshadows the reduced performance of sending large write requests through the NFSv4 server. The 50 and 500 event experiments have slightly different write request size and offset distributions. In addition, the 500 event simulation has ten times the number of write requests. The difference between the pNFS write threshold performance improvements in the 50 and 500 event experiments seems to be due to a difference in behavior of the NFSv4 writeback cache with these different write workloads.

6.3.4. Discussion Experiments show that writing to the direct data path is not always the best choice. Write request size plays an important role in determining the preferred data path.

85 The Linux NFSv4 client gathers small writes into 32 KB requests. With very small requests, the overhead of gathering requests diminishes the benefit. As the size of each write request grows, the increase in throughput is considerable. Performing an increased number of parallel asynchronous write requests also improves performance. This is seen in both Figure 6.4a and Figure 6.4c, as the performance of writing 32 KB requests exceeds that of writing directly to storage. The Linux NFSv4 server does not perform write gathering. Our experiments clearly show the benefit of increasing the write request size. The ability for the NFSv4 server to combine small requests from multiple clients into a single large request should lead to further advantages.

6.4. Related work Log-structured file systems [151] increase the size of writes by appending small I/O requests to a log and later flushing the log to disk. Zebra [152] extends this to distributed environments. Side effects include large data layouts and erratic block sizes. The Vesta parallel file system [153] improves I/O performance by using workload characteristics provided by applications to optimize data layout on storage. Providing this information can be difficult for applications that lack regular I/O patterns or whose I/O access patterns change over time. The Slice file system prototype [116] divides NFS requests into three classes: large I/O, small-file I/O, and namespace. A µProxy, interposed between clients and servers, routes NFS client requests between storage, small-file servers, and directory servers, respectively. Large I/O flows directly to storage while small-file servers aggregate I/O operations of small files and the initial segments of large files. This method benefits small file performance, but ignores small I/O to large files. Both the EMC HighRoad file system [103] and the RAID-II network file server [95] transfer small files over a low-bandwidth network and use a high-bandwidth network for large file requests, but differentiating small and large files does not help with small requests to large files. This re-direction benefits only large requests, and may reduce the performance of small requests.

86 GPFS [97] forwards data between I/O nodes for requests smaller than the block size. This reduces the number of messages with the lock manager and possibly reduces the number of read-modify-write sequences. Both the Lustre [104] and the Panasas ActiveScale [105] file systems use a writebehind cache to perform buffered writes. In addition, Lustre allows clients to place small files on a single storage node to reduce access overhead. Implementations of MPI-IO such as ROMIO [82] use application hints and file access patterns to improve I/O request performance. The work reported here benefits and complements MPI-IO and its implementations. MPI-IO is useful to applications that use its API and have regular I/O access patterns, e.g., strided I/O, but MPI-IO small write performance is limited by the deficiencies of the underlying parallel file system. Our pNFS enhancements are beneficial for existing and unmodified applications. They are also beneficial at the file system layer of MPI-IO implementations, to improve the performance of the underlying parallel file system.

6.5. Conclusion Diverse file access patterns and computing environments in the high-performance community make pNFS an indispensable tool for scalable data access. This chapter demonstrates that pNFS can increase write throughput to parallel data stores—regardless of file size—by overcoming the inefficient performance of parallel file systems when write request sizes are small. pNFS improves the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and a distributed file system for small write requests. Evaluation results using a real scientific application and several benchmark programs demonstrate the benefits of this design. The pNFS heterogeneous metadata protocol allows any parallel file system to realize these write performance improvements.

CHAPTER VII Direct Data Access with a Commodity Storage Protocol Parallel file systems feature impressive throughput, but they are highly specialized, have limited operating system and hardware platform support, poor cross-site performance, and often lack strong security mechanisms. In addition, while parallel file systems excel at large data transfers, many do so at the expense of small I/O performance. While large data transfers dominate many scientific applications, numerous workload characterization studies have highlighted the prevalence of small, sequential data requests in modern scientific applications [74, 77, 78]. Many application domains demonstrate the need for high bandwidth, concurrent, and secure access to large datasets across a variety of platforms and file systems. Scientific computing connects large computational and data facilities across the globe and can generate petabytes of data. Digital movie studios that generate terabytes of data every day require access from Sun, Windows, SGI, and Linux workstations, and compute clusters [11]. This need for heterogeneous data access creates a conflict between parallel file systems and application platforms. Distributed file systems such as NFS [38] and CIFS [2] bridge the interoperability gap, but they are unable to deliver the superior performance of a high-end storage system. pNFS overcomes these enterprise- and grand challenge-scale obstacles by enabling direct access to storage from clients while preserving operating system, hardware platform, and parallel file system independence. pNFS provides file access scalability by using the storage protocol of the underlying parallel file system to distribute I/O across the bisectional bandwidth of the storage network between clients and storage devices, removing the single server bottleneck that is so vexing to client/server-based systems. In combination, the elimination of the single server bottleneck and the ability for direct access to storage by clients yields superior file access performance and scalability.

87

88 Regrettably, pNFS does not retain NFSv4 file system access transparency, and can therefore not shield applications from different parallel file system security protocols and metadata and data consistency semantics. In addition, implementing pNFS support for every storage protocol on every operating system and hardware platform is a colossal undertaking. File systems that support standard storage protocols may be able to share development costs, but full support for a particular protocol is often unrealized, hampering interoperability. The pNFS file-based layout access protocol helps bridge this gap in transparency with middle-tier data servers, but eliminates direct data access, which can hurt performance. This chapter introduces Direct-pNFS, a novel augmentation to pNFS that increases portability and regains parallel file system access transparency while continuing to match the performance of native parallel file system clients. Architecturally, Direct-pNFS uses a standard distributed file system protocol for direct access to a parallel file system’s storage nodes, bridging the gap between performance and transparency. Direct-pNFS leverages the strengths of NFSv4 to improve I/O performance over the entire range of I/O workloads. I know of no other distributed file system that offers this level of performance, scalability, file system access transparency, and file system independence. Direct-pNFS makes the following contributions: Heterogeneous and ubiquitous remote file system access. Direct-pNFS benefits are available with an conventional pNFS client: Direct-pNFS uses the pNFS file-based layout type, and does not require file system specific layout drivers, e.g., object [154] or PVFS2 [155]. Remote file system access transparency and independence. pNFS uses file system specific storage protocols that can expose gaps in the underlying file system semantics (such as security support). Direct-pNFS, on the other hand, retains NFSv4 file system access transparency by using the NFSv4 storage protocol for data access. In addition, Direct-pNFS remains independent of the underlying file system and does not interpret file system-specific information. I/O workload versatility. While distributed file systems are usually engineered to perform well on small data accesses [63], parallel file systems target scientific workloads

89 dominated by large data transfers. Direct-pNFS combines the strengths of both, providing versatile data access to manage efficiently a diversity of workloads. Scalability and throughput. Direct-pNFS can match the I/O throughput and scalability of the exported parallel file system without requiring the client to support any protocol other than NFSv4. This chapter uses numerous benchmark programs to demonstrate that Direct-pNFS matches the I/O throughput of a parallel file system and has superior performance in workloads that contain many small I/O requests. A case for commodity high-performance remote data access. Direct-pNFS complies with emerging IETF standards and can use an unmodified pNFS client. This chapter makes a case for open systems in the design of high-performance clients, demonstrating that standards-compliant commodity software can deliver the performance of a custom made parallel file system client. Using standard clients to access specialized storage systems offers ubiquitous data access and reduces development and support costs without cramping storage system optimization. The remainder of this chapter is organized as follows. Section 7.1 makes the case for open systems in distributed data access. Section 7.2 reviews pNFS and its departure from traditional client/server distributed file systems. Sections 7.3 and 7.4 describe the DirectpNFS architecture and Linux prototype. Section 7.5 reports the results of experiments with micro-benchmarks and four different I/O workloads. I summarize and conclude in Section 7.6.

7.1. Commodity high-performance remote data access NFS owes its success to an open protocol, platform ubiquity, and transparent access to file systems, independent of the underlying storage technology. Beyond performance and scalability, standards-based high-performance data access needs all these properties to be successful in Grid, cluster, enterprise, and personal computing. The benefits of standards-based data access with these qualities are numerous. A single client can access data within a LAN and across a WAN, which reduces the cost of development, administrative, and support. System administrators can select a storage solution with confidence that no matter the operating system and hardware platform, users

90 are able to access the data. In addition, storage vendors are free to focus on advanced data management features such as fault tolerance, archiving, manageability, and scalability without having to custom tailor their products across a broad spectrum of client platforms.

7.2. pNFS and storage protocol-specific layout drivers This section revisits the pNFS architecture described in Chapter V and discusses the drawbacks of using storage protocol-specific layout drivers.

7.2.1. Hybrid file system semantics Although parallel file systems separate control and data flows, there is tight integration of the control and data protocols. Users must adapt to different semantics for each data repository. pNFS, on the other hand, allows applications to realize common file system semantics across data repositories. As users access heterogeneous data repositories with pNFS, the NFSv4 metadata protocol provides a degree of consistency with respect to the file system semantics within each repository. Unfortunately, certain semantics are layout driver and storage protocol dependent, and they can drastically change application behavior. For example, Panasas Activescale [105] supports the OSD security protocol [136], while Lustre [104] specialized security protocol. This forces clients that need to access both parallel file systems to support multiple authentication, integrity, and privacy mechanisms. Additional examples of these semantics include client caching, and fault tolerance.

7.2.2. The burden of layout and I/O driver development The pNFS layout and I/O drivers are the workhorses of pNFS high-performance data access. These specialized components understand the storage system’s storage protocol, security protocol, file system semantics, device identification, and layout description and management. For pNFS to achieve broad heterogeneous data access, layout and I/O drivers must be developed and supported on a multiplicity of operating system and hard-

91 Data Servers (2-tier pNFS) pNFS Client

Data Servers

Application User Kernel

Layout I/O

pNFS Client

File Layout Driver

NFSv4 Parallel I/O

Storage

(3-tier pNFS)

User

User

Kernel

Kernel

pNFS Server

I/O

PFS Client

PFS Storage

Mgmt Daemon

I/O Driver PFS Parallel I/O


pNFS Metadata Server

User

User

User

Kernel

Kernel

Kernel

pNFS Server

Control Flow

pNFS Server

PFS Client

I/O

User Kernel

PFS Metadata

PFS Metadata Server

PFS Client

PFS Storage

Mgmt Daemon

PFS Metadata PFS Management Protocol

Figure 7.1: pNFS file-based architecture with a parallel file system. The pNFS filebased layout architecture consists of pNFS data servers, clients and a metadata server, plus parallel file system (PFS) storage nodes, clients, and metadata servers. The threetier design prevents direct storage access and creates overlapping and redundant storage and metadata protocols. The two-tier design, pNFS servers, PFS clients, and storage on the same node, suffers from these problems plus diminished single client bandwidth.

ware platforms—an effort comparable in magnitude to the development of a parallel file system client.

7.2.3. The pNFS file-based layout driver Currently, the IETF is developing three layout specifications: file, object, and block. The pNFS protocol includes only the file-based layout format, with object- and blockbased to follow in separate specifications. As such, all pNFS implementations will support the file-based layout format for remote data access, while support for the object- and block-based access methods will be optional. A pNFS file-based layout governs an entire file and is valid until recalled by the pNFS server. To perform data access, the file-based layout driver combines the layout information with a known list of data servers for the file system, and sends READ, WRITE, and COMMIT operations to the correct data servers. Once I/O is complete, the client sends updated file metadata, e.g., size or modification time, to the pNFS server.

92

(a) 3-tier (b) 2-tier Figure 7.2: pNFS file-based data access. Both 3-tier and 2-tier architectures lose direct data access. (a) Intermediary pNFS data servers access PFS storage nodes. (b) pNFS data servers access both local and remote PFS storage nodes.

pNFS file-based layout information consists of: •

Striping type and stripe size

•

Data server identifiers

•

File handles (one for each data server)

•

Policy parameters

Figure 7.1 illustrates how the pNFS file-based layout provides access to an asymmetric parallel file system. (Henceforth, I refer to this unspecified file system as PFS). pNFS clients access pNFS data servers that export PFS clients, which in turn access data from PFS storage nodes and metadata from PFS metadata servers. A PFS management protocol binds metadata servers and storage, providing a consistent view of the file system. pNFS clients use NFSv4 for I/O while PFS clients use the PFS storage protocol.

7.2.3.1 Performance issues Architecturally, using a file-based layout offers some latitude. The architecture depicted in Figure 7.1 might have two tiers, or it might have three. The three-tier architecture places PFS clients and storage on separate nodes, while the two-tier architecture places PFS clients and storage on the same nodes. As shown in Figure 7.2, neither choice features direct data access: the three-tier model has intermediary data servers while with two tiers, tier-two PFS clients access data from other tier-two storage nodes. In addition, the two-tier model transfers data between data servers, reducing the available bandwidth

93

Figure 7.3: Direct-pNFS data access architecture. NFSv4 application components use an NFSv4 data component to perform I/O directly to a PFS data component bundled with storage. The NFSv4 metadata component shares its access control information with the data servers to ensure the data servers allow only authorized data requests.

between clients and data servers. These architectures can improve NFS scalability, but the lack of direct data access—a primary benefit of pNFS—scuttles performance. Block size mismatches and overlapping metadata protocols also diminish performance. If the pNFS block size is greater than the PFS block size, a large pNFS data request produces extra PFS data requests, each incurring a fixed amount of overhead. Conversely, a small pNFS data request forces a large PFS data request, unnecessarily taxing storage resources and delaying the pNFS request. pNFS file system metadata requests to the pNFS server, e.g., file size, layout information, become PFS client metadata requests to the PFS metadata server. This ripple effect increases overhead and delay for pNFS metadata requests. It is hard to address these remote access inefficiencies with a fully connected blockbased parallel file systems, e.g., GPFS [97], GFS [98, 99], and PolyServe Matrix Server [101], but for parallel file systems whose storage nodes admit NFS servers, Direct-pNFS offers a solution.

7.3. Direct-pNFS Direct-pNFS supports direct data access—without requiring a storage system specific layout driver on every operating system and hardware platform—by exploiting file-based layouts to describe the exact distribution of data on the storage nodes. Since a Direct-

94 Direct-pNFS Client Application

User Kernel

pNFS Client

Layout I/O

Optional Aggregation Driver

File Layout Driver I/O Driver

Data Servers

NFSv4 Parallel I/O

User


Kernel

Metadata Server pNFS Server

User Kernel

pNFS Server

File Layout

Layout Translator

PFS Layout

PFS Metadata Server

Control Flow

I/O

PFS Storage

Mgmt Daemon

PFS Management Protocol

Figure 7.4: Direct-pNFS with a parallel file system. Direct-pNFS eliminates overlapping I/O and metadata protocols and uses the NFSv4 storage protocol to directly access storage. The PFS uses a layout translator to converts its layout into a pNFS file-based layout. A Direct-pNFS client may use an aggregation driver to support specialized file striping methods.

pNFS client knows the exact location of a file’s contents, it can target I/O requests to the correct data servers. Direct-pNFS supports direct data access to any parallel file system that allows NFS servers on its storage nodes—such as object based [104, 105], PVFS2 [129], and IBRIX Fusion [156]—and inherits the operational, fault tolerance, and security semantics of NFSv4

7.3.1. Architecture In the two- and three-tier pNFS architectures shown in Figure 7.1, the underlying data layout is opaque to pNFS clients. This forces them to distribute I/O requests among data servers without regard for the actual location of the data. To overcome this inefficient data access, Direct-pNFS, shown in Figure 7.4, uses a layout translator to convert a parallel file system’s layout into a pNFS file-based layout. A pNFS server, which exists on every PFS data server, can satisfy Direct-pNFS client data requests by accessing the local PFS storage component. Direct-pNFS and PFS metadata components also co-exist on the same node, which eliminates remote PFS metadata requests from the pNFS server. The Direct-NFSv4 data access architecture, shown in Figure 7.3, alters the NFSv4PFS data access architecture (Figure 3.3) by using an NFSv4 data component to perform direct I/O to a PFS data component bundled with storage. The PFS data component proxies NFSv4 I/O requests to the local disk. An NFSv4 metadata component on storage maintains NFSv4 access control semantics.

95 In combination, the use of accurate layout information and the placement of pNFS servers on PFS storage and metadata nodes eliminates extra PFS data and metadata requests and obviates the need for data servers to support the PFS storage protocol altogether. The use of a single storage protocol also eliminates block size mismatches between storage protocols.

7.3.2. Layout translator To give Direct-pNFS clients exact knowledge of the underlying data layout, a parallel file system uses the layout translator to specify a file’s storage nodes, file handles, aggregation type, and policy parameters. The layout translator is independent of the underlying parallel file system and does not interpret PFS layout information. The layout translator simply gathers file-based layout information, as specified by the PFS, and creates a pNFS file-based layout. The overhead for a PFS to use the layout translator is small and confined to the PFS metadata server.

7.3.3. Optional aggregation drivers It is impossible for the pNFS protocol to support every method of distributing data among the storage nodes. At this writing, the pNFS protocol supports two aggregation schemes: round-robin striping and a second method that specifies a list of devices that form a cyclical pattern for all stripes in the file. To broaden support for unconventional aggregation schemes such as variable stripe size [157] and replicated or hierarchical striping [19, 158], Direct-pNFS also supports optional “pluggable” aggregation drivers. An aggregation driver provides a compact way for the Direct-pNFS client to understand how the underlying parallel file system maps file data onto the storage nodes. Aggregation drivers are operating system and platform independent, and are based on the distribution drivers in PVFS2, which use a standard interface to adapt to most striping schemes. Although aggregation drivers are non-standard components, their development effort is minimal compared to the effort required to develop an entire layout driver.

96 7.4. Direct-pNFS prototype I implemented a Direct-pNFS prototype that maintains strict agnosticism of the underlying storage system and, as we shall see, matches the performance of the storage system that it exports. Figure 7.5 displays the architecture of my Direct-pNFS prototype, using PVFS2 for the exported file system. Scientific data is easily re-created, so PVFS2 buffers data on storage nodes and sends the data to stable storage only when necessary or at the application’s request (fsync). To match this behavior, my Direct-pNFS prototype departs from the NFSv4 protocol, committing data to stable storage only when an application issues an fsync or closes the file. At this writing, the user-level PVFS2 storage daemon does not support direct VFS access. Instead, the Direct-pNFS data servers simulate direct storage access by way of the existing PVFS2 client and the loopback device. The PVFS2 client on the data servers functions solely as a conduit between the NFSv4 server and the PVFS2 storage node on the node. My Direct-pNFS prototype uses special NFSv4 StateIDs for access to the data servers, round robin striping as its aggregation scheme, and the GETDEVLIST, LAYOUTGET, AND LAYOUTCOMMIT pNFS operations. A layout pertains to an entire file, is stored in the file’s inode, and is valid for the lifetime of the inode.

7.5. Evaluation In this section I asses the performance and I/O workload versatility of Direct-pNFS. I first use the IOR micro-benchmark [146] to demonstrate the scalability and performance of Direct-pNFS compared with PVFS2, pNFS file-based layout with two and three tiers, and NFSv4. To explore the versatility of Direct-pNFS, I use two scientific I/O benchmarks and two macro benchmarks to represent a variety of access patterns to large storage systems: NAS Parallel Benchmark 2.4 – BTIO. The NAS Parallel Benchmarks (NPB) are used to evaluate the performance of parallel supercomputers. The BTIO benchmark is based on a CFD code that uses an implicit algorithm to solve the 3D compressible NavierStokes equations. I use the class A problem set, which uses a 64x64x64 grid, performs

97 Direct-pNFS Client Loopback

Application User Kernel

File Layout I/O

pNFS Client

File Layout/ Aggregation Driver Transport Driver

Data Servers NFSv4 Parallel I/O

PVFS2 Storage Server

User

PVFS2 Client

Kernel


Metadata Server pNFS Server

PVFS2 Metadata Server

I/O

User Kernel

Layout pNFS Server

Control Flow

PVFS2 Client PFS Management Protocol

Figure 7.5: Direct-pNFS prototype architecture with the PVFS2 parallel file system. The PVFS2 metadata server converts the PVFS2 layout into an pNFS file-based layout, which is passed to the pNFS server and then to the Direct-pNFS file-based layout driver. The pNFS data server uses the PVFS2 client as a conduit to retrieve data from the local PVFS2 storage server. Data servers do not communicate.

200 time steps, checkpoints data every five time steps, and generates a 400 MB checkpoint file. The benchmark uses MPI-IO collective file operations to ensure large write requests to the storage system. All parameters are left as default. ATLAS Application. ATLAS [148] is a particle physics experiment that seeks new discoveries in head-on collisions of high-energy protons using the Large Hadron Collider accelerator [149] under construction at CERN. The ATLAS simulation runs in four stages; the Digitization stage simulates detector data generation. With 500 events, Digitization spreads approximately 650 MB randomly over a single file. Each client writes to a separate file. More information regarding ATLAS can be found in Section 6.3.3. OLTP: OLTP models a database workload as a series of transactions on a single large file. Each transaction consists of a random 8 KB read, modify, and write. Each client performs 20,000 transactions, with data sent to stable storage after each transaction. Postmark: The Postmark benchmark simulates metadata and small I/O intensive applications such as electronic mail, NetNews, and Web-based services [145]. Postmark performs transactions on a large number of small randomly sized files (between 1 KB and 500 KB). Each transaction first deletes, creates, or opens a file, then reads or appends

98 512 bytes. Data are sent to stable storage before the file is closed. Postmark performs 2,000 transactions on 100 files in 10 directories. All other parameters are left as default.

7.5.1. Experimental setup All experiments use a sixteen-node cluster connected via Gigabit Ethernet with jumbo frames. To ensure a fair comparison between architectures, we keep the number of nodes and disks in the back end constant. The PVFS2 1.5.1 file system has six storage nodes, with one storage node doubling as a metadata manager, and a 2 MB stripe size. The pNFS three-tier architecture uses three NFSv4 servers and three PVFS2 storage nodes. For the three-tier architecture, we move the disks from the data servers to the storage nodes. All NFS experiments use eight server threads and 2 MB wsize and rsize. All nodes run Linux 2.6.17. Storage System: Each PVFS2 storage node is equipped with dual 1.7 GHz P4 processor, 2 GB memory, one Seagate 80 GB 7200 RPM hard drive with Ultra ATA/100 interface and 2 MB cache, and one 3Com 3C996B-T Gigabit Ethernet card. Client System: Client nodes one through seven are equipped with dual 1.3 GHz P3 processor, 2 GB memory, and an Intel Pro Gigabit Ethernet card. Client nodes eight and nine have the same configuration as the storage nodes.

7.5.2. Scalability and performance Our first set of experiments uses the IOR benchmark to compare the scalability and performance of Direct-pNFS, PVFS2, pNFS file-based layout with two and three tiers, and NFSv4. In the first set of experiments, clients sequentially read and write separate 500 MB files. In the second set of experiments, clients sequentially read and write a disjoint 500 MB portion of a single file. To view the effect of I/O request size on performance, the experiments use a large block size (2 to 4 MB) and a small block size (8 KB). Read experiments use a warm server cache. The presented value is the average over several executions of the benchmark Figure 7.6a and Figure 7.6b display the maximum aggregate write throughput with separate files and a single file. Direct-pNFS matches the performance of PVFS2, reach-

99 140 Direct-pNFS PVFS2 pNFS-2tier pNFS-3tier NFSv4

100 80 60 40

80 60 40 20

0

0 2

3

4

5

6

7

50 40 30 20 10 0

1

8

2

3

4

5

6

7

8

1

2

3

Number of Clients

Number of Clients

(a)

4

6

7

8

(c) 140

Direct-pNFS PVFS2 pNFS-2tier pNFS-3tier NFSv4


120

5

Number of Clients

(b)

140


Direct-pNFS PVFS2 pNFS-2tier

60

100

20

1

70 Direct-pNFS PVFS2 pNFS-2tier pNFS-3tier NFSv4



120


140

100 80 60 40 20

100


80 60 40 20

0

0 1

2

3

4

5

Number of Clients

6

7

8

1

2

3

4

5

6

7

8

Number of Clients

(d) (e) Figure 7.6: Direct-pNFS aggregate write throughput. (a) and (b) With a separate or single file and a large block size, Direct-pNFS scales with PVFS2 while pNFS-2tier suffers from a lack of direct file access. pNFS-3tier and NFSv4 are CPU limited. (c) With separate files and 100 Mbps Ethernet, pNFS-2tier is bandwidth limited due to its need to transfer data between data servers. (d) and (e) With a separate or single file and an 8 KB block size, all NFSv4 architectures outperform PVFS2.

ing a maximum aggregate write throughput of 119.2 MB/s and 110 MB/s for separate and single file experiments. . pNFS-3tier write performance levels off at 83 MB/s with four clients. pNFS-3tier must split the six available servers between data servers and storage nodes, which cuts the maximum network bandwidth in half relative to the network bandwidth for the other pNFS and PVFS2 architectures. In addition, using two disks in each storage node does not offer twice the disk bandwidth of a single disk due to the constant level of CPU, memory, and bus bandwidth. Lacking direct data access, pNFS-2tier incurs a write delay and performs a little worse than Direct-pNFS and PVFS2. The additional transfer of data between data servers limits the maximum bandwidth between the pNFS clients and data servers. This is not visible in Figure 7.6a and Figure 7.6b because network bandwidth exceeds disk bandwidth, so Figure 7.6c repeats the multiple file write experiments with 100 Mbps Ethernet. With this change, pNFS-2tier yields only half the performance of Direct-pNFS and PVFS2, clearly demonstrating the network bottleneck of the pNFS-2tier architecture. NFSv4 performance is unaffected by the number of clients, indicating a single server bottleneck

100 600


500



600

400

300

200

100

0


500

400

300

200

100

0

1

2

3

4

5

6

7

8

1

2

3

Number of Clients

4

(a)

6

7

8

6

7

8

(b)

600

600 Direct-pNFS PVFS2 pNFS-2tier pNFS-3tier NFSv4

500



5

Number of Clients

400

300

200

100

0


500

400

300

200

100

0 1

2

3

4

5

6

7

Number of Clients

8

1

2

3

4

5

Number of Clients

(c) (d) Figure 7.7: Direct-pNFS aggregate read throughput. (a) With separate files and a large block size, Direct-pNFS outperforms PVFS2 for some numbers of clients. pNFS2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is bandwidth and CPU limited. (b) With a single file and a large block size, PVFS2 eventually outperforms Direct-pNFS due to a prototype software limitation. pNFS-2tier and pNFS-3tier are bandwidth limited due to a lack of direct file access. NFSv4 is CPU limited. (c) and (d) With a separate or single file and an 8 KB block size, all NFSv4 architectures outperform PVFS2.

Figure 7.6d and Figure 7.6e display the aggregate write throughput with separate files and a single file using an 8 KB block size. The performance for all NFSv4-based architectures is unaffected in the large block size experiments due to the NFSv4 client write back cache, which combines write requests until they reach the NFSv4 wsize (2 MB in my experiments). However, the performance of PVFS2, a parallel file system designed for large I/O, decreases dramatically with small block sizes, reaching a maximum aggregate write throughput of 39.4 MB/s. Figure 7.7a and Figure 7.7b display the maximum aggregate read throughput with separate files and a single file. With separate files, Direct-pNFS matches the performance of PVFS2, reaching a maximum aggregate read throughput of 509 MB/s and 482 MB/s. With a single file, PVFS2 has lower throughput than Direct-pNFS with only a few clients, but outperforms Direct-pNFS with eight clients, reaching a maximum aggregate read throughput of 530.7 MB/s. Direct-pNFS places the NFSv4 and PVFS2 server modules on the same node, inherently placing a higher demand on server resources. In addition, PVFS2 uses a fixed

101 number of buffers to transfer data between the kernel and the user-level storage daemon, which creates an additional bottleneck. This is evident in Figure 7.7b, where PVFS2 achieves a higher aggregate I/O throughput than Direct-pNFS. The division of the six available servers between data servers and storage nodes in pNFS-3tier limits its maximum performance again, achieving a maximum aggregate bandwidth of only 115 MB/s. NFSv4 aggregate performance is flat, limited to the bandwidth of a single server. The pNFS-2tier bandwidth bottleneck is readily visible in Figure 7.7a and Figure 7.7b, where disk bandwidth is no longer a factor. Each data server is responding to client read requests and transferring data to other data servers so they can satisfy their client read requests. Sending data to multiple targets limits each data server’s maximum read bandwidth. Figure 7.7c and Figure 7.7d display the aggregate read throughput with separate files and a single file using an 8 KB block size. The performance for all NFSv4-based architectures remains unaffected in the large block size experiments due to NFSv4 client read gathering. The performance of PVFS2 again decreases dramatically with small block sizes, reaching a maximum aggregate read throughput of 51 MB/s.

7.5.3. Micro-benchmark discussion Direct-pNFS matches or outperforms the aggregate I/O throughput of PVFS2. In addition, the asynchronous, multi-threaded design of Linux NFSv4 combined with its write back cache achieves superior performance with smaller block sizes. In the write experiments, both Direct-pNFS and PVFS2 fully utilize the available disk bandwidth. In the read experiments, data are read directly from server cache, so the disks are not a bottleneck. Instead, client and server CPU performance becomes the limiting factor. The pNFS-2tier architecture offers comparable performance with fewer clients, but is limited by network bandwidth as I increase the number of clients. The pNFS-3tier architecture demonstrates that using intermediary data servers to access data is inefficient: those resources are better used as storage nodes. The remaining experiments further demonstrate the versatility of Direct-pNFS with workloads that use a range of block sizes.

102 1600

120 100

Direct-pNFS PVFS2

1400

PVFS2

1200 80

1000

Time (s)


Direct-pNFS

60

800 600

40

400 20

200 0

0 1

4 Number of Clients

1

8

(a) ATLAS

4 Number of Clients

(b) BTIO

30

40 PVFS2

Direct-pNFS PVFS2

Transactions/Second (tps)


Direct-pNFS 25

9

30

20 15

20

10

10

5 0

0 1

4 Number of Clients

8

1

4 Number of Clients

8

(c) OLTP (d) Postmark Figure 7.8: Direct-pNFS scientific and macro benchmark performance. (a) ATLAS. Direct-pNFS outperforms PVFS2 with a small and large write request workload. (b) BTIO. Direct-pNFS and PVFS2 achieve comparable performance with a large read and write workload. Lower time values are better. (c) OLTP. Direct-pNFS outperforms PVFS2 with an 8 KB read-modify-write write request workload. (d) Postmark. DirectpNFS outperforms PVFS2 in a small read and append workload.

7.5.4. Scientific application benchmarks This section uses two scientific benchmarks to assess the performance of DirectpNFS in high-end computing environments.

7.5.4.1 ATLAS To evaluate ATLAS Digitization write throughput I use IOZone to replay the write trace data for 500 events. Each client writes to a separate file. Figure 7.8a shows that Direct-pNFS can manage efficiently the mix of small and large write requests, achieving an aggregate write throughput of 102.5 MB/s with eight clients. While small write requests reduce the maximum write throughput achievable by Direct-pNFS by approximately 14 percent, they severely reduce the performance of PVFS2, which achieves only 41 percent of its maximum aggregate write throughput.

103 7.5.4.2 NAS Parallel Benchmark 2.4 – BTIO The Block-Triangle I/O benchmark is an industry standard for measuring the I/O performance of a cluster. Without optimization, BTIO I/O requests are small, ranging from a few hundred bytes to eight kilobytes. The version of the benchmark used in this chapter uses MPI-IO collective buffering [80], which increases the I/O request size to one MB and greater. The benchmark times also include the ingestion and verification of the result file. BTIO performance experiments are shown in Figure 7.8b. BTIO running time is approximately the same for Direct-pNFS and PVFS2, with a maximum difference of five percent with nine clients.

7.5.5. Synthetic workloads This section uses two macro-benchmarks to analyze the performance of Direct-pNFS in a more general setting.

7.5.5.1 OLTP Figure 7.8c displays the OLTP experimental results. Direct-pNFS scales well with the workload’s random 8 KB read-modify-write transactions, achieving 26 MB/s with eight clients. As expected, PVFS2 performs poorly with small I/O requests, achieving an aggregate I/O throughput of 6 MB/s.

7.5.5.2 Postmark For the small I/O workload of Postmark, I reduce the stripe size, wsize, and rsize to 64 KB. This allows a more even distribution of requests among the storage nodes. The Postmark experiments are shown in Figure 7.8d, with results given in transactions per second. Direct-pNFS again leverages the asynchronous, multi-threaded Linux NFSv4 implementation, designed for small I/O intensive workloads like Postmark, to perform up to 36 times as many transactions per second as PVFS2.

104 7.5.6. Macro-benchmark discussion This set of experiments demonstrates that Direct-pNFS performance compares well to the exported parallel file system with the large I/O scientific application benchmark BTIO. Direct-pNFS performance for ATLAS, for which 95% of the I/O requests are smaller than 275 KB, far surpasses native file system performance. The Postmark and OLTP benchmarks, also dominated by small I/O, yield similar results. With Direct-pNFS demonstrating good performance on small I/O workloads, a natural next step is to explore performance with routine tasks such as a build/development environment. Following the SSH build benchmark [159], I created a benchmark that uncompresses, configures, and builds OpenSSH [160]. Using the same systems as above, I compare the SSH build execution time using Direct-pNFS and PVFS2. I find that Direct-pNFS reduces compilation time, a stage heavily dominated by small read and write requests, but increases the time to uncompress and configure OpenSSH, stages dominated by file creates and attribute updates.

Tasks like file creation—

relatively simple for standalone file systems—become complex on parallel file systems, leading to a lot of inter-node communication. Consequently, many parallel file systems distribute metadata over many nodes and have clients gather and reconstruct the information, relieving the overloaded metadata server. The NFSv4 metadata protocol relies on a central metadata server, effectively recentralizing the decentralized parallel file system metadata protocol. The sharp contrast in metadata management cost between NFSv4 and parallel file systems—beyond the scope of this dissertation—merits further study.

7.6. Related work Several pNFS layout drivers are under development. At this writing, Sun Microsystems, Inc. is developing file- and object-based layout implementations. Panasas object and EMC bock drivers are currently under development. pNFS file-based layout drivers with the architecture in Figure 7.1 have been demonstrated with GPFS, Lustre, and PVFS2. Network Appliance is using the Linux file-based layout driver to bind disparate filers (NFS servers) into a single file system image. The architecture differs from Figure 7.1 in

105 that the filers are not fully connected to storage; each filer is a standalone server attached to Fibre Channel disk arrays. This continues previous work that aggregates partitioned NFS servers into a single file system image [91, 113, 114]. Direct-pNFS generalizes these architectures to be independent of the underlying parallel file system. The Storage Resource Broker (SRB) [130] aggregates storage resources, e.g., a file system, an archival system, or a database, into a single data catalogue. The HTTP protocol is the most common and widespread way to access remote data stores. SRB and HTTP also have some limits: they do not enable parallel I/O to multiple storage endpoints and do not integrate with the local file system. EMC’s HighRoad [103] uses the NFS or CIFS protocol for its control operations and stores data in an aggregated LAN and SAN environment. Its use of file semantics facilitates data sharing in SAN environments, but is limited to the EMC Symmetrix storage system. A similar, non-commercial version is also available [141]. Another commodity protocol used along the high-performance data channel is the Object Storage Device (OSD) command set, which transmits variable length storage objects over SCSI transports. Currently OSD can be used to access only OSD-based file systems, and it cannot be used for file system independent remote data access. GPFS-WAN [134, 135], used extensively in the TeraGrid [127], features exceptional throughput across high-speed, long haul networks, but is focused on large I/O transfers and is restricted to GPFS storage systems. GridFTP [4] is also used extensively in Grid computing to enable high I/O throughput, operating system independence, and secure WAN access to high-performance file systems. Successful and popular, GridFTP nevertheless has some serious limitations: it copies data instead of providing shared access to a single copy, which complicates its consistency model and decreases storage capacity; it lacks direct data access and a global namespace; runs as an application, and cannot be accessed as a file system without operating system modification. Distributed replicas can be vital in reducing network latency when accessing data. Direct-pNFS is not intended to replace GridFTP, but to work alongside it. For example, in tiered projects such as ATLAS at CERN, GridFTP remains a natural choice for longhaul scheduled transfers among the upper tiers, while the file system semantics of Direct-

106 pNFS offers advantages in the lower tiers by letting scientists work with files directly, promoting effective data management.

7.6.1. Small I/O performance Log-structured file systems [151] increase the size of writes by appending small I/O requests to a log and later flushing the log to disk. Zebra [152] extends this to distributed environments. Side effects include large data layouts and erratic block sizes. The Vesta parallel file system [153] improves I/O performance by using workload characteristics provided by applications to optimize data layout on storage. Providing this information can be difficult for applications that lack regular I/O patterns or whose I/O access patterns change over time. Both the Lustre [104] and the Panasas ActiveScale [105] file systems use a writebehind cache to perform buffered writes. In addition, Lustre allows clients to place small files on a single storage node to reduce access overhead. Implementations of MPI-IO such as ROMIO [82] use application hints and file access patterns to improve I/O request performance. The work reported here benefits and complements MPI-IO and its implementations. MPI-IO is useful to applications that use its API and have regular I/O access patterns, e.g., strided I/O, but MPI-IO small write performance is limited by the deficiencies of the underlying parallel file system. DirectpNFS is beneficial for existing and unmodified applications. It is also beneficial at the file system layer of MPI-IO implementations, to improve the performance of the underlying parallel file system.

7.7. Conclusion Universal and transparent data access is a critical enabling feature for highperformance access to storage. Direct-pNFS enhances the heterogeneity and transparency of pNFS by using an unmodified NFSv4 client to support high-performance remote data access. Experiments demonstrate that a commodity storage protocol can match the I/O throughput of the specialized parallel file system that it exports. Furthermore, Direct-

107 pNFS also “scales down” to outperform the parallel file system client in diverse workloads.

CHAPTER VIII Summary and Conclusion This chapter summarizes my dissertation research on high-performance remote data access and discusses possible extensions of the work.

8.1. Summary and supplemental remarks Scientific collaboration requires scalable and widespread access to massive data sets. This dissertation introduces data access architectures that use the NFSv4 distributed file system to realize levels of scalability, performance, security, heterogeneity, transparency, and independence that were heretofore unrealized. Parallel file systems manage massive data sets by scaling disks, disk controllers, network, and servers—every aspect of the system architecture. As the number of hardware components increases, the difficulty of locating, managing, and protecting data grows rapidly. Parallel file systems consist of a large number of increasingly complex and interconnected components, e.g., a metadata service, data and control components, and storage. As described in this dissertation, these components manage many different aspects of the system, including file system control information, data, metadata, data sharing, security, and fault tolerance. Providing distributed access to parallel file systems further increases the complexity of interacting components. Additional components can include new data and control components as well as a separate metadata service. Distributed access components lack direct and coherent integration with parallel file system components, increasing overhead and reducing system performance. For example, client/server-based distributed file systems require a server translation component to convert remote file system requests into parallel file system requests.

108

109 This dissertation analyzed the organization and properties of distributed and parallel file system components and how these components influence the performance and capabilities of remote data access. For example, tight integration of their separate metadata services can provide performance benefits through reduced communication and overhead, but decreases the level of independence between the file systems and increases the development effort required to support remote data access. Similarly, the placement (and/or existence) of both data components can affect the performance, heterogeneity, transparency, and security levels of remote data access. Chapter IV discusses possible remote data access architectures with existing and unmodified parallel file system components. Lack of access to detailed parallel file system information prevents the distributed and parallel file system components from optimizing their interaction. I introduce Split-Server NFSv4, which scales I/O throughput by spreading client load among existing ports of entry into the parallel file system. Split-Server NFSv4 retains NFSv4 semantics and offers heterogeneous, secure, transparent, and independent file system access. The Split-Server NFSv4 prototype raises several issues with accessing existing and unmodified parallel file systems, including: 1. Parallel file systems target single-threaded applications such as scientific MPI codes. As a result, some parallel file systems serialize data access among threads in a multi-threaded application. Overlooking the distinction between competing and non-competing components can considerably reduce performance. 2. A similar issue arises when clients spread I/O requests among data servers. The parallel file system’s failure to adapt to an understanding that the I/O requests are from a single application leads to unnecessary coordination and communication. In addition, synchronization of file attributes between data servers degrades remote access performance. 3. Block size mismatches unnecessarily tax storage resources and delay application requests. 4. Distributing I/O requests across multiple data servers inhibits the underlying file system’s ability to perform effective readahead.

110 5. Overlapping metadata protocols lead to a communication ripple effect that increases overhead and delay for remote metadata requests. Chapter V investigates a remote data access architecture that allows full access to the information stored by parallel file system components. This enables a distributed file system to utilize resources using the same information parallel file systems use to scale to large numbers of clients. I demonstrate that NFSv4 can use this information along with the storage protocol of the underlying file system to increase performance and scalability while retaining file system independence. In combination, the pNFS protocol, storagespecific layout drivers, and some parallel file system customizations can overcome the above I/O inefficiencies and enable customizable security and data access semantics. The pNFS prototype matches the performance of the underlying parallel file system and demonstrates that efficient layout generation is vital to achieve continuous scalability. While components in distributed and parallel file systems may provide similar services, their capabilities can be vastly different. Chapter VI demonstrates that pNFS can increase the overall write throughput to parallel data stores—regardless of file size—by using direct, parallel I/O for large write requests and a distributed file system for small write requests. The large buffers, limited asynchrony, and high per-request overhead inherent to parallel file system scuttles small I/O performance. By completely isolating and separating the control and data protocols in pNFS, a single file system can use any combination of storage and metadata protocols, each excelling at specific workloads or system environments. Use of multiple storage protocols increases the overall write performance of the ATLAS Digitization application by 57 to 100 percent. Beyond the interaction of distributed and parallel file system components, the physical location of a component can affect overall system capabilities. Chapter VII analyzes the cost of having pNFS clients support a parallel file system data component. A storagespecific layout driver must be developed for every platform and operating system, reducing heterogeneity and widespread access to parallel file systems. In addition, requiring support for semantics that are layout driver and storage protocol dependent, e.g., security, client caching, and fault tolerance, reduces the data access transparency of NFSv4. To increase the heterogeneity and transparency of pNFS, Direct-pNFS removes the requirement that clients incorporate parallel file system data components. Direct-pNFS

111 uses the NFSv4 storage protocol for direct access to NFSv4-enabled parallel file system storage nodes. A single layout driver potentially reduces development effort while retaining NFSv4 file access and security semantics. To perform heterogeneous data access with only a single layout driver, parallel file-system specific data layout information is converted into the standard pNFS file-based layout format. Pluggable aggregation drivers provide support for most file distribution schemes. While aggregation drivers can limit widespread data access, their development effort is likely to be less than for a layout driver and they can be shared across storage systems. Direct-pNFS experiments demonstrate that a commodity storage protocol can match the I/O throughput of the exported parallel file system. Furthermore, Direct-pNFS leverages the small I/O strengths of NFSv4 to outperform the parallel file system client in diverse workloads. The pNFS extensions to the NFSv4 protocol are included in the upcoming NFSv4.1 minor version specification [123]. Implementation of NFSv4.1 is under way on several major operating systems, bringing effective global data access closer to reality.

8.2. Supplementary observations From a practical standpoint, this dissertation can be used as a guide for capacity planning decisions. Organizations often allocate limited hardware resources to satisfy local data access requirements, adding resources for remote access is an afterthought. Used during the planning and acquisition phases, the analyses in this dissertation can serve as a guide for improving remote data access. Unfortunately, the storage community suffers from a narrow vision of data access. As this dissertation explains, many parallel file system providers continue to believe that a storage solution can exist in isolation. This idea is quickly becoming old-fashioned. Specialization of parallel file systems limits the availability of data and reduces collaboration, a vital part of innovation [161]. In addition, the bandwidth available for remote access across the WAN is continuing to increase, with global collaborations now using multiple ten Gigabit Ethernet networks [162]. Storage solutions need a holistic approach that accounts for every data access persona. Each persona has a specific I/O workload, set of supported operating systems, and required level of scalability, performance, transparency, security, and data sharing. For

112 example, the data access requirements of compute clusters, archival systems, and individual users (local and remote) are all different, but they all need to access the same storage system. Specializing file systems for a single persona widens the gap between applications and data. pNFS attempts to address these diverse requirements by combining the strengths of NFSv4 with direct storage access. As a result, pNFS uses NFSv4 semantics, which include a level of fault tolerance as well as close-to-open cache consistency. As described in Section 7.4, NFSv4 fault tolerance semantics can decrease write performance by pushing data aggressively onto stable storage. In addition, some applications or data access techniques, e.g., collective I/O, require more control over the NFSv4 client data cache. Mandating that an application must close and re-open a file to refresh its data cache (or acquire a lock on the file, which has the same effect) can increase delay and reduce performance. Maintaining consistent semantics across remote parallel file systems is important, but a distributed file system should allow applications to tune these semantics to meet their needs.

8.3. Beyond NFSv4 This dissertation focuses on the Network File System (NFS) protocol, which is distinguished by its precise definition by the IETF, availability of open source implementations, and support on virtually every modern operating system. It is natural to ask whether the scalability and performance benefits of the pNFS architecture can be realized by other distributed file systems such as AFS or CIFS. In other words, is it possible to design and engineer pAFS, pCIFS, or even pNFSv3? If so, what are the necessary requirements of a distributed file system that permit this transformation? To answer these questions, let us first review the pNFS architecture: 1. pNFS client. pNFS extends the standard NFSv4 client by delegating application I/O requests to a storage-specific layout driver. Either the pNFS client or each individual layout driver can manage and cache layout information. 2. pNFS server. pNFS extends the standard NFSv4 server with the ability to relay client file layout requests to the underlying parallel file system and respond with the resultant opaque layout information. In addition, the pNFS server tracks out-

113 standing layout information so that it can be recalled in case a file is renamed or its layout information is modified5. 3. pNFS metadata protocol. The base NFSv4 metadata protocol enables clients and servers to request, update, and recall file system and file metadata information, e.g., vnode information, and request file locks. pNFS extends this protocol to request, return, and recall file layout information. Fundamentally, pNFS extends NFSv4 in its ability to retrieve, manage, and utilize file layout information. Any distributed file system with a client, server, and a metadata protocol that can be extended to retrieve layout information is a candidate for pNFS transformation (although the implementation details will vary). A single server is not necessary along the control path, but a distributed file system must have the ability to fulfill metadata requests. For example, some file systems do not centralize file size on a metadata server, but dynamically build the file size through queries to the data servers. pNFS does not rely on distributed file system support of a storage protocol since it isolates the storage protocol in the layout driver. However, it is important to note that Split-Server NFSv4, Direct-pNFS, and file-based pNFS use the NFSv4 storage protocol along the data path, and a distributed file system must support its own storage protocol to realize the benefits of these architectures. The stateful nature of NFSv4, which enables server callbacks to the clients, is not strictly required to support the pNFS protocol. Instead, a layout driver could discover that its layout information is invalid through error messages returned by the data servers. Unfortunately, block-based data servers such as a Fibre Channel disks cannot run NFSv4 servers and therefore cannot return pNFS error codes. This would prohibit stateless distributed file systems, such as NFSv3, from supporting block-based layout drivers. A core benefit of pNFS is its ability to provide remote access to parallel file systems. pNFS achieves this level of independence by leveraging the NFS Vnode definition and VFS interface. DFS, SMB, and the AppleTalk Filing Protocol all use their own mecha-

5

File layout information can change if it is restriped due to load balancing, lack of free disk space on a particular data server, or server intervention.

114 nisms to achieve this level of independence. Other distributed file systems, such as AFS, do not currently support remote data access to existing data stores6. Most NFSv4 implementations also include additional features such as locks, readahead, a write back cache, and a data cache. These features provide many benefits but are unrelated to the support of the pNFS protocol.

8.4. Extensions This section presents some research themes that hold the potential to further improve the overall performance of remote data access.

8.4.1. I/O performance MPI is the dominant tool for inter-node communication while MPI-IO is the nascent tool for cluster I/O. Native support for MPI-IO implementations (e.g., ROMIO) by remote access file systems such as pNFS is critical. Figure 8.1 offers an example of an inter-cluster data transfer over the WAN. The application cluster is running an MPI application that wants to read a large amount of data from the server cluster and perhaps write to its backend. The MPI head node obtains the data location from the server cluster and distributes portions of the data location information (via MPI) to other application cluster nodes, enabling direct access to server cluster storage devices. The application then uses MPI-IO to read data in parallel from the server cluster across the WAN, processes the data, and directs output to the application cluster backend. A natural use case for this architecture is a visualization application processing the results of a scientific application run on the server cluster. Another use case is an application making a local copy of data from the server cluster on the application cluster. Regarding NFSv4, several improvements to the protocol could be beneficial to applications. For example, List I/O support has shown great application performance benefits with other storage protocols [163, 164]. In addition, many NFSv4 implementations have

6

HostAFSd, currently under development, will allow AFS servers to export the local file system.

115

Figure 8.1: pNFS and inter-cluster data transfers across the WAN. A pNFS cluster retrieves data from a remote storage system, processes the data, and writes to its local storage system. The MPI head node distributes layout information to pNFS clients.

a maximum data transfer size of 64 KB to 1 MB, while many parallel file systems support a maximum transfer size of 4 MB and larger. Efficient implementation of larger NFSv4 data transfer sizes is vital for high-performance access to parallel file systems. Finally, this dissertation uses a distributed file system to improve I/O throughput to parallel file systems, but leaves the architecture of the parallel file system largely untouched. Parallel file system architecture modifications may also improve remote data access performance, e.g., data and metadata replication and caching [165, 166], and various readahead algorithms.

8.4.2. Metadata management This dissertation focuses on improving I/O throughput for scientific applications. Some research has focused on improving metadata performance in high-performance computing [116, 167, 168], but most parallel file systems implement proprietary metadata management schemes7. To offload overburdened metadata servers, many parallel file systems distribute metadata information among multiple servers and employ their clients to perform tasks that involve communication with storage, e.g., file creation. Most distributed file systems

116 use a single metadata server, creating a metadata management bottleneck that threatens to eliminate the benefits of the parallel file system’s metadata management distribution. Stateful distributed file servers only exacerbate the problem. Additional research is needed to explore how distributed file systems can optimize metadata management without introducing these new bottlenecks and inefficiencies.

7

I am unaware of any additional research beyond this dissertation into the interaction between distributed and parallel file system metadata subsystems.

Bibliography

117

Bibliography

[1]

R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, "Design and Implementation of the Sun Network Filesystem," in Proceedings of the Summer USENIX Technical Conference, Portland, OR, 1985.

[2]

Common Internet File System File Access Protocol (CIFS), msdn.microsoft.com/library/en-us/cifs/protocol/cifs.asp.

[3]

S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, NFS Version 4 Protocol Specification. RFC 3530, 2003.

[4]

B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnal, and S. Tuecke., "Data Management and Transfer in High-Performance Computational Grid Environments," Parallel Computing, 28(5):749-771, 2002.

[5]

M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, "Query by Image and Video Content: The QBIC System," IEEE Computer, 28(9):23-32, 1995.

[6]

Biowulf at the NIH, biowulf.nih.gov/apps/blast.

[7]

S. Berchtold, C. Boehm, D.A. Keim, and H. Kriegel, "A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space," in Proceedings of the 16th ACM PODS Symposium, Tucson, AZ, 1997.

[8]

P. Caulk, "The Design of a Petabyte Archive and Distribution System for the NASA ECS Project," in Proceedings of the 4th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 1995.

[9]

J. Behnke and A. Lake, "EOSDIS: Archive and Distribution Systems in the Year 2000," in Proceedings of the 17th IEEE/8th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2000.

[10] J. Behnke, T.H. Watts, B. Kobler, D. Lowe, S. Fox, and R. Meyer, "EOSDIS Petabyte Archives: Tenth Anniversary," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005.

118

119

[11] D. Strauss, "Linux Helps Bring Titanic to Life," Linux Journal, 46, 1998. [12] ASCI Purple RFP, www.llnl.gov/asci/platforms/purple/rfp. [13] Petascale Data Storage Institute, www.pdl.cmu.edu/PDSI. [14] Serial ATA Workgroup, "Serial ATA: High Speed Serialized AT Attachment," Rev. 1, 2001. [15] Adaptec Inc., "Ultra320 www.adaptec.com.

SCSI:

New

Technology-Still

SCSI,"

[16] T.M. Anderson and R.S. Cornelius, "High-Performance Switching with Fibre Channel," in Proceedings of the 37th IEEE International Conference on COMPCON, San Francisco, CA, 1992. [17] K. Salem and H. Garcia-Molina, "Disk Striping," in Proceedings of the 2nd International Conference on Data Engineering, Los Angeles, CA, 1986. [18] O.G. Johnson, "Three-Dimensional Wave Equation Computations on Vector Computers," IEEE Computer, 72(1):90-95, 1984. [19] D.A. Patterson, G.A. Gibson, and R.H. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proceedings of the ACM SIGMOD Conference on Management of Data, Chicago, IL, 1988. [20] Small Computer Serial Interface (SCSI) Specification. ANSI X3.131-1986, www.t10.org, 1986. [21] T10 Committee, "Draft Fibre Channel Protocol - 3 (FCP-3) Standard," 2005, www.t10.org/ftp/t10/drafts/fcp3/fcp3r03d.pdf. [22] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, Internet Small Computer Systems Interface (iSCSI). RFC 3720, 2001. [23] K.Z. Meth and J. Satran, "Design of the iSCSI Protocol," in Proceedings 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, CA, 2003. [24] J. Postel and J. Reynolds, File Transfer Protocol (FTP). RFC 765, 1985. [25] B. Callaghan, NFS Illustrated. Essex, UK: Addison-Wesley, 2000. [26] PASC, "IEEE Standard Portable Operating System Interface for Computer Environments," IEEE Std 1003.1-1988, 1988.

120

[27] D.R. Brownbridge, L.F. Marshall, and B. Randell, "The Newcastle Connection or UNIXes of the World Unite!," Software-Practice and Experience, 12(12):11471162, 1982. [28] P.J. Leach, P.H. Levine, J.A. Hamilton, and B.L. Stumpf, "The File System of an Integrated Local Network," in Proceedings of the 13th ACM Annual Conference on Computer Science, New Orleans, LA, 1985. [29] P.H. Levine, The Apollo DOMAIN Distributed File System. New York, NY: Springer-Verlag, 1987. [30] P.J. Leach, B.L. Stumpf, J.A. Hamilton, and P.H. Levine, "UIDs as Internal Names in a Distributed File System," in Proceedings of the 1st Symposium on Principles of Distributed Computing, Ottawa, Canada, 1982. [31] G. Popek, The LOCUS Distributed System Architecture. Cambridge, MA: MIT Press, 1986. [32] A.P. Rifkin, M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh, "RFS Architectural Overview," in Proceedings of the Summer USENIX Technical Conference, Atlanta, GA, 1986. [33] S.R. Kleiman, "Vnodes: An Architecture for Multiple File System Types in Sun UNIX," in Proceedings of the Summer USENIX Technical Conference, Altanta, GA, 1986. [34] R. Srinivasan, RPC: Remote Procedure Call Protocol Specification Version 2. RFC 1831, 1995. [35] R. Srinivasan, XDR: External Data Representation Standard. RFC 1832, 1995. [36] Sun Microsystems Inc., NFS: Network File System Protocol Specification. RFC 1094, 1989. [37] B. Pawlowski, C. Juszczak, P. Staubach, C. Smith, D. Lebel, and D. Hitz, "NFS Version 3 Design and Implementation," in Proceedings of the Summer USENIX Technical Conference, Boston, MA, 1994. [38] B. Callaghan, B. Pawlowski, and P. Staubach, NFS Version 3 Protocol Specification. RFC 1813, 1995. [39] B. Callaghan and T. Lyon, "The Automounter," in Proceedings of the Winter USENIX Technical Conference, San Diego, CA, 1989. [40] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West, "Scale and Performance in a Distributed File System," ACM Transactions on Computer Systems, 6(1):51-81, 1988.

121

[41] J.G. Steiner, C. Neuman, and J.I. Schiller, "Kerberos: An Authentication Service for Open Network Systems," in Proceedings of the Winter USENIX Technical Conference, Dallas, TX, 1988. [42] M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere, "Coda: A Highly Available File System for a Distributed Workstation Environment," IEEE Transactions on Computers, 39(4):447-459, 1990. [43] M.L. Kazar, B.W. Leverett, O.T. Anderson, V. Apostolides, B.A. Bottos, S. Chutani, C.F. Everhart, W.A. Mason, S. Tu, and R. Zayas, "DEcorum File System Architectural Overview," in Proceedings of the Summer USENIX Technical Conference, Anaheim, CA, 1990. [44] S. Chutani, O.T. Anderson, M.L. Kazar, B.W. Leverett, W.A. Mason, and R.N. Sidebotham, "The Episode File System," in Proceedings of the Winter USENIX Technical Conference, Berkeley, CA, 1992. [45] C. Gray and D. Cheriton, "Leases: an Efficient Fault-tolerant Mechanism for Distributed File Cache Consistency," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, Litchfield Park, AZ, 1989. [46] G.S. Sidhu, R.F. Andrews, and A.B. Oppenheimer, Inside AppleTalk. Reading, MA: Addison-Wesley, 1989. [47] International Standards Organization, Information Processing Systems - Open Systems Interconnection - Basic Reference Model. Draft International Standard 7498, 1984. [48] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Detailed Specifications. RFC 1002, 1987. [49] Protocol Standard for a NetBIOS Service on a TCP/UDP Transport: Concepts and Methods. RFC 1001, 1987. [50] J.D. Blair, SAMBA, Integrating UNIX and Windows,. Specialized Systems Consultants Inc., 1998. [51] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. Nelson, and B. B. Welch, "The Sprite Network Operating System," IEEE Computer, 21(2):23-36, 1988. [52] M.N. Nelson, B.B. Welch, and J.K. Ousterhout, "Caching in the Sprite Network File System," ACM Transactions on Computer Systems, 6(1):134-154, 1988. [53] V. Srinivasan and J. Mogul, "Spritely NFS: Experiments With Cache-Consistency Protocols," in Proceedings of the 12th ACM Symposium on Operating Systems Principles, 1989.

122

[54] R. Macklem, "Not Quite NFS, Soft Cache Consistency for NFS," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994. [55] "Purple: Fifth Generation ASC Platform," www.llnl.gov/asci/platforms/purple. [56] The BlueGene/L Team, "An Overview of the BlueGene/L Supercomputer," in Proceedings of Supercomputing '02, Baltimore, MD, 2002. [57] BlueGene/L, www.llnl.gov/asc/computing_resources/bluegenel/ bluegene_home.html. [58] ASCI White, www.llnl.gov/asci/platforms/white. [59] S. Habata, M. Yokokawa, and S. Kitawaki, "The Earth Simulator System," NEC Research and Development Journal, 44(1):21-26, 2003. [60] ASCI Red, www.sandia.gov/ASCI/Red. [61] M. Satyanarayanan, "A Study of File Sizes and Functional Lifetimes," in Proceedings of the 8th ACM Symposium on Operating System Principles, Pacific Grove, CA, 1981. [62] D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer, "Passive NFS Tracing of Email and Research Workloads," in Proceedings of the USENIX Conference on File and Storage Technologies, San Francisco, CA, 2003. [63] M.G. Baker, J.H. Hartman, M.D. Kupfer, K.W. Shirriff, and J.K. Ousterhout, "Measurements of a Distributed File System," in Proceedings of the 13th ACM Symposium on Operating Systems Principles, Pacific Grove, CA, 1991. [64] J.K. Ousterhout, H. Da Costa, D. Harrison, J.A. Kunze, M. Kupfer, and J.G. Thompson, "A Trace-Driven Analysis of the UNIX 4.2 BSD File System," in Proceedings of the 10th ACM Symposium on Operating Systems Principles, Orcas Island, WA, 1985. [65] B. Fryxell, K. Olson, P. Ricker, F.X. Timmes, M. Zingale, D.Q. Lamb, P. MacNeice, R. Rosner, and H. Tufo, "FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes," Astrophysical Journal Supplement, 131:273-334, 2000. [66] A. Darling, L. Carey, and W. Feng, "The Design, Implementation, and Evaluation of mpiBLAST," in Proceedings of the ClusterWorld Conference and Expo, in conjunction with the 4th International Conference on Linux Clusters: The HPC Revolution, San Jose, CA, 2003.

123

[67] E.L. Miller and R.H. Katz, "Input/Output Behavior of Supercomputing Applications," in Proceedings of Supercomputing '91, Albuquerque, NM, 1991. [68] B. Schroeder and G. Gibson, "A Large Scale Study of Failures in HighPerformance-Computing Systems," in Proceedings of the International Conference on Dependable Systems and Networks, Philadelphia, PA, 2006. [69] G. Grider, L. Ward, R. Ross, and G. Gibson, "A Business Case for Extensions to the POSIX I/O API for High End, Clustered, and Highly Concurrent Computing," www.opengroup.org/platform/hecewg, 2006. [70] A.T. Wong, L. Oliker, W.T.C. Kramer, T.L. Kaltz, and D.H. Bailey, "ESP: A System Utilization Benchmark," in Proceedings of Supercomputing '00, Dallas, TX, 2000. [71] B.K. Pasquale and G.C. Polyzos, "Dynamic I/O Characterization of I/O Intensive Scientific Applications," in Proceedings of Supercomputing '94, Washington, D.C., 1994. [72] D. Kotz and N. Nieuwejaar, "Dynamic File-Access Characteristics of a Production Parallel Scientific Workload," in Proceedings of Supercomputing '94, Washington, D.C., 1994. [73] A. Purakayastha, C. Schlatter Ellis, D. Kotz, N. Nieuwejaar, and M. Best, "Characterizing Parallel File-Access Patterns on a Large-Scale Multiprocessor," in Proceedings of the Ninth International Parallel Processing Symposium, Santa Barbara, CA, 1995. [74] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. Schlatter Ellis, and M. Best, "FileAccess Characteristics of Parallel Scientific Workloads," IEEE Transactions on Parallel and Distributed Systems, 7(10):1075-1089, 1996. [75] E. Smirni and D.A. Reed, "Workload Characterization of Input/Output Intensive Parallel Applications," in Proceedings of the Conference on Modeling Techniques and Tools for Computer Performance Evaluation, Saint Malo, France, 1997. [76] E. Smirni, R.A. Aydt, A.A. Chien, and D.A. Reed, "I/O Requirements of Scientific Applications: An Evolutionary View," in Proceedings of the 5th IEEE Conference on High Performance Distributed Computing, Syracuse, NY, 1996. [77] P.E. Crandall, R.A. Aydt, A.A. Chien, and D.A. Reed, "Input/Output Characteristics of Scalable Parallel Applications," in Proceedings of Supercomputing '95, San Diego, CA, 1995. [78] F. Wang, Q. Xin, B. Hong, S.A. Brandt, E.L. Miller, D.D.E Long, and T.T. McLarty, "File System Workload Analysis For Large Scale Scientific Computing

124

Applications," in Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2004. [79] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, and J. Dongarra, MPI: The Complete Reference. Cambridge, MA: MIT Press, 1996. [80] W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir, MPI: The Complete Reference, volume 2--The MPI-2 Extensions. Cambridge, MA: MIT Press, 1998. [81] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard," Parallel Computing, 22(6):789-828, 1996. [82] R. Thakur, W. Gropp, and E. Lusk, "Data Sieving and Collective I/O in ROMIO," in Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation, 1999. [83] J.P Prost, R. Treumann, R. Hedges, B. Jia, and A.E. Koniges, "MPI-IO/GPFS, an Optimized Implementation of MPI-IO on top of GPFS," in Proceedings of Supercomputing '01, Denver, CO, 2001. [84] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi, "Passion: Optimized I/O for Parallel Applications," IEEE Computer, 29(6):70-78, 1996. [85] D. Kotz, "Disk-directed I/O for MIMD Multiprocessors," ACM Transactions on Computer Systems, 15(1):41-74, 1997. [86] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett, "Server-Directed Collective I/O in Panda," in Proceedings of Supercomputing '95, 1995. [87] J. del Rosario, R. Bordawekar, and A. Choudhary, "Improved Parallel I/O via a Two-Phase Run-time Access Strategy," in Proceedings of the Workshop on I/O in Parallel Computer Systems at IPPS '93, Newport Beach, CA, 1993. [88] OpenMP Consortium, "OpenMP C and C++ Application Program Interface, Version 1.0," www.openmp.org, 1997. [89] High Performance Fortran Forum, "High Performance Fortran language specification version 2.0," hpff.rice.edu/versions/hpf2/hpf-v20, 1997. [90] C. Juszczak, "Improving the Write Performance of an NFS Server," in Proceedings of the Winter USENIX Technical Conference, San Fransisco, CA, 1994. [91] P. Lombard and Y. Denneulin, "nfsp: A Distributed NFS Server for Clusters of Workstations," in Proceedings of the 16th International Parallel and Distributed Processing Symposium, Fort Lauderdale, FL, 2002.

125

[92] C.A. Thekkath, T. Mann, and E.K. Lee, "Frangipani: A Scalable Distributed File System," in Proceedings of the 16th ACM Symposium on Operating Systems Principles, Saint-Malo, France, 1997. [93] R.A. Coyne and H. Hulen, "An Introduction to the Mass Storage System Reference Model, Version 5," in Proceedings of the 12th IEEE Symposium on Mass Storage Systems, Monterey, CA, 1993. [94] S.W. Miller, A Reference Model for Mass Storage Systems. San Diego, CA: Academic Press Professional, Inc., 1988. [95] A.L. Drapeau, K. Shirriff, E.K. Lee, J.H. Hartman, E.L. Miller, S. Seshan, R.H. Katz, K. Lutz, D.A. Patterson, P.H. Chen, and G.A. Gibson, "RAID-II: A HighBandwidth Network File Server," in Proceedings of the 21st International Symposium on Computer Architecture, Chicago, IL, 1994. [96] L. Cabrera and D.D.E. Long, "SWIFT: Using Distributed Disk Striping To Provide High I/O Data Rates," Computing Systems, 4(4):405-436, 1991. [97] F. Schmuck and R. Haskin, "GPFS: A Shared-Disk File System for Large Computing Clusters," in Proceedings of the USENIX Conference on File and Storage Technologies, San Francisco, CA, 2002. [98] Red Hat Software Inc., "Red Hat Global File System," www.redhat.com/ software/rha/gfs. [99] S.R. Soltis, T.M. Ruwart, and M.T. O'Keefe, "The Global File System," in Proceedings of the 5th NASA Goddard Conference on Mass Storage Systems, College Park, MD, 1996. [100] M. Fasheh, "OCFS2: The Oracle Clustered File System, Version 2," in Proceedings of the Linux Symposium, Ottawa, Canada, 2006. [101] Polyserve Inc., "Matrix Server Architecture," www.polyserve.com. [102] C. Brooks, H. Dachuan, D. Jackson, M.A. Miller, and M. Resichini, "IBM TotalStorage: Introducing the SAN File System," IBM Redbooks, 2003, www.redbooks.ibm.com/redbooks/pdfs/sg247057.pdf. [103] EMC Corp., "EMC Celerra HighRoad," www.emc.com/pdf/products/ celerra_file_server/HighRoad_wp.pdf, 2001. [104] Cluster File Systems Inc., Lustre: A Scalable, High-Performance File System. www.lustre.org, 2002. [105] Panasas Inc., "Panasas ActiveScale File System," www.panasas.com.

126

[106] D.D.E. Long, B.R. Montague, and L. Cabrera, "Swift/RAID: A Distributed RAID System," Computing Systems, 7(3):333-359, 1994. [107] P.H. Carns, W.B. Ligon III, R.B. Ross, and R. Thakur, "PVFS: A Parallel File System for Linux Clusters," in Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, 2000. [108] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson, Stream Control Transmission Protocol. RFC 2960, 2000. [109] RDMA Consortium, www.rdmaconsortium.org. [110] B. Callaghan and S. Singh, "The Autofs Automounter," in Proceedings of the Summer USENIX Technical Conference, Cincinnati, OH, 1993. [111] S. Tweedie, "Ext3, Journaling Filesystem," in Proceedings of the Linux Symposium, Ottawa, Canada, 2000. [112] ReiserFS, www.namesys.com. [113] F. Garcia-Carballeira, A. Calderon, J. Carretero, J. Fernandez, and J.M. Perez, "The Design of the Expand File System," International Journal of High Performance Computing Applications, 17(1):21-37, 2003. [114] G.H. Kim, R.G. Minnich, and L. McVoy, "Bigfoot-NFS: A Parallel File-Striping NFS Server (Extended Abstract)," 1994, www.bitmover.com/lm. [115] A. Butt, T.A. Johnson, Y. Zheng, and Y.C. Hu, "Kosha: A Peer-to-Peer Enhancement for the Network File System," in Proceedings of Supercomputing '04, Pittsburgh, PA, 2004. [116] D.C. Anderson, J.S. Chase, and A.M. Vahdat, "Interposed Request Routing for Scalable Network Storage," in Proceedings of the 4th Symposium on Operating System Design and Implementation, San Diego, CA, 2000. [117] M. Eisler, A. Chiu, and L. Ling, RPCSEC_GSS Protocol Specification. RFC 2203, 1997. [118] M. Eisler, LIPKEY - A Low Infrastructure Public Key Mechanism Using SPKM. RFC 2847, 2000. [119] J. Linn, The Kerberos Version 5 GSS-API Mechanism. RFC 1964, 1996. [120] NFS Extensions for Parallel Storage (NEPS), www.citi.umich.edu/NEPS. [121] G. Gibson and P. Corbett, pNFS Problem Statement. Internet Draft, 2004.

127

[122] G. Gibson, B. Welch, G. Goodson, and P. Corbett, Parallel NFS Requirements and Design Considerations. Internet Draft, 2004. [123] S. Shepler, M. Eisler, and D. Noveck, NFSv4 Minor Version 1. Internet Draft, 2006. [124] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck, "Scalability in the XFS File System," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, USA, 1996. [125] IEEE Storage System Standards Working Group (SSSWG) (Project 1244), "Reference Model for Open Storage Systems Interconnection - Mass Storage System Reference Model Version 5," ssswg.org/public_documents/MSSRM/V5pref.html, 1994. [126] R.A. Coyne, H. Hulen, and R. Watson, "The High Performance Storage System," in Proceedings of Supercomputing '93, Portland, OR, 1993. [127] TeraGrid, www.teragrid.org. [128] W.D. Norcott and D. Capps, "IOZone Filesystem Benchmark," 2003, www.iozone.org. [129] Parallel Virtual File System - Version 2, www.pvfs.org. [130] C. Baru, R. Moore, A. Rajasekar, and M. Wan, "The SDSC Storage Resource Broker," in Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Canada, 1998. [131] T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, D.S. Roselli, and R.Y. Wang, "Serverless Network File Systems," in Proceedings of the 15th ACM Symposium on Operating System Principles, Copper Mountain Resort, CO, 1995. [132] B.S. White, M. Walker, M. Humphrey, and A.S. Grimshaw, "LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications," in Proceedings of Supercomputing '01, Denver CO, 2001. [133] O. Tatebe, Y. Morita, S. Matsuoka, N. Soda, and S. Sekiguchi, "Grid Datafarm Architecture for Petascale Data Intensive Computing," in Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, Berlin, Germany, 2002. [134] P. Andrews, C. Jordan, and W. Pfeiffer, "Marching Towards Nirvana: Configurations for Very High Performance Parallel File Systems," in Proceedings of the HiperIO Workshop, Barcelona, Spain, 2006.

128

[135] P. Andrews, C. Jordan, and H. Lederer, "Design, Implementation, and Production Experiences of a Global Storage Grid," in Proceedings of the 23rd IEEE/14th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2006. [136] R.O. Weber, SCSI Object-Based Storage Device Commands (OSD). Storage Networking Industry Association. ANSI/INCITS 400-2004, www.t10.org, 2004. [137] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, and C.L. Seitz, "Myrinet A Gigabit-per-Second Local-Area Network," IEEE Micro, 15(1):29-36, 1995. [138] Infiniband. Arch. Spec. Vol 1 & 2. Rel. 1.0, www.infinibandta.org/download_spec10.html, 2000. [139] Internet Assigned Numbers Authority, www.iana.org. [140] M. Olson, K. Bostic, and M. Seltzer, "Berkeley DB," in Proceedings of the Summer USENIX Technical Conference, FREENIX track, Monterey, CA, 1999. [141] A. Bhide, A. Engineer, A. Kanetkar, A. Kini, C. Karamanolis, D. Muntz, Z. Zhang, and G. Thunquest, "File Virtualization with DirectNFS," in Proceedings of the 19th IEEE/10th NASA Goddard Conference on Mass Storage Systems and Technologies, College Park, MD, 2002. [142] R. Rew and G. Davis, "The Unidata netCDF: Software for Scientific Data Access," in Proceedings of the 6th International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Anaheim, CA, 1990. [143] NCSA, "HDF5 ", hdf.ncsa.uiuc.edu/HDF5. [144] Unidata Program Center, "Where is NetCDF www.unidata.ucar.edu/software/netcdf/usage.html.

Used?,"

[145] J. Katcher, "PostMark: A New File System Benchmark," Network Appliance, Technical Report TR3022, 1997. [146] IOR Benchmark, www.llnl.gov/asci/purple/benchmarks/limited/ior. [147] FLASH I/O Benchmark, flash.uchicago.edu/~jbgallag/io_bench. [148] ATLAS, atlasinfo.cern.ch. [149] The Large Hadron Collider, lhc.web.cern.ch. [150] ATLAS Development Team (private communication), 2005.

129

[151] M. Rosenblum and J.K. Ousterhout, "The Design and Implementation of a LogStructured File System," ACM Transactions on Computer Systems, 10(1):26-52, 1992. [152] J.H. Hartman and J.K. Ousterhout, "The Zebra Striped Network File System," ACM Transactions on Computer Systems, 13(3):274-310, 1995. [153] P.F. Corbett and D.G. Feitelson, "The Vesta Parallel File System," ACM Transactions on Computer Systems, 14(3):225-264, 1996. [154] B. Halevy, B. Welch, J. Zelenka, and T. Pisek, Object-based pNFS Operations. Internet Draft, 2006. [155] D. Hildebrand and P. Honeyman, "Exporting Storage Systems in a Scalable Manner with pNFS," in Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, Monterey, CA, 2005. [156] IBRIX Fusion, www.ibrix.com. [157] S.V. Anastasiadis, K. C. Sevcik, and M. Stumm, "Disk Striping Scalability in the Exedra Media Server," in Proceedings of the ACM/SPIE Multimedia Computing and Networking, San Jose, CA, 2001. [158] F. Isaila and W.F. Tichy, "Clusterfile: A Flexible Physical Layout Parallel File System," in Proceedings of the IEEE International Conference on Cluster Computing, Newport Beach, CA, 2001. [159] M. Seltzer, G. Ganger, M.K. McKusick, K. Smith, C. Soules, and C. Stein., "Journaling versus Soft Updates: Asynchronous Meta-data Protection in File Systems.," in Proceedings of the USENIX Annual Technical Conference, San Diego, CA, 2000. [160] OpenSSH, www.openssh.org. [161] M. Enserink and G. Vogel, "Deferring Competition, Global Net Closes In on SARS," Science, 300(5617):224-225, 2003. [162] H. Newman, J. Bunn, R. Cavanaugh, I. Legrand, S. Low, S. McKee, D. Nae, S. Ravot, C. Steenberg, X. Su, M. Thomas, F. van Lingen, and Y. Xia, "The UltraLight Project: The Network as an Integrated and Managed Resource in Grid Systems for High Energy Physics and Data Intensive Science," Computing in Science and Engineering, 7(6), 2005. [163] A. Ching, W. Feng, H. Lin, X. Ma, and A. Choudhary, "Exploring I/O Strategies for Parallel Sequence Database Search Tools with S3aSim," in Proceedings of the International Symposium on High Performance Distributed Computing, Paris, France, 2006.

130

[164] A. Ching, A. Choudhary, W.K. Liao, R. Ross, and W. Gropp, "Noncontiguous I/O through PVFS," in Proceedings of the IEEE International Conference on Cluster Computing, Chicago, IL, 2002. [165] S. Well, S.A. Brandt, E.L. Miller, and C. Maltzahn, "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data," in Proceedings of Supercomputing '06, Tampa, FL, 2006. [166] S. Ghemawat, H. Gobioff, and S.T. Leung, "The Google File System," in Proceedings of the 19th ACM Symposium on Operating Systems Principles, Bolton Landing, NY, 2003. [167] S. Well, K.T. Pollack, S.A. Brandt, and E.L. Miller, "Dynamic Metadata Management for Petabyte-scale File Systems," in Proceedings of Supercomputing '04, Pittsburgh, PA, 2004. [168] R.B. Avila, P.O.A. Navaux, P. Lombard, A. Lebre, and Y. Denneulin, "Performance Evaluation of a Prototype Distributed NFS Server," in Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, Foz do Iguaçu, Brazil, 2004.

Distributed Access to Parallel File Systems - IBM Research People ...

Distributed Access to Parallel File Systems - IBM Research People ...

Suggest Documents

Distributed and Parallel Systems

Parallel File Systems - Dell

distributed file systems

Parallel & Distributed Systems group - CiteSeerX

parallel and distributed systems

Parallel & Distributed Systems group - CiteSeerX

Parallel and Distributed Systems - csail

Distributed and Parallel Database Systems

Efficient Access Control for Distributed Hierarchical File Systems

Decentralized Access Control in Distributed File Systems - Columbia CS

Optimistic Recovery in Distributed Systems - Researcher - IBM

file - IBM

IBM Research Report Distributed Cross-Domain Change ...

Deployment Time Optimization of Distributed ... - IBM Research

Chemical amplification - IBM Research People and Projects

Second-Order Dependencies to the Rescue - IBM Research People ...

Do You Know? Recommending People to Invite into ... - IBM Research

Four Paths to AI - IBM Research People and Projects

pdf file - Access Research Network

Performance of the IBM General Parallel File System - CiteSeerX

Performance of the IBM General Parallel File System - Semantic Scholar

Implementing the IBM General Parallel File System (GPFS) in a ...

Performance of the IBM General Parallel File System - Semantic Scholar

Introduction to Distributed Systems - Distributed Systems Sistemi ...