Consistent File Replication for Wide Area ... - peter honeyman

2 downloads 5697 Views 167KB Size Report
primary-copy scheme and a view-based failure recovery protocol that tolerates ..... a client accesses data from any nearby replication server. Second, a primary ...
Consistent File Replication for Wide Area Collaboration Jiaying Zhang [email protected]

Peter Honeyman [email protected]

Center for Information Technology Integration University of Michigan at Ann Arbor

Abstract

efficient data access and storage schemes to allow physicists in all world regions to contribute effectively to the analysis and the physical results. To meet availability, performance, and scalability requirements, distributed services naturally turn to replication; file service is no exception. While the concept of file replication is not new, existing solutions either forsake read/write replication totally or severely weaken consistency guarantees, and thus fail to satisfy the requirements for global scientific collaborations, in which users want to use widely distributed computation and storage resources as though they were using them locally. Returning to the Atlas example, the workloads of Atlas have a mix of production and analysis activities. Physics analysis, for example, is an iterative, collaborative process. The stepwise refinement of analysis algorithms requires using multiple clusters to reduce development time. Although the workloads during this process are dominated by read, they also demand the underlying system to support write operations. Furthermore, strong consistency is often taken for granted, e.g., an executable might incorporate user code that was finished only seconds before the submission of the command that requires the use of this code. The rapid evolution of wide-area collaborations calls for a mutable replicated file system that supports consistent data access. However, when designing such a system, we must also consider the tradeoffs among consistency, performance, and availability. Most scientific applications are read-dominated, so a file system cannot become widely deployed if it supports mutable replication at the cost of impairing normal read performance. Similarly, support for strong consistency guarantees should not slow down applications for which weaker consistency semantics suffice, otherwise, those applications will not choose to employ the system. In this paper, we describe a consistent mutable file replication protocol designed to meet the needs of the global collaborations. The protocol supports mutable replication without adding overhead to normal reads. It guarantees “ordered writes” by default, and provides

To meet availability, performance, and scalability requirements, distributed services naturally turn to replication. File service is no exception, as wide-area collaborations drive enterprises to seek simple and efficient mechanisms to share data across geographically distributed organizations. This paper describes the design and implementation of an extension to NFSv4 that enables mutable (i.e. read/write) replication with flexible consistency guarantees, small performance penalty, and good scaling properties. Evaluation data collected with a prototype implementation shows that our replication system can improve performance of applications that need to access remote data, even applications that span the Internet.

1. Introduction Recent years have seen an increasing demand for global collaborations in scientific studies, spanning disciplines from high energy physics, to climatology, to genomics. Applications in these fields demand intensive use of computational resources, far beyond the scope of a single organization, and require access to massive amounts of data, introducing the need for scalable, efficient and reliable data access and management schemes. Take the Atlas experiment as an example. Atlas searches for new discoveries in high-energy proton collisions. The protons will be accelerated in the Large Hadron Collider accelerator, currently under construction at the European Organization for Nuclear Research (CERN) near Geneva. After the accelerator starts running, Atlas is expected to collect several petabytes of data per year, which need to be distributed to a collection of decentralized sites for analysis. Atlas is the largest collaborative effort ever attempted in the physical sciences. 1800 physicists from more than 150 universities and laboratories in 34 countries participate in this experiment. The wide distribution of the participating researchers and the massive amount data to be collected, distributed, stored, and processed demand scalable and 1

stronger consistency for applications that request it with POSIX synchronization flags, so that the overhead to enforce strong consistency is induced only when users demand it. The replication protocol itself uses a variant primary-copy scheme and a view-based failure recovery protocol that tolerates any number of crash failures and network partitions. The failure detection and recovery are driven by client accesses, so that no heartbeat messages or expensive group communication services are required. We implement the protocol as an extension to NFS version 4 (NFSv4) [26], and evaluate its performance for normal operation and failure recovery. The contribution of this paper is the design, implementation, and evaluation of a practical, consistent, and efficient mutable replication scheme. For the most common accesses, the performance penalty of replication is zero, remains slight in typical deployments with mixed I/O, and is modest in the worst case. The practicality of the system follows from the design as an extension to NFSv4, which opens the door to IETF consensus on a minor version of NFSv4 that standardizes the extensions. The remainder of the paper is organized as follows: Section 2 presents our design principles for a mutable replication protocol. Section 3 describes in detail our replication system that offers flexible consistency guarantees. Section 4 examines the performance of a prototype implementation. Section 5 reviews the related work and Section 7 concludes.

many clients, and the potential for partition. On the other end of the spectrum, consistency guarantees are abandoned altogether, e.g., in P2P systems that strive to maximize availability [27, 25, 31], or are replaced by heuristics for addressing conflicts when they happen [33, 18], i.e., optimistic replication. To balance the benefit of replication with the cost of guaranteeing consistent access, some distributed file systems provide read-only access to replicated files, side-stepping update consistency problems altogether [32, 3, 36]. We observe that although optimistic replication has been widely studied, few applications in reality are prepared to deal with the conflicts that might happen. Even if applications can provide such support, conflict resolution must be performed carefully; otherwise, the cost to reproduce data, if possible, can be considerable. With the particular goal to develop a replicated file system that facilitates global scientific collaborations, we therefore rule out optimistic replication. For its superior read performance, read-only replication has been favored in the current experimental platforms for supporting wide-area scientific collaborations, e.g., DataGrid [10]. With read-only replication, once a file is declared as shared by its creator, it cannot be modified. An immutable file has two important properties. I.e., its name may not be reused and its contents may not be altered. Notwithstanding its simple implementation, read-only replication has several deficiencies. First, it fails to support complex sharing behavior, e.g., concurrent writes. Second, to guarantee uniqueness of file names, file creation and retrieval require a special API that most applications are not designed for, which hinders the reuse of existing software developed in the traditional computing environment for global collaborations. In conclusion, we argue that global scientific collaborations require consistent mutable replication. A desirable solution is not to bypass update consistency problems, but to develop an advanced system that supports write operations and consistency guarantees without hurting ordinary case performance. In considering the options for consistency guarantees, the “principle of least surprise” argues for the strictest possible semantics, yet this choice is rarely offered as the default. Even in single system file sharing, applications typically read and write through private, thus potentially stale, I/O buffers. It is tacitly understood that weaker guarantees, such as “ordered writes” or “read your own writes” suffice for most applications and that applications requiring more stringent guarantees provide for themselves with explicit synchronization mechanisms, such as lock or sync calls. We follow this approach. By default, our replication system guarantees sequential consistency, in which distributed nodes do not necessarily see updates simultane-

2. Design Principles This section introduces a mutable replicated file system whose particular goal is to fulfill the needs of emerging wide-area collaborations. Sections 2.1 to 2.3 first respectively discuss our design principles in terms of consistency guarantees, failure model, and performance tradeoff. Following that, Section 2.4 outlines the scheme of our mutable replicated file system.

2.1. Consistency Guarantees One challenge when replicating files is keeping copies synchronized as updates occur. In a distributed storage system, synchronization semantics are expressed in terms of the consistency guarantees granted to applications sharing data. Essentially, a consistency guarantee is a contract between processes and the data store. It says that if processes agree to obey certain rules, the store promises to work correctly. In distributed file systems, various consistency guarantees have been introduced. The most stringent guarantee, strict consistency, assures that all clients see precisely the same data at all times. Although semantically ideal, strict consistency can be detrimental to performance and availability in networks with high latency, 2

ously, but they are guaranteed to see them in the same order [20]. We take advantage of this relaxed consistency requirement to design and build a system that imposes no penalty on reading files. Furthermore, to meet the needs of applications that require synchronized access, we support the use of POSIX synchronization flags at open time to dictate requirements for shared access to that file.

• Exclusive read: most common. Support for replication should add negligible overhead to unshared reads. • Shared read: common. Blaze observes that files used by multiple workstations make up a substantial proportion of read traffic. For example, files read by more than one user make up more than 60% of read traffic, and files read by more than ten users make up more than 30% of read traffic. This motivates us to avoid cost penalties for shared reads.

2.2. Failure Model One objective of our work is to increase data availability with a replication scheme that tolerates a large class of failures. There is a well-studied hierarchy of failures in distributed systems: omission failure, performance failure, and Byzantine failure [11]. An omission failure occurs when a component fails to respond to a service request. Server crash and network partition are typical omission failures. A performance failure occurs when a system component fails to respond to a service request within a specified time limit. Occasional message delays caused by overloaded servers or network congestion are examples of performance faults. In Byzantine failure, components act in arbitrary, even malicious ways [19]. Compromised security can lead to Byzantine failure. Although security breach is increasingly common in the Internet, Byzantine failure is beyond the scope of the work presented in this paper. When considering omission failure and performance failure, we expect that the latter happens more frequently as a system scales to wide area networks. To develop a replication protocol that performs well in the face of typical Internet conditions, it is important that the performance of our protocol is insensitive to temporary message delays. We achieve this by allowing a write request to be replied as soon as a majority of replication servers have acknowledged the update. Consequently, even though occasional message delays can cause a handful of replication servers to respond sluggishly, the penalty to application performance is small. Furthermore, to tolerate long-term omission failures, we developed a failure recovery scheme that allows the system to continuously process read and asynchronous write requests as long as most of the replication servers are in working order.

• Exclusive write: less common. File system workload studies show that writes are less common than reads. When we consider accesses for data that need to be replicated over wide area networks, this difference can become even larger. This allows us to design a file replication system within which data updates are more expensive than in one-server-copy case, while still getting good average performance. • Write with concurrent access: infrequent. A longer delay due to enforcing consistency can be justified when a user tries to access an object being updated by another client. • Server failure and network partition: rare. Special failure recovery procedures can be used when a server crashes, or a network partitions. During the time of failure, write accesses might even be blocked if stringent consistency must be guaranteed without doing much damage to overall performance. To guarantee data consistency without penalizing exclusive reads and shared reads, our replication system uses a variant primary-copy scheme with operation forwarding to support concurrent access during file modifications. We depart from the usual primary copy scheme by allowing dynamic determination of a primary server, at the granularity of a single file. When there are no writes, a client’s read requests are served on a nearby server, as in a read-only replication system. Furthermore, failure detection and recovery are driven by client accesses. No heartbeat messages or special group communication services are needed. When a primary server fails, recovery mechanisms generate a replacement primary server that then gathers other replication servers to a consistent state. With these approaches, the five cases discussed above introduce overheads inversely proportional to their expected frequencies, roughly speaking.

2.3. Performance Tradeoff Good performance is a critical goal for all file systems. Our design follows a simple but fundamental rule: make common accesses fast. Based on the insights from workload analysis of real file systems [9, 8, 4, 30] and the recent workload study of the future global-scale scientific applications [15, 34], the following cases are considered, ordered by expected frequency:

2.4. Outline Following the principles discussed above, we have designed a mutable replication protocol suitable for emerging wide-area collaborations. The protocol is well suited 3

By convention, a special directory /nfs is the global root of all NFSv4 file systems. To an NFSv4 client, /nfs resembles a directory that holds recently accessed file system mount points. Entries under /nfs are mounted on demand. Initially, /nfs is empty. The first time a user accesses any NFSv4 file system, the referenced name is forwarded to a daemon that queries DNS to map the given name to one or more file server locations, selects a file server, and mounts it at the point of reference. The format of reference names under /nfs directory follows Domain Name System [23] conventions. We use a TXT Resource Record [24] for server location information. The content of a TXT RR maps a reference name to a list of file servers, in this case replicas holding the copies of data. The second extended feature is the support for directory replication. We implement directory replication in rNFS by exporting a directory with an attached reference string that includes information on how to get directory replica locations, such as replica lookup methods and lookup key. Our prototype supports four types of reference string: LDAP, DNS, FILE and SERVER REDIRECT. The format of each type is provided in an extended technical report [37]. When a client first accesses a replicated directory, the server resolves the replica locations of that directory with the attached reference string. It then sends this information to the client through FS LOCATIONS, as specified in the NFSv4 protocol.

for read-dominated applications, as mutable replication adds no cost to exclusive reads or shared reads. It supports file modifications by using a variant primary copy scheme with operation forwarding to guarantee consistency. Flexible consistency guarantees are supported: the protocol guarantees sequential consistency by default, and provides support for synchronized access with POSIX synchronization flags, so that the overhead to enforce stringent consistency is induced only when users demand it. The protocol tolerates any number of component omission and performance failures, even when these lead to network partition. We have implemented our replication protocol as an extention to NFSv4, the emerging Internet standard for distributed filing. Our implementation utilizes several existing features provided by NFSv4, such as client side failure recovery, mechanisms to support read-only replication, compound RPC, and delayed-write policy. We pay special attention to keeping the protocol simple enough so that it is potential to standardize the extensions as an IETF minor version of NFSv4. Furthermore, the protocol requires minuscole extensions on client-side implementation, which makes it easy and practical to deploy. In the rest of this paper, we refer to the implemented replication system as rNFS for short.

3. Architecture and Scheme In this section, we first briefly describe the naming scheme of rNFS. Following that, we discuss in detail how our mutable replication protocol guarantees sequential consistency and synchronized access.

3.2. Sequential Consistency To support mutable replication, we need mechanisms to distribute updates when a replica is modified and to control concurrent accesses when writes occur. To prevent mutable replication from affecting exclusive read or shared read performance, we adopt an extended primary copy scheme with operation forwarding to coordinate concurrent writes. Compared to the traditional single primary copy scheme, our design has the following advantages. First, the overhead to support mutable replication is induced only when writes occur. When there are no writes, the system behaves as a read-only replication system, i.e., a client accesses data from any nearby replication server. Second, a primary server is selected on the granularity of a single file, which allows fine-grained load balancing. Third, in our scheme, a primary server is dynamically chosen when a file write opened for write. In most cases (exclusive write cases), a client’s write requests are served by a nearby primary server. The solution is well suited for wide-area collaborations in which a replica is often dynamically created and it is hard to decide an optimal primary server for a file in advance. Fourth, it provides higher availability because a client

3.1. Naming Scheme The NFSv4 protocol includes features to support file system migration and read-only replication using a special file attribute FS LOCATIONS. The NFSv4 specification calls for a client access to a migrated file system to yield a special (NFSERR MOVED) error; retrieving the FS LOCATIONS attribute gives new locations for the file system. The client uses this information to connect to the new server. For replication, a client’s first access to a file system might yield the FS LOCATIONS attribute, which lists alternative locations for the file system. Complying with the published NFSv4 protocol, we also use the FS LOCATIONS attribute to communicate replica location information between servers and clients. However, the namespace of rNFS includes two extended features: First, we do not rely on FS LOCATIONS attribute to locate a replication server for file system replication. Instead, we extend NFSv4 to support a single global name space that hides server location details from users. 4

can usually choose any working server to read or write a file. Below we first describe the mechanisms we use to support file updates in Section 3.2.1. Section 3.2.2 then presents the failure recovery mechanism of the protocol in case of server crash and network partitions. For directory updates, a similar approache is used, with several performance improvements presented in Section 3.2.3.

ally through an I/O daemon that delays writes for some seconds [21]. This relaxes the dependency of application performance on primary server latency. The I/O daemon’s delayed-write policy also increases the likelihood that the updates will be long-lived [4]. The primary server distributes updates to other replication servers in parallel. Updates must be delivered in order, either by including a serial number with the update or through a reliable transport protocol such as TCP. In addition to the data payload, each update message from the primary server to other replicas also includes metadata related to the update, such as modification time. Each replication server modifies its copy of file metadata appropriately after updating file data. This guarantees that the metadata of the file is consistent among replicas, which as we show in Section 3.2.2, makes it easy to determine the most recent file copies during failure recoveries. Initially, a replication server is unsure if a received update is valid - e.g., the primary server might send out an update request and then immediately crashed - so it does not apply the update at once. Rather, the request is cached until the next update or a replication re-enabling message is received from the primary. Two or more servers may try to become the primary for a file at the same time. When these servers are in the same partition, contention is always apparent to the conflicting servers. We resolve the conflict by having conflicting servers cooperate: the server that has already disabled more replicas is allowed to continue; the server that has so far disabled fewer replicas hands its collection of disabled replicas to the first server; when a tie happens, the server with bigger IP address is allowed to proceed. If conflicting servers are in different partitions, at most one can collect acknowledgments from a majority of the replication servers. For some kinds of failure, e.g., multiple network partition failures, it is possible that no primary server can be elected. We discuss this case further in the next subsection.

3.2.1. File Updates When a client opens a file for writing, the chosen server temporarily becomes the primary for that file. All other replication servers are instructed to forward client write requests for that file to the primary server. When the file is closed, the primary server withdraws from its leading role by notifying other replication servers to stop forwarding writes. In the following discussion, we refer to the first procedure as disabling replication, and the latter as reenabling replication, although by default, read requests received on other replication servers are still processed locally. While a file is open for writing, the primary server is responsible for distributing updates to other replicas. We consider two strategies for distributing updates. The first is distributing updates when the modified file is closed. The second strategy distributes updated data to other replicas as they arrive. Although naive, the update-on-close strategy does avoid multiple updates should some or all of the file be written several times. However, if a client writes a massive amount of data to a file and then closes it, the close operation takes a long time to return. Furthermore, we run the risk of losing all client updates if the primary server fails before distributing the new data, which invalidates any assurance of durability to the client for individual write operations. Distributing updated data to other replication servers every time the primary server receives a write request eliminates the update propagation delay for a close request. It also facilitates recovery from primary server failure: a client receives a positive acknowledgment for every successful write, so if the primary server fails, the client can connect to a new server (using standard NFSv4 client recovery mechanisms) and reissue at most one unacknowledged write request. However, unlike distributing updates at the time the file is closed, this strategy adds to network traffic if a client overwrites file blocks. We prefer the latter scheme. We hesitate to impose a sweeping change to system call behavior and we are willing to expend some network resources to reduce latency. Yet, by making client-to-server writes synchronous with updates to other replication servers, it appears that we are increasing client write latency, not reducing it. The paradox is resolved by observing that NFSv4 writes are usu-

3.2.2. Failure Recovery Our primary copy scheme guarantees consistent access when all replicas are in working order. However, failure complicates matters. Different kinds of failure may occur, including client failure, replication server crash failure, network partition, and combinations of these cases. Here, we briefly describe the failure detection and recovery mechanisms for each case. A detailed description and proof of correctness is presented elsewhere [38]. Following the specification of NFSv4, a file opened for writing is associated with a lease on the primary server, subject to renewal by the client. In the event of a client failure, the server receives no further renewal requests, so the lease expires. Once the primary decides that the client has failed, it closes any files left open by 5

the failed client on its behalf. If the client was the only writer, then the primary re-enables replication for the file at this time. Unsurprisingly, the file content reflects all writes acknowledged by the primary server prior to the failure. To support sequential consistency, the system maintains an active group view among replicas and allows updates only in the active group. We require that an active group contain a majority of the replicas to ensure its uniqueness. During file modifications, the primary server removes from its active group view any replicas that fail to acknowledge replication disabling requests or update requests. The primary server updates its local copy and acknowledges a client write request only after it has received update acknowledgments from a majority of replicas. If the active view shrinks to less than a majority, the primary server “fails” the client request. The primary server sends its active view to other replication servers when it re-enables replication. A server not in the active view may have stale data, so the re-enabled servers must refuse any later replication disabling or update requests that come from a server not in the active group. A failed replication server can rejoin the active group only after it synchronizes with the up-to-date copy. If a primary server crashes or is separated in a minority partition, a replication server (in the majority partition) detects this failure when a forwarded request times out. In that case, the replication server starts a failure recovery procedure to become the replacement primary. Basically, the replication server asks other active replicas for permission to become the new primary server. If this succeeds, the replacement synchronizes all active replicas with the most up-to-date copy found in the majority partition, and distributes a new active group view. It then re-enables replication on the active servers. With these mechanisms, our system can guarantee sequential consistency and continuously serve client requests as long as a majority of replicas are in working order and can communicate. If there are multiple partitions and no partition includes a majority of the replication servers, read requests can continue to be satisfied, but no write requests can be served until the partition heals. We assume this happens rarely. To tolerate various edge scenarios, e.g., in the case that all servers crash when the primary server is processing a write request, it appears that a replication server should record in its stable storage all information related to an update, such as the current primary server, cached update, serial number and failed replicas. This strategy, however, would reduce write performance. Instead, we rely on the system administrator’s involvement in the case that more than majority of replication servers fail. Because our protocol guarantees sequential consistency, the administrator can simply use a synchronization tool (i.e., rsync)

which compares the data states among replication servers and selects the most recent copy if inconsistent copies are detected. After the synchronization is complete, the administrator can bring all replication servers to normal state. Thus, a replication server needs to record in stable storage only minimal information, i.e., the failed replication servers it currently knows and the primary server it currently admits for a file. El-Abbadi et al. have studied the failure recovery problem of read-one-write-all replication scheme in distributed database systems. They point out that replicas can not independently or asynchronously update their can-communicate views because network connections may be intransitive or replicas may detect connection changes at different time. Taking a further step, they present a series of properties and rules which they show are sufficient conditions for a replication protocol to guarantee one-copy serializability. We have extended this theory into distributed file systems and have proved that our failure recovery protocol guarantees sequential consistency in the face of node crash or network partition. The basic rationale lies in having a single server (primary server or replacement server) decide the view of the majority partition. This view is then distributed to and sustained by all the members contained in it. Thus, a unique and consistent majority view is guaranteed. For more details, readers can refer to our technical report [38]. 3.2.3. Directory Updates Directory modifications include creation, deletion, and modification of entries in a directory. Unlike file writes, a directory modification may involve more than one object. We require replication for all involved objects to be disabled before processing a directory update. These disabling requests are grouped and processed together, so that no deadlock might happen. Furthermore, little time elapses between the start and finish of a directory update, which reduces the likelihood of concurrent access to a directory while it is being updated. So instead of redirecting access requests to a replicated directory while an update is in progress, replication servers block access requests to that directory until the primary server re-enables replication. Like directory modifications, attribute updates proceed quickly, so we handle them the same way. When disabling directory replication, the primary server sends the replication disabling request and the update request together in one compound RPC. A replication server receiving this compound RPC caches the update, begins to block local update, and acknowledges the request. After receiving replies from a majority of replication servers, the primary server acknowledges the client request. Simultaneously, the primary server could send a commit message, notifying other replication servers to apply the update. However, to reduce network traffic, we 6

sistency only, our system provides synchronization guarantee as an option that can be demanded by applications through POSIX synchronization flags. The rest of this subsection presents this scheme in detail.

delay this notification until the primary server re-enables replication, i.e. a replication server applies the cached update when it receives the re-enabling replication request from the primary server. One issue introduced by this optimization is the possibility for a replication server to receive “invalid” replication disabling requests. Consider a simple example in which a client first creates a file c in directory /a/b/, then opens it for writing. As described above, with the create request, the connected server sends replication disabling requests for directory /a/b/ combined with the update request to create entry c to other replication servers. It acknowledges the client after receiving replies from a majority of replication servers. As the client might send the write open request for file c immediately after receiving this acknowledgment, it is possible that the connected server sends a replication disabling request for file /a/b/c before it re-enables replication for directory /a/b/ on other replication servers. As a result, the replication disabling request for file c would be rejected by other replication servers since they have not applied the cached create request. Although the problem can be solved by having the primary server simply keep re-sending the second request, a lot of redundant network traffic would be induced, especially when the system consists of slow (far) replication servers. So instead, we take another approach. In our implementation, before sending a replication disabling request, the primary server first checks whether any parent directory of the to-be-modified entry is being disabled ; if so, it would wait untill that directory was re-enabled. Consequently, the performance of directory updates is normally decided by the RTT between the primary server and the majority of replication servers; however, when a client issues a burst of directory updates, the performance might be slowed down by a replication server far away as the primary server re-enables the replication for a directory when it receives acknowledgments for the previous replication disabling requests from all other replication servers or upon a timeout if a failure happens. Other solutions exist to solve this kind of problem. For example, the primary server can pre-send a commit request for a directory update. In our future study, we want to compare these different solutions in terms of their performance and induced network traffic with real application operations.

3.3.1. POSIX Synchronization Flags POSIX provides three synchronization flags, O SYNC, O DSYNC, and O RSYNC, in the open system call interface [1]. Applications can use these flags to specify variant synchronization behaviors for subsequent reads and writes of the open file: • O DSYNC: If set, write I/O operations on the file complete as defined by synchronized I/O data integrity completion, i.e., the write is complete only when the written data is successfully transferred, and all file system information required to retrieve the data is successfully transferred. • O SYNC: If set, write I/O operations on the file complete as defined by synchronized I/O file integrity completion, which is identical to synchronized I/O data integrity completion with the addition that all file attributes relative to the I/O operation, e.g., modification time, will be successfully transferred prior to returning to the calling process. • O RSYNC: If set, read I/O operations on the file complete at the same level of integrity as specified by O DSYNC or O SYNC flags, i.e., a read is complete only when an image of the data has been successfully transferred to the requesting process; if there are any pending writes affecting the data to be read at the time that the synchronized read is requested, these writes will be successfully transferred prior to reading the data. In a local file system, support for these synchronization requirements usually adopt the same solution, so some operating systems, notably Linux, treat the three synchronization flags equally. However, in a distributed environment, it is beneficial to distinguish these different synchronization requirements, as the cost to support them can be considerably different. 1 In NFSv4, a client’s WRITE operation request includes a special flag that declares whether the written data is to be synchronized. A NFSv4 client sets this flag on user’s behalf if the O SYNC flag of the file is set or the user issues a fsync system call. However, the synchronization requirement specified with O RSYNC flag set is

3.3. Access Synchronization The protocol discussed so far efficiently provides sequential consistency guarantees. However, applications may sometimes require stronger consistency guarantees, i.e., synchronized access. To meet such needs without imposing overhead on data access that requires sequential con-

1 The use of compound RPC allows our replication system to update file attributes at the same time of updating file data. So without special mentitioning, the following discussion does not distinguish O SYNC and O DSYNC flags.

7

not addressed. We notice that the synchronization guarantees required with O SYNC and O RSYNC flags set correspond to synchronized read requirement. Taking these flags as the hint that the application is demanding synchronized access, we refine our replication protocol as follows. When the primary server receives a synchronous write request from a client, it must ensure that the replication for the file has been disabled on every other replication server before returning a reply to the client. By default, a replication server forwards write requests only while its replication is disabled. However, if during this period, a client opens the file with both O SYNC and O RSYNC flags set, the replication server forwards the client’s read requests to the primary server as well. In most cases, the update distribution procedure works the same way as that in the sequential consistency model The primary server acknowledges a client’s write request after it gets acknowledgments from a majority of replication servers; if there is a failure detected during update distribution, the primary server can still process client’s read and/or write requests as long as it is in the majority partition. However, if the primary server is separated in a minority partition, it is not guaranteed the distributed update reaches the majority partition. If a client opens the file with both O SYNC and O RSYNC flags set, the primary server must refuse its subsequent read requests for the file to guarantee that no stale data is served. In the majority partition, the failure recovery mechanism described in Section 3.2.2 can be used to recover the fresh copy of data. After that, read requests can be served in the majority partition. With the described mechanism, slight overhead is induced to guarantee synchronized access when applications demand it; longer delay is charged on forwarded operations if concurrent writes occur; If a file is not under modification, any read requests for the file, even those with synchronization requirement, are processed by a nearby server. It is easy to see that this approach provides synchronization guarantee at the cost of sacrificing system availability for synchronous write operations. I.e., if a failure happens, an application can not synchronously write a file. Several methods can be used to bypass this restriction if it is critical to guarantee system availability for write operations as well as synchronized access. For example, we can use periodic heartbeat messages to detect partition failures, and require a replication server to reject any client requests if it fails to receive replies from a majority of replication servers. Consequently, the system can continue to process synchronous write operations after a heartbeat period, as long as a majority of replication servers are active. However, we believe that the current solution is superior in most scenarios because it adds no

overhead or network traffic to normal operations. There are other possible ways to decide consistency guarantees for a file. For example, we can implement it as an extended attribute associated with the file. The proposed approach is favored for it allows applications to control file sharing behavior more flexibly. Consider the example of an edit-and-run procedure: The program is edited on one client, and then a number of clients are instructed to execute it. Because the execution instruction can be issued immediately after the program editing, the access on the file must be coordinated. In our replication system, correct synchronization behavior can be guaranteed if the editor application (writer) issues a fsync system call after completing the editing, and the execution application (reader) opens the file with both O SYNC and O RSYNC flags set. On the other hand, another application, e.g., a snapshot tool, can choose to open the file without setting any synchronization flags as sequential consistency is sufficient to guarantee its correctness. In the extreme case, if a file is always opened with O SYNC and O RSYNC flags set, strict consistency is provided. However, using open synchronization flags also introduces two issues: First, existing programs might not use these flags to specify synchronization requirements. Thus modifications are required on the program’s open calls to ensure synchronized access. We believe such modifications can be performed easily enough that it does not affect the prevalence of the system. Second, the NFSv4 protocol does not provide the mechanism for a client to send open synchronization flags to its server. Our current implementation conveys this information by using the extra bits of the share access flag in open operation request, which requires extension to the published NFSv4 protocol. In Table 1, we summarize the invoking methods, normal operations, and operations during failures for sequential consistency model and the consistency model to support synchronized access.

4. Evaluation We conducted a series of experiments to explore the performance of rNFS under different network conditions and application environments. In Section 4.1, we use a modified Andrew benchmark (MAB) with replication servers running in a local area network and in (simulated) highlatency, wide area networks. In Section 4.2, we examine the recovery time for primary server failure. We measured all the experiments presented in this paper with a prototype implemented in Linux 2.6.12 kernel. Servers and clients all run on dual 2.8GHz Intel Pentium4 processors with 1024 KB L2 cache, 1 GB memory, and dual Intel 82547GI Gigabit Ethernet cards onboard. We use TCP as the transport protocol. The number of bytes NFS 8

Guarantee Invoke Method Normal Operation During Failure

Sequential Consistency Default Primary server ensures replication is disabled on majority of replication servers. Replication server forwards write requests while replication is disabled. Reads are allowed everywhere. Writes are allowed in the majority partition.

Synchronized Access Writer synchronously writes the file. Reader opens the file with O SYNC & O RSYNC. Primary server ensures replication is disabled on all replication servers. Replication server forwards read and write requests while replication is disabled. Reads are allowed in the majority partition. Clients cannot synchronously write a file.

Table 1: . Comparison of schemes to guarantee sequential consistency and to guarantee synchronized access. uses for reading (rsize) and writing files (wsize) is set as 32768 bytes. All numbers presented are mean values from three trials of each experiment; standard deviations (not shown) were within five percent of the mean values.

Andrew Benchmark Time (in seconds)

7

4.1. Benchmark Evaluation This subsection presents the experimental results of running a modified Andrew benchmark over rNFS. The Andrew benchmark [16] is a widely used file system benchmark that models a mix of file operations. It measures five stages in the generation of a software tree. Stage (I) creates the directory tree, (II) copies source code into the tree, (III) scans all the files in the tree, (IV) reads all of the files, and finally (V) compiles the source code into a number of libraries. The modified Andrew benchmark used in our experiments differs from the original Andrew benchmark in two aspects. First, in the last stage, it compiles different source code than that included in Andrew benchmark package. Second, the Andrew benchmark writes a log file in the generated directory; if writes are slow compared to reads, the cost of updating the log file dominates the overall cost of a stage that mostly reads, hindering analysis. Therefore, we use local disk to hold the log file. Our first experiment looks at replication in a LAN environment, such as a cluster. Figure 1 depicts the performance of the modified Andrew benchmark as the number of replicas increases. The measured RTT between any two machines is around 200 µsec. Figure 1 shows that in a LAN, the penalty for replication is small. Replication induces no performance overhead in Stages (III) and (IV), as these two phases consist of read operations only. Stage (V) is compute-intensive, so the performance difference between single server and replicated servers is negligible. Most of the performance penalty for replication comes in Stages (I) and (II), which consist of file and directory modifications. However, with a fast network, the aggregate penalty is still only a few percent. Furthermore, because a primary server distributes updates to other replication servers in parallel, performance is not adversely affected as the number of replication servers increases.

mkdir

copy

scandir

readall

make

6 5 4 3 2 1 0 local access

single server

2 replication 3 replication 4 replication 5 replication servers servers servers servers

Figure 1: . MAB in LAN replication. The next experiment, depicted in Figure 2 compares the cost of replicating to a distant server vs. the cost of accessing a distant server directly. We ran the modified Andrew benchmark with an increasingly distant file server, the upper line in Figure 2, and again with a local replication server and an increasingly distant replication server, the middle of the three lines in Figure 2. The RTT marked on the X-axis shows the round-trip time between the primary server and the remote replication server for the replication experiments, and between the client and the remote file server for the remote access experiments. In Figure 2, the smallest RTT measured is 200 µsec., the network latency of our test bed LAN. For the other measurements, we use Netem [14], a Linux tool that simulates network delays. Each experiment first warmed the client’s cache with two runs. Figure 2 shows that replication outperforms remote access in all five stages. In Stages (III) and (IV), the readonly stages, replication is as fast as local access, since no messages need to be sent to the other replication server in these stages. But replication also dominates remote access in the other three stages. To see why, we take a close look at the network traffic in the measured experiments, where we find that with replication, fewer messages are sent to the remote server, accounting for its advantage. We model this as follows. The running time in each 9

System Model Replication Remote Access

Mkdir 20 85

Copy 228 735

Scandir 0 154

Readall 0 510

Make 71 589

Total 309 2073

Table 2: . Number of remote RPCs of MAB in Replication and Remote Access. Replication servers P (the primary server) R2, R3 R4 R5, R6, R7

make readall scandir

500

copy mkdir

400

RTT to the primary server 20 msec 40 msec 60 msec

3 replication servers

Table 3: . Servers used in Figure 3 experiments.

remote replication remote access

300

200

35 mkdir A n d re w B e n c h m a rk T im e (in s e c o n d s )

Andrew Benchmark Time (in seconds)

600

100

0 0.2

20

40

60

80

100

120

140

160

180

200

RTT (in milliseconds)

Figure 2: . MAB in WAN replication. stage can be estimated as T = T basic + RT T × N umRP C

(1)

copy

scandir

readall

make

30 25 20 15 10 5 0 (P, R4)

where Tbasic denotes the computation time at the client and the request processing time at the connected server. For replication, RTT represents the round-trip time between the primary server and the replication server, and NumRPC represents the number of RPC messages sent from the primary to the replication server in the corresponding stage. For remote access, RTT represents the round-trip time between the client and the remote server, and NumRPC is the total number of RPC requests sent from the client to the server. The Tbasic cost is about the same for replication and remote access, so any difference in performance must be accounted for by the second part of the formula. For example, Stage (I) creates 20 directories at a cost of 85 RPC requests sent from the client to the connected server (20 create, 13 access, 17 getattr and 35 lookup). The reported access, getattr and lookup requests are unavoidable even for a warm cache run because they are requesting information on newly created directories. However, with replication, access, getattr and lookup requests are served locally at the primary server, eliminating their cost altogether at the scale of this experiment. Furthermore, although each create costs two RPC messages, the primary server replies to the client after receiving the response to only the first of these two. Consequently, the number of latency-inducing remote RPC messages in the replication experiment decreases from 85 to 20. Table 2 shows summary RPC counts for the other stages. One important feature of rNFS is that a primary server

(P, R4, R5)

(P, R2, R4, R5) (P, R2, R4, R5, (P, R2, R3, R4, (P, R2, R3, R4, R6) R5, R6) R5, R6, R7)

Figure 3: . MAB with different replication server sets. can reply to a client request as soon as it gets acknowledgments from half of the other replication servers. Given a set of replication servers, then, the performance of rNFS is dictated by the largest RTT between the primary server and half of the nearest replication servers, which we call the majority diameter. To illustrate how this feature can be used to advantage, we added a third replication server halfway (in terms of RTT) between the other two and re-ran the modified Andrew benchmark. The result is the lowest of the three lines in Figure 3. Placing the third replication server midway between the local and remote replication servers cuts the majority diameter in half and for the Andrew benchmark, this cuts the overall run time, which is dominated by the cost of remote RPCs, nearly in half. The results imply if most writes to a replicated file come from one site, the performance overhead for remote replication can be made scant by putting a majority of replication servers near that site. Furthermore, if a site is using local replication, then the penalty for adding a distant replication server, say, for off-site backup, is negligible. Figure 3 compares the running time of the modified Andrew benchmark with a fixed majority diameter and a varying number of replication servers. The servers used in this experiment are described in Table 3. The experi10

mental results show that with the majority diameter fixed (at 40 msec), increasing the number of replication servers has negligible effect on system performance, which is key to good scaling. To summarize, the evaluation data presented in this subsection illustrate two main points. First, network RTT is the dominant factor in rNFS WAN performance. By locating a replication server close to the client , rNFS can mask RTT-induced latency. Second, rNFS scales well in this workload. Application performance is unaffected by adding additional replication servers while keeping the majority diameter fixed. As a gedanken experiment, we might imagine the practical limits to scalability as the number of replication servers grows. A primary server takes on an output bandwidth obligation that multiplies its input bandwidth by the number of replication servers. For the near term, unicast communication and the cost of bandwidth seem to be the first barriers to massive replication.

From the above analysis, we can estimate the overall failure recovery time with the following formula: T = 2 × timeout + sync.

(2)

Here, timeout represents the waiting period for detecting primary server failure, which is first encountered by the replacement server, and then conducted by other replication servers for failure verification. In our experiments, we set it to one minute. sync represents the time for synchronizing all active replicas, and depends on network topology, especially RTT. If the replication servers happen to be synchronized, sync is 0. However, if some nodes need to synchronize, sync is the time required to bring their data up-to-date. In our prototype, we use rsync [35], an open source utility available in most UNIX systems, to perform synchronization. To quantify failure recovery time for both cases, we conducted two series of experiments with three replication servers. In the first series of experiments, we locate the failed primary halfway between the replacement server and the third replication server. Empirically, we find that the replacement server and the third replication server are almost always synchronized, so these experiments minimize failure recovery time. In the second series of experiments, we co-locate the failed primary and the replacement server but locate the third replication server remotely. These results illustrate maximal failure recovery time. We conducted the experiments by diabling the primary server while a client is writing a 100 MB file, followed by a client write request for that file issued to the replacement server, inducing the failure recovery procedure. The results we report are measured from the time that the replacement server receives the client request to the point that it sends back a reply. Figure 4 shows the measured failure recovery time as the RTT between the replacement server and the third replication server increases, as well as the predicted minimum and maximum failure recovery time calculated with formula (2). For the predicated maximum failure recovery time, we estimate sync with the time measured when running rsync on the replication server to retrieve the file from the replacement server. These results are also presented in Figure 4. Figure 4 shows that the measured results are consistent with the predicted values. Failure recovery time ranges from two minutes to four minutes, as the majority diameter increases to 200 milliseconds. Synchronization is slower than we expected, but this appears to be an artifact of our test bed: the WAN simulator we used in our experiments to add network delays also extends the TCP slow start stage, which is much longer than the measured synchronization time. As a result, synchronization appears to be taking place over a slow network. We expect this phase to take less time for nodes

4.2. Failure Evaluation In this subsection, we evaluate system recovery time for primary server failures. As described in Section 3.2.2, a replication server can detect primary server failure when a forwarded client request times out. In that case, the replication server starts the failure recovery procedure. We denote this replication server as the replacement server to distinguish it from other replicas. During recovery, the replacement server first asks other active replicas for permission to become the new primary server. Upon receiving a switching-primary request, each replication server first checks the state of the original primary server to avoid switching the primary blindly under unusual network conditions. For example, network connections may not be transitive, so that a pair of servers might not agree about the status of a third even though the two are able to communicate with one another. In our implementation, we check to see if the primary server is truly down by sending a NULL RPC request. If the request times out, the replication server marks the original primary server as a failed replica and acknowledges the replacement server’s request. The reply also includes the modification time of the replication server’s copy of the file. The replacement server becomes the new primary if it receives the acknowledgments from a majority of the replication servers. The replacement server then determines which replicas have stale data by comparing the received modification times and synchronizes these servers with the freshest file copy, which it may first have to retrieve for itself. Following that, the replacement server constructs a new active view, distributes it to other replication servers in the view, and re-enables their replication. 11

350

predictated maximum time measured maximum time synchronization time

Failure R ecovery Tim e (s)

300

servers, users can still operate on files in their cache. The modified files are automatically transferred to a preferred server upon reconnection. This can lead to conflicting updates; in some cases, user involvement is needed to get the desired version of data. Recent years have seen a lot of work in peer-to-peer (P2P) file systems, including OceanStore [28], Ivy [25], Pangaea [31], and Farsite [2]. These systems address the design of systems in untrusted, highly dynamic environments. Consequently, reliability and continuous data availability are usually critical goals in these systems; performance or data consistency are often sacrificed. Compared to these systems, our system addresses data access and storage needs of global scientific collaboration, which can employ more reliable hardware but have more stringent requirements on average I/O performance. This leads to different design strategies in our approach. The Grid [12] is an emerging infrastructure that aims to connect globally distributed resources to a shared virtual computing and storage system. Driven by the needs of scientific collaborations, the sharing that the Grid concerns with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative scientific problem-solving patterns. Various middleware systems have been developed to facilitate data access on the Grid. For example, Storage Resource Broker (SRB) [5] utilizes a metadata catalog service to allow location-transparent access for heterogeneous data sets. NeST [6], a user-level local storage software, provides best-effort storage space guarantees, mechanisms for resource and data discovery, user authentication, quality of service and multiple transport protocol support, with the goal of bringing appliance technology to the Grid. The Chimera system [13] provides a virtual data catalog that can be used by application environments to describe a set of application programs, and then track all the data files produced by executing those applications. Their work is motivated by observing that much scientific data is not obtained from measurements but rather derived from other data by the application of computational procedure, which implies the need for a flexible data sharing and access system. A common missing feature among these systems is the lack of semantic supporting for fine-grained data sharing. Furthermore, most of these systems provide extended features by defining their own API. In order to use them, an application has to be re-linked with their libraries. Consequently, scientific researchers are generally hesitant to install and use these Grid softwares. rNFS would be a complementary data access scheme to these systems in the sense that it can provide a unified name space, consistency, and file system semantics necessary

predictated minimum time measured minimum time

250 200 150 100 50 0 0

50 100 150 Diameter RTT (in milliseconds)

200

Figure 4: . Primary server failure recovery time. that synchronize over a high-speed network. Recovery time also suffers from the lengthy RPC failure timeout. This value can be tuned, but if the timeout is too short, adverse network conditions might be misinterpreted as primary server failure.

5. Related Work During the design of rNFS, we have studied many replication systems, concurrency control algorithms, failure recovery schemes and recent works on Grid computing. Space limitations prevent us from a detailed enumeration in this paper, so we discuss only the replicated file systems and the recent work on Grid computing that are directly related to our study. Echo [7] and Harp [22] are file systems that use the primary copy scheme to support mutable replication. In these systems, replication is used only to increase data availability; potential performance benefits from replication are not targeted. Both systems use a pre-determined primary server for a collection of disks, a potential bottleneck if those disks contain hot spots or if the primary server is distant. rNFS avoids this problem by allowing any server to be primary for any file, determined dynamically in response to client behavior. Unlike Echo and Harp, rNFS uses replication to improve performance as well as availability. A client can choose a nearby or lightly loaded replication server to access data and switch to a working replication server if the originally selected server fails. Coda [33, 17] achieves its primary design goal of constant data availability through server replication and disconnected operation. When a client opens a file for the first time, it contacts all replicas to make sure it will access the latest copy and t hat all replicas are synchronized. On close, updates are propagated to all available replicas. In the presence of failure, Coda sacrifices consistency for availability. When a Coda client is not connected to any 12

to support global applications.

quential consistency by default, and provides strict consistency or close-to-open semantics if applications request it with POSIX synchronization flags, so that the overhead to enforce strong consistency is induced only when users demand it. Failure detection and recovery are driven by client accesses. No heartbeat messages or special group communication services are needed. We believe that with the flexible consistency guarantees, easy failure recover, superior performance implementation in the emerging standard for Internet-based distributed filing, and a clarified mechanism for adopting our extentions, our system provides a practical, reliable, efficient and forward-looking way for accessing and sharing data in wide-area collaborations.

6. Discussion and Future Work By requiring a distributed update to reach only a majority of replication servers before replying to a client write request, the default behavior of rNFS is transparent recovery from most omission failures and the ability to serve I/O requests continuously as long as a majority of the replication servers are in working order. On the other hand, no application is allowed to open a file for synchronous write if even a single replication server fails. Although seemingly in conflict by virtue of the radical difference in availability offered by these two design points, they are based on the same rationale, namely, the desire to offer applications a dependable data service, with increasing dependability guarantees under application control. The principle of least surprise argues the importance of guaranteed durability of data written by a client and acknowledged by the server. Consequently, if a request fails, we elect to report the failure to the application immediately, instead of masking the failure, which risks losing the results of a computation. Usually, the latency of distributing an update to a majority of the replication servers is hidden by the client daemon’s delayed-write behavior. However, large bursty writes can affect performance, either by exhausting the client daemon’s buffers, or (similarly) by inducing application delay when a file is closed, forcing the application to wait as the daemon flushes pending writes. That delay can be substantial if the majority of the replication servers are distant. The evaluation results in Figure 2 highlight the latter problem. It is clear that wide-area replication does not hurt the performance of applications that write data at a moderate rate. The question then arises: do common scientific applications satisfy this requirement? We see that large-scale scientific applications usually write computational results at a relatively constant (i.e., non-bursty) rate [34], so we expect the answer is “yes”. However, settling the matter requires more focused evaluations than the Andrew Benchmark can provide. We are currently evaluating the the performance of real scientific applications running on rNFS [39]. Building grid applications that combine rNFS with existing grid replication location services [29] is another direction for future study.

References [1] UNIX man pages: open(2), second edition, 1997. [2] A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. P. Wattenhofer. Farsite: federated, available, and reliable storage for an incompletely trusted environment. SIGOPS Oper. Syst. Rev., 36(SI):1–14, 2002. [3] B. Allcock, J. Bester, J. Bresnahan, A. L. Chervenak, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, and I. Foster. Secure, efficient data transport and replica management for high-performance data-intensive computing. In Proceedings of the Eighteenth IEEE Symposium on Mass Storage Systems and Technologies, page 13, Washington, DC, USA, 2001. IEEE Computer Society. [4] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout. Measurements of a distributed file system. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 198–212. Association for Computing Machinery SIGOPS, 1991. [5] C. Baru, R. Moore, A. Rajasekar, and M. Wan. The sdsc storage resource broker, 1998. [6] J. Bent, V. Venkataramani, N. LeRoy, A. Roy, J. Stanley, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and M. Livny. Flexibility, manageability, and performance in a grid storage appliance. In Proceedings of the 11th IEEE Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July 2002. [7] A. D. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo distributed file system. Technical Report 111, Palo Alto, CA, USA, 10 1993. [8] M. Blaze. NFS tracing by passive network monitoring. In Proceedings of the USENIX Winter 1992 Technical Conference, pages 20–24, San Fransisco, CA, USA, 1992. [9] M. A. Blaze. Caching in large-scale distributed file systems. PhD thesis, Princeton, NJ, USA, 1993.

7. Conclusion

[10] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications.

In this paper, we describe a file replication protocol designed to meet the needs of emerging global collaborations. The protocol supports mutable replication without adding overhead to normal reads. It guarantees se-

[11] F. Cristian, H. Aghali, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to byzantine agreement. In Proc. 15th Int. Symp. on Fault-Tolerant Computing (FTCS-15), pages 200–206, Ann Arbor, MI, USA, 1985. IEEE Computer Society Press.

13

[31] Y. Saito, C. Karamanolis, M. Karlsson, and M. Mahalingam. Taming aggressive replication in the pangaea wide-area file system. SIGOPS Oper. Syst. Rev., 36(SI):15–30, 2002.

[12] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998. [13] I. Foster, J. Voeckler, M. Wilde, and Y. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database Management, 2002.

[32] M. Satyanarayanan, J. H. Howard, D. A. Nichols, R. N. Sidebotham, A. Z. Spector, and M. J. West. The ITC distributed file system: principles and design. SIGOPS Oper. Syst. Rev., 19(5):35–50, 1985.

[14] S. Hemminger. Netem - emulating real networks in the lab, April 2005.

[33] M. Satyanarayanan, J. J. Kistler, P. Kumar, M. E. Okasaki, E. H. Siegel, and D. C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–459, 1990.

[15] K. Holtman. Cms data grid system overview and requirements. The Compact Muon Solenoid (CMS) Experiment Note 2001/037, CERN, Switzerland, 2001. [16] J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West. Scale and performance in a distributed file system. ACM Trans. Comput. Syst., 6(1):51–81, 1988.

[34] D. Thain, J. Bent, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and M. Livny. Pipeline and batch sharing in grid workloads. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC’03), page 152, Washington, DC, USA, 2003. IEEE Computer Society.

[17] J. J. Kistler and M. Satyanarayanan. Disconnected operation in the coda file system. In Thirteenth ACM Symposium on Operating Systems Principles, volume 25, pages 213–225, Asilomar Conference Center, Pacific Grove, U.S., 1991. ACM Press.

[35] A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, 1999.

[18] P. Kumar and M. Satyanarayanan. Supporting application-specific resolution in an optimistically replicated file system. In Workshop on Workstation Operating Systems, pages 66–70, 1993.

[36] B. S. White, M. Walker, M. Humphrey, and A. S. Grimshaw. Legionfs: a secure and scalable file system supporting crossdomain high-performance applications. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing, pages 59–59, New York, NY, USA, 2001. ACM Press.

[19] Lamport, Shostak, and Pease. The byzantine generals problem. In Advances in Ultra-Dependable Distributed Systems, N. Suri, C. J. Walter, and M. M. Hugue (Eds.), IEEE Computer Society Press. 1995.

[37] J. Zhang and P. Honeyman. Naming, migration and replication in nfsv4. Technical Report CITI-TR-03-2, Ann Arbor, MI, USA, December 2003.

[20] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, 28(9):690–691, 1979.

[38] J. Zhang and P. Honeyman. A replica control protocol for distributed file systems. Technical Report CITI-TR-04-1, Ann Arbor, MI, USA, April 2004.

[21] C. Lever. Using the linux nfs client with net-work appliance filers. Technical Report Netapp TR-3183, 2003.

[39] J. Zhang and P. Honeyman. File replication for large scale scientific applications. Technical Report CITI-TR-05-3, Ann Arbor, MI, USA, October 2005.

[22] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira, and M. Williams. Replication in the Harp file system. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 226–38. Association for Computing Machinery SIGOPS, 1991. [23] P. Mockapetris. Domain names - concepts and facilities. Technical report, 1987. [24] P. Mockapetris. Domain names - implementation and specification. Technical report, 1987. [25] A. Muthitacharoen, R. Morris, T. M. Gil, and B. Chen. Ivy: A read/write peer-to-peer file system. In Proceedings of 5th Symposium on Operating Systems Design and Implementation, 2002. [26] B. Pawlowski, S. Shepler, C. Beame, B. Callaghan, M. Eisler, D. Noveck, D. Robinson, and R. Thurlow. The NFS version 4 protocol. Proceedings of the 2nd international system administration and networking conference (SANE2000), page 94, 2000. [27] G. J. Popek, R. G. Guy, T. W. Page, Jr., and J. S. Heidemann. Replication in Ficus distributed file systems. In IEEE Computer Society Technical Committee on Operating Systems and Application Environments Newsletter, volume 4, pages 24–29. IEEE Computer Society, 1990. [28] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatowicz. Pond: The oceanstore prototype. In Proceedings of the Conference on File and Storage Technologies. USENIX, 2003. [29] M. Ripeanu and I. Foster. A decentralized, adaptive replica location mechanism. In Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, page 24, Washington, DC, USA, 2002. IEEE Computer Society. [30] D. Roselli, J. R. Lorch, and T. E. Anderson. A comparison of file system workloads. In Proceedings of the USENIX 2002 Technical Conference, pages 41–54, San Diego, CA, USA, 2002.

14

Suggest Documents