Replicated File Management in Large-Scale

0 downloads 0 Views 427KB Size Report
network partitions that characterize these systems make the design of algorithms to operate on them a ... Dipartimento di Ingegneria dell'Informazione, Universit`a di Pisa, Via .... replicated) file services for local file storage and access. Clients ...
Replicated File Management in Large-Scale Distributed Systems

¨ Ozalp Babao˘glu

Alberto Bartoli

Gianluca Dini

Technical Report UBLCS-94-16 June 1994 Revised January 1995

Department of Computer Science University of Bologna Piazza di Porta S. Donato, 5 40127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available in gzipped PostScript format via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. All local authors can be reached via e-mail at the address [email protected]. Questions and comments should be addressed to [email protected].

Recent Titles from the UBLCS Technical Report Series 95-2 Clepsydra Methodology, P. Ciaccia, O. Ciancarini, W. Penzo, January 1995. 95-3 A Unified Framework for the Specification and Run-time Detection of Dynamic Properties in Distributed ¨ Babao˘glu, E. Fromentin, M. Raynal, January 1995 (Revised February 1995). Computations, O. 95-4 Effective Applicative Structures, A. Asperti, A. Ciabattoni, January 1995. 95-5 An Open Framework for Cooperative Problem Solving, M. Gaspari, E. Motta, A. Stutt, February 1995. 95-6 Considering New Guidelines in Group Interface Design: a Group-Friendly Interface for the CHAOS System, D. Bottura, C. Maioli, S. Mangiaracina, February 1995. 95-7 Modelling Interaction in Agent Systems, A. Dalmonte, M. Gaspari, February 1995. 95-8 Building Hypermedia for Learning: a Framework Based on the Design of User Interface, S. Mangiaracina, C. Maioli, February 1995. 95-9 The Bologna Optimal Higher-Order Machine, A. Asperti, C. Giovannetti, A. Naletto, March 1995. 95-10 Synchronization Support and Group-Membership Services for Reliable Distributed Multimedia Applications, F. Panzieri, M. Roccetti, March 1995 (Revised July 1995). ¨ Babao˘glu, R. Davoli, L.-A. 95-11 The Inherent Cost of Strong-Partial View-Synchronous Communication, O. Giachini, P. Sabattini, April 1995. 95-12 On the Complexity of Beta-Reduction, A. Asperti, July 1995. 95-13 Optimal Multi-Block Read Schedules for Partitioned Signature Files, P. Ciaccia, August 1995. 95-14 Integrating Performance and Functional Analysis of Concurrent Systems with EMPA, M. Bernardo, L. Donatiello, R. Gorrieri, September 1995. ¨ Babao˘glu, A. Bartoli, G. Dini, September 1995. 95-15 On Programming with View Synchrony, O. 95-16 Generative Communication in Process Algebra, P. Ciancarini, R. Gorrieri, G. Zavattaro, October 1995. 95-17 Dynamic Declustering Methods for Parallel Grid Files, P. Ciaccia, A. Veronesi, November 1995. 95-18 Group Membership and View Synchrony in Partitionable Asynchronous Distributed Systems: Specifications, ¨ Babao˘glu, R. Davoli, A. Montresor, November 1995 (Revised September 1996). O. 96-1 An Investigation on the Optimal Implementation of Processes, C. Laneve, January 1996. 96-2 Expansivity, Permutivity, and Chaos for Cellular Automata, F. Fagnani, L. Margara, January 1996. 96-3 Enriched View Synchrony: A Paradigm for Programming Dependable Applications in Partitionable Asyn¨ Babao˘glu, A. Bartoli, G. Dini, February 1996 (Revised July 1996). chronous Distributed Systems, O. 96-4 May and Must Testing in the Join-Calculus, C. Laneve, March 1996. 96-5 The Shape of Shade: a Coordination System, S. Castellani, P. Ciancarini, D. Rossi, March 1996. 96-6 Engineering Formal Requirements: an Analysis and Testing Method for Z Documents, P. Ciancarini, S. Cimato, C. Mascolo, March 1996. 96-7 Using Bayesian Belief Networks for the Automated Assessment of Students’ Knowledge of Geometry Problem Solving Procedures, M. Roccetti, P. Salomoni, March 1996. 96-8 Virtual Interactions: An Investigation of the Dynamics of Sharing Graphs, A. Asperti, C. Laneve, April 1996. 96-9 Towards an Algebra of Actors, M. Gaspari, April 1996. 96-10 Mobile Petri Nets, A. Asperti, N. Busi, May 1996. 96-11 Communication Support for Critical Distributed Multimedia Applications: an Experimental Study, F. Panzieri, M. Roccetti, May 1996. 96-12 A Logic Coordination Language Based on the Chemical Metaphor, P. Ciancarini, D. Fogli, M. Gaspari, July 1996. 96-13 On the Parallelization of Concurrent Systems I: Basic Definitions and Properties, F. Corradini, R. Gorrieri, D. Marchignol, August 1996. 96-14 The Compositional Security Checker: A Tool for the Verification of Information Flow Security Properties, R. Focardi, R. Gorrieri, August 1996.

Replicated File Management in Large-Scale Distributed Systems ¨ ˘ 1 Ozalp Babaoglu

Alberto Bartoli2

Gianluca Dini2

Technical Report UBLCS-94-16 June 1994 Revised January 1995 Abstract Large-scale systems spanning geographically distant sites are potentially appropriate environments for distributed applications supporting collaboration. In this paper, we examine the possibility of using such systems as repositories for replicated files to facilitate low-latency data sharing. Asynchrony in communication and computation, complex combinations of site and communication failures, and in particular, network partitions that characterize these systems make the design of algorithms to operate on them a difficult task. We show that view-synchronous communication is not only an appropriate conceptual model for reasoning about large-scale distributed systems, it is also an effective programming model. We support these claims by developing algorithms for managing replicated files with one-copy serializability as the correctness criteria.

1. Dipartimento di Matematica, Universit`a di Bologna, Piazza Porta S. Donato 5, Bologna 40127 (Italy). Tel. +39 51 354430, Fax: +39 51 354490, E-mail: [email protected] 2. Dipartimento di Ingegneria dell’Informazione, Universit`a di Pisa, Via Diotisalvi 2, 56126 Pisa (Italy), Email: alberto,gianluca @iet.unipi.it

f

g

1

1

Introduction

Very large-scale distributed systems, such as the Internet, present interesting opportunities and challenges as infrastructures for collaborative distributed applications. The two aspects of large scale — geographic separation and number of sites — combine to provide an environment where computations may tap a huge number of resources with independent failure modes. In this paper, we concentrate on the “geographic separation” aspect of large scale and consider the problem of replicated file service over wide-area distributed system in order to facilitate data sharing. In such systems, replication could not only improve the availability of data, it could also improve the latency of data access by allowing certain requests to be serviced completely locally. The principal impediment to achieving these goals is the possibility of failures. In a large-scale system, failures may result in complex communication scenarios including network partitions. Furthermore, unpredictable communication and computation delays due to transient failures and highly-variable loads are sources for asynchrony and make reasoning based on time and timeouts impossible. Developing and reasoning about applications to be deployed in a large-scale distributed system would be an extremely difficult task if all of its complexity had to be confronted directly. Our solutions are based on a view-synchronous communication (VSC) service [15, 24] that hides most of the complexities due to failures and asynchrony3 . Informally, VSC cleanly transforms failures into group membership changes and provides global guarantees about the set of messages that have been delivered by a group as a function of changes to the group’s composition. Being able to reason at this level despite failures and asynchrony of the underlying system greatly simplifies application development without sacrificing efficiency. The main contribution of this work is in substantiating the above claim by developing algorithms for replicated file management that achieve low-latency access while guaranteeing one-copy serializability [10]. In doing so, we show how the conceptual VSC model can form the basis of a programming model by completing its semantics. We identify and solve several interesting problems related to the inherent asynchrony between applications and the VSC layer that render the model more realistic. The current work reports only on the algorithmic issues within the context of a larger project implementing the replicated file service. As such, many important file system issues (e.g., naming, locating, physical storage) are ignored since they are outside the scope of this work. Furthermore, the correctness criteria (one-copy serializability) that is considered represents only one of several consistency options provided by the file service. Our overall design supports typed files where different correctness criteria may be associated with different files. One-copy serializability represents only one extreme in this spectrum with weaker consistency options being available for those applications that can cope with the resulting inconsistencies [19, 23, 1].

2

System Model

The system is a collection of processes executing at potentially remote sites. Processes communicate through a message exchange service provided by the network. The network is not fully connected and is typically quite sparse. Both processes and communication links may fail by crashing. Given that the computing and communication resources may be shared by large numbers of processes and messages, the load on the system will be highly variable and unpredictable. Thus it is not possible to place bounds on communication delays and relative speeds of processes. As such, the system is adequately modeled as an asynchronous distributed system. Asynchronous systems place fundamental limits on what can be achieved by distributed computations in the presence of failures [12]. In particular, the inability of some process p to 3. In [24], the abstraction is called virtually-synchronous communication. We are reluctant to use this term [4] since it is loaded with other semantics that are associated with the Isis system [6].

UBLCS-94-16

2

communicate with another process q cannot be attributed to its real cause — q may have crashed, q may be slow, communication to q may have been disconnected or may be slow. From the point of view of p, all of these scenarios result in process q being unreachable.

;

;

Given two processes p1 and p2 , let be a binary relation such that p1 p2 iff p2 is reachable from p1 in the sense that p1 can effectively communicate with p2 . Reachability information is typically derived from a system service called the failure suspector [9] — processes that are suspected as having failed are declared unreachable while all others are reachable. What distinguishes a large-scale distributed system with respect to failures from an ordinary distributed system are the following properties of the reachability relation:

;

; ;

q) 6) (q p): 1. Asymmetric: (p 2. Non-transitive: (p q) ^ (q r) 6) (p r): 3. Non-connected: The graph representing the reachability relation need not be connected.

;

;

In a large-scale system, communication delays could be comparable to inter-failure times, which may result in significant periods during which symmetry and transitivity of the reachability relation are not satisfied due to inconsistencies either among the failure suspectors or the network routing tables. Property 3 which corresponds to partitions, on the other hand, is a consequence of the sparse network connectivity and may be provoked by a small number of failures and exist for extended periods.

3

View Synchrony

The basic primitive of view-synchronous communication (VSC) is the reliable multicast of a message to a group of processes. For the multicast primitive to be terminating in an asynchronous system with failures, VSC includes a membership service that provides consistent information regarding the components of the group that are currently believed to be reachable. At each process, in addition to delivering multicast messages, VSC also delivers views denoting the set of reachable processes. In an asynchronous system, views constructed by individual processes may be inaccurate with respect to the actual reachability state. What is guaranteed by VSC, however, is that views are mutually consistent in the sense that they are agreed upon by all of the processes in the view before being delivered. View changes are triggered by process crashes and recoveries, communication failures and repairs, network partitions and mergers, explicit requests to join or leave the group. The particular VSC membership service we rely upon is called strong-partial in that multiple views are allowed to exist concurrently but their intersection is guaranteed to be empty. Concurrent views model network partitions and represent the major challenges for consistency of replicated data. The real utility of VSC is not in its individual components — reliable multicasts and membership service — but in their integration. Informally, VSC guarantees that message deliveries are totally ordered with respect to view changes. Two views v and v0 are said to be consecutive if there exists some process in their intersection for which v0 is the next view to be delivered after view v. A process p is said to multicast a message in view v if the last view to be delivered at p before the multicast is v. Definition 3.1 Let v and v0 be two consecutive views. Communication is said to be view-synchronous iff (1) all processes in v \ v0 deliver the same set of messages that were multicast in view v and (2) no message is delivered in more than one view. Note that VSC specifies neither the relative order in which individual messages are delivered within a view nor the semantics of view changes and message deliveries. The actions associated with these events are specified by the application through pieces of code called handlers. UBLCS-94-16

3

To better understand the VSC abstraction, let us consider a view v made up of four processes p1; p2; p3; p4 and let us consider the multicast of m by p1 in view v. Let us suppose that p1 crashes while executing the multicast and, as a result, the message is delivered only by p2 . Eventually, the failure suspector of one of the other processes will declare p1 to be unreachable and an agreement among the processes will be triggered to install a new view v excluding p1 . The protocol executed by the VSC run-time support guarantees that m will be delivered also to p3 and p4 (part 1 of VSC definition) and that this event will happen before installing v (part 2). The VSC definition also guarantees that if p2 in turns fails before having a chance of relaying m to p3 and p4 , then the new view v will not include p2 . Otherwise part (1) of the definition would have been violated. This 0

0

0

differs from the multicast semantics of the Isis system, where consecutive views differ of at most one element [6]. The conceptual power of VSC rests in the observation that cuts associated with view deliveries define consistent global states of the system in which there are no messages in transit. As such, delivery of a new view may be thought of as synchronization point (thus the name “view synchrony”) for the distributed computation in that it allows local reasoning about the global system state based on a common view and a common set of delivered messages.

4

Generalized View Synchrony

While VSC as defined above is rather attractive, it is impossible to implement. In any implementation of VSC, including the one described in [4], there will be asynchrony between the support layer and the applications running on top of it. In particular, new views can be delivered by the VSC support layer asynchronously with respect to processes executing applications. For example, suppose that a process p believes that it multicast some message in view vi because the last view to be delivered by p was vi . There can be no guarantee that the actual servicing of the multicast and the delivery of the message will occur in view vi . In fact, an arbitrary number of view changes may occur between vi and the view where the message will be delivered (Fig. 1). Moreover, large-scale system where communication delays may be large and failures frequent further increase the likelihood of this occurring. Thus, when a message is delivered, assumptions about global system state that were made at the time of its multicast may no longer be true. Note that these observations are not a consequence of any particular implementation; rather they are intrinsic to the nature of view synchrony. v-cast(m) handler

v1

v2

v3

t

Figure 1. Process delivers view v1 and starts executing a handler. In the meantime, two view changes occur resulting in the multicast of message m to be serviced in view v3 .

To cope with these problems that complicate global reasoning with local information, we generalize the VSC semantics as follows. The view vim in which a process p multicasts message m is called the presumed view. Let future view vim+k be the real view in which the delivery of message m k m actually occurs. Define the collapsing of view vim with respect to message m as the set vi m = \jj = =0 vi+j UBLCS-94-16

4

where vim+j and vim+j +1 are consecutive views for all j = 0 : : :k. With these notions, we can now define the generalized VSC (GVSC) semantics as follows. Definition 4.1 Communication is said to be generalized view-synchronous iff for each message m (1) all processes in vi m \ vim+k+1 deliver the same set of messages that were multicast in presumed view vim and (2) no message is delivered in more than one view. Note that if the presumed view and real view are one and the same for all multicast messages, the generalized and original VSC semantics coincide. Informally, GVSC guarantees that a message is delivered only by those processes that were in the presumed view, and that never left and rejoined the sender’s view of the group until its delivery. The ability to reason about replicated file management with GVSC greatly simplifies both the algorithms and their proofs.

5

Replicated Files and Server Groups

The basic unit of replication is a file, which is an unstructured sequence of bytes. The file system consists of client processes that issue access requests to files and server processes that service them. A replicated file is implemented as a set of ordinary files, each managed by a server residing at a (geographically) distinct site. In carrying out their function, servers rely on traditional (nonreplicated) file services for local file storage and access. Clients interact with the server group to access files that are replicated. While in principle a client may contact any server, it will typically contact the local server (if there is one) for obvious performance reasons. In this paper we concentrate on the coordination among the servers in order to achieve low-latency access to consistent shared data. The details of the client-server interactions are beyond the scope of this work and are omited. Let fp denote the replica of file f managed by server p, and let S [f ] denote those servers having a replica of f . Each replica fp of file f has an associated vote, denoted w[fp ]. Both the set S [f ] and the vote assignment are assumed to be static and are known by all servers. Let w[f ] denote the sum of the votes assigned to all replicas of f . Any subset QS [f ] of S [f ] such that P fp 2QS [f ] w[fp ] > w[f ]=2 is called a quorum set of file f . To simplify notation in the rest of the paper, we omit explicit references to the replicated file name when it is clear from context.

Each server in S joins the Replica Management Group (RMG) that implements the abstraction of replicated files. Group management is performed by a layer providing VSC semantics as discussed in Section 3. We assume that the set S is static and that all servers join RMG and never leave it explicitly. While the set S is static, membership of RMG, as represented by views delivered to servers, is dynamic due to failures and repairs. In fact, the possibility of network partitions may split the set of servers, resulting in multiple views for RMG to exist concurrently.

The abstraction of a file implemented by RMG should be identical to that of a non-replicated file. In other words, the fact that multiple replicas exist should be transparent to the clients and the file should behave as if there was a single copy of it. This intuition is formalized as the one-copy correctness criteria for replicated data. The other aspect of correctness has to do with concurrent accesses to the file. Here, we adopt the traditional database notion known as serializability which requires the behavior of the file with concurrent accesses to be identical to a behavior under some serialization of the accesses [8, 5]. Taken together, the correctness criteria is called one-copy serializability [10] and describes concurrent access to replicated files. Despite network partitions, the strong-partial membership guarantees of the VSC service (Section 3) along with the fact that the set of all replicas and the vote assignment are static, lead to the following two properties of quorum sets: Property 5.1 There can be no multiple concurrent views of RMG that define quorum sets.

UBLCS-94-16

5

Property 5.2 Any two views of RMG that define quorum sets have non-empty intersection. In other words, quorums transform the strong-partial membership service of VSC to a linear membership service [24]. Existence of a quorum determines which servers are allowed to make progress in the presence of partitions. Property 5.2 is used to determine which server has the most recent replica after a total failure. A crucial aspect of one-copy behavior is whether a replica is up-to-date or not: Definition 5.1 A replica is up-to-date if, after suspending all future write requests, it will eventually reflect all the writes performed on the file. Note that the definition cannot be based on the state of the replica at a single cut, but has to be based on an interval. This is because any single cut may include a write operation in progress and the replica may not yet reflect the result of the operation. We define a second group over the set of servers S :

Definition 5.2 Up-to-date Group (UG): A server belongs to UG if it has a quorum view of RMG and has an up-to-date replica. Note that servers may belong to two different groups — RMG and UG. Since they join RMG once and never leave it explicitly, while operational, they always track RMG view changes.

Properties of these views determine if they may join the other group. In particular, if its view for RMG defines a quorum, the server joins UG after performing a state-transfer protocol to obtain a copy of the most-recent replica (Section 6.3). At this point the server begins tracking views for both groups. If UG does not exist within the quorum view for RMG, it has to be created. Servers belonging to UG are able to satisfy requests to access the file. More precisely, if the current view for UG defines a quorum,4 the server can serve both read and write requests. Otherwise, it can only serve read requests. The server leaves UG, either when its current view stops being a quorum, or as its first action during crash recovery. After leaving UG, a server waits for view changes in RMG such that a quorum is re-established. At that point the server joins UG as described above. To avoid creating two new server groups for each replicated file, in practice, those files that have the same distribution of votes among the sites are grouped in volumes. In this manner, votes and replicas become associated with volumes rather than with single files, resulting in only two server groups per volume. This modification affects only the algorithms dealing with the creation of and joins to UG (Section 6.3).

6

The Replica Management System

We consider the implementation of read and write operations on replicated files. Informally, reads are served by the local replica (if it exists) and writes are performed on all the replicas. As for concurrency control, we solve read-write conflicts by guaranteeing mutually-exclusive access to each local replica; we solve write-write conflicts by electing a lock manager (LM) among the servers in order to serialize write requests. This concurrency control policy, together with the read-local/write-all replica control policy within UG, guarantees one-copy serializability of file accesses [10]. Let S be a server that has been asked to perform a write (read) operation and let us denote by vx the views of UG installed at S , e.g. we omit both the specification of the group and of the

4. Once created, UG always defines a quorum. There may be periods during its creation, however, where this may not be true (see section 6.3).

UBLCS-94-16

6

server. Let vi be the presumed view at the time S it is delivered a write (read) request from a client; let vi+k (k  0) be the presumed view when S returns the result of the operation to the client5 ; let, finally, vi+j be consecutive views 8j 2 [0; k]: Write A write operation attempts to install a new version of the file and returns one of the following status codes to the client:  Success: The new version has been installed at a quorum set. This happens iff vi+j contains a quorum set 8j 2 [0; k]. The quorum set holding the new replica is contained into the intersection vi \ vi+1 : : : \ vi+k . In the background, the new version will eventually be installed on all the servers in UG.  Fail: All the already existing replica remain unaffected. This happens iff vi = ;, that is, if S does not belong to UG.  NotQuorum: All the already existing replica remain unaffected. This happens iff vi 6= ;^ vi does not contain a quorum set yet.  Unknown: The outcome is either that associated with the Success status or with Fail, but it is not known which. This happens iff vi contains a quorum set ^9j 2 [1; k] such that vi+j does not contain a quorum set. Read A read operation attempts to get the latest version of the file and returns either of the following status codes to the client:  Success: The returned version is the latest one. This happens as long as S belongs to UG.  Fail: The latest version cannot be located. This happens in any other case. The algorithms for managing locks and for writing are logically subdivided into four modules of the servers. The modules communicate through messages (either point-to-point or reliable multicast) and each module is executed only when the corresponding server belongs to UG that defines a quorum. To simplify the presentation, we assume that the client-server interface is such that while a server is busy satisfying a request, no further client requests are directed to it. We present our algorithms using a pseudo programming language with a Pascal-like syntax. The notation is extended by primitives for expressing communication and concurrency. Communication is initiated through send and v-cast statements. The former is used for (reliable) point-to-point communication while the latter invokes a reliable multicast with VSC semantics. The destination for the v-cast primitive is implicitly UG. Messages are typed through identifiers that are contained within the message body. We use SMALL-CAPS font to denote message types. The two events provided by VSC are denoted delivery and view-change and correspond to message and view change deliveries, respectively. A process may be required to synchronize itself with one of the following events: delivery of a view change, delivery of a message or the satisfaction of a boolean condition on local variables. The statement wait-for(condition) blocks the process until the condition is verified. On the other hand, the statement upon(event) is used to specify handlers that are invoked asynchronously with respect to the execution of the process when the relevant event occurs. We assume that once invoked, handlers for message delivery and view changes execute either to the end or until they block without any interleaving of other handler executions. In analogy to interrupt handlers, we say that they are uninterruptible. Variables that are local to handlers will not be declared and we will assume that their type may be understood from context. Variables that survive across multiple handler executions (global variables) are enclosed within var, end statements. Global variables can also be accessed by all the handlers in that module. The initial statement defines the initial value for a global variable and it is executed whenever a server starts executing the corresponding module. Variables not associated with such a statement are initialized within proper protocols (Section 6.3). We have assumed that all internal clean-up actions that become necessary when any algorithm is aborted — possibly including group-leave(UG) — are performed within the procedure abort(). 5. This event may happen before the actual completion of the protocols (Section 6.2).

UBLCS-94-16

7

var queue: fifo queue of sid initial empty; LockHolder: sid initial NULL; LockManager: sid; end; 1 upon(delivery of L-REQ) 2 place msg.sender in queue; 3 distribute lock(); 4 upon(delivery of L-REL) 5 LockHolder := NULL; 6 distribute lock(); 7 upon(view-change in UG) 8 if(UG is a quorum view) then 9 if(LockManager = mysid) then 10 if(LockHolder UG) then 11 send(mysid, I-ENQ); 12 wait-for(delivery of I-RES, ires) 13 upon(view-change in UG) 14 if(UG is not a quorum view) then 15 abort(); 16 LockHolder := res.writer; 17 remove from queue requests of servers that are not in UG; 18 distribute lock(); 19 elseif(LockManager UG) then 20 LockManager := new-lockmgr(UG); 21 if LockManager = mysid then 22 wait-for(delivery of L-STATE from all servers in UG) 23 upon(view-change in UG)) 24 if(UG is not a quorum view) then 25 abort(); 26 rebuild queue and LockHolder; 27 distribute lock(); 28 elseabort()

62

62

29 procedure distribute lock() 30 if((LockHolder = NULL) and (queue is not empty)) then 31 LockHolder := extract first sid from queue; 32 send(LockHolder, L-GRANTED);

Figure 2. The algorithm executed by the lock manager. The type sid is used to store server identifiers. The variable mysid contains the identifier of the executing server. A message sent to the server itself (line 11) is received by the participant module of the writer. The group name is used also in referring to the current view for that group. The function new-lockmgr() deterministically selects an element of its argument.

UBLCS-94-16

8

var LockManager: sid; granted: boolean initial FALSE; lreq: message initial NULL; end; 1 upon(request by client to acquire lock) 2 lreq := L-REQ, get-time(), mysid ; 3 send(LockManager, lreq); 4 wait-for(receipt of L-GRANTED) 5 granted := TRUE; 6 upon(request by client to release lock) 7 lreq := NULL; 8 granted := FALSE; 9 send(LockManager, L-REL); 10 upon(view-change in UG) 11 if(UG is a quorum view) then 12 if(LockManager UG) then 13 LockManager := new-lockmgr(UG); 15 lstate := L-STATE, lreq ; 16 if(granted) then 17 lstate.info := “I have the lock”; 18 elsesend(mysid, I-ENQ); 19 wait-for(delivery of (I-RES, ires); 20 upon(view-change in UG) 21 if(UG is a quorum view) then 22 if(LockManager UG) then 23 LockManager := new-lockmgr(UG); 24 elseabort(); 25 if(ires.writer = mysid) then 26 lstate.info := “I inherited the lock”; 27 send(LockManager, lstate); 28 else abort();




62




62

Figure 3. The algorithm executed by the lock agent. Function get-time() returns the local time.

6.1

Write Lock Management

Write-write conflicts are solved by requiring that no more than one writer be active in any quorum. This is achieved by electing a lock manager to distribute write locks among the components of UG. Within each server, two modules are devoted to lock management. The code describing the server behavior granting write access is called the lock manager (LM), while that describing its behavior requesting write access is called the lock agent (LA). The lock management algorithms guarantee that at most one server is active as the LM and that it issues a write lock to a single LA at a time (see also the Propositions below). Before discussing the code, it is worthwhile noting the usefulness of the “generalized VSC” notion presented in Section 4. Let us consider three consecutive views for UG, say v1 ; v2 ; v3 . Let us suppose that the lock manager p is delivered a lock request from process q in v1 and that p sends a message for granting the lock. Let us suppose that: (1) while p is handling the request, views v2 and v3 have already be installed; and (2) v2 does not include q but v3 does (e.g., a transient communication failure). The message granting the lock is actually sent in v3 because of (1). While handling the installation of view v2 , p detects that the lock must be broken and allocates it to another process r. Without the notion of generalized VSC, process q would receive the old granting message, so we might end up in a situation with two lock holders in v3 . Although this problem could be solved in several ways, with our generalized VSC notion the reasoning is greatly simplified, since q would not deliver the old message because it does not belong to the set v1 \ v2 \ v3. The protocol executed by LM is shown in Figure 2. At any time, LM may receive a lock UBLCS-94-16

9

request (L-REQ), a lock release (L-REL), or may be delivered a view change. A lock request is immediately granted if the lock is free, otherwise it is queued. When a lock is released, if there are pending requests, one of them is granted. To avoid starvation, the choice is made on a FIFO basis. If a view change is delivered, requests that originated from servers that left the view are removed from the queue. Moreover, if the current lock holder left the view, the lock is broken and some other server inherits the lock. If instead LM itself leaves the view, it gives up the role of lock manager.

The algorithm for lock acquisition performed by LA, shown in Figure 3, is invoked due to a write operation. As soon as the request is received, LA timestamps a L-REQ message with its local clock and sends it to LM and waits for either the granting of the lock or the delivery of a new view. In the latter case, if the new view still constitutes a quorum set but does not include the old lock manager, a recovery action is executed.

By convention, the first server joining UG becomes the lock manager.6 Whenever a server joins UG, it is told (via the state-transfer-and-join algorithm described in Section 6.3) who the current LM is. Whenever a server in UG is delivered a view change that excludes LM, a new lock manager is elected by applying a deterministic function to the new composition of UG. For the new LM to take over correctly, it must figure out who the lock holder is, if any, and it must reconstruct the queue of the pending requests. To this end we simply require that the new LM does not serve any request until it has received a message tagged L-STATE from all the components of UG. Such messages contain the “lock state” of the sending server, which may be any one of “I have the lock”, “I inherited the lock”, “I requested a lock at my local time T0 ”7 . The LA module of each server sends its own lock state whenever it detects the election of a new LM. After having received all lock states, LM reconstructs the (FIFO) lock-request queue based on the local time information contained in the L-STATE messages, implicitly grants the lock to the server which said either “I have the lock” or “I inherited the lock”, if any, and then starts working. Ordering the new lock-request queue according to the time information carried by l-state messages enables us to avoid starvation of requests, provided clocks are monotonically increasing and clock drifts are bounded. Lock inheritance may happen if the lock holder leaves UG. More precisely, if the lock holder started a write protocol before leaving then this write is completed by the remaining servers, in which case the lock is inherited by some of them. Otherwise, the lock is broken. To figure out whether there was a write in progress at the time the lock holder left UG, each LA sends an I-ENQ message to the participant module of the writing protocol at its own site (Section 6.2). This in turns replies with an I-RES message containing the identity of the server currently acting as writer, which may possibly be NULL. This point will be discussed in more detail in Section 6.2. We can prove the following properties for the above algorithms (all proofs can be found in Appendix A) that will be used by the write protocol: Proposition 6.1 There can be no more than one LM that exist concurrently.

Proposition 6.2 If there are no failures, eventually there will be exactly one LM.

Proposition 6.3 There can be no more than one lock holder that exist concurrently.

Proposition 6.4 Every lock request is eventually granted, provided the UG in which the request was made does not disappear and the requesting server does not leave UG before the lock is granted. 6. The way we create UG guarantees that it does not contain more than one server initially (see Section 6.3). 7. These pieces of information are actually coded within the message in a slightly different manner.

UBLCS-94-16

10

Proposition 6.5 While the current lock holder belongs to UG, the lock is not broken (forcefully taken away).

Proposition 6.6 When the current lock holder leaves UG, the lock is broken only if there is no write in progress. Otherwise, the lock is inherited by some other server in UG. Property 6.5 implicitly assumes that any lock holder will eventually release the lock. This in turns implies that any write protocol eventually completes (Section 6.2). Property 6.6 is necessary for the write protocol (Section 6.2). 6.2

Coordinating Writes

For each write request, the servers that perform updates on its behalf are called the participants. Each participant updates the replica that is local to it. Participants’ activity is coordinated by the writer, which is the server that received the client write request. The write protocol is greatly simplified by the fact that all participants belong to UG, and thus, all of them are up-to-date. Furthermore, we require that servers do not join UG while a write protocol is in progress (Section 6.3). This implies that during a write protocol UG can only shrink. To cope with servers that leave the write protocol before completion, thus leading to replicas that potentially differ from each other, each replica is associated the following information: (i) A monotonically increasing version number (vn), (ii) The UG incarnation number (ugid) in which the replica has been created, and (ii) A state that may be either W , meaning “the last write operation attempted on this replica has been completed”, or R, meaning “the last write operation attempted on this replica is ready-to-be-completed”. Contents of replicas in state R are never given to clients. Write and read operations are performed only on replicas in state W . The ugid is an integer maintained (in volatile memory) by every server in UG that is incremented every time the composition of UG changes, provided UG remains a quorum. When a server joins UG, it learns the current value of ugid through the state-transfer-and-join algorithm (Section 6.3). When a new UG is created from scratch, the new ugid is obtained by incrementing the ugid of the replica held by the server that created UG. Since the server that created UG has a replica having the highest ugid, any write performed in the new incarnation of UG will be given an ugid greater than all those associated with the existing replicas. How this information is used to detect and resolve inconsistencies among the various replicas is discussed in Section 6.3. Figure 4 shows the algorithm executed by the writer. The writer acquires a lock and multicasts a W-REQ message to all participants (including itself). Then, it sets about waiting for acknowledgment messages W-ACK. Receipt of an acknowledgement implies that the corresponding participant has created a replica in state R for the new version (an R-replica for short). Having collected acknowledgements from a number of participants defining a quorum, the writer multicasts a W-COMMIT message and releases the lock. The write request of the client can be replied to as soon as a W-ACK has been received from a quorum of participants, typically much earlier than when all acknowledgements have been received. The reason why we can reduce the write latency experienced by clients in this manner is because once a quorum of participants has created an R-replica, it is guaranteed that the effects of the writing will not be lost. If the writer is notified of a view change before sending the W-COMMIT message, then UG must have shrunk. If the new view remains a quorum, there is nothing to do.8 Otherwise, the writer aborts the operation and returns an “Unknown” status to its client. Note that when a participant is asked to update its secondary storage, it is expected to either 8. Actually, it might be necessary to perform some actions having to do with lock management, e.g. if the LM left UG. For the sake of clarity, we have presented lock management and writing in isolation.

UBLCS-94-16

11

perform the update or to leave the UG. That is, a participant can never respond “NO” to a write request [16, 17] as it could in an atomic commitment algorithm [13]. In other words, our write protocol does not require an atomic commitment. The reason why we can make this assumption is the observation that a server unable to write to its disks must have serious problems and will eventually be declared unreachable and leave the view. Figure 4 shows the behavior of a participant in the write algorithm. When a participant receives a W-REQ, it realizes that there is a write protocol in progress. The participant registers the request to secondary storage thus creating an R-replica of the new version. It then acknowledges the request through a W-ACK. When a participant receives a W-COMMIT, it simply switches the state of its R-replica to W , thus installing the new version of the file. If a new view is delivered while a participant is waiting for the commit message, there are two cases to consider: (1) the participant left UG; (2) the writer left UG. In the former case, the participant just leaves the protocol. In the latter case, a new server S is elected among the remaining components of UG through a deterministic function new-writer(), to complete the operation. S behaves similarly to the original writer: it waits for a W-SYNCACK from all the remaining participants (including itself) and then sends a W-COMMIT. Note that S must collect W-SYNCACK messages before sending the W-COMMIT because VSC, while guaranteeing that all processes see view changes in the same order with respect to message deliveries, does not guarantee any order between view changes and secondary storage accesses. This property is analyzed in more detail in Figure 5, where the dashed lines indicate cuts defined by possible view changes that exclude the writer in the middle of the protocol. Views V C1 and V C3 of the figure are the two extreme cases, since the former is delivered before any participant has created its R-replica, whereas the latter is delivered after all of them created their R-replicas. Between these two cases there are other possibilities, e.g. view V C2. Obviously, from the point of view of a single participant, the knowledge that a new view has been installed and that all the survived servers have received the W-REQ message does not allow it to conclude anything about the state of the various replicas — a participant could crash before having a chance to access its secondary storage. A form of end-to-end acknowledgement is thus necessary, which is provided exactly by the W-SYNCACK. Similar considerations are also valid for the writer side: if the writer is notified of a view change that excludes some participants from UG, it can only conclude that all surviving participants have received its W-REQ, but it cannot conclude anything about whether they already created R-replicas or not. As an aside, notice that when the writer leaves UG, VSC guarantees that either all remaining servers have been delivered a W-REQ, or none did. Therefore, if the lock has to be inherited then all servers are aware of this (Section 6.1). Figure 6 shows a possible interleaving between consecutive executions of the write protocol. A server holding an R-replica for version vn — the one labelled as S — may receive a W-REQ that attempts to install version vn + 1 before of the W-COMMIT for version vn. The reason why this may happen is that the system is asynchronous and channels only guarantee FIFO ordering. To get around this problem, note that S may conclude upon delivery of W-REQ for version vn + 1 that version vn has been successfully installed. It it had not, S would have left UG. Thus, before creating an R-replica for version vn + 1, S may automatically switch to W the state of its R-replica for version vn. Note also that no further write operations can start until S acknowledges the creation of an R-replica for version vn + 1. To prove the correctness of the write algorithms, we need to show that they achieve onecopy semantics. This in turn can be derived from the following propositions: Proposition 6.7 If a server installs a new version of the file in UG that defines a quorum, then eventually either all servers in UG also install the new version or UG disappears.

Proposition 6.8 The sequence of installed versions of the file by any server in UG corresponds to a subsequence of the write lock acquisitions, which are totally ordered. UBLCS-94-16

12

var replied: boolean initial FALSE; active: boolean initial FALSE; vn: integer; end;

var writer: sid initial NULL; ugid: integer; end;

1 upon(delivery of (W-REQ, wreq)) 1 upon(delivery of a write request 2 if(wreq.vn = get-vn() + 1) then from client) 3 if(disk write(“W”, ugid) = FAIL) 4 then abort(); 2 active := TRUE; 5 writer:= wreq.sender; 3 replied := FALSE; 4 acquire lock(); 6 if(disk write(wreq.data, “R”, ugid, 5 vn := get-vn(); wreq.vn) = FAIL) 6 wreq := W-REQ, vn+1, 7 then abort(); data ; 8 send(writer, W-ACK); 7 v-cast(wreq); 9 upon(delivery of (W-COMMIT, wcom)) 8 wait-for(delivery of W-ACK from 10 if(wcom.vn = get-vn()) then a quorum in UG); 11 writer := NULL; 9 return SUCCESS to client; 12 if(disk write(“W”, ugid) = FAIL) 10 replied := TRUE; 13 then abort(); 11 v-cast(W-COMMIT); 14 upon(delivery of (I-ENQ, ienq)) 12 wait-for(delivery of remaining 15 lockinfo := I-RES, writer ; W-ACK in UG); 16 send(ienq.sender, lockinfo); 13 release lock(); 17 upon(view-change in UG) 14 active := FALSE; 18 if(UG is a quorum view) then 15 upon(view-change in UG) 19 ugid = ugid + 1; 16 if(UG is not a quorum view) then 20 if(writer NULL and 17 if not replied and writer UG) then active then 21 writer := new-writer(UG); 18 return UNKNOWN 22 send(writer, W-SYNCACK); to client; 23 if(mysid = writer) then 19 abort(); 24 wait-for(delivery of W-SYNCACK from a quorum in UG); 25 v-cast(W-COMMIT); 26 wait-for(delivery of W-SYNCACK from all others in UG); 27 release lock(); 28 elseabort();







62

Figure 4. The algorithms executed by the writer (left) and by a participant in the write protocol (right). Function get-vn() returns the version number of the replica (in either W or R state). The function new-writer() deterministically selects an element among its argument. The functions acquire lock() and release lock() invoke the appropriate actions of the lock agent.

UBLCS-94-16

13

writer

W-REQ W-COMMIT

participant1

participant2

participant3 V C1 V C2

V C3

Figure 5. An execution of the write protocol in the absence of view changes during the protocol. Dashed lines show possible view changes that exclude the writer from the rest of the UG.

6.3

Creating and Joining the Up-To-Date Group

A problem that still needs to be addressed is the creation of the up-to-date group UG. This problem arises when there is a quorum view for RMG but none of its members knows whether it is up-to-date or not. In such a case, the algorithm for creating UG (Fig. 7) is activated. It is meant to check up whether UG already exists and, if it does not, to create it from the scratch. In the latter case, an up-to-date server is selected within the quorum view of RMG and entrusted with creating UG. As proved in the Appendix, the algorithm guarantees the following: Proposition 6.9 If during the execution of the reformation algorithm, the number of failures is finite and the view for RMG continues to define a quorum, the UG will be eventually created. The decision as to which servers can create the new UG is taken by a coordinator in RMG by using the information maintained by the servers as described in Section 6.2. Let the volume state of a server be the largest ugid among all the replicas, either in state R or W , belonging to the volume at that server.9 The following result holds: Proposition 6.10 The server with the highest volume state among those in a quorum view of creates UG.

RMG

The server that reforms UG, say S , leaves the state of its replicas — either R or W — unaffected. More clearly, it does not switch to W the state of the R-replicas it (possibly) has. The reason is because if UG ceases again to exist before forming a quorum, the R-replicas that 9. As an aside, note that things can be easily structured so that a server does not need to scan all files in the volume to figure out its current volume state: it suffices to maintain the volume state in permanent storage and to update it whenever a new replica is created.

UBLCS-94-16

14

writer1 L-REL

Lockmanager L-GRANTED W-COMMIT(

vn) writer2

W-REQ(

S

vn + 1)

Figure 6. Two interleaved executions of the write protocol. A message pertinent to version vn + 1 may reach server S before the latter has observed the completion of the attempt to install vn.

have been selected as up-to-date might not exist on a quorum. Thus, if they were transformed in W -replicas, they might not be taken into account when attempting to reform UG (see also Proposition 6.10 and its proof in the Appendix), which would violate serializability. These Rreplicas will be transformed in W -replicas when writes start again (lines 5-7, Fig. 4 left, and lines 2-3, Fig. 4 right).

Once a new UG has been created, other servers in RMG can join it through the statetransfer-and-join algorithm shown in Figure 8.10 Indeed, the details of the state-transfer depend on whether UG already forms a quorum set or not. Figure 8 shows the algorithm followed in the former case, e.g. the normal case. For convenience of presentation, we will discuss the issues related to the initial transitory phase, in which UG does not form a quorum yet, at the end of this Section. The state transferred includes the contents of the up-to-date replica, 11 and the values of the variables LockManager and ugid. The state transfer occurs in two phases to avoid blocking UG with respect to writes for the entire duration of the state transfer, which could be fairly long in case of large volumes. The first phase occurs in the background, in parallel with ongoing write operations (Fig. 8 right, line 4). The second phase consists of obtaining the write lock and transferring the modifications to the volume that may have occurred during the first phase (Fig. 8 right, lines 5-6). Thus, the UG remains blocked for new write requests due to state transfer only during this relatively short second phase. The correctness of the algorithm is established through the following properties: Proposition 6.11 If there are a finite number of failures and UG does not disappear while executing the protocol, a server S wishing to join will eventually belong to UG. Proposition 6.12 A server S joins UG only if there is no write in progress. 10. As with the client-server interface, we assume that the lock request for the state transfer is not forwarded to the lock acquisition module (Figure 3), until that module has served all previous requests.

R

11. As discussed above, while UG does not form a quorum joining servers might receive via state-transfer -replicas. Recall also that -replicas are not seen by clients, since reads always return the -replica associated with the largest version number (Section 6.2).

R

UBLCS-94-16

W

15

1 2 3 4 5

var ps, rs : set of sid initial ; c : sid initial NULL; rss : set of ReplicaState initial ; end

;

;

6 procedure uga() 7 if(UG exists) then stj(); 8 else 9 ps := ps RMG; 10 if(ps RMG) then 11 if(c mysid) then /* 12 rs := ; 13 ps := RMG; 14 c := NULL; 15 rss := ; 16 v-cast(V-STATE); 17 else /* 18 19 remove from rs 20 delete V-STATE of 21 if(c NULL 22 create UG(rss,

6=

\ =6

UGA is restarted */

;

;

UGA continues */ 8p 2 rs ^ p 62 ps < p > < p from rss > 6= \ c 62 rs) rs) 23 procedure create UG(rss. rs) 24 c := select-candidate(rss, rs); 25 26 27 28

if(c = mysid) then group-join(UG); init(); v-cast(UG-BUILT);

29 upon(view-change(RMG)) 30 if(RMG defines a quorum) then 31 uga(); 32 else abort(); 33 upon(delivery(V-STATE)) 34 rs := rs V-STATE.SND; put V-STATE in rss ; 35 36 if(rs = ps) then 37 create UG(rss, );

[


rs

38 upon(delivery(UG-BUILT)) 39 if(c mysid) then 40 stj();

6=

Figure 7. Algorithm for creating UG. Communication is within RMG. The init() procedure sets to proper values the variables of the various modules that have not been associated with any initial statement in the pseudocode. In particular, LockManager is set to the sid of the invoking server whereas ugid is set to the volume state increased by one.

UBLCS-94-16

16

var sp: sid initial NULL; end;

var js: sid initial NULL; end;

1 procedure state transfer and join() 2 if(mysid UG and UG exists) do 3 sp := select(members(UG)); 4 send(sp, J-REQ); 5 wait-for(completion of state transfer from sp) 6 upon(delivery(J-ABORT)) 7 abort state transfer and join 8 uga() 9 upon(view-change(RMG)) 10 if(RMG is not a quorum view) then 11 abort(); 12 if(sp NULL and sp RMG) then 13 abort state transfer and join 14 uga();

1 upon(delivery(J-REQ, jreq)) 2 if(I am in UG) then 3 js := jreq.sender; 4 transfer state to js; 5 acquire lock(); 6 update state sent to js; 7 wait-for(js UG); 8 release lock(); 9 js := NULL; 10 else 11 send(J-ABORT); 12 upon(view-change in UG) 13 if(UG is not a quorum view) then 14 abort(); 15 upon(view-change in RMG) 16 if(RMG is not a quorum view) then 17 abort(); 18 if(js NULL and js RMG) then 19 release lock(); 20 abort state transfer();

62




6=


2

6=

62

Figure 8. The state-transfer-and-join algorithm: joining server (left) and state provider in UG (right). Selecting the state provider is done by applying the function select() to the current composition of UG as returned by members(UG).

Proposition 6.12 is necessary for the correctness of the write algorithm. Concerning the initial transitory phase, since servers join UG one at a time, there will be a certain interval in which UG exists but it does not form a quorum yet. During this interval servers in UG may only serve read requests, they cannot serve write requests (Section 6). Furthermore, in this phase the state provider must not invoke abort() when the new composition of UG is not a quorum (Fig. 8 right, lines 12-14). To this end it suffices to activate in this phase a slightly different module. As soon as a server in UG is delivered a view that forms a quorum, it will activate the modules for managing locks and for writing, as well as the module in Fig. 8 (right). Finally, putting together the above results, we can prove the following: Proposition 6.13 The replica management system implements replicated files with one-copy serializability semantics.

7

Dynamically Changing Votes and Composition of RMS

7.1

Introduction

In this section we will discuss how to modify the set of sites involved in the replica management system as well as the distribution of votes to these sites. Changing the set of sites is useful for adding/removing replicas, whereas changing votes may be useful for modifying dynamically the “relative importance” of the various replicas. We will discuss here only the mechanisms for doing this and we will not enter into any detail about the policies that can be built upon them. For instance, one might dynamically change votes so that to let them “migrate” toward those sites that have used more heavily the corresponding volume. Or, one might create (remove) replicas at sites running clients that frequently (rarely) use the volume. We require that votes associated with the various replicas are bound to geographical sites. Each server participating in the system will have a voting table describing such a binding and UBLCS-94-16

17

will keep it in permanent storage. Names of geographical sites are used for indexing the voting table, and to this end we assume that (1) the association of a name with a site is permanent; and (2) different sites have different names. These are the only assumptions we make, e.g. we do not care whether names of geographical sites are “pure names” or not [18], whether they are a machine- or a human-processable form of identification [21], and similar. It is important to point out, however, that an actual application cannot name geographical sites by means of the identifier assigned by the VSC run-time support to the server running at that site. The reason is that in VSC a process ceases to exist when it crashes, e.g. when it recovers it is given a new identifier 12 , which does not match with our notion of permanent association between votes and sites. The application must thus implement on its own the binding of site names to VSC names. There are several simple ways of doing this and we will not enter into any detail about that. The major problem to solve in order to be able to change votes dynamically is that the ability of transforming the strong-partial membership service of VSC to a linear membership service (section 5) must be preserved. Differently stated, we want Property 5.1 (there can be no multiple concurrent views of RMG that define quorum sets) to hold even with dynamically changing votes. In section 7.2 we will assume that the composition of the replica management system is still static, and will show in section 7.4 how to extend such a framework to deal with addition/removal of sites. 7.2

Changing Votes: Overview

The voting table is a permanent data structure replicated across the various sites and managed similarly to volume replicas, e.g. it is modified by means of a voting-update algorithm that can be run only within a view of UG that defines a quorum. A server that wants to modify the voting table first acquires a write lock for the corresponding volume — e.g. it locks all files in that volume with a single lock request13 — then performs the necessary updates and finally releases the lock. The voting-update algorithm itself is very similar to the write algorithm, the major difference being that the decision on when an execution can be considered successful is different in the two cases. To illustrate some of the problems involved, let us consider the situation depicted in fig. 9, up (step 1) — three servers p; q; r with voting table 1; 1; 1 — and let us suppose that we want to install voting table 3; 0; 0. Assuming that the algorithm completed successfully if the new voting table has been written on a new quorum does not work (steps 2-4): there might be a total failure when the new voting table has been written only at p — which is sufficient to form a quorum in the new voting table — and upon recovering there might be two concurrent partitions composed of, respectively, p and q; r, that both believe to be a quorum. The voting-update algorithm satisfies the following Propositions, which are very similar to those for the write algorithm (section 6.2): Proposition 7.1 If a server installs a new version of the voting table, then that server belongs to UG that defines a quorum in both the old and new voting table.

Proposition 7.2 If a server installs a new version of the voting table, then eventually either all servers in

UG installs the new version or UG disappears.

Proposition 7.3 The sequence of installed versions of the voting table by any server in UG corresponds 12. This is typical also in other VSC-like tools, for instance in Isis [6] a process that leaves the primary partition and then re-joins it, assumes different Isis identifiers. 13. For the sake of brevity, we do not show the corresponding simple modifications needed to the lock manager.

UBLCS-94-16

18

1

2

3

4

p (3, 0, 0) p (1, 1, 1)

p (3, 0, 0)

q (1, 1, 1)

q(1, 1, 1)

r (1, 1, 1)

r (1, 1, 1)

TOTAL

FAILURE q (1, 1, 1)

r (1, 1, 1)

p (1, 0, 0)

p (0, 1, 0)

q (1, 0, 0)

q (0, 1, 0)

r (1, 0, 0)

r (1, 0, 0)

TOTAL

q (0, 1, 0)

FAILURE

r (1, 0, 0)

Figure 9. Possible executions in an environment where the voting table may change. The upper figure shows a situation in which, after recovering from a total failure in the middle of the voting-update algorithm, there might be two concurrent partitions that both believe to form a quorum.

to a subsequence of the write lock acquisitions, which are totally ordered. Note that a new voting table is installed only if it is known to be present at a quorum of both the old and new voting table: It is straightforward to realize that this prevents the scenario of the previous example from occurring. Also note that if a server belongs to UG when the voting-update algorithm is initiated, it will belong to UG also when the algorithm will have been completed and the new voting table installed, unless meanwhile the server crashed or become isolated: Amongst other things, this implies that servers participating in the voting-update algorithm can keep on servicing read requests, much like they do while participating in an ordinary write algorithm. Furthermore, it is straightforward to realize that changes to the algorithms presented so far involve only how to deal with servers that do not belong to UG, e.g. those whose voting tables might not be up-to-date. To cope with possible inconsistencies among replicas of the voting table these are associated, similarly to replicas of volumes (section 6.2), with certain additional pieces of information: (i) A state that may be either W or R; (ii) A monotonically increasing version number (vt-vn); (iii) A monotonically increasing tentative number (vt-ugid). Whereas (i) and (ii) are substantially identical UBLCS-94-16

19

to their analogous for volume replicas, piece (iii) is slightly different from its counterpart ugid: it plays the same role — identifying the most recent replica in R state when recovering from a total failure — but is maintained differently, as will be clarified below. Unless otherwise specified, we will denote both servers and sites as S1 ; S2 ; : : :Sn , and whenever we will refer to quorums, we will implicitly assume that they are computed according to the up-to-date voting table. The Propositions that are necessary for preserving the correctness of the algorithms presented so far are the following:

Proposition 7.4 If Sj starts the UG reformation algorithm in presumed view v of RMG, then v defines a quorum. Proposition 7.5 If Sj starts the UG reformation algorithm in presumed view v of RMG, then 8Si 2 v eventually any of the following will happen: (1) Si starts the reformation algorithm; (2) Si installs a view of RMG that does not include Sj ; (3) Si crashes. Proposition 7.6 If every failure eventually recovers and there are no further failures for a sufficiently long time, then the UG reformation algorithm will be eventually started. Proposition 7.4 is necessary for preserving Property 5.1 (there can be no multiple concurrent views of RMG that define quorum sets). Note that it is stated in term of “presumed view”, because between the time in which Sj is delivered the event that leads it to the decision of starting the reformation algorithm and the starting itself, the view might have changed an arbitrary number of times. However, it is straightforward to prove that if any (real) view for RMG does not define a quorum during the execution of the reformation algorithm, the algorithm will fail. Proposition 7.5 is necessary for guaranteeing that either all servers in the view start the reformation algorithm or none does. Note that Propositions 7.4 and 7.5 do not state that every quorum is recognized as such. There might be certain failure patterns in which, after recovering, servers cannot figure out whether they really belong to a quorum or not, thus they would not, in these scenarios, start the reformation algorithm. Differently stated, allowing dynamic modification of votes implies that, in certain situations, a quorum might not be able to make progress. To guarantee liveness, we make the additional assumption embedded in Proposition 7.5, e.g. that there is a sufficiently long time in which all servers can communicate with each other. We point out again, however, that this is not always necessary, e.g. a quorum is “usually” sufficient for recovering from a total failure except for a narrow set of scenarios in which some more servers are needed. These scenarios will be highlighted along the proof of Proposition 7.4 and an example is given in fig. 9, down: three servers (p; q; r) with voting table (1; 0; 0) (step 1) that are going to install voting table (0; 1; 0). At step 2 there is a replica of the new table only at p; q, that form a quorum in both the old and the new voting table. Let us suppose that the total failure of step 3 happens when the copy at p is already in W state and that at q is still in R state. Although the partition composed, after recovering, by q and r form a quorum hence would be sufficient to reform UG (step 4), they have to wait for p to become available again: From their point of view, p might still have the old voting table and might have already reformed UG. Finally: Proposition 7.7 If Sj joins a quorum view of RMG, then eventually any of the following will happen: (1) Sj starts the state-transfer-and-join algorithm; (2) Sj is delivered the installing of a view of RMG that does not define a quorum; (3) Sj crashes.

UBLCS-94-16

20

7.3

Changing Votes: Algorithms

The ability to change dynamically the distribution of votes among sites relies on two additional modules, voting-update and votes-exchange. Furthermore, the state transferred to a server during the state transfer and join must be enriched by the up-to-date voting table and the current value for vt-ugid, which is maintained as described below. The voting-update module is shown in fig. 10. It is very similar to the writer/participant module (section 6.2), except for some differences related to the fact that, if UG does not cease to exist, a write always completes successfully whereas a voting-update might fail. The reason is because there might be UG’s that do not define a quorum in the voting table being installed (Proposition 7.1).

Therefore, the current composition of UG is checked against the voting table being installed before starting the algorithm (line 3 left), as well as when dealing with view changes in the middle of the algorithm (lines 18-23 left, and 26-30 right). Furthermore, when a voting-update fails but UG continues to exist, a server that created a voting table in R state might later be requested to create a different replica in R state for the same version number. Note that this cannot happen in the write algorithm, where after creating a replica in R state a server either switches it to W state or leaves UG. We solve this problem as follows. First, we replace the notion of ugid associated with volume replicas by a vt-ugid that is incremented not only when the composition of UG changes (line 15, right), but also upon delivery of a V-W-REQ (line 2, right). This preserves in the context of voting tables the property guaranteed by ugid’s about volume replicas: in a quorum of servers, the only replicas in R state that might be switched to W state are those associated with the largest version number and the largest vt-ugid. And second, we release the lock after the creation of a replica in R state has been acknowledged by all servers in UG rather than by just a quorum of the old and new voting table. Lacking this restriction, it would be difficult for a server receiving two consecutive V-W-REQ deciding whether the second message should act as an “implicit commit” for the first one or not (section 6.2). Rather than developing a new algorithm from scratch, we preferred to adapt the write algorithm and pay the consequent price in terms of increased latency of voting updates. The votes-exchange module is necessary to deal with servers not in UG and is shown in fig. 11. If UG exists within the current view of RMG, the state transfer and join algorithm is started (lines 2 and 10). Otherwise, a comparison algorithm is started, that is, each server multicasts its own voting table within RMG by means of a VOTE-TABLE message (lines 7-9). When a server is delivered a VOTE-TABLE, it places the message in its own vrep-set (lines 11-13). Whenever a server has been delivered a VOTE-TABLE from every server in its view of RMG, it compares the corresponding voting tables and takes a decision (lines 18-19). The decision algorithm is given in fig. 12. If the outcome is “this view is really a quorum” (e.g. vt 6= NULL at line 20), then vt is the up-to-date voting table (lines 20-22). Otherwise, the decision is “this view is not, or might not be, a quorum” (e.g. vt = NULL at line 20), in which case the server takes no further steps. View changes are treated as follows: if the composition of RMG has only shrank since the beginning of the comparison algorithm (vreq-set = RMG at line 4), then the algorithm continues (lines 3-6); otherwise, the comparison algorithm is restarted from scratch (lines 7-9). Note that the algorithm is fully distributed, e.g. all servers play the same role. 7.4

Changing Composition of RMS

We deal with addition/removal of sites to the replica management system by relying on the ability of dynamically changing votes described in the section 7.3. We assume that voting tables are managed such that to satisfy the following properties: (1) a site not included in the voting table is implicitly associated with zero votes; and (2) entries associated with zero votes in both the i-th and the (i + 1)-th version of the voting table may be removed at the time the (i + 1)-th version is installed, e.g. switched to W state. As will be clearer in the following, property 2 is necessary to prevent problems of garbage collection of voting table entries. UBLCS-94-16

21

var v-writer: sid initial NULL; vt-ugid: integer; vt: voting-table initial NULL; end;

var active: boolean initial FALSE; vt: voting-table initial NULL; end;

1 upon(delivery(SET-VOTES, msg)) 1 upon(delivery of (V-W-REQ, wreq)) 2 vt := msg.vt; 2 vt-ugid := vt-ugid + 1; 3 if(not check-votes(vt)) then 3 v-writer:= wreq.sender; 4 return FAILED to invoker; 4 vt := wreq.data; 5 active := TRUE; 5 if(disk write(vt, “R”, vt-ugid, 6 acquire lock(); 7 vtvn := get-vtvn(); wreq.vn) = FAIL) 8 wreq := V-W-REQ, vtvn+1, vt ; 6 then abort(); 9 v-cast(wreq); 7 send(v-writer, V-W-ACK); 10 wait-for(delivery of V-W-ACK from 8 upon(delivery of (V-W-COMMIT, wcom)) all servers in UG); 9 if(disk write(“W”) = FAIL) 11 return SUCCESS to invoker; 10 then abort(); 12 v-cast(V-W-COMMIT); 11 v-writer := NULL; 13 active := FALSE; 12 vt := NULL: 13 upon(view-change in UG) 14 release lock(); 14 if(UG is a quorum view) then 15 vt := NULL 15 vt-ugid = vt-ugid + 1; 16 upon(view-change in UG) 16 if(v-writer NULL and 17 if(active) then 18 if(not check-votes(vt)) then v-writer UG) then 19 return FAILED to invoker; 17 v-writer := new-writer(UG); 20 abort wait-for at line 10; 18 send(v-writer, W-SYNCACK); 21 active := FALSE; 19 if(mysid = v-writer) then 22 release lock(); 20 wait-for(delivery of 23 vt := NULL; V-W-SYNCACK from all servers in UG); 21 v-cast(V-W-COMMIT); 22 release lock(); 23 v-writer := NULL; 24 vt := NULL; 25 elseif(v-writer = mysid) then 26 if(not check-votes(vt)) then 27 release lock(); 28 v-writer := NULL; 29 vt := NULL; 30 abort wait-for at line 20;




6=

62

Figure 10. The algorithm for updating voting tables, writer left, participant right. The boolean function check-votes() return TRUE iff the current view of UG defines a quorum in both the old and new voting table. Functions acquire lock() and release lock() get a lock for the whole volume, as explained in the text.

UBLCS-94-16

22

var vreq-set: set of sid initial ; vrep-set: set of message initial ; vt: voting-table initial NULL; end

;

;

RMG ; \RMG = RMG

)) 1 upon(delivery(view-change( 2 if(members(UG) = ) then 3 vreq-set := vreq-set ; ) then 4 if(vreq-set 5 remove from vrep-set messages from servers not in 6 v-try-again(); ; 7 elsevreq-set := 8 msg := VOTE-TABLE, current-voting-table ; 9 v-cast(msg); 10 elsestate transfer and join(); 11 upon(delivery(VOTE-TABLE, msg)) 12 remove from vrep-set message from msg.sender (if any); 13 insert msg into vrep-set; 14 v-try-again(); 15 upon(delivery(UG-BUILT) 16 state transfer and join();




17 procedure v-try-again() 18 if(senders in vrep-set = ) then 19 vt := compare-vt(vrep-set); 20 if(vt NULL) then 21 start using msg.vt as voting-table, but do not install it in permanent storage; 22 activate create-UG module and invoke uga();

RMG

6=

Figure 11. The algorithm for exchanging voting tables between servers not in UG. Function compare-vt() is shown separately.

1 function compare-vt(set of message: rep-set): voting-table 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

vt-rcvd := voting tables in rep-set;

vn := largest-version-number(vt-rcvd); svt := set of elements in vt-rcvd with vn; candidate-vt := element in svt in W state; if(candidate-vt 6= NULL) then if(RMG is a quorum view for candidate-vt) then

return candidate-vt; elsereturn NULL; elseold-vt := element in vt-rcvd with 1 and in W state; if(old-vt = NULL or is not a quorum view for old-vt) then return NULL; elsecandidate-vt := element in with largest vt-ugid; is a quorum view for candidate-vt) then if( return candidate-vt; elsereturn NULL;

RMG

RMG

vn ?

svt

Figure 12. The algorithm for comparing voting tables between servers not in UG.

UBLCS-94-16

23

Addition of a site Sk to the replica management system is a process that, logically, involves the following steps:

1. Associating Sk with zero votes. 2. When Sk belongs to a quorum view of RMG, executing the state-transfer-and-join algorithm. 3. When Sk belongs to UG, (possibly) modifying the distribution of votes.

Note that the only administrative burden is starting a proper server on the site being added to the replica management system. Actual transfer of a replica of the volume and of the voting table will be done done automatically during the state-transfer-and-join algorithm. It is guaranteed that although Sk starts without any voting table, steps 2 eventually will complete, e.g. Sk eventually will realize that it belongs to a quorum (Proposition 7.7). Removal of a site from the replica management system is an issue that might arise certain problems when reforming UG. To illustrate them, let us go back to the example of fig. 9, down: Let us suppose that, upon recovering, p decides to leave the replica management system. From p’s point of view this seems harmless, because it sees a voting table in W state in which it has zero votes, thus its availability seems to be not necessary to the rest of the system. On the other hand, as already pointed out, q; r are waiting for contacting p because they cannot figure out which is the up-to-date voting table: from their point of view, p might have reformed UG by using the old voting table. Therefore, having p left the system, UG would never be created again.

These considerations illustrate which is the core of the problem: p left because its availability was not necessary in the new voting table, but one must make sure that it is not necessary in the old table either. Differently stated, if p had zero votes also in the old voting table, then q; r would not be waiting for it. Accordingly, removal of a site Sk is done as follows: As soon as Sk is delivered the installing of a voting table such that it is associated with zero votes in both the old and new voting table,14 Sk may leave the replica management system.

8

Structure of the Run-Time Support

8.1

Overview

As we said in section 4, our algorithms have been presented with a notion of “generalized VSC” (GVSC), which may simplify the reasoning at the application level (section 6.1). In this section we describe in more detail the overall software architecture of the run-time support. Then, we will describe in section 8.2 an algorithm that supports the GVSC notion and that can be implemented without intervening into the VSC support layer. The overall structuring is shown in fig. 13. It can be seen that the run-time support is split in three layers — VSC, linear membership, GVSC — and that the application may access either GVSC or VSC (more details about this point will be given below). The VSC layer may either be part of the process’ address space or not, depending on the implementation. Conversely, the other two layers must be part of the process’ address space and for this reason we will denote them as per-process run-time support (PPRTS ). As soon as VSC delivers events to the process, these are queued in the process’ address space. 15 Then, when the process is ready to process the next event in the queue, PPRTS performs some processing described below, which may involve discarding the event, and then passes control to the application. Similarly, when the process wants to pass events to the VSC layer, these events may be processed by PPRTS and then passed to VSC. The only job of the linear membership layer is “filtering” group views exported by VSC as follows: if the view defines a quorum set then it is left unchanged, otherwise it is replaced by the 14. Recall that a site associated with zero votes is equivalent to a site not included in the voting table. 15. Note that this scenario is a realistic one, since the VSC layer cannot synchronize itself with the application.

UBLCS-94-16

24

Application

GVSC

Linear

Membership

VSC

Figure 13. Logical organization of the run-time support

empty set. Notice that this layer is controlled by the application, because it knows about votes and quorums: Since VSC provides only a strong-partial membership service, the job of building a linear service must be demanded to the application itself. On top of linear membership there is the GVSC layer, which is responsible for implementing the GVSC abstraction. Its behaviour will be discussed in detail in section 8.2, where it will also be shown that it is not application-dependent. Our algorithms make use of the run-time support being discussed as follows. A process is composed, basically, of all the modules of pseudo-code discussed so far.16 . All these modules lay on top of GVSC except for the votes-exchange module (section 7.3) that lays directly on top of VSC. The reason why such a module needs a special treatment is because it is the one used for building the linear membership layer (e.g. for exchanging voting tables between servers not in UG), therefore it cannot run on top of the linear membership layer itself17. Notice that an application that does not need the ability to change votes dynamically may be composed of processes that lay completely on top of GVSC. 8.2

Supporting Generalized VSC (GVSC)

In this section we will discuss the internals of the GVSC layer. To simplify the notation, it is useful to recall some facts. The VSC layer exports to each server S a view vG (S ) of each group G. At i (S )  vi+1 (S )  : : : — and there any given server S , views of a group can be totally ordered — vG G j i i (Sa ). In the following we will is reciprocity on view composition — vG (Sa )  vG (Sb ); 8Sb 2 vG omit the specification of the group G and/or the server S whenever ambiguities cannot arise. The GVSC layer maintains a view vector having as many integer elements as processes18. Let V Vik denote the view vector at process pi associated with view vk . Let V V [j ] denote the component of V V associated with process pj . The view vector associated with the first view exported to the process19 has all its elements set to zero. The actions undertaken by the GVSC 16. In practice there will be additional modules not shown here, such as those for performing reads, for interfacing with clients, etc. 17. For instance, a process that belongs to a quorum without being aware of this (because its voting table is not up-todate), would be exported by the linear membership layer an empty view of the group, therefore it would be unable to communicate. 18. There will be a separate view vector for each group. 19. “First view” means “since the last recovery of the process”. More clearly, we do not require that view vectors are

UBLCS-94-16

25

layer at process pi are the following:20

1. Upon handling the delivery of view vk+1 , the view vector V Vik+1 is obtained from V Vik as follows:

V Vik+1 [j ] =



V Vik [j ] + 1 V Vik [j ]

if pj 2 vk+1 (pi) ^ pj otherwise

62 vk (pi )

The updating above is done before multicasting or delivering any message. Informally, the generic element V Vik [j ] of the view vector counts how many times pi and pj have joined. 2. When the application layer at pi sends a message m in presumed view vk , the GVSC layer attaches the view vector of the presumed view V Vik to m and then passes m to the VSC layer. Let V V m denote the view vector carried by message m. 3. When the GVSC layer at pj is handling in view vk the delivery of message m sent by pi (it might be i = j ), it forwards it to the application iff V Vjk [i] = V V m [j ], otherwise it drops the message. As proved in the Appendix, the following result holds: Proposition 8.1 Actions (1)–(3) implement GVSC according to definition 4.1

9

Considerations about the Membership Service

Our algorithms rely on a VSC layer providing a strong-partial membership service, e.g. one in which the intersection between any two concurrent views is guaranteed to be the empty set. Differently stated, when there are concurrent views every process belongs to at most one of such views. Concurrent views represent situations of network partitions: Informally, two views are said concurrent if neither precedes the other. Strong-partial semantics is a cornerstone of our system, since it guarantees Property 5.1: There can be no multiple concurrent views of RMG that define quorum sets. A weak-partial membership service — one in which concurrent views may overlap in arbitrary way — would not be useful. It may be helpful, however, to discuss certain aspects of this semantics in order to understand its practical implications. Consider a system composed of two processes p; q and a view va including both. Suppose that p believes q faulty, which leads p to install a view vb that excludes q. The fact that the semantics of the membership service is strong-partial means that eventually q will install a view vc that excludes p: vb and vc are concurrent, thus their intersection is the empty set. Since the system is asynchronous, however, there may be an arbitrary long interval of time between p’s installing of vb and q’s installing of vc : during that interval, the intersection between the view at p and the view at q will not be the empty set.

This effect has an important practical consequence. Suppose p is given the majority of votes and that the views above refer to UG; then, p would have installed an UG that excludes q, which in its turn would still believe to be in UG for an arbitrary long time. As a consequence, during this time q would keep on servicing read requests since it would believe to hold an up-to-date replica. Note that q would not be able to perform writes, since these involve communicating with p, which would not be delivered messages from q. 1-SR, thus, is not affected.

If this effect was strongly undesirable for applications, a server might perform reads by contacting a quorum of servers in UG rather than by servicing them locally on the solely basis of its belonging to UG. Roughly speaking, contacting a quorum in UG would be a sort of “membership validation”. maintained in permanent storage. 20. The notation means “the component of ”.

VV

UBLCS-94-16

V V [J]

VV

associated with process

pj ” rather than “the j -th component of 26

The analysis above shows also that, actually, our algorithms work even with a relaxed semantics of the membership service. Consider a flavour of weak-partial semantics in which, informally, two concurrent views that are the successor of a common view may overlap only provided one is a proper subset of the other [4]. In this semantics there may be concurrent views where a process q believes process p to be reachable, whereas p believes q unreachable. It is straightforward to realize that, from the point of view of our algorithms, this scenario is identical to that outlined above. Finally, it is worthwhile making some discussion about the way we select the primary partition. As already seen, we make use of a linear membership service that lays on top of the strong-partial membership service provided by VSC. It has been proposed in the literature a different flavour of VSC, called PP-VSC [25], that implements view-synchronous communication in an environment where is the membership service itself that defines a primary partition. That is, PP-VSC prevents concurrent views from occurring and exports to applications a unique totally ordered sequence of views [20]. Whereas VSC is easier than consensus [24], it has been proved that PP-VSC is not [25]. Although our run-time support at first glance looks similar to PP-VSC — they both provide view-synchronous communication and primary partitions — there are in fact substantial differences between the two. Apart from the fact that we give applications control on votes distribution, thus allowing them to control how the primary partition is selected, the crucial point is that the impossibility results of [25] do not apply to our scenario. The reason is that our run-time support solves a problem that is in fact different from PPVSC. A requirement of PP-VSC is that, informally, once an instance of the problem has been solved — that is, amongst other things, a new view has been scheduled — any two processes that have solved the problem must agree on the composition of the new view. We do not satisfy this requirement, because given two processes belonging to concurrent views they obviously do not agree on view composition. The basic reason why we can always make progress, even in scenarios where PP-VSC cannot, is that a process might be delivered a view that includes only itself: Roughly speaking, this allows the VSC run-time support to always terminate any agreement round. Of course we do not solve the unsolvable: Although a process is never blocked by the run-time support that is not able to schedule a primary partition, it may get blocked as well, but at application level. In a sense, we have simplified the lowest layers of the system on the basis of end-to-end considerations [22].

10

Related Work

There have been other studies of achieving one-copy serializability using the notion of “views”. Even though the context for these previous studies has been database management, a comparison with our work is warranted. In [2] view changes are managed by the replica control protocol and are automatically triggered upon changes in processes’ ability to communicate. A view change while a transaction is running causes it to abort even though the change is due to the joining of a process. Therefore, read and write operations are required to either start and complete within the same view or abort. This drawback can be overcome by restricting the concurrency control protocol. In [3], a read or write operation is still bound to a single view, but it is the application itself that decides when a view change occurs. Unlike our approach based on VSC, their approach has no need of maintaining agreement on the composition of the view — each site can independently decide which sites to include in its current view. This decision may be based on several factors, among them the ability to communicate. The consequences of the decision are only on availability and performance, and do not affect correctness. Upon every view change, the replica control protocol invokes a system “update” transaction that has to contact all replicas of the data in the UBLCS-94-16

27

view and serves to make all the replicas in the view up-to-date. In our work, view changes are handled by the VSC support layer, so applications have no control over them. Since we control when a server can join the up-to-date group, a joining server never causes an operation to abort. We guarantee that servers become up-to-date at join time, so contacting all available replicas is necessary only in case of a total failure. Other works near to ours are based on the notion of virtually-synchronous communication21. Isis [6] is a toolkit for programming distributed reliable applications that has been developed at Cornell University and that first proposed and implemented the notion of virtually-synchronous communication [15, 7]. The major differences between the Isis framework and ours are consequences of the slightly different VSC model we have used (Section 3): 1) we have a different multicast semantics in presence of failures; 2) we do not assume any order other than FIFO on multicasts; 3) we rely on a membership service that tolerates partitions. We have never used point 1, since when a server leaves a view we do not make any assumption on its state. Concerning point 3, notice that although we allow only one partition at time to make progress, the choice of which is that partition cannot be delegated to the membership service, since it depends on the actual distribution of votes among the servers. Therefore, a membership service that allows concurrent views to exist is essential in our framework. RNFS [14] is a network file service that is tolerant to fail-stop failures and can be run on top of NFS [26], a standard network file service. A goal of the project was to make fault-tolerance transparent to client machines. They used the state machine [11] approach as design methodology and Isis broadcast protocols and toolkit routines [15] in building it. NFS-servers are replicated for fault-tolerance. Moreover, on purpose, the semantics of the server is exploited to reduce complexity of the broadcast protocol. In particular designers make use of idempotency, and the synchrony of updates with respect to failures, which are typical features of NFS, to deal with failures. RNFS does not address partitions at all, probably because at that time Isis did not provide any support for that. Indeed, the system seems conceived for a local area network environment. Another file system developed on top of Isis is Deceit [1]. Deceit provides a superset of NFS functionality, including replication, file migration and version control. As RNFS, Deceit cannot tolerate long-term network partitions because of Isis’ features. Deceit attempts to support a wide range of semantics and performance by allowing users to set a number of parameters per file, which regulate the likelihood of inconsistency over multiple partitions and consistency within a single partition. However, according to the authors, the final implementation came out too complex. Our design, in a sense, made some explicit choices about the degree of freedom that Deceit leaves to users.

11

Conclusions

While one-copy serializability may be considered an unacceptably heavy-weight consistency criterion for a large-scale system, it is worthwhile pointing out that the proposed algorithms are able to deal with typical access patterns to files in an efficient manner. First, note that the proposed scheme favors reads over writes in that membership in UG is sufficient for local servicing of reads since the local replica is up-to-date. Data sharing in large-scale systems will typically involve files that are mainly read, so this choice seems appropriate. Our scheme is also able to deal with files that are “owned” by a “home” site that is willing to share them with other sites provided it is guaranteed continued access to them. Most updates to these files originate at the home site that would like to maximize their availability. This type of sharing can be handled by allocating votes for the file such that the home site alone is sufficient to define a quorum. If we can also guarantee that the lock manager for the file will be at its home site, then all clients at the home site experience local-cost latency in accessing the file. 21. Recall that we prefer to use the term view-synchronous communication instead (Section 1).

UBLCS-94-16

28

View-synchronous communication has proven to be an appropriate paradigm for the replicated file management problem in a large-scale distributed system. It leads to conceptually simple and efficient algorithms for maintaining one-copy serializability in a completely asynchronous system in the presence of complex failure patterns. A major role in this simplicity is played by the clean failure model exported to the programmer. We believe that the framework of this paper could be used to achieve other flavors of consistency weaker than one-copy serializability with equal simplicity. In this paper we have concentrated on the algorithmic issues of consistency for data sharing in a large-scale distributed system. A communication infrastructure called RELACS exporting the VSC abstraction has been implemented on top of a common operating system (Unix) and a prototype is currently functional [4]. The performance of the proposed solutions have to be verified in practice on top of this system. There are many opportunities for further improvements in access latency through simple optimizations and heuristics. We are currently exploring these issues in the context of a replicated file system under development. Acknowledgments We are grateful to Sape Mullender (University of Twente), Marco Avvenuti and Luigi Rizzo (University of Pisa) for their contributions to this work in the early stages of the replicated file system design. This work has been supported in part by the Commission of European Communities under ESPRIT Programme Basic Research Project 6360 (BROADCAST).

References [1] A. Siegel and K.P. Birman and K. Marzullo. Deceit: a flexible distributed file system. Technical Report TR89-1042, Department of Computer Science at Cornell University, Ithaca, NY, November 1989. [2] A. El Abbadi, D. Skeen, and F. Cristian. An efficient, fault-tolerant protocol for replicated data management. In Proceedings 4th SIGACT-SIGMOD Symposium on Principles of Database System, pages 215–229. ACM, 1985. [3] A. El Abbadi and S. Toueg. Maintaining Availability in Partitioned Replicated Databases. ACM Transactions on Databases Systems, 14(2):264–290, June 1989. ¨ Babaoglu, [4] O. ˘ M.G. Baker, R. Davoli, and L.A. Giachini. RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems. Technical Report UBLCS-94-15, Laboratory for Computer Science, University of Bologna, Italy, June 1994. [5] P.N. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys, 13(2):185–221, June 1981. [6] K. Birman, R. Cooper, T. Joseph, K. Marzullo, M. Makpangou, K. Kane, F. Schmuck, and M. Wood. The ISIS - System Manual, Version 2.1. Department of Computer Science, Cornell University, September 1993. [7] K.P. Birman. The process group approach to reliable distributed computing. Communications of the ACM, 36(12):36–53, December 1993. [8] C.H. Papadimitriou. The Serializability of Concurrent Database Updates. Journal of the ACM, 26(4):631–653, October 1979. [9] T.D. Chandra and S. Toueg. Unreliable Failure Detectors for Asynchronous Systems. In Proceedings 10th ACM Symposium on Principles of Distributed Computing, pages 325–340. ACM, August 1991. [10] S.B. Davidson, H. Garcia-Molina, and D. Skeen. Consistency in partitioned networks. ACM Transactions on Computer Systems, 17(3):341–370, September 1985. [11] F.B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Survey, 22(4):299–319, 1990. [12] M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of ACM, 32(2):374–382, April 1985. UBLCS-94-16

29

[13] V. Hadzilacos. On the relationship between the atomic commitment and consensus problems. In B. Simons and A.Z. Spector, editors, Fault Tolerant Distributed Computing, pages 201–208. Springer Verlag, Lecture Notes in Computer Science N. 448, 1990. [14] K. Marzullo and F. Schmuck. Supplying high availability with a standard network file system. In Proceedings of 8th International Conference on Distributed Computing Systems, pages 13–17, San Jose, CA, June 1988. IEEE. [15] K.P. Birman and T.A. Joseph. Exploiting Virtual Synchrony in Distibuted Systems. In 11th ACM Symposium on Operating Systems Principles, pages 123–138. ACM, 1987. [16] L. Lamport. The part-time parliament. Technical Report 49, DEC SRC, Palo Alto, 1989. [17] T. Mann, A. Hisgen, and G. Swart. An algorithm for data replication. Technical Report 46, DEC SRC, Palo Alto, 1989. [18] Roger M. Needham. Names. In Sape Mullender, editor, Distributed Systems, pages 315–327. ACM Press, second edition, 1993. [19] T.W. Page Jr, Richard G. Guy, G.J. Popek, and J.S. Heidemann. Architecture of the Ficus scalable replicated file system. Technical report, University of California, Los Angeles, 1991. [20] A. Ricciardi and K. Birman. Using process groups to implement failure detection in asynchronous environments. In Proceedings ACM Symposium on Principles of Distributed Computing, pages 341–352, August 1991. [21] Jerome H. Saltzer. On the naming and binding of network destinations. In Local Computer Networks, pages 311–317. North-Holland Publishing Company, 1982. Also available as RFC1498. [22] J.H. Saltzer, D.P. Reed, and D.D. Clark. End-to-end arguments in system design. ACM Transactions on Computer Systems, 2:277–278, November 1984. [23] M. Satyanarayanan, J.J. Kistler, P. Kumar, M.E. Okasaki, E.H. Siegel, and D.C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–59, 1990. [24] A. Schiper and A. Ricciardi. Virtually-synchronous communication based on a weak failure suspector. In Proceedings 23rd International Symposium on Fault-Tolerant Computing Systems, pages 534–543. IEEE, June 1993. [25] Andr´e Schiper and Alain Sandoz. Primary partition virtually synchronous communication harder than consensus. In Distributed Algorithms, Lecture Notes in Computer Science 857, pages 39–52. Springer-Verlag, 1994. [26] Sun Microsystems, Inc. NFS: Network file system protocol specification. Technical Report RFC 1094, Network Information Center, SRI Intl., March 1989.

UBLCS-94-16

30

A

Proofs

In this Appendix we will give the proofs of the various Propositions previously stated in the paper. We will use the term “quorum view” to denote views of a group that contain a quorum set and will assume that, unless otherwise stated, a “view” is a “quorum view”. Recall that when a server is delivered a view of a group that is no longer a quorum view, it will invoke group-leave() to leave that group, even if it is delivered again a quorum view meanwhile. Said this, we will mean, for convenience of presentation and unless otherwise stated, that a “member of a group” is a member of a quorum view that is not about to leave that group. Furthermore, we will adopt simplifications about notations of views and groups similar to those in Section 4, e.g. we will omit the specification of the group and or the server at which the view is exported whenever ambiguities cannot arise. Finally, we will try to highlight explicitly those points in which GVSC semantics has been especially useful for simplifying both the reasoning and the structure of the pseudocode. A.1

Write Lock Management

Proposition 6.1. There can be no more than one LM that exist concurrently. Proof: By definition, the LM must belong to UG. Given property 5.1, we have to prove only that in any UG exactly one LM will eventually be elected. Let S1 be the server that creates UG; let c(S1 ) be the cut corresponding to the first installing of a view of UG that does not include S1 22 ; let, finally, Si be any server joining UG before c(S1 ).

2

Let us denote by SLM (S ) the server believed by S UG to be the current LM. During the UG creation protocol S1 sets SLM (S1 ) = S1 (Section 6.3). During the state-transfer protocol Si sets SLM (Si ) = S1 (Section 6.3). Si performs a new election iff a new view is installed that does not include SLM (Si ) (line 19-27 fig.2, line 12-27 fig. 3). Election is carried out by applying a deterministic function of the composition of the current view (lines 20 and 13, respectively). Recall that VSC guarantees agreement on the composition of the view; if S1 leaves UG then: (1) any server Si UG eventually will perform an election; and (2) the outcome will be the same, e.g. SLM (Sj ) = SLM (Sk ); Sj ; Sk UG. The reasoning can now be iterated.

2 8

2

2

Proposition 6.2. If there are no failures, eventually there will be exactly one LM. Proof: This follows directly from property 5.1, proposition 6.1, and the definition of concurrent views. Note that there might happen certain (endless) sequences of failures that get the election of the LM never completed. This actually happens if the LM fails, then the one which is being elected fails as well before completing the election and so on. 2

Proposition 6.3. There can be no more than one lock holder that exist concurrently.

2

Proof: Let SLH UG be a server holding a lock. From the proof of proposition 6.1 it follows that in any UG there is either one LM or none, and that, eventually, there will be one. It suffices then to prove the following points: 1. Necessary condition for LM to allocate a new lock is that either SLH has left the quorum view of UG and its 0 lock has not been inherited by any other server SLH UG; or, SLH explicitly released the lock. 2. Once LM has left the quorum view of UG, messages it (presumably) sent before then and not yet delivered, do not have any effect. 3. Once SLH has left the quorum view of UG, its lock is not valid any more.

2

Concerning point 1, new locks are allocated within the procedure distribute lock() (fig. 2, line 29). A necessary condition is that LockHolder is not NULL (line 30). LockHolder may be set to NULL only at lines 5, 16 and 26. In the first case the lock has been explicitly released, whereas in the other two SLH has 22. The case in which

UBLCS-94-16

S1 never leaves UG is trivial. 31

left the quorum view of UG and its lock has not been inherited by any other server (see also the proof of proposition 6.6).

Concerning point 2, let LM = SLM ; let vim be the (presumed) view of UG in which SLM sends a message m tagged L-GRANTED and let vim+k be the view in which the sending actually occurs (k 0). We have to prove that if j (0;k) such that vim+j is not a quorum view, then such m does not have any effect. This is straightforwardly guaranteed by GVSC: The collapsing of vim with respect to m is the empty set, therefore m is not delivered to any server.



9 2

Finally, concerning point 3, there are the three cases below to consider. receives the L-GRANTED message and leaves the quorum view of UG before sending any message; S knows that it is the lock holder, therefore it can perform internal clean-up actions (procedure leave(), line 28 fig. 3). 2. SLH receives the L-GRANTED message, sends some messages and then leaves the quorum view of UG. Let vim be the (presumed) view of UG in which SLH sends message m while holding the lock and let vim+k be the view in which the sending actually occurs (k 0). We must obviously guarantee that m does not have any effect, because when SLH left the quorum view of UG the lock has been broken (fig. 2, lines 10-18). We achieve this in a straightforward way via GVSC: since the sender of the message left the quorum view of UG at least once, the collapsing of vi with respect to m is the empty set, therefore the message is not delivered to any process. 3. SLH leaves the quorum view of UG before receiving the L-GRANTED message; Let vim be the (presumed) view of UG in which LM sends the L-GRANTED message being considered to SLH and let vim+k be the view in which the sending actually occurs (k 0). Reasonings similar to the case above can be applied to conclude that SLH is not delivered such a message.

1.

SLH





2 Proposition 6.4. Every lock request is eventually granted, provided the UG in which the request was made does not disappear and the requesting server does not leave UG before the lock is granted. Proof: That UG can disappear before a given lock request be granted is obvious, so in the following we will assume that UG does not disappear. Let S UG be the server acting as LM. Lock requests delivered to S when a lock is already allocated are queued and served on a FIFO basis (lines 2 and 31, fig. 2). The only possible troubles are thus when S leaves the quorum view of UG and is taken over by another server S 0 . S 0 reconstructs the queue of the pending requests and the identity of the LockHolder, which may possibly be NULL (lines 22-26). To prove that all servers in UG indeed send L-STATE messages (lines 12-27, fig. 3), it suffices to recall the agreement on view composition and that the exchange of I-ENQ, I-RES messages involves different modules in the same server, which cannot thus fail independently.

2

The internal order of the queue reconstructed by S 0 may be different from that of S . One has then to make sure that every lock request will eventually reach the head of that queue. Assuming that clocks are monotonically increasing and that clock drifts are bounded, it suffices to note that: (1) each lock request is timestamped with the local clock value of the requesting server (line 2, fig. 3); (2) the order of the queue reconstructed by S 0 is based on the timestamps attached to each request (line 26, fig. 2). 2

In the proof above we have made implicit use of GVSC: servers sending L-STATE do not need to worry about whether the new lock manager is still there or whether it left and re-entered the quorum view of UG in between the presumed and real view. Similarly, when a server is taking over the lock manager, it does not have to worry about whether L-STATE messages are really sent to it or rather they were sent to its (possible) previous incarnation(s) as lock manager. Proposition 6.5. While the current lock holder belongs to UG, the lock is not broken (forcefully taken away). Proof: Breaking the lock means changing the value of LockHolder when its current value is not NULL, without waiting for the delivery of an L-REL message. This may happen only at the lines 16 and 26 of fig. 2. In both cases, necessary condition for changing the value of LockHolder is that the current lock holder has left the quorum view of UG. 2

UBLCS-94-16

32

Lemma A.1 For any message m multicast in vi while there is a write in progress, if any server S 2 vi delivers m, eventually either all servers in vi will deliver m or a further view change will take place and all servers in vi+1 will have delivered m. Proof: Obvious consequence of proposition 6.12 along with VSC definition (Section 4). Note that delivered, if delivered at all, in vi . 2

m is

Lemma A.2 If a server S 2 UG installs version n of the file, then it does so while handling the delivery of a message m multicast in UG. Furthermore, either m is a W-COMMIT for version n, or a W-REQ for version n + 1. Proof: Let n be the new version of the file. Hence it turns out that version n may be installed only upon delivery of either a message m1 carrying a W-COMMIT for version n (fig. 4, line 13), or a message m2 carrying a W-REQ for version n + 1 (fig. 4, line 3). Note that a server could be delivered m2 before than m1 , because we are assuming only FIFO ordering among multicasts. 2

The three following lemmas will be used for proving Proposition 6.6. They avail themselves of the following definitions: Let UG(i) and UG(i+1) be two consecutive quorum views of UG; let ei denote the event — local to Si 2 UG(i+1) — “Si begins the handler of the installation of UG(i+1)”; let ci be any cut that includes events ei ; 8Si 2 UG(i+1). Lemma A.3 If any Si 2 UG(i+1) has an have an R-replica for that version along ci .

R-replica for version n along ci , then all servers in UG(i+1)

Proof: An R-replica for version n can be created only upon delivery of a message m tagged with W-REQ and carrying version number n, therefore Si UG(i+1) has been delivered m before event ei . This is also true Si UG(i+1) because of Lemma A.1. Since handlers are uninterruptible, the (local) event “creation of replica in R state for version n” thus precedes event ei Si UG(i+1). Furthermore, Si UG(i+1) such that the (local) event “installing of version n” precedes ei , because of Lemmas A.2 and A.1. 2

2

8 2

8 2

69 2

Lemma A.4 If S 2 UG has an R-replica along ci , then the writer variable in its participant module contains the identity of the server that made it create such a replica. Otherwise, writer is NULL. Proof: The participant module is shown in fig. 4. Writer is set to a not NULL value upon creation of an R-replica (lines 5-7), and it is set to NULL only at line 11, e.g. when the state of the replica is being switched to W. 2

Lemma A.5 Let SLH 2 UG(i) be the current lock holder; let n be the version of the file it is attempting to install; let SLH 62 UG(i+1). If at least one server in UG(i+1) has an R-replica for version n along ci , then all the following conditions hold: (1) every server Si 2 UG(i+1) elects a new server S i 2 UG(i+1) to be the responsible for completing the write operation; (2) such an election is performed within the handler of the installing of UG(i+1); (3) S i = S j 8Si ; Sj 2 UG(i+1).

8 2

Proof: Lemmas A.3 and A.4 imply that Si UG(i+1) when the handler of the installing of UG(i+1) begins, writer is not NULL. Thus all servers in UG(i+1) will elect a new server for completing the write operation (lines 20-21). 2

Proposition 6.6. When the current lock holder leaves UG, the lock is broken only if there is no write in progress. Otherwise, the lock is inherited by some other server in UG.

Proof: Let SLH be the current lock holder; let UG(i) and UG(i+1) be two consecutive quorum views of UG such that SLH UG(i) SLH UG(i+1). There are two cases to consider:

2

UBLCS-94-16

^

62

33

2

1. LM UG(i+1). The leaving of SLH is detected by lines 10-18 (fig. 2). To figure out whether there was a write in progress, LM sends an I-ENQ message to its own participant module (fig. 2, line 11), which replies with an I-RES message containing the current value of writer (fig. 4, lines 14-16). LockHolder is then set to res.writer (fig. 2, line 16). Lemmas A.3, A.4 and A.5 guarantee that if the received res.writer is NULL then there is no write in progress; otherwise, there is a write in progress and res.writer contains the identity of the server that has taken over SLH . 2. LM UG(i+1). This has instead to do with lines 20-27 (fig. 2). Once the new LM has received an L-STATE message from every server in UG, it proceeds by reconstructing the queue of requests and the value of LOCKHOLDER (see also the proof of Proposition 6.4). Each server constructs its own L-STATE message in the LA module (fig. 3, lines 12-27), by possibly asking the participant module to figure out whether there was a write in progress (lines 18-19) via I-ENQ messages already discussed above.

62

In both cases, a module handling a view change could be notified of further view changes, because it could block on a wait-for statement. These further view changes are handled by new handlers installed before suspending on the wait-for 23 . Concerning the LM module (lines 13 and 23, fig. 2) the only interesting event is if it leaves UG. Concerning the LA module, (line 20, fig. 3), one has also to keep track of the identity of the LM, which could change in the meanwhile. 2

In the proof above GVSC has been implicitly used again to simplify the reasoning. The server acting as LM might leave and rejoin the quorum view of UG and be elected again as LM. In that case, it does not have to worry about the delivery of messages (L-REQ L-REL and so on), that in fact were sent to its previous incarnation as LM: they are not delivered at all. This simplification of the reasoning becomes especially important within the handler of the view-change event (where one has to wait for delivery of I-RES and L-STATE messages). The same considerations apply to the other side: any server trying to contact LM does not have to worry about whether LM is still there or whether it disappeared and rejoined. A.2

Coordinating Writes

Lemma A.6 Once a lock has been allocated, eventually any of the following will happen: (1) an R-replica will be created by all servers in UG before releasing the lock; (2) the lock will be broken; (3) UG will disappear. Proof: That case 3 may happen before both 1 and 2 is obvious, so in the following we will assume that UG does not disappear. Let us start from the very beginning, e.g. all servers are at version 0. Let SLH UG be a server holding the lock (fig. 4). If SLH does not leave the quorum view of UG before releasing the lock, we are obviously in case 1. Otherwise, Proposition 6.6 guarantees that if there is a write in progress the 0 lock is inherited by some other SLH UG, otherwise the lock is broken. In the former case Lemma A.3 applies, which allows concluding that all servers in UG have a replica at version 1. The reasoning can now be iterated. 2

2

2

Lemma A.7 Once a server S 2 UG has created an R-replica for version n, eventually either it will install that replica before creating any other replica with a larger version number or it will leave the quorum view of UG. Proof: That S may leave the quorum view of UG before installing the replica is obvious, so in the following we will assume that it does not. An R-replica for version n can only be created upon handling the delivery of a message m carrying the W-REQ tag and version number n (line 6, fig. 4). S must thus have been delivered such a message, say sent by SLH UG. Recall that SLH can send such a message only when it is the lock holder. After the delivery of m and before the installing of version n, the events that can be delivered to n are the following:

2

1. W-COMMIT for version n. The state of the replica is switched to W, thus the replica is installed (lines 10-

23. If, due to implementation details, allowing dynamic installation of handlers was not feasible, it would be easy achieving an equivalent behavior by just adding a further state variable.

UBLCS-94-16

34

14, fig. 4). 2. W-COMMIT for version numbers smaller than n. The message is ignored. 3. W-REQ for version n + 1. Before creating an R-replica for version n + 1, the state of the replica for version n is switched to W, thus installing it (lines 2-4, fig. 4). 4. Installing of a new view of UG that includes SLH . The reasoning is iterated. 5. Installing of a new view of UG that does not include SLH . The lock is inherited by another server 0 SLH UG because of Proposition 6.6. The reasoning can be iterated having observed that the server inheriting the lock behaves as the original writer, except that it does not have to maintain internal state for returning a status code to the client (lines 25-28, fig. 4 and lines 8-13, fig. 4).

2

If the reasoning is iterated, note that eventually either of events 1 or 2 will be delivered to Lemma A.5 along with the fact that it does not leave the quorum view of UG.

S , because of

To prove that delivery of different either W-REQ of W-COMMIT is not possible:

  

for version numbers larger than n + 1. A W-REQ for version n + 2 can be sent only after acquiring the lock; the lock is released only after the creation of an R-replica for version n + 1 — which follows the delivery of W-REQ — has been acknowledged by all servers in UG (line 13, fig. 4). Concerning version numbers larger than n + 2 the reasoning is iterated. W-REQ for version numbers smaller than n. The reasoning is the same as above. W-COMMIT for version numbers larger than or equal to n + 1. Because if a server S UG is delivered a W-COMMIT for version n, then S must have been previously delivered a W-REQ for that version: A W-COMMIT may be sent either by the writer (line 11, fig. 4), or by a participant that took over the writer because it left the quorum view of UG (line 25, fig. 4); in the former case just recall that multicasts are delivered in FIFO order; in the latter case, first note that a participant takes over the writer only if it was delivered a W-REQ before seeing the writer leaving the quorum view of UG, then consider that because of VSC and of proposition 6.12 any other server has delivered such a W-REQ as well. W-REQ

2

2 Proposition 6.7. If a server installs a new version of the file in UG that defines a quorum, then eventually either all servers in UG also install the new version or UG disappears. Proof: That UG can disappear before all servers in UG have installed the new version is obvious, so in the following we will assume that UG does not disappear. Let S UG be a server that installed the new version of the file; let m be the message multicast in UG that made S install the new version (lemma A.2); let vim be the presumed view of UG in which m was sent and let vim+k (k 0) be the view in which the sending actually occurs; let, finally, vi m be the collapsing of the presumed view with respect to m. There are the following cases:

2 

1. In between vim and vim+k , one or more participants leave UG. Lemma A.1 allows concluding that eventually all servers in UG will deliver m. Let us consider any S 0 = S; S 0 UG. According to Lemma A.2, either m is a W-COMMIT for version n, or it is a W-REQ for version n + 1. From the proof of Lemma A.7 it follows that in either case the last W-REQ message delivered to S 0 was one for version n. When S 0 is handling the delivery of m it will thus have a replica for version n, either in R state or in W state. If the replica is still in R state, S 0 will install version n, e.g. it will switch its state to W (lines 9-13, fig. 4). Between the delivery of m and the actual writing on secondary storage S 0 might fail, but it will eventually leave UG. Since we are assuming that UG does not disappear, eventually all servers in UG will install version n. 2. In between vim and vim+k , the writer (and possibly one or more participants) leaves UG. This case cannot happen. Since S has installed the new version, it delivered m; if the writer had left UG, the message would have not been delivered to any server because of VSC.

6

2

2 Proposition 6.8. The sequence of installed versions of the file by any server in UG corresponds to a subsequence of the write lock acquisitions, which are totally ordered. Proof: Having noted that the sequence of lock acquisitions is unique because of proposition 6.3, the UBLCS-94-16

35

proof is an immediate consequence of Lemmas A.6, A.7.

2

Proposition 6.13. The replica management system implements replicated files with one-copy serializability semantics. Proof: Consequence of the linear membership approach, the read-one/write-all replica control protocol, the concurrency control protocol, the way UG is re-created after total failure. 2

A.3

Creating the Up-To-Date Group

The UG creation and joining algorithm (UGA) is meant to check up whether UG already exists and, if it does not, to create it from the scratch (Proposition 6.9). UGA is initially activated by the view change which notifies that RMG comes to form a quorum after a total failure. An up-to-date server, called the candidate, is selected out of the quorum of servers and entrusted with creating UG. Such a selection is accomplished based upon the replicas’ state according to Proposition 6.10. If instead UG already exists, any server can join it soon. Joining UG requires to get up-to-date first and joining the group afterwards (Proposition 6.11 and 6.12). The pseudo-code which expresses UGA is showed in figure 7. An informal overview follows, mainly focusing on the scenario when RMG comes to form a quorum after a total crash, UG does not exist and consequently has to be created from the scratch.

During UGA each server p avails itself of four variables. Variable ps contains the set of processes that p knows participate in UGA. Variable rs is a subset of ps containing those servers that, so far, have sent their replica’s state to p. Variable c contains the identifier of candidate, when it gets elected. Finally, variable rss is a set containing the state of replica managed by servers contained in rs. Initially, all variables are cleared.

When UGA is activated, each server sets ps to RMG, multicasts its own replica’s state in a V-STATE message (lines 12–16) and sets about delivering the replica states of all the other members in ps. Whenever a V-STATE message is delivered the sender is registered in rs and the related replica state in rss (lines 34–37). Once a server has delivered the state from all servers in ps (that is ps = rs), evaluates the identity of the candidate (invoking procedure create UG() at line 37). If it happens to be the candidate, creates UG and notifies this fact to the others (lines 25–28). Otherwise it waits for UG creation notification, upon which, joins UG (lines 38–40).

If view changes occur while UGA is running, there are several remarkable scenarios that have to be accounted for. If a view change makes RMG stop being a quorum, UGA is aborted. Otherwise, view change’s effects depend upon whether one of the following two mutually exclusive events occurs: only leavings occurred, that is some servers left the current view of RMG and none joined; at least one joining occurred, without any regard to the fact that some server also left. In the former case UGA is continued with the servers remained in RMG. In the latter, UGA is always restarted unless candidate has already been elected. In a moment we will see some more detail. In case of only leavings, the new membership of RMG is trivially a subset of ps. Variable ps is reset to this new value (line 9), those processes that left are removed from rs, their replica state are removed from rss (lines 18–20) as well, and UGA goes on with the updated values of ps, rs and rss. Provided a candidate had been already set but it also left, a new one is selected (lines 21–22).

In case of at least one joining, RMG contains some servers, those that joined, which are not contained in ps (remember that ps contains the previous RMG’s membership). All servers different from candidate reset their local variables, that is they set ps to the current composition of RMG and clear up the other variables, and multicasts their replica’s state (lines 12–16). This implies that UGA is restarted from the scratch if candidate does not belong to RMG after the view change. If, instead, candidate is still in RMG, it will neither reset its variables nor multicasts, but continue on its way to UG’s building (lines 26–28). Please notice that, the fact that candidate does UBLCS-94-16

36

not multicast its replica’s state, in practice, causes the other servers to block because condition rs = ps will never hold for them. Such servers get unblocked upon being notified that candidate created UG (line 38–40). In both cases, if another view change strikes, these reasonings are recursively applied.

In order to prove proposition 6.9 we need to exploit certain properties of UGA. These properties are resumed by the following lemma. Firstly, however, we need some definitions.

Let V 1 be the view where RMG comes to form a quorum after a total failure. Let us suppose that view changes occur but RMG never stops being a quorum. Let V i be the i-th view such that V 1  : : :  V i  : : :. Let us consider the view change from V i to V i+1 . Upon this view change, each server p in i + 1 V executes a handler hp [i + 1]. Let c[i + 1] the cut passing along the end of hp [i + 1]; 8p 2 V i+1 .

Let us also assume that UG does not exist along the cut defined by the view change (although it might seem rather obvious, in fact it is not. We will come back on this point later on). This causes procedure uga() in figure 7 to be invoked upon the view change in RMG. Finally, let psp , rsp and cp the instances of variables ps, rs and c at server p. Lemma A.8 Let us consider the following predicates defined along c[i + 1]:

1. If 9q 2 V i+1 ; cq = q ) (a) 9j (j  i) such that V j is the view where cq was set to q and psq = V j \ : : : \ V i+1 . (b) i. If 9p 2 V i+1 ; p 6= q such that psp = psq ) 8p 2 V i+1 ; psp = psq ii. If 9p 2 V i+1 ; p 6= q such that psp 6= psq ) psp = V i+1 ^ cp = NULL 2. If 8p 2 V i+1 ; cp = NULL ) psp = psq = V i+1 Proof:

Point 1a. If cq = q then procedure create UG() was executed at q. This may have happened either at line 22 or 37 in figure 7. If at line 22, q must have executed the branch else related to statement if at line 10. Let V j the view where cq was set to q, therefore psq was set to V j . Any view change that leads to V i+1 from V j causes the intersection at line 9 to be computed, but psq is never reset because test at line 11 always fails at q. Therefore psq = V j : : : V i+1 .

\ \

If at line 37, a set of V-STATE messages were delivered in such a way that psq = rsq . These message were multicast at the end of a run of uga() activated by an RMG view change which notified the installation of V a (1 a i).

 

Moreover, let us consider the V-STATE message sent upon the view change that installed V a and b i). This delivering causes procedure delivered the latest by server q in a certain view V b (a create UG() to be invoked and cq to be set (line 24).

 

There cannot be any joinings in between V a and V b , because otherwise cq must have been already set, so violating the outcome, just stated, that cq was set in V b. Therefore, since only leavings may have occurred, it follows that psq = V a : : : V b = V b and j = b. Moreover, since view V b, all view changes that finally produce V i+1 cause q to compute the intersection at line 9 but never reassign psq because cq is set to q’s identifier, therefore psq = V j : : : V i+1 .

\

\

\ \

Point 1(b)i. If psp = psq , from the previous point it is straightforward to show that p executed the branch else at 17. Therefore view V i changed to V i+1 because of only leavings. Therefore the strong partial character of VSC’s membership service proves point 1(b)i.

6

Point 1(b)ii. If psp = psq there are two cases.

;

1. If p executed the branch then at line 11, at least a joining took place, therefore psp = V i+1 and rsp = and cp = NULL. 2. If p executed the branch else, psp = V i+1 therefore only leavings occurred between V i and V i+1 . However, at least one joining after V b must have occurred because otherwise psp would have been equal to psq , so cp = NULL. Point 2. This proof also is articulated in two branches. UBLCS-94-16

37

2

1. If the the branch else is taken then ps = V i+1 and then the proposition is obviously true (only leavings occurred ). 2. If, instead, the branch then is taken ps = V i+1 (at least one join occurred) then ps is set to RMG.

6

Before proving proposition 6.9, we give some intuitions about the physical meaning of lemma A.8. In general, point 1 accounts for the case in which candidate is set and belongs to V i+1 . More in detail, point 1a is ancillary to point 1(b)i’s proof, and therefore its usage limits to that context only. Point 1(b)i accounts for the case where candidate is set and only leavings occurred; UGA is to go on, candidate will create UG, whereas the other processes will wait for notification of this event. Point 1(b)ii contemplates the case where at least one joining occurred after candidate was set. All processes but candidate try or have tried to restart UGA. However, they are blocked since candidate is instead running to UG creation. Point 2 covers the scenarios in which candidate does not belong to V i+1 . From point 2’s proof, it is worthwhile to prove that, in case of only leavings, candidate was not set even before the view change. Processes in V i+1 are still waiting for candidate to be selected. In case of at least one joining, the fact that candidate were set in V i or not does not matter very much, because, UGA is restarted. Proposition 6.9. If during the execution of the reformation algorithm, the number of failures is finite and the view for RMG continues to define a quorum, the UG will be eventually created.

Proof: Let c1 the cut defined by the view change that notifies that RMG comes to form a quorum again. As it is shown in figure 14, given the asynchrony of the system, a previous instance of UG may be still existing along c1 . For the moment we will assume that this does not happen. Lemma A.9 shows that, although this event may occur, it is harmless. The proof is articulated in two parts. In the one, we prove that UG is created if no view change, since the one which notifies that RMG came to form a quorum after a total failure, occurs. In the other, we prove that the proposition is true nonetheless view changes take place.

Let V 1 be the view which notifies that RMG forms a quorum after a total failure; the installation of this view makes begin24 . Furthermore, let us suppose that UG does not exist along the cut defined by 1 V (see later).

UGA

1. If no view change takes place since V 1 , UG is created For each server p V 1 , initially all local variables are cleared. In particular ps = and c = NULL therefore lines 12–16 are executed. Through these lines, p sets ps to the current composition of RMG (that is V 1 ) and multicasts its replica’s state to the other members of RMG. Whenever p delivers a V-STATE message (line 33), stores both the sender and the state of the related replica in rs and in rss respectively. Upon delivering V-STATE from all servers in V 1 (rs = ps), p evaluates the candidate through the procedure create UG(). This procedure internally invokes selectcandidate(). This function deterministically picks an up-to-date server out of rss (Proposition 6.10 states which one). Since VSC guarantees that p; q V 1 rssp = rssq , then all servers select the same server. The identity of the server so selected is assigned to variable c. The server q such that q = candidate creates UG invoking group-join() (line 26) and notifies this fact multicasting UG-BUILT. Upon delivering such a message (line 38), any server different from candidate attempts to join UG through the stj() procedure. 2. Let V j ; (j 1) a view of RMG such that V 1 : : : V j , then, UG is eventually created, provided RMG never stops being a quorum. The fact that RMG may stop being a quorum is obvious. Therefore, hereafter, we assume that it never does. Let us consider c[j ]. Let us also suppose that along c[j ]; q V j ; cq = q, then create UG() was executed at q. If no view change occurs, q creates UG. Any view change, such that q V j+1 , produces case 1(b)i of lemma A.8 if only leavings occur. On the contrary, it produces case 1(b)ii in case at least one joining. In both these cases, because cq = q and is still in RMG this reasoning have to be applied again.

2

;

8







2

)



9 2

2

UGA

24. Please notice that is really invoked after the voting table has been validated, so it is not activated exactly upon the RMG’s view change. However, without loss of generality, we assume that starts upon the view change. This is equivalent to assume that both the number of copies and their weights are fixed.

UBLCS-94-16

UGA

38

p J ? REQ

q r

c1

c2

Figure 14. Initially UG contains p, q and r. Along cut c1 , UG stops forming a quorum. The related handlers performed by serveres p, q and r at the end of their execution will invoke group-leave(UG) in order lo leave UG. Once all of them have left, UG will not exist anymore. Along cut c2 , however, RMG forms is started. However, server p is such particularly slow that, along c2 , again a quorum and therefore it has not executed group-leave yet. Therefore, along c2 , UG is still existing. When server q servers the view change in RMG it checks off the existence of UG and activates the state transfer procedure at once. Please notice that, however, UG is correctly destined to disappear, as soon as p executes the group-leave.

UGA



Any view change which excludes q produces case 2, treated in the next point. If along c[j ], p V j ; cp = NULL, the system is in the state described by point 2 of lemma A.8. A view change, as usual due to either only leavings or even a single joining, is responsible of having put the system in such a state. In the former case, p V j , either p elected candidate if rsp = psp (lines 18–22), or some more V-STATE messages have to be delivered for that condition to become true. In the latter case, p V j , is restarted (p executes lines 12–16) For candidate to be elected, a number of messages, such that ps = rs, must be delivered. In both cases other view changes may furtherly occur. If they occur before candidate is elected, the reasonings of this point must be applied again. Otherwise, those of the previous one.

8 2

8 2 8 2 UGA

2 It is worthwhile to notice that we have assumed that when RMG forms a quorum after a total failure, no UG is existing. Due to the asynchrony of the system, this might not be the case, as it is shown in figure 14. Actually, when a total failure occurs, unless it is due to contemporaneous physical crash of all servers in UG, servers take a certain interval of time in order to realize that UG does not form a quorum anymore and, therefore, leave UG. Roughly speaking, this time amounts to the time needed for suspecting and managing the failure and executing the related view change handler. Given the asynchrony of the system, this transient time may be arbitrarly long. Therefore, when RMG comes to form a quorum, some servers may experiment that UG is already existing, when, in fact, it is the UG that is going to disappear in consequence of the previous total failure, and therefore they could try to join it, so skipping the creation of UG. However, we can prove that: Lemma A.9 When RMG comes to form a quorum after a total crash, eventually every server in RMG participate in UG creation. Proof: Of course a server may fail and RMG disappear, therefore hereafter we will assume that they do not. Let c1 be the cut defined by the view change that notifies that RMG comes to form a quorum after a total failure. Along c1 , UG may be still existing (owing to the reasons we mentioned above) or not. The UBLCS-94-16

39

latter case has been already treated in proposition 6.9, so we focuse only on the former. We will prove that, even in this scenario, eventually every server in RMG will execute UGA. It is worthwhile to notice that, since RMG has reformed after a total failure, all processes still in UG, sooner or later, must serve the view change notifying that UG stopped being a quorum, before serving the one that notifies that RMG is a quorum again. The UG view change is served by an handler that we denote hp [UG], where p indicates the process that executes it (you can find examples of those handlers in modules for writing, locking files). Such handlers invoke group-leave(UG) as their last action, therefore UG is destined to disappear. Without any loss of generality, let us suppose that UG contains a server p only. Let us now consider the behaviour of the other processes in RMG. It is straightforward to notice that these processes either have recovered from a physical crash or, somewhere in their past, executed groupleave(UG). When RMG comes to form a quorum, these servers execute , find out that UG exists, and invoke stj() (line 7 of figure 7) in order to attempt to join UG. Therefore, each joining server selects server p as the state provider, requests joining and sets about waiting for the state transfer completion (figure 8 (left), lines 1–5).

UGA

State provider p will deliver and serve J-REQ (figure 8 (right), line 1) after hp [UG] has terminated, that is, when p is not member of UG anymore. Therefore p will send back the J-ABORT message (figure 8 (right), (figure 8 (left), lines 6–8). Since p was the last member of line 11) whereby the joining server returns to UG, it does not exist anymore and UG creation is started (the test at line 7 in figure 7 fails).

UGA

Let us suppose that a view change occurs while a joining server is performing the stj, more precisely when it is transferring the state (line 5 of figure 8 (left)). Apart from case in which RMG stops being a quorum and therefore the server aborts, the other remarkable situation is when the current state provider fails. This event causes the joining server to return to (figure 8 (left), lines 12–14). Moreover, it is worthwhile to notice that a joining server does not care of any view change caused by a server joining RMG. The just joined server check off UG and the reasonings showed above have to be iterated. 2

UGA

Proposition 6.10. The server with the highest volume state among those in a quorum view of creates UG.

RMG

Proof: Let S be the server with the highest volume state among those in a quorum view of RMG25 . We have to prove that: (1) by switching to W the state of all replicas at S , they all become up-to-date; and (2) this is not in contrast with the status codes returned to clients, e.g. no write operation that returned a SUCCESS is lost. To follow the proof recall that whenever a replica — either in R or in W state — is created, it is tagged with the current value of ugid (lines 3, 7, 13, fig. 4). Recall also that message deliveries are totally ordered with respect to view changes and that handlers are uninterruptible.

R Let us denote by UG’ the last composition of UG before its disappearing; let vnW f (vnf ) be the largest version number among all replicas in W state (R state) of file f ; let f ( f ) be the set of servers S UG’ R holding a replica of f in W state (R state) whose version number is vnW f (vnf ); let, finally, S be a server selected according to the proposition being proved. With respect to any file f , UG may have ceased to exist in any of the following scenarios:

W R

2

1. 2.

Wf = UG’ ^ Rf = ; (all the writes performed on f have been completed). Wf 6= ; ^ Rf 6= ; (there is a write in progress, at least one server has installed a replica in W state, some

3.

Wf 6= ; ^ Rf = ; (there is a write in progress, at least one server has installed a replica in W state, no server

4.

Wf = ; ^ Rf 6= ; (there is a write in progress, no server has installed a replica in W state).

servers have an R-replica).

has an R-replica).

Concerning scenarios 1, 2 and 3, recall that a version can be installed only after an R-replica for that version has been created on at least a quorum of servers. Then consider that if the largest version number among all the write requests that created a quorum of R-replicas is n, then any quorum set contains at least one replica (either in state R or in W ) associated with the ugid of the UG in which that replica was created. This, along with the definition of S , allows concluding that S f S f , e.g. that S certainly holds an up-to-date replica, either in state R or W. Concerning scenario 4, first consider that f may or may not form a quorum. Then consider the two following subscenarios:

2W _ 2R

R

25. If there are many of such servers, any one of them can be selected.

UBLCS-94-16

40

S 2 Rf .

The replica at S is installed, even if it does not exist on a quorum. This is necessary in order to make it possible to take a decision about which server shall create UG without waiting for all servers becoming available, e.g. to allowing any quorum to make progress. 2. S f . In this scenario the attempt of write that led to the creation of version vnR f is ignored and the up-to-date version of f becomes that at S . To prove that this is correct, we have to prove that the version at S is really the outcome of the latest successful write. Let f be the set of servers holding the up-to-date version for f . Since a write can be successful only after a replica in R state for that version has been created on a quorum of servers, the definition of S guarantee that it must be S f. 1.

62 R

U

2U

Therefore, the first time that UG is being re-created, S is up-to-date. Concerning the status codes returned to clients, it suffices to note that necessary condition for returning a SUCCESS status is that a replica has been created on at least a quorum of servers, and that in neither of the scenarios above is such a replica lost. The Proposition then follows by just iterating the reasoning.

2 As an aside, note that the choice of the server for re-creating UG cannot be done by just looking at the version numbers held by the various servers. The reason is because, outside UG, replicas in R state with the same version number might not be identical, e.g. they could be the outcome of distinct write operations. To realize this, just consider the following scenario: a server Si 2 UG creates an R-replica for version vnRf ; UG disappears and such a replica has not been installed on a quorum; UG is re-created starting from a quorum view of RMG that does not include Si ; a further attempt is done for installing version vnR f ; Sj 2 UG creates a replica in R state and then UG disappears; a quorum view of RMG is formed that includes both Si and Sj . The Proposition just proved is aimed precisely at selecting a server holding a replica created in the most recent incarnation of UG, which is the only one that might have been successful. Proposition 6.11. If there are a finite number of failures and UG does not disappear while executing the protocol, a server S wishing to join will eventually belong to UG. Proof: It suffices to note that the lock request done by the state provider (Fig. 8 right, line 5) eventually will be granted because of Proposition 6.4. 2

Proposition 6.12. A server S joins UG only if there is no write in progress. Proof: Obvious consequence of the state-transfer-and-join protocol: necessary condition for completing the state transfer is that the state provider in UG has acquired the lock (Fig. 8 right, line 5lines 6-8) (Section 6.3). 2

A.4

Dynamically changing votes and composition of RMS

Concerning Propositions 7.1, 7.2 and 7.3, the proofs can be obtained from their counterparts that refer to the write algorithm, having observed that in this case the lock is released after the creation of a replica in R state has been acknowledged by all servers in UG. The remaining Propositions are proved in the following. Proposition 7.4. If Sj starts the UG reformation algorithm in presumed view v of RMG, then v defines a quorum.

Proof: Let v denote the composition of RMG in which the UG reformation algorithm is started (line 22, fig. 11). We need to prove that when function compare-vt() (fig. 12) returns a non-NULL value (lines 8 and 15), v really defines a quorum (lines 20-22, fig. 11). Necessary condition to invoke compare-vt() is that vrep-set contains a message from every server in v (lines 18-19), thus compare-vt() operates on the voting table of each server in v.

The non-NULL value candidate-vt at line 8 is returned iff the largest version number vn among voting tables in v refers to a voting table in W state and v defines a quorum according to that voting table. To prove that such a voting table is also up-to-date, let us suppose that it is not, e.g. a voting table exists outside v that

UBLCS-94-16

41



has version number vn + k(k 1); furthermore, that it exists at a set of sites defining a quorum according to that table itself. Necessary condition for this is the previous installing of version number vn + 1. Because of Propositions 7.2 and 7.2, this can be done only provided a quorum of version vn is notified. Since any two quorum sets have non-empty intersection, at least one server in v would be aware of this, but this contrasts with the choice of candidate-vt made at lines 3-4. If instead the voting table with the largest version number is not in W state, the else branch at line 10 is entered, which contains the remaining non-NULL return value at line 15. This candidate-vt is selected at line 13 as the voting table in R state associated with the largest vt-ugid. This is the only suitable candidate because the voting-update algorithm guarantees that any copy in R state associated with a smaller vt-ugid is certainly not up-to-date. Having reached line 10, necessary conditions to return the non-NULL value at line 15 are: (1) the current view of RMG defines a quorum for the old-vt selected at line 10; (2) such an old-vt is in W state. By applying again the reasoning of the previous paragraph to this scenario, it is easy to conclude that the largest version number of voting tables in W state is either (a) vn or (b) vn 1. Case a applies when candidate-vt exists on a quorum for that table, in which case the replica in W state would be outside v. Otherwise, case b applies. In either case Propositions 7.1 and 7.2, along with the fact that v defines a quorum for candidate-vt (line 14), guarantee that forcing candidate-vt to W state is always a safe decision.

?

Note that in any case in which a NULL value is returned, it is not possible to take any safe decision because Propositions 7.1 and 7.2 cannot be applied, thus one has to wait for contacting more servers. 2

Proposition 7.5. If Sj starts the UG reformation algorithm in presumed view v of RMG, then 8Si 2 v eventually any of the following will happen: (1) Si starts the reformation algorithm; (2) Si installs a view of RMG that does not include Sj ; (3) Si crashes. Proof: That either case 2 or case 3 may happen before case 1 is obvious. In the following we will assume that they do not happen and will prove that eventually case 1 will happen. Since Sj has started the reformation algorithm, vrep-set at Sj contains a message from every server in v (line 18, fig. 11) and the corresponding voting tables allow compare-vt to return a non-NULL value (line 19). Having observed that messages put in vrep-set are sent via multicast (lines 8-9 and 10-13), and that certain messages may be removed upon delivery of view changes (line 5), it is straightforward to realize that eventually every server in v will be able to start the reformation algorithm. 2

Proposition 7.6. If every failure eventually recovers and there are no further failures for a sufficiently long time, then the UG reformation algorithm will be eventually started. Proof: The hypothesis of the Proposition can be restated by saying that for a sufficiently long time there is a view v of RMG that includes all servers belonging to RMG. We need to prove that, within that view and under this the hypothesis, function compare-vt() (fig. 12) certainly returns a non-NULL value (lines 19-22, fig. 11). Line 8 certainly cannot be reached because if candidate-vt is in W state then the current view of RMG certainly defines a quorum. Concerning line 12, this would be reached iff there was not any element in svt in W state. In this case the voting-update algorithm, along with the fact that the current view of RMG includes all servers, would guarantee that: (1) a voting table in W state whose version number is equal to that of candidate-vt decreased by 1 indeed exists; and (2) v certainly defines a quorum for that voting table. Therefore, neither of the tests at line 11 will be satisfied and line 12 will not be reached. Finally, line 16 cannot be reached because, again, v includes all servers in RMG. 2

Proposition 7.7. If Sj joins a quorum view of RMG, then eventually any of the following will happen: (1) Sj starts the state-transfer-and-join algorithm; (2) Sj is delivered the installing of a view of RMG that does not define a quorum; (3) Sj crashes. Proof: That either case 2 or case 3 may happen before case 1 is obvious. We need to prove that, assuming that they do not happen, then eventually case 1 will happen. Let c be any consistent cut including the event “Sj joins a quorum view of RMG”. If UG exists along c, the test at line 2 of fig. 11 will not be satisfied and Sj will jump to line 10, which will make it switch to the state-transfer-and-join. Otherwise, Propositions 7.4, 7.5 and 7.6 guarantee that, under the previous assumption, UG eventually will exist. Thus

UBLCS-94-16

42

eventually Sj will be delivered an UG-BUILT message (multicast at line 24 of fig. 7), which will make it switch to the state-transfer-and-join (lines 15-16). 2

A.5

Supporting GVSC

The following refers to the algorithm for supporting GVSC presented in section 8.2. Proposition 8.1. Actions (1)–(3) implement GVSC according to definition 4.1

Proof: Let us consider any message m multicast by pi in presumed view vk1 (pi ). Let us consider a sequence of views vk1 (pi ) : : : vk3(pi ) where vk3 (pi )(k3 k1) is the view in which the sending of m m 1 k actually occurs. Let v (pi ) be the collapsing of vk1 (pi ) with respect to m. By comparing Definitions 3.1 and 4.1 it turns out that for implementing GVSC on top of VSC it suffices to guarantee that process pj will not receive m iff:







9k2 2 [k1; k3]such that vk1(pi )  vk2(pi )  vk3(pi ) ^ pj 62 vk2(pi )

(1)

Case k2 = k3 is trivial: VSC alone guarantees that pj will not receive m. Otherwise, we have to analyze how pj will apply action 3. The view vector attached to m is V V m = V Vik1 because of action 2. As a consequence of action 1 and the agreement on view composition, it is:

V Via(i) [j ] = V Vja(j) [i]; 8j 2 va(i) (pi )

(2)

Indexes a(j ) may obviously be different because the processes may have had different histories. Note that action 1 is such that equality 2 starts holding before delivering or multicasting any message in the current view. Then, let us distinguish the two cases:

 9k2 2 [k1;k3) satisfying eqn. 1. In this case it will be V Vik3[j ] > V Vik1 [j ] = V V m [j ], because pi and

pj have joined together at least once between the presumed view and0the real view. Let k3 be the index of the view at pj satisfying the agreement relation vk3 (pi )  vk3 (pj ). Eqn. 2 guarantees that V Vjk30 [i] = V Vik3 [j ]. Therefore, V Vjk30 [i] > V V m [j ] and the message will not be delivered to the 0

application layer (action 3).

 6 9k2 satisfying eqn. 1. Reasonings similar as above lead to conclude that V Vjk30 [i] = V V m [j ], therefore the message is delivered to the application layer.

2

UBLCS-94-16

43