Selecting a “Primary Partition” in Partitionable Asynchronous Distributed Systems Alberto Bartoli Dip. Ingegneria dell’Informazione University of Pisa, Italy E-mail:
[email protected]
Özalp Babaoglu Dept. of Computer Science, University of Bologna, Italy E-mail:
[email protected]
Abstract
communicate through reliable multicasts whose semantics are formalized as view synchrony that defines global ordering guarantees on message deliveries as a function of view changes [4,8]. Partitions are a fact of life in most practical distributed systems and they tend to become more frequent as the geographic extent of the system grows or its connectivity weakens, for instance due the presence of wireless links. A partitionable group membership service allows multiple views of the group, each corresponding to a different partition, to co-exist and evolve concurrently [12,4]. In asynchronous systems subject to failures, multiple views can also be the result of virtual partitions indistinguishable from real ones [18]. A partitionable membership service can support partition-aware applications [6], that is, applications programmed so as to establish which of their services are available in which partitions based on the composition of the associated views [3]. As such, partitions may result in a reduction of services in some partitions but do not necessarily render them completely unavailable. In contrast, a primarypartition group membership service maintains a single agreed view of the group at any given time [16,15]. To achieve this requirement in a partitionable system, a primary-partition group membership service has to limit group membership changes to the primary partition and block all processes in non primary partitions by not delivering them any views [19] or by pretending that they have crashed so as to force rejoins after recovery [16]. In summary, a primary-partition group membership service promotes what can be called partition-ignorant applications since the presence of multiple partitions is hidden from them. Even when progress is possible in multiple partitions, many applications are able to offer the full ensemble of their operations in one partition alone, the primary partition. Only a subset of their operations may be available in non-primary partitions. In the limit, it may be the case that only the primary partition is allowed to make progress and non-primary partitions serve no operations at all and (temporarily) block. In this paper we propose a methodology and associated algorithms such that a primary partition can be
We consider network applications that are based on the process group paradigm. When such applications are deployed over networks that are subject to failures, they may partition across several disconnected clusters resulting in multiple views of the group's current composition to exist concurrently. Application semantics determine which operations, if any, can be performed in different partitions without compromising consistency. For certain application classes, most (possibly all) operations need to be confined to a single primary partition while other partitions are allowed to service only a (possibly empty) subset of the operations. In this paper, we propose a mechanism for deciding when a view constitutes the primary partition for the group. Our solution is highly flexible and has the following novel features: each group member can establish if it belongs to the primary partition or not based solely on local information; the group can be dynamic as processes voluntarily join and leave it; the selection rule for establishing the primary partition need not be universal but can be decided on a per-application basis and can be modified at run-time; the primary partition can be reestablished even after total failures. Layering our solution on top of a partitionable group membership service allows a wide range of applications with different and possibly conflicting notions of “primary partition” to be supported on a common computing base.
1. Introduction Process groups have proven to be a useful technology for developing network applications to be deployed in asynchronous distributed systems [7]. Informally, a group is a set of processes that cooperate towards some common goal or share some global state. The composition of the group is dynamic due to processes that voluntarily join and leave the computation, those that need to be excluded due to failures or those that need to be integrated after repairs. A group membership service tracks these changes and transforms them into views that are agreed upon as defining the group's current composition. Group members
established on top of a partitionable group membership service. Our methodology is based on the Enriched View Synchrony paradigm that facilitates the maintenance of application-defined shared state in a process group [3]. Each group member determines whether it belongs to the primary partition or not by applying a selection rule to its current view. Our solution has several novel features that make it appropriate for building highly flexible partitionaware applications. Our methodology does not assume a static group membership but admits processes to join and leave the group dynamically. It can re-establish a primary partition even after a “total failure” scenario where all group members crash. The selection rule uses local information only. It is possible to support simultaneously different selection rules, each being appropriate for a particular application class, and amortizing their cost over a common group infrastructure core [13]. Furthermore, selection rules can be modified at run time without having to halt and restart the application. This allows applications to “adapt” themselves to their operating environment by basing their selection rules on observed properties during an execution, for instance, regarding the observed availability of sites or the number of requests they receive from the outside. As a result, both the availability and the performance of an application could be rendered significantly higher than if the selection rule were to be decided a priori and could not be modified. The major advantage of our solution for establishing a primary partition stems from its composite construction: a per-application selection rule is layered on top of a system-wide partitionable group membership service. The alternative of embedding the notion of primary partition into the group membership service itself (as is done in a primary-partition group membership service) would preclude partition-aware application development. Our solution, however, cannot contradict the result that the “primary-partition group membership problem” is not solvable in asynchronous systems subject to failures [10]. Very informally, the latter problem requires the existence of a single, agreed-upon view of the group, whereas our solution guarantees that given a set of concurrent views, the predicate “is this view primary?” evaluates to true in at most one of these views. As we show, this weaker requirement is useful and indeed achievable in an asynchronous system.
2. System model and view synchrony The system is a collection of processes that communicate through a network. Processes may crash and the network may partition. Crashed processes eventually recover and partitions eventually repair. The system is asynchronous in that no bounds are assumed on communication delays or relative speeds of processes. Each process has access to a local failure detector module
[9] with properties sufficient to implement view synchrony as described in [4] and summarized in the following. View synchrony implements the notion of process group and provides reliable FIFO-multicast as basic communication primitive among group members (we shall consider only processes that are group members). View synchrony includes a group membership module that provides processes with consistent information about the subset of the group that appears to be currently reachable. This information takes the form of views. New views are communicated to processes in the form of view change events. A process that delivers a view change event vchg(v) is informed that the new view is v. In this case we say that the process has delivered or installed v. Every view installed by a process includes the process itself. We say that an event e occurs in view v at a given process iff the last view delivered at that process before e is v. We denote by comp(v) the set of processes composing the view v. An essential feature of view synchrony consists in the fact that view changes are globally ordered with respect to message deliveries [8,4], which is informally described as: Any two processes that survive from some view v to some other view v' have delivered the same set of messages in view v. A process may thus reason globally on the basis of local information, e.g., of the sequence of messages and view changes delivered by the process. We assume a partitionable membership service one that allows multiple views of the group to exist concurrently. Given two views v and w, we say that: (i) w is an immediate successor of v, iff there is a process for which w is the next view to be installed after v (in this case v and w are said to be consecutive); (ii) w is a successor of v, iff there is a process that installed w after v; (iii) w and v are concurrent, denoted vÂw, iff neither view is a successor of the other. Intuitively, concurrent views are views that are installed at different processes and constitute different perceptions of the group membership, typically as a result of partitions. Ideally, the composition of any pair of concurrent views should have empty intersections, in order to present the intuitive notion of “clean” partitions to processes. Unfortunately, a membership service with such a semantics (called strong partial [17]) may be blocking, that is, a failure (or a failure suspicion) occurring at an inopportune point of the membership protocol may delay a view installation until the failure recovers [5]. In this paper we shall assume quasi-strong partial semantics because it admits non-blocking implementations [4]. The salient properties of this semantics are: (R1) Two concurrent views may have non-empty intersection iff one is a proper subset of the other. (R2) Two concurrent views that merge into a common successor view have empty intersection. For instance, let w1…wk be a set of
concurrent views that have non-empty intersection. By definition of concurrent views, no process can install two or more of these views. If any of these views, say w1, has been installed by all of its members, then none of w2…wk has been installed by all of its members (R1). No pair of views in w1…wk will have an immediate successor in common (R2). Of particular importance to this paper is the fact that two or more views in w1…wk could contain a majority of the processes. However, it follows from the previous discussion that at most one of these views, say w1, may have been indeed installed by a majority, in which case processes that installed w2…wk will not be able to communicate with a majority. In other words, concurrent majority views can be (intuitively) seen as “transitory views” that have not been installed by all of their members and that will disappear soon without having performed any global activity among all of their members. We shall assume that a process that crashes and recovers maintains the same identifier that it had before the crash (e.g., it remains a group member until leaving the group explicitly). To this end, each process stores its identifier in stable storage [1]. Let v be the last view delivered by a process p before its crash and let w be the first view delivered after its recovery. In order to extend the successor relation across failures of group members and across total failures, we shall say that w is an immediate successor of v. Moreover, we shall assume that comp(w)=p.
3. Enriched view synchrony Enriched view synchrony (EVS) is an extension of view synchrony that greatly simplifies the programmer’s handling of recoveries and merging of partitions after repairs [3]. In this section we shall give a very informal outline of EVS. More details can be found in the cited paper, along with a detailed discussion of its advantages. The cited paper actually specifies EVS in the context of a strong partial membership service, but a similar specification can be easily obtained for a quasi-strong partial membership. A view as defined in view synchrony is a set of process names. This notion is replaced in EVS by the notion of enriched view (e-view for short), that augments the “flat” contents of a view by structural and historical information in the form of subviews and subview sets, sv-sets for short. Processes in a view are grouped into nonoverlapping subviews. Subviews in a view are grouped into non-overlapping sv-sets (see Figure 1 and ignore ev5 and ev6 for the moment). Each view is constructed out of at least one subview and one sv-set. An e-view is thus a tree: the view is composed of sv-sets; an sv-set is composed of subviews; a subview is composed of
ev2
ev1
ev3
ev4
ev5
ev6
Figure 1: Example of e-view evolution. Circles indicate processes. Thick, dashed and thin frames denote views, svsets and subviews, respectively. Arrows indicated view changes and dashed arrows indicate e-view changes that are not view changes.
processes. The first e-view delivered to a process is such that the process appears in a subview and sv-set by itself. Changes in either the composition of the e-view or its structure are delivered to processes in the form of vchg(ev) events, that replace the traditional notion of view change. Traditional view changes correspond to e-view changes where there is a change in the composition of the e-view. The successor relation can be defined for e-views similarly to views. The system attaches no meaning to subviews and svsets. It simply maintains the structuring information on behalf of applications. In particular, it guarantees the following Structure Property: Given two consecutive views v and w, processes in their intersection that were in the same subview (sv-set) in v remain in the same subview (sv-set) also after the installation of w; moreover, processes that were not in the same subview or sv-set remain in different subviews or sv-sets. This property is illustrated in Figure 1 (e-views ev1,ev2,ev3,ev4). Even when the view membership remains unaltered, e-view change events may be provoked by applications requesting to merge subviews or sv-sets. The operation SV-SetMerge(sv-set-list) creates a new sv-set that is the union of the sv-sets given in sv-set-list. The operation SubviewMerge(sv-list) creates a new subview that is the union of the subviews given in sv-list. If all the subviews in sv-list do not initially belong to the same sv-set, the call has no effect. The resulting subview belongs to the sv-set containing the input subviews. With reference to Figure 1, ev5 is installed as a result of an SV-SetMerge() whereas ev6 as a result of a SubviewMerge(). This extended interface maintains the semantics of view synchrony regarding view changes and message deliveries. With respect to e-view changes, the interface guarantees the following additional properties, which we state informally: (Total Order Property) E-view change events within a given view (i.e., between two consecutive view change events) are totally ordered by all processes in the view; (Causal Order Property) Causality relations between message multicasts and e-view changes are preserved. Note that there are no ordering requirements between e-view changes (that are not view changes) and multicasts. What distinguishes subviews and sv-sets from views,
and renders them useful in partitionable systems, is the fact that their composition can grow only under application control, and not at arbitrary times. This feature is best exploited if applications are structured according to a simple methodology described in detail in [3]. In particular, an algorithm execution shall involve only processes in the same subview or sv-set. In this way, the initial set of participants may only shrink, even if the view expands while the algorithm is in progress.
4. Application design The methodology proposed in this paper for selecting the primary view is summarized as follows: • Selection rule and application are structured in terms of subviews. The notion of primary view is replaced by the notion of primary subview. • A process evaluates locally the selection rule for its current subview. • While in the primary subview, a process can change the selection rule. • Subviews in the same view select a common “consistent” selection rule and then merge into a single subview. Processes in the resulting subview will decide (locally) whether they are in the primary subview or not. • The group membership can be dynamic and the primary subview is reconstructed automatically even if the group experiences a total failure. The fact that the selection rule is evaluated locally is particularly important to applications. Consider a process p in the primary subview that is delivered a view change. If p’s subview continues to be primary then no actions need be taken, except for the few local steps that evaluate the selection rule. We are not aware of any other protocol with this feature, that makes the overall scenario simple, modular and that simplifies substantially the “porting” of applications designed for non-partitionable systems. Our methodology takes a “neutral” approach: the membership service delivers views and the selection rule is changed only when the application decides to. Other protocols attempt instead to “prolong” the lifetime of the primary view, perhaps by changing automatically the selection rule upon each view change [11,16]. In these protocols, if a view v contains a majority of the last primary view then v is the new primary view. A similar strategy is used in the replica control protocols in [14,2]: the only partition that can apply updates is the one containing a majority of the replicas that applied the last successful update. Clearly, these approaches do not lead necessarily to higher availability, because one could end up with views that are not primary even though they include “almost all” of the group members. Besides, the protocol in [11] requires that each process
p maintains, in stable storage, state information about all previous instances of the protocol whose outcome is not yet known to p. This information is exchanged with all participants in any subsequent execution of the protocol and is discarded by means of a dedicated garbage collection mechanism or when p becomes part of a primary view. In contrast, with our protocol, p maintains information only about the last instance not yet completed. This difference makes the proposed protocol simpler, reduces the amount of information that must be exchanged with the other processes and makes it simpler to transfer all pertinent data structures to stable storage by means of a single atomic action.
5. Voting table and properties Each group is associated with a voting table. The voting table is a table with one entry for each process. Each entry contains the identifier of the process and the vote associated with it. If a process is assigned vote zero, then its entry is not present in the table. In any voting table, no two disjoint sets of processes can collect a majority of votes, e.g. a quorum. The voting table includes a version number that counts the number of updates applied to the table. Voting table and version number are replicated at all processes. To this end, each process maintains four variables in stable storage: curr-vt: voting table; vn: version number; prop-vt: the last pending attempt to update the voting table, or a null value (if the process has no pending attempt recorded); update-id: a unique identifier of the last pending attempt to update the voting table, or a null identifier (if prop-vt is null). It may be implemented by a sufficiently long bit pattern. Whenever these variables are updated, they are transferred to stable storage as a single atomic action. When a process joins the group, its variables are initialized as follows: curr-vt, prop-vt and update-id are null; vn is zero. A process may leave the group only when its curr-vt is null or grants to it vote zero. We leave unspecified the interface presented to applications in order to enforce this constraint. In order to bootstrap the system, we shall assume that an initial (nonnull) voting table is defined whose version number is zero. Each process with an entry in this voting table joins the group with curr-vt equal to the initial voting table rather than with a null voting table. Each process p maintains in volatile storage a boolean variable called PRIMARY and having the following interpretation: p performs the actions that are allowed only in the primary subview iff PRIMARY is true. The current value of PRIMARY is determined on the basis of p’s local variables, as follows. Let prop-vt=null (the case prop-vt≠null is described below); PRIMARY is true iff the composition of p’s subview defines a quorum
according to curr-vt. Moreover, PRIMARY is false when curr-vt is null. Management of the voting table is split in two parts: • The update algorithm updates the voting table upon requests issued by the application. Updating the voting table corresponds to changing the selection rule. A process can participate in this algorithm iff its PRIMARY variable is true. • The propagation algorithm merges a set of subviews into a single subview. All processes in the resulting subview have an identical voting table and this voting table is the “most recent” one among the voting tables of all participants. At the end, the value of PRIMARY is re-evaluated. Along any given cut, the current voting table is the one with highest version number among the voting tables of all group members. Unless stated otherwise, we shall always refer to the current voting table and we shall not specify the cut when it can be inferred from the context. A quorum subview is a subview that defines a quorum. A primary subview is a quorum subview that has been installed by all of its members (although a primary subview shall be defined as a quorum subview that has been installed by a quorum of processes, our definition is simpler to handle and does not alter the essence of the reasoning). The main properties of the proposed methodology are: (VT1) Let p and q be any two processes that deliver vchg(ev) and that are in the same subview in ev. The value of PRIMARY in ev at p and q is identical. (VT2) If PRIMARY is true at process p in e-view ev, then the subview of p in ev is a quorum subview. (VT3) Let p be a process whose subview sv in ev defines a quorum. PRIMARY at p is true in ev iff either propvt at p is null, or sv defines a quorum according to prop-vt. (VT4) Let ev be any e-view that does not contain the primary subview. If every failure eventually recovers and the number of view changes is finite, then there exists an e-view ev’ successor of ev that contains the primary subview. (VT5) Let p be a process whose view contains the primary subview and that does not belong to the primary subview. If the number of view changes is finite, eventually either of the following will hold: (i) p will belong to the primary subview and will have its PRIMARY variable true; or (ii) p will be delivered a view that does not contain the primary subview. Let svi be a subview defined in e-view evi and let svj be a subview defined in evj (evi≠evj). (VT6). If svi and svj are primary subviews then evi and evj are not concurrent. (VT7) Let vni and vnj denote the version number of the voting table that is current upon the installing of evi and evj. If both svi and svj are primary and evj is a
successor of evi, then vni≤ vnj. In short, the resulting scenario is as follows. Upon delivery of an e-view, each process decides locally whether its subview defines a quorum (by updating PRIMARY). All processes in the same subview take the same decision (VT1). Decisions are safe in the sense that if a process decides that its subview defines a quorum then its subview indeed defines a quorum (VT2). The opposite is not true, hence there might be quorum subviews that are not detected by their members (VT3). However, this scenario may happen only as a consequence of certain failures occurring during the update algorithm (see next sections). In any case, the primary subview does not disappear indefinitely, even if the group experiences a total failure (VT4), and all processes have equal chances to belong to the primary subview (VT5). The set of primary subviews form a sequence that is totally-ordered by the version number of the voting table (VT6,VT7). Observe that VT6 guarantees that primary subviews form a totally-ordered sequence, but it does not exclude the possibility of multiple quorum subviews that exist concurrently. By a similar reasoning as the one described in Section 2 about majority views, however, it can be easily concluded that: (i) this feature is a necessary consequence of the quasi-strong partial semantics of the membership service, not of the proposed algorithm; (ii) quorum subviews that are not primary are “transitory” subviews that will disappear soon without having performed any global activity with a majority; (iii) concurrent quorum subviews can exist only in concurrent e-views.
6. The Algorithm 6.1 Notation We use a pseudo-code with a Pascal-like syntax. Indentation levels implicitly delimit blocks. Execution is driven by events delivered by the underlying run-time support. Upon delivery of an event, an uninterruptible code segment is executed. The association between events and code segments is specified informally in the pseudocode. The statement wait-for(condition) blocks until the specified condition becomes true. View changes delivered before its completion are handled by executing the corresponding handler. The statement ReceiveSV(t,replies) blocks until the executing process p has received a message carrying the specified tag t from each member of p’s subview. These messages are contained into the replies variable. View changes delivered while p is blocked are handled by executing the corresponding handler and by discarding messages that were sent by processes that have left p’s subview. The statement Receive-SVSet(t,replies) is identical, except that it refers
procedure Update(new-vt: voting-table); if not Check(new-vt) then return; msg := ; multicast msg in MySV; Receive-SV(ACK,ack-set); multicast in MySV; procedure CompleteUpdate(); coord-u := elect(MySV); send ACK to coord-u; if (MyID = coord-u) then Receive-SV(ACK,ack-set); multicast in MySV;
Upon delivering msg with tag ATTEMPTEDVT: Upon view change: if (prop-vt ≠ null) then prop-vt := msg.new-vt; update-id := msg.upid; if not (comp(MySV),curr-vt) or ; not (comp(MySV),prop-vt)) then send ACK to coord-u; PRIMARY := false; abort update algorithm; Upon delivering msg with tag COMMITTEDVT: elseif (coord-u ∉ MySV) then curr-vt := prop-vt; CompleteUpdate() vn := vn+1; elseif (coord-u ∉ MySV) then prop-vt := null; coord-u := elect(MySV); update-id := null; ;
4 4
Figure 2 Update algorithm
to the sv-set of the executing process. Predicate 4(s,vt) is True iff the set of processes s defines a quorum according to voting table vt. Function elect(s) returns the identifier of a process chosen deterministically from the set s. Function new-upid() returns an identifier that can be employed as unique identifier of an attempted update. Identifiers MyID, MySV, MySVSet denote, respectively, name, subview and sv-set of the executing process.
6.2 Update algorithm The update algorithm is run in a quorum subview. Update requests are directed to a dedicated process in the subview, denoted as coord-u, that is elected by applying a deterministic function to the composition of the subview. Election is performed whenever multiple subviews merge (e.g. upon termination of the propagation algorithm) or when coord-u leaves the subview. In this case, pending requests are resubmitted to the new coord-u. Requests arriving at coord-u while an update is in progress are queued and processed in FIFO order. For brevity, we will omit all the actions related to requests handling as they do not provide any additional insight. The algorithm is given in Figure 2. Coord-u processes each request by invoking procedure Update(new-vt), that attempts to install new-vt and proceeds in two phases. Boolean function Check(vt) returns true iff both following constraints are satisfied (Switched(vt) denote the set of processes that are given a zero vote in curr-vt and a non-zero vote in vt): (C1) 4(comp(MySV),curr-vt)∧4(comp(MySV), vt); (C2) There is a set of processes disjoint from Switched(vt) that defines a quorum in vt. As clarified later, constraints C1-C2 allow recovering from failures occurring during the update algorithm, in particular, from total failures: intuitively, C1 is necessary because a quorum of curr-vt cannot decide whether it can complete a pending attempt vt unless it is also a quorum of vt (C1 appears in similar forms in virtually any
algorithm for changing the notion of “primary partition” dynamically [11,16,14]); C2 simplifies the handling of groups with dynamic membership. View changes occurring while an update is in progress may only cause the set of participants to shrink (Structure property of EVS). In this case, each process checks that C1 is still satisfied. Coord-u then discards ACK’s sent by processes that have left (within the Receive-SV() statement) and either continues to wait or multicasts the related COMMITTEDVT. If coord-u has left the quorum subview, it is taken over by another process. The algorithm assumes that, at the beginning of its execution, any two participants have identical data structures. Due to view changes occurring during its execution, the algorithm may proceed along concurrent eviews and it will involve, in each e-view, only the members of a single (quorum) subview. The algorithm guarantees that, upon its termination, participants in the same subview still have identical data structures. The main property of the update algorithm is the following: Let PropSet denote the set of participants that either installed new-vt or recorded the attempt in propvt: If any participant installed new-vt, then PropSet defines a quorum according to both curr-vt and new-vt.
6.3 Propagation algorithm When a process is delivered an e-view composed of multiple subviews, it starts participating in an instance of the propagation algorithm. The algorithm is driven by a dedicated process, denoted as coord-p, that performs the following steps (Figure 3): (i) construct an sv-set encompassing all sv-sets composed of a single subview; (ii) wait for a COLLECTVT message from each participant; (iii) select the “most recent” voting table among the voting tables of the participants (procedure Analyze(), see below); (iv) multicast a PROPEND message containing the selected voting table and the decision about the pending attempts possibly recorded; (v) merge all participants into a single subview. At the end of the algorithm, each
procedure Propagation() participants := ; coord-p := elect(participants); if (MyID = coord-p) then SVSetMerge(participants); wait-for(creation of sv-set); CompletePropagation(); procedure CompletePropagation() send to coord-p; if (MyID = coord-p) then Receive-SVSet(COLLECTVT,DecisionRcvd); Analyze(DecisionRcvd); msg := ; multicast msg in MySVSet; SubviewMerge(MySVSet); wait-for(sv-set merging and delivery of PROPEND); EndOfPropagation();
procedure Analyze(s: set of messages); procedure EndOfPropagation (); ; if (decision was WAIT) then decision := ; PRIMARY := false; if (decision = PROPAGATE) then return; try-vt := ; if not (comp(MySV),curr-vt) then try-upid := ; PRIMARY := false; return; Upon delivering msg with tag PROPEND: if (prop-vt = Null) then curr-vt := msg.selected-vt; PRIMARY := true; vn := msg.selected-vn; coord-u := elect(MySV); if (msg.decision = PROPAGATE) then elseif (comp(MySV), prop-vt) then prop-vt := msg.try-vt; PRIMARY := true; update-id := msg.try-upid; CompleteUpdate(); elseif (msg.decision = CLEAR) then else prop-vt := Null; PRIMARY := false; update-id := Null; ; Upon view change:
Figure 3 Propagation algorithm
participant re-evaluates its PRIMARY variable. If it has become true, coord-u is elected and, if necessary, the pending update is completed (function CompleteUpdate()). Sv-sets composed of multiple subviews are not collected at step (i) because they are already running an instance of the algorithm. When these sv-sets will have merged into a single subview, a further instance of the propagation algorithm will be run. If the algorithm is started (e.g. the sv-set is created) while an instance of the update algorithm is in progress, processes in the quorum subview do not send the COLLECTVT until terminating the update (this aspect is not enclosed in the pseudo-code for simplicity). View changes occurring while the algorithm is in progress may cause the initial sv-set to split among concurrent e-views. In this case, each resulting subset will continue the algorithm in the respective e-view (sv-sets may only shrink because of the Structure Property of EVS). When coord-p leaves the sv-set, it is taken over by another process that will receive a further COLLECTVT from each participant. Let S denote the set of participants in the algorithm. Procedure Analyze() assigns selected-vt equal to the voting table with highest version number among processes in S and selected-vn to the version number of selected-vt. These variables define the voting table and version number that will be installed by all participants upon delivery of the PROPEND message. Let a1...ak denote all different pending attempts (e.g. pending attempts with different updateid’s) possibly recorded at participants with curr-vt= selected-vt. Analyze() takes one of the following decisions concerning the handling of these attempts: (CLEAR) clear pend-vt at all participants; (PROPAGATE) propagate an
4
4
if (coord-p ∉ MySv) then if (sv-set has not been created yet) then restart Propagation(); else coord-p := elect(MySVSet); CompletePropagation();
attempt ai at all participants; (WAIT) leave all pend-vt’s unaltered. The decision is taken as follows. S is partitioned in k+2 disjoint subsets: S-1 contains participants with vn < selected-vn; S0 contains curr-vt=selected-vt participants with and prop-vt=null; Si (i∈[1,k]) contains participants with curr-vt=selected-vt and prop-vt = ai. Any of these subsets can be empty. If S=S-1∪S0 (e.g. there are no pending attempts recorded) then decision is CLEAR. Otherwise, if ¬4(S, selected-vt) then decision is WAIT. Otherwise, let OutS denote the set of processes that are not in S: R1) ∀ ai, ¬4(OutS∪Si,ai) or (¬4(OutS∪Si, selected-vt) ∧ 4(Si,ai)) ⇒ CLEAR R2) ∃ ai : 4(Si,ai) ∧4(Si , selected-vt) ⇒ PROPAGATE (attempt ai) R3) Otherwise⇒ WAIT Intuitively, CLEAR is performed when it is guaranteed that none of the pending attempts has been installed by a process in OutS. PROPAGATE ai when the current voting table is either selected-vt or ai: ai is the only attempt that might have been installed by a process pout in OutS; and if pout indeed installed ai, then it certainly has not installed a more recent voting table. WAIT is performed when neither CLEAR nor PROPAGATE is safe. In this case, at the end of the algorithm, PRIMARY is certainly set to false. Observe that it is not necessary to know the composition of OutS (which cannot be achieved), but only the ability to tell whether OutS∪Si is a quorum according to a specified voting table vt. This can be achieved by collecting the votes of all processes that are not in S and have a non-zero vote in vt. The conditions that allow reconstructing the primary
subview after a total failure can be summarized, very informally, as follows: S must define a quorum; if the total failure occurred while no instance of the update algorithm was in progress (e.g. S=S-1∪S0 thus the decision is CLEAR), then a quorum is also sufficient; Otherwise, depending on the actual failure pattern, a quorum becomes sufficient if either it contains a quorum of identical pending attempts (R2), or it is possible to exclude the success of all known pending attempts (R1). The fact that R3 is not applied in every execution (Liveness I) can be derived from the following informal considerations. R3 means that the set of processes in OutS with a non-zero vote in a pending attempt ai is “too big” to be ignored. This set will be always too big only if some of its components left the group so that it is not possible anymore to form a quorum of ai. Suppose p∈OutS left the group, thus, p has a zero vote in its voting table; if p installed ai, it is possible to collect a quorum of ai without p; otherwise, there is certainly a set of processes that cannot leave the group and can form a quorum of ai because of constraint C2 of the update algorithm.
7. Conclusions We have presented a methodology and associated algorithms for establishing a primary partition in a partitionable asynchronous distributed system. With our methodology, based on the Enriched View Synchrony programming paradigm, the lower layers deliver views and the application applies the selection rule. This sharp layering makes it easy to support multiple applications with different and possibly conflicting notions of a primary partition on the same computing base. In particular, each application can define its own selection rule and modify it at run-time according to its specific needs. Furthermore, the primary partition is reconstructed automatically after recovering from total failures and the group membership can be dynamic.
Acknowledgments This work has been funded by the Italian Ministry of the University of Scientific and Technologic Research (M.U.R.S.T. 60% and 40%).
References [1] Y. Amir, D. Dolev, S. Kramer, D. Malki,, “Transis: a communication sub-system for high availability”, Proc. 22-nd Symposium on Fault-Tolerant Computing, July 1992, pp. 7684. [2] D. Davcev, W.A. Burkhard, “Consistency and recovery control for replicated files”, Proc. 10-th ACM Symposium on Operating Systems Principles, December 1985, pp. 87-96. [3] Ö. Babaoglu, A. Bartoli, G. Dini, “Enriched view synchrony: a programming paradigm for partitionable asynchronous systems”, IEEE Transactions on Computers, vol.46, n.6, June 1997, pp. 642-658.
[4] Ö. Babaoglu, R. Davoli, A. Montresor, “Group membership and view synchrony in partitionable asynchronous systems: specifications”, Technical Report UBLCS-95-18, Dept. of Computer Science, University of Bologna, November 1995 (revised September 1996). [5] Ö. Babaoglu, R. Davoli, L. Giachini, P. Sabattini, “The inherent cost of strong-partial view synchronous communication”, in Distributed Algorithms (WDAG9), Lecture Notes in Computer Science 972, October 1995, pp.72-86. [6] Ö. Babaoglu, R. Davoli, A. Montresor, R. Segala, “System support for partition-aware network applications”, Technical Report UBLCS-97-4, Dept. of Computer Science, University of Bologna, March 1997. [7] K. Birman, “The process group approach to reliable distributed computing”, Communications of the ACM, vol.36, n.12, December 1993, pp.36-53. [8] K. Birman, “Virtual synchrony model”, in Reliable Distributed Computing with the Isis toolkit, IEEE CS Press, 1994. [9] T. Chandra, S. Toueg, “Unreliable failure detectors for reliable distributed systems”, Journal of the ACM, vol. 43, n. 2, March 1996, pp. 225-267. [10] T. Chandra, V. Hadzilacos, S. Toueg, B. Charron-Bost, “On the impossibility of group membership”, Proc. 15-th ACM Symposium on Principles of Distributed Computing, May 1996, pp. 322-330. [11] D. Dolev, I. Keidar, E. Lotem, “Dynamic voting for consistent primary components”, Technical Report CS96-7, Institute Of Computer Science, The Hebrew University of Jerusalem, 1996. [12] D. Dolev, D. Malki, R. Strong, “A framework for partitionable membership service”, Technical Report CS95-4, Institute Of Computer Science, The Hebrew University of Jerusalem, 1995. [13] B. Glade, K. Birman, R. Cooper, R. van Renesse, “Lightweight process groups in the Isis system”, Distributed Systems Engineering, July 1993. [14] S. Jajodia, D. Mutchler, “Dynamic voting algorithms for maintaining the consistency of a replicated database”, ACM Transactions on Database Systems, Vol.15, n.2, June 1990, pp.230-280. [15] F. Kaashoek, A. Tanenbaum, “Group communication in the Amoeba distributed operating system”, Proc. 12-th IEEE International Conference on Distributed Computing Systems, May 1991, pp. 222-230. [16] A. Ricciardi, K. Birman, “Consistent process membership in asynchronous environments”, in Reliable Distributed Computing with the Isis toolkit, IEEE CS Press, 1994. [17] A. Schiper, A. Ricciardi, “Virtually-synchronous communication based on a weak failure suspector”, Proc. of the 23-rd International Symposium on Fault-Tolerant Computing, June 1993, pp. 534-543. [18] A. Schiper, A. Ricciardi, K. Birman, “Understanding partitions and the ‘’no-partition’’ assumption”, Proc. of the 4-th IEEE Workshop on Future Trends of Distributed Systems, September 1993, pp. 354-360. [19] A. Schiper, A. Sandoz, “Primary partition ‘’virtuallysynchronous communication’’ harder than consensus”, in Distributed Algorithms (WDAG8), Lecture Notes in Computer Science 857, October 1994, pp. 39-52.