A General Framework to Solve Agreement Problems Michel H URFINy , Raimundo M AC Eˆ DOz , Michel R AYNALy , Fr´ed´eric T RONELy
y IRISA
z LaSiD-CPD-UFBA
Campus de Beaulieu 35042 Rennes Cedex, France
[email protected]
Campus de Ondina CEP 40170-110 Bahia, Brazil
[email protected]
Abstract
it proposes to the other processes, and all correct processes have to eventually agree on a common value (called decision value) that has to be one of the proposed values. Several efforts have been focused on making Consensus usable in distributed systems: they consider Consensus as a basic building block on top of which solutions to particular agreement problems can be designed (among others, solutions to Atomic Broadcast [3], Non-Blocking Atomic Commitment [7] and Atomic Multicast [6] have adopted such an approach). This approach has been exploited and deepened in [9] where the notion of Consensus service is proposed to solve agreement problems. This service is implemented by a set of “Consensus server” processes. When they have to solve a particular agreement problem, clients interact with this service. The notion of Consensus filter is used to provide a run-time customizing of the Consensus service for each agreement problem (a filter transforms messages received by a server into an initial value for the underlying Consensus protocol). Basically, the principle used in this approach consists in considering the Consensus problem as a black box, and consequently in reducing each agreement problem to the Consensus problem. Such a reduction proceeds in two phases. There is first a preliminary exchange phase during which each process sends its initial value to the others (For example in the NBAC problem, each process sends its Y ES /N O vote to the other processes). Then, each correct process, according to the values it has received and to failures it suspects, constructs a value that constitutes its local view of the computation and consequently defines the initial value it proposes to Consensus (To continue the classical NBAC example, if a process has received a Y ES vote from each other, it proposes C OMMIT to the Consensus. If it has received a N O vote, or if it has not received a vote from some process and suspects it of having crashed, it proposes A BORT to the Consensus). Then, the second phase consists in executing a Consensus protocol which imposes a single decision value to all processes. The main advantage of this approach lies in the fact a single Consensus ser-
Agreement problems are among the most important problems designers of distributed systems have to cope with. A way to solve them is to first provide a solution to the Consensus problem and then to reduce each agreement problem to Consensus. This “run-time customizing” approach is particularly relevant when upper layer applications have to solve several distinct agreement problems. In this paper we investigate a “compile-time customizing” approach to automatically generate ad hoc agreement protocols. A general agreement framework, characterized by six “versatility” parameters, is defined. Appropriate instantiations of these parameters provide particular agreement protocols. This approach is particularly suited to generate efficient agreement protocols.
1 Introduction Atomic Broadcast, Atomic Multicast and Weak NonBlocking Atomic Commitment (NBAC) are typical examples of agreement problems encountered in the design and the implementation of fault-tolerant distributed systems. An agreement problem involves a set of processes; it is characterized by the fact that these processes have to agree on a common value. For example, in the Atomic Broadcast problem, processes have to agree on a single delivery order for a set of messages. In the NBAC problem, processes have to agree on a single outcome for an operation (C OMMIT or A BORT). Many systems solve each agreement problem by a specific protocol, independently of the protocols solving their other agreement problems. It appears, however, that agreement problems can be perceived as instances of a more abstract problem, namely, the Consensus problem. In the Consensus problem, each process has an initial value that 1
vice is required to solve different agreement problems. In this paper, we investigate a compile-time customizing approach that automatically generates ad hoc agreement protocols. More precisely, we propose a general framework suited to generate agreement protocols. This framework is characterized by six versatility parameters. A protocol solving a particular agreement problem is obtained by instantiating the general framework with appropriate values. Next, the resulting protocol is installed as a basic component within the corresponding system. Note that the resulting protocol is a specialized protocol with respect to the corresponding agreement problem. Interestingly, this approach provides an original insight into the deep structure of agreement problems and into their relations. Previous works have already focused on particular extensions of consensus protocols (e.g., decision on a vector of values [10], multiple propositions for an initial value [1, 2], early decision [8], deferring of initial value proposals [4]). But (to our knowledge), this paper presents the first proposal that exhibits a single framework featuring all these extensions. The obtainment of such an agreement framework is based on the definition of a small set of well identified versatility parameters. It is important to note that these parameters are tightly related and cannot be defined independently. So, a clear and simple set of rules has been attached to these parameters to guarantee the particular instantiations of the framework will work correctly. The paper is composed of six sections. Section 2 introduces the runtime model and agreement problems. Section 3 presents the idea of the framework. Then, Section 4 presents the skeleton of the General Agreement Framework (GAF), presents its “versatility” parameters and provides example of instantiations. Section 5 discusses the constraints that have to be satisfied by the framework parameters in order GAF provides consistent agreement protocols. Finally, Section 6 concludes the paper.
its specification. When a process crashes, it stops working. A process can be correct or faulty. By definition a correct process is a process that does not crash during the execution of the (upper layer) application. Otherwise it is faulty. In asynchronous distributed systems with process crash failures, the main difficulty lies in the impossibility to safely distinguish between a crashed process and a very slow process or a process with which communications are very slow. Chandra and Toueg have introduced the concept of unreliable failure detectors and have exhibited a set of properties these detectors must satisfy in order that reliable distributed systems can be built on top of them [3]. Due to asynchrony, a failure detector can make mistakes either by not suspecting a crashed process or by erroneously suspecting a correct one. According to their behaviors, several families of failure detectors can be defined. These definitions rely on the notions of completeness and accuracy. More precisely, completeness requires that every crashed process is eventually suspected. Accuracy restricts the mistakes (erroneous suspicions) possibly made by a failure detector. In the following we consider the family of failure detectors (denoted S) that satisfies the following properties: (1) Strong completeness: eventually, every crashed process, is permanently suspected by every correct process. (2) Eventual weak accuracy: there is a time after which some correct process is not suspected by any correct process. Note that completeness can be realized by using “I am alive” messages and timeouts. On the other hand, even if accuracy is satisfied by some system executions, there is no way to guarantee that it will be satisfied by all system executions. This observation shows the limit of asynchronous systems subject to process crashes, as far as crash detection is concerned. There is no mean to ensure safe process crash detection. Such a detection can be at best approximate.
3
2.3 Consensus Problem
2 Runtime Model and Agreement Problems
The Consensus problem [3, 5] is defined in terms of two primitives called propose and decide. Initially, each process pi selects a value vi among a set of possible values and invokes the primitive propose with this value as a parameter (we say “pi proposes vi ”). A process ends its participation to Consensus by executing decide(v ) (we say “pi decides the value v ”). Many theoretical works have been devoted to the Consensus problem1. The most famous one is due to Fischer, Lynch and Paterson [5] who proved that Consensus has no deterministic solution in asynchronous distributed systems subject to (even a single) process crash failure. To overcome this impossibility result,Chandra and Toueg have shown that Consensus can be solved in distributed systems
2.1 Asynchronous Distributed Systems A distributed system is composed of a finite set of n sites interconnected through a communication network. Each site has a local memory (and possibly a stable storage, according to the needs of applications). Processes synchronize and communicate by exchanging messages through channels of the underlying network. We consider asynchronous distributed systems: there are bounds neither on communication delays, nor on process speeds.
2.2 Process Failure Model We consider a failure model in which processes may crash silently. A non-crashed process behaves according to
1 We
2
consider the definition of the Uniform Consensus problem.
equipped with unreliable failure detectors when those satisfy some completeness and accuracy properties [3]. They (with Hadzilacos) have shown that the class S of failure detectors is the weakest one allowing to solve Consensus. Several protocols based on S have been designed to solve the Consensus problem. The first to have been proposed (called CT in the following) is due to Chandra and Toueg [3]. It requires reliable channels and a majority of correct processes.
variants modifies CT. We list here some of these variants: Possibility for a process to change the value it has previously proposed. The possibility to change the value proposed by a process during a Consensus execution has been investigated (for the first time to our knowledge) in the context of Atomic Broadcast [1]. The Consensus-based solution to Atomic Broadcast described in [3] uses repeated executions of Consensus. The k-th execution of Consensus is used to define the k-th batch of messages processes have to deliver in the same order. The initial value proposed by a process to a Consensus execution is a sequence of messages this process has received but not yet delivered. As processes may continuously broadcast messages, a sequence of Consensus is necessary. This approach has been improved in the following way [1]. During a Consensus execution, a process that has already proposed a set of messages can later “increase” this set of messages in its future proposals. This allows a Consensus execution to decide on a larger sequence of messages to be delivered. To attain this goal, the underlying Consensus protocol (namely CT) is actually modified. The new building block (called Prefix Agreement) is specific to the needs of Atomic Broadcast. Possibility for a process to defer the proposal of its initial value. Several proposals have modified CT to allow some processes to participate in a Consensus without being obliged to initially propose a value. According to the particular agreement problem and to the actual failure pattern, a process can be required to subsequently provide a proposal. Such modified Consensus protocols have been used to solve agreement problems in a mobile computing environment [2], and to implement a variant of passive replication (called semi-passive replication [4]). Possibility for a process to gather several proposed values. The CT protocol has been modified to allow processes to internally gather and agree on a set of values rather than on a single value [10]. This modification is obtained by embedding a function within the Consensus protocol itself. This function maps each set of possibly collected estimate values to an according output value. That is why, this modification requires to define a set of input values and a set of output values. Possibility for a process to early decide. In some agreement problems a decision can be made as soon as some particular value(s) has (have) been proposed. This is the case of the (all-or-nothing) NBAC problem: as soon as a vote N O has been issued, the decision is A BORT whatever the other votes can be. So, a vote N O allows an early decision. The protocol presented in [8] modifies CT to allow such a mechanism which increases time efficiency.
3
3
2.4 Non-Blocking Atomic Commitment Problem The NBAC problem originated from databases, more precisely from transactions. In a distributed system, a transaction usually involves several participant sites (i.e., several processes). At the end of a transaction, its participants are required to enter a commitment protocol in order to commit it (when enough things went well) or to abort it (when too many things went wrong). Each participant votes YES or N O. If for any reason (deadlock, storage problem, concurrency control conflict, etc.) a participant cannot locally commit the transaction, it votes N O. Otherwise a vote Y ES means that the participant commits locally to make updates permanent if it is required to do so. Then, the decision to commit or to abort is determined. The decision must be C OMMIT if enough participants (usually all) voted Y ES. It must be A BORT if too many participants (usually one) voted N O. More precisely, if a participant decides C OMMIT, then at least x participants have voted Y ES (x = n characterizes the classical all-or-nothing NBAC problem, while x = d(n +1)=2e characterizes the majority NBAC problem). On the other hand, the trivial solution, where the decision value is always A BORT, has to be eliminated by ensuring that if x participants vote Y ES and do not suspect each other, then the decision value must be C OMMIT2 .
2.5 Atomic Broadcast Problem Here we are concerned with a communication primitive which is actually an agreement problem, namely, Atomic Broadcast (also called Ordered Reliable Broadcast). Informally, processes broadcast messages and they have to deliver them in the same order. So, this problem is characterized by an agreement on a single message delivery order.
3 The Idea of the Framework Several variants of Consensus protocols have been introduced in the recent past to get more appropriate solutions to particular agreement problems. Basically, each of these
Each of these proposals considers a particular issue and appropriately modifies CT to solve it. This paper proposes a more general approach. It identifies several “versatility” parameters and provides a General Agreement Framework
2 This definition is the one used by Guerraoui in [7], and called weak non blocking atomic commitment.
3
(GAF) that takes into account all these parameters. Then, by customizing these parameters, particular agreement protocols can be obtained. The skeleton of GAF is the CT protocol. So, all the protocols obtained by instantiating GAF assume reliable channels, a majority of correct processes and a failure detector of the class S. It is important to note that GAF is not obtained by simply “aggregating” versatility providing ingredients. The parameters are related through a set of rules. These rules guarantee to get sound instantiations of the framework. As an example, let us consider the possibility for a process to change its mind: after having proposed a value v 1, it wants to propose a value v 2. This possibility actually requires that the set of possible proposed values forms a lattice, and for consistency reasons (see Section 5), the sequence of values successively proposed by a process must be an increasing sequence in this lattice. Informally, this means that a process can only “increase the significance” of its proposal. In the same way, the set of possible decided values has to form a lattice. Actually, the design of GAF relies on a sound formal basis which defines the constraints on parameters that have to be satisfied in order to obtain a consistent agreement problem. These points are illustrated with examples and developed in the next sections.
progresses and converges to the final decision value. During a round r, the cooperation between processes is based on a centralized communication scheme: each message (except the D ECISION messages) is either sent to or received from the coordinator. Moreover, a message (except D ECISION) sent during a round r can only be taken into account (lines 17, 23 and 34) by a process currently executing the same round. A round spans several while loop executions (lines 2-48). Accordingly, the variable new roundi is used to indicate that a new round has to be started. Each round is divided into four phases (those are the four phases of CT): Phase 1: In the first phase (lines 3-12), each process sends to the current coordinator its own estimate of the final value (line 12). This message called ESTIMATE contains four values: the identity of the sender (pi ), its current round number (ri ), its current estimate value (esti ) and the associated timestamp (tsi ). The boolean phase1 begini is used by pi to know if it has to (re)start phase 1. As in CT, this phase is always executed at the beginning of a round: the boolean phase1 begini is set to true at line 4 when a new round is launched. This phase is also executed each time a new value proposed by pi can be taken into account: the boolean phase1 begini is set to true at line 10 when the upper layer application is allowed to provide a more significant input value (see below the descriptions of both the function G ET (line 9) and the order relation (line 10)). Phase 2: Since all the ESTIMATE messages are exclusively sent to the current coordinator, the second phase (lines 23-33) of a round is executed only by the coordinator. The boolean phase2 endi is used by the coordinator pi to know if it has already completed this phase (line 24). This variable is initialized to false when a new round, coordinated by pi , starts (line 6). It is set to true when the coordinator ends the second phase by broadcasting a new estimate value (line 32) During this phase, the coordinator pi first gathers estimates sent by processes during the first phase. Each time the current coordinator pi receives an ESTIMATE message from a process, it executes three main actions: (a) pi adds the identity of the sender to the set received fromi . This set contains the identities of the processes from which pi has received an estimate during the current round (lines 6 and 25). Thanks to this variable, the coordinator knows if a majority of estimates has been collected (jreceived fromi j d(n + 1)=2e) and moreover it knows the identities of the processes from which it has not yet received an ESTIMATE message. (b) Then the coordinator updates the value of the variable new esti called herein its new estimate. If it turns out that no other estimate messages have to be gathered, the new estimate will be proposed to all the processes (line 32). The new estimate is either selected among the received estimates (line 26) or computed by applying the function F to the set of gathered informations (line 28).
3
4 Description of the Framework 4.1 The Framework The general framework is described in Figure 1. It is composed of a protocol description that uses six functions. Those functions have to be appropriately instantiated to get a particular agreement protocol. They are G ET, , ACCEPTABLE, F, E XCUSED , and E ARLY . We first provide the protocol description. Then, we present the functions. 4.1.1 Skeleton of the Framework As previously indicated, the framework skeleton is CT. Indeed, the framework generates CT-like protocols: each of them is based on the rotating coordinator paradigm and proceeds in consecutive asynchronous rounds until a decision is reached (execution of the return statement at line 16 or 38). At a given time the value of the variable ri is equal to pi ’s current round number (this variable is modified at lines 1 and 4). Each round is coordinated by a predetermined process that tries to impose a decision value. When considering a round r, the “current” coordinator is the process pc such that c = (r mod n) + 1 (line 4). Each process pi manages a local variable esti that represents its current estimate of the final decision value. A timestamp tsi is associated with this value. When the protocol starts, esti , initialized to ?, is timestamped 0 (line 1). This value is updated as the protocol 4
Framework general agreement begin 0; new roundi true; esti ; tsi 0; est fromi [ ; ; ; ]; (1) ri (2) while (true) do % The loop is from line 2 until line 48 % (3) if (new roundi ) % Initialize the round variables of pi % (4) then new roundi false; ri ri + 1; c (ri mod n) + 1; phase1 begini true; (5) if (i = c) % Initialize the round coordinator variables % (6) then received fromi ; tsmi 0; phase2 endi false; accepti ; rejecti (7) endif endif; (8) if (tsi = 0) % The value proposed by the upper layer application can be changed % G ET (); % Get a new proposal % (9) then est fromi [i] (10) if (esti est fromi [i]) then esti est fromi [i]; phase1 begini true endif (11) endif; (12) if (phase1 begini ) then send(ESTIMATE< pi ; ri ; esti ; tsi >) to pc ; phase1 begini false endif; (13) if a message m (as defined below) has been received (14) then case m of (15) m =DECISION< j; est > m is from any pj (16) send(DECISION< i; est >) to all except pi ; pj ; return(est) m =NEW ESTIMATE< c; r; new est > such that r = ri m is from pc (17) (18) if ( ACCEPTABLE (new est)) (19) then esti new est; tsi ri ; send(VOTE< i; ri ; ack >) to pc % pi accepts new est % (20) else send(VOTE< i; ri ; nack >) to pc % pi refuses new est % (21) endif; (22) if (i = c) then new roundi true endif (23) m =ESTIMATE< j; r; est; ts > such that r = ri m is from any pj to pc (i = c) (24) if not(phase2 endi ) received fromi j ; (25) then received fromi ts; new esti est endif; (26) if (tsmi < ts) then tsmi (27) if ((tsmi = 0) and not(est est fromi [j ]))
?
? ? ?
;
;
;
f
g g
f
6
g
f [f g
then est
(28)
f
j
fromi [j ]
est; new esti
jd
g
F (est fromi ) endif;
e
8
2
(29) if ( ( received fromi (n + 1)=2 ) and ( pj : j received fromi or E XCUSED (j )) ) (30) or (31) ( E ARLY (new esti ) and (tsmi = 0) ) (32) then send(NEW ESTIMATE< i; ri ; new esti >) to all; phase2 endi true (33) endif endif; (34) m = VOTE< j; r; answer > such that r = ri m is from any pj to pc (i = c) (35) if (answer = ack) (36) then accepti accepti j ; % The coordinator pi counts the positive acknowledgments % (37) if ( accepti = (n + 1)=2 ) (38) then send(DECISION< i; esti >) to all except pi ; return(esti ) (39) endif (40) else rejecti rejecti j % The coordinator pi counts the rejections % (41) endif; (42) if ( accepti rejecti = (n + 1)=2 ) then new roundi true endif % Deadlock prevention % (43) endcase (44) endif; (45) if ( (new roundi ) and (i = c) and (pc suspectedi )) true (46) then send(VOTE< pi ; ri ; nack >) to pc ; new roundi (47) endif (48) endo end
f
j
j
:
j d
[f g
[
[f g j d
6
e
g
f g
e
2
Figure 1. A General Agreement Framework
5
ized to ; at the beginning of a round (line 6), these sets contain process identities: accepti (resp. rejecti ) contains the processes that have sent a positive acknowledgment ACK (resp. a negative acknowledgment NACK) to the current coordinator, namely pi . If the current coordinator receives positive acknowledgments from a majority of processes (line 37), it reliably broadcasts a message DECISION which contains the decision value (lines 38 and 16). Reliable Broadcast guarantees that (1) all correct processes deliver the same set of messages, (2) all messages broadcast by correct processes are delivered, and (3) no spurious messages are ever delivered [3]. Otherwise, the current coordinator pi proceeds to the next round (line 42). As previously indicated, this framework extends CT. More precisely, the original CT algorithm can be obtained by considering the instantiation defined in Table 1 (see also Section 4.2). So, the framework inherits the locking property from CT. This property is the following one. As soon as a majority of processes have positively acknowledged a new estimate new est sent by the current coordinator, no other value can be decided. This means that henceforth the current value of new est will be the decided value: this value is locked. Whatever is the round during which a process decides, the locked value is the decided value.
More precisely, pi keeps track (by the mean of the variable tsmi ) of the maximal timestamp received during the current round (lines 6 and 26). If the coordinator has received at least one estimate whose timestamp is greater than zero, the value of new esti is set to an estimate whose timestamp is the greatest one (line 26). Otherwise, all the estimates it has received were timestamped zero, and they have been saved in an array est fromi (line 28). In that case the value of new esti is the result returned by applying the function F to the array est fromi (line 28, see the description of the parameter F below). Note that, for any k , the value of the variable est fromi [k ] is the most significant zero timestamped estimate3 which has been sent by pk and received by pi or the default value ? (line 1) if no value has been received. (c) The gathering of estimate values ends when enough estimates have been received by the current coordinator (lines 29-31). The coordinator is assured of receiving at least a majority of estimates, because a majority of processes is correct by assumption. Yet the gathering can possibly end before (line 31, see below the description of the parameter E ARLY ). Depending on the problem to solve, it can be necessary to collect more values (line 29, see below the description of the parameter E XCUSED). Then, the coordinator proposes its new estimate by sending it to all processes (line 32). The message NEW ESTIMATE broadcast by the coordinator contains three fields: the identity of the sender (pi = pc ), its current round number (ri ) and the new estimate (new esti ). Phase 3: In the third phase (lines 17-22 and lines 4547) each process pi waits for the receipt of a new estimate from the coordinator. Either pi suspects the coordinator of having crashed (line 45), or pi receives the new estimate (line 17). In the former case, a process sends a negative acknowledgment “NACK” to the coordinator (line 46). In the latter case, it either refuses the new estimate (by sending a negative acknowledgment “NACK” to the coordinator, line 20 and see below the description of the function ACCEPTABLE) or adopts it (by sending a positive acknowledgment “ACK”, line 19). If the new estimate is adopted, pi updates the timestamp associated with its current estimate to the current value of the round counter (line 19). A message VOTE sent to the coordinator (lines 19, 20 or 46) contains three fields: the identity of the sender (pi ), its current round number (ri ) and a positive or negative acknowledgment (ACK or NACK). Phase 4: The fourth phase (lines 34-42) is performed only by the coordinator. It waits for a majority of acknowledgment messages. The variables accepti and rejecti are used only by pi when it is the current coordinator. Initial-
Parameter G ET
F
ACCEPTABLE E XCUSED E ARLY
Description of the function Return the initial value vi proposed by pi
8vi ; ? vi
6 ?
Select (in the array est from) a value = Always return true Always return true Always return false
Table 1. Solving the Consensus Problem
4.1.2 Parameter of the Framework As indicated in the previous sections, GAF provides six freedom degrees, each of them being defined by a function which has to be appropriately instantiated to solve a given agreement problem. The function GET: A process pi starts with a default proposed value, namely, ?. Then, the protocol repeatedly asks the upper layer application to provide new proposed values: this is the role played by the function G ET called at line 9. We assume that a value returned by this function necessarily belongs to a predefined set of input values whose definition is intrinsically related to a given agreement problem. The provided value becomes the current estimate of process pi . Note that the function G ET is called at least once. Nothing ensures that the input value est fromk [i] taken into account by a coordinator pk when it computes
3 Since a process can only send more and more significant values (line 10), the test not(est est fromi [j ]) performed at line 27 is useless when channels are FIFO.
6
the decision value (line 28) is equal to the last value proposed by pi . Due to the asynchrony of communications, est fromk [i] can be equal to the default value ? or to any input value previously proposed by pi . As the protocol progresses, it comes a moment after which processes are no longer allowed to change their proposals. This occurs as soon as tsi becomes different from 0 (line 8): then the protocol stops calling G ET. Intuitively, when the timestamp tsi is greater than zero, the current estimate value esti is not a value received directly from the upper layer application (at line 9) but a value adopted by pi at line 19. Henceforth, to ensure the locking property, the only updates (of the estimate value) that are allowed are the ones that can be performed by the skeleton CT. The function : When the function GET is called twice or more, the upper layer application can either propose the same value or provide a more significant one. The function expresses the fact that some values are more significant than others. As indicated in Section 5.1, this function must be a partial order relation on the input values. The function F: The computation of a decision value is done at line 28. If enough values have been gathered (line 29), the result is broadcast to all the processes. The decided value is the result returned by a deterministic function F applied to an array including: (1) The values that have been proposed by processes, plus (2) A ? value for each process that has not proposed a value (see line 1). The value returned by F belongs to a predefined set of possible output results. Note that the set of inputs values and the set of outputs values are not necessarily equal. The data structures used in the framework have the following properties: - If est fromi [k ] = v , v is necessarily an input value. - If new esti = v , v is necessarily an output value. - If esti has been set to v , v is an input value if and only if tsi = 0. Otherwise, it is an output value. As indicated in the Introduction, when the Consensus problem is used as a black box, the validity property (See Section 2.3) states that the decision value is one of the proposed values. Such an approach requires an exchange of local information between all the processes. At the end of this preliminary exchange phase, each process computes the initial value it proposes to Consensus. In the GAF approach, the gathering of local information and the computation of the decision value (done by the function F) are integrated within the generated agreement protocol. The function ACCEPTABLE: The set of possible output values is partitioned into two subsets: one subset contains values that are said to be acceptable, while the other one contains unacceptable values. The boolean function AC CEPTABLE applied to an output value returns true if and only if this element is an acceptable value. Thanks to this
distinction between the output values, processes can participate to an execution of the agreement protocol even if they do not have any significant value to propose. But if too many meaningless input values are gathered, the computed decision can also be insignificant. In that case, the proposed value has to be rejected and extra rounds have to be performed. The function ACCEPTABLE is called at line 18 when a new estimate is received from the coordinator of the current round. As the new estimate which has been computed at line 28 or selected at line 26 is not necessarily an acceptable value, the receiver has to test its validity. The process will adopt the new estimate and send a positive acknowledgment if and only if this value is an acceptable one. Otherwise, it refuses the value and sends a negative acknowledgment to the coordinator. As a consequence, extra rounds will be executed until an acceptable value is obtained. For sake of simplicity, all the types of messages defined in the proposed solution are similar to those used in CT. Thus the behavior of the coordinator is the same whether its new estimate is acceptable or not. For the same reason, a process sends a negative acknowledgment either because it suspects the coordinator or because it rejects its new estimate. In this paper, the fact that a value is acceptable or not is a property of the value itself. But the value returned by the function ACCEPTABLE might also depend on the own context of process pi . In that case, ensuring the liveness property of the generated protocol would become more intricate. Note that, in any case, the termination of the generated protocol still relies on the fact that enough process eventually accept an output value (see Section 5). The function E XCUSED: The function F has to be applied when enough values have been gathered. When the early detection feature is not used (i.e., a call to the E ARLY function always returns false), the locking property of CT is satisfied if a majority of values have been gathered. Depending on the problem to solve, the gathering of extra values is either useless (in the Consensus problem, a single value will be selected), essential (in the NBAC problem, at least max(x; n , x + 1) values have to be gathered) or indifferent because both choices present advantages (in the atomic broadcast, a tradeoff has to be found between the time required to gather extra values and the possible increase of the number of messages ordered in a single decision value). The function E XCUSED is applied only when a majority of estimate values has been collected. A call E XCUSED(j ) returns true if and only if the coordinator has not to wait any longer for the missing proposal of process pj . The function F can not be applied as long as a non-excused process has not proposed a value. Furthermore processes are prone to crash, so some of them may never propose any value. Thus, to ensure live7
ness, the predicate E XCUSED(j ) must eventually be satisfied if the process pj has crashed. For this reason this function may use information returned by failure detector or use timeout mechanisms. The predicate E ARLY: As previously indicated, some problems are characterized by the possibility of an early decision. For example, in the all-or-nothing NBAC problem, as soon as a coordinator is aware that at least one process is in favor of aborting the transaction, it can immediately decide on ABORT. Informally, if enough processes have proposed some values, the decided value is independent of the values proposed by the other processes. GAF allows to force an early decision if the array of proposed values est from satisfies the predefined predicate E ARLY. It is important to note that this parameter has not to be explicitly defined by application programmer, since its definition is derived from the definitions of the other parameters (see Section 5).
Parameter G ET
F
ACCEPTABLE E XCUSED E ARLY
Description of the function Return N O or Y ES = N O; N O Y ES Return C OMMIT if at least x values Y ES have been collected. Otherwise, return A BORT Always return true E XCUSED(pi ) is satisfied if pi is currently suspected Return true when at least (n x + 1) N O values have been gathered
?
,
Table 2. Solving the NBAC Problem
Parameter G ET
F
ACCEPTABLE E XCUSED E ARLY
4.2 Instantiating the Framework The Consensus Problem Let us consider the instantiation described in Table 1. With this instantiation the framework generates a Consensus protocol which is nothing else than CT. The set of input values (equal to the set of output values) contains all the values v potentially proposed by a process. In both sets, no value is more significant than another one. As the function G ET called by process pi always returns the same value vi , the phase 1 is executed only once per round (when a round is started). The NBAC Problem Let us consider the instantiation described in Table 2. With this instantiation the framework generates a solution to the NBAC problem. The set of input values contains two values denoted N O and Y ES. The set of output values contains two values denoted A BORT and C OMMIT. We assume that the value N O is less significant than Y ES. Thus a participant in a transaction is allowed to initially propose N O if it is not sure to be able to commit later (It may be the case if temporary writes are not yet stored on a non-volatile storage). Sending a more significant value (YES) is still possible and may be profitable if the Consensus protocol has not yet converged to a decision in the meantime. As indicated in [10], the fact that the gathering of information and the computation of the decision value are done within the same framework leads to an efficient solution. As no preliminary exchange phase is required, this protocol induces less communications than other classical approaches. Moreover this solution is a good trade-off between the number of messages exchanged and the number of communication steps (latency) required to obtain the final decision. The Atomic Broadcast Problem Let us consider the instantiation described in Table 3. With this instantiation the framework generates a protocol
Description of the function Return a set of message identities = ; v0 ; v; v0 v v0 v Return the union of all the gathered sets of messages identities Return true when not applied to the empty set E XCUSED(pi ) is satisfied after a fixed delay Always return false
? ;
8 8
()
Table 3. Solving The Atomic Broadcast Problem
which can be used by the upper layer application to solve the Atomic Broadcast problem. The upper layer application has to use repeated sequential executions of the generated protocol. By definition, any set of message identities is a potential input/output value. A value v is more significant than a value v 0 if the set v 0 is included in the set v . Identities of messages provided by the function G ET correspond to received messages that have been submitted for Atomic Broadcast but not yet delivered. We assume that all sites a priori agree on a deterministic function that is applied to each decision value (i.e. a set of message identities) to obtain a total order on the delivery of these messages. The proposed solution is more efficient than the Prefix Agreement solution described in [1]. Both solutions allow a process to complete its initial proposal. But GAF also allows a process to propose an empty value (? = ;) if it has no message to order (This process is still allowed to subsequently propose a bunch of messages received in the meantime). This feature possibly speeds up the outcome. Moreover, more messages can possibly be ordered during a single execution of the generated protocol because the decision is the union of all the proposed values rather than a single one. Note that if no significant value has been collected, the function F returns the empty set which is not an acceptable value. If no message is broadcast, processes will forever propose ;. Thus, coordinators will apply F on an array filled with empty sets, and obtain an unacceptable value (;) refused by all the processes. 8
8l l El El El El (1)
5 How is Consistency Ensured ?
0
?
?
>
0
>
0
Thus, one can see that El is in fact the least possible completion, while El> is the greatest possible one. Output Lattice Ordering input values would be useless, if there was not another order relation on output values, and if the mapping F between input and output values was not compatible with these order relations. Thus, the function F must be a monotone non decreasing mapping. In other words, any increase of information with respect to input values induces an increase of information with respect to output values. Thus, F([?; ; ?]) is the least possible element of the output lattice that can be obtained by F 5 . Early Decision The function E ARLY (x) has to satisfy the two following properties. Firstly, this function must always return false if the lattice LIN has no greatest element (i.e., when > is not defined). Secondly, if this function is called during the execution of the loop l by a coordinator then, it returns true if and only if F(El> ) = F([?; ; ?]). Let us consider the coordinatewise order relation on the elements of LnIN . Obviously El? El> . By monotony of F, if F(El> ) = F([?; ; ?]), we conclude that: ?
First we define a set of constraints which have to be satisfied by any instantiation of the framework. Then, we indicate why this set of constraints is necessary to ensure that a single decision value is adopted by all the processes.
5.1 Constraints On Parameters Input Lattice The set of input values is embedded into a lattice denoted LIN . We assume this lattice has a least element, namely ?. This lattice may possibly contain no greatest element4 . Nevertheless, if the lattice has a greatest element, it is denoted >. Each time it is asked to propose a new value, a process is allowed to supply a value that is different from its previous ones. To be manageable, this possibility has to be constrained in some way. We consider here the following constraint: the sequence of proposed values must be monotonously non-decreasing with respect to the lattice LIN . Informally, this means that a process can not propose a value “carrying less information” than its previous ones. As indicated previously, when a process acts as a coordinator (i.e., every n rounds), it gathers estimate values sent by every processes and computes a new estimate of the final decision. The value of the new estimate, which is broadcast at the end of the second phase, has the following property: it is either a value previously computed by another coordinator or a value returned by the function F. In the last case, we assume that the function is applied when enough values have been gathered. To define more precisely the meaning of “enough”, let us adopt the following definitions: Each process repeatedly executes a while loop (lines 2– 48). Let us consider the lth execution of this loop, and the value of the array est from after all the statements of line 28 have been executed by the coordinator. This value is denoted El? . Recall that the array has been initialized to [?; ; ?] at line 1 when the protocol was launched. As long as no estimate has been received from pk , est from[k ] remains equal to ?. In this case, pk ’s value is said to be “missing”. Thus the symbol ? that appears in the notation El? indicates that missing values have been substituted by the least element of LIN . Similarly, if a greatest element > exists, El> denotes the value of the array est fromi where the missing values have been replaced by the most significant input value >. Both El? and El> are valid element of the lattice Lnin . They exhibit the following important properties: if extra executions of the while loop were performed, the coordinator would complete its collection of received estimates. In the case where the already gathered values were not changed, one would have:
8x El F(x) F(El ) = F([?; ; ?]) >
>
In particular, this implies that:
F(El ) = F([?; ; ?]) ?
By equation (1), any future completion of the array
est from is also mapped onto F([?; ; ?]). This allows
the coordinator to early decide on a particular output value namely F([?; ; ?]) which is equal to F(El? ). In the NBAC problem, the value N O is the least element of the input lattice while Y ES is the greatest one. Thus the decision is equal to A BORT = F([N O; ; N O]) as soon as x values N O have been collected. In that case the set of gathered information completed with the most significant value Y ES is the greatest completion and the function F applied on the greatest completion returns the least element of the output lattice.
5.2 Safety and Termination Termination Termination property is mainly inherited from the property of CT. But the possibility for processes to propose non significant values, implies that it is possible to obtain non acceptable output values. This implies that termination also relies on the fact that processes will provide enough significant values leading to an acceptable output value. Safety One of the major issue of Consensus protocols is to cope with decisions taken during different rounds by different coordinators. This issue is mastered in CT by using 5 Since F is assumed to be onto, it means that least element of the output lattice.
4 As instance, in the atomic broadcast problem, no such element exists because the number of possibly broadcast messages is unbound.
9
F([?; ; ?]) is the
F(El2 ) = F([?; ; ?]) F(El1 ). F(El1 ) = F([?; ; ?]). >
a timestamp mechanism to ensure the locking property. Indeed as soon as a new estimate value has been approved by at least a majority of processes, one is guaranteed that any other new estimate proposed in the future will be approved by at least one process which is aware of the previous one (majority condition). Thus, such a process ensures that the current coordinator is also aware of previous decisions. Allowing changes of mind, and early decisions increase the difficulty to ensure that two different decisions will never be taken by two coordinators. But thanks to the defined restrictions (See Section 5.1) and the timestamp mechanism of CT, this requirement is naturally satisfied. Let us assume that two consecutive decisions are taken by two coordinators at round r1 and r2 with r1 r2 . Four cases have to be considered: (1) The two decisions are taken after collecting at least a majority of estimates (M-decisions). (2) The two decisions are early decisions (E-decisions). (3) The first decision is a E-decision and the second one is a M-decision. (4) The first decision is a M-decision and the second one is a E-decision. The first case is already solved in CT. The second case is solved by the fact that an early decision leads to select a single predefined value namely F([?; ; ?]). So decisions are identical. The third case is also solved by the timestamp mechanism of CT. Indeed the value of the second decision is necessarily equal to the previously adopted value. This is due to the fact that this decision is taken after gathering a majority of estimate values while a previous decision (a M-decision or a E-decision) has been obtained after the gathering of a majority of positive acknowledgments. Thus the value which is selected by the coordinator of round r2 is necessarily the value of a process which has previously adopted the decision value of round r1 . The last case is a little bit more tricky and plainly lies on the properties of the function F. Let us assume that on one hand during round r1 , the coordinator has received at least a majority of estimates and has proposed a new estimate during the execution of the while loop identified by l1 . On the other hand, during round r2 a E-decision is taken. So, when the coordinator has suggested its new estimate, it was executing a while loop l2 and the following property was satisfied.
In this paper we have investigated a “customizing” approach to automatically generate ad hoc agreement protocols. A general agreement framework, characterized by six “versatility” parameters, has been defined. Appropriate instantiations of these parameters provide particular agreement protocols. This approach allows to generate efficient agreement protocols. We have identified a set of natural constraints that have to be satisfied by any instantiation to ensure the liveness and the safety properties of the generated protocol.
References [1] Anceaume E. A Lightweight Solution to Uniform Atomic Broadcast for Asynchronous Systems. Proc. IEEE FTCS’27, pp. 292-301, Seattle, WA, May 1997. [2] Badache N., Hurfin M., and Macedo R. Solving the Consensus Problem in a Mobile Environment. Proc. 17th IEEE IPCCC’99, Phoenix, AZ, February 1999. [3] Chandra T. and Toueg S. Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM, 43(2):225-267, March 1996. [4] Defago X., Schiper A., and Sergent N. Semi-Passive Replication. Proc. 17th IEEE Int. SRDS, Purdue University, IN, pp.43-50, 1998. [5] Fischer M.J., Lynch N., and Paterson M.S. Impossibility of Distributed Consensus with one Faulty Process. Journal of the ACM, 32(2):374-382, April 1985. [6] Fritzke U., Ingels Ph., Mostefaoui A., and Raynal M. FaultTolerant Total Order Multicast to Asynchronous Groups. Proc. 17th IEEE Int. SRDS, pp.228-234, 1998. [7] Guerraoui R. Revisiting the Relationship Between NonBlocking Atomic Commitment and Consensus. Proc. 9th Int. WDAG, LNCS 972, pp. 87–100, 1995. [8] Guerraoui R., Larrea M., and Schiper A. Non-Blocking Atomic Commitment with an Unreliable Failure Detector. Proc. 14th IEEE SRDS, pp. 41-50, Bad Neuenahr, 1995.
F(El2 ) = F([?; ; ?])
To prove that decisions are identical one has to show that:
F(El1 ) = F(El2 ) >
[9] Guerraoui R. and Schiper A. Consensus Service: A Modular Approach for Building Fault-Tolerant Agreement Protocols in Distributed Systems. Proc. IEEE Int. Symposium FTCS’26), pp. 168-177, Sendai, Japon, June 1996.
This is due to the fact that El1 El2 , indeed: (1) If El?1 [i] = xi one has two possible cases: (a) El>2 [i] = yi . As a process can only increase its proposition, we conclude that xi yi . (b) El>2 [i] = >. Obviously xi El>2 [i] (2) If El?1 [i] = ? then obviously xi El>2 [i] Moreover since F is monotone non decreasing one has ?
This implies that
6 Conclusion
>
?
?
?
>
[10] Hurfin M. and Tronel F. A Solution to Atomic Commitment Based on an Extended Consensus Protocol. In Proc. 6th IEEE Workshop FTDCS, pp. 98–103, Tunisia, October 1997.
10