Using Active Clients to Minimize Replication in Primary-Backup Protocols 1 Parvathi Chundi
Ragini Narasimhan Daniel J. Rosenkrantz Department of Computer Science University at Albany - State University of New York Albany, NY 12222
S. S. Ravi
August 31, 1995
Abstract
We consider the primary backup-approach for providing fault-tolerant services in a distributed system. In this approach, a fault-tolerant service is implemented using a collection of servers. One of the servers functions as the primary while the others function as backups. Clients send their service requests to the primary. When the primary fails, one of the backups takes over as the primary. For this approach, previous work (see [Bu93] and the references cited therein) has presented a model that we call the restricted client model. Under this model, protocols have been presented for various types of server failures including crash failures, send-omission failures and receive-omission failures. For crash failures and send-omission failures, the degree of replication used by these protocols is a minimum; that is, for tolerating up to f crash or send-omission failures, these protocols use at most f + 1 servers. For tolerating f receive omission failures, it has been shown that b3f=2c servers are necessary and 2f +1 servers are sucient. These protocols do not handle omission failures of clients. In this paper, the primary-backup approach is considered under a model in which the clients play an active role when their service requests are not ful lled. Each client maintains an ordered list of servers and sends its service requests to the rst server in its list. If the server does not respond within a speci ed timeout period, the client retransmits the request to the next server in its list. Under this active client model, we construct protocols that tolerate the three types of failures mentioned above. For each type of failure, our protocol achieves the minimum degree of replication. More precisely, all our protocols tolerate up to f server failures using only f +1 servers. In addition, these protocols tolerate an arbitrary number of client failures. Further, the protocols ensure that the service provided by the system is functionally equivalent to that provided by a single failure-free server.
1
Research Supported by NSF Grant CCR-90-06396. Email addresses:
0
fparu, raginis, djr,
[email protected]
1 Introduction The Primary-Backup approach to providing fault-tolerant services is widely used in distributed systems [Ba81, OL88, Ly90, BE+91, LG+91, Bu93, Ja94]. In this approach, a fault-tolerant service is implemented through the use of multiple servers. The state information of the service is fully replicated at each server. One of the servers is designated as the primary and the others are designated as backups. Normally, a client sends its service request to the primary, which processes the request and sends a response to the client. In order to preserve the consistency of the replicated service, the primary also sends to all the backups, the database update that occurs when each request is processed. When the primary fails, one of the backups takes over as the primary and noti es the clients so that subsequent requests will be sent to the new primary. The primary-backup approach was introduced in [AD76]. Foundational aspects of the primarybackup approach are studied in [Bu93, BM+92a, BM+92b]. We refer to their model as the Restricted Client (RC) model because the clients are driven by the servers in the system. Moreover, the clients play no role in determining the next primary or in identifying faulty servers. Primary-Backup protocols based on the RC model are described in [BM+92a, BM+92b, BM93]. These protocols do not handle client omission failures. Making a system fault-tolerant to client failures may be as important as handling server failures because there is little reason to believe that clients are less susceptible to failures than servers. In such a case, a protocol must ensure that a client failure does not threaten the consistency of the entire system [Sch90]. We present protocols that can handle an arbitrary number of client failures. In our model, the clients play a more active role in the protocols. So, we refer to this model as the Active Client (AC) model. Under this model, each client maintains an ordered and identical list containing all the servers. Normally, a client sends its service request to the rst server in its list. If a response is not received within a predetermined amount of time, the client sends the same request to the next server in its list. The clients may also receive messages from the servers regarding failed servers. In such a case, the clients suitably modify their server lists. Using this active client model, we present protocols for three types of failures, namely crash failures (i.e., servers/clients may crash in a fail-stop manner), send-omission failures (i.e., servers/clients may crash or fail to send messages), and receive-omission failures (i.e., servers/clients may crash or fail to receive messages). Additional discussion on these failure models appears in Section 2.1. Our approach of having more active clients leads to protocols with the following advantages. 1. For all three types of failures, our protocols are minimal with respect to the degree of replication. More precisely, our protocols use only f +1 servers to tolerate f server failures. (Obviously, f +1 servers are needed to tolerate f server failures.) In particular, our protocol for receive-omission failures allows the system to tolerate f failures using only f + 1 servers. This is in contrast to 1
the model of [BM+92a, BM+92b], where it is shown that at least b3f=2c servers are necessary and 2f + 1 servers are sucient to tolerate f receive-omission failures. 2. For each type of failure, our protocols allow the system to tolerate f server failures and an unlimited number of client failures while using only f + 1 servers. We note that ensuring the consistency of the database at each non-faulty server and tolerating client failures render the protocols, as well as their correctness proofs, nontrivial. Before we end this section, we point out some dierences between the Active Replication approach (or the state machine approach [Lam78, Sch90]) and the primary-backup approach. In the former, the state of the service is replicated across all the servers implementing the service and a client's request is (atomically) broadcasted to all servers. Upon receiving a request, each server computes the response, updates its state if necessary, and then responds to client. In the primary-backup approach, under both RC and AC models, a client sends a request to only one server. In the AC model, this server may either be a primary or a backup. In the RC model, this server can only be a primary. Under both models, if the server receiving the client's request is faulty, then such a request may never get a response2 . However, in active-replication, such failures of individual servers can be immediately masked from the clients because they broadcast their requests to multiple servers. Some methods to achieve fault-tolerance in systems where both servers and clients may be faulty are discussed in [Sch90]. They show that the number of servers required to tolerate f crash failures of servers is f +1 and f Byzantine failures of servers is 2f +1. The crash and Byzantine failures of clients are handled by replicating each client and running each replica on hardware that fails independently. We note that our protocols do not replicate clients and do not tolerate Byzantine failures of clients. The remainder of this paper is organized as follows. Section 2 presents the system model and comparison with previous work. Sections 3, 4 and 5 discuss our protocols for send-omission failures, receive-omission failures and crash-failures respectively. Some concluding remarks are presented in Section 6. For reasons of space, only informal descriptions of the protocols and sketches of correctness proofs appear in this paper. Reference [CN+95] is a more detailed version of this paper.
2 The system model and preliminary de nitions 2.1 System description
The system consists of an ensemble of servers and clients which are collectively referred to as sites. The number of servers and clients in the system are denoted by m and n respectively. Each server is connected to all other servers and all clients by point-to-point, non-faulty, fo ( rst in rst out) links. 2
Client may receive a response eventually by resending the same request to the system.
2
Hence, if a site S sends two messages to another site R and R receives both messages successfully, then they are enqueued at R in the order they were sent. The site R may fail to receive one or both of these messages, but R will never receive them out of order. The servers in our system use broadcast to propagate messages dealing with system maintenance and updates. The broadcast operation used in our protocols has the following properties. 1. It is atomic. That is, if a site broadcasts a message to other sites, it is received by all destination sites that do not fail or by no destination sites. 2. It is fo. That is, two broadcast messages from the same site will be enqueued in the order they were sent at the destination sites that do not fail. We now explain the types of failures in the context of broadcast. We say that a send or broadcast operation on a message is unsuccessful if the message is not emitted by the sender. For all types of failures considered in this paper, if a broadcast is unsuccessful, the message is delivered to none of the sites. In the crash and send-omission failures, a successful broadcast message is received by all destination sites that have not crashed. In the case of receive-omission failure, some of the destination sites that have not crashed may fail to receive the message; so, a successful broadcast may be received by a subset of destination sites. The transmission delay of a message is bounded and has a known nonzero upper bound d. Therefore, a message sent by a site S to another site R at time t will be received by R before t + d (or will not be received at all in the event of a failure). We assume that server and client clocks advance at the same rate as real time3 (although they are not necessarily synchronized). Finally, we assume that, for any request, the maximum computation time needed to generate the response and update has a known nonzero upper bound c. For simplicity of analysis, we assume that servers need zero time to prepare a message for broadcasting. The parameter f denotes the maximum number of faulty servers in the system. We do not place an upper bound on the number of faulty clients.
2.2 Design features of the system We assume that the system provides access to a service that incorporates a database state. Each server maintains a fully replicated copy of the database at its site. Each site contains an ordered list of servers which is initialized to h1; 2; : : :; mi. Each site considers the server whose id is at the head of the list to be the primary. Thus, the server with server-id equal to 1 is the initial primary for all sites. Throughout this paper, we use \server P " to mean the server whose server-id is P . A client C requesting service sends a request message to a primary P . This request message includes a unique We can relax this requirement by adjusting the wait times associated with our protocols based on the maximum dierence in rates. 3
3
request identi cation number (or reqid) of the form hclient-id,sequence-numberi. Then, C waits for a response for a predetermined amount of time which we call the wait-time (denoted by ). If the system successfully responds, then the response reaches the client before the wait-time expires. If the wait-time expires without C receiving a response, C sends the request with same reqid to the next server on its list. Consecutive requests from C have consecutive sequence numbers. All servers start in the same initial database state. When a primary server receives a request message, it computes the response, changes its state, and broadcasts an update message to all the backups. This update message includes the reqid and enough information to compute the response and the database update. We note that the details of the update message can be made appropriate to the application. The update message might include the response and a list of modi ed data items (possibly empty); otherwise, the update message might indicate the request, with the backup recomputing the response and update. Each server maintains a response table that contains one entry per client. Each entry is a pair h lreqid; resp i where lreqid is the reqid of the last request from that client that is serviced by the system and resp is the response computed. We assume that the time taken to carry out an update (by a backup) with respect to a request R is no more than the time taken to generate the update and the response corresponding to R (at the primary site). We now formally de ne the terms request-message, request and response.
De nition 2.1 (a) A request-message is a message with a sequence number from a client to a server requesting a service. (b) A request R = hr1; r2; : : :ri i is a chronologically ordered sequence of request-messages, all with the same sequence number, but sent to dierent servers. (c) A response to a request R = hr1; r2; : : :rii is the response to the last request-message ri in the sequence R. De nition 2.2 An outage of a server occurs at time t if and only if a non-faulty client sends a request-message to a server at time t and never receives a response to this message.
Our protocols ensure that in any run of the system, the duration of any outage and the number of outages are bounded. This requirement is formalized by the following de nition of a bofo (bounded outage nitely often) system. The term bofo was introduced in [BM+92a]. Our de nition of a bofo system is slightly dierent from that given in [BM+92a]. We explain the dierence in Section 2.4.
De nition 2.3 A system of m servers acts like a (k,)-bofo system for f (f < m) server failures, i
there exist nite values of k and such that for every run of the system with at most f failures, all outages can be covered using at most k intervals, each of length at most .
Note that the de nition allows more than k outages to occur as long as they can be covered using k or fewer intervals, each of length at most . 4
2.3 Requirements on the protocol All our protocols satisfy the following requirements. 1. For each run containing at most f server failures, every non-faulty client eventually receives a response to each of its requests and the corresponding update is incorporated into the database state of every non-faulty server. We call this the guaranteed-response requirement. 2. There exist nite values k and such that the system acts as a single (k; )-bofo system for f server failures. We call this the bofo requirement. 3. For each run consisting of at most f server failures, the following property holds. Let R be the (possibly in nite) set of requests in whose updates are incorporated into the database state by any non-faulty server. Then, there exists a total ordering on R such that the response produced by the system for a request R0 in R is identical to that generated by a single failure-free server that responds to these requests in the same order, assuming that all servers in the system and the single failure-free server start in the same initial state. We call this the consistency requirement. We note that if C is a non-faulty client, then R contains all the requests sent by C . If C is a faulty client, R contains a pre x (possibly all) of the requests sent by C .
2.4 Comparison with Previous Work Here, we point out some of the important dierences between the RC model and the AC model. Reference [Bu93] identi es three cost metrics for any primary-backup protocol. They are degree of replication (the number of servers used to implement the protocol), failover time (the worst case time period during which there is no primary), and blocking time (the worst-case interval between the time a request is received by the system and the time its response is sent in any failure-free execution of the protocol). For several types of failures, including those discussed in Section 1, lower bounds on the above metrics for tolerating a given number of failures are established in [BM+92a]. To derive bounds on these metrics, the RC model assumes that such a system has the following four properties4 . First, there can be at most one primary at any time during a run of the protocol. Second, a client can send a request only to the server that it believes to be the primary. Third, if a client's request reaches a backup, it is not enqueued (and hence not processed) by that backup. Lastly, a primary-backup service behaves like a (k; )-bofo server. As de ned in [Bu93], a server P is a (k; )-bofo server if in any run of P there are at most k outages each of length at most . A system under the RC model behaves like a single (k; )-bofo server i every run of the system is equivalent to some run of the single server. In [Bu93], two runs 1 and 2 are de ned to be equivalent if for all clients the following holds. A client 4
For reasons of space, we state the properties informally.
5
sends a request at time t in 1 i it sends the same request at t in 2 and a client receives a response at time t in 1 i it receives the same response at time t in 2. Note that the issues of performance of the system and consistency of responses are combined into one requirement, namely the bofo requirement. The protocols presented in [BM+92b, Bu93] achieve the lower bounds established in [BM+92a], but do not tolerate clients' omission failures. An important dierence between the protocols given in [BM+92a] and the protocols given here is that our protocols can tolerate any number of faulty clients with minimal replication. The AC model speci es the properties of a primary-backup system in terms of external behavior observable by clients (e.g. the responses provided and the outages experienced by a client). Therefore, the protocols described here do not satisfy some of the internal properties of RC model mentioned above. For example, the AC model does not require that there be at most one primary in the system. It is possible that dierent servers and clients have dierent primaries. Also, a client can contact the next server on its list when it experiences an outage with a server. (Such requests are enqueued at backups and may lead to the detection of a faulty primary and the election of a new primary.) By weakening the rst two properties of the RC model, we are able to design protocols that have less overhead in terms of messages exchanged between servers during failure-free operation of the system. In all our protocols, the servers (except for the primary in the protocol for send-omission failures) do not send periodic messages to each other. As a result, the servers may not detect a faulty primary immediately. A consequence of not including periodic messages in the protocol is that our protocols can have arbitrarily long periods of failover time should the clients all be quiescent for an arbitrarily long time. The AC model separates the issues of consistency and performance. It requires that the system act like a (k; )-bofo system which means that every non-faulty client experiences only a nite number of bounded outages. The AC model also requires that the responses sent by the system be identical to those produced by a single failure-free server starting in the same initial state as any server in the system. If a single failure-free server services the requests from non-faulty clients, then it computes the response exactly once. Therefore, our protocols must ensure that any request processed by a non-faulty server is never processed again by any other non-faulty server. This is not the case for the protocols in [BM+92a] under the RC model which permits runs in which a client sends a request that is incorporated into the database, but for which the client receives no response. In [BM+92a], it is the client's responsibility to cope with this possibility. We also note that all our protocols and the protocols given in [BM+92a] have the same values for blocking time. We end this section by noting that [HJ92] presents a dierent approach for analyzing primarybackup systems using a queuing model. The subsequent sections of this paper discuss our protocols for send-omission failures, receive-omission failures and crash failures respectively. 6
3 Send Omission Failures 3.1 Description of protocol
In this section, we describe our protocol for tolerating send omission failures. Under this failure model, the sites may crash or fail to send messages. Our method for tolerating the send-omission failures of clients and servers uses a technique similar to that in [Bu93] which translates the send-omission failure of a server into a crash failure. Both protocols use only f +1 servers to tolerate f faults. In our method, every non-faulty client continues to receive consistent responses from the system. Faulty clients may not receive responses even though there is at least one non-faulty server in the system. We begin with the protocol executed by the primary. A primary server P broadcasts a periodic message to the backups every time units. Every outgoing message from P to the backups has an associated sequence number and consecutive messages from a primary to the backups have consecutive sequence numbers. Each broadcast message from P is either an update message (abbreviated as \upd" in the gures) corresponding to a client's request or an alive message. An update message is broadcasted immediately after computing the response for a request. An alive message is broadcasted if no message was broadcasted by P during the past time units. This message simply indicates to the backups that the primary has not halted. Messages from primary P are monitored by each backup B that has not halted. A skip in the sequence numbers associated with the messages or the absence of a message from P during any period of length + d is an indication to B that P committed a send omission on at least one message or has crashed. None of the messages from P after the skip are accepted by B . In these cases, B sends a kill message to P . (Note that in the above cases, all backups that have not crashed realize that P is faulty). Upon receiving a kill message, P simply halts. Client C
req
C
req upd
Primary P
upd
alive
resp
P
Backup B’
B’
Backup B’’
B’’
alive
P halts
X
kill nextp(B’) kill nextp(B’) Figure 3.2: Faulty Primary P
Figure 3.1: Normal Processing
The normal processing of a primary server is illustrated in Figure 3.1. When a primary P receives a request req from a client C , it sends the response to C if req is present in its response table5 . Otherwise, P computes the response and update and then broadcasts the update to all the backups. 5
Recall that the response table stores for each client, the response to its previous request.
7
C P
req
req upd
alive
resp
X
alive
P halts nextp(B’)
B’
kill
B’’
alive
upd
resp alive
nextp(B’) Figure 3.3: Faulty Primary P
It delays sending the response (denoted by \resp" in the gures) for a bounded interval of time in order to ensure that the above broadcast is successful. During this interval, P continues to service other requests that it receives. The broadcast of the update message is successful if P does not receive a kill message within the interval . Hence, at end of the interval if P is still functioning, it sends the response to C (since the corresponding update is incorporated in at least one non-faulty server in the system). If P made a send-omission on the update message (see Figure 3.2 where a send-omission on a message is indicated by a dashed-arrow with an X at the source of the arrow.), one or more kill messages sent by non-halted backups will arrive at P within time . (At least one kill message is sent to P because there is at least one non-faulty server in the system.) Hence, P halts. A backup waits for the periodic message from the primary. If the message is an update, it incorporates the update and stores the request id and the corresponding response in its response table. Each backup executes a failure-detector process that examines the messages received by the backup and if it detects a send-omission fault of the primary (i.e a skip in sequence numbers), it sends a kill message to the primary immediately and waits for an alive message from the next primary server within a bounded amount of time. A backup may also receive requests from clients. As discussed in Section 2.4, allowing a backup to receive requests is one of the main dierences between our model and the RC model. Since a client C sends requests to some backup B 0 only when it experiences failure (i.e., C times out) with the current primary P , B 0 must determine whether P is faulty or C is faulty. It is possible that P never received the request req from C because C made a send-omission on req ; in this case, C is faulty. Alternatively, P may have received req, computed the update and successfully broadcasted the update, but omitted to send the response to C and hence, P is faulty (See Figure 3.3). Therefore, if B 0 nds an update for req from P in its response table, it detects that P is faulty and sends a kill message to P . Server P halts upon receipt of this message. Therefore, all non-halted backups (B 0 and B 00 in Figure 3.3) broadcast a next primary(B 0 ) message (abbreviated as \nextp" in Figures 3.2 and 3.3) to the clients (assuming B 0 is the server after P in the list of servers). This message announces to clients that B 0 is 8
taking over as the next primary. The server B 0 takes over as primary (if it has not already crashed) by broadcasting an alive message to the backups and starts executing the primary protocol. If backups timeout waiting for an alive message from B 0 , then they send a kill message to B 0 , broadcast to clients the next primary(B 00 ) message and expect an alive message from B 00 . When a backup takes over as primary, since broadcast is atomic, all backups agree with that backup on the last update message from P that is incorporated in the database state at each site. In the above scenario, if B 0 makes a send-omission on the kill message to P , then B 0 realizes its own send-omission failure when it subsequently receives a message from P after a certain bounded interval. In such a case, B 0 halts. When B 0 receives the request req from C , if B 0 does not nd req in its table and the primary P has been broadcasting a stream of messages without a skip in sequence numbers, then C must have made a send-omission on the request-message to P . In this case, B 0 ignores C 's request. (Note that if P crashed after receiving the request, but before processing it, all backups realize the failure of P because of the absence of a periodic message. In such a case, they elect a new primary and inform the clients.) Note that the servers, except for the primary, do not send periodic messages among themselves in our protocol. Therefore, our protocol has a lesser message complexity than those proposed in [Bu93].
3.2 Discussion of correctness In this section, we brie y discuss how the above protocol satis es the three properties given in Section 2. We use dmin and cmin to denote the minimum values of transmission delay and computation time respectively. To simplify the discussion, we assume that the failure-detector processor at the backups needs zero processing time and that = c. We rst show that a primary server sends a response to a client only if the broadcast of the corresponding update message to the backup is successful.
Lemma 3.1 Suppose a primary P broadcasts a message M with sequence number N to all backups at
time t. If it does not receive a kill message by t + 3d + c, then M is received by all non-halted backups.
Proof: Suppose P omitted to send M . A backup B detects this fault in the following two ways. Case 1: B received a message M 0 with sequence number N 0 > N from P and detected a skip in the sequence of messages from P . The server P can broadcast M 0 in the interval [t + cmin ; t + c]. This message must reach B in the interval [t + cmin + dmin ; t + c + d). Therefore, the latest time at which B detects the send-omission fault of P on M and sends a kill message is t + c + d. This kill message must reach P by t + c + 2d. Case 2: B timed out waiting for a message from P . 9
The message prior to M must have been sent by P in the interval [t ? c; t ? cmin ], and this message reaches B in the interval [t ? c + dmin ; t ? cmin + d). The backup B waits c + d time to receive the next message and therefore, it expects the message in the interval [t + dmin + d; t ? cmin + c + 2d). Since B timed out waiting for the message, the latest time at which B can send a kill message to P is t ? cmin + c + 2d and this message must reach P before t + 3d + c ? cmin. In both of the above cases, the kill message sent by B reaches P before t + 3d + c if P makes a send-omission on M sent at t. If there is no such kill message, M is received by all backups. 2 We can use the above lemma to bound the wait-time for the above protocol.
Lemma 3.2 Suppose a client C sends a request to a primary P at time t. If P sends a response to C , then it must reach C before t + 5d + (n + 1)c. 2
It can be seen that since there is at least one non-faulty server in the system, all non-faulty clients must eventually receive a response for their requests. Hence the response-guarantee requirement can be shown to be satis ed by the above protocol. Now, we brie y explain how the bofo requirement is satis ed.
Lemma 3.3 The system acts like a (f; 2f + d)-bofo system for f failures. Proof sketch: It can be shown that if a non-faulty client C sends a request-message at time t to a
primary P and times out waiting for a response, but P processed this request (i.e., the update from P is incorporated by other backups), then C will receive a response (from some server on its list) for this request before t + (f ? P + 1) . By this time all non-faulty clients will stop sending requests to P . Further, it can be shown that, in any run of the system, if t1 is the time at which the rst outage occurred involving requests sent to P (ties broken arbitrarily) and t2 is the time at which the last outage occurred6 involving P (ties broken arbitrarily), then t2 ? t1 2f + d. Hence, we can associate an interval of length at most 2f + d with each faulty server and this interval covers all outages corresponding to that server. Since there can be at most f such faulty servers, we can conclude that the system acts like a (k; )-bofo system for f failures, where k = f and = 2f + d. 2 We now brie y explain why the above protocol satis es the consistency requirement. To prove that this requirement is met, we now de ne the follows relation on requests. We use the phrase \a request R is processed by a server S " to denote that S received a request-message for R, computed the response, and broadcasted the update message to servers. This broadcast message is received by all the non-crashed servers and is incorporated into their database states. 6
This time is well de ned because all clients stop sending requests to P when the next primary server takes over.
10
De nition 3.1 Consider any run of the system. A request Rq follows a request Rp in this run i (a) the same server i processed both Rp and Rq , processing Rp rst, or (b) Rp is processed by a server i, Rq is processed by a server j (i = 6 j ) and i precedes j in the list of servers. The following lemma proves that any backup, prior to taking over as primary, incorporates all updates that it received from its primary.
Lemma 3.4 Suppose a client C sends a request R to a primary P at time t and times out waiting
for a response. Suppose further that P processed R. Then, if any backup B dequeues the request R, it must have already dequeued the update corresponding to R from P .
Proof: The primary P sends an update U for R no later than t + d + nc. From the de nition of
\processed by", all backups receive the update by t + 2d + nc. The client C times out at t + 5d + (n + 1)c and retransmits R to a backup, say B . The earliest time R can reach the backup B is t + 5d + (n + 1)c + dmin. Hence, R is enqueued at B after U . Therefore, R can be dequeued only after U. 2 Using the above lemma, we can prove that a request is processed exactly once in our system. Therefore, follows is a total ordering on the set of requests R processed in the system. We say that a backup B realizes the failure of a primary P at time t i one of the following conditions holds: 1. B received a message from P at t with sequence number l, the sequence number of the previous message from P is l0, and l > l0 + 1. 2. B timed out at t waiting for a message from P . If a backup B realizes the failure of primary P at time t and the sequence number of the last valid message from P is l, then it can be shown that B incorporates all update messages from P with sequence numbers in the range 1 through l before B incorporates any update messages from the next primary or before servicing new request-messages from its receive queue (if B becomes the next primary). Since the broadcasts are atomic, all non-halted backups realize the failure of P roughly at the same time and incorporate in their databases the same number of update messages from P . We use these observations to prove that the response computed by the system of servers is identical to that produced by a single failure-free server.
Lemma 3.5 Consider the sequence of requests hR1; : : :; Rp; : : :i ordered by follows ordering in any run
of the system of servers V . Assume that each server in V starts in the same initial state s0 . For each p (p > 0), the update message associated with processing Rp contains the same response and state as those produced by a single failure-free server S when it starts in initial state s0 and services the requests in the order R1; R2; :::Rp.
11
Proof: We prove this lemma using induction on p. Basis: p = 1. The request R1 is the rst request processed in V . Assume that R1 is processed by the primary server P . The state of P is s0 before R1 is processed and is identical to the initial state of S .
Hence, the broadcast message contains a state and a response produced corresponding to processing R1 which are identical to those produced by S in servicing R1 while in state s0 . Induction step: Assume that the lemma is true for all 0 < p < l. We show that it is true for p = l. Let Rl?1 be processed by the primary server P . From the induction hypothesis, the broadcast message from P contains a state sl?1 that is identical to the state of S after it services the requests R1; :::Rl?1. The server Q which processes Rl must be in state sl?1 before dequeuing Rl (from the above discussion). So Rl will be applied to the same state in both servers Q and S . Therefore, the update message broadcasted contains the state and response that are identical to those produced by S after it services the requests R1; :::Rl in that order. 2. The protocol ensures that once a request from a client is processed by the system, the request is not processed again. If the client retries a request, the response computed previously is sent to the client. Using this observation and Lemma 3.5, it can be seen that the above protocol satis es the consistency requirement.
4 Receive Omission Failures 4.1 Description of protocol
In this section, we summarize the protocol for tolerating receive-omission failures. The protocol described here diers signi cantly from that given in [Bu93]. Our protocol needs only f + 1 servers to tolerate f faulty servers whereas the method given in [Bu93] uses 2f +1 servers. The protocol described here attempts to translate a receive-omission failure of a server S into a crash failure by broadcasting a faulty S message to all sites including S . Upon the receipt of this message, S must halt. Such an attempt may not be successful because S being faulty, may omit to receive these faulty messages. Since a broadcast message may not be received by all destination sites under this failure model, the protocol is somewhat complex. Also, faulty clients may omit to receive responses and other system maintenance messages further complicating the protocol. We rst discuss some general issues about the protocol. In this protocol, a set of faulty clients may continue to get service from a faulty primary Pf , never realizing that Pf is faulty. (This happens because faulty clients may have omitted to receive the faulty messages or the next-primary messages.) Such clients may later send a request to a non-faulty primary P . A server detects a faulty client C by detecting a skip in the sequence number of the requests from C , and ignores such requests. A request that does not show a skip is called proper. The update messages from a primary are numbered 12
consecutively, and if a backup discovers a skip of sequence numbers in the stream of updates, it halts because it omitted to receive one or more update messages. No other message from a primary has a sequence number. A server uses a faulty message to accuse another server of being faulty. A server (or client) I which receives such a message has sucient information to determine whether the sender of the message or the accused is faulty. The server which is identi ed to be faulty is marked faulty in I 's server list and all subsequent messages from that faulty server are ignored by I . If I receives a faulty message accusing itself, I may halt. Figure 4.1 explains the normal processing of a primary server. A client C sends a request req to primary P and waits for wait-time to obtain a response. Upon receipt of R, P rst checks if R is proper. If not, P ignores R. Otherwise, P searches its response table for R. If R is found in the table, the corresponding response is sent to C ; otherwise, P computes the response, broadcasts the update message to all backups, and immediately after that sends the response to C . req
Client C
X
P
resp
Backup B’
B’
Backup B’’
B’’ Figure 4.1: Normal Processing
C
req P halts
upd
Primary P
req
C
req
check
nextp
luid
resp
Figure 4.2: Faulty Primary P
X req
X faulty B’
P B’
upd resp
B’ halts
X check
B’’
faulty B’ Figure 4.3: Faulty client C and Faulty Backup B’
A backup takes over as primary if it can establish that the current primary is faulty. Since the servers do not send periodic messages to each other, the only way to detect a faulty primary is the following. When a client C times out waiting for P to respond to request req , it retransmits req to some backup B 0 in its list. When B 0 receives req , it cannot immediately conclude that P is faulty. 13
This is because C may have timed out since it committed a receive-omission on the response sent by P . If B 0 nds an update corresponding to req in its response table, B 0 sends the response to C . Now, let us consider the case where B 0 does not have an update for req . Even now, B 0 cannot conclude that P is faulty because it must distinguish between the following two scenarios: 1. B 0 made a receive omission on the update corresponding to req , and 2. P made a receive omission on req . In the former scenario, since P processed req , at least one non-faulty server must contain the update for req . In the latter scenario, no backup will have an update for req . To decide which of these two alternatives is actually the case, a check message containing the reqid and primary id of B 0 is broadcasted to all servers by B 0 . A backup can send at most one check message in any run of the system. After sending the check message, B 0 waits for a sucient amount of time so that the above decision can be made. If there is another backup B 00 with an update for req with P as primary (see Figure 4.3), and B 00 receives the check message from B 0 at some time t, then B 00 broadcasts a faulty message to all clients and backups accusing B 0 . A faulty message is also broadcasted to all servers and clients if the primary of B 0 is marked faulty at time t in the list of servers at B 00 . All clients and backups that believe the faulty message from B 00 mark B 0 faulty in their list of servers, ignore all further messages from B 0 , and continue to regard P as their primary. When B 0 receives (if at all) the faulty message from B 00 , it realizes its own receive-omission and halts. On the other hand, if there is no backup with an update for req , P must have made a receiveomission on req (see Figure 4.2) Therefore, faulty messages accusing B 0 will not be sent. When P receives (if at all) the check message from B 0 , it detects its own fault and halts. (P may never receive this check message and continue to service requests. However, all non-faulty clients mark P faulty.) In order to take over as primary, B 0 broadcasts the next primary(B 0 ) message to all servers and clients. When clients receive this message from B 0 , they stop sending requests to P and will not accept any responses from P . Now, B 0 waits for a sucient amount of time to receive all the update messages whose responses have already been accepted by the non-faulty clients. Then, B 0 broadcasts a lastuid message (denoted by \luid" in Figure 4.2) containing the sequence number of the last update message from P that B 0 incorporated in its state. At this stage, it is possible that B 0 may be detected to be faulty by another backup B 00 that received more updates from P than did B 0 . This is because B 0 may have omitted to receive some of the updates B 00 received. After broadcasting the lastuid message, B 0 waits for a speci ed amount of time to receive any faulty messages generated as a result of its lastuid message. If it receives none, then B 0 changes mode to primary, starts processing requests, and sending responses. Note that B 0 may commit receive-omission on all the faulty messages broadcasted about itself, and may believe that it has taken over as primary. Also, a faulty client or a faulty server that commits receive-omission on faulty messages about B 0 may erroneously install B 0 as its primary. Now, we describe how a backup installs a new primary server. After determining that its current 14
primary P is faulty using the check message from B 0 , a backup B 00 marks P faulty in its list of servers and waits for B 0 to take over as its primary. The server B 00 rst receives the next primary message from B0 . At this point, we say that B 00 installs B 0 as its \inactive primary". After receiving the next primary message from B 0 , B 00 incorporates the updates from P for a speci ed amount of time, and queues up the subsequent update messages. The lastuid message from B 0 to B 00 speci es the last update message from P that B 0 incorporated in its state. If B 00 incorporated no more updates than B 0 , it rolls forward to the state at B 0 (if needed) by incorporating these updates from its queued updates. If it has received no faulty messages accusing B 0 by this point, we say that B 00 installs B 0 as its \active primary" and is ready to accept update messages from B 0 . If B 00 determines that B 0 omitted to receive some of the update messages from P that B 00 itself received, it broadcasts a faulty message about B 0 to the other servers and clients, ignores all subsequent messages from B 0 , and reverts back to accepting updates from P . If B 00 fails to receive either or both of the next primary and lastuid messages from B 0 , it marks B 0 faulty in its list of servers and continues to accept updates from P . If B 00 receives either of these messages from a server without receiving a check message rst, B 00 realizes its own receive-omission and halts. Note that a backup that has been following a faulty primary can never be installed as primary by non-faulty clients and non-faulty servers because such a backup will be detected to be faulty when it broadcasts a check message. It is also possible that a backup receives two check messages at around the same time or receives one check message when waiting for another backup to take over as its primary. In these cases, the check messages have enough information to determine which backup should take over as primary. When a client C receives a next primary message from a server B 0 and C does not receive a faulty message about B 0 or has not marked B 0 faulty in its list, then C removes all servers preceding B 0 in its list and installs B 0 as its new primary. When a client C marks its primary P faulty in its list of servers, it stops sending requests to P and waits to receive a next primary message. If C does not receive a next primary message within a speci ed amount of time, C sends its requests to the next server after P that is not marked faulty in its list.
4.2 Discussion of Correctness We now discuss the correctness of the protocol outlined above. The following de nitions are used in correctness proofs. We say that a server (client) is an absolutely good server (client) (AGS and AGC respectively) if it receives all the messages sent to it and never crashes. A server i is installed as an inactive primary by another server j at time t if j receives the next primary message from i at t and i is not marked faulty in j 's list of servers at time t. A server i is installed as an active primary by another server j at time t if j installs i as an inactive primary at time t0 (t0 < t) and at time t, j is ready to accept update messages from i as described in the protocol. A primary i is a valid primary 15
if an AGS installs it as an active primary. A server (client) is faulty if it is not an AGS (AGC). Note that the system has at least one AGS. In order to ensure consistency, we need to show that a faulty client does not cause any inconsistencies in the database and that it does not mislead non-faulty clients. Our correctness proofs rely on a lemma which establishes that all good servers behave in the same manner and that a good client never marks a good server faulty in its list of servers. The statement of this lemma requires some de nitions. We use the term event to represent one of the following actions: receipt (sending) of a message at (from) a site, a timeout, and the end of a computation while processing a request. The order of events is speci ed using a hypothetical global clock which associates a timestamp with each event (breaking ties arbitrarily). We can obtain a total order on a global set of events by arranging the events in increasing order of their timestamps. We de ne such a sequence of events to be linearized. Finally, recall that d denotes the strict upper bound on the transmission delay. We are now ready to state the lemma.
Lemma 4.1 Let O = e1; e2; ::; ek; ek+1; :: be a linearized sequence of the global set of events for any run of the system. For all i > 0, the following conditions hold at the end of execution of the partial sequence Oi = e1 ; e2; ::; ei?1; ei .
1. If there is an event at time t which results in a server S being marked faulty by an AGS ap , then all other AGSs mark S faulty by time t + d. 2. If there is an event at time t which results in installing a server S as an inactive primary by an AGS ap , then all other AGSs install S as an inactive primary by time t + d. 3. If there is an event at time t which results in installing a server S as an active primary by an AGS ap , then all other AGSs also install the server S as an active primary by time t + d. 4. Given any two AGSs ap and aq , let PS (ap) and PS (aq ) be the sequence of valid primaries of ap and aq respectively. Also, let US (ap) and US (aq ) denote the sequence numbers of update messages incorporated by ap and aq respectively, in the order in which they were received. Then, at any time t, either PS (ap ) and US (ap) are pre xes of PS (aq ) and US (aq ) respectively or vice versa. 5. No AGS is marked faulty in the list of servers at another AGS. 6. If there is an event at time t which results in marking a server S faulty by an AGS, then all AGCs also mark S faulty in their lists by time t + d. 7. If there is an event at time t which results in installing a server S as an inactive primary by an AGS ap , then all AGCs install S as primary by time t + 3d. 8. No AGS is marked faulty by an AGC in its list of servers. 2
16
Lemma 4.1 can be proven by induction on i. We omit the proof because of space limitations. The following lemma estimates the wait-time of the above protocol during failure-free operation.
Lemma 4.2 Suppose a client C sends a request req to a primary P at time t. If P sends a response
to req , then the response reaches C before t + 2d + nc.
Proof: P receives req no later than t + d. It computes and broadcasts update for req before t + d + nc
and immediately sends a response to C . Hence, this response must reach C before t + 2d + nc. 2 The determination of wait-time when there are failures is much more involved. Suppose a client C times out with primary P and retransmits the request to a backup B 0 . If C receives a response from B0 , it can be shown that the response reaches C before t + 12d + 2c. Therefore, the wait-time of the protocol is 12d + nc. It can also be shown that the system acts like a (k; )-bofo system with k = f and = (f + 1) + 9d. We now brie y show how the above protocol satis es the consistency requirement. Given any run of the system and an AGS ap , let R = hR1; :::; Ri1; :::; Ril; :::i be the sequence of requests ordered by the follows relation for which the updates are incorporated in the database state by ap in . Let PS (ap) be the sequence of valid primaries installed at ap in this run.
Lemma 4.3 Consider the sequence of requests in R. Suppose a request-message q in request Ri is
sent to a valid primary T . If T dequeues q in the run then T has already incorporated the update corresponding to the immediately preceding request Ri?1 in the follows relation.
Proof: The lemma is trivial if T processed Ri?1. So, assume that some valid primary P , dierent from
T , processed Ri?1 . Since P processed Ri?1 , it must have broadcasted an update message corresponding to Ri?1 that is incorporated into the state of ap (from the de nition of \processed by"). Since ap installs T as its active primary, the lastuid message from T must indicate to ap that T received all the update messages from P that ap received. Therefore, T must have received the update message corresponding to Ri?1 . According to the protocol, T rst incorporates all the update messages and only then starts processing requests. Hence, the lemma is proved. 2 The following lemma can be proven using Lemma 4.3.
Lemma 4.4 Consider the sequence of requests R. Assume that each server in the system starts in the
same initial state s0 . For each p (p > 0), the update message associated with processing Rp contains the same response and state as those produced by a single failure-free server S when it starts in initial state s0 and services the requests in the order R1 ; :::; Rp. 2
Using the above lemma, consistency requirement can be proven. 17
5 Protocol for Crash Failures 5.1 Description of protocol
We now provide a description of a protocol for crash failures. The protocol is similar to that described in [Bu93]. Both protocols require f + 1 servers to tolerate f crash failures. During the failure-free operation of the protocol, a client sends a request R to primary P and waits for a predetermined amount of time to receive a response. Upon receiving R, P checks whether R has been processed earlier (by some former primary) by consulting its response table. If so, it sends the corresponding response to C . Otherwise, it computes the response and the update and changes the database state. It then broadcasts the update and response to all backups. Immediately after that, the primary sends the response to the client. If P sends the response successfully, the response will reach the client before the wait-time expires. Every non-halted backup B receives the update message broadcasted by P and incorporates it in its state. B also stores the reqid and response in its response table. If C times out waiting for a response from P , it sends R to the next backup on its list. If it times out waiting for this backup, it will contact the backup after that in the list. Since we have f + 1 servers, one of them must respond. When some backup B receives this request it realizes that P has crashed. So, backup B broadcasts a next primary(B ) message to all the clients to announce that it is taking over as the next primary. It incorporates all updates from P in its state and then starts acting as primary. When a client receives a next primary(i) message, it deletes all servers j , j < i, from its list of servers and considers i as its primary. Since faulty clients are assumed to be fail-stop, a faulty client cannot cause any inconsistency in the databases stored at the servers.
5.2 Discussion of Correctness It can be shown that the above protocol has a wait-time of 3d + (n + 2)c. (Recall that d denotes the upper bound on the transmission delay, c denotes the upper bound on computation time and n denotes the number of clients in the system.) It is easy to see that the above protocol satis es the response-guarantee requirement. It can also be shown that the system acts as a (f; 6d + (n + 4)c)-bofo system. It is straightforward to see that in any run of the system, each request is processed exactly once. As a consequence, follows is a total-ordering. The next lemma proves that any backup, prior to taking over as primary, incorporates all updates that it received from its primary.
Lemma 5.1 Consider the sequence of requests hR1; R2; : : :; Rp?1; Rp; : : :i in a run ordered by the follows relation. Suppose a request-message in a request Rp is sent to server i. If server i dequeues in
18
the run then i has already incorporated the update corresponding to the immediately preceding request Rp?1 . 2
The nal lemma of this section points out that the response and the update computed by the system for a request in a given run are identical to those generated by a single failure-free server starting in the same state. So, the protocol satis es the consistency requirement.
Lemma 5.2 Consider the sequence of requests hR1; : : :; Rp; : : :i in a run of the system. Assume that
each server in the system starts in the same initial state s0 . For each p (p > 0), the update message associated with processing Rp contains the same response as produced by a single failure-free server when it starts in initial state s0 and services the requests in the order R1 ::: Rp. 2
6 Conclusions The primary-backup approach is commonly used to provide fault tolerance in distributed systems. In this paper, we proposed a new model, which we call the active client (AC) model, for this approach. Here, each client maintains an ordered list of servers and uses it to detect faulty servers and elect new primary servers. Using the AC model, we proposed protocols that tolerate crash failures, sendomission failures and receive-omission failures of servers and clients. One important feature of our protocols is that they tolerate these three types of failures using minimal replication; that is, they use only f + 1 servers to tolerate f faulty-servers. Another important feature is that for each type of failure, our protocols can tolerate f server failures and an unlimited number of client failures using only f + 1 servers. The AC model also leads to protocols with lower message complexity during failure-free operation of the system. Our results point out that by focusing on benign client failures the number of non-faulty servers required to make the system fault-tolerant can be signi cantly reduced. We are currently investigating protocols that can tolerate classes of failures where the type of client failures is dierent from that of server failures. We are also planning to extend our work to handle Byzantine client failures.
References [AD76]
P. A. Alsberg and J. D. Day, \A Principle for Resilient Sharing of Distributed Resources", Proc. Second Intl. Conf. on Software Engineering, Oct. 1976, pp 627-644.
[Ba81]
J. F. Barlett, \A NonStopTM Kernel", Proc. Eighth ACM Symposium on Operating System Principles, Dec. 1981, pp 22-29. A. Bhide, E. N. Elnozahy and S. P. Morgan, \A Highly Available Network File Server", Proc. 1991 Winter USENIX Conf., Jan. 1991, pp 199-205.
[BE+91]
19
[BM93]
N. Budhiraja and K. Marzullo, \Tradeos in Implementing Primary-Backup Protocols," Tech. Report, Department of Computer Science, Cornell University, Ithaca, NY, 1993. [BM+92a] N. Budhiraja, K. Marzullo, F. B. Schneider and S. Toueg, \Primary-Backup Protocols: Lower Bounds and Optimal Implementations," Tech. Report, Department of Computer Science, Cornell University, Ithaca, NY, Sept. 1992. [BM+92b] N. Budhiraja, K. Marzullo, F. B. Schneider and S. Toueg, \Optimal Primary-Backup Protocols," Proc. Sixth Intl. Workshop on Distributed Algorithms, Haifa, Israel, Nov. 1992, pp 362-378. [Bu93] N. Budhiraja, \The Primary-Backup Approach: Lower and Upper Bounds" (Ph.D. Thesis), Technical Report No. 93-1353, Department of Computer Science, Cornell University, Ithaca, NY, June 1993. [CN+95] P.Chundi, R.Narasimhan, D.J. Rosenkrantz and S.S.Ravi, \Active Client Primary-Backup Protocols", Technical Report 95-2, Computer Science Dept, University at Albany, SUNY, Feb. 1995. [HJ92] Y. Huang and P. Jalote, \Eect of Fault Tolerance on Response Time," IEEE Trans. Software Engineering, Vol. 41, No. 4, April 1992, pp 420-428. [Ja94] P. Jalote, Fault Tolerance in Distributed Systems, PTR Prentice-Hall, Englewood Clis, NJ, 1994. [Lam78] L. Lamport, \Time, Clocks, and the Ordering of Events in a Distributed System," Communications of the ACM, 21(7), July 1978, pp 558-565. [LG+91] B. Liskov, S. Ghemawat, R. Gruber, P. Johnson and M. Williams, \Replication in the Harp File System", Proc. Thirteenth ACM Symposium on Operating System Principles, Oct. 1991, pp 226-238. [Ly90] J.Lyon, \Tandem's Remote Data Facility" COMPCON Spring 90, Digest of Papers, Feb. 1990, pp 562-567. [Mu89] S. Mullender Distributed Systems, Addison-Wesley, Reading, MA, 1989. [OL88] B. Oki and B. Liskov, \Viewstamped Replication: A New Primary Copy Method to Support Highly Available Distributed Systems", Proc. Seventh ACM Symp. on Principles of Distributed Computing, Aug. 1988, pp 8-17. [Sch90] F. B. Schneider, \Implementing Fault-tolerant Services Using the State Machine Approach: A Tutorial," ACM Computing Surveys, Dec. 1990, pp 299-319.
20