HPP: A Reliable Causal Broadcast Protocol for

0 downloads 0 Views 186KB Size Report
board services. The common features in all these applications are that: (i) they require only weak consistency;. (ii) they can tolerate small delays in propagation; ...
HPP: A Reliable Causal Broadcast Protocol for Large-Scale Replication in Wide Area Networks A KHIL K UMAR 1

AND

N OHA A DLY 2

1 College of Business, University of Colorado, Boulder, CO 80309-0419, USA 2 Computer Science Department, Alexandria University, Alexandria, Egypt

Email: [email protected] This paper describes a fast, reliable, scalable and efficient broadcast protocol called HPP (hierarchical propagation protocol) for weak-consistency replica management. It is based on organizing the nodes in a network into a logical hierarchy and maintaining a limited amount of state information at each node. It ensures that messages are not lost due to failures or partitions and minimizes redundancy. Furthermore, the protocol allows messages to be diffused while nodes are down provided the parent and child nodes of a failed node are alive. Moreover, the protocol allows nodes to be moved in the logical hierarchy and the network to be restructured dynamically in order to improve performance, while still ensuring that no messages are lost while the switch takes place and without disturbing normal operation. A performance study of the protocol in terms of availability and propagation delay indicates that the protocol reduces the delay by a factor of four compared to a protocol that does not diffuse messages past failed nodes. Received September 26, 1996; revised March 31, 1998

1. INTRODUCTION Data are replicated in distributed systems to improve system availability and performance. There are two approaches for managing replicated data: synchronous protocols that enforce strict serializability by means of quorums [1, 2, 3] and asynchronous protocols which permit updates and queries to occur at any replica. Synchronous protocols are impractical for large networks as they suffer from high latency and low throughput since links tend to be slow and unreliable, and a large number of replicas generate considerable traffic over the network. Asynchronous protocols provide higher availability and give better response time. An asynchronous protocol should provide a propagation scheme which ensures that updates are efficiently and reliably propagated to all replicas even if the communication network does not provide such a guarantee. However, such an approach is based on the assumption that the applications can tolerate some inconsistency and that reconciliation methods are available to resolve conflicts. The semantics of some applications are such that they do not require strict serializability and weaker forms of consistency are adequate and acceptable. Some examples are updates to replicated data, e.g. bibliographic databases, software distribution, managing mailing lists and bulletin board services. The common features in all these applications are that: (i) they require only weak consistency; (ii) they can tolerate small delays in propagation; (iii) they T HE C OMPUTER J OURNAL,

must be reliable; (iv) there is a need to minimize the communication cost and redundant communication. For instance in a bulletin board application, the main consistency requirements are that messages generated by a node must be seen by all other nodes in the order in which they were generated, and if a single node receives a message and posts a response or a follow-up message it should be seen by all other nodes after the original message to which the follow-up relates, i.e. causal ordering must be maintained. Hence, what is required is a method that can asynchronously propagate messages generated at any node to all other nodes in a network while respecting the above ordering constraints. In a large network the challenge lies in ensuring that such asynchronous propagation is fast, reliable and scalable. Other applications that have used weak consistency are air traffic control [4], resource discovery systems such as archie [5], stock exchange systems and Lotus NOTES. In this paper we describe a propagation protocol called HPP (hierarchical propagation protocol) for managing asynchronous data replication. Grapevine [6] and the Global Name Service [7] were among the first systems to use weak consistency. Another weak consistency protocol called Timestamped Anti Entropy (TSAE) is proposed in [8]. The basic idea is that nodes maintain state information such as version numbers about the data they store and at periodic intervals they exchange the state information to reduce entropy. If the states are different, the node with the most recent version will update the node with an earlier version of a file or object. A node communicates with its ‘peers’, Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION but keeps state information for all nodes. The flood-d protocol [9] extends TSAE to a hierarchical protocol, thus making it more scalable and lowering the overhead of communication. [9] also gives algorithms to create a good logical topology from a physical topology. Other weak consistency protocols were presented in [4, 10, 11, 12, 13]. The above weak consistency protocols are useful and interesting; however, they all create redundant messages to achieve reliability and require a node to maintain state information for all nodes. This is acceptable for small networks with a few replicas, but it will incur a large communication overhead for wide area networks like the Internet. Moreover, a node sends not only messages it originates, but also messages it has received from other replicas. Hence, a node may receive a message several times: once directly from the sender and more copies that are relayed by other nodes. This redundancy leads to increased communication overhead and wasted network bandwidth. Some of the above protocols (e.g. [13]) mitigate the problem of redundant messages by maintaining information about what other sites know. However, this state information is quadratic in the number of nodes in the network. Furthermore, although redundancy is reduced it still exists in varying degrees. Of course, redundancy does usually contribute to an increase in availability. A well-known broadcasting technique is the reverse path forwarding algorithm proposed by Dalal and Metcalfe [14]. In this method a node that receives a packet or message will forward it to other nodes only if the node is along the shortest path from the source (i.e. the sender) to other nodes in the network. This technique is based on constructing minimum spanning trees and is optimal under various conditions; however, it can drop messages and is unreliable if the network changes dynamically. It is also not fully distributed and does not perform ordering. Another solution to this problem is to have a central node which receives all messages from various sender sites, imposes an order on them (e.g. the method proposed in [15]) and then sends them to all other sites. Then, all sites will see all the messages in the same order. This scheme places too much burden on the central site. Other broadcast protocols are discussed in [16, 17, 18]. Hierarchies are a natural and logical way of organizing a group of nodes for message propagation. This idea has been proposed in [19] for ordered and reliable multicast communication. It is a more distributed version of the scheme proposed in [15] where ordering was imposed by a central node. In [19], however, the nodes are all arranged in a hierarchy. There are several multicast groups, and each sender sends its message to the root node and then the root propagates the message through the hierarchy. The hierarchy is arranged in such a way that all multicast groups receive the message in the correct order. The drawback with this approach is that it is not fully distributed and places an undue burden on the root node. Moreover, failure of the root node will interrupt the propagation. Our proposal also exploits the notion of a hierarchy, but in a different way from the [15, 19] proposals, so as to reduce T HE C OMPUTER J OURNAL,

IN

W IDE A REA N ETWORKS

109

the burden on the root. The basic idea is that a message generated at any node of the hierarchy is propagated to its child nodes and the parent node, and each receiving node further sends the message to its parent and child nodes, except the node from where it came. Of course, the exceptions to this general scheme are that a leaf node does not have any child nodes and the root does not have a parent. The nice property of such a propagation scheme is that it is scalable, because each node sends a message to only a few other nodes and the burden of propagation is distributed quite uniformly throughout the network. Our hierarchy is dynamic and can change depending upon the traffic conditions. If performance is poor a node can relocate to another position in the hierarchy. The initial hierarchy is somewhat arbitrary, but it evolves into a good arrangement from a performance point of view through reorganizations. Another important factor which makes this scheme attractive and novel is that each node needs to keep limited information only about its own parent and child nodes, and need not know anything about the rest of the network. Therefore, it avoids the considerable additional overhead incurred by other distributed schemes such as ISIS [20, 21] and HARP [22]. ISIS is a distributed programming toolkit that provides atomic, interactive delivery with total or causal message ordering. It is based on virtual synchronous process groups and has been used to develop a variety of applications including replicated file systems. However, ISIS is aimed towards small systems and ensures strong consistency of group views at the expense of latency and communication overhead. Causal order is maintained by timestamping each message with ordering information, of size N, representing message i.d.s already seen by the sender. When a message m arrives at a destination, if one or more of m’s predecessors’ messages have not arrived, then m’s delivery is delayed until the appropriate messages arrive. However, the overhead of piggybacking with each message a timestamp of size proportional to the total number of nodes in the network is quite substantial and hinders scalability. In contrast, our method incurs a much smaller overhead. HARP [22] is based on a generalized hierarchical structure, consisting of node clusters arranged in a hierarchy, and also maintains global state information. Our goals in designing a new hierarchical propagation scheme are as follows. First, it should be reliable, i.e. messages must not get lost. Second, it must minimize redundancy. Third, it must be scalable and efficient. This means that the workload should be uniformly distributed and no node should have to maintain a global view of the network. Fourth, it must allow messages to propagate in spite of failed nodes. Our failure model enables such propagation to occur by diffusion past any failed node provided its parent node and child nodes are up. Fifth, it must ensure that messages are seen in the same order in which they are sent, i.e. multiple source ordering is maintained [19]. Finally, it must also allow the network to be reorganized dynamically without loss of messages. This means, for instance, that a node anywhere in the hierarchy should be able to move to a new position elsewhere in Vol. 41, No. 2, 1998

110

A. K UMAR AND N. A DLY

the hierarchy without losing messages during the time it switched positions and without affecting the operation of other nodes in the network. This paper describes in detail our propagation protocol called HPP (Hierarchical Propagation Protocol) for managing replicated data. The organization of this paper is as follows. Section 2 briefly reviews the overall system model and states our assumptions about the processors and the communications network. Then, Section 3 describes the operation of our propagation algorithm during normal conditions. Section 4 turns to explain how network reorganization and restructuring take place. Next, Section 5 describes operation in failure mode. Section 6 evaluates the performance of the protocol analytically and also through various simulation experiments. Finally, Section 7 concludes the paper.

2. SYSTEM MODEL The system consists of N nodes connected by an internetwork. Processors may fail, then restart; however, failstop processors only are assumed and byzantine faults do not occur. The underlying communication network is unreliable: it may lose or duplicate messages and does not guarantee any order of delivery. Link failures can cause the network to be partitioned. These partitions are eventually merged again. In the special case where a node loses its communication with all other nodes in the network, it is treated like a node failure. Messages are delayed due to transmission over the network, but a finite delay is assumed. Therefore, a node can eventually send a message to any other node by retransmitting the message if it does not receive an acknowledgement after a certain timeout period. Hence, message delivery is ensured by means of timeouts and retries. Nodes are organized in a logical hierarchy, where a node i at level l has a parent at level l − 1, a grandparent at level l − 2, n children at level l + 1, n × n grandchildren at level l + 2 and so on. The root node is at level 1. Neighbors are nodes with the same parent node. The assumption that each parent node has exactly n child nodes is made only for ease of presentation of the protocol; in general, we do not require that this must hold for the protocol to work. Each node communicates only with its parent and children, which are together referred to as its correspondents. We use the notation P(i ) to denote the parent node of node i , and C j (i ) to denote the j th child of node i . Messages on the network are classified into normal and control messages. Normal messages are regular messages sent by an application and they contain application data. Control messages are special messages related to the protocol and they do not contain any application data. Our data consistency model is based on an assumption of weak consistency. Updates to data are relayed by broadcasting the update message. The protocol provides ordering of messages and reliability. T HE C OMPUTER J OURNAL,

3. PROPAGATION DURING NORMAL OPERATION The basic scheme for propagation is very simple: a node generating a message sends it to all its correspondents (parent and children). A node receiving a message from a correspondent sends the message to all correspondents except the one from which the message is coming. This works recursively and a message originating at any site will eventually propagate everywhere. A node receives each message only once, so there is no redundancy during normal operation. We call this basic scheme HPP0. The problem with this scheme is that it does not tolerate any faults and if a node fails messages cannot propagate past it either upwards or downwards. We describe this basic scheme in this section and in subsequent sections show how it can be improved so that messages can propagate past a failed node. Every message m must have the identity of the originator m orig (i.e. the node that creates the message) and a corresponding sequence number m seq. The combination hm orig, m seqi creates a unique message identifier. Also, a node tags every message it sees with two values S ↓ and S ↑ , where S ↓ is a sequence number of messages sent downwards, and S ↑ is a sequence number of messages sent upwards by the node. Thus, each node strips off the previous values of S ↓ and S ↑ attached to a message it receives, and replaces them with new values before propagating the message. Messages are also tagged with an indication of the previous sender: ‘local’ (if the message was generated locally), ‘child’ and ‘parent’. A bit map is kept along with each message (1 bit per correspondent) to keep track of acknowledgements received from the correspondents. Each node i keeps the following state vectors or data structures: •

Vi , a vector to keep track of the messages node i has generated or received from its own correspondents Vi = (R p , Rc1 , Rc2 , . . . , Rcn , L, S ↓ , S ↑ ), where R p is the number of the last message received by i from up, i.e. from the parent P(i ), Rc j is the number of the last message received by i from down, i.e. from its j th child C j (i ), L is the number of the last local message generated by node i itself, S ↓ is the sequence number of the downward stream, i.e. a counter to keep track of the number of messages sent by i to its child nodes and S ↑ is the sequence number of the upward stream of messages sent, i.e. another counter for messages sent by i to its parent.

(In the rest of this paper, we use the notation Vi · x to denote entry x in the data structure Vi .) •

Vi P , a vector describing node i ’s view of its parent’s V vector: Vi P = (R p , Rc1 , Rc2 , . . . , Rcn , L, S ↓ , S ↑ ). The definition of each entry is the same as Vi but with respect to the parent.

Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION •

ViC , a vector keeping track of received messages that were originated by child nodes of node i ViC = (L 1 , L 2 , L 3 , . . . , L n ) where ViC · L j is the number of the last message generated locally by the child C j (i ) and received by i .

Therefore, state vector Vi contains information about messages received by node i itself, and Vi P and ViC contain information that node i maintains about the states of its parent and child nodes respectively. By aggregating messages in this way, in terms of local (L), received from child (Rc j ) and received from parent (R p ), these three vectors encapsulate a large part of the information that a node needs to maintain. The state vectors V, V P and V C are updated upon the generation or receipt of any message. The details of updating the state vectors are shown in Figure 1. The figure lists the steps in the three functions Propagate, Propagate down and Propagate up that a node must perform depending upon whether it originates a message, receives a message from the parent node or receives a message from a child node respectively. It should be noted that the function calls (except the wait) made within these functions either recursively or to another function are non-blocking. In the figure, we assume that node i is the r th child of its parent P(i ) (in general, a node will know the value of r by examining its View array, to be described shortly). We also assume that when a node receives a message from a child it can identify that the message is coming from the r th child. Finally, each site maintains and updates a vector of the highest message it has received from every other node and processes the received messages in sequence. When node i sends messages to correspondent j , j processes these messages in the order in which they were sent from i , i.e. in FIFO (first-in, first-out) order. Messages received out of order are detected by a missing sequence number and inserted in a queue for later processing. We call these FIFO queues. In general, there are n + 1 such FIFO queues kept by each node, one per correspondent, and this ensures that messages are received in the order in which they were sent between a pair of correspondent nodes. The FIFO order is supported by maintaining two arrays VS and VR, which respectively maintain the highest message number sent in sequence to and received in sequence from various nodes. For example, consider the simple hierarchy of Figure 2. In this hierarchy, the correspondents of node 3 are 1, 8, 9 and 10. If V R3 = (5, 7, 9, 4), it means that node 3 has received messages numbered 1 through 5 from node 1, messages 1 through 7 from node 8 and so on. Similarly, if V S3 = (9, 7, 8, 7), it means that node 3 has sent messages numbered 1 through 9 to node 1, messages 1 through 7 to nodes 8 and 10, and 1 through 8 to node 9. Since all nodes observe FIFO order while propagating messages and there is only one unique path in the logical hierarchy for a message between any pair of nodes, causal ordering is maintained during normal operation, i.e. if a node i receives a message m 1 at time t1 and generates a message m 2 at time t2 , such T HE C OMPUTER J OURNAL,

IN

W IDE A REA N ETWORKS

111

that t2 > t1 , then every other node in the network receives the two messages in the order m 1 , m 2 . Theorem A.1 (in the Appendix) establishes that causal order is achieved during normal operation. Each node must know the identities of its own correspondents (i.e. its parent node and all child nodes) and their correspondents. Therefore, each node maintains an [n + 2] × [n + 1] matrix View, in which row 1 holds the i.d.s of its own correspondents, row 2 gives the i.d.s of its parent’s correspondents and rows 3 thru n+2 give the i.d.s of the correspondents of its children (i.e. child nodes 1 thru n respectively). In each row, the first entry denotes the i.d. of the parent and entries 2 thru n+1 are the i.d.s of the child nodes 1 thru n respectively. Table 1 gives an example of a View array maintained by node 3 for the example hierarchy shown in Figure 2. The array contains five rows. The first row shows the view of node 3 itself, while the second row shows the view of node 3’s parent, i.e. node 1. Finally, the last three rows show the view of the three child nodes of node 3, i.e. nodes 8, 9 and 10. In each row, by convention, the first node is the parent node and the remaining nodes are the child nodes. A −1 indicates a null value. Thus, the root node has a −1 for its parent node and the leaf nodes have a −1 for their child nodes. Messages are kept in a log. A message is inserted into the log (in FIFO order) when it is received and is removed from the log only after all correspondents have acknowledged receipt of the message. A summary of the data structures maintained at each node i , their descriptions and their sizes is presented in Table 2. The algorithms for a node to join or leave the hierarchy are relatively straightforward and are discussed next. The join algorithm is shown in Figure 3. A node v wishing to join the hierarchy identifies a parent node and sends a join request message to it. The parent node adds an entry for the new node v into its View matrix and V vectors, and notifies its correspondents (i.e. its parent and all the child nodes). It then sends its View matrix and V vector to the joining node. The joining node initializes its own View matrix using the View matrix from the parent, and names the V vector from the parent as its own V P vector. Finally, it initializes the elements in its own V vector to 0, except that V · R p = V P · S ↓ . To leave the hierarchy, a node notifies its parent. The parent in turn updates its view, notifies its correspondents and sends an acknowledgement to the leaving node. The correspondents update their View matrix accordingly. 4. REORGANIZATION This section describes an algorithm that allows a node v to change its parent node without affecting normal operation. The moving node v must either be a leaf node or it should move all of its descendants along with it to the new position. The ability to reorganize the hierarchy helps to overcome various problems such as link congestion. Periodic reorganization can also be used as a technique to keep the hierarchy balanced and to flatten it; however, we Vol. 41, No. 2, 1998

112

A. K UMAR AND N. A DLY

Propagate(m) (node i generates message m and propagates to parent and all child nodes) (node i is the r th child of its parent) { Increment Vi · L; For (k = 1 to n) { Propagate down(m,"local", 0);} (Send to kth child) Propagate up(m,"local"); Increment Vi · S ↓ and Vi · S ↑ ; wait(acknowledgement from parent); Increment ViP · S ↓ , Vi P · S ↑ and ViP · Rcr ; } Propagate down (m,source,cnum) (node i receives message m from its parent for downward propagation) { If m is out of order Then queue m Else If m is not a duplicate Then (check message number for duplicates) { increment Vi · R p , Vi P · S ↓ , Vi · S ↓ ; If source ="local" Then increment Vi P · L; If source ="parent" Then increment Vi P · R p ; If source ="child" Then increment Vi P · Rccnum ; send acknowledgement to parent node; For (k = 1 to n) { Propagate down(m,"parent",0);} } } Propagate up (m,source) (node i receives message m from its j th child for upward propagation) (node i is the r th child of its parent) { If m is out of order Then queue m Else If m is not a duplicate Then { Increment Vi · Rc j ; If source ="local" Then increment ViC · L j For (i = 1 to n; i 6= j) { Propagate down(m,"child",j); } Propagate up(m,"child"); Increment Vi · S ↓ , Vi · S ↑ ; Send acknowledgement to child node; wait(acknowledgement from parent); Increment Vi P · S ↓ , Vi P · S ↑ , Vi · R p and ViP · Rcr ; } }

FIGURE 1. Algorithm for updating the state vectors upon originating or receiving a message.

do not discuss details of such algorithms here. Keeping the hierarchy tree balanced and flattened through periodic reorganization in this manner helps to distribute the work load uniformly across the various nodes. It also lowers the transmission time by reducing the number of levels. Such reorganization should occur when the mean and variance T HE C OMPUTER J OURNAL,

values of the number of child nodes in the hierarchy exceed a certain threshold. The reorganization scheme is designed so as to guarantee that no messages are lost while the switch is taking place. To achieve this, the moving node informs the new parent of its intention to join. On receiving a confirmation from Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION

1

2

5

3

6

7

4

8

9 10 11

14

15

12

13

16

FIGURE 2. Example of a four-level hierarchy.

TABLE 1. The 5 × 4 View array for node 3, View3 . 1 –1 3 3 3

8 2 –1 14 –1

9 3 –1 15 –1

10 4 –1 16 –1

the new parent, it leaves the old parent and joins the new parent. The protocol must ensure that no messages are lost even if a failure occurs along the path from the old parent to the new parent while the reorganization is in progress. To ensure correctness, a node attempts to move only if its parent is not down. Figure 4 shows visually the main steps of the reorganization protocol and the sequence in which events happen at various nodes. The detailed reorganization algorithm is given in Figure 5 and a brief explanation of the algorithm follows next. Consider a node v with an old parent OP that wishes to join a new parent NP. A node experiencing poor performance will identify a potential new parent node NP, perhaps by querying a central node that maintains information on traffic patterns. Without loss of generality, the algorithm in Figure 5 assumes that v is the r th child of the old parent and will become the (n + 1)th child of the new parent. v sends a join req1 message to NP. NP, on receiving join req1, will send a join ack1 message to v and will start buffering any further messages generated or received for v. These are demarcator messages for timing purposes as explained below. When v receives join ack1, it sends a special marker message to NP to mark the last message sent from v to NP through OP (in the present hierarchy). Then, it sends to OP a leave req message and stops sending further messages to OP. However, it starts buffering for T HE C OMPUTER J OURNAL,

IN

W IDE A REA N ETWORKS

113

NP (in a log called M log N P ) all new messages generated locally or received from its child nodes. It should be noted that the join ack1 and special marker messages are special control messages which will travel through the hierarchy exactly like a normal data message. Unlike normal messages, these control messages do not contain any application data and are only required for the correct operation of the protocol. When OP receives the leave req message, it stops sending further messages to v, discards v from its View, informs its own correspondents of the change and sends a leave ack message to v. On receiving leave ack, v removes OP from its View and sends join req2 to NP. NP will then add v to its View, inform all its correspondents about the new child and send an acknowledgement join ack2 to v including all messages it has buffered for it previously (in a log called M logv ). On receiving join ack2, v processes these messages, adds NP to its view and finally sends to NP all messages it has buffered for it. NP waits for special marker and, after receiving it, starts processing messages received from v. The move is considered complete when node v receives join ack2 from NP. While adding a new node, NP will increase the size of its vectors and of the View array also if necessary. Recall from Section 3 that a −1 denotes a null value (see Table 1). It should also be noted that the scheme does not need synchronization between all nodes to commit a change and it does not disturb normal operation. This feature of the reorganization scheme enhances scalability. The old views are maintained so that, in case failure occurs during reorganization, the old views can be restored and the reorganization is aborted. (Further discussion of reorganization in the presence of failures is given in Section 5.5.) It can be shown that no messages will be lost as a result of the reorganization by arguing that OP and NP will receive all messages v receives from below, and that v will receive all messages that NP has received before the move or will receive after the move. Assume v sends leave req at time t1 . Then, NP will receive directly from v every message v has received from below at time t ≥ t1 . All messages received by v at time t < t1 from below would be propagated by it to OP and then OP will further propagate them to NP by the normal propagation mechanism. Therefore, NP receives all messages received by v from below. Similarly, OP receives from v all messages it has received from below at time t < t1 . All messages received by v from below at time t ≥ t1 will be sent to NP and NP will further propagate them to OP by the normal propagation mechanism. Therefore, OP receives all messages received by v from below. Assume NP received v’s join req1 at time t2 . Messages received by NP at time t ≥ t2 will be sent to v. NP sends join ack1 at time t = t2 (NP sends the join ack1 immediately after receiving join req1; this happens atomically). Since the join ack1 propagates through the network like other normal messages and since the FIFO order is preserved in transmitting messages between correspondents, any message that is received by NP at time t < t2 will be received by OP and v before Vol. 41, No. 2, 1998

114

A. K UMAR AND N. A DLY TABLE 2. Summary of various data structures, their descriptions and sizes. Variable name

Description

Size

Viewi [n + 2][n + 1] Vi ViP ViC logi n + 1 FIFO queues V Si [n + 1] V Ri [n + 1]

Partial view of the hierarchy Vector describing i’s state Vector describing i’s record of its parent’s state i’s record of messages originated by its child nodes Log keeping received messages at node i Queues maintained for FIFO ordering messages Vector keeping sequences of sent messages Vector keeping sequences of received messages

(n + 2)(n + 1) n+3 n+4 n variable variable n+1 n+1

v

New parent of v

join_req

uest

r, view

_ update view _ inform correspondents

V vecto

_ initialize view _ Set V P = V _ Start normal operation FIGURE 3. Main steps of the join algorithm.

join ack1. Thus, v is guaranteed to receive all messages that NP received at time t < t2 from OP and no messages are missed. To preserve the causal order while the switch is taking place, node NP does not process messages received directly from v until the special marker has been received. The special marker is a demarcator which ensures that NP processes all messages sent to NP through OP first, and then it processes any new messages received directly from v. Furthermore, join ack1 is another demarcator which also propagates through the network like a normal message before the change of correspondents occurs; thus, v receives join ack1 after all regular messages propagated by NP before the move. Since v starts receiving new messages directly from NP after NP receives join req2—which is sent by v only after join ack1 is received—the ordering is preserved. A more rigorous proof of this heuristic argument to show that the causal order is not violated while the change is taking place is given in Theorem A.2 in the Appendix. 5. FAILURES In this section we describe how the protocol circumvents failures, i.e. messages can be propagated past a failed node in both directions. This is important because in a hierarchical arrangement even one failed node can prevent messages from being propagated across it and cause a partition. Our protocol can bypass one failure per correspondent group, i.e. if node i fails, and none of its correspondents is down, it is possible to propagate messages past i in spite of the T HE C OMPUTER J OURNAL,

failure. We call this process by which messages propagate past a failed node diffusion. If none of the correspondents of a node can communicate with it, then the node is deemed to have failed. Therefore, the special case in which a node gets partitioned from all its correspondents is treated as a node failure for the purposes of our algorithm. In such cases, the child nodes of the failed node elect a coordinator CO and CO takes over the functions of the failed node. CO also queues a start recovery message for the failed node and tries to send it periodically so that, after the failed node comes up, it knows that it has to trigger the recovery algorithm and go back to its normal operation mode. The information in the state vectors kept by the various nodes is adequate for a transition to be made which enables a coordinator to take over the functions of a failed parent. The various steps required to be performed after a failure have been grouped into four phases. Assume the failed node is f . Once all its child nodes agree that f has failed, they elect a coordinator by exchanging messages among themselves (election phase described in [23]). The coordinator then performs actions to temporarily assume the role of f and bring all correspondents of f up-to-date as of the time f failed (transition phase). In the next phase, CO assumes the role of the failed node (diffusion phase) and finally, when f recovers, it takes back its function from CO (recovery phase). In the first phase, it is assumed that a standard election algorithm is run (see [23]). The details of the next three phases are described below. Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION OP’s correspondents OP

IN

W IDE A REA N ETWORKS NP

115

v

NP’s correspondents

join_req1

_req

leave ild

_ch

del

start buffering message for v

1 join_ack special_

marker start buffering messages for NP

update view

leave

_ack

join_req2

, view ges

ered messa

buff join_ack2 , w vie

buffered messa

ges

process messages update view

update view add _ch

ild

wait(special_marker) process received messages

control messages

FIGURE 4. Main steps of the reorganization algorithm.

5.1. Transition Assume f was in the process of sending a message to its correspondents when it failed. The transition algorithm ensures that, even if these messages were sent to a subset of the correspondents (in the worst case to only one of them), they will be delivered reliably everywhere. The algorithm is initiated by CO after being elected. Without loss of generality, assume that f is the j th child of its parent P( f ). The algorithm is run by CO and the steps in the algorithm are listed in Figure 6. In the first four steps, CO gathers the state information it needs. In the next four steps, it instructs some nodes to send to other nodes messages that the latter have missed. The objectives of this exercise are: (1) to find out which node has the most current information as of the time node f failed and (2) to arrange for that node to send messages to other nodes that are behind. At the end of this phase all nodes are current as of the time node f failed and then CO is ready to assume the functions of its parent node f . In steps 1 and 2, CO gathers the relevant state information from the parent and child nodes of f . In step 3, it determines which correspondent of f (among f ’s parent and child nodes) has received the largest number of messages that were generated locally at f and keeps this number in Max L f . Since each child node maintains a view of its parent’s V vector, and these views could be different, step 4 compares these vectors maintained by the child nodes of f . The objective is to determine the highest numbered message received by f from each of its child nodes until failure and also to determine, for all messages received by f from each child node, how far behind the other child nodes are (since f must propagate messages received from one child node to all of the other child nodes). Consequently, V PM AX [i ] is T HE C OMPUTER J OURNAL,

the highest numbered message f has received from its child node i ; V PM I N [i ] is the number of messages out of these that have reached the child node that is most behind. The subsequent steps of the algorithm can be explained better by examining all possible ways in which f might send a message to, or receive a message from, one of its correspondents and then fail before propagating the message to all its other correspondents. In all such cases our transition algorithm must ensure that such a message reaches all correspondents of f . We divide this problem into four cases depending upon whether the correspondent of f is a parent node or a child node, and also depending upon whether f has sent a message to or received a message from the correspondent. These four cases are: • • • •

f receives a message from P( f ) and then f dies; f sends a message to P( f ) and then f dies; f sends a message to a child node C j ( f ) and then f dies; f receives a message from a child node C j ( f ) and then f dies.

Each of these four cases is discussed separately below and in each case we explain how the appropriate step of our algorithm ensures that a message that falls in that case is propagated to all other correspondents of f (the step numbers below refer to Figure 6). •

Case 1. P( f ) generates a message itself (or receives a message from a child or a parent node), sends it to f and then f dies (denoted P( f ) → f , f dies). Since P( f ) has the message, it will get propagated upwards in spite of the failure of f . However, to ensure that it will also be propagated downwards, CO must compare V P( f ) · S ↓ with VCPi ( f ) · R p : if V P( f ) · S ↓ >

Vol. 41, No. 2, 1998

116

A. K UMAR AND N. A DLY

Reorg node(v) (node v initiates the reorganization) { wait(node OP is alive); send join req1 to NP; (wait for acknowledgement from NP); wait(join ack1); wait(NP is alive); send a special control message special marker; send to OP leave req and stops sending it messages; insert new or child node messages in M Log N P ; (wait for leave acknowledgement) wait(leave ack from OP); set old V i ew = Viewv [2]; old V P = VvP ; send join req2(Viewv [1]) to NP; (wait for acknowledgement from NP) wait (join ack2(VN P , View N P [1], M Logv ) from NP); set Viewv [2] = View N P [1]; VvP = VN P ; Vv · R p = VN P · S ↓ ; send M Log N P to NP; send commit to OP and discard old V i ew and old V P ; process messages in M Logv ; } (Algorithm at new parent NP) Reorg node(NP) { wait(join req1); send a special control message join ack1; insert new generated or received messages in M Logv ; wait(join req2(Viewv [1]) from v); add v to its list of child nodes: set View N P [1, n + 2] = v; set View N P [n + 3] = Viewv [1]; VN P · Dn+1 = 0; VNCP · L n+1 = 0; send add child(v) to its correspondents; send acknowledgement to v join ack2(VN P , View N P [1], M Logv ); wait(M Log N P from v); wait (special marker); process messages in M Log N P ; } (Algorithm at old parent OP) Reorg node(OP) { wait(leave req from v); (copy old status vectors of node v) set old Vv = VO P · Rcr ; old VvC = VOC P · L r ; set old V i ewv [1] = View O P [1]; old V i ewv [2] = View O P [r + 2]; remove v from View O P [1], VO P and VOC P ; remove row r+2 from View O P and rearrange the rows ; send del child(v) to its correspondents; send leave ack message to v ; wait(commit message from v), discard copy of old status vectors of node v; }

FIGURE 5. Reorganization algorithm.



VCPi ( f ) · R p , then (V P( f ) · S ↓ − VCPi ( f ) · R p ) messages have been missed by Ci ( f ) and CO asks P( f ) to send those messages to Ci ( f ) (see step 5). Case 2. f generates a message, sends it to P( f ) and then dies (denoted by f → P( f ), f dies). Since P( f ) has received the message, it will again get propagated upwards in spite of the failure of f . To ensure that it is propagated downwards, CO compares T HE C OMPUTER J OURNAL,



C ·L with V P C P V P( j f) C i ( f ) ·L: if V P( f ) ·L j > VC i ( f ) ·L, then Ci ( f ) has missed some messages and CO asks P( f ) to send Ci ( f ) the missing messages (see step 6). Case 3. f generates a message, sends it to one of its child nodes (say, Ci ( f )) and then dies (denoted f → Ci ( f ), f dies). To ensure that such messages are propagated upwards, C P CO compares VCPi ( f ) · L with V P( f ) · L j : if VC i ( f ) · L >

Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION

IN

W IDE A REA N ETWORKS

117

C (1) Ask node P( f ) for V P( f ) · S ↓ , V P( f ) · L j and V P( f ) · Rc j ; P (2) Ask nodes Ci ( f ) for VCi ( f ) , i = 1 . . . n; C P (3) Let Max L f = Max{V P( f ) · L j , Maxi {VC i ( f ) · L}}; C ∗ If (Max L f = V P( f ) · L j ) Then i = 0; ∗ Else i = i s.t. Max L f = VCPi ( f ) · L; (4) Construct the vectors V PM AX and V PM I N s.t., V PM AX [i ] = VCPi ( f ) · Rci , ∀i ; V PM I N [i ] = Min j {VCPj ( f ) · Rci } , ∀i ; (5) Ask P( f ) to send Ci ( f ), ∀i , the last (V P( f ) · S ↓ − VCPi ( f ) · R p ) messages sent downwards; (6) If (i ∗ = 0) Then Ask P( f ) to send Ci ( f ), ∀i , the last (Max L f − VCPi ( f ) · L) messages received with m orig= f ; Else { C Ask node i ∗ to send P( f ) the last (Max L f − V P( f ) · L j ) messages received with m orig= f ; ∀k 6= i ∗ , Ask node i ∗ to send Ck ( f ) the last (Max L f − VCPk ( f ) · L) messages received with m orig= f ; } (7) Ask nodes Ci ( f ), ∀i , to send to their neighbours their last (V PM AX [i ] − V PM I N [i ]) messages sent upwards; (8) Let y = 6k VCPk ( f ) · Rck − (V P( f ) · Rc j − Max L f ); If y > 0 Then Ask Ci ( f ), ∀i , to send the last y messages sent upwards to P( f );

FIGURE 6. Transition algorithm.



C V P( f ) · L j , then P( f ) has missed one or more messages and CO must ask Ci ( f ) to send the missing messages to P( f ). To ensure that the messages are propagated downwards, CO compares VCPi ( f ) · L with VCPk ( f ) · L: if VCPi ( f ) · L > VCPk ( f ) · L, then Ck ( f ) has missed some messages and CO asks Ci ( f ) to send the missing messages to Ck ( f ) (see step 6). Case 4. Ci ( f ) generates a message locally (or receives a message from below), sends it to f and then f dies (denoted Ci ( f ) → f , f dies). To ensure that such messages are propagated downwards, CO compares VCPi ( f ) · Rci with VCPj ( f ) · Rci : if

VCPi ( f ) · Rci > VCPj ( f ) · Rci , then C j ( f ) has missed some messages sent by Ci ( f ) and CO asks Ci ( f ) to send those messages to C j ( f ) (see step 7). To ensure that such messages are propagated upwards, CO checks if (V P( f ) · Rc j − Max L f ) < 6k VCPk ( f ) · Rck . If so, then P( f ) is missing some messages sent by child nodes and CO asks them to send those messages to P( f ) (see step 8). The different cases along with the conditions used to detect the need for upwards and downwards propagation are summarized in Table 3. A ‘–’ in Table 3 means that the message has already been propagated in that direction and no action is required. Since missing messages can fall into only one of the four cases described above, and since all of them are detected T HE C OMPUTER J OURNAL,

and propagated, it follows that: if a node fails and it has sent a message to at least one of its correspondents before failing, then this message will be reliably propagated to all of its other correspondents and, consequently, will be reliably propagated to every other node in the network. CO assembles all requests to P( f ) or Ci ( f ) into one message including the total number of messages that each of them is supposed to receive from the others. P( f ) or Ci ( f ) receiving diffused messages treat them exactly as if they were coming from f . The only change is that the child nodes Ci ( f ) (i = 1, . . . , n) need not update VCPi ( f ) , since it represents f ’s state and that is frozen because f is down. Furthermore, the batch of diffused messages should be sorted such that normal messages are processed first, and then control messages. 5.2. Diffusion While node f is down, CO takes over the responsibilities of f temporarily until f comes up again. That is, Ci ( f )’s and P( f ) send messages to CO, and CO will diffuse them to the other correspondents of f . Therefore, it is ensured that the flow of propagation will continue while f is down. Messages are transmitted in FIFO order between CO and correspondents of f . This phase starts once the transition phase is terminated, i.e. when P( f ) or Ci ( f )s have received and processed all messages they were supposed to receive during the transition phase. Then, each node can move into the diffusion phase independently. Vol. 41, No. 2, 1998

118

A. K UMAR AND N. A DLY TABLE 3. A summary of failure scenarios and conditions to detect upwards and downwards propagation. Case

Upwards

P( f ) → f , f dies



f → Ci ( f ), f dies f → P( f ), f dies Ci ( f ) → f , f dies

C VCP ( f ) · L > V P( f) · L j i – V P( f ) · Rc j − Max L f < 6k VCP ( f ) · Rck k

CO, on receiving a message m from a correspondent of f , treats it as if it is coming from a parent node and performs the algorithm described in Section 3 (see Figure 1), i.e. CO must update M VC O , increment VC O · S ↓ and VC O · R p , send m to its own correspondents, etc. Additionally, it must also send the message to correspondents of f , other than the one the message is coming from. If CO generates a message locally or receives a message from one of its own correspondents, it treats it normally (see Figure 1), but, in addition, sends it to all correspondents of f also. A correspondent of f , on receiving a message from CO, acts as follows: if the correspondent is P( f ), then it treats the message as if it is coming from a child (see Figure 1); if the correspondent is a child node of f , then it treats the message as if it is coming from a parent (again, see Figure 1), but does not update VCPi ( f ) . 5.3. Recovery While f is down, each correspondent of f , on generating a new message or receiving a message from its own correspondents, inserts it in the log, marks it as not acknowledged by f and sends it to CO. When node f comes up and receives a start recovery message from CO, it triggers the recovery algorithm which consists of the following steps. 1.

2.

f sends recovering message to each of its correspondents. The message to correspondent i should include any messages in f ’s log which have not been acknowledged by i . Correspondent i upon receiving such a message: (a)

(b)

(c)

3.

checks whether any of the included messages is a duplicate (as it might have already received it during the transition phase) and, if so, discards it; otherwise, it accepts the message and updates Vi ; checks whether it is holding any pending messages in the FIFO queue for f and flushes the queue; extracts from its log all messages not acknowledged by f and sends them to f along with an ack.

f upon receiving the ack from correspondent i : (a)

Downwards

accepts non-duplicate messages (but does not forward them to its other correspondents as they already have them); T HE C OMPUTER J OURNAL,

(b)

4.

i

k

C P V P( f ) · L j > VC i ( f ) · L P VC ( f ) · Rci > VCP ( f ) · Rci i

k

increments V f · R p , if i is a parent, or V f · Rc j , if i is the j th child, by the number of non-duplicate messages received from correspondent i .

After f receives acks from all its correspondents, it: (a)

(b) (c) 5.

V P( f ) · S ↓ > VCP ( f ) · R p i VCP ( f ) · L > VCP ( f ) · L

increments V f · S ↓ by the total number of nonduplicate message received from all correspondents; sends V f to all its child nodes Ci ( f ); asks P( f ) for V P( f ) .

Normal operation of f is resumed.

Steps 1 through 2(b) ensure that the diffusion process is completed, e.g. if a node had received a message from f but had deleted it from its log before the transition algorithm starts and hence the message was not diffused or if f has generated a message but did not send it to any other node before failing. Furthermore, it is essential for the correspondents of f , who might or might not have received these messages, to flush any pending messages in the FIFO queue each of them keeps for f . Step 2(c) ensures that f will receive all messages it missed during the failure. Steps 3 and 4 allow f to bring itself up-to-date before resuming its functions. Recall that, while f was down, its child nodes were not required to update their VCPi ( f ) vector. Therefore, on recovering, f sends its V f vector to all of its child nodes Ci ( f ) so that they can use the incoming vector as their new VCPi ( f ) . Further, f needs to update its record of its parent’s state; so, f requests P( f ) for its V P( f ) , and calls it V fP . Afterwards, normal operation of f is resumed. (If CO receives a message from a correspondent of f after f comes up, then CO returns the message back to the sender noting that f is alive and the sender must resend the message directly to f .) 5.4. Partitions In large-scale systems, link failures can occur frequently. A sequence of link failures leads to partitions. It is assumed that an external monitor process is responsible for maintaining link status, i.e. detecting when partitions occur and are restored. This process constantly monitors the status of the network. When partitions are detected, correspondents in one partition mark in their view their correspondents in the other partition as being isolated in order not to attempt sending them messages. However, messages generated or received Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION by correspondents in one partition are kept in the log for the isolated correspondents. When the partition heals, each pair of previously isolated correspondents exchanges messages that were kept for each other, updates their state vectors and propagates received messages as usual. This will ensure that any messages that were missed during the partition will be received. Afterwards, normal operation is restored. 5.5. Reorganization despite failure In this section we discuss the case in which failures occur while reorganization is taking place. The reorganization algorithm can proceed in spite of some kinds of failures, but if any of nodes OP, NP or v fail, then it is aborted. Our reorganization algorithm assumes that nodes v, OP and NP must not fail while reorganization is taking place. If one of these nodes fails, then the reorganization must be aborted. This is the reason that nodes OP and v maintain their old views and state vectors (see Figure 5) so that they can be restored in case of failure. In this way node v is still able to receive all messages from above through OP, in spite of a temporary interruption and small delay on account of the attempted reorganization. On the other hand, if any of the other nodes, say the ones along the path between OP and NP, were to fail, then the reorganization algorithm can proceed in spite of these failures. The detailed proof for this claim is given in Theorem A.3 of the Appendix. The proof is somewhat laborious and considers all cases and subcases to show that no messages are lost if a failure occurred while a reorganization was in progress. This proof is however important because the causal ordering cannot be guaranteed in the presence of failures and it is necessary to show that any messages sent by NP before the control message join ack1 will be received by OP before join ack1. Otherwise, v might lose some messages in reorganization. We prove that this will not happen by arguing that, in the transition phase, the coordinator (CO) site will sort the messages for diffusion upwards such that the control messages appear last. This is the central idea behind the proof. 6. ANALYSIS AND SIMULATION RESULTS In this section we discuss the performance results of the HPP protocol obtained both analytically and through simulation experiments. The simulation compares our HPP protocol which tolerates successive failures against HPP0 which does not. HPP0 is a surrogate for a conventional protocol like the one used on Usenet for news delivery. We also discuss issues of additional space and processing time overhead that arise in our protocol. 6.1. Analytical results First, we analytically derive the probability that no pair of consecutive nodes has failed (i.e. both nodes are down) anywhere in the hierarchy. Two nodes are consecutive if they are at successive levels in the hierarchy and have a direct link between them. Assuming each node is independently T HE C OMPUTER J OURNAL,

IN

W IDE A REA N ETWORKS

119

‘up’ with a probability p and ‘down’ with a probability q = 1 − p, we calculate the probability that a message generated anywhere in the hierarchy is propagated throughout the hierarchy without being blocked. In the following analysis, we represent the branching factor as B and the number of levels as L (recall, the root is numbered as level 1). A (sub) hierarchy is called ‘good’ if there is no pair of consecutive nodes in the (sub) hierarchy that are down; else, it is called ‘bad’. The probability that no pair of consecutive nodes is down in an L-level hierarchy with a branching factor B at each level is denoted by P(L). The expressions for computing P(L) are given below: P(L) = p(P(L − 1)) B + q( p P(L − 2) B ) B , P(1) = 1

L≥2

P(0) = 1 P(L) is derived recursively as follows. To compute the probability of an L-level hierarchy (or sub-hierarchy) being ‘good’, we consider two cases: the root node is good (i.e. it is up) or the root node is bad (i.e. it is down). The overall probability is the sum of the individual probabilities of the two cases. The first part of the expression computes the probability of the case when the root is up. In this case, the probability that the rest of the hierarchy is good is given by (P(L − 1)) B . The second part of the expression for P(L) corresponds to the case where the root node is down. In this case, all the next level (i.e. second level) nodes must be up and the rest of the subhierarchy below the second level nodes must be good. This is represented by the second part of the expression. Finally, note that P(0) and P(1) are boundary conditions and both have a value of 1. The results for various L and B combinations are shown in Tables 4 and 5. In Table 4, the results are shown for p = 0.99, while Table 5 shows the results for p = 0.95. In Table 4, the probability of a ‘good’ hierarchy falls below 0.875 only when the number of levels is 6 and the branching factor is 5 or more. Thus, even with 1365 nodes, the probability of the entire hierarchy being ‘good’ is 0.875. The results in Table 5 show that if the reliability of an individual node drops to 0.95, then only small hierarchies (say, with 100 nodes) can be constructed while maintaining a high probability of the entire hierarchy being good. Please note that these results are conservative by design because they require that, in all consecutive pairs, at least one node is up, i.e. under these conditions the propagation delay is almost zero. However, this analysis does not give any realistic estimate of the actual propagation delay. In the next section we use simulation to get an estimate of the propagation delay. In particular, we use the simulation experiments to determine how long it actually takes to propagate a message generated at a leaf node through the hierarchical network to all of the other leaf nodes. 6.2. Results of simulation experiments We ran extensive simulation tests to measure the average delay incurred in the HPP protocol for the message to Vol. 41, No. 2, 1998

120

A. K UMAR AND N. A DLY

TABLE 4. Probability of a ‘good’ hierarchy for various levels, branching factor combinations ( p = 0.99). The number of nodes is shown in parentheses. B # Levels

2

3

4

5

6

2

1 (3) 0.999 (7) 0.999 (15) 0.997 (31) 0.994 (63)

1 (4) 0.999 (13) 0.996 (40) 0.988 (121) 0.965 (364)

1 (5) 0.998 (21) 0.992 (85) 0.967 (341) 0.875 (1365)

1 (6) 0.997 (31) 0.985 (156) 0.927 (781) 0.684 (3906)

0.999 (7) 0.996 (43) 0.975 (259) 0.861 (1555) 0.406 (9331)

3 4 5 6

TABLE 5. Probability of a good hierarchy for various levels, branching factor combinations ( p = 0.95). The number of nodes is shown in parentheses. B # Levels

2

3

4

5

6

2

0.995 (3) 0.986 (7) 0.968 (15) 0.932 (31) 0.865 (63)

0.993 (4) 0.973 (13) 0.914 (40) 0.760 (121) 0.436 (364)

0.991 (5) 0.956 (21) 0.828 (85) 0.467 (341) 0.047 (1365)

0.989 (6) 0.936 (31) 0.712 (156) 0.182 (781) 0 (3906)

0.987 (7) 0.914 (43) 0.576 (259) 0.036 (1555) 0 (9331)

3 4 5 6

propagate throughout the hierarchy. The results of the protocol were benchmarked against a plain hierarchical protocol (HPP0) in which no propagation takes place past any failed node. The HPP0 protocol serves as a surrogate for a conventional protocol for comparison purposes. In the simulation we considered various hierarchies with different numbers of levels and branching factors. For simplicity, the branching factor was assumed to be constant at all nodes in the hierarchy in any single experiment. The reliability of the nodes was characterized by MTTF (mean time to failures) and MTTR (mean time to repair). The actual durations of up and down times of each node were generated from exponential distributions using these values as averages. A message was generated at a leaf node and propagated according to the HPP protocol. The delay is the amount of time it takes for all of the other nodes that are up at the leaf level to receive the message. We generated one message every simulation time unit. The normal propagation delay between nodes and the normal processing time at each node were disregarded. In comparison with the delay caused by failures, the normal T HE C OMPUTER J OURNAL,

TABLE 6. Sample parameters for the simulation to evaluate HPP and HPP0. Parameter

Value

Branching factor Number of levels MTTF/MTTR ratio MTTF MTTR Length of simulation Processing delay at a node Propagation delay between two nodes

varied from 2 to 6 varied from 3 to 6 100 1000 min 10 min 20,000 0.1 s (neglected) 1 s (neglected)

propagation delay is small (less than a few seconds). Moreover, the normal propagation delay and the normal processing time are approximately the same in both methods and so neglecting them does not bias the comparison. The sample parameters are summarized in Table 6. The results of the experiments are shown in Figures 7 (a)–(d), for the cases of 3–6 levels, respectively. In each figure we plot the delay versus the branching factor for the HPP protocol and the benchmark HPP0 protocol. In these figures the values for MTTF and MTTR are 100 and 1, respectively, and the average was computed over 20,000 simulation time units. This means that the average probability of a node being up is approximately 0.99. The results show that the HPP protocol is considerably superior to HPP0 in all cases and the magnitude of the gap between the protocols increases with more levels in the hierarchy. This is reasonable because with more levels the chances of a message getting blocked along the hierarchy are much higher in the HPP0 protocol. Assuming that a simulation time unit is 10 min, the results indicate that for the case of six levels and a branching factor of six, the average delay incurred in the propagation of a message is 8.84 min for the HPP protocol and 37.54 min for the HPP0 protocol. It should be noted that in the interest of simplifying the simulation we have neglected the processing overhead that occurs in running our failure handling algorithm in Section 5. However, compared to the savings produced, this overhead is indeed very small. Thus, we believe this simplification is reasonable. For instance, in both cases the total processing and propagation delay is less than 30 s assuming 1 s propagation delay per link and 0.1 s processing delay at each node. In summary, these simulation results show that: (1) HPP improves the propagation time considerably as compared to a conventional protocol represented by HPP0; (2) even in a hierarchy with six levels and a branching factor of six (which consists of more than 9300 nodes), the average propagation delay is reasonably small. As a final point, it should be noted that the space overhead of maintaining the additional vectors is really minimal. Assuming two bytes per entry in the View array and 4 bytes Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION

IN

W IDE A REA N ETWORKS

121

FIGURE 7. Delay versus branching factor.

per entry in the Vi , Vi P and ViC vectors and a branching factor of six, we estimate the overhead per node to be 216 bytes. This does not include the sizes of the message queues where messages are stored prior to delivery. The queue lengths do not represent additional overhead because the messages would in any case require storage space at each node. 7. CONCLUSIONS This paper has described a weak-consistency replica control protocol called HPP which efficiently and reliably propagates messages in wide-area networks and can be used for various applications such as updates to replicated data, software distribution, managing mailing lists and in a bulletin board service. The HPP protocol is based on organizing the nodes in the network into a logical hierarchy and is both scalable and efficient. It delivers messages in causal, but not total order. Three important features of the HPP protocol are minimum redundancy, high reliability and ability to restructure. The unique aspect of our protocol is that each node maintains only local state information by encapsulating information about itself and its parent and child nodes into state vectors. These aggregate state vectors provide enough information for the determination of missing messages. This reduces communication traffic and the overhead of information that must be maintained at each node, and makes the scheme more scalable. It T HE C OMPUTER J OURNAL,

ensures reliable, eventual delivery of messages in spite of failures or partitions. Further, it minimizes redundancy by making efficient use of network bandwidth. It also allows nodes to dynamically change their position in the hierarchy to improve performance (thus keeping it balanced and relatively flattened) while ensuring that messages are not lost. The performance evaluation of the protocol shows that the propagation delay is reasonably small even for a six-level logical hierarchy. The limitation of our protocol lies in the fact that messages can be diffused past a failed node only if its parent and all child nodes are alive. This means that ‘successive’ failures (i.e. where a pair consisting of a parent node and a child node are down) will cause the diffusion to stop. However, it should be noted that successive failures can be handled by keeping more information in the state vector at each node. Clearly there is a trade-off between the amount of additional information stored and the ability of the protocol to tolerate failures. We expect to study this trade-off in future research. We also plan to study the issue of designing a good logical topology for the network given a certain physical topology, a topic discussed in [9]. Another useful research direction is to investigate modifications to the protocol that would allow the use of more than one alternate path between nodes for transmission. Finally, it will also be interesting to explore if underlying multicast mechanisms can be incorporated into our algorithms. Vol. 41, No. 2, 1998

122

A. K UMAR AND N. A DLY

ACKNOWLEDGEMENTS Noha Adly has been supported by an FCO grant from the British Council.

[17]

REFERENCES [1] Gifford, D. K. (1979) Weighted Voting for Replicated Data. In Proc. 7th Symp. on Operating System Principles, pp. 150– 162, ACM Press, Pacific Grove, CA. [2] Cheung, S., Ammar, M. and Ahamad, M. (1990) The Grid Protocol: a High Performance Scheme for Maintaining Replicated Data. In Proc. IEEE 6th Int. Conf. on Data Engineering, pp. 438–445, IEEE Computer Society Press, Los Angeles, CA. [3] Kumar, A. (1991) Hierarchical quorum consensus: a new algorithm for managing replicated data. IEEE Trans. Comput., 40, 996–1004. [4] Queinnec, P. and Padiou, G. (1993) Flight plan management in a distributed air traffic control system. In Proc. Int. Symp. on Autonomous Decentralized Systems, Kawasaki, Japan, pp. 323–329. [5] Emtage, A. and Deutsch, P. (1992) Archie—An Electronic Directory Service for the Internet. In Conf. Proc. Usenix, San Francisco, CA, pp. 93–110. [6] Schroeder, M., Birrell, A. and Needham, R. (1984) Experience with Grapevine. ACM Trans. Comput. Syst., 2, 3–23. [7] Lampson, B. (1986) Designing a global name service. In Proc. 5th Symp. on Principles of Distributed Computing, ACM SIGACT-SIGOPS, pp. 1–10, ACM Press, Calgary, Canada. [8] Golding, R. (1992) Weak-consistency Group Communication and Membership. PhD thesis, University of California, Santa Cruz. [9] Obraczka, K. (1995) Massively Replicated Services in Widearea Internetworks. PhD thesis, University of Southern California. [10] Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D. and Terry, D. (1987) Epidemic algorithms for replicated database maintenance. In Proc. 6th Symp. on Principles of Distributed Computing, ACM SIGACT-SIGOPS, pp. 1–12, ACM Press, Vancouver, Canada. [11] Downing, A., Greenberg, I. and Peha, J. (1990) OSCAR: an Architecture for Weak Consistency Replication. In Proc. IEEE PARABASE-90, pp. 350–358, IEEE Computer Society Press, Miami Beach, FL. [12] Ladin, R., Liskov, B. and Shrira, L. (1990) Lazy replication: exploiting the semantics of distributed services. In Proc. 9th ACM Symp. on Principles of Distributed Computing, Quebec City, CA, pp. 43–57, ACM Press. [13] Wuu, G. and Bernstein, A. (1984) Efficient Solutions to the Replicated Log and Dictionary Problems. In Proc. 3rd ACM Symp. on Principles of Distributed Computing, pp. 233–242, ACM Press, Vancouver, Canada. [14] Dalal, Y. and Metcalfe, R. (1978) Reverse Path Forwarding of Broadcast Packets. Commun. ACM, 21, 1040–1048. [15] Chang, J. and Maxemchuk, N. (1984) Reliable Broadcast Protocols. ACM Trans. Comput. Syst., 2, 251–273. [16] Agarwal, D., Moser, L., Melliar-Smith, P. and Budhia, R. K. (1995) A reliable ordered delivery protocol for interconnected

T HE C OMPUTER J OURNAL,

[18] [19]

[20] [21]

[22]

[23]

local-area networks. In Proc. Int. Conf. on Network Protocols, Tokyo, Japan. Floyd, S., Jacobson, V., McCanne, S., Lui, C. G. et al. (1995) A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. In Proc. ACM SIGCOMM ’95, p. 25, ACM Press, Cambridge, MA. Melliar-Smith, P. and Moser, L. (1993) Trans: A Reliable Broadcast Protocol. In IEEE Trans. Commun., 140, 481–493. Garcia-Molina, H. and Spauster, A. (1991) Ordered and reliable multicast communication. ACM Trans. Comput. Syst., 9, 242–271. Birman, K. and Joseph, T. (1987) Reliable communication in the presence of failures. ACM Trans. Comput. Syst., 5, 47–76. Birman, K., Schiper, A. and Stephenson, P. (1991) Lightweight causal and atomic group multicast. ACM Trans. Comput. Syst., 9, 272–314. Adly, N. (1993) HARP: A Hierarchical Asynchronous Replication Protocol for Massively Replicated Data. Technical Report TR-310, Computer Laboratory, University of Cambridge, UK. Garcia-Molina, H. (1982) Elections in a distributed computing system. IEEE Trans. Comput., C-31, 48–59.

APPENDIX D EFINITION A.1. A path from node i to node j is the set of nodes that a message traverses while propagating from node i to node j by following the hierarchical propagation protocol. C OROLLARY A.1. There is a unique path between any pair of nodes i and j . Proof. The proof follows directly from the propagation algorithm. Node i sends a message to its correspondents, which in turn send the message (recursively) to their correspondents until it reaches j . Since the nodes are organized in a hierarchy, messages from node i to node j must take the same, unique path. T HEOREM A.1. If every node observes FIFO order while propagating messages, then causal order is achieved provided the hierarchy is not reorganized and no node failures occur. Proof. Assume node o originates a message m 1 ; node i receives m 1 at t = t1 and then posts m 2 after it sees m 1 . We shall show that all nodes in the network will see m 2 after m 1 . There are two cases. Case 1. i = o. From Corollary A.1, all other nodes (correspondents of i , their correspondents, etc.) will receive both m 1 and m 2 from node i by the same path. Since every node along a path processes messages in FIFO order, it follows that all nodes will see m 1 before m 2 . Case 2. i 6= o. Consider S, the set of nodes along the unique path (see Corollary A.1) from node i to node o. For any node x ∈ S, at t = t1 , x has already received m 1 , and therefore it will see m 2 after m 1 . Any node x 6∈ S Vol. 41, No. 2, 1998

L ARGE -S CALE R EPLICATION will receive both m 1 and m 2 from a node y, where y ∈ S. From Corollary A.1, the path between x and y is unique. Since every node along a path processes messages in FIFO order, x will see m 2 after m 1 . Therefore, every node in the network sees m 2 after m1. T HEOREM A.2. Assuming no node failures occur, the causal order is preserved while reorganization takes place. Proof. The proof of causality in Theorem A.1 was based on there being a unique path between any two nodes. The causal order could conceivably be violated due to the fact that while the change is taking place, messages sent from node v may follow different paths to reach NP or OP, and vice versa. The proof is based on showing that even if messages follow different paths: (i) nodes NP and OP will still receive messages sent by v in their correct order, and (ii) node v will still receive messages sent by NP or OP in their correct order. Case 1. For messages received by NP and OP from v. Assume node v sends a message m 1 to OP for propagation at t0 , and then sends the control message special marker at t1 . Subsequently, it sends a message m 2 to NP at t2 (t0 < t1 < t2 ). NP receives m 1 and m 2 through different paths: m 1 is received through OP, while m 2 is received directly from v. Similarly, OP receives m 1 directly from v while m 2 is received through NP. We shall show that both NP and OP process m 1 before m 2 . Since the special marker propagates through the network like a normal message, from Theorem A.1 both NP and OP must receive m 1 before the special marker. Since NP will start processing messages received directly from v only after it receives special marker, NP will process m 2 after m 1 . Moreover, since OP will receive m 2 through NP, it follows that OP will receive m 2 after m 1 . Case 2. For messages received by v from NP or OP. Assume NP sends a message m 1 for propagation at t0 , then it receives a join req1 and sends a join ack1 at t1 . Subsequently, it sends a message m 2 for propagation at t2 (t0 < t1 < t2 ). We shall show that although v receives the messages via different paths (i.e. m 1 from OP and m 2 from NP), it will still process m 2 after m 1 . Since v receives both m 1 and join ack1 from OP, from Theorem A.1 v must receive m 1 before the join ack1. Since NP will send m 2 to v only after NP receives join req2 and since v will send join req2 only after it receives join ack1 then NP will send m 2 only after v has processed m 1 . The same argument holds for messages sent to v by OP while the change is taking place.  T HEOREM A.3. If a node v changes its parent from OP to NP, then it is guaranteed that no messages are lost as a T HE C OMPUTER J OURNAL,

IN

W IDE A REA N ETWORKS

123

result of the move, even if a failure occurs along the path from OP to NP while the reorganization takes place. Proof. The proof is based on arguing that OP and NP will receive all messages v receives from below and that v will receive all messages that NP has received before the move or will receive after the move. Node NP. Assume v sends leave req at time t1 and receives join ack2 at time t2 . Every message generated by v or received from below at t1 ≤ t ≤ t2 will be buffered by v in M Log N P . Since v will send the log to NP at time t2 and will then start sending messages directly to NP, this means that NP will receive directly from v every message received by v from below at time t ≥ t1 . All messages received by v at t < t1 from below are received by OP directly from v and OP propagates them to NP by the normal propagation mechanism. If a failure occurs along the path from OP to NP while messages are propagating, the transition and diffusion phases guarantee that no messages will be missed by NP. Therefore, NP receives all messages received by v from below. Node OP. Similarly, every message received by v from below at t < t1 will be received by OP directly from v. All messages received by v from below at t ≥ t1 will be sent to NP and NP will further propagate them to OP by the normal propagation mechanism. If a failure occurs along the path from NP to OP while messages are propagating, the transition and diffusion phases guarantee that no messages will be missed by OP. Therefore, OP receives all messages received by v from below. Node v. Assume NP receives v’s join req1 at time t1 . We need to show: v receives every message received by NP at t ≥ t1 ; (ii) v receives every message received by NP at t < t1 .

(i)

Assume NP receives v’s join req2 at time t2 . Since NP will buffer every message generated or received at t ≥ t1 and will send the log at t = t2 (and since it adds v as a correspondent at t = t2 ), it follows that v will receive every message received by NP at t ≥ t1 directly from NP. To prove (ii), we need to show that any message that NP has received at t < t1 will also be received by v through OP. Assume that NP has received a message m at t < t1 . Since NP generates the join ack1 at t = t1 and since v will still be receiving messages from OP until v receives join ack1, we need to show that OP (and consequently v) will receive m before receiving join ack1. Three cases can occur while m and join ack1 are propagating. Case 1. While m and join ack1 are propagating, there were no failures nor reorganization all along the Vol. 41, No. 2, 1998

124

A. K UMAR AND N. A DLY path to OP. Then, from Theorem A.1 OP will receive m before receiving the join ack1. Case 2. While m and join ack1 are propagating, a failure occurs in their path of propagation towards OP. If the failure occurs after m reaches OP, then whether join ack1 is sent during the transition phase or the diffusion phase, it will be received by OP after m anyway. If the failure happens before m reaches OP, there are four subcases. Case 2.1. m and join ack1 are diffused by the transition algorithm. If OP was one of the failed node’s correspondents, then, upon sorting m and join ack1, join ack1 will always come last because it is a control message. This guarantees that OP will process m first, send it to v and then process join ack1. If OP was not one of the failed node’s correspondents, then OP will receive m and join ack1 through normal propagation. Since each correspondent of the failed node sorts messages (such that join ack1 comes last) before processing and further propagating them, and since they maintain the FIFO order in propagation, then OP will receive m first and then join ack1. Case 2.2. m and join ack1 are sent during the diffusion phase.

T HE C OMPUTER J OURNAL,

Since the coordinator CO becomes a correspondent of the failed node’s correspondents and each node maintains FIFO order, then from Theorem A.1 OP will receive m before join ack1. Case 2.3. m is diffused by the transition algorithm while join ack1 is sent during the diffusion phase. Since the diffusion phase starts only after the transition phase is over, m will precede join ack1 in propagation. Case 2.4. join ack1 is diffused by the transition algorithm while m is sent during the diffusion phase. We will show that this case cannot occur. Assume that the failure occurs at time t f . For this case to occur, at time t < t f , one or more correspondents of the failed node should have received the join ack1 but not m. Again, from Theorem A.1, this is not possible at t ≤ t f , and, therefore, there is a contradiction. Case 3. While m and join ack1 are propagating, a reorganization takes place. From Theorem A.2, since the causal order is preserved while a reorganization occurs, OP  receives m before receiving join ack1.

Vol. 41, No. 2, 1998

Suggest Documents