Reliable probabilistic communication in large ... - Semantic Scholar

Reliable probabilistic communication in large-scale information dissemination systems A.-M. Kermarrec, L. Massouli´e, A.J. Ganesh Microsoft Research St George House, 1, Guildhall Street Cambridge CB2 3NH, UK fannemk,lmassoul,[email protected] Phone: +44 1223 724 823 Fax: +44 1223 744 777

Technical Report 2000-105

Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052

Abstract Reliable group communication is important for large-scale distributed applications such as information dissemination systems. The challenging issue in this context remains scalability. The computation time and amount of data dedicated to the reliability mechanism should remain manageable as the number of nodes in a system grows, and no bottleneck should emerge. Probabilistic algorithms has proven their ability to fill this gap. In this paper, we present the theoretical analysis and evaluation of a scalable reliable group communication protocol for wide-area dissemination systems. The scalability of the protocol relies on its probabilistic flavor. The protocol provides probabilistic guarantee of delivery and thus makes do with a lightweight recovery protocol. A distributed membership service is described and simulation results show that the protocol exhibits very stable behavior in the presence of transient and/or permanent failures.

1 Introduction Reliable group communication protocols are essential for distributed systems and applications that need to collaborate across networks [1, 14]. Large-scale information dissemination systems are typically composed of a large number of nodes and require the broadcast of data to a group of members. Until recently, most efforts have been targeted towards deterministic approaches which have proven efficient in the context of local-area networks. Unfortunately, these protocols suffer from a cruel lack of scalability: detecting and retransmitting messages is a burden for a centralized server as the number of participants increases. Distributing the service overcomes this drawback but the high degree of synchronization required between distributed servers to manage reliability considerably limits the scalability of the system as well. Recently, probabilistic multicast protocols have received increasing attention as a mean to cope with the widearea dimension of distributed applications, and to limit the amount of data that needs to be stored for reliability management. The probabilistic nature of delivery guarantees of these protocols is a reasonnable and pragmatic concession to large-scale settings. In this paper, we propose a probabilistic reliable dissemination protocol. The scalability of our approach relies on its probabilistic flavor combined with a gossip-based algorithm used to disseminate information. Our protocol is based on the one presented in [10] but we present a new theoretical analysis; this enables us to accurately tune parameters to achieve a given probability of success. The analysis results lead us to design a distributed membership service. The protocol provides a probabilistic guarantee of reliability. In addition, we describe a very lightweight detection and recovery protocol to cope with missing messages, which are very rare. Finally, simulation results show that the protocol exhibits very stable behavior in the presence of many simultaneous failures. This makes it suitable to support failures as well as multiple disconnections. The remainder of this paper is organized as follows. We survey related work in Section 2 and describe the system model in Section 3. Section 4 contains a description of our protocol and of the membership service. The theoretical analysis is presented in Section 5 and Section 6 diplays our simulation results. We conclude in Section 7 with suggestions for future work.

2 Reliable Multicast Reliable multicast protocols have been widely studied at the scale of local area networks [3, 5, 12]. However, the scale of distributed systems has been considerably impacted by the success of the Internet. Where distributed systems used to consist of dozens of nodes connected through a local area network, they are now typically composed of dozens of thousands of nodes connected through the Internet, which exhibits unsta-

1

ble and unpredictable behavior. These characteristics make distributed systems impossible to control in a centralized way while increasing their requirements with respect to fault tolerance. Several reliable multicast protocols have been proposed in the context of wide-area networks but remain limited with respect to scalability. Reliable Multicast transport Protocol (RMTP) [13] for example is a senderreliable (i.e the detection of missing messages is ensured by senders) multicast transport protocol providing sequenced lossless delivery of data from one sender to a group of receivers. The retransmission is handled on a packet basis and the scalability of RMTP relies on a hierarchical distributed management of acknowledgements. However, sender-reliable protocols generate a lot of acknowledgments traffic, which intrinsically limits their scalability. RMTP does not deal with crash failures or network partition. Moreover, this protocol is implemented at the transport level which provides its efficiency but does not leave the opportunity to use the membership information at the middleware or application level which is an important assumptions for considered applications. Many reliable multicast protocols rely on the use of loggers for providing stable storage and handling retransmission of missing message; one such protocol is LBRM [11]. However, centralizing charge of handling and retransmitting information on one or several loggers intrinsically limits scalability: the amount of information to be stored grows with the number of nodes and loggers could be quikly overloaded. Another class of reliable protocols relies on peer-based protocols to ensure the detection and recovery of missing messages such as Scalable Reliable Multicast (SRM) [8]. In such protocols, each node gets its part of the global load and is responsible for retransmitting any missing message on its peer member. As mentioned earlier, probabilistic protocols have recently emerge and appear to provide the required flexibility to cope with scalability requirements of Internet settings. A recent and innovative protocol called pbcast [2] uses the peer-based property, probabilistic guarantees and gossiping algorithms to provide scalable and reliable multicast. Messages are first broadcast using either IP multicast or a randomly generated multicast tree if IP multicast is not available. In addition, each node periodically chooses other members randomly to gossip to and sends them a digest of the most recent messages. Upon receipt of these messages, receivers check for missing messages and, if needed, solicit retransmission. Several optimizations enable this protocol to have stable behaviour in the face of perturbations in the network. A hybrid version of pbcast and LBRM called Reliable Probabilistic Multicast (rpbcast) [15] proposes a three phase probabilistic algorithm. The first phase uses an unreliable IP multicast. During the second phase, an a-la pbcast gossiping step is initiated and, if it fails, a third deterministic phase using loggers is initiated. These protocols use an unreliable multicast protocol first and then initiate a detection-recovery phase. The protocol presented in this paper uses a gossiping algorithm to ensure epidemic style dissemination of information, with adequate redundancy to provide a very high probability that each member of the group is reached by any message in the first place. Consequently, missing messages are very rare. The detectionrecovery mechanism is thus rarely needed and can be very light-weight. Our approach is centered on the theoretical analysis which gives us the parameters to efficiently design the distributed membership service.

3 Design Guidelines In this section, we describe our system model, our reliability and failure assumptions and the features which have guided the design of our protocol.

3.1 System Model We consider a system composed of N nodes. Each node is either a standard member, i.e receiver or sender, or a server. A distributed network of servers is in charge of the membership service and provides standard 2

nodes with access points to the group. Every node participates in the dissemination of messages. Many dissemination systems such as publish-subscribe systems [4] rely on a distributed set of servers. This service does not represent a bottleneck here since it manages only the membership information, the dissemination of information being distributed among all the nodes. Standard nodes A node is a member of the group and is potentially a sender and/or a receiver. An access point to the group for registration and information dissemination is represented by one server. For example, servers can be associated to a node according to a geographical criterion. Three types of messages can be sent by a node to its server: (i) a join message to register as a member of the group, (ii) a leave message to leave the group and an (iii) information message to disseminate data among group’s members. Besides, each node participates in the dissemination of information message sent by other nodes. It acts as a gossiper to k other nodes randomly chosen among its local list provided by a server upon a join operation. k is called the fanout in the remainder of the paper. Servers Each server acts as an access point for a number of nodes and manages a part of the membership service. Servers are also in charge to initiate message disseminations. Servers exchange their membership information with each other in a lazy manner in order to get approximate global information of the membership. Servers are responsible for (i) upon receipt of a join message, integrating a new member in the group, i.e. the new member is provided with a randomly generated local list of k nodes and is integrated in k local lists of others members, and (ii) initiating the gossiping algorithm upon receipt of a leave or information message.

3.2 Failure assumptions The goal of our protocol is to ensure that an information message sent by a member of the group reaches all the members despite transient and permanent failure of nodes and/or links in the network. A transient failure refers to the temporary inability of a node to receive a message, such as buffer overflow, or of the network to deliver a message [2]. Permanent failures refer to node crash or network partitioning. In our protocol, messages are ordered on a one-sender basis. Each sender timestamps its messages and the protocol delivers a message on a node if all messages emanated from that node have been received and delivered. This way, messages are uniquely identified. Messages are delivered once even upon multiple receipts. This preliminary version of our system does not support the persistence of messages. When a node has experienced crash failure or a long disconnection, our protocol does not ensure that messages emitted in the meanwhile will be delivered upon reconnection. We support the failure of one server at a time and ensure persistence of the membership information. Our protocol provides a probabilistic guarantee: the parameters of the protocol are set such that there is a very high probability that each message reaches all members in a fixed, and known number of rounds. We also provide a light-weight detection-recovery protocol which enables nodes to recover missing messages from transient failures.

3.3 Design guidelines Our protocol is based on the probabilistic broadcast protocol [10] which is used in pbcast [2] for the detection/recovery phase. However the theoretical analysis is different and so is its use.

3

As previously mentioned, most reliable multicast protocols rely on a first phase of unreliable dissemination of messages followed by a detection-recovery phase. This phase is always necessary and may be expensive depending on the number of missing messages which potentially increases with the size of the system. Our approach is rather different. Parameters are set, according to our theoritical analysis, to ensure a very high probability of delivery in the first place and can easily be tuned to guarantee different levels of probabilities. Consequently the detection-recovery phase is rarely required and can be implemented by a very lightweight protocol which could be switched on or off depending on the requirements of the application. The scalability of our protocol relies on several features:

Epidemic dissemination: our protocol is based on a gossiping algorithm which provides eventual convergence and has been used for various purposes [6, 9, 10, 16]. Epidemic algorithms are scalable by nature since the load is distributed among a number of nodes. The effort required from each node increases only slightly even as the number of nodes increases dramatically. Moreover, although scalability and redundancy are often contradictory mainly due to the synchronization issue in deterministic approaches, redundancy is achieved for both scalability and reliability through the use of the gossiping algorithm. Probabilistic management: the theoretical analysis enables us to accurately predict the behavior of the system and guides the tuning of parameters. Moreover, since target nodes for gossiping are randomly chosen, the management of data is very lightweight and does not require any heavyweight synchronization to ensure reliability. Most of the information in the system is randomly and dynamically generated without requiring much storage. Stable behavior: one of the most interesting properties of our protocol is its stability with respect to failures. The protocol exhibits a high degree of fault-tolerance and provides an adequate level of support for mobile users as well. Distributed membership service: our system model is scalable since only servers need to have an approximate global membership information (no strict synchronization is required). Other nodes only require a partial knowledge of the system delivered by the distributed servers.

4 Probabilistic reliable dissemination of events 4.1 Distributed membership service Each server of the service acts as an access point for a set of standard nodes. Upon receipt of a join message, a server adds the node identifier in its list of members and initiates a two-phase commit algorithm with another server (randomly chosen among the servers) to ensure that the identifier of the node is stored on at least two servers and that the join operation has been safely taken into account. This makes possible to tolerate one server failure at a time. Obviously, tolerating more failures means increasing the degree of redundancy using the same mechanism. We thus ensure the persistence of the membership information. The server has to integrate the new memsber in the graph of connection. This require two steps: (i) providing the new member with a a partial knowledge of the system required to participate to the gossiping algorithm afterwards and (ii) disseminating the new member identity among others nodes. The server randomly generates from its own list a sublist of at least k nodes. The bound depends on the expected number of subsequent leave messages. An acknowledgment is then sent to the joining node with the sublist. This list provides nodes with a partial knowledge of the system. The main result of the analysis presented in the next

4

section is to accurately define the fanout (k) depending on the required level of reliability. In addition to that, the server generates randomly a second list of k nodes to which the identifier of the new member is sent. This enables the new member to be integrated in k others lists and consequently in future graphs of connections. After having served a given number of join requests or periodically, a server disseminates the list of new members (since the last synchronization) to other servers. This weak synchronization is required in order not to partition the network into server areas which would produce s graphs disconnected from each other(s being the number of servers). A node identifier is thus potentially in any local lists. Consequently, a leave message needs to be disseminated in the whole system as well. Upon receipt of a leave message, a server initiates a gossip and removes the node from its list and all others nodes do the same, if relevant, when they get the gossip. A node identifier is potentially in any local lists. Consequently, a leave message needs to be disseminated in the whole system as well. Upon receipt of a leave message, a server initiates a gossip and removes the node from its list and all others nodes do the same, if relevant, when they get the gossip.

4.2 Gossiping protocol The basic gossiping protocol is simple and works as follows. Upon receipt of an information message, a server initiates the gossiping algorithm. A list of k targets is randomly generated (a new one is generated for each message to avoid determinism) and the server gossips to the k chosen nodes. Upon receipt of a gossip message, a node gossips to k nodes, randomly chosen from its local list (if the size of the list is greater than k). Each node gossips once for a given message. For each information message, a server initiating the gossip sets the number of rounds (nbRounds) which is decremented at each gossip round. The gossip stops when the initial number of rounds has been processed. The pseudo-code for the gossiping algorithm is presented in Algorithm 1, using the standard epidemic vocabulary. A node is susceptible when it has not yet received the current message, it is infected between the time it receives the message and gossips it and is dead when it has finished gossiping the current message and does not take further actions. Each node manages three lists. The waiting list contains received messages that have not been delivered yet because a missing message from the same sender has been detected. The missing list is the list of detected missing messages, this list is checked and updated each time a new message is received. The history list is composed of the h most recent messages, limiting the size to h avoids unreasonable growth. The history list is sent with each gossip and is updated as well upon receipt of a new message. The missing list is used by a node to ensure its own recovery whereas the history list is used to cope with other nodes’ recovery. With each information message, a node receives the history list of its predecessors. This list is used by the node to detect and recover from any missing messages. The example depicted in Figure 1 shows a 20-nodes configuration with two servers Node 0 and Node 1. The fanout is 4. In four rounds, the message is propagated and we can easily see on this figure that a failure of Node 5 or failed links to node 5, for example, would have had no effect on the dissemination of messages.

4.3 Detection and recovery As we will demonstrate next, the parameters of our system can be tuned so that it is very unlikely that even one member fails to receive a message. However, when such a situation arises, it means that either one node is disconnected (isolated) from the graph or that one or more failures occurred.

5

1 Probabilistic gossiping algorithm on each node Receive gossip(sender, message, nbRounds, senderhistoryList); if (state == susceptible) then fthis state is associated to the current gossipg state=infected; if noMissingMessage(message.sender,message.timestamp) then fcheck for a missing message from the same senderg deliver(message); else store(message,waitingList) for future delivery end if update(historyList); fAdd the new message to the history and remove the oldest oneg update(waitingList); fdeliver previous blocked messages from the same sender if relevantg if (nbRound != 0) then for (i=0; i< k; i++) do target = randomChoice(localList) fGossiping targets are uniformly randomly chosen among nodes of the local list g send to target gossip(myself, message, nbRounds- -, historyList); end for state=dead; end if end if

3

2

3

5

4

Sender

2

15

16

0

12

11

16 17

0

14

8

18

12

11

Round0

7

1 8

18

19

9

10

6

13

7

1

19

15

17

13 14

5

4

Sender

6

9

10

Round1

3

2

3

5

4

Sender

2

15

16

0

14

12

19

11

16

0

14

8

10

6

17

13

7

1

18

15

17

13

5

4

Sender

6

12

9

19

11

Round2

7

1 8

18

10

9

Round3

Susceptible

Dead

Infected

Figure 1: Exemple of a message dissemination in a 20 nodes configuration

6

4.3.1

Recovery from graph isolation

A node becomes isolated from the connection graph when a node identifier is present in no local lists but that of its server. Such a node has a substantial probability of being isolated during the next dissemination as well. The fact that servers randomly generate the first round of targets limits the probability that a node is disconnected for ever but does not provide any guarantee of quick reconnection. To overcome this problem, we propose two different solutions: refreshment of lists by servers and periodic check by isolated nodes. These two mechanism can be run in parallel or only one could be chosen depending on the application requirements. List refreshment Each server is in charge to provide any new member with a local list which is used afterwards to generate its gossip targets. If one node, due to the randomized, and thus unpredictable generation of local lists, is present in no local lists, it could be disconnected from the graph for several message disseminations. Its only chance to be connected again is to be chosen by a server and this probability decreases with the number of nodes. To cope with this rare situation, servers periodically refresh local lists with a random portion of the list. List refreshement also occurs on k nodes randomly chosen each time a new member joins the group. Upon receipt of a list refreshment, nodes update their local lists. If the list needs to be pruned, less recently added nodes are removed. This could be necessary to keep local lists small. How often a list refreshment should happen is still under study. Periodic check In addition to list refreshment, a node which has not received messages for a given period (the period depends on the average delay between messages and is much larger than this delay) checks with its server whether no messages have been sent or it is disconnected. Upon receipt of such a message and after checking, the server refreshes local lists with this very node identifier if the node turns out to be disconnected. This operation prevents a node of being disconnected for ever but, since our system does not so far provide for persistence of messages, the node only gets the most recent messages (the ones present in the history lists) when it reconnects. Due to the probabilistic model and the list refreshment mechanism, a periodic check very seldom happens if actovated. 4.3.2

Recovery from transient failures

We provide a lightweight message detection and recovery protocol for transient failures similar to the one in [2]. To this end, each node keeps track of the recent history of received messages (the id are stored in the history list) and stores the messages temporarily as well. Periodically each node checks if its list of non-received messages is empty. If not, it gossips to a target chosen in its local list to get the messages whose timestamp exceeds at least twice the average duration needed to propagate a message by gossiping. This avoids requesting messages that may be in transit. Effectively, missing messages are very rare in our protocol and we definitely do not want to solicit retransmission of messages which eventually will arrive. The pseudo code for this simple algorithm is depicted in Algorithm 2. The nbRecovery parameter depends on the number of failures we are willing to tolerate and is limited by the number of nodes in the local list anyway.

5 Analysis of gossiping performance In this section we establish theoretical results on the performance of the gossiping algorithm, in terms of the following key parameters: fanout, failure rate (both link and node), and system size. The main question 7

2 Detection of failure and recovery fDetect missing messages from previous gossiping using the historyg if failureDetected() then for (i=0; i < nbRecovery; i++) do target = randomChoice(localList); send to target recovery(missingMsg, myself); end for end if is whether the gossiping will succeed with high probability, for a given set of parameters. Birman et al. [2] write down the transition matrix for the Markov chain describing the number of nodes which have received the message after a given number of rounds of the protocol. However it cannot be solved analytically, and they resort to numerical methods. In contrast we use a relation with random graphs to obtain analytical results that hold as the number of nodes grows large. The random directed graph G (n; p) is defined as follows. It has vertex set V of cardinality n. For every ordered pair of vertices fx; yg, the arc (x; y) (directed from x to y) is present with probability p, independent of every other arc. Our results will be developed for this graph model, which corresponds to a gossiping protocol where each person who hears some news repeats it to each other person, independently with probability p; equivalently, the number of people to whom he relays the news has the binomial distribution B(n 1; p). An alternative model is one where each individual gossips to a fixed number of people, k, but these individuals are chosen at random. This corresponds to a random graph model, which we denote Gn k where each vertex has outgoing arcs to k vertices chosen uniformly at random from the set of n 1 remaining vertices. The success of the gossip protocol corresponds to the existence of a directed path from a specified source vertex s to every other vertex in the random graph. The connectivity of undirected graphs was studied in a classical paper of Erd¨os and Renyi [7]. They consider the graph on n vertices where the edge between each (unordered) pair of nodes is present with probability pn independent of other edges. They show that if pn = (log n + c + o(1))=n, then the probability that the graph is connected goes to exp( exp( c)). The random graph model corresponding to the gossiping protocol does not appear to have been studied in the literature. Consider the random graph G (n; pn ) with a specified source vertex s. We denote by π( pn ; n) the probability that there is a directed path from s to every other vertex of G (n; pn ). Likewise, given a subgraph of G (n; pn ) with j vertices including s, we denote by π( pn ; j) the probability that each of these vertices is reachable from s along the edges of the subgraph. Our first result concerns the error-free case: ;

Theorem 1 Consider the sequence of random graphs G (n; pn ) with pn constant. We have c lim π( pn ; n) = e e :

= [log n + c + o(1)]=n,

where c is a

n!∞

The theorem states that for the gossip to succeed, it is both necessary and sufficient that each person gossips to about log n people. Remark 2 The result of Theorem 1 holds if the number of nodes each node gossips to is a constant. To be precise, if each node gossips to exactly kn = log n + c + o(1) other nodes, then the probability that everyone gets the message goes to exp( exp( c)). 8

The proof is given in the Appendix. It relies on the identity π( pn ; n)

n 1

=

1

∑

r =1 n 1

=

1

n r

1 1

pn )r(n

(1

r =1

π( pn ; r)

r n ) n r

∑ (1

r)

pn )r(n

(1

r)

π( pn ; n

r)

(1)

To see this, we note that reaching all n vertices is the complement of reaching exactly r vertices, for some r between 1 and n 1. For each fixed r, the first term in the sum on the right is the number of ways of choosing r 1 vertices other than the source, the second is the probability that there is no arc from any of these r vertices to any of the remaining n r, and the third term is the probability that all r vertices are reached from the source conditional on there being no edges from any of them to outside. The second equality is obtained by simple manipulation. We briefly sketch the intuition behind the rest of the proof. We can show that for pn having the form in the statement of the theorem,

lim

n!∞

n r

pn )r(n

(1

r)

=

cr

e

r!

for each fixed r. Assuming that π = limn!∞ π( pn ; n) exists and is the same as limn!∞ π( pn ; n and that sums and limits are interchangeable in (1), we expect to have π=1

∞

∑

r =1

cr

e

r!

r) for fixed r,

π:

On simplifying, this gives π = exp( exp( c)). The arguments for the random graph model Gn kn are very much the same, but start from a suitable modification of (1). The meaning of these results is that there is a sharp threshold in the required fanout at log(n). If the fanout exceeds this threshold by a safety margin c, then the limiting probability that at least one person fails to get the message is 1 exp( exp( c)). The table below shows how quickly this probability decreases to zero as a function of the safety margin c. ;

c 1

π

-2 0.999

0 0.632

2 0.127

4 0.018

6 0.002

8 3E-4

10 5E-5

12 6E-6

Table 1: Dependency on c of probability of failure of gossip We now consider the impact of link and node failures on the success of the gossiping algorithm. Suppose links fail independently of each other, each with a probability ε. Considering the random graph model G (n; pn ), this simply corresponds to replacing pn by (1 ε) pn . Therefore, if we take pn = [log n + c + o(1)]=(1 ε)n, then the probability of success is asymptotically exp (exp c). We believe an analogous result holds for the model Gn kn . That is to say, a fanout of kn = [log n + c + o(1)]=(1 ε) would give a limiting probability of success of exp (exp c). The advantage of the G (n; pn ) model is that independent failures retain the model with modified parameters, but this is not true of the Gn kn model. Suppose next that nodes other than the source fail independently of the arcs present. The question is then whether the message reaches every node that has not failed. Let us condition on n0 being the number of nodes that haven’t failed. By the independence assumption, the random graph model for this situation is G (n0 ; pn ). If pn = [log n0 + c + o(1)]=n0 , which corresponds to a fanout of ;

;

k = (n=n0 )[log n0 + c + o(1)]; 9

(2)

Fanout 40 35 30 25 20 15 0.1

0.2

0.3

0.4

0.5

epsilon

Figure 2: Fanout required versus probability of node failure, for n=1000,10000,100000. Fanout σ for 500 nodes σ for 1000 nodes σ for 10000 nodes

4 3.01 4.86 14

6 0.94 1.48 4.89

8 0.33 0.26 1.61

10 0 0.17

Table 2: Standard deviation of number of dead nodes in non converging simulations

then gossip succeeds with limiting probability exp (exp c). Let ε be the failure probability of each individual node. Then n0 is binomial with parameters n and 1 ε, and concentrates sharply around (1 ε)n. Figure 5 shows the mean fanout required to achieve success probability greater than 99.9% as a function of the node failure probability ε. From top to bottom, the three plots correspond to n = 100000; 10000; 1000 respectively.

6 Simulation results We evaluated the behavior of the protocol through simulations. Simulations are complementary with the analysis and provide additionnal information about the average number of rounds, the stability of the system in terms of failure, and the general behavior. The simulator written in Java, models a system of N nodes. Each node is implemented as a Java object, with an operationnal state (faulty, susceptible,...), a sublist of k nodes representing the local list provided by servers, and in and out message queues. One gossip is initiated by the node 0, representing the server receiving an information message from a member of the group. The aim of this preliminary simulator was to evaluate the efficiency of the gossiping algorithm as regards to the dissemination of one message. We evaluated the fraction of converging simulations (meaning simulations in which every node of the group receives the message), the impact of node failures, and the mean number of rounds required for a gossip message to reach all nodes (in the case of converging simulations). Node failures were simulated by changing the state of nodes, randomly chosen in the system, to faulty: when a faulty node receives a message it takes no further action. The simulation results presented in this paper are based on 1000 simulations for each configuration. The standard deviations obtained, some of which are shown in Table 6 (Note that all simulations converge in 500 nodes configurations with a fanout of 10), indicate a high degree of confidence in the results. For instance, in the 1000-nodes configuration with a fanout of 8, the spread around the average number of nodes failing to get the message is 0.26 in the 219 simulations which have not converged. An average of 99.88% of nodes do receive the message in these non-converging simulations. We first present the impact of fanout on the number of converging simulations in 6.1. Section 6.2 is 10

devoted to the impact of failures, and we conclude this section with the impact of fanout and failures on the average number of rounds.

6.1 Impact of fanout In each plot, and for each configuration two bars are presented. The dark bar reflects the proportion of converging simulations. The light bar represents the proportion of nodes that have received the message in the non-converging simulations. We ran simulations for configurations varying from 10 to 20000 nodes. For obvious space reasons we do not present all the results, but examples depicted below are representative of an homogeneous behavior of the system. Figures 3, 4 and 5 depict the results for configurations varying from 100 to 15000 nodes. Our results show three different stages in the behavior of the system. In the first stage, the fanout is too small and very few nodes get the message. In the second stage, nearly all but not all nodes are reached, i.e., the resulting connection graph has left a few nodes out. It is related to the case in which the union of local lists provided by servers does not contain all nodes and illustrates the node isolation problem. The third stage depicts the situation we are interested in: most simulations converge, meaning that all nodes receive the message. It is interesting to note that even in the few simulations that do not converge, almost all nodes are reached but a few nodes are victim from graph isolation. This means that the fraction of nodes which do not receive the message in failure-free execution is very low. Figure 6 summarizes the impact of fanout on the proportion of converging simulations. Note that unexpected results displayed for 500 nodes, fanout 9 and 15000 nodes, fanout 10, are due to side effects of simulations. The analysis of the previous section suggests that a proportion exp (exp c) of simulations should converge, where c is given by k log(n), k being the fanout and n the number of nodes. According to this formula, and for n = 100, fanout values of k = 4; 8 should give 16%, 96:7% of converging simulations. This matches closely the simulation results displayed here. The match improves as n increases: for n = 15000, fanout values of k = 9; 14 should give 15:7%, 98:8% of converging simulations.

6.2 Reliability We have shown that our protocol exhibits favorable behavior during failure-free execution. However, our purpose here is providing reliability in the presence of node and link failures. Figures 7 and 8, resp. Figures 9 and 10, display the impact of failures on 5000-nodes, resp. 10000nodes, configurations with different fanouts. The results demonstrate a high degree of immunity to failures. We do observe that the behavior of the system starts to change when more than half the nodes are faulty. However, thanks to the property observed during failure-free executions, even when the fraction of converging simulations falls drastically, the fraction of nodes reached is still very high, requiring the recovery mechanism very seldom. The fraction of converging simulations is given by exp (exp c), where Equation 2 gives c = n0 k=n log(n0 ), where k is the fanout, n the number of nodes, and n0 is the number of nodes that haven’t failed. This formula yields, for n = 5000 and k = 11, that the fraction of converging simulations corresponding to 32, 256, 1024 failed nodes should be 91:5%, 87%, 53:2%. This matches well the simulation results of Figure 7. This stability in the face of a large number of failures is very interesting since it implies that the protocol provides good support for mobile nodes which may disconnect, voluntarily or not, for non-negligible periods.

11

100 Nodes

500 Nodes

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 1

2

3

4

5

6

7

8

9

10

1

2

3

4

5

6

Fanout

7

8

9

10

11

12

13

14

Fanout

Proportion of dead nodes in non converging simulations

Converging simulations



Figure 3: Results in failure-free executions for 100 and 500 nodes configurations

1000 Nodes

5000 Nodes

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1

2

3

4

5

6

7

Fanout

8

9

10

11

12

13

14

15

16

Fanout



Proportion of dead nodes for non converging simulations



10 000 Nodes

15 000 Nodes

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

1

2

3

4

5

6

Fanout Proportions of dead nodes in non converging simulations

7

8

9

10

11

12

13

14

15

16

Fanout Converging simulations




12

1 0.9 0.8 100 Nodes 500 Nodes 1000 Nodes 5000 Nodes 10000 Nodes 15000 Nodes 20000 Nodes

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Fanout

Figure 6: Converging simulations versus fanout

5000 Nodes (Fanout 11)


1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

4

8

16

32

64

128

256

512 1024 2048

0

4

8

16

Number of failures

32

64

128

256

512 1024 2048

Number of failures





Figure 7: Stability in presence of failures in 5000 nodes configurations configurations



1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

4

8

16

32

64

128

256

512 1024 2048

0

4

8

16

Number of failures Proportion of dead nodes in non converging simulations

32

64

128

256

512 1024 2048

Number of failures Converging simulations

Proportion of dead nodes in non converging simulation


Figure 8: Stability in presence of failures in 5000 nodes configurations

13

10 000 Nodes (Fanout 12)


1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

4

32

64

128

256

512

1024

2048

4096

0

4

32

64

Number of failures

128

256

512

1024

2048

4096

Number of failures





Figure 9: Stability in presence of failure for 10 000 nodes configurations



1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 0

4

32

64

128

256

512

1024

2048

4096

0

4

32

64

Number of failures Proportion of dead nodes in non converging simulations

128

256

512

1024

2048

4096

Number of failures Converging simulations

Proportion of dead nodes for non converging simulations


Figure 10: Stability in presence of failure for 10 000 nodes configurations

14

6.3 Number of rounds Figure 11 depicts the impact of fanout and of the number of failures on the average number of rounds needed to achieve converging simulations. Not surprisingly, the results show that in failure-free executions, as the fanout increases, the average number of rounds decreases and the number of converging simulations increases. This provides both an increasing reliability and a smaller delay for a given node to get a message. The second plot shows that the number of failures does not have an impact for quite a large number of failures in the system and, as the number of converging simulations decreases, the average number of rounds increases slightly. Simulation results are complementary to, and corroborate, the theoretical analysis which is of an asymptotic character. The simulations confirm the resilience of the system to failures. One obvious limitation of this algorithm is the amount of network traffic generated. Providing reliability has a cost and a tradeoff is always necessary. In this protocol, we provide scalability in terms of load balancing, reliability management and a very stable behavior at the cost of an increase in network traffic.

7 Conclusion In this paper we have described a reliable probabilistic protocol suitable for large-scale information dissemination systems and provided theoretical guarantees on its performance. The randomized generation of gossip targets obviates the need to store information for reliability management. The protocol is based on a similar algorithm as pbcast [10] but presents different characteristics in use and analysis. The contribution of this paper is first to propose a new theoretical analysis which specifies analitically the fanout to achieve a given probability of success: To achieve a probability of delivery of exp( exp( c)), the fanout has to be set to kn = logn + c in a n-nodes system. Second, we use a gossiping algorithm to ensure reliable dissemination instead of using it just for detection and recovery. The membership service is distributed and relies on a set of servers; such an infrastructure is often use in information dissemination systems. Nodes only need to get a partial knowledge of the system from the servers and this makes the protocol scalable. A list refreshment mechanism ensures that no node is isolated for long. Simulation results show that the system performance degrades very little even when a large number of nodes fail. These encouraging results demonstrate that probabilistic epidemic algorithms are useful in distributed systems where highly-reliable dissemination of information is required: events notification, eventual consistency, etc. We are working on a number of open problems. The distributed membership service still needs to be evaluated, especially the list refreshment mechanism. This is under investigation both from a theoretical point of view and from the design and implementation angle. Dynamic adaptation of the parameters of the protocol such as the number of rounds and the fanout to changes in system size is also under study. Taking locality of nodes into account in our probabilistic model is one of our main priorities. A more sophisticated simulator is scheduled as well as an implementation. Finally, part of our future work is to consider persistence of messages in order to tolerate crash failure and long periods of disconnection.

8 Appendix Let us now establish the result of Theorem 1. Call a vertex isolated if it has no incoming arcs. Clearly, there is a directed path from the source to every other vertex only if there are no isolated vertices (other than, possibly, the source itself). Now, for each vertex x 6= s, we have IP(x is isolated ) = (1

pn )n 15

1

=

e c [1 + o(1)]: n

Moreover, under the random graph model G (n; pn ), the isolation of distinct vertices are independent events. Thus, n 1 e c e c IP(no vertex other than s is isolated ) = 1 [1 + o(1)] =e [1 + o(1)]: n Recalling that π( pn ; n) denotes the probability that every vertex is reachable from the source via a directed path, it is immediate from the above that lim sup π( pn ; n) e

e

n!∞

c

(3)

:

In fact, this simple calculation essentially yields the correct estimate of the probability of there being a directed path to every vertex. In other words, isolated vertices constitute the main contribution to the probability of a vertex not being reachable from the source. We now establish the reverse inequality to (3) and, thereby, the claim of Theorem 1. The following estimates will be needed. Lemma 3 Let npn = log n + c + o(1). Then, for all n sufficiently large, and all r 2 f1; : : : ; n=2g, we have

fn (r) := Moreover, limn!∞ fn (r) = e

cr =r!

Proof We have n log(1

n r

(1

pn )r(n

pn ) = =

log n

=

r log n

j ) n

log r! + ∑ log(1 r 1

log r!

r[c + o(1)] + ∑ log(1 j=1

1)r :

c + o(1), and so,

j=1

=

(c

ebr 2c!

for each fixed r.

r 1

log fn (r)

r)

log r!

r[c + o(1)] +

(r

r2 )(log n + c + o(1)) n

j r2 (log n + c) )+ n n

r2 (log n + c) : n

It is clear from the second equality above that, for fixed r, lim log fn (r) =

n!∞

log r!

cr:

This verifies the second claim of the lemma. R A standard comparison of sums and integrals ensures that log r! 1r log xdx, which in turn yields log r! r log r r + 1 for all r 1. Hence, for all r 2 f1; : : : ; n2 g and n sufficiently large, log fn (r)

(c

1)r

r log r 1

log n=n log r=r

:

It can be verified by differentiation that log x=x is a decreasing function of x for x > e. Hence, for all r 2

16

f3

;::: ;

n=2g, log fn (r)

(c

=

(c

1)r

r log r 1

=

1)r

(c

log n 2 log(n=2) log 2 2 log(n=2)

r log r

1 2

r r r log log 2 1 2 2 2 r log b c! : 2

1)r

log r log(n=2)

(c 1)r We have used the fact that log n! n log n to obtain the last inequality. This establishes the first claim of the lemma for 3 r n 2. It is straightforward to verify the claim for r = 1 2. =

;

Lemma 4 Let fn be defined as in Lemma 3. Given ε > 0, we can find R such that, for all n sufficiently large, n

∑

r ) fn (r) < ε: n

(1

r =R+1

Proof We have from Lemma 3 that, for all r 2 f1; : : : ; n=2g and n sufficiently large, fn (r) g(r) :=

(c

e

1)r

br 2c! =

:

But the positive sequence g(r) is summable, since ∞

∞

∑ g(r) ∑ r! [e

r =0

1

(c

1)2r

+e

(c

1)(2r +1)

]

r =0

(1 + e

(c

1)

e

)e

2(c 1)

:

Hence, given ε > 0, we can choose R large enough that, for all n sufficiently large, n=2

fn (r)

∑

r =R+1

Since fn (r) = fn (n

∞

∑

g(r)