Robust Monitoring of Network-wide Aggregates through ... - CiteSeerX

3 downloads 0 Views 341KB Size Report
Oct 6, 2006 - (USENIX - OSDI '02), Boston, MA, USA, December 9-12, 2002. ...... x. a j write t. a a a read t. a j. t a j j L a j L a j L a s x s srs β. +. +. ∈. ∈. ∈. = −.
Robust Monitoring of Network-wide Aggregates through Gossiping Fetahi Wuhib, Mads Dam, Rolf Stadler

Alexander Clemm

KTH Royal Institute of Technology Stockholm, Sweden {fetahi, mfd, stadler}@kth.se

Cisco Systems San Jose, California, USA [email protected] Friday, October 06, 2006

Abstract— We examine the use of gossip protocols for continuous monitoring of network-wide aggregates. Aggregates are computed from local management variables using functions such as AVERAGE, MIN, MAX, or SUM. A particular challenge is to develop a gossip-based aggregation protocol that is robust against node failures. In this paper, we present G-GAP, a gossip protocol for continuous monitoring of aggregates, which is robust against discontiguous failures (i.e., under the constraint that neighboring nodes do not fail within a short period of each other). We formally prove this property, and we evaluate the protocol through simulation using real traces. The simulation results suggest that the design goals for this protocol have been met. For instance, the tradeoff between estimation accuracy and protocol overhead can be controlled, and a high estimation accuracy (below some 5% error in our measurements) is achieved by the protocol, even for large networks and frequent node failures. Further, we perform a comparative assessment of G-GAP against a tree-based aggregation protocol using simulation. Surprisingly, we find that the tree-based aggregation protocol consistently outperforms the gossip protocol for comparative overhead, both in terms of accuracy and robustness. Keywords: gossip protocol, epidemic aggregation, decentralized monitoring

I.

protocol,

robust

INTRODUCTION

The motivation of this research is to investigate the use of gossip protocols for decentralized real-time monitoring. Recent research in gossip protocols suggests to us that these types of protocols may help engineering a new generation of monitoring systems that are highly scalable and fault tolerant. Gossip protocols, also known as epidemic protocols, can be characterized by asynchronous and often randomized communication among nodes in a network [10][3]. Originally, they have been proposed for disseminating information in large dynamic environments [10], and more recently, they have been applied for various tasks, including constructing robust overlays [1][17], estimating the network size [9][6], etc. We are specifically interested in assessing the use of gossip protocols for decentralized aggregation of device data in near real-time. Aggregation functions, commonly used by management applications, include SUM, MAX and AVERAGE of device-level counters and other variables. Examples of such aggregations are:

(a) average load across all network links; (b) the number of active voice calls in a given domain. Specific applications that require such information include network surveillance, service assurance, and traffic control in large-scale or dynamic networks. Admission control, for example, can make use of cross-device network load and QoS measurements to decide whether to accept or reject flows into a network domain. A gossip protocol for monitoring network-wide aggregates executes in the context of a decentralized management architecture. Figure 1 shows an example of such an architecture, which we propose using for this purpose. In this architecture, monitoring nodes with identical functionality organize themselves into a management overlay. The aggregation protocol (in this case, the gossip protocol) runs in the monitoring nodes, which communicate via the overlay. Each monitoring node collects data from one or more network devices. This data is aggregated, in a decentralized fashion, to estimate the MIN, MAX, SUM, AVERAGE, etc., of the device variables through the aggregation protocol. A management station or an application server can access the monitoring layer at any node. Node or link failures—on the physical network or the management overlay—trigger a re-organization of the management overlay, thereby enabling continuous operation. (Note that the protocol introduced in this paper can also run on a different architecture, as long as the architecture includes monitoring nodes that execute the protocol, it provides functions in the monitoring nodes to access local device variables, and it maintains an overlay topology for monitoring nodes to exchange information.) Recently, other approaches to decentralized aggregation, which are based on creating and maintaining spanning trees in the management overlay, have been investigated by others [12][13][14], and also by us [4][15][16][18]. There are qualitative and quantitative differences between tree-based and gossip-based aggregation. First, gossip-based aggregation protocols tend to be simpler in the sense that they do not maintain a distributed tree in the management overlay. Second, in tree-based aggregation, the result of an aggregation operation is available on the so-called root node of the tree, while in gossip-based aggregation, the result is available on all nodes. Third, failure handling is very different for tree-based

1

disseminates partial aggregates among peers in the hierarchy. The choice of using gossiping is motivated by achieving robustness. In case a parent node fails, an elected child node can replace the parent instantly. In contrast to our research, [8] aims at providing periodic estimates of a (slowly) changing global aggregate, while the goal for our protocol is to give a continuous estimate of a global aggregate in near real-time and with high accuracy. Since [8] uses a hierarchical aggregation scheme, the authors do not address the problem of mass loss (see section III) in gossip protocols as our work does.

FIGURE 1: ARCHITECTURE OF THE DECENTRALIZED MONITORING SYSTEM. GOSSIP PROTOCOLS RUN IN THE MANAGEMENT OVERLAY (MIDDLE LAYER)

aggregation than for gossip-based aggregation. If a node fails, a tree-based aggregation protocol needs to reconstruct the aggregation tree for which there are well understood techniques. In gossip protocols, node failure can produce mass loss (which is further explained in section III). Unfortunately, the problem of mass loss has not been sufficiently studied to date. As a consequence, there is no gossip protocol for monitoring aggregates available today that is robust to node failures. We conclude that, in order to perform a comparative assessment of tree-based and gossip-based aggregation, the problem of mass loss needs to be addressed first. In this paper, we extend a gossip protocol in [3], which computes a snapshot of the network aggregate at a specific time, to a protocol that continuously produces an estimate of the aggregate as it changes over time. More importantly, we extend the protocol in such a way that certain mass invariants hold, which makes the protocol robust against many classes of node failures. We call the resulting protocol G-GAP, for Gossip-based Generic Aggregation Protocol. We prove these invariants for the synchronous and the asynchronous version of the protocol. We evaluate G-GAP through simulation, focusing on the accuracy of the estimates produced, the tradeoff between estimation accuracy and protocol overhead, the relationship between accuracy and network size (i.e., scalability), and, the relationship between accuracy and failure rate (i.e., robustness). For the sake of comparison, we run the same simulation scenarios with a tree-based aggregation protocol (that has a similar overhead), which provides us with insight into the performance of tree-based vs. gossip-based monitoring. The rest of the paper is organized as follows. Section II reviews related work. Section III presents our protocol, starting out with push synopses [3], which is developed first into a synchronous robust protocol, then an asynchronous robust protocol, namely G-GAP. Section IV presents results from our experimental evaluation. Finally Section IV concludes the paper and presents future work. II.

RELATED WORK

A paper by Renesse et al. [8] is the first work known to us where a gossip protocol is used in the context of monitoring aggregates. Here, the authors present a hierarchically structured monitoring system called Astrolabe. Aggregation functions are computed using tree-based aggregation, while a gossip protocol

Jelasity et al. in [2] present a gossip protocol for computing global aggregates. The protocol can be used for distributed polling or computing a snapshot of the system state. In order to support continuous monitoring of aggregates, the authors suggest restarting the protocol periodically, which amounts to periodic polling. While the protocol in [2] is not robust against node failures, our protocol is robust in the sense that it recovers lost mass for many types of node failures. (Since the protocol in [2] is restarted periodically, mass loss does not accumulate, and estimation errors are mitigated in this way.) In a paper by Kempe et al. [3], the authors introduce a set of gossip protocols for computing a number of aggregation functions. The protocols presented do not directly support continuous monitoring of an aggregate, and they are not robust to node failures. Our work is based on one of these protocols, and, in this paper, we extend it to support continuous monitoring and robust operation. Godsi et al. in [6] present a gossip algorithm for estimating the system size of a peer-to-peer system that uses DHTs and a ring topology. The algorithm computes the average inter-node distance on the ring to determine the system size. The protocol is robust to node failures. It periodically generates a wave that moves around the ring for the purpose of detecting mass loss and recovering from it. This work is similar to ours in that the protocol provides a continuous estimate and is robust to node failures. However, the specific recovery mechanism is different form ours, since it relies on the ring topology and is directly applicable only to estimating the system size. III.

THE PROTOCOL: G-GAP

A. Design goals and design approach The management architecture. G-GAP is designed for a management architecture shown in Figure 1, in which each network device participates in protocol processing, by running a management process, either internally or on an external, associated device. (A monitoring node in Figure 1 corresponds to a management process.) These management processes communicate via a network overlay for the purpose of monitoring a network-wide aggregate. We refer to this overlay also as the network graph. A node of this graph represents a network device together with its management process. Each node of the network graph has an associated local (management) variable xi(t). The local variable can represent a MIB variable, e.g., a device counter. In this paper, we assume that the variables are aggregated using AVERAGE. It is also possible to use other aggregation functions, such as SUM, MIN, or MAX with this protocol.

2

Round 0 { 1. si = xi ; 2. wi = 1 ; 3. send ( si , wi ) to self } Round r + 1 { 1. Let {( s l , wl )} be all pairs sent to i during round r 2. si = ∑l sl ; wi = ∑l wl 3. choose shares α i , j ≥ 0 for all nodes j such that

∑α j

i, j

=1

4. for all j send (α i , j * si ,α i , j * wi ) to each j } FIGURE 2: PUSH-SYNOPSES – PSEUDO CODE FOR NODE i

is also guaranteed to be delivered to the recipient within that round. The correctness of the push-synopses protocol relies crucially on the following invariant for mass conservation. We use xr ,i to refer to the value of a variable xi at the end of round

r.

Proposition 1 (Mass Conservation, Push Synopses [3]) For all rounds r ≥ 0 , 1. 2.

∑ s =∑ x ∑w =n i

i

r ,i

i

0 ,i

r ,i

Design goals. Our aim is to develop a distributed protocol for continuously computing aggregation functions in a scalable and robust manner. The design goals for G-GAP are as follows:

Proof: Since the only communication between nodes is by message passing it suffices to show that if the property holds before the main protocol cycle is executed by a node i then it holds after execution as well. □



In [3], the push-synopses protocol is shown to converge exponentially fast to the correct average, and probabilistic bounds on the accuracy of the local aggregates are determined. This convergence result carries over to the protocols proposed here with minor modifications, which we will show in detail in a forthcoming publication.

• • •

Accuracy: for a given protocol overhead, the estimation error should be small, and the variance of the estimation error across all nodes should be small. Controllability: it should be possible from a management station to control the tradeoff between protocol overhead and accuracy of the estimation. Scalability: for a fixed accuracy, the local protocol overhead at any node or link should in general increase sub-linearly with system size. Robustness: the protocol should be robust to node failures and should allow for nodes dynamically joining and leaving the network. During transient periods, the estimation error due to reconfiguration should be small.

Design approach. As stated earlier, G-GAP is based on “Push synopses”, a gossip protocol for polling aggregates proposed by Kempe et al. [3]. We extend this protocol in two directions. First, we adapt it to continuous monitoring. (In fact, such an adaptation is already suggested in [3].) Second, and more importantly, we extend the protocol with a scheme to provide accurate estimates in the event of (many classes of) node failures. These extensions occur in two steps; first, for the case of fully synchronized rounds with guaranteed, timely message delivery; then, for the more general, asynchronous case. B. Push synopses We first recall the push-synopses protocol of Kempe et al. [3]. We consider only aggregation using AVERAGE, but the protocol is easily generalized to other aggregation functions, such as SUM, MIN, MAX, etc. (cf. [2][3]). In this protocol, each node i maintains, in addition to the local management variable xi , a weight wi and a sum si . The local estimate of the aggregate is computed as ai = si / wi . We assume the total number of nodes is n. The protocol proceeds in synchronized rounds, assuming reliable and timely communication such that a message sent within a given round

The assumption of round synchronization can be lifted by adding round identifiers to messages and by buffering. Mass conservation remains valid under these changes [3]. Also, message loss can be tolerated, provided that the underlying transport mechanism guarantees reporting of message loss. In the event of message loss, the protocol retransmits the lost message to self. The mass conservation property remains valid, if message loss is always reported within the same round the message was sent; if this cannot be guaranteed, then the statement of prop. 1 must be adapted, by taking into account the “mass” of lost messages that were sent but have--so far--not been reported lost in earlier rounds. The protocol is given for the case of polling, i.e., where the variables xi are constant. It is easily adapted to continuous monitoring, by sampling the value of xi at each round and adding the change in xi to si in step 2 of Figure 2. Step 2 for round r thus becomes 2.

si = ∑ l sl + ( xr ,i − xr −1,i ) , wi = ∑l wl

This protocol assumes that all nodes can directly send messages to each other. However, it can be easily adapted to the case where the protocol executes on a network graph, which means that only adjacent nodes can communicate directly with each other. This adaptation is crucial in practice. In this case, we require that α i , j = 0 if i ≠ j and j is not adjacent to i. This applies also to the other protocols given below, where it also suffices to store state information for adjacent nodes.

3

Round 0 { 1. si = s1si ,i = s 2 s i ,i = xi ; 2. wi = s1wi ,i = s 2 wi ,i = 1 ; 3. for each node j { ( rsi , j , rwi , j ) = (0,0) //recovery share for node j ( s1si , j , s1wi , j ) = (0,0) //share sent previous round to j ( s 2 si , j , s 2 wi , j ) = (0,0) }; //share sent round before last 4. send ( si , wi ,0,0) to self 5. send (0,0,0,0) to all other nodes } Round r + 1 { 1. let {( sl , wl , rs l , rwl ) | l ∈ L} be all messages sent to i from sender l during round r 2. si = ∑l∈L sl ; wi = ∑ wl ; l∈L for all l ∈ L let ( rs i ,l , rwi ,l ) = ( rs l , rwl )

3. for all k that did not send a message { si = s i + s1si ,k + s 2 si ,k + rs i ,k ; s1si ,k = s 2 si , k = rs i , k = 0 ; wi = wi + s1wi ,k + s 2 wi ,k + rwi ,k ; s1wi ,k = s 2 wi , k = rwi ,k = 0 }

4. for all j choose shares α i , j ≥ 0 such that

∑α j

i, j

= 1 and α i , j = 0 whenever j ∉ L

5. for all j { ( s 2 si , j , s 2 wi , j ) = ( s1si , j , s1wi , j ) ; ( s1si , j , s1wi , j ) = (α i , j si , α i , j wi ) }

6. for all j choose shares β i , j ≥ 0 such that



j

β i , j = 1 and β i , j = 0 whenever j ∈ L ∪ {i}

7. for all j send ( s1si , j , s1wi , j , β i , j (α i ,i si − xi ), β i , j (α i ,i wi − 1)) to j } FIGURE 3: SYNCHRONOUS G-GAP – PSEUDO CODE FOR NODE i

C. Synchronous G-GAP In case of crash failures the push-synopses protocol of section III.B can no longer be guaranteed to converge to the true value. There are essentially two problems: 1.

The failed nodes contribution to the aggregate should be undone.

2.

Mass conservation will in general fail.

In this section we present a first adaptation of the pushsynopses protocol to the case of benign crash failures, under rather strict assumptions. Later we show how these assumptions can be partially lifted, at the expense of a somewhat more complex protocol. The synchronous G-GAP protocol is shown in Figure 3. The protocol is given for polling. It is adapted to continuous monitoring in the same way as the push synopses protocol. The basic idea is that each node i distributes recovery shares to its neighbors such that, if node j discovers that node i has failed, node j can use the recovery share previously received to recover the relevant part of node i ’s mass. The solution in Figure 3 relies on five rather strong assumptions:

1.

Reliable and timely message delivery: There is a number td < tr (the round length) such that any message sent from a node i to a node j at time t is delivered to j at a time no later than t + t d , provided node j is live.

2.

Synchronized rounds: Rounds are globally synchronized to within some bound t Δr . That is, all live nodes start a round within t Δr of each other. The input variables xi

change only between rounds. Round atomicity: All protocol cycles are executed as atomic statements. 4. Discontiguous failure events: No two nodes fail within two rounds of each other. (When running this protocol on a network graph, it suffices that the nodes are adjacent.) 5. Connectedness: No failure will cause a node to become disconnected. The assumption of round synchronicity allows us to unambiguously determine the value of a variable during each (global) round r . For instance, the value of xi during r is 3.

x r ,i and the value of s1si , j is s1s r ,i , j , etc. Round atomicity reduces, essentially, to atomic broadcast, since then the protocol cycles can be executed within a single transaction. This can be realized efficiently on some physical media, but the general implications of this assumption remain to be investigated. Round atomicity ensures that, during each round, if some message is received then all messages are received and responded to. With this, assumptions 1 and 2 guarantee that a round time t r can be found such that all messages are received during the same round they were sent. Node failure, then, can be detected as in step 3 when the node fails to receive a message for a neighbor. Thus, if node i did not receive a message from node k in round r − 1 it means that node k did not receive the message sent from i to k in round r − 1 ( s1s k ), neither did it process the message sent in the round before ( s 2 s k ). Hence node

i must in round r in this situation restore not only i ’s recovery share for k but also the contribution sent by i to k in the two previous rounds. For the statement of the mass conservation property for SGGAP we assume that round 0 by convention refers to the initialization phase, and that all nodes are alive during this round. Proposition 2 (Mass Conservation, SG-GAP) Let Lr be the set of nodes live during round

r . At the end

of each round r ≥ 0 , 1.

∑ ∑

i∈Lr

sr ,i + ∑ i∈L , j∉L s 2 sr ,i , j +

i∈Lr −1 − Lr , j∈Lr

r

r

rsr , j ,i = ∑ i∈L xr ,i r

4

∑ ∑

2.

i∈Lr

wr ,i + ∑ i∈L , j∉L s 2 wr ,i , j + r

i∈Lr −1 − Lr , j∈Lr

round (0) { 1. si = xi ; 2. wi = 1 ;

r

rwr , j ,i = Lr

3. Li = {i} ; 4. for each node j (rsi , j , rwi , j ) = (0,0) ;

Proof: See appendix. The invariants are explained thus: Total system mass at round r is composed of three components: 1.

Mass “belonging” to each live node i , the s r ,i .

2.

Mass which was sent by a currently live node i to a currently dead node j in the previous round (and which j consequently is unable to answer in the current round),

5. for each node j ( srs i , j , srwi , j ) = (0,0) ; 6. send ( si , wi ,0,0,0,0) to self; 7. for all j ≠ i send (0,0,0,0,0,0) to j } round (r + 1) { 1. Let Rec be all messages received by i during round r 2. si = ∑ m∈M s (m) + ( xr − xr −1 ) ; wi = ∑ m∈M w( m) 3. for all j (acksi , j , ackwi , j ) = (0, 0)

the s 2 s r ,i , j .

4. Li = Li ∪ orig ( Rec) 5. for all j ∈ N { a. (rsi , j , rwi , j ) = (rsi , j , rwi , j ) +

3.

Recovery mass which was sent in the previous round from a now dead node i to a currently live node j . Observe that proposition 2 is sufficient to re-establish the exponential convergence result of [3] for the case of networks that are eventually stable (i.e. from some time t c onwards, no



i∈Lr





Compared to SG-GAP, the assumption of round synchronization is relaxed, as is the assumption of timely message delivery. With these modifications, nodes have less precise knowledge of other nodes’ state. For this reason, a node, in addition to the recovery mass already introduced in SG-GAP, a further pair ( acksi , j , ackwi , j ) is added to the message sent in step 7.d of Figure 4, in order to let nodes acknowledge the receipt of messages. The pair (acksi , j , ackwi , j ) is computed as what node i believes to be

j ’s recovery information on i , i.e., (rs j ,i , rw j ,i ) . However, lack of synchronization and message transmission delays may make these values different. The asynchronous setting makes some changes in notation convenient. Most importantly, we consider system events to be serialized in a discrete time model. That is, failure events and

m:orig ( m ) = j

( s (m), w(m))

} 6. if (detected_failure(j)) { a. ( si , wi ) = ( si , wi ) + (rsi , j , rwi , j ) b. (rsi , j , rwi , j ) = ( srsi , j , srwi , j ) = (0, 0)

i∈Lr

D. Asynchronous G-GAP In this section we relax the synchrony assumptions of SGGAP. The basic idea is to drop the strictly synchronous failure handling of SG-GAP and, instead, letting a node compute recovery shares incrementally, by explicitly acknowledging the mass it receives from a peer. The pseudo-code for a protocol based on this idea, asynchronous G-GAP, is shown in Figure 4. From an application point of view, this asynchronous version of the protocol is the most significant protocol described here, and, therefore, we refer to it simply as G-GAP in the rest of this paper. As in the case of SG-GAP, the protocol is given for polling and for a complete (i.e., fully connected) network graph. The adaptation to more general network graphs follows the patterns set out in subsection III.B.

((rs (m), rw(m) − acks (m), ackw(m)))

b. (acksi , j , ackwi , j ) = ( srsi , j , srwi , j ) +

nodes fail or restart). This is so, since one round after t c the behaviour of SG-GAP is in some sense identical to that of push synopses, and the invariant of prop. 2 reduces to sr , i = xr ,i .



m:orig ( m ) = j

c. Li = Li \ j } 7. for all j ∈ Li { a.

choose α i , j ≥ 0 such that

b.

choose β i , j ≥ 0 such that

∑β j

c.

i, j

∑α j

i, j

=1

= 1 and β i ,i = 0

( srsi , j , srwi , j ) = β i , j (α i ,i si − xi ), βi , j (α i ,i wi − 1)

send (α i , j si , α i , j wi , srsi , j , srwi , j , acksi , j , ackwi , j ) to j e. (rsi , j , rwi , j ) = (rsi , j + α i , j si , rwi , j + α i , j wi ) } d.

} FIGURE 4: (ASYNCHRONOUS) G-GAP – PSEUDO CODE FOR NODE i

protocol cycle executions (both of which we call transitions) are considered to be atomic, and the time axis to be discretized into points t0 , t1 ,... such that, at each instant tn , exactly one of two events occurs: •

a protocol cycle is executed at node i ;



node i fails.

Node recovery is not considered yet, though we believe it should be possible to extend prop. 3 below to this case as well. In the absence of round synchrony, local variables need to be sampled in a slightly different way than in the case of SGGAP. Here we apply the following convention: If, say, si , j is a local variable at node i , then st ,i , j refers to the value of si , j

5

at time instant

t + , t ∈ {tn | n ∈ ω} , i.e., immediately upon

2.

Correlated failure and message signaling: Message generation / consumption events and failure events occur in the same order at origin and destination nodes. Observe that this does not entail ordered communication in general.

3.

Discontiguous failures: If a failure occurs, then no other live node can fail until the failure event has been processed by all nodes. Alternatively, we can say: if a failure event occurs at time t fail , then, there is a time Δtdetct such that, all nodes alive

completion of the corresponding event. Concerning the communication model, we assume reliable message transmission in the sense that a message m , generated (i.e., sent) by the execution of a protocol cycle at node i , is regarded as “pending”, until either the destination node j fails or the message is consumed (i.e., read) by the execution of a protocol cycle at j . We can thus define the following sets: •







M pending ,t ,i , j : The set of all messages from origin i to destination j which are pending immediately upon completion of the transition at time t . M read ,t ,i , j : The set of messages from origin i to destination j which are read during a transition on a node j at time t . If no transition takes place on node j at time t then M read ,t ,i , j = ∅ . M write,t ,i , j : The set of messages from origin i to destination j which are generated by node i through the execution of a protocol cycle at time t . Again, if no cycle is executed on node i at time t then M write ,t ,i , j = ∅ . M transit ,t ,i , j = M pending ,t ,i , j − M write,t ,i , j :Those messages that are in transit, i.e., pending but not generated at time t .

We obtain the following straightforward axiom reflecting this model: Axiom 1 (Communication model) For all n ∈ ω ,

M pending ,tn+1 ,i , j = ( M pending ,tn ,i , j ∪ M gen ,tn+1 ,i , j ) − M read ,tn+1 ,i , j

Messages in G-GAP have the format

( s, w, srs, sws, acks, ackw) where s , w , etc., are real-valued variables. We use the notation

s pending ,t ,i , j =

∑ {s | ∃w, srs,...: (s, w, srs,...) ∈ M

pending ,t ,i , j

}

and, similarly, for other variables w, srs, etc., and other indices read , write and transit .

For provably correct operation the G-GAP protocol requires the following assumptions to be met: 1.

at time t fail have processed this event by t fail

Self messages: A message a node sends to itself will be immediately available for consumption. This assumption can be lifted, we conjecture, by adding sequence numbers to the protocol or by storing the message content in a local variable.

+ Δtdetect .

In addition, no node fails in between t fail and

t fail + Δtdetect . The statement of mass conservation now needs to take into account both received and pending messages. Proposition 3 (Mass conservation, G-GAP) Let L be the set of all nodes and Ltn the set of live nodes at time tn . Then, at all times tn 1.

2.



i∈Ltn

>0:

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ i∈L



tn

i∈Ltn , j∉Ltn

tn

tn

tn

i∈Ltn , j∉Ltn

rstn ,i , j +

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

Ltn = ∑ j∈L ,i∈L wpending ,tn , j ,i + ∑ i∈L



, j∉Ltn

, j∉Ltn

rwtn ,i , j +

(rwpending ,tn , j ,i − ackwpending ,tn , j ,i )

Proof: See appendix. Proposition 3 expresses that the sum of local variables at all live nodes can be computed as the sum of all pending mass to all live nodes, plus the sum of the recovery shares for the failed nodes at the live nodes, plus the sum of pending recovery shares, minus the sum of pending acknowledgements. Prop. 3 allows to derive a convergence result only when message transmission time is bounded. Under this assumption, if no more failures can occur after some time t , then it is possible to identify a time t stable > t at which all pending recovery information from failed nodes will have been received, and the recovery shares for failed nodes will be 0. Prop. 3 then reduces to



i∈Ltstable

xi = ∑ i , j∈L

tstable

s pending ,tstable ,i , j

and the behavior of G-GAP reduces to that of push synopses. Note that if G-GAP executes on a general network graph, the assumption of discontiguous failures relates to the neighborhood of a node, rather than the set of all network nodes. As a consequence G-GAP is robust to multiple concurrent failures as long as they occur in different neighborhoods.

6

IV.

EXPERIMENTAL EVALUATION

We have evaluated G-GAP (called asynchronous G-GAP in section III.D) through extensive simulations using the SIMPSON simulator, a discrete event simulator that allows us to simulate packet exchanges over large network topologies and packet processing on the network nodes [5]. In various scenarios, we measure the estimation error by G-GAP on the network nodes, in function of the round rates, the network size, and the failure rate, in order to evaluate the protocol against our design objectives. In addition to G-GAP, we run most simulation scenarios also with GAP [4], a tree-based aggregation protocol that gives an estimate of the aggregate at the root node. This allows us to compare the use of a gossip protocol with a protocol that is based on spanning trees for the purpose of monitoring networkwide aggregates. To make the comparison fair, we measure the performance metrics of both protocols for a comparable overhead. An overview of GAP is presented in the Appendix. A. Simulation setup and evaluation scenarios Evaluation metrics. The main evaluation metric is the estimation error of the protocols. For G-GAP, we compute the estimation error as the (absolute) difference between the actual aggregate and the estimate of the aggregate on the nodes. For each simulation run, we determine the average estimation error over the simulation time and over all nodes. In addition, to indicate the dispersion of the error values, we determine the 90th percentile of the error values. In the case of GAP, all measurements relate to the root node, since the estimate of the aggregate is available only at this node. A second evaluation metric is the mass loss, which measures the correctness of the protocol in the case of failures. Local variables. For all simulation runs, a local variable represents the number of HTTP flows that enter the network at a specific router, and the aggregate represents the current average number of these flows in the network. We simulate the behavior of the local variables based on packet traces captured at the University of Twente [11]. Specifically, we use two traces The first trace, which we call the University of Twente (UT) trace, is obtained as follows. Packet traces captured at two measurement points were divided into 150sec segments. From each segment i, we sample every second the number of HTTP flows that were traversing the measurement point. This number gives the value of the local variable xt,i of node i at time t. Across all segments, the average value of xt,i is about 45 flows, and the standard deviation of the change between two consecutive values is about 3.4. The second trace, which we call Randomized Periodic UT trace, is obtained by scaling the UT trace with a random periodic factor as follows.

((

(

x = int 1 + rand (0..1) cos 2π t * t ,i

)) )

where

x 30 t ,i rand(0..1) returns a random number uniformly distributed between 0 and 1, and int() the integer part of the argument. The average value of xt,i* across all segments is about 47 and the standard deviation of the change between two consecutive values is about 14. The second trace provides us

with local variables that have higher dynamics than the first trace. Overlay topology. The overlay topologies used for our simulations are generated by GoCast [17], a gossip protocol that builds topologies with bidirectional edges and small diameters. The protocol allows setting the (target) connectivity of the overlay. For this evaluation of G-GAP, we do not simulate the dynamics of GoCast. This means that the overlay topology does not change during a simulation run. Unless stated otherwise, the overlay topology used in the simulations has 654 nodes (this is the size of Abovenet, an ISP). It is generated with target connectivity of 10, which produces an average distance of 3.1hops and a diameter of 4hops in the overlay. Failures. We assume a failure detection service in the system that allows a node to detect the failure of a neighbor. For our simulations, we assume that the failure of a node is detected within 1sec. Other Simulation Parameters. In addition to the above, we run the simulations with the following parameters unless stated otherwise. •

For G-GAP, the default round length is 250ms, which means 4 rounds/sec. For GAP, the maximum message rate is 4 msg/sec per overlay link.



For all nodes i and time t, αt,i,j=1/(1+# βt,i,j=1/# of neighbors



Processing overhead: 1ms/cycle



Network delay across overlay links: 20ms



The length of a simulation run is 50sec, with a warmup period of 25sec and a measurement period of 25 sec.

of neighbors)

and

B. Estimation Accuracy vs. Protocol Overhead In this experiment, we measure the estimation accuracy of G-GAP and GAP in function of the protocol overhead. We run simulations for round rates of 1, 2, 4, 6, 8 and 10 messages per sec. (For G-GAP, a round rate of 1 per sec means that the protocol executes one round per second, i.e., one protocol cycle per second. For GAP a round rate of 1 per sec means a maximum of 1msg/sec on an overlay link). We run two scenarios for the evaluation of estimation accuracy. For the first scenario, we use the UT trace to simulate behavior of the local variables. Figure 5 shows the results. Each measurement point corresponds to one simulation run. The top of the bar indicates the 90th percentile of the estimation error. As expected, for both protocols, increasing the round rate results in the decreasing of the estimation error. Therefore, the round rate controls the tradeoff between estimation accuracy and protocol overhead. In addition, for comparable overhead (i.e. the same round rates), the average error in G-GAP is around 8 times that of GAP. In the second scenario, we study the influence of higher dynamicity of the local variables by using the Randomized

7

FIGURE 5: ESTIMATION ERROR VS. PROTOCOL OVERHEAD FOR UT TRACE TABLE 1: TOPOLOGICAL PROPERTIES OF GOCAST-GENERATED OVERLAYS

FIGURE 6: ESTIMATION ERROR VS. PROTOCOL OVERHEAD FOR RANDOMIZED PERIODIC UT TRACE

USED IN SIMULATIONS

# nodes 82 164 327 654 1308 2616 5232

diameter 3 hops 4 4 4 5 5 6

avg. distance 2.1 hops 2.4 2.7 3.1 3.4 3.7 4

Periodic UT trace to simulate behavior of the local variables. Figure 6 shows the results. The top of the bars indicate the 90th percentile of the estimation error. The result shows that the average estimation error in both protocols is larger than that in the UT trace (2 times larger for G-GAP and 2-10 times for GAP). We explain this by fact that changes in the values of the local variables tend to be the larger for this trace than for the UT trace. We observe that the estimation error for GAP is smaller than the error for G-GAP: namely, by a ratio of 1.5 for low round rates and by a ratio of 5 for high round rates. This ratio is smaller than that for the UT trace. More importantly, within the parameter ranges explored, we conclude that GAP outperforms G-GAP in terms of accuracy. C. Scalability In this scenario, we measure the estimation accuracy of GGAP and GAP in function of the network size. The round rate is set to 4 round/sec. We run simulations with GoCastgenerated overlays for networks of size 82,164,327,654, 1308, 2626 and 5232 nodes. The target connectivity of GoCast is 10, which results in about 80% of the nodes having a connectivity of 10 and the rest a connectivity of 11. We use the UT trace to simulate the behavior of the local variables. Topological properties of the overlays are presented in Table 1. Figure 7 shows the results. Each measurement point corresponds to one simulation run. The top of the bar indicates the 90th percentile of the estimation error.

FIGURE 7: ESTIMATION ERROR VS. NETWORK SIZE

We observe that for both protocols, the estimation error seems to be independent on the network size. In the general case, for synthetic traces generated by the same (random) process, we would expect such a result for both GAP and GGAP. Further, [2] shows for a polling-based gossip protocol, that variance of the estimates of the global average across all nodes is independent of the network size. Therefore, this simulation result is not entirely surprising. Also, in this scenario, GAP clearly outperforms G-GAP in terms of accuracy. D. Robustness against node failures In this section, we evaluate the robustness properties of GGAP in three scenarios. In the first scenario, we validate the mass conservation property of G-GAP for the case of discontiguous failures (for which we proved the protocol to be robust), and we compare the result to a gossip protocol for computing aggregates that is not robust. Then, we compare the estimation accuracy of G-GAP and GAP by measuring the estimation error in function of the failure rate. Finally, we

8

FIGURE 9: ESTIMATION ERROR VS. FAILURE RATE FOR GAP AND G-GAP FIGURE 8: MASS LOSS IN G-GAP-- (TOP) AND G-GAP (BOTTOM) FOR DISCONTIGUOUS FAILURES

compare the estimation accuracy of G-GAP and the non-robust gossip protocol as a function of the failure rate. This allows as assessing the effectiveness of the failure recovery mechanism in G-GAP. For the first scenario, we use the default topology (654 nodes) and simulation settings as described in section IV.A, and simulate the local weight changes using the UT trace. We generate failures as follows. Every 1.25sec, a node is selected at random. The node fails and recovers after 10sec. (Note that the generated failures are discontiguous, and therefore the protocol is robust by design.) We run the scenario with G-GAP and with G-GAP without failure recovery, which we call here G-GAP--. (In our simulation runs, G-GAP-- was realized by executing G-GAP and skipping nodefailed calls.) During the simulation run, we measure the mass loss, computed as in proposition x − s − s ( t , i , j ) ∑ i ∑ t ,i ∑ ∑ ∑ transit i∈Lt

i∈Lt

i∉L≤ t

j

3.1 for the case of s, and, similarly, for the case of w. The simulation is run for 150sec and Figure 8 shows the result. As can be seen from the figure, G-GAP corrects the effects of node failures, and thus mass loss is transient. For the case of G-GAP-- however, we observe that mass loss is not corrected but accumulates over time. Note that mass loss can be positive or negative. This scenario experimentally validates the robustness property of our G-GAP implementation, which says that, as long as failures are discontiguous, the protocol recovers lost mass and executes correctly. For the second scenario, we use the same simulation parameters as above but vary the failure rate from 0 to 10 node failures/sec. Failure arrivals are generated by a Poisson process, and failures are uniformly distributed over all running nodes. A node that failed recovers after 10sec and reappears in the place it had in the overlay before the failure. Note that there is a chance that contiguous failures can occur and that the chance of contiguous failures increases with growing failure rate.

FIGURE 10: ESTIMATION ERROR VS. FAILURE RATE FOR G-GAP AND GGAP-- (UPPER CURVES). PART OF THE ERROR DUE TO MASS LOSS (LOWER CURVES)

We use the UT trace to simulate the behavior of the local variables. Each simulation run of the scenario is 150sec (which includes a 25sec warm-up period). Figure 9 shows the result obtained. Each measurement point corresponds to one simulation run. The top of the bars indicate the 90th percentile of the estimation error. As can be seen from the figure, the estimation error for both GAP and G-GAP increases (almost linearly) with the failure rate. We also see that the slope is steeper and the spread is wider for G-GAP than for GAP. This result is somewhat surprising for us. We would have expected a gossip protocol to perform better, compared to a tree-based protocol, under high node failure rates. In the third scenario, we compare G-GAP and G-GAP--, using the same simulation setup as in the second scenario. We measure, for both protocols, the estimation error and the part of the error that is due to mass loss. The result is presented in Figure 10.

9

As can be seen from the figure, the part of the error due to mass loss in G-GAP (lowest curve) increases with the failure rate. We explain this behavior with the occurrence of contiguous failures. The figure also shows that the estimation error of G-GAP increases with the failure rate and that the increase is larger than the error caused by mass loss. We suspect that this additional increase is due to the error recovery mechanism of G-GAP, which results in increased dynamicity of the s and w values. Figure 10 further shows that the part of the error due to mass loss in G-GAP-- increases with the failure rate. At the same time, the estimation error increases with increase in failure rate, by an amount similar to that caused by mass loss. We expect this behavior since the protocol is not robust and mass losses accumulate. What can look surprising at first sight is that the estimation error for G-GAP and G-GAP-- is almost the same. We suspect that this result is due to the short simulation time of a 150sec, which probably does not allow for mass losses to accumulate sufficiently. For large simulation times, we expect a significant difference between the accuracy of G-GAP and G-GAP-- in failure scenarios. We are currently investigating this issue further. V.

DISCUSSION AND FUTURE WORK

This paper makes two major contributions. First, we present a gossip protocol, G-GAP, which enables continuous monitoring of network-wide aggregates. The hard part is making the protocol robust against node failures, and we solved the problem for failures that are not contiguous (i.e., neighbors do not fail in short time of each other). We formally prove the correctness of the protocol by showing that invariants for mass conservation hold. Applying gossip protocols to continuous monitoring is not possible without solving the problem of mass loss due to node failures, and this paper gives a significant result in this direction. The simulation results suggest that we have achieved the design goals for G-GAP set out in III.A. First, we have shown that the tradeoff between estimation accuracy and the protocol overhead can be controlled by varying the round rate. Second, with the traces we used, an estimation error of some 5% or less can be achieved for all network sizes and failure scenarios we simulated. We have observed that the estimation accuracy of the protocol, for a given overhead, does not seem depend on the network size, which makes the protocol scalable. Finally, we have proven and validated that the protocol is robust to discontiguous failures. The second contribution of this paper is a comparative assessment of G-GAP with GAP, a fairly standard tree-based aggregation protocol. The significance of this assessment is a comparison between gossip based and tree based monitoring. Our simulation results show that within the parameter ranges of the simulation scenarios, the tree-based protocol consistently outperforms the gossip-based protocol. For comparable overhead, the tree-based protocol shows a smaller average estimation error and a smaller variance of the error than the gossip-based protocol, independent of network size and independent of frequency of failures that occur in the network.

While more work in needed to evaluate the relative advantages and disadvantages of tree-based vs. gossip-based monitoring, this paper makes a significant contribution to the discussion towards a new paradigm for distributed real-time monitoring. Our simulation results show that the dynamics of the local variables influences the estimation accuracy in G-GAP. Not surprisingly, local variables with high dynamics lead to a lower accuracy and vice versa. Our experience shows that the choice of the overlay topology significantly affects the performance of G-GAP, e.g., the estimation accuracy of the protocol. Generally speaking, a lower diameter and a higher connectivity of the overlay topology lead to a better performance. On the other hand, increasing the connectivity increases the load on the management nodes for a given round rate. Taking all this into account, we chose an overlay protocol that produces a uniform connectivity and, for our scenarios, we found out that a connectivity of 10 is an appropriate choice for real-time monitoring purposes. All simulation results given in this paper are for AVERAGE as the aggregation function. We expect the performance of G-GAP to be affected by the particular choice of the aggregation function. Specifically, in scenarios with contiguous failures, we expect the estimation error to be different, and we plan to investigate this issue further. For instance, in the case of SUM, we expect the estimation error to be larger, while we expect it to be smaller for MIN and MAX. G-GAP, as presented in this paper, is robust against discontiguous node failures. Our simulations have shown that in the case of frequent contiguous failures where 20% of the nodes are down, mass loss and hence estimation errors can accumulate. Therefore, in a real system, the protocol would have to be restarted in such cases. We see the possibility of further improving the robustness of G-GAP and plan more work in this direction. Acknowledgements. This work has been supported by a grant from Cisco Systems, a personal grant from the Swedish Research Council, and the EC IST-EMANICS Network of Excellence (#26854). VI. [1]

[2]

[3]

[4]

[5]

BIBLIOGRAPHY

D. Ernst, A. Hamel, and T. Austin, “Cyclone: a broadcast-free dynamic instruction scheduler with selective replay,” In Proc. of the 30th Annual international Symposium on Computer Architecture (ISCA’03), San Diego, California, June 7–14, 2003. M. Jelasity, A. Montresor and O. Babaoglu, “Gossip-based aggregation in large dynamic networks,” ACM Transactions on Computer Systems, vol. 23, Issue 3, pp. 219-252, August 2005. D. Kempe, A. Dobra and J. Gehrke, “Gossip-Based Computation of Aggregate Information,” In Proc. of the 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS’03), Cambridge, MA, USA, October 11-14, 2003. M. Dam and R. Stadler, “A generic protocol for network state aggregation,” In Proc. Radiovetenskap och Kommunikation (RVK’05), Linköping, Sweden, June 14-16, 2005. K. S. Lim and R. Stadler, “SIMPSON — a SIMple Pattern Simulator fOr Networks,” http://www.s3.kth.se/lcn/software/simpson.htm, August 2006.

10

[6]

A. Ghodsi, S. El-Ansary, S. Krishnamurthy, and S. Haridi, “A Selfstabilizing Network Size Estimation Gossip Algorithm for Peer-to-Peer Systems,” SICS Technical Report T2005:16, 2005. [7] S. Boyd, A. Ghosh, B. Prabhakar, D. Shah, “Randomized Gossip Algorithms,” IEEE/ACM Transactions on Networking, vol. 14, issue SI, pp. 2508-2530, June 2006. [8] R. van Renesse, K. Birman, and W. Vogels, “Astrolabe: A Robust and Scalable Technology for Distributed System Monitoring,” ACM Transactions on Computer Systems, vol 21, issue 2, pp.164-206, May 2003. [9] D. Kostoulas, D. Psaltoulis, I. Gupta, K. Birman, A. Demers, “Decentralized Schemes for Size Estimation in Large and Dynamic Groups,” In proc. of the 4th IEEE International Symposium on Network Computing and Applications (NCA’05), Cambridge, MA, USA, July 2729, 2005. [10] A. Demers, D. Green, C. Hauser, W. Irish, J. Larson, “Epidemic algorithms for replicated database maintenance,” In proc. the 6th Annual ACM Symposium on Principles of Distributed Computing, Vancouver, British Columbia, Canada, August 10 - 12, 1987 [11] University of Twente - Traffic Measurement Data Repository, http://m2c-a.cs.utwente.nl/repository/, August 2006. [12] A. Deligiannakis, Y. Kotidis and N. Roussopoulos, “Hierarchical inNetwork Data Aggregation with Quality Guarantees,”, In proc. 9th International Conference on Extending Database Technology (EDBT’04), Heraklion – Crete, Greece, March 14-18, 2004.

[13] S. Madden and M. Franklin and J. Hellerstein and W. Hong, “TAG: a Tiny Aggregation Service for Ad-Hoc Sensor Networks,” Fifth Symposium on Operating Systems Design and Implementation (USENIX - OSDI '02), Boston, MA, USA, December 9-12, 2002. [14] M. A. Sharaf, J. Beaver, A. Labrinidis and P. K. Chrysanthis, “Balancing energy efficiency and quality of aggregate data in sensor networks,” The International Journal on Very Large Data Bases, vol. 13, issue 4, pp. 384-403, December 2004. [15] K.S. Lim and R. Stadler, “Real-time Views of Network Traffic Using Decentralized Management,” 9th IFIP/IEEE International Symposium on Integrated Network Management (IM’2005), Nice, France, May 16-19, 2005. [16] F. Wuhib, A. Clemm, M. Dam and R. Stadler, “Decentralized Computation of Threshold Crossing Alerts,” 16th IFIP/IEEE Distributed Systems Operations and Management (DSOM’05), Barcelona, Spain, October 24-26, 2005. [17] C. Tang, and C. Ward, “GoCast: Gossip-Enhanced Overlay Multicast for Fast and Dependable Group Communication,” In Proc. International Conference on Dependable Systems and Networks (DSN'05), Yokohama, Japan, June 28 - July 1,2005. [18] A. G. Prieto and R.Stadler “Adaptive Distributed Monitoring with Accuracy Objectives,” to appear In proc. ACM SIGCOMM workshop on Internet Network Management (INM’06), Pisa, Italy, September 11, 2006.

11

APPENDIX: THE GENERIC AGGREGATION PROTOCOL (GAP) GAP (Generic Aggregation Protocol)[4], is an asynchronous distributed protocol that builds and maintains a BFS (Breadth First Search) spanning tree. The tree is maintained in a similar way as the algorithm that underlies the 802.1d Spanning Tree Protocol (STP). 10

10

10

10

10

10

Mgmt station

40 10

70

Root

10

140

10 10

10

10 10

10 10

10 10

partial aggregate

60 10

local weight

40 10

20

In GAP, each node maintains information about its children in the BFS tree, in order to compute the partial aggregate, i.e., the aggregate value of the weights from all nodes of the subtree where this node is the root (see Figure 11). GAP is event-driven in the sense that messages are only exchanged as results of events, such as the detection of a new neighbor, the failure of a neighbor, an aggregate update, a change in local weight or a parent change. Since an event-driven protocol can cause a high load on the root node and on nodes close to the root, GAP uses a simple rate limitation scheme, which imposes an upper bound on message rates on each link.

10 10

10

FIGURE 11: GAP MAINTAINS A SPANNING TREE ON THE MANAGEMENT OVERLAY THAT ENABLES INCREMENTAL AGGREGATION OF LOCAL VARIABLES ON A CONTINUOUS BASIS. THE AGGREGATION FUNCTION USED IS SUM.

12

APPENDIX: PROOF OF PROPOSITION 2 Proposition 2 (Mass Conservation, SG-GAP) Let Lr be the set of nodes live during round r . At the end of each round r ≥ 0 , 1. 2.

∑ ∑

i∈Lr i∈Lr

sr ,i + ∑ i∈L , j∉L s 2 sr ,i , j + r

r

wr ,i + ∑ i∈L , j∉L s 2 wr ,i , j r

r

∑ +∑

i∈Lr −1 − Lr , j∈Lr

rsr , j ,i = ∑ i∈L xr ,i r

i∈Lr −1 − Lr , j∈Lr

rwr , j ,i = Lr

Proof(by induction): We prove only 1. The proof for 2. is almost identical. The equation for r = 0 is also straightforward so we go directly for the inductive step. Let M r

= ∑i∈L sr ,i + ∑i∈L , j∉L s 2sr ,i , j + ∑i∈L r

r

r −1 − Lr , j∈Lr

r

rsr , j ,i

We calculate:

M r +1

= =

∑ ∑

i∈Lr +1

sr +1,i + ∑i∈L

i∈Lr +1 ∩ Lr

r +1 , j∉Lr +1

sr +1,i + ∑i∈L

r +1 − Lr

s 2sr +1,i , j + ∑i∈L − L xi + ∑i∈L

r +1 ∩ Lr , j∈Lr − Lr +1

=

∑ ∑ ∑ ∑ ∑ ∑

=

s1sr ,i , j + ∑i∈L

i∈Lr , j∈Lr +1 ∩ Lr

α r ,i , j sr ,i + ∑i∈L

r +1 ∩ Lr , j∉Lr

i∈Lr , j∈Lr +1 ∩ Lr

s 2sr ,i , j + ∑i∈L

r +1 ∩ Lr , j∉Lr

r +1 ∩ Lr , j∈Lr − Lr +1

r +1 , j∈Lr +1 ∩ Lr

rsr +1, j ,i

rsr ,i , j +

α r ,i , j sr ,i +

β r ,i , j (α r ,i ,i sr ,i − xi ) s 2 sr ,i , j + ∑i∈L

s1sr ,i , j + ∑i∈L

r +1 ∩ Lr , j∈Lr −1 − Lr

α r ,i , j sr ,i + ∑i∈L

r +1 − Lr

i∈Lr − Lr +1 , j∈Lr +1 ∩ Lr

r

xi + ∑i∈L

r +1 − Lr

i∈Lr +1 ∩ Lr , j∈Lr −1 − Lr

s1sr ,i , j + ∑i∈L − L

= 0)

i∈Lr +1 ∩ Lr , j∉Lr

i∈Lr − Lr +1 , j∈Lr +1 ∩ Lr

rsr +1, j ,i

= 0 and if i ∉ Lr then s 2sr +1,i , j = 0 as well. Also, if j ∉ Lr +1

(since if i ∉ Lr +1 then s 2 sr +1,i , j and j ∉ Lr then s 2 sr +1,i , j

r +1 , j∈Lr +1

r

r +1 ∩ Lr , j∈Lr −1 − Lr

xi + ∑i∈L

r +1 ∩ Lr , j∈Lr − Lr +1

rsr ,i , j +

α r ,i , j sr ,i +

β r ,i , j (α r ,i ,i sr ,i − xi )

(since the contributions in other cases are 0) =

∑ ∑ ∑

i∈Lr , j∈Lr −1 − Lr

s1sr ,i , j + ∑i∈L , j∈L

i∈Lr , j∈Lr +1 ∩ Lr

α r ,i , j sr ,i + ∑i∈L

i∈Lr − Lr +1 , j∈Lr

β r ,i , j (α r ,i ,i sr ,i − xi )

r −1 − Lr

r

r +1 − Lr

s 2 sr ,i , j + ∑i∈L , j∈L

r −1 − Lr

r

xi + ∑i∈L

r +1 ∩ Lr , j∈Lr − Lr +1

rsr ,i , j +

α r ,i , j sr ,i +

(by the non-contiguous failure assumption) =

∑ ∑ ∑

i∈Lr , j∈Lr −1 − Lr

s1s r ,i , j + ∑i∈L , j∈L

i∈Lr , j∈Lr +1 ∩ Lr

α r ,i , j s r ,i + ∑i∈L

i∈Lr − Lr +1 , j∈Lr

β r ,i , j (α r ,i ,i s r ,i − xi )

r −1 − Lr

r

r + 1 − Lr

s 2 s r ,i , j + ∑i∈L , j∈L r

r −1 − Lr

rs r ,i , j +

x i + ∑i∈L , j∈L − L α r ,i , j s r ,i − ∑i∈L − L α r ,i ,i s r ,i + r

r

r +1

r

r +1

(by the non-contiguous failures assumption)

13

=

∑ ∑ ∑

i∈Lr , j∈Lr −1 − Lr

s1sr ,i , j + ∑i∈L , j∈L

i∈Lr , j∈Lr +1 ∩ Lr

α r ,i , j sr ,i + ∑i∈L

i∈Lr − Lr +1 , j

s 2 sr ,i , j + ∑i∈L , j∈L

r −1 − Lr

r

r +1 − Lr

r −1 − Lr

r

rsr ,i , j +

xi + ∑i∈L , j∈L − L α r ,i , j sr ,i − ∑i∈L − L α r ,i ,i sr ,i + r

r

r +1

r

r +1

β r ,i , j (α r ,i ,i sr ,i − xi )

(since if j ∉ Lr then j ∉ Lr −1 as well, whenever Lr − Lr +1 is inhabited, again by the noncontiguous failures assumption) =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

i∈Lr , j∈Lr −1 − Lr

i∈Lr , j∈Lr +1 ∩ Lr

=

i∈Lr , j∈Lr −1 − Lr i∈Lr , j∉Lr

=

= = = =

i∈Lr , j

i∈Lr

i∈Lr

α r ,i , j sr ,i + ∑i∈L

r +1 − Lr

r

r

r −1 − Lr , j∈Lr

r

rsr , j ,i + ∑i∈L

r

sr ,i + ∑i∈L , j∉L s 2sr ,i , j + ∑i∈L r +1 − Lr

xi + ∑i∈L

xi − ∑i∈L − L xi

r +1 − Lr

r +1

r +1

xi − ∑i∈L − L xi

r +1 − Lr

xi − ∑i∈L − L xi

r −1 − Lr , j∈Lr

r −1 − Lr , j∈Lr

r

r

r

r +1 − Lr

rsr , j ,i + ∑i∈L

α r ,i , j sr ,i + ∑i∈L , j∉L s 2sr ,i , j + ∑i∈L r

r

r

r +1

r

s 2sr ,i , j + ∑i∈L r

r +1

α r ,i , j sr ,i + ∑i∈L , j∈L −L α r ,i , j sr ,i +

α r ,i , j sr ,i + ∑i∈L , j∈L α r ,i , j sr ,i + r

rsr ,i , j +

xi + ∑i∈L , j∈L −L α r ,i , j sr ,i − ∑i∈L −L xi

r +1 ∩ Lr

r

r −1 − Lr

r

α r ,i , j sr ,i + ∑i∈L , j∈L r −1 − Lr , j∈Lr

M r + ∑i∈L



s 2sr ,i , j + ∑i∈L , j∈L

r −1 − Lr

r

s 2 sr ,i , j + ∑i∈L

i∈Lr , j∈Lr −1 − Lr i∈Lr , j∉Lr

s1sr ,i , j + ∑i∈L , j∈L

r

r +1

rsr , j ,i + ∑i∈L

rsr , j ,i + ∑i∈L

r +1 − Lr

r +1 − Lr

xi − ∑i∈L − L xi r

xi − ∑i∈L − L xi r

r +1

r +1

r +1

xi − ∑i∈L − L xi r

r +1

(by the induction hypothesis) =



i∈Lr +1

xi

14

APPENDIX: PROOF OF PROPOSITION 3 Proposition 3: Mass conservation Let L be the set of all nodes and Ltn the set of live nodes at time tn . Then, at all times tn > 0 :



1.

i∈Ltn

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ i∈L tn

tn

Ltn = ∑ j∈L ,i∈L wpending ,tn , j ,i + ∑ i∈L

2.

tn

tn

, j∉Ltn

, j∉Ltn

rstn ,i , j + ∑ i∈L

tn

rwtn ,i , j + ∑ i∈L

tn

, j∉Ltn

, j∉Ltn

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

(rwpending ,tn , j ,i − ackwpending ,tn , j ,i )

We prove only 1. The prove of 2 is almost identical. Lemma 3.1: Recovery information for all live nodes Let L be the set of all nodes. If no failure occurs until time tn , then for all nodes a ∈ L :



j∈L

s pending ,tn , j , a − xa = ∑ j∈L \ a rstn , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j )

Proof (by induction): At time t0 , the right hand side



j∈L \ a

rst0 , j ,a + ∑ j∈L \ a (rs pending ,t0 ,a , j − acks pending ,t0 ,a , j ) = 0 since all terms are equal to 0

On the left hand side,



j∈L

s pending ,t0 , j ,a − xa = s pending ,t0 ,a ,a − xa = xa − xa = 0

Let’s assume that the lemma is true for some time tn , then



j∈L

s pending ,tn , j , a − xa = ∑ j∈L \ a rstn , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j )

Let

Qtn = ∑ j∈L s pending ,tn , j ,a − xa and

Rtn = ∑ j∈L \ a rstn , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j )

We prove Qtn+1 − Qtn = Rtn+1 − Rtn . There are two cases to consider. The case where node a ∈ Ltn+1 performs a transition at time tn +1 and the case where node b ∈ Ltn+1 | b ≠ a performs a transition at tn +1 . Case 1: Let node a ∈ Ltn+1 perform a transition at time tn +1 . Then:

Qtn = ∑ j∈L s pending ,tn , j ,a − xa

= ∑ j∈L stransit ,tn+1 , j ,a + ∑ j∈L sread ,tn+1 , j ,a − xa

It is the case that:

Qtn+1 = ∑ j∈L s pending ,tn+1 j ,a − xa (Axiom 1)

= ∑ j∈L stransit ,tn+1 , j ,a + swrite ,tn+1 ,a ,a − xa Therefore:

Qtn+1 − Qtn = swrite,tn+1 ,a ,a − ∑ j∈L sread ,tn+1 , j ,a

In the same manner:

Rtn+1 = ∑ j∈L \ a rstn+1 , j ,a + ∑ j∈L \ a (rs pending ,tn+1 , a , j − acks pending ,tn+1 ,a , j )

15

= ∑ j∈L \ a rstn , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j ) +



j∈L \ a

(rswrite ,tn+1 ,a , j − ackswrite ,tn+1 ,a , j )

(since the effect of execution of a transition on a is creation of more messages) Therefore:

Rtn+1 − Rtn = ∑ j∈L \ a (rswrite,tn+1 ,a , j − ackswrite ,tn+1 ,a , j )

= ∑ j∈L \ a rswrite ,tn+1 ,a , j − ∑ j∈L \ a acks pending ,tn+1 ,a , j

= ∑ j∈L \ a β a , j ( swrite,tn+1 ,a ,a − xa ) − ∑ j∈L \ a ( sread ,tn+1 ,a , j ) − ∑ j∈L \ a srst x ,a , j (since this is how the protocol computes rs and acks) (tx refers to the last time of transaction on a before tn +1 )

= swrite ,tn+1 , a , a − xa − ∑ j∈L \ a ( sread ,tn+1 ,a , j ) − ( swrite ,tx ,a ,a − xa )

(definition of

β , srs, and α i ,i

in the protocol)

= swrite ,tn+1 ,a ,a − ∑ j∈L sread ,tn+1 , j ,a (since swrite ,t x , a , a is also consumed (read) at tn +1 ) Case 2: let a node b ∈ Ltn+1 | b ≠ a perform a transition at time tn +1 . Assume that Qtn and Rtn are defined as above. Recall that

Qtn = ∑ j∈L s pending ,tn , j ,a − xa

Then

Qtn+1 = ∑ j∈L s pending ,tn , j , a − xa + swrite,tn+1 ,b ,a

(since, the transition on b adds a message) which gives

Qtn+1 − Qtn = swrite,tn+1 ,b ,a Also, recall that

Rtn = ∑ j∈L \ a rstn , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j )

and therfore

Rtn+1 = ∑ j∈L \ a rstn+1 , j ,a + ∑ j∈L \ a (rs pending ,tn+1 ,a , j − acks pending ,tn+1 ,a , j )

= ∑ j∈L \ a rstn+1 , j ,a + ∑ j∈L \ a (rs pending ,tn ,a , j − acks pending ,tn ,a , j ) − (rsread ,tn+1 , a ,b − acksread ,tn+1 ,a ,b ) (since a transition on b consumes the messages)

Then,

Rtn+1 − Rtn = ∑ j∈L \ a rstn+1 , j ,a − ∑ j∈L \ a rstn , j ,a − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) = rstn+1 ,b ,a − rstn ,b ,a − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) (since rs values on other nodes than b are not affected by the transition)

= (rstn ,b ,a + rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b + swrite ,tn+1 ,b ,a ) − rstn ,b ,a − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) (due to the way the protocol computes rs)

= swrite ,tn+1 ,b ,a QED.

16

Lemma 3.2: Proposition 3 for the case of all live nodes Let L be the set of all nodes. If no failure occurs, then, at all times tn > 0 :



i∈Ltn

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i tn

Note that proposition 3 reduces to lemma 3.2 for a system where no failures occur. Proof(by induction): At time t0 it is straightforward to show that it holds since,



j ,i∈L

s pending ,t0 , j ,i =∑ i∈L s pending ,t0 ,i ,i =∑ i∈L ( xi )

(see steps 1 and 5 of round 0 of protocol) For the induction step, assuming that it is true for an execution at time tn , we show it is true for time tn +1 . Assume that a cycle is executed on node a at time tn +1 . Then



j ,i∈L

s pending ,tn+1 , j ,i = ∑ j ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L swrite,tn+1 ,a , j = ∑ j ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L α a , j stn+1 ,a

= ∑ j ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + stn+1 ,a ∑ j∈L α a , j = ∑ j ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + stn+1 ,a

α) = ∑ i∈L xi

(due to the definition of

= ∑ j ,i∈L s pending ,tn , j ,i

(due to the protocol and since stn+1 , a =



s

j∈L read ,tn+1 , j , a

)

QED. Lemma 3.3: (Proposition 3 for the case of a single failure) Let L be the set of all nodes. Suppose that node z fails at time t fail and that no other node fails. Then, for all times tn ≥ t fail :



i∈Ltn

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ j∈L rstn ,i , z + ∑ j∈L (rs pending ,tn , z , j − acks pending ,tn , z , j ) tn

tn

tn

Note that proposition 3 reduces to lemma 3.3 for a system where a single failure occurs. Proof(by induction): We first show that it holds for time t fail . At all times tn < t fail , it was shown in Lemma 3.2 that



i∈Ltn

xi = ∑ j ,i∈L s pending ,tn , j ,i tn

Therefore, at time t fail −1 (1)



xi = ∑ j ,i∈L

i∈Lt fail −1

t fail −1

s pending ,t fail −1 , j ,i

At time t fail ,

∑ ∑

(2)

i∈Lt fail

j∈L ,i∈Lt fail

xi = ∑ i∈L

t fail −1

xi − xz and

s pending ,t fail , j ,i = ∑ j ,i∈L s pending ,t fail −1 , j ,i − ∑ i∈L s pending ,t fail −1 ,i , z

We can rewrite the above as (3)



j ,i∈L

s pending ,t fail −1 , j ,i = ∑ j ,i∈L

t fail

s pending ,t fail , j ,i + ∑ i∈L s pending ,t fail −1 ,i , z

17

Replacing (3) in (1) and using that in (2) gives



i∈Lt fail

xi = ∑ j ,i∈L

t fail

s pending ,t fail , j ,i + ∑ i∈L

t fail −1

s pending ,t fail −1 ,i , z − xz

Lemma 3.1 applied to node z at time t fail −1 gives the expression for the last two terms of the above equation as follows



j∈Lt fail −1

s pending ,t fail −1 , j , z − xz = ∑ j∈L

t fail −1 \ z

Therefore



i∈Lt fail

rst fail −1 , j , z + ∑ j∈L

t fail −1 \ z

xi = ∑ j∈L ,i∈L

s pending ,t fail , j ,i + ∑ j∈L

= ∑ j∈L ,i∈L

s pending ,t fail , j ,i + ∑ j∈L

t fail

rst fail −1 , j , z + ∑ j∈L

t fail −1 \ z

t fail

t fail

\z

(rs pending ,t fail −1 , z , j − acks pending ,t fail −1 , z , j ) t fail −1 \ z

rst fail , j , z + ∑ j∈L

t fail

\z

(rs pending ,t fail−1 , z , j − acks pending ,t fail −1 , z , j )

(rs pending ,t fail , z , j − acks pending ,t fail , z , j )

(since the failure does not affect the terms) Now that we have shown the basis step, we assume that lemma 3.3 is true for a time tn > t fail and we show that it is also true for

tn +1 . Let M tn =



j∈L ,i∈Ltn

s pending ,tn , j ,i + ∑ i∈L

tn

, j∉Ltn

rstn ,i , j + ∑ i∈L

tn

, j∉Ltn

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

If we assume that the round is executed on some node a at time t + 1 , then, we have

M tn+1 = ∑ j∈L ,i∈L s pending ,tn+1 , j ,i + ∑ i∈L

tn+1 ,

tn+1

j∉Ltn+1

rstn+1 ,i , j + ∑ i∈L

= ∑ j∈L ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L

t fail −1

tn



j∈Ltn

tn+1 ,

j∉Ltn+1

α a , j st

n+1 , a

(rs pending ,tn+1 , j ,i − acks pending ,tn+1 , j ,i ) + ∑ j∈L

tn

\a

rstn , j , z + rstn+1 , a , z +

(rs pending ,tn , z , j − acks pending ,tn , j ,i ) − ∑ j∈L (rsread ,tn+1 , z , j − acksread ,tn+1 , j ,i ) tn

(since only z failed, and the protocol)

= ∑ j∈L ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L tn

tn

\a

rstn , j , z +

rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a ) +



j∈Ltn

(rs pending ,tn , z , j − acks pending ,tn , j ,i ) − ∑ a (rsread ,tn+1 , z , a − acksread ,tn+1 , z ,a )

(since, if a learnt about the failure of z, then



j∈Lt fail −1

α a , j st

n+1 , a

= ∑ j∈L sread ,tn+1 , j ,a + rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a ) and rstn+1 ,a , z = 0 or

if a did not learn about the failure of z then



j∈Lt fail −1

α a , j st

n+1 , a

= ∑ j∈L sread ,tn+1 , j ,a and

rstn+1 ,a , z = rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a ) )

= ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ j∈L rstn , j , z + ∑ j∈L (rs pending ,tn , z , j − acks pending ,tn , j ,i ) tn

tn

= ∑ i∈L xi = ∑ i∈L xi tn

tn

tn+1

QED. Lemma 3.4: recovery information Let L be the set of all nodes. For any node a ∈ L and any time tn :



j∈Ltn

s pending ,tn , j , a − xa = ∑ j∈L

tn

\a

rstn , j ,a + ∑ j∈L

tn

\a

(rs pending ,tn ,a , j − acks pending ,tn ,a , j )

Note that lemma 3.1 is a special case of lemma 3.4 for the assumption that no failure occurs.

18

Proof: First, we show using induction that, for any two live nodes a and b at any time tn where both nodes are alive:

βt ,a ,b (α t n

n ,a ,a

stn ,a − xa ) + ∑ b s pending ,tn ,b ,a = rstn ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b )

(Recall that pending and other messages are defined as follows

s pending ,t ,i , j = ∑ {s | ∃w, srs,...: ( s, w, srs,...) ∈ M pending ,t ,i , j } )

For t = t0 , it is trivial to show the above since both sides become 0 due to the initialization in the protocol. Let

Qtn = β tn ,a ,b (α tn ,a ,a stn ,a − xa ) + s pending ,tn ,b ,a

= β tn ,a ,b (α a ,a stn ,a − xa ) + stransit ,tn+1 ,b ,a + sread ,tn+1 ,b ,a Rtn = rstn ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b )

= rstn ,b ,a + (rstransit ,tn+1 ,a ,b − ackstransit ,tn+1 ,a ,b ) + (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) Then we prove that Qtn+1 − Qtn = Rtn+1 − Rtn . There are three cases to consider; the case where a transition is performed on node a at tn +1 , the case where a transition is performed on node b at tn +1 , and the case where a transition is performed at tn +1 on node c | c ≠ a and c ≠ b . Case 1: node c performs a transition at tn +1 : This case is trivial to show because all terms are not affected by the transition on c. Case 2: node a performs a transition at tn +1 :

Qtn+1 = βtn+1 ,a ,b (α a ,a stn+1 ,a − xa ) + s pending ,tn+1 ,b ,a

= βtn+1 ,a ,b (α tn ,a ,a stn +1,a − xa ) + s pending ,tn ,b ,a − sread ,tn+1 ,b ,a (a transition on a may consume pending messages from b to a) Therefore,

Qtn+1 − Qtn = β tn+1 ,a ,b (α a ,a stn +1,a − xa ) + s pending ,tn ,b , a − sread ,tn+1 ,b ,a −

(β tn ,a ,b (α tn ,a ,a stn ,a − xa ) + s pending ,tn ,b ,a ) = βtn+1 ,a ,b (α a ,a stn +1,a − xa ) − β tn ,a ,b (α tn ,a ,a stn ,a − xa ) − sread ,tn+1 ,b ,a Also

Rtn+1 = rstn+1 ,b ,a + (rs pending ,tn+1 ,a ,b − acks pending ,tn+1 ,a ,b )

= rstn ,b ,a + (rstransit ,tn ,a ,b − ackstransit ,tn ,a ,b ) + srstn+1 ,a − ackstn+1 ,a Therefore,

Rtn+1 − Rtn = rstn ,b ,a + (rstransit ,tn ,a ,b − ackstransit ,tn ,a ,b ) + srstn+1 ,a − ackstn+1 ,a −

(rstn ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b )) = srstn+1 ,a ,b − ackstn+1 ,a ,b

= βtn+1 ,a ,b (α tn+1 ,a ,a stn+1 ,a − xa ) − ( srstn ,a ,b + ∑ b sread ,tn+1 ,b ,a )

= βtn+1 ,a ,b (α tn+1 ,a ,a stn+1 ,a − xa ) − βtn ,a ,b (α tn ,a ,a stn ,a − xa ) − ∑ b sread ,tn+1 ,b ,a = Qtn+1 − Qtn Case 3: node b performs a transition at tn +1 :

Qtn+1 = βtn+1 ,a ,b (α tn+1 ,a ,a stn+1 ,a − xa ) + s pending ,tn+1 ,b ,a

= βtn ,a ,b (α tn ,a ,a stn ,a − xa ) + s pending ,tn ,b ,a + α tn+1 ,b , a stn+1 , a

19

(a transition on b generates a new message to b) Therefore,

Qtn+1 − Qtn = βtn ,a ,b (α tn ,a ,a stn ,a − xa ) + s pending ,tn ,b ,a + α tn+1 ,b ,a stn+1 ,b −

βt

n , a ,b

(α tn ,a ,a stn ,a − xa ) + s pending ,tn ,b ,a

= α tn+1 ,b ,a stn+1 ,b Also

Rtn+1 = rstn+1 ,b ,a + (rs pending ,tn+1 ,a ,b − acks pending ,tn+1 ,a ,b )

= rstn+1 ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b ) − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) Therefore,

Rtn+1 − Rtn = rstn+1 ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b ) − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) −

(rstn ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b )) = rstn+1 ,b ,a − (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) − rstn ,b ,a From the protocol,

rstn+1 ,b ,a = rstn ,b ,a + (rsread ,tn+1 ,a ,b − acksread ,tn+1 ,a ,b ) + α tn+1 ,b ,a stn+1 ,b

Therefore,

Rtn+1 − Rtn = α tn+1 ,b ,a stn+1 ,b = Qtn+1 − Qtn

Hence we showed that

β t ,a ,b (α t n

n ,a ,a

stn ,a − xa ) + s pending ,tn ,b ,a = rstn ,b ,a + (rs pending ,tn ,a ,b − acks pending ,tn ,a ,b )

Now, if we sum up both sides of the above equation over all live nodes except a, and replace the b with j, we have

∑ ∑ ∑ ∑

j∈Ltn \ a j∈Ltn \ a

j∈Ltn \ a j∈Ltn

( s pending ,tn , j , a + βtn ,a , j (α tn ,a ,a stn ,a − xa )) = ∑ j∈L s pending ,tn , j ,a + ∑ j∈L

tn

tn

\a

βt

n ,a,

j

tn

\a

tn

tn

\a

(rstn , j , a + rs pending ,tn , a , j − acks pending ,tn ,a , j )

(α tn ,a ,a stn ,a − xa ) = ∑ j∈L

s pending ,tn , j ,a + α tn ,a ,a stn ,a − xa = ∑ j∈L

s pending ,tn , j ,a − xa = ∑ j∈L

\a

\a

rstn , j ,a + ∑ j∈L

rstn , j ,a + ∑ j∈L

tn

tn

\a

(rstn , j ,a + rs pending ,tn ,a , j − acks pending ,tn , a , j )

\a

(rs pending ,tn ,a , j − acks pending ,tn ,a , j )

(rs pending ,tn ,a , j − acks pending ,tn ,a , j )

QED. Lemma 3.5: Induction step of proposition 3 Let L be the set of all nodes and Ltn the set of live nodes at time tn . Assume that node

z ∈ L fails at time t fail > 0 and that no

other failure occurs after t fail . Assume that the failure at time t fail is discontiguous (see section III.D). Then, if (*)



i∈Ltn

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ i∈L tn

tn

, j∉Ltn

rstn ,i , j + ∑ i∈L

tn

, j∉Ltn

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

holds at time tn = t fail −1 , then (*) also holds for all times tn ≥ t fail . Note that lemma 3.2 and lemma 3.5 together provide the basis, and lemma 3.5 provides induction step for proposition 3, which proves proposition 3. Note also that lemma 3.3 is a special case of lemma 3.5 for the assumption that no other failure occurs before t fail . Proof: Recall that proposition 3 states that



i∈Ltn

xi = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ i∈L tn

tn

, j∉Ltn

rstn ,i , j + ∑ i∈L

tn

, j∉Ltn

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

20

Then at time t fail −1 , as a consequence of the assumption that failures are discontiguous and correlated message and failure signaling assumption (see section III.D), the proposition reduces to



xi = ∑ j ,i∈L

i∈Lt fail −1

t fail −1

s pending ,t fail −1 , j ,i

We first show that proposition 3 at time t fail (shown below) holds. Then, using the induction principle, we show that the proposition also holds for all time tn > t fail .



i∈Lt fail

xi = ∑ j∈L

t fail −1 ,i∈Lt fail

s pending ,t fail , j ,i + ∑ i∈L rst fail ,i , z + ∑ i∈L (rs pending ,t fail , z ,i − acks pending ,t fail , z ,i ) t fail

t fail

By definition,



i∈Lt fail

xi = ∑ i∈L

t fail −1

and



j∈Lt fail −1 ,i∈Lt fail

s pending ,t fail , j ,i = ∑ j ,i∈L

t fail −1

Therefore



i∈Lt fail

xi − xz

xi = ∑ i∈L

t fail −1

s pending ,t fail −1 , j ,i − ∑ i∈L

s pending ,t fail−1 ,i , z

t fail −1

xi − xz = ∑ j∈L

s pending ,t fail , j ,i + ∑ i∈L

t fail −1 ,i∈Lt fail

t fail −1

s pending ,t fail −1 ,i , z − xz

Lemma 3.4 applied to node z at t fail −1 gives



j∈Lt fail −1

s pending ,t fail −1 , j , z − xz = ∑ j∈L

t fail −1 \ z

Therefore



i∈Lt fail

xi = ∑ j∈L

t fail −1 ,i∈Lt fail

∑ =∑

j∈Lt fail −1 \ z

rst fail −1 , j , z + ∑ j∈L

t fail −1 \ z

s pending ,t fail , j ,i + ∑ j∈L

t fail −1 \ z

(rs pending ,t fail −1 , z , j − acks pending ,t fail −1 , z , j )

rst fail −1 , j , z +

(rs pending ,t fail −1 , z , j − acks pending ,t fail −1 , z , j )

j∈Lt fail −1 ,i∈Lt fail

s pending ,t fail , j ,i + ∑ j∈L rst fail , j , z + ∑ j∈L (rs pending ,t fail , z , j − acks pending ,t fail , z , j ) t fail

t fail

(since they are not affected by the failure of z) Thus we showed that proposition 3 holds at time t fail . For the induction step, we assume that proposition 3 is true at some time tn > t fail and show that is also true at time tn +1 . Now, we assume that Lemma 3.3 is true for a time tn > t fail and we show that it is also true for tn +1 . Let

M tn = ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ i∈L tn

tn

, j∉Ltn

rstn ,i , j + ∑ i∈L

tn

, j∉Ltn

(rs pending ,tn , j ,i − acks pending ,tn , j ,i )

Now, assume that the round is executed on a at t + 1 , then, we have

M tn+1 = ∑ j∈L ,i∈L s pending ,tn+1 , j ,i + ∑ i∈L tn+1

tn+1 ,

j∉Ltn+1

rstn+1 ,i , j + ∑ i∈L

= ∑ j∈L ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L

t fail −1

tn



j∈Ltn

tn+1 ,

j∉Ltn+1

α a , j st

n+1 , a

(rs pending ,tn+1 , j ,i − acks pending ,tn+1 , j ,i ) + ∑ j∈L

tn

\a

rstn , j , z + rstn+1 ,a , z +

(rs pending ,tn , z , j − acks pending ,tn , j ,i ) − ∑ j∈L (rsread ,tn+1 , z , j − acksread ,tn+1 , j ,i ) tn

(due to the discontiguous failures assumption, out of all failed nodes, only z’s message is pending in the network)

= ∑ j∈L ,i∈L s pending ,tn , j ,i − ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L sread ,tn+1 , j ,a + ∑ j∈L tn

tn

\a

rstn , j , z +

rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a ) +



j∈Ltn

(rs pending ,tn , z , j − acks pending ,tn , j ,i ) − ∑ a (rsread ,tn+1 , z , a − acksread ,tn+1 , z ,a )

(since, if a learnt about the failure of z, then

21



j∈Lt fail −1

α a , j st

n+1 , a

= ∑ j∈L sread ,tn+1 , j ,a + rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a ) and rstn+1 ,a , z = 0 or

if a did not learn about the failure of z then



j∈Lt fail −1

α a , j st

n+1 , a

= ∑ j∈L sread ,tn+1 , j ,a and rstn+1 ,a , z = rstn ,a , z + ∑ a (rsread ,tn+1 , z ,a − acksread ,tn+1 , z ,a )

)

= ∑ j∈L ,i∈L s pending ,tn , j ,i + ∑ j∈L rstn , j , z + ∑ j∈L (rs pending ,tn , z , j − acks pending ,tn , j ,i ) tn

= ∑ i∈L xi

tn

tn

tn

QED.

22

Suggest Documents