Gossip-based Failure Estimator for Large-Scale Dynamic Networks

8 downloads 1505 Views 157KB Size Report
systems where synchronization is hard to achieve (such as ... the protocol and offers continuous process failure and recovery ..... A circular disk communication model is .... [13] N. Hamed Azimi, H. Gupta, X. Hou, and J. Gao, “Data preservation.
FailDetect: Gossip-based Failure Estimator for Large-Scale Dynamic Networks Andrei Pruteanu, Venkat Iyer, Stefan Dulman Delft University of Technology, The Netherlands Emails: {a.s.pruteanu,v.g.iyer,s.o.dulman}@tudelft.nl

Abstract—Ubiquitous and wirelessly connected devices are the present status quo in terms of networks around us. With the ever increase of scale, there comes also the problem of various transmission failures. They are usually caused by hardware, software, or any other medium access contention. For the case of mobile networks, path uncertainty comes also into picture due to node mobility. All this leads to low quality of service and reduced user experience. The main contribution of the paper is the introduction of a novel distributed algorithm called FailDetect for the statistical estimation of the average packet failure-rate in large-scale wireless distributed systems. It is based on gossip protocols, with the adding of periodic resets of the exchanged values. It is a fully-distributed scheme that does not presume time synchronization among the reset intervals for various nodes. A model and an evaluation by means of simulation and experiments show that FailDetect succeeds in evaluating the average packet failure-rate of the network, while exhibiting low message-complexity. Index Terms—large-scale systems, failures, packet loss, distributed, gossiping, reset

I. I NTRODUCTION

AND

M OTIVATION

The omni-presence of wirelessly connected devices around us is no longer a prediction about the future, but rather a fact about today’s technological status quo. Along with the ubiquitous, always-connected experience, comes also the problem of transmission failures due to noise in the wireless environment or various hardware or software problems. The wireless communication environments are inherently different than the wired-based ones. For the majority of the devices (equipped with omnidirectional antennas), every transmission is a broadcast. Additionally, nodes have to share a limited part of the wireless spectrum. Due to contention, radio propagation issues (multi-path effects [8] etc), the chances of having transmission failures are high [25]. Failure detection is one of the most important building blocks of most distributed systems applications such as transactions [12], consensus [5] and replication services [19]. In systems where synchronization is hard to achieve (such as MANETs), the presence of a failure detection service may be used to improve various agreement problems [9]. Inspired by real-world deployments of WSNs, where periodic resets of the nodes are a known failure mode [3], we propose a new failure detection algorithm, by incorporating resets into a gossiping algorithm. We show that the new mechanism, called DiffusionReset, retains the property of achieving convergence exponentially fast. Although our extension is derived from gossiping algorithms that are sensitive to mass

conservation [16], [17], our approach specifically exploits the property that total mass varies in a dynamic network. Based on DiffusionReset, we develop the FailDetect algorithm as a solution for the online fail rate estimation within the network (defined as the percentage of packets that are lost within a defined period of time). We are not assuming that the nodes advertise their packet transmission success rate. In short, random subsets of nodes reset the local values used by the algorithm. The results of the gossiping algorithm is an average aggregate value, available at all nodes. The deviation from the expected estimate (given there is no message loss), indicates the amount of transmission failures in the system. To the best of our knowledge, this is the first work that is addressing an arbitrary mobile multihop topology while still offering very good fail rate estimates in a fully distributed manner and with low message complexity. We validate our work with a model checked by both simulation and experiments on our wireless testbed. For the analysis of our algorithm via simulation we considered different mobility and network density scenarios that cannot be matched with corresponding traces from real deployments due to their scarcity and difficulty of collection, especially for large-scale mobile ad-hoc networks. The paper is structured as follows: in Section II we describe existing state-of-the-art. In Section III we introduce the failure detection mechanism. We analyze the prposed FailDetect algorithm in Section IV, and conclude the paper in Section V. II. R ELATED W ORK Due to the detrimental nature of wireless communication failures, there are many studies dedicated to the detection [18], [20], [25], the impact estimation [11], [23], [24], the mitigation [13] and the repair [1] of communication links [14] to restore the system to a normal state of functioning. In wireless sensor networks (WSNs), devices are usually capable of providing two pieces of information about the channel quality - the Link Quality Indication (LQI) and Received Signal Strength Indication (RSSI). They constitute a form of Channel State Information (CSI) [4], [14]. While each device is able to estimate the quality of its links, having a global view of the average packet loss across the system, in a fully distributed manner has not been extensively studied. The traditional approach to failure detection is to have each node send heartbeat broadcast messages at regular time intervals. Other nodes (such as link-neighbors for the case

Round (R)

100 90 80

mi ωi

70

Residual value

Node 1

Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10

60

Node 2

φ1

Round (R)

φ2

Node 3 0

Fig. 2.

Round (R)

φ3 Timesteps [k]

Discrete time model for three nodes (φi - reset phase).

50 40 30 20 10 0 0

50

100

150

200

250

300

Time [time steps]

Fig. 1. DiffusionReset (200 nodes, tr. range 0.15 units, R(reset interval) = 50, µ = 0).

of MANETs) evaluate the presence or the absence of a node based on reception of heartbeat messages [7], [10]. This approach is feasible for the case of static networks, given that the network density is low and, as a consequence, the beacon (heartbeat) interval is also low. For the case of dynamic topologies, intermittent links might be interpreted as failures [10]. The closest algorithm [22] to our work presumes that each host in the network runs a failure detector process that executes the protocol and offers continuous process failure and recovery reports to interested clients. Each member maintains a list of member addresses and an integer (heartbeat counter) that can be used to detect node failures. Each member typically maintains a last time that its corresponding heartbeat counter has increased. If the heartbeat counter has not increased for more that Tf ail seconds, then the member is considered to have failed. The protocol only detects hosts that become entirely unreachable. It does not detect link failures between hosts. The algorithm does need many optimizations to reduce the message complexity for the case of large-scale systems with the cost of slower convergence speed. III. T HE FailureDetect A LGORITHM We introduce the FailDetect algorithm by making the same assumptions as the ones used by the Push-Sum [17] algorithm: i) communication takes place at discrete time intervals; ii) nodes do not need to have globally unique IDs (although at the lowest communication layer we need to be able to distinguish between neighbors); iii) the network does not become partitioned with time. We consider the discretized time in order to simplify the description of the algorithm and to ease the intuitive understanding of the concepts. The term communication round is being used similarly to the one described by [15] - it captures the fact that each node performs, in a given (large) time interval, an equal amount of actions. The beginning of the rounds need not be synchronized (see Figure 2), in fact

FailDetect is actually relying on this. Rounds are considered to be orders of magnitudes longer than the clock drifts of the devices, thus, usual communication networks can be modeled as such. The last assumption (unpartitioned network or, equivalently, network as a single multihop cluster) should be interpreted from the perspective of large periods of time - it may be invalidated for the case of mobile scenarios for short moments (as in single nodes having no neighbors in a small interval of time). This behavior might introduce “glitches” in the algorithm, nevertheless, mobility actually helps by significantly accelerating the convergence of diffusion algorithms [21]. A. The DiffusionReset Algorithm In this section we introduce the first contribution of this paper, the DiffusionReset algorithm (see Algorithm 1), which is the foundation of our solution to the transmission failure detection problem. We build upon a basic diffusion algorithm (lines 1–3 and 9–14 in Algorithm 1), adding the novel feature that each node periodically (albeit asynchronously), resets its local variables to a default value (i.e. the tuple {µ; 1} lines 4–8 in Algorithm 1). The rationale for this mechanism is to have a number of nodes periodically reset (detailed in Section III-B). The inspiration for this algorithm comes from a very common failure pattern met in real-world WSNs deployments – where nodes reset randomly [3] (see Figure 1 for the expected behavior). In our case we talk about a computational reset, not about an actual restart of the system either by software of hardware means. As Lemma 1 shows (see Section III-B), the average of i [k] the distributed variable M i = m ωi [k] , will converge to {µ; 1} with time, regardless of the initial values {mi [0], ωi [0]}i∈S . Intuitively we can think of it as if the network “forgets” the initial value exponentially fast. This property extends also to disturbances in the network: if a node local values become arbitrary, the system will converge back to {µ; 1} exponentially fast. Basic diffusion mechanism – DiffusionReset borrows parts of the Push-Sum and Push-Vector, introduced in [17] (lines 1–3 and 9–14 in Algorithm 1). In short, these work as follows: each node i holds a local state variable (given by the tuple of values {mi [k]; ωi [k]}) at the beginning of the communication time step k (mi is usually referred to as “mass”). During the time step, each node splits its local variables in several shares that get distributed to its neighbors. In our case, since we use unicast, it gets distributed to only one neighbor. At the end of the time step, the node adds all the shares of received variables and updates to the new state value. The effect of this

Algorithm 1 DiffusionReset(µ, φi )

Algorithm 2 FailDetect(φi )

1: ⊲ state update P step 2: mi [k] ← λj,i [k − 1]mj [k − 1] + j∈S [k−1]

1: 2: 3: 4: 5: 6: 7:

3: ωi [k] ←

P

i

+ j∈S [k−1] i

λj,i [k − 1]ωj [k − 1]

4: ⊲ reset step 5: if rem (k, R) == φi then 6: {mi [k]; ωi [k]} ← {µ, 1} 7: Choose values λi,j 8: end if

⊲ initialization step if random uniform ≤ 0.5 then Update phase: φi ← rem(k, R) Reset mass value: µi ← 0 else Reset mass value: µi ← 1 end if

8: ⊲ diffusion step 9: {mi [k], ωi [k]} ← DiffusionReset (µi , φi )

9: ⊲ communication step 10: for all neighbors j do 11: Send j: {λi,j mi [k]; λi,j ωi [k]} 12: end for

10: ⊲ return value i [k] 11: m ω [k] i

13: ⊲ return value 14: {mi [k]; ωi [k]}

mechanism is that, with time, all local variables converge to the same value (the average of the initial variable set) regardless of the synchronization model [17] (allowing us to relax the synchronous communication assumption). Let i indicate the current node and j be the index of a neighbor j ∈ Si+ [k]. During each time step, node i defines a share vector Λi [k] of size ni [k], with elements corresponding to the share of local variables to be distributed to each neighbor. Let λi,j [k] be the share assigned by node i to a neighbor j in time step P k. The shares are chosen such that, at any time step k, j∈S + [k] λi,j [k] = 1 holds. During i each time step k, each node i sends to all its neighbors a weighted vector: {λi,j [k]mi [k]; λi,j [k]ωi [k]} and receives the sets {λj,i [k]mj [k]; λj,i [k]ωj [k]} from its neighbors. At the time step k + 1, the node updates its P mi (ωi is updated similarly) value as follows: mi [k + 1] = j∈S + [k] λj,i [k]mj [k]. i In matrix form, (M and Ω being column vectors with the mi and respectively ωi elements), we have M[k + 1] = ΛT M[k], Ω[k + 1] = ΛT Ω[k]. As shown in [17], if no errors occur and the P P set of nodes remains the same, the sums m [k] and i i∈S i∈S ωi [k] remain constant over time. The usage of the share vector Λi [k] allows a great flexibility in the algorithm design: if all the elements in Λi [k] are zero, except for two entries (corresponding to i and a random neighbor j) equal to 0.5 each, the algorithm maps onto the classic definition of gossiping using unicasts. If all the entries in Λi [k] are taken to be 12 then we model a local unicasting mechanism. Reset mechanism – The reset mechanism (lines 4–8 in Algorithm 1) works as follows: every R time steps, a node resets its state value to {µ; 1}. The reset phase of each node is φi (see Figure 2). Let δ[k] be the discrete Dirac function. The moment k when node i resets is signaled by ti [k] = 1, where ti [k] = δ [rem (k − φi , R)] (rem(a, b) gives the remainder of the division of a to b). Let xi [k] be the local state variable on node i (i.e., the vector [mi [k], ωi [k]]). The state transition can be written as X xi [k + 1] = (1 − ti [k]) λj,i [k]xj [k] + ti [k][µ, 1]. (1) j∈Si+

We define the vector X = [x1 [k], x2 [k], ..., xn [k]]T . Let

A[k] be the adjacency matrix and I the identity matrix. We define the square matrix ∆[k] with the terms ti [k] on its diagonal. Let D be a n × 2 matrix with elements µ on the first column and 1 on the second column. The algorithm can be written in matrix form as X[k + 1] = (I − ∆[k]) (I + A[k]) ΛT [k]X[k] + ∆[k]D, where the left term captures the basic diffusion mechanism and the right term the asynchronous resets.

B. Convergence of DiffusionReset As shown by Dimakis et. al [21] mobility enables the construction of “short” routes between all pairs of agents, accelerating the diffusion process. The convergence time is 2 Tconv (n, e) = O( nm log e−1 ) where m is fraction of nodes that are mobile. If the entire network becomes mobile, then Tconv (n, e) = O(n log e−1 ), meaning actually that the speed of information diffusion in a fully mobile network approaches the one of a fully connected network. In our case, the influence of mobility and multihop topology is captured by A[k] and Λ[k] matrices - that change at each moment in time. In the following we provide general proofs for convergence of DiffusionReset, which hold P as long each node distributes mass n around itself without loss ( i=1 λi,j = 1). Each node resets after R time steps and the reset phase for each node is random and follows an uniform distribution. This results in an approximately constant number of nodes resetting at each time step. The average mass value to which all nodes reset in one gossiping step  k is equal to µ. The expected values 1 , E ti [k]2 = R1 , E {λi,j [k]} = 12 , where are E {ti [k]} = R we used the fact that ti [k] can be  either 0 or 1, leading to 1 ti [k] = ti [k]2 . Let f = 1 − R . The error on each node is defined as |mi [k] − µ|. We can prove the following two lemmas: Lemma 1 (Convergence of Mass for DiffusionReset). With time, the total mass of the system converges to: limk→∞ M [k] = nµ. Proof: The total mass in the network at time k + 1 is:

0.8

Estimation error [percentage]

0.7

Estimation error [percentage]

static max speed: 0.05 max speed: 0.1 max speed: 0.15

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2

4

6

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1

8 10 12 14 16 18 20 22 24 26 28 30

0

Network diameter [hops]

Pn

i=1

M [k + 1] =

nµ λj,i [k]mj [k] + R +

X

(1 − ti [k])

i=1

nf X

(2)

j∈Si

(1 − ti [k])

i=1

E{M [k + 1]} =

n X

λj,i [k]mj [k] +

j=1

n n X X

nµ R

(1 − E{ti [k]})E{λj,i [k]}mj [k] +

i=1 j=1

(3) nµ R (4)

1 1 nµ = (1 − ) nf µ + R 2 R As f < 1, we obtain limk→∞ M [k] = nµ.

0.15

Fig. 4. Packet loss estimation error. Network Size = 400. Mobility model: Random Walk. Reset interval: 50.

nµ 1 (7) + (1 + ψ0 )nf µ k→∞ R 2 Let µ ˆ be the computed average at any of the nodes via the gossiping process (after a low pass filter has been applied). The node can determine the mean value ψ0 from the following equation: lim M [k] =

We can extend the expression above from local neighborhoods (Si+ ) to the full network because the shares λj,i = 0 for j ∈ / Si+ . It follows (knowing that ti [k] and λj,i [k] are independent and nf is the number nodes not reseting m and ω at time k + 1): M [k + 1] =

0.1

show that:

mi [k + 1]. From Equation 1,

nf X

0.05

Max. node speed [units/timeunit]

Fig. 3. Packet loss estimation error. Network Size = 400. Mobility model: Random Walk. Reset interval: 50.

M [k + 1] =

0% loss 10% loss 20% loss 30% loss 40% loss 50% loss 60% loss 70% loss

(5)

C. FailureDetect Algorithm The main idea of the algorithm is that each node flips a coin when it has to reset and sets µ to either 0 or 1 with 50% probability. We assume that the links between the nodes are affected by errors defined by the random variable ψi,j [k]. The only assumption we have is that the mean value does not fluctuate widely on a timescale in the same order with R, such that we assume that E{ψ} = ψ0 . The constant ψ0 is the probability that a message is sent successfully (1−errorrate). We can rewrite the equation for mass variation (Equation 1) in the following way: X xi [k + 1] = (1 − ti [k]) ψj,i [k]λj,i [k]xj [k] + ti [k][µi , 1] j∈Si+

(6) Following the same reasoning as in the previous lemmas and the fact that, for the unicast case, λj,i = 12 , one can easily

nµ 1 µ + (1 + ψ0 )nf µ = nˆ R 2 µ 1 ˆ + (1 + ψ0 )f µ = µ R 2 2ˆ µR − 2µ − Rµf ψ0 = f µR

(8) (9) (10) (11)

The average error rate is simply 1 − ψ0 =

2R(µ−ˆ µ) (R−1)ˆ µ .

IV. FailureDetect A LGORITHM A NALYSIS We base our evaluation by first conducting simulations on Matlab. The mobile nodes are assumed to be deployed in a square space of 1 units2 . A circular disk communication model is assumed. The nodes move through space with a speed ranging from a minimum of 0.0 units/time step (static network) to a maximum of 0.15 units/time step (using the Random Walk [2] mobility model). Each experiment consisted of simulations running for 500 time steps. The number of nodes, the maximum node speed, the packet failure rate and the network diameter (transmission range) were varied across simulations to achieve different topologies and network dynamics. A. Influence of Network Diameter and Mobility on Accuracy From the accuracy point of view, a trade-off exists between the network diameter and the mobility of the nodes (i.e. maximum speed). While the former aspect is already well understood [6], the latter is still a subject of active research [21]. The importance of such a trade-off is that given a network diameter, if the mobility of the nodes is above

1.1

static max speed: 0.05 max speed: 0.1 max speed: 0.15

1 0.9 Estimation error [percentage]

Estimation error [percentage]

0.8 0.75 0.8 0.7 0.7

0.65

0.6

0.6

0.5

0.55 0.5

0.4 20

0

15 5

0.1 0.15

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1

0.05

10

Diameter [hops]

0.45

0.8

Max. node speed [units/timeunit]

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Real packet loss [percentage]

Fig. 5. Packet failure estimation error. Network Size = 108. Mobility model: Random Walk. Reset interval: 50.

Fig. 6. Packet failure estimation error. Network Size = 400. Mobility model: Random Walk. Reset interval: 50.

a certain threshold then, from the convergence speed and accuracy perspectives, the mobile multihop network becomes actually equivalent to a single hop network Intuitively, a large network diameter increases the number of diffusion steps needed to spread information. When node are mobile, in a certain time window, links between distant nodes have a higher chance of being created than in static network. Thus, it reduces the number of diffusion steps needed in order to distribute the same amount of information. As seen in Figure 3, there is a gradient in the accuracy of the prediction. For networks where the nodes have either increased mobility and/or high transmission range (lower network diameter) the packet loss estimation accuracy is higher. When the network is either static and/or has a low density (nodes have reduced transmission ranges), the information spreads slower which results in higher estimation error. On the other hand, when the nodes become mobile and/or the transmission range increases, the network becomes more connected and the estimation is better due to an increase in information diffusion speed. The simulation results confirm our hypothesis. In Figure 3 it can be seen that when the network diameter is small, the estimation accuracy is at around 20%. When the network diameter increases (transmission range gets smaller), the density decreases and network becomes more disconnected. After 25 hops the error is above 30% and it increases rapidly. An interesting aspect is that, the beneficial influence of node mobility can be seen when the network becomes more disconnected. After 10 hops, the gap between the static topology and the dynamic one starts to show up. It is increasing towards the right of the graph confirming again the hypothesis that node mobility does actually help, especially when the network density is low.

implement, there are also boundaries that we have to consider. We searched for them by setting the uniform distribution of packet loss at different levels across several experiments from 0% to 80% and analyzed the estimation error. What we noticed from the experiments, and is showcased in Figure 4 and Figure 6 is that there is an interval in wich the estimation error of FailDetect is lower than 40%. It is able to detect accurately packet failure rates between 15% and 70%. Outside this interval, the accuracy is of the algorithm is low. A possible explanation for the lower bound limit is that when the packet loss is low (less than 10%) any fluctuation introduced by the random failures can cause high variations of the estimation error and, since the value that has to be estimated is small, the resulted relative error is high. For the upper bound, at high packet loss rates, FailDetect is not able to cope with the increased level of packet loss since the basic diffusion mechanism via gossiping is almost not working at all. Due to the high number of packets that are lost, not even node mobility can help in this case. In Figure 5 there can be seen a similar behavior - the higher the diameter, the higher the estimation error of the algorithm. Here, because the size of the network is smaller, the effects are much more pronounced since the network gets disconnected much quicker when the diameter is increasing. As expected, our algorithm is able to perform well in detecting the average packet loss within a certain average packet loss interval. Since gossiping in itself as a mechanism is dependent on the quality of the communication links, such a limitation comes naturally. Still, FailDetect proves itself valuable within the confidence interval previously described.

B. Influence of Packet Loss Ratios on Accuracy FailDetect was designed to compute, is a distributed manner, the percentage of packets that are lost in a network due to various failures and contention conditions. While we claim that our solution is elegant, has a low overhead in terms of communication and computational costs and is easy to

C. Testbed evaluation In order to further validate our algorithm on real devices, we have implemented FailDetect on our wireless sensor network testbed, consisting of 108 GNode nodes statically deployed across the floor of our department, using the TinyOS-2.x as the software platform. The GNodes are sensor nodes built around the MSP430-microcontroller combined with a Chipcon-CC1101 transceiver.

Gossip Aggregate 0.56 0.58

Packet error Actual Estimated 0.26 0.24 0.44 0.33

TABLE I E STIMATION OF PACKET ERROR ON T ESTBED . 50% OF THE NODES RESET TO 0.

We implemented the network reset schedule using a lookup table of node-ids, indexed by gossiping round number. Typically, for a gossiping round, every node-id in the lookup table resets its value to 0. In order to estimate the average packet error within the network, we let 50% of the nodes to reset to 0, and let the network compute the gossiping average. The relation between the observed average packet error and the computed value is captured in Table I. Since packet errors cannot be accounted for and controlled, we showcase two particular cases of packet errors that we observed, 26% and 44%. The average value computed through gossiping (0.56 and respectively, 0.58) translates to estimation of packet errors that exhibit comparable accuracy with those captured in Figure 6. V. C ONCLUSIONS Failure detection (e.g. packet loss rate) is an important building block for most distributed systems applications such as transactions, consensus and replication services. In systems where time synchronization is hard to achieve (such as MANETs), the presence of a failure detection service may be used to improve various applications that make use of this information. In this paper, we introduce FailDetect, an algorithm for the online estimation of transmission failures in dynamic networks. To our knowledge, this is one of the first algorithms specifically targeted at multihop, mobile networks. The solution we presented is elegant due to its reduced number of assumptions, it has a low message complexity (it’s derived from gossiping algorithms) and incorporates the notion of periodic asynchronous resets for being able to provide at each node an estimation of the packet failure rate. We analyze the FailDetect algorithm and validate our contribution analytically, through simulations and with experiments on a wireless sensor network testbed. We showed that our method is applicable for large-scale networks and works for both the static and the mobile cases. As future work, we plan to extend our analysis to larger experiments along with more diverse conditions in terms of topology dynamics and the adaptiveness of the algorithm. R EFERENCES [1] H. Balakrishnan, V. Padmanabhan, S. Seshan, and R. Katz, “A comparison of mechanisms for improving TCP performance over wireless links,” Netw., IEEE/ACM Trans. on, vol. 5, no. 6, pp. 756–769, 2002. [2] C. Bettstetter, “Int. workshop on modeling analysis and simulation of wireless and mobile systems,” in MSWIM 2001, 2001, pp. 19 – 27. [3] J. Beutel et al., “Deployment techniques for sensor networks,” in Sensor Networks, ser. Signals and Communication Technology, G. Ferrari, Ed. Springer, 2009, pp. 219–248.

[4] C. Boano et al., “The Triangle Metric: Fast Link Quality Estimation for Mobile Wireless Sensor Networks,” in (ICCCN), 2010 Proceedings of 19th International Conference on. IEEE, 2010, pp. 1–7. [5] Chandra et al., “Unreliable failure detectors for reliable distributed systems,” ACM Journal (JACM), vol. 43, no. 2, pp. 225–267, 1996. [6] A. G. Dimakis, A. D. Sarwate, and M. J. Wainwright, “Geographic gossip: efficient aggregation for sensor networks,” in Proceedings of IPSN 2006. New York, NY, USA: ACM, 2006, pp. 69–76. [7] M. Elhadef and A. Boukerche, “A Gossip-Style Crash Faults Detection Protocol for Wireless Ad-Hoc and Mesh Networks,” in IPCCC 2007. IEEE Internationa. IEEE, 2007, pp. 600–605. [8] D. Estrin et al., “Instrumenting the world with wireless sensor networks,” in (ICASSP’01). 2001 IEEE International Conference on, vol. 4. IEEE, 2002, pp. 2033–2036. [9] M. Fischer, N. Lynch, and M. Paterson, “Impossibility of distributed consensus with one faulty process,” Journal of the ACM (JACM), vol. 32, no. 2, pp. 374–382, 1985. [10] R. Friedman and G. Tcharny, “Evaluating failure detection in mobile ad-hoc networks,” International Journal of Pervasive Computing and Communications, vol. 5, no. 4, pp. 476–496, 2009. [11] Z. Fu et al., “The impact of multihop wireless channel on TCP throughput and loss,” in INFOCOM 2003., vol. 3. IEEE, 2003, pp. 1744–1753. [12] R. Guerraoui et al., “Non blocking atomic commitment with an unreliable failure detector,” in Reliable Distributed Systems, 1995. Proceedings., 14th Symposium on. IEEE, 2002, pp. 41–50. [13] N. Hamed Azimi, H. Gupta, X. Hou, and J. Gao, “Data preservation under spatial failures in sensor networks,” pp. 171–180, 2010. [14] M. Ilyas and H. Radha, “Measurement based analysis and modeling of the error process in IEEE 802.15. 4 LR-WPANs,” in INFOCOM 2008. IEEE, 2008, pp. 1274–1282. [15] K. Iwanicki and M. Van Steen, “On hierarchical routing in wireless sensor networks,” in IPSN 2009. IEEE, 2009, pp. 133–144. [16] M. Jelasity, A. Montresor, and O. Babaoglu, “Gossip-based aggregation in large dynamic networks,” ACM Trans. on Computer Systems, vol. 23, no. 3, pp. 219–252, 2005. [17] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” in FOCS 2003, 2003. [18] K. Naidu, D. Panigrahi, and R. Rastogi, “Detecting anomalies using endto-end path measurements,” in INFOCOM 2008. The 27th Conference on Computer Communications. IEEE. IEEE, 2008, pp. 1849–1857. [19] B. Parno, A. Perrig, and V. Gligor, “Distributed detection of node replication attacks in sensor networks,” in Security and Privacy, 2005 IEEE Symposium on. IEEE, 2005, pp. 49–63. [20] N. Samaraweera, “Non-congestion packet loss detection for TCP error recovery using wireless links,” in Communications, IEE Proceedings-, vol. 146, no. 4. IET, 2002, pp. 222–230. [21] A. Sarwate and A. Dimakis, “The impact of mobility on gossip algorithms,” in INFOCOM 2009, IEEE, 19-25 2009, pp. 2088 –2096. [22] R. Van Renesse, Y. Minsky, and M. Hayden, “A gossip-style failure detection service,” in IFIP 2009. Springer-Verlag, 2009, pp. 55–70. [23] F. Xing and W. Wang, “On the critical phase transition time of wireless multi-hop networks with random failures,” in Mobicom 2008. ACM, 2008, pp. 175–186. [24] Y. Xu et al., “Characterizing the spread of correlated failures in large wireless networks,” in INFOCOM 2010. IEEE, 2010, pp. 1–9. [25] J. Zhao et al., “Understanding packet delivery performance in dense wireless sensor networks,” in Proc. of the 1st international conference on Embedded networked sensor systems. ACM, 2003, pp. 1–13.