Distributed Recovery from Network Partitioning in Movable Sensor ...

1

Distributed Recovery from Network Partitioning in Movable Sensor/Actor Networks via Controlled Mobility Kemal Akkaya1 , Fatih Senel2 , Aravind Thimmapuram1 and Suleyman Uludag3 1

3

Department of Computer Science, Southern Illinois University Carbondale, Carbondale, IL 62901 Email: [email protected], [email protected] 2 Department of Computer Science, University of Maryland Baltimore County, Baltimore, MD 21250 Email: [email protected] Department of Computer Science, Engineering and Physics, University of Michigan-Flint, Flint, MI 48502 Email: [email protected]

✦

Abstract—Mobility has been introduced to sensor networks through the deployment of movable nodes. In movable wireless networks, network connectivity among the nodes is a crucial factor in order to relay data to the sink node, exchange data for collaboration and perform data aggregation. However, such connectivity can be lost due to a failure of one or more nodes. Even a single node failure may partition the network and thus eventually reduce the quality and efficiency of the network operation. To handle this connectivity problem, we present PADRA to detect possible partitions and then restore the network connectivity through controlled relocation of movable nodes. The idea is to identify whether or not the failure of a node will cause partitioning in advance in a distributed manner. If a partitioning is to occur, PADRA designates a failure handler to initiate the connectivity restoration process. The overall goal in this process is to localize the scope of the recovery and minimize the overhead imposed on the nodes. We further extend PADRA to handle multiple node failures. The approach, namely, MDAPRA strives to provide a mutual exclusion mechanism in repositioning the nodes to restore connectivity. The effectiveness of the proposed approaches is validated through simulation experiments. Index Terms—Movable sensors and actors, relocation, fault-tolerance, connectivity, node failure, partitioning

1 Introduction In recent years, there has been a growing attention in deploying heterogeneous movable nodes within the wireless sensor networks (WSNs) for various purposes. The types of these nodes vary from small mobile motes, from Robomote [1] and [2] to powerful actors [3] which can take certain actions. While the former has given rise to deploy Movable/Mobile Sensor Networks (MSNs) ([1], [4]) where all the sensors can move on demand in addition to their sensing capabilities, the latter has introduced networking of mobile actors with static sensors called Wireless Sensor and Actor Networks (WSANs) [3]. Actors in WSANs are movable units such as robots, rovers and unmanned vehicles which can process the sensed data, make decisions, and then perform appropriate actions. The response of the actors mainly depends

on its capabilities and the application. For instance, actors can be used in lifting debris to search for survivors, extinguishing fires, chasing an intruder, etc. Examples of WSAN applications include facilitating/conducting Urban Search And Rescue (USAR), detecting and countering pollution in coastal areas, detection and deterring of terrorist threats to ships in ports, destruction of land and underwater mines, etc. In both MSNs and WSANs, connectivity of the network is crucial throughout the lifetime of the network in order to meet the desired application-level requirements. For instance, in MSNs, sensors need to periodically, or in response to a query, send their data to the sink node so that all spots of the region can be monitored accurately. In addition, they need to perform aggregation/fusion on the data they receive from their neighbors and relay the fused information toward the sink. This requires that the whole network with the sink node and sensors should be connected throughout the lifetime of the network. Similarly, as far as the WSANs are concerned, in most application setups, actors need to coordinate with each other in order to share and process the sensors’ data, plan an optimal response and pick the most appropriate subset of actors for executing such a plan. For instance, in a forest monitoring application, in case of a fire, the actors should collaboratively decide the best possible solution in terms of the number of actors to employ, their traveling time, and distance to the fire spot. Again, this process requires that all the actors should form and maintain a connected inter-actor network at all times to become aware of the current states of others. In such connected MSNs and WSANs, failure of one or multiple nodes (i.e., sensors and actors respectively) may cause the loss of multiple inter-node communication links, partition the network if alternate paths among the affected nodes are not available, and stop the actuation capabilities of the node if any. Such a scenario will not

2

only hinder the nodes’ collaboration but also has very negative consequences on the considered applications. Therefore, MSNs and WSANs should be able to tolerate the failure of mobile nodes and self-recover from them in a distributed, timely and energy-efficient manner: First, the recovery should be distributed since these networks usually operate autonomously and unattended. Secondly, rapid recovery is desirable in order to maintain the responsiveness to detected events. And finally, the energy overhead of the recovery process should be minimized to extend the lifetime of the network. In this paper, we present a distributed PArtition Detection and Recovery Algorithm (PADRA) to determine possible partitioning in advance and self-restore the connectivity in case of such failures with minimized node movement and message overhead. Since partitioning is caused by the failure of a node which is serving as a cut-vertex (i.e., a gateway for multiple nodes), each node determines whether it is a cut-vertex or not in advance in a distributed manner. This is achieved by utilizing the Connected Dominating Set (CDS) of the whole network. Once such cut-vertex nodes are determined, each node designates the appropriate neighbor to handle its failure when such a contingency arises in the future. The designated neighbor picks a node, called dominatee, whose absence does not lead to any partitioning of the network to replace the failed node when it actually fails. Such a dominatee is found through either a greedy algorithm by following the closest neighbors or a dynamic programming approach by exploring all possible paths at the expense of increased message cost. The replacement is done through a cascaded movement where all the nodes from the dominatee to the cut-vertex are involved. The goal is to share the movement load so that the energy of the selected dominatees will not drain quickly as a result of a long mechanical motion. PADRA assumes that only one node fails at a time and no other nodes fail until the connectivity is restored. However, although the probability is lower, concurrent failure of multiple nodes is still possible. Such a case is very challenging and warrants further investigation. More specifically, two important issues are: (1) how to deal with a loss of multi-hop links if the failed nodes are neighbors; and (2) how to coordinate multiple recovery efforts if the failed nodes are located in different parts of the network. In this respect, we further extend PADRA to handle multiple simultaneous node failures. The approach, namely MPADRA, can coordinate the execution of multiple recovery efforts by introducing mutual exclusion on the use of nodes. The idea is to reserve the nodes before they actually move to restore the connectivity so that they cannot be moved for other recovery purposes. In the case of unsuccessful reservation, another failure handler (i.e., secondary failure handler) is involved to restore the connectivity. Our contributions in this paper are as follows: 1) We propose a new distributed cut-vertex detection algorithm with a low false alarm ratio; 2) We use connected dom-

inating set and cascaded movement idea together in order to restore the failure of cut-vertices. The proposed algorithm is both distributed and proactive and is thus message efficient; and 3) We propose a novel algorithm to handle the failure of multiple simultaneous failures. Simulation results confirm that both PADRA and MPADRA can perform very close to optimal solution in terms of travel distance with only local knowledge and thus with minimal messaging cost.

2 Related Work Node mobility has been exploited in wireless networks in order to improve various performance metrics such as network lifetime, throughput, coverage and connectivity [5]. Two types of mobility were considered in these efforts: inherent and controlled mobility. Inherent mobility can be further classified as random and predictable depending on the travel path. In this case, mobile nodes are used as carriers to relay data from sources to the destination [6]. This may be due to the lack of communication paths between the source and the destination or simply to preserve the very limited energy supply aboard the source node. These mobile relays, often referred to as MULEs (Mobile Ubiquitous LAN Extensions), may be passing-by cars, humans and animals, etc. In controlled mobility, on the other hand, nodes move only when needed and follow a prescribed travel path. For such a purpose either external mobile nodes are introduced into the system [7]-[8] or existing (internal) movable nodes in the network are used [12]. Due to its practicality, the bulk of the published work in this area has pursed controlled mobility. As we exploit such mobility to maintain connectivity in this paper, the discussion in this section will be limited to schemes that consider network connectivity. A robot, called Packbot, has been introduced in [7] to serve as a mobile relay in WSNs. A Packbot comes close to sensors to collect their data and then carries all data reports to the base-station. The Packbot’s proximity to the sensor nodes significantly reduces the energy consumed in wireless transmission and thus lengthens the sensors’ lifetime. In addition, the use of Packbot enables reaching isolated nodes or blocks (network partitions) and links them to the rest of the network. An algorithm has been proposed for determining the trajectory of the Packbot to serve multiple nodes. A similar work that employs mobile relay nodes is presented in [13]. Unlike [7], the mobile relays stay within at most two hops of the sink and thus they do not need to travel around the network. While Packbots, and other similar devices, can in principle repair disconnected networks, the use of a mobile relay is not practical in wireless networks, given the expected latency in data delivery caused by the travel time while touring sources and by the relatively slow mechanical motion relative to wireless transmissions. With multiple Packbots, the latency still stays high even if a distinct Packbot is designated for every individual data source.

3

Controlled mobility of the internal nodes within the WSN has mostly been exploited to improve the network lifetime and coverage [9][15]. Some studies considered connectivity as a constraint while striving to improve other performance metrics. For example, in [9], nodes on a data route are repositioned in order to minimize the total transmission energy while still preserving the connectivity with the base-station. Meanwhile, the objective of C2 AP [12] is to maximize the coverage of actor nodes while maintaining a connected inter-actor topology. COCOLA [10] deals with the effect of moving a node on the network connectivity. Basically, the goal is to avoid breaking any inter-node links. None of these approaches pursues the repositioning of existing nodes to restore the network connectivity that gets severed by the failure of a node. The closest work to ours is reported in [16]. This work presents DARA which also strives to restore the connectivity when a cut-vertex node fails. The idea is similar to ours in the sense that it explores cascading movement when replacing the failed node. However, there are many differences from our work. First, DARA does not provide a mechanism to detect cut-vertices. It is assumed that this information is available at the node which may require the knowledge of the whole topology. Our approach on the other hand determines the cut-vertices in advance through the underlying CDS. Secondly, in DARA, the selection of the node to replace the failed node is done based on the neighbors’ node degree and distance to the failed node which may require excessive replacements until a leaf node is found. Our approach looks for a dominatee rather than a leaf node to replace the failed cut-vertex. We also note that a preliminary version of this paper has appeared in [18]. This paper extends that work by providing more analysis, performance enhancements and handling multiple simultaneous failures. As the actors employed in WSANs can also be robots, some of the research in mobile robot networks can be applied to WSANs. One example of providing faulttolerance in such networks has been studied in [17]. The approach is based on moving a subset of the robots to restore the 2-connectivity that has been lost due to the failure of a robot. While the idea of movement of robots is similar to ours, it is performed based on block movements and requires a centralized approach. The same problem is also studied in [19] which rather presented a distributed approach. Unlike [17] and [19], our paper focuses on providing 1-connectivity which provides more room for coverage when compared to 2connectivity. Note that the idea of cascaded movement is similar to that of [15]. This work identifies some spare sensors from different parts of the network that can be repositioned in the vicinity of the holes. Since moving a node for a relatively long distance can drain significant amount of energy, a cascaded movement is proposed if there are sufficient number of sensors on the way. The idea is to determine intermediate sensor nodes along the path and to replace those nodes gradually. However,

connectivity is not considered in this work. Regarding the handling of multiple node failures causing partitioning, our approach MPADRA is the first work, to the best of our knowledge, to address this challenging problem.

3 Partition Detection and Recovery 3.1 System Model and Assumptions We consider a WSAN or a MSN. We assume that the nodes (i.e., actors or sensors) are randomly deployed in an area of interest and form a connected network. In the case of WSANs, this is in fact an inter-actor network, i.e., no sensor nodes are involved. However, in MSNs, we consider the network with the sink node and the sensors to be connected. Note that in any case, the nodes have the ability to move but they are not moving most of the time as in Mobile Ad Hoc Networks (MANETs). The mobility capability is only exploited whenever needed. The nodes in such networks have a limited on-board energy supply. The radio range, denoted by r, of a node refers to the maximum Euclidean distance that its radio can reach. r is assumed to be constant and same for all nodes. Interference range for each node on the other hand is assumed to be 2r. Each node periodically broadcasts heartbeat messages to its neighbors. If such an heartbeat message is not received within a certain amount of time, the node is assumed to be failed. We also assume that there is no obstacle on the path of a moving node and the nodes can reach their exact locations by maintaining a constant speed. However, the movement is assumed to be more costly than message transmission [15]. Finally, we assumed that all the nodes have same sensing capabilities and thus movement of these nodes would not cause sensing coverage holes in terms of some sensing modalities. 3.2 Problem Definition When the lost node is a leaf node, no other nodes will be affected. Meanwhile, when the failed node serves as a gateway node in the network, a serious damage to the network connectivity will be inflicted. Such nodes are referred to as cut-vertices. The loss of a cut-vertex partitions the network into disjoint sub-networks. Note that in this paper, we do not consider the coverage holes formed as a result of node failures as this problem has already been extensively studied in [14][15]. Nonetheless, a joint consideration of connectivity and coverage holes is left as a future work. Our problem can be defined as follows: "n mobile nodes that know their locations are randomly deployed in an area of interest and they form a connected network G. In the case of a failure of a particular node, Ai , our goal is twofold: 1) Determine locally whether such failure causes any partitioning within the network G and; 2) If there is a partitioning, determine (again locally) the set of movements to restore the connectivity with minimum travel distance. With travel distance, we mean two different metrics: 1) Total Movement distance of

4

P all the Nodes (TMN): i∈S Mi ; 2) The Maximum of the Movement distance of all Individual nodes (MMI) : maxi∈S Mi where S denotes the set of nodes in the network and Mi denotes the total movement distance for a particular node i ". In order to minimize, we present a Connected Dominating Set (CDS) based approach which informs a particular node u in advance whether a partitioning will occur or not in the case of its failure. As a result, each node will know in advance how its failure will be handled. Obviously, if u’s failure does not cause any partitioning, no handling will be needed. However, if the failure of u leads to partitioning, then it designates a set of nodes to be repositioned so that the connectivity is restored in the network. Determining such nodes and how they will be repositioned are our second goal. While our goal is to minimize TMN, as mentioned above, it is also important to share this movement load among all the moving nodes evenly so that fairness can be achieved. Otherwise, if a particular node moves very long distances to fix the connectivity, then this node may deplete all of its energy and can die quicker than the rest of the nodes in the network. Therefore, we introduce another metric, namely, Maximum of the Movement distance of all Individual nodes (MMI) and would like to minimize this metric first. The rationale is to put a cap on MMI (e.g., r) and then try to minimize TMN so as to extend the lifetime of the whole network. As will be elaborated later in Section 3.4, this means that in some cases TMN will be sacrificed in order to provide a better MMI. Next, we describe our approach for cutvertex determination. 3.3 Cut-vertex Determination Determining whether a node is a cut-vertex or not can be easily done by using depth first search trees (DFS). However, this approach requires flooding the whole network and can be costly in terms of the message overhead. Thus, we follow a distributed approach for such a purpose. Our approach will be based on the concept of CDS of the network. As every node can reach the nodes in a CDS, the connectivity of the network can be maintained as long as CDS is connected. We use a distributed algorithm [20] in order to determine the CDS of a given network G. Note that this algorithm requires only local information. A node needs to know only its two-hop neighbors which can be done by transmitting two messages. As a result of running Dai’s algorithm [20], each node will know whether it is a dominator (an element of CDS) or a dominatee one hop away from a dominator. When this information is available, determination of a cut-vertex will be handled as follows: 1) If a node u is a dominatee, then it can not be a cutvertex since its absence will not violate the connectivity of the network. For instance node A1 is a dominatee and cannot be a cut vertex in Figure 1a. 2) If a node u is a dominator, then there is a high

probability that u will be a cut-vertex. In that case, there can be two cases: a) In the first case, u may have at least one dominatee, v, as its neighbor. If v does not have any neighbors, u will declare itself as a cut-vertex. This is shown in Figure 1a where A2 is a dominator with at least one dominatee (i.e., A1 ) which does not have any neighbors. Thus, A2 will be a cut-vertex. b) In the second case, all the neighbors of u will be either dominators or dominatees which have some neighbors. For such circumstances, determination of the cutvertex is challenging since the dominators/dominatees can be connected via some other nodes within the network.Thus, failure of u may not cause any partitioning. In order to decide whether u is a cut-vertex or not, one of the neighbors of u should start a local depth-first search (DFS) to look for all the remaining neighbors of u. If all of them can be reached, this indicates that there are alternative paths which can be used during the failure of u to maintain the connectivity of the network. Therefore, u will not be a cut-vertex. Otherwise, it will declare itself as a cut-vertex. For instance in Figure 1b, A3 is a dominator with neighbors A2 and A4 which have neighbors. Therefore, a DFS is required to check whether A3 is a cut-vertex or not. It is important A to note that for A A a) b) the cases of 2b (i.e., Figure 1b), A A A A A A we may have A A an increased Figure 1. In a), node A1 is a dominatee message overhead and cannot be a cut vertex. A2 is a domidepending on nator and has a dominatee A1 which is not connected. Thus, A2 is a cut-vertex. A3 is the topology also a cut vertex in a) but will not be a cutof the network. vertex in b). Especially, for the applications where message transmission is a concern (due to security reasons, network size etc.), the local can be partially or completely eliminated. Specifically, for complete elimination, any dominator which falls into category 2b will be assumed as a cut-vertex without performing any local DFS. For partial elimination, DFS will only be performed for the dominators which fall into category 2b but do not have any dominatees within 2-hops. If DFS is used, then the approach will be referred to as PADRA+ hereafter. 5

6

7

2

1

3

4

2

3

4

1

3.4 Handling Partitioning: Greedy Approach 3.4.1 In-advance designation of a failure handler PADRA achieves optimized recovery of a node failure by preplanning the failure handling process. The rationale for such a proactive approach is that the neighbors of a cut-vertex Ai will not be able to communicate after the failure of Ai . To restore the connectivity of the network, for each cut-vertex Ai , PADRA identifies a failure handler (FH) within the network that would start the recovery process when Ai fails. Right after the network is deployed, the CDS is constructed and cut-vertices are

5

identified as explained in Section 3.3. Each cut-vertex, Ai , H then picks an FH among its neighbors, denoted as AF . i FH When Ai fails, Ai will initiate the recovery process. In this way, the network recovery time is minimized as each node in the network will know how to react to a failure before it happens. How to pick the FH is explained in connection with the recovery process next. 3.4.2 Recovery Process When a node Ai fails (i.e., no heartbeat messages can H be heard within a certain period of time), its FH, AF i will initiate the recovery process. We would like to note here that in some cases, this can be a false alarm due to not being able to send the heartbeat message. In such a case, the node is not failed and recovery would not be needed. However, delaying the heartbeat message indicates a problem either in the channel or at the node due to high load. Therefore, the recovery process can still help to alleviate these problems without incurring H any partitioning in the network. Further, Ai and AF i can talk to each other on the type of the problem and can take more actions. Obviously, in case of the existence of other wireless devices in the environment which can potentially interfere with Ai , the unnecessary movement H is inevitable. For preventing such circumof the AF i stances, interference/jamming issues should be carefully considered during the deployment phase of the network. These details are beyond the scope of this paper. The idea in our recovery is to find the closest dominatee and use it as a replacement for the failed node so that the connectivity is preserved. In order to minimize the movement distance, the closest dominatee in terms of distance to the failed cut-vertex node should be determined. We propose a greedy approach for determining the closest dominatee for a cut-vertex node. Note that if H AF is itself a dominatee, the recovery will be simplified i immensely, since it can move to the location of Ai and keep the network connected. Clearly, the best FH choice for Ai is a dominatee. If the neighbors of Ai are all dominators, Ai prefers the closest one that has a dominatee among its neighbors (Ai will know that from the 2-hop neighbor list that it has). If none of the neighbors of Ai has a dominatee, Ai just picks the closest node as its FH. Obviously, the quality of the FH selection improves with the consideration of 3-hops neighbors, 4hops neighbors, etc. at the expense of increased message overhead. The basic idea of our recovery mechanism is as follows: If there is a dominatee among the neighbors of the cut-vertex, it will be designated as the node to replace the cut-vertex upon failure since it will be the closest dominatee to the failed node. Otherwise, the cutvertex node designates its closest neighbor to handle its failure. In case of such a failure, the closest neighbor will apply the same idea in order to find the closest dominatee to itself. That is, it picks its closest neighbor and this continues until a dominatee is hit. For example, in Figure 2a, for node A2 , the closest neighbor will

be A3 if the distance between A2 and A3 , denoted as |A2 A3 |, is the minimum among all the other links of A2 . Once A2 determines its closest neighbor, it sends a message to A3 and designates it as the node to handle its failure. When it actually fails, A3 will pick A5 and A5 will find A8 as the closest dominatee to A3 . Note that A with this A A approach X A A A A A A A A A A only the A A nodes (a) A A (b) along the Figure 2. A sample execution of PADRA. Black path to the nodes are dominators and white nodes are dominatees. a) Node A2 fails. Its FH A3 starts the recovery dominatee process. b) A3 replaces A2 , A5 replaces A3 and transmit finally A8 replaces A5 . and receive messages which reduce the message traffic significantly. 8

FH

1

8

5

3

4

6

1

3

5

4

6

2

7

7

9

9

3.4.3 Relocation of the closest dominatee Moving location

the closest dominatee directly to the of the failed cut-vertex will definitely restore the connectivity 3 with the minimum 2r TMN. However, since r 5 6 the movement distance r r r r 1 2 4 can be very large, MMI Figure 3. TMN - MMI example. will be unacceptably high. In other words, the approach does not provide fairness and may deplete the moving nodes’ energy rather quickly as compared to other nodes in the network. This is apparent in Figure 3. Here, if node A3 moves directly to the location of the failed node A1 , TMN and thus MMI will be 2r where r is the radio range of the nodes. However, as will be shown shortly by sacrificing from TMN, one can come up with a better MMI (i.e., r). A possible solution to further reduce MMI here could be to move the partition as a block toward the failed node where the closest neighbor of the failed node will be the leading node in this movement. While all the nodes will move the same distance of r with this approach, it will significantly increase TMN since all the nodes in that partition will be moving. Therefore, we propose a hybrid solution which will combine the advantages of both approaches. The idea is to use cascaded movements from the closest dominatee to the failed node in order to maintain a maximum of r units of movement for the individual nodes (i.e., MMI) and at the same time minimize the TMN by decreasing the number of moving nodes. The approach can then be formulated P as an optimization problem as follows: min i∈S Mi , subject to Mi ≤ r ∀i ∈ S. It can be easily shown that when cascaded movement idea is used and MMI is kept at most as r, then TMN will be the sum of all the distances on the shortest path from the dominatee to the failed node. The approach will work as follows: In case of a failure, the closest dominatee will start a cascaded movement

6

toward the location of the failed node. That is, A3 will replace A2 , A5 will replace A3 and finally A8 will replace A5 as shown in Figure 2b. The idea is to share the load of movement overhead among the nodes on the path to the failed cut-vertex node in order to extend the lifetime of each node and thus the whole network. Note that this approach may not work when the network contains a cycle of dominators. We now describe how we address this problem. 3.4.4 Handling Cycles Since the existence of a cycle may cause a series of replacements which will never be able to pick a dominatee node, a loop avoidance mechanism should be defined to stop the replacements of the nodes. Therefore, we introduce an extra confirmation message ACK to be received before a node Ai can start to move. The idea is as follows: In order to replace itself, node Ai should pick a node Aj which has not moved before. If node Aj has moved before, it will send a negative acknowledgment back to Ai and will not move anymore. Node Ai will understand that this indicates a cycle and thus a new neighbor other than Aj should be picked. If there are no other neighbors, then Ai will move once and no further movements will be performed so that the cycle can be broken. An example is given in Figure 4.

A1

A8

A8

A5

A5

A2

A4

A6

A2

(a) A7

Figure 4.

A7

A2

A3

A5

(c)

3

5 6 7 8 9 10

2 3 4 5 6

11

A7

a) Initial network b) Cycle detection c) Cycle elimination.

Algorithm 1: Recovery(FH, i)

4

1

10

Due to space constraints we skip the detailed pseudocode for designating a FH for a node. The algorithm, named Recovery as shown in Algorithm 1, is run on the FH of the failed node i. Move(.) function is called

2

Algorithm 2: ClosestDominatee(FH)

9

A6

A4

3.4.5 Detailed Pseudocode

1

The algorithm that we proposed in Section 3.4 is a greedy CDS-based approach. However, in some circumstances selecting the closest neighbor dominator may not always provide the optimal solution in terms of TMN. We argue that we can find the optimal solution at the expense of increased message complexity by using a dynamic programming approach. The idea is to find the least cost path to the closest dominatee. Therefore, the FH node starts a search process among the sub-trees of its neighbors in order to find the closest dominatee. Basically, each sub-tree returns its cost of reaching a dominatee to the FH. The FH picks the least cost dominatee among these options rather than using a greedy approach, as shown in Algorithm 2. With this approach, node i in

8

A8

A4 (b)

3.5 Handling Partitioning: Dynamic Programming

7

A6

A3

A3

and broadcasts an ’ARRIVED’ message and updates its CDS accordingly.

// FH → failure handler designated upfront for node if i f ails then if isDominatee(F H) = true then M ove(F H, i, null) else if isDominator(F H) = true then // Check if FH has a dominatee neighbors if ∃j ∈ N (F H) ∧ isDominatee(j) = T rue then M ove(F H, i, j) else j ← ClosestN eighbor(F H) M ove(F H, i, j) // Since FH replaced i, j needs to replace FH end

i

when a node is moving. In that case, the node first informs its predecessor node to replace itself by sending a ’LEAVING’ message. It then moves to the new location

12 13

if i is a dominatee then cost ← 0 else if i is a dominator and |Ei | > 0 then cost ← min(Dist(Dominatee(i))) else Broadcast(i,0 CLOSEST _DOM IN AT EE 0 ) forall j ∈ N (i) do cost ← Dist(i, j) + ClosestDominatee(j) Receive(i,0 DOM IN AT EE_F OU N D 0 ) putintoM ap(i, cost, map) end minindex ← getIndexOf M inCost(map) return minindex

line 8 of Procedure Recovery(FH, i) in Algorithm 1 will start a recursive call to all of its neighbors asking for a dominatee as seen in Algorithm 2. Each caller can compute the minimum cost of finding a dominatee node by using the formula given in Equation 1. minj∈Ei {dij } Ei 6= 0 ci = (1) minj∈Di {dij + cj } Ei = 0, where i is a dominator, ci is the total travel distance to replace i, dij is the Euclidean distance between i and j, and Ei and Di are the number of dominatees and dominators of node i, respectively. The FH will start replacing the node with the ID of minindex which will be done recursively until the dominatee is replaced. With dynamic programming, there is no risk of getting cycles and thus we eliminate the ’ACK’ message. We will refer this approach as PADRA-DP hereafter to differentiate it from PADRA using greedy approach.

4 Algorithm Analysis 4.1 Travel Distance Analysis When the TMN is considered, there are three factors which affect the performance with respect to different

7

Algorithm PADRA PADRA+ PADRA-DP Optimal

Table 1

Table 2

Factors affecting the TMN performance

Type and Count of Messages Used

False Alarms in Cut-vertex Determination Yes No Yes No

Selection of FH Greedy Greedy Greedy Optimal

Determination of Closest Dominatee Greedy Greedy DP Optimal

algorithms as shown in Table 1. These are number of false alarms in cut-vertex determination, selection of FH and determination of the closest dominatee. As summarized in Table 1, the variants of PADRA will not provide the optimal (i.e., the minimum) TMN. While PADRA+ ensures zero false alarms in determining the cut-vertices, it utilizes greedy approaches in the selection of FH which is also true for PADRA and PADRA-DP as well. Basically, we only consider 2-hop neighbor information in the selection of the FH. If a dominatee does not exist within 2-hop neighborhood, then the selection is based on a greedy approach which basically designates the closest neighbor as the FH. Optimal solution would consider all the neighbors and then pick the one that will trigger the less number of movements. Similarly, the determination of the closest dominatee is based on a greedy approach in PADRA and PADRA+. While PADRA-DP utilizes a dynamic programming approach, it suffers from false alarms and poor FH selection. Obviously, the optimal solution will outperform all the approaches in terms of TMN in average but it would require either a centralized approach or excessive and unnecessary flooding of the whole network by all the nodes in advance. Nonetheless, we show that in the worst case, the TMN for the optimal solution is also the same as the variants of PADRA. Theorem 1: The worst case TMN for PADRA, PADRA+, PADRA-DP and Optimal solution is O(nr). Proof: The worst case topology for TMN is depicted in Figure 5 for PADRA, PADRA+ and PADRADP. In this topology, assuming A4 fails, in the worst case, the FH would be A5 which triggers movement till An is hit. Note that A3 cannot be picked since it will not trigger the worst number of movements. Thus, TMN Pn−4 for PADRA, PADRA+ and PADRA-DP is given by i=1 r = (n − 4)r = O(nr), assuming that the distance between the nodes is equal to the maximum value of the transmission range, r. For the optimal solution, however, A3 will be picked as the FH and thus T M NOptimal = 3r. The worst case for the optimal solution would be the case where the node in the middle of the line topology in Figure 5 (i.e., Aj such that j = n+1 2 ) fails. In this case, whether the FH is selected as Aj−1 or Aj+1 , the T M NOptimal will be P( n+1 2 −1) r = n−1 r which is also O(nr). i=1 2 A1

Figure 5.

A2

A3

A4

Ak

An-2

An-1

An

Worst Case Topology for TMN with dominatees at two ends.

Node Type

CDS

PADRA 4 PADRA+ 4 PADRA4 DP

CV Determination None T+H+1

FH Designation (CV) 1 1

FH Designation (NCV) None None

Replace. (CV)

Replace. (NCV)

3k 3k

None None

None

1

None

2T + 2k

None

4.2 Message Complexity Analysis Before calculating the message complexity based on the worst case topology, we provide the type and number of messages sent in general in Table 2. For each approach we summarize how many messages are needed at each step of the algorithm. These steps are as follows: 1) CDS determination. Each node sends 4 messages in order to determine whether it is a dominator or a dominatee [20]. 2) Cut-vertex Determination. Each node determines whether it is a cut-vertex or not. For dominatees there is no cost in terms of messaging. This is also true for dominators in PADRA or PADRA-DP since each node just checks its 2-hop neighborhood table set in step 1 and decides whether it should be a cut-vertex or not. In PADRA+, a DFS should be performed in one of the subtrees leaded by the closest neighbor of the node. Given H denotes the number of neighbors, T the number of nodes in the subtree, and k the number of hops to the closest dominatee, we can calculate the message complexity as follows: The node sends one broadcast message and receives H replies from its neighbors. T messages are needed to reach the nodes within the sub-tree leaded by the closest neighbor. Thus, totally (T + H + 1) messages will be needed. Finally, when a failure happens, 2k (i.e., ’LEAVING’ and ’ARRIVED’ messages per hop) messages are sent to restore the connectivity. 3) FH Determination. In each approach, every cut-vertex determines an FH for itself. This is again done by accessing the 2-hop neighbor table. Once the node decides, it informs the FH with a message. Obviously, a node which is not a cut-vertex, does not designate a FH and thus no messages are sent. 4) Finding the closest dominatee & replacement. The FH node determines the closest dominatee by a greedy search. Each time the closest neighbor is replaced until a dominatee is found. This requires 3 messages (including the ACKs) at each node and thus totally 3k messages are needed where k is the number of hops to the closest dominatee. In case of PADRA-DP, dynamic programming is used and thus an additional 2T messages are needed to search the sub-tree and get the optimal result. In addition, since there will be no cycles 2 messages will be enough for replacements. Therefore, a total of 2T + 2k messages will be needed. The overall message types and counts are provided in Table 2. The worst case behavior of PADRA, PADRA+ and

8

PADRA-DP can again be observed when the topology is a line as shown in Figure 4. Based on such topology we introduce the following theorems: Theorem 2: Worst-case message complexity of PADRA is O(n). Proof: Let us assume that A3 failed and A4 was the failure handling node. Then A4 will start a replacement process until An is hit. This means a total of n − 3 nodes will be replaced. At each replacement a request and an ACK messages were used. Thus, 2(n − 3) messages will be sent to restore the connectivity. Adding 4n messages for CDS and n for closest node designation, the total number of messages in the worst case for PADRA will be 7n − 6 which is O(n). Theorem 3: Worst-case message complexity of PADRADP is O(n). Proof: Let us again assume that A3 failed and A4 with respect to Figure 5 was the failure handling node. Then A4 will start a replacement process based on dynamic programming to determine the closest dominate. This means a total of (n−3) messages will be transmitted and come back, totaling 2n−6 messages. Once the closest dominatee and the path to that dominatee is available at node A4 , it will start a replacement by following the next hop on the path to the closest dominatee. Thus, (n − 3) messages will be sent to restore the connectivity. Adding 4n messages for CDS and n for PFH designation, the total number of messages in the worst case for PADRADP will be 8n − 9 which is O(n). Theorem 4: Worst-case message complexity of PADRA+ is O(n2 ). Proof: In PADRA+, a DFS is performed for each cutvertex, from A3 to An−2 , again with respect to Figure 5. Thus, the number of nodes performing DFS will be (n − 4). For A3 , this will cost traversing each node until An in the worst case when DFS is performed through A4 . Thus, given T = n − 3 and H = 1, a total of (n − 1) messages will be sent. The total number of messages for An−2 is n − 1 as well if we assume that An−3 will start the DFS. As a result, only two nodes will introduce a total of (n − 1) messages, the rest will introduce even smaller number of messages. For instance, for A4 and A5 , the total messaging costs are (n − 2) and (n − 3), respectively. Therefore, the total messaging cost will be (n − 1) + (n − 2) + ... + (n − ( n−4 2 )) + ... + (n − 2) + (n − 1). P (n−4) This can be expressed as 2[ i=12 (n − i)], which reduces 2 to ( (3n −10n−8) ) and the worst-case message complexity 4 of O(n2 ). Theorem 5: The total number of messages sent in PADRA+ in the worst-case is less than that of the optimal solution. Proof: In the optimal cascading case, each needs to know the whole topology. This will require (n − 1) broadcasts for A1 and An and (n − 2) broadcasts for the rest, totaling (n2 − 2n + 2) with two more messages for 2 ). replacement. Obviously, (n2 − 2n + 2) > ( (3n −10n−8) 4 Thus, PADRA+ has less messaging cost when compared

to the optimal solution. 4.3 Time Complexity Analysis The period of time it takes to restore the connectivity is also a key concern since the network will be disconnected during this transition time. The main factor here is the time it takes for a node to reach its final destination which depends on the speed of the node. Therefore, MMI will directly affect the time for the network to be recovered. Messaging is also a concern but it will bring minimal overhead when compared to movement. Theorem 6: The worst-case time complexity of PADRA, PADRA+ and PADRA-DP is O rs + (p + t)n , where s is the speed of nodes, p propagation delay dor distance r, and t the transmission delay. Proof: Assuming a failure occurred, the FH will send a message to its closest neighbor, get an ACK and move to replace the failed node. Similarly, the node receiving this message will do the same thing and move to replace the FH. Therefore, the replacements can be done in parallel. That is, while the first node is moving its predecessor will also start moving, etc. In this way, as soon as the message is received by a dominatee, it will take it at most rs time to replace its successor node, where r is the distance represented by the transmission range and s is the speed of the node. The time it takes the message to reach the dominatee can be computed as follows: at each node 3 messages are sent. Thus, total time will be 3(p + t), where p is the propagation delay and t is the transmission delay. Since there can be at most (n − 4) hops, the total time will be 3(n − 4)(p + t). Adding the movement time, the total will be which is O rs + (p + t)n . For PADRA-DP, the messaging delay is higher. First the closest dominatee is found in 2(n − 4)(p + t) (i.e., 2T ) time. Then the replacements are done in time since no ACK is needed. Total will be which is again O rs + (p + t)n .

5 Multiple Simultaneous Node Failures

PADRA assumes that only one node fails at a time and no other nodes fail until the connectivity is restored due to a particular cut-vertex failure. However, there may be situations (i.e., in case of flooding, thunderstorms, etc.) where two or more cut-vertices may fail around the similar times which will require the execution of PADRA simultaneously by the FHs of the failed nodes. Such simultaneous execution may fail to restore the connectivity since the nodes do not have the up-to-date state information of their neighbors and race conditions may occur. In other words, multiple FHs may compete to access the same set of nodes to replace their failed nodes. As a result of such competition, even if a First-ComeFirst-Served mechanism is applied, some of the nodes can get stuck and have to wait for the others to finish the replacements. Even worse, some deadlock situations may occur when two nodes wait for each other. This requires extra messaging and may delay the recovery

9

process. Therefore, a mechanism which will handle race conditions, update the network state appropriately and provide parallel replacements is needed. In this section, we will explain in details how PADRA should be modified to handle simultaneous failures of multiple nodes. We name the approach as Multiple PADRA (MPADRA) thereafter. 5.1 MPADRA Overview Depending on the location of the failures of the nodes, the recovery process may introduce race conditions where the same node, say Ai may be requested to replace multiple nodes, say Aj and Ak . In that case, a mutual exclusion mechanism is needed to make sure that only one of the nodes among Aj and Ak can use Ai for replacing itself. Otherwise, Ai will not know which one to replace. In addition, even if it replaces one, the other will not be able to proceed in the cascaded relocation process and thus the connectivity will not be restored despite a number of replacements have already been made. One possible solution here is to wait until one of the failures is fixed and continue from where the recovery process stopped. However, this not only requires an alerting framework to inform the node about the end of the recovery process but also cause unnecessary delay in restoring the connectivity. Such delay may not be tolerated in mission critical wireless sensor/actor networks. We present a solution which is based on two phases: In the first phase, an FH will reserve the nodes to be replaced before replacements are made as opposed to PADRA. The goal of this reservation is similar to RSVP protocol [21] where the nodes on a path are reserved for a certain connection. In our case, we strive to lock all the nodes on the path through the closest dominatee so that no other recovery processes can use those nodes for replacement. Nevertheless, during this reservation phase race conditions occur if a node receives a reservation message from two nodes at similar times. In that case, the node which sent the message first will have the priority and be able to lock that node for itself. The other node will back off and look for an alternative dominatee. Note here that such reservation may cause other FH to reserve a dominatee from a longer path. As our approach is not centralized, with local information available at the nodes this is inevitable. However, depending on the topology such a dominatee may not be found. In that case, the only way to restore the connectivity is to use a secondary (the second closest node) FH (SFH) which will look for another dominatee on another path. Basically, the primary FH (PFH) will timeout and SFH will take over the recovery process. Once the nodes to be replaced are reserved for FHs, in the second phase they will apply the cascaded motion as in PADRA, update the CDS and unlock the nodes that have been locked for replacement purposes. We now explain how MPADRA can handle possible multiple failure scenarios.

5.2 MPADRA Details Let us assume two nodes, namely u and v which fail at approximately the same time. We introduce the following notation in Table 3 for representing the PFH, SFH, Closest Dominatees and the path to that closest Table 3 Notation for the Nodes Involved in MPADRA Node

PFH

SFH

u v

uP F H vP F H

uSF H v SF H

Closest Dom. (CD) uCD v CD

Path to CD P (uP F H , uCD ) P (v P F H , v CD )

dominatee for such nodes. MPADRA will work in two phases as mentioned above: Phase I: Path Reservation: As soon as a PFH, uP F H , which is a 1-hop neighbor of the failed node u detects that u, has failed, it starts the first phase of the recovery process by sending a ’RESERVE’ message to its closest dominatee. If there is no such dominate, it sends it to its closest dominator. A node receiving a ’RESERVE’ message will either accept the request (i.e., not respond) or send a NACK message (i.e., negative acknowledgment) depending on its state. If the node has not already been reserved by another node (i.e., is unlocked), it will not send any messages but change its state to locked (See Figure 6a). Otherwise, it will send a NACK message Reserve Msg

UNLOCKED

Release Msg

Reserve / NACK

uSFH

(a)

vSFH

LOCKED

u Moved

vCD

uPFH vPFH uCD

v

(b)

Figure 6.

(a) The states of a node in MPADRA, (b)Case 1: u and v executes in parallel without facing any race condition.

back to the originator of ’RESERVE’ message. Once a node is locked, it applies the same procedure and tries to lock a dominatee if any. This process continues until a dominatee has been locked. In the meantime, the other PFH, v P F H also tries to lock a dominatee in the same way. During this first phase, depending on the location of the failed nodes u and v, the following cases are possible: 1) The paths to the closest dominatees of uP F H and v P F H do not share any common links (i.e., P (uP F H , uCD ) and P (v P F H , v CD ) as seen in Figure 6b. This is the simplest case which does not require any special treatment. Basically, both uP F H and v P F H will independently and simultaneously execute Phase 1 of MPADRA and be able to reserve uCD and v CD (and any nodes on the paths to uP F H and v P F H ). 2) P (uP F H , uCD ) and P (v P F H , v CD ) share some common links/nodes as seen in Figure 7. In this case, there are race conditions where uP F H and v P F H compete to reserve a node. For instance, in Figure 7a, while uP F H is trying to reserve uCD , v P F H is also trying to reserve uP F H and v CD (uCD ). Therefore, whichever makes the reservation and locks the nodes first will be able to recover its failed node. If u fails first, uP F H and uCD are locked. In the meantime if v P F H requests uP F H , it will not be able to do so. Thus, it will back-off and try

10

y

x

uSFH u vCD

uPFH uCD

vCD

vSFH

vPFH

(a)

v

u

u

Figure 7.

(c)

uPFH vPFH (b)

v

vPFH v

vPFH uPFH

uCD

v

uSFH

u

(d)

uPFH vSFH

Case 2: Situations where paths for recovery have common

nodes/links.

the second closest neighbor y and can reach and reserve dominatee x. However, if such an option is not possible, then v P F H will get stuck and will not be able to reserve a dominatee. This indicates that the PFH will fail to restore the connectivity and thus SFH should be involved to start the recovery process. In this case, the nodes which are locked (if any) should be unlocked starting backwards from the node which got stuck. To do that the final node which got stuck starts sending a ’RELEASE’ message back to the node that tried to reserve it before. In that case, any node receiving a ’RELEASE’ message in locked state will go back to unlocked state as shown in Figure 6a. Given that there are cases where PFH fails to restore the connectivity, we define a time-out value for SFHs. For instance, in Figure 7a, node v SF H , which is the SFH of v, will wait for time τ to see the v P F H replacing v. If such replacement is not done within τ , then v SF H will assume that v P F H failed to restore the connectivity and thus it starts a new recovery process for restoring the connectivity due to failure of v. Note that this also applies to failure of u. If v fails first and thus v P F H locks nodes uP F H and uCD , then uP F H will not be able to reserve any node for handling the failure of u. In that case, uCD timeouts and starts a new recovery process. Selection of the value of τ can be based on an estimate of the time it takes a PFH node to replace the failed node. The time to replace the failed node will be rs . Considering the round trip time required for reservation of the path, in the worst case it will be (n − 2)(p + t). Then τ should be set at least to the sum of travel time and the round trip time: τ > rs + 2(p + t)(n − 2) . Note that Figure 7b, c and d illustrate special cases of 2 where the PFHs are not able to restore the connectivity. In Figure 7b, P (uP F H , uCD ) passes through v and P (v P F H , v CD ) passes through u. This case is a situation where uP F H will fail to reserve v P F H and vice versa. Therefore, both SFHs uSF H and v SF H need to get involved and recover the failed node u and v respectively. Similarly, in Figure 7c, PFHs for u and v are same, (i.e., uP F H = v P F H ) and thus cannot reserve any path for both u and v. The failures are handled by the SFHs of u and v. In some cases both the node and its corresponding PFH can fail at the same time as shown in Figure 7d. In this case, again the SFHs will time-out and handle the failures. Finally, we would like to note that SFHs may get stuck in reserving a path when the number of failed nodes is more than two. For such cases, we adapt and use the

idea of exponential back-off mechanism of traditional Ethernet algorithm. That is, a SFH will need to back-off and retry reserving the nodes on the path after a certain amount of time with the hope that some of the failures have already been fixed. If it still cannot reserve the path in the second trial, its timer will be doubled. In this way, eventually all the partitions will be recovered. Phase II: Replacements: Once the paths are reserved successfully (i.e., a dominatee is locked) for each PFH (or SFH), now the replacements can be done safely without running into race conditions. The replacements are done starting from the dominatee node. However, some parallelism can be exploited as each node knows where to go. Basically, the dominate node broadcasts a ’LEAVING’ message and starts moving toward its new location. As soon as such ’LEAVING’ message is received at the node to be replaced, it also broadcasts a ’LEAVING’ message and starts moving. As a result every node receiving the ’LEAVING’ message on the path starts moving to their new locations. As soon as a node reaches its new location, it broadcasts an ’ARRIVED’ message to its neighbors for updating the network states as in PADRA. The connectivity is restored when PFH replaces the failed node. Detailed Pseudo-code: The pseudo-code for MPADRA is given in Algorithm 3. This is a generic code Algorithm 3: Recovery(invoker, i) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

if state == unlocked then state ← locked if isDominatee(i) == true then M ove(i, invoker) else if isDominator(i) == true then // Check if i has a dominatee among its neighbors if ∃j ∈ N (i) ∧ isDominatee(j) == T rue then U nicast(i, j, ‘RESERV E 0 ) // send message to j. j runs Recovery (i, j) else if ∃j ∈ N (i) ∧ isDominator(j) == T rue then U nicast(i, j, ‘RESERV E 0 ) // send message to j. j runs Recovery (i, j) else if ∀j ∈ N (i)|j == (F AILEDorLOCKED) then // i is stuck. if SF H(i) == T rue then Backoff and double timer else U nicast(i, invoker, ‘RELEASE 0) else if state == locked and ’RESERVE’ msg is received then U nicast(i, invoker, ‘N ACK 0 ) // already reserved else if state == locked and ’LEAVING’ msg is received then M ove(i, invoker) else if state == locked and ’NACK’ msg is received then /* try other dominatees and dominators, if there is no available neighbor

17 18

*/

state ← unlocked // reservation failed U nicast(i, invoker, ‘RELEASE 0)

which will work at any node during the execution of MPADRA. Algorithm 3 will be invoked at a PFH if

11

it detects a failure or at a node which receives a ’RESERVE/NACK/LEAVING’ message or at a SFH which has timed-out after the failure of its invoker. i is the node that runs this procedure. Initially invoker is null since it is dead and i = PFH/SFH. We skip the details of the algorithm due to space constraints. 5.3 MPADRA Analysis Theorem 7: The TMN in the worst case is r(n − 4) in MPADRA. Proof: Assuming a line topology as in Figure 5, in the worst case, two nodes in the middle fail and their PFHs are in the opposite directions. Then, each will introduce a TMN of r( (n−2) − 1), totaling r(n − 4) for restoring 2 connectivity. Theorem 8: Worst-case message complexity of MPADRA is O(n). Proof: Assuming the line topology in Figure 7d, the worst case will be two independent recovery processes running in parallel. One of the SFHs starting the reservation phase will need to send one ’RESERVE’ message. If (n−2) nodes are involved, a total of (n−2) messages will 2 2 be needed to reserve a path assuming no race conditions. Replacements will require (n − 2) (i.e., ’LEAVING’ and ’ARRIVED’ messages for each moving node) messages. And finally, (n−2) messages to release the locks (i.e., 2 ’RELEASE’) will be required. Total will be 2(n−2) for one SFH. For two SFHs, the number of messages required is 4n − 8. If we add the cost of CDS (4n) and PFH (n) and SFH selection (n), the total will add up to 10n − 8 which is O(n). Theorem 9: The worst-case time complexity of MPADRA is O rs + (p + t)n . Proof: Assuming one of the PFHs fails to restore connectivity, the corresponding SFH will need to wait for τ = rs + 2(n − 2)(p + t) before it time-outs as explained before. Once this happens, it needs to reserve the path which requires (n − 2)(p + t) time and then start replacements which requires another (n−2)(p+t) time to inform SFH to replace the failed node. The time it takes to reach the new location will be at most rs . Thus, total time needed willbe at most 2r s + 4(n − 2)(p + t), which is O rs + (p + t)n .

6 Experimental Evaluation

6.1 Experiment Setup and Performance Metrics In the experiments, we considered a forest monitoring application where movable robots with fire extinguishers are deployed within sensors. Each robot is collecting data from sensors and needs to be in communication with other robots. We created connected networks of these robots consisting of varying number of nodes randomly placed in an area of interest of size 700mX700m. Each simulation is run for 30 different network topologies and the average performance is reported. The dominatees and dominators are determined through running the distributed algorithm in [20]. For each topology, one of

the cut-vertices is picked to be failed in such a way that there will be no dominatees among the neighbors of the cut-vertex. We used TMN, total number of messages, # of false alarms for cut-vertex detection as the performance metrics. We compared PADRA, PADRADP and PADRA+ with DARA [16] and the optimal cascaded movement solution which provides the least travel distance. 6.2 Performance Evaluation of PADRA False Alarms for Cut-vertices: We first checked the effectiveness of our cut-vertex detection algorithm. Mainly, we looked at the false alarm rate when PADRA is used. In order to perform this experiment, we created random topologies and for each topology we counted the number of cut-vertices found by PADRA and PADRA+. The results in Figure 8a show that PADRA can detect all the nodes which are really cut-vertices, but it also falsely identify more nodes as cut-vertices although they are not. However, such false ratio is not very high, particularly when the node transmission range and the number of nodes is smaller. In other words, if degree of connectivity is high and the network contains cycles, then the number of false alarms will be higher for PADRA. For instance, in Figure 8a, for 60 nodes and 50m transmission range, the false alarm ratio is almost negligible (i.e., 1.5 %). However, the false alarm ratio is around 15% when the transmission range is 200m. Based on these observations, we can conclude that in our target application of forest monitoring, PADRA+ can be employed, as the transmission range of the robots will be higher when compared to normal sensors. In addition, the messaging cost/delay is not of concern, since the nodes are assumed to be more powerful and the determination of cut-vertices is done upfront. PADRA on the other hand can be used with sensor networks where the transmission range is smaller and messaging energy cost can be significant. TMN: We evaluated the movement performance of our approach by varying the number of nodes (20 to 100). Figure 8b shows that different versions of PADRA performed very close to the optimal cascading while significantly outperforming DARA. Our approach maintains similar TMNs even when the network size grows as seen in Figure 8b indicating its scalability. The reason for such good scalability is due to termination of the replacements when a dominatee is hit. Since the dominatees can be anywhere in the network (i.e., not just the leaf nodes), there is a high probability of reaching it earlier than a leaf node and this is independent of the network size. Note that, this is not the case in DARA and it needs to look for a leaf node to stop; making the performance even worse when the network size grows. Optimal cascading performs slightly better than PADRA and PADRA+ as it determines the shortest path to the closest dominatee each time. This is also the case with PADRA-DP since it does not employ a greedy approach for determining the closest dominatee.

12 100

1 0000

T o ta l D is ta n c e T ra v e le d (m )

False Alarm (%)

70 60 50 40 30 20

6 00

P P P D O

1 1 00

ADRA ADRA+ A D R A -D P ARA p tim a l C a s c a d in g

5 00

4 00

3 00

1 000 9 00

P P P D O

ADRA ADRA+ A D R A -D P ARA p tim a l C a s c a d in g

9 000

T o ta l N u m b e r o f M e s s a g e s

7 00

80


False Alarm Ratio (%) 90

8 00 7 00 6 00 500 4 00 3 00

200

10

50

100

150

200

20

Node Radio Range (m)

30

40

50

60

70

80

90

1 00

50

N um berofN odes

(a)

(b)

1 50

3 00

200

1 00

30

40

50

60

70

N um berofN odes

(e)

80

90

1 00

225 200 175 150 1 25 1 00 75 50

20

4 000 3 000 2000

30

40

50

60

70

80

90

1 00

N um berofN odes

(d)

7 00 650

175 1 50 125 1 00 75 50

6 00 550 5 00 450 4 00 350 3 00

25

25 0

25 0 20

5 000

0 20

T o ta l C o v e ra g e (1 000m x m )

4 00

6 000

2 00

M PAD R A O p tim a l

2 00

M PADRA O p tim a l

25 0


5 00

7 000

(c)

27 5


T o ta l N u m b e r o f M e s s a g e s

1 00

225

PADRA PADRA+ P A D R A -D P

ADRA ADRA+ A D R A -D P p tim a l C a s c a d in g

N o d e T ra n s m is s io n R a n g e (m )

7 00

6 00

P P P O

1 000

2 00

0

8 000

40

60

80

1 00

N um berofN odes

(f)

50

1 00

N o d e T ra n s m is s io n R a n g e (m )

(g)

1 50

200 20

M P A D R A -In itia lC o v e ra g e M P A D R A -F in a lC o v e ra g e 40

60

80

1 00

N um berofN odes

(h)

Figure 8.

(a)% of node which are falsely identified as cut-vertices in PADRA. (b) TMN with varying number of nodes. r = 100m, (c) TMN with varying node transmission range. # of nodes is 40, (d) TMN with varying node transmission range. # of nodes is 40, (e) Total number of messages with varying number of nodes, (f) Total movement distance of the nodes in MPADRA. r=100m, (g) Total movement distance of the nodes with varying radio range in MPADRA. # of nodes is 60, (h) Total coverage change in MPADRA.

However, we note that the performance of PADRA-DP is comparable to PADRA+. This can attributed to the failed node. If the node is identified as cut-vertex in both approaches, then PADRA-DP will reduce the travel time compared to PADRA+ as a result of the usage of greedy and dynamic programming approaches. However, if the failed node is falsely identified as a cut-vertex (i.e., false alarm) in PADRA-DP, then PADRA+ will not replace any nodes and thus will be better in terms of travel distance. We also conducted a similar experiment by varying the transmission range (50 to 200m) as seen in Figure 8c. Similarly, PADRA outperformed DARA due to the same reason as mentioned above. Note however, that as the transmission range increases, surprisingly the travel distance also increases. PADRA, PADRA+ and PADRA-DP keep the same rate of increase as the optimal solution while DARA performs worse at higher transmission ranges. One explanation for this increase is the increased transmission range and thus the travel distance from a node to another. In addition, cut-vertices with higher transmission ranges are rare since the network connectivity is improved. Therefore, if there is a cut-vertex, it usually connects blocks that are far apart which eventually increases travel distance. Since the path for DARA replacements is longer, travel distance gets even worse with the increased transmission range. Due to the same reason of not getting many cut-vertices as the transmission range increases, the false alarms for PADRA-DP will be higher which makes PADRA+ to slightly outperform PADRA-DP. When comparing PADRA with PADRA+, the latter performs better as it minimizes the error in identifying the cut-vertices. However, PADRA+ minimizes this ratio by doubling the message cost as we discuss next. As far as PADRA-DP is concerned, although it does introduce false alarms in identifying cut-vertices, such disadvantage can be compensated through the use of shortest paths to dom-

inatees. Total number of Messages: We also compared the number of messages sent when determining the cut-vertices, designating primary failure handlers and replacing the failed nodes. The simulation results confirmed that our approach requires significantly less number of messages for the whole failure handling process when compared to optimal cascading as seen in Figure 8d. Even for PADRA+, we observed that on the average it is much better, although it doubled the number of messages compared to PADRA, as shown in Figure 8e. We also observed that the number of messages in PADRA-DP is significantly less than that of PADRA+. This result is also consistent with the Theorems 3 and 4 proved in section IV. 6.3 Performance Evaluation of MPADRA We also performed the same experiments in order to assess the performance of MPADRA. This time, two random nodes are selected to fail simultaneously. These nodes are picked among the cut-vertices and PADRA is used to determine those cut-vertices. Note that there may be some cases where simultaneous failure of two non-cut-vertices may also cause partitioning in the network. Since we are using PADRA, it assumes all the dominators as cut-vertices and thus any picked pair can be handled. Definitely, if the two failed nodes are both dominatees, then no action will be needed as no partitioning will occur. If only one of them is a dominator, then running PADRA will be enough. If there are two dominators, then they will be assumed as cut-vertices and MPADRA will be executed. For the experiments, we picked every pair of cut-vertices in a topology and tried 30 different topologies. We compared the performance to the optimal solution in terms of TMN. We also assess the message overhead and network coverage change during these experiments. The coverage range of a node

13

is assumed to be 50m. TMN:We varied both the number of nodes and the radio range. The results in Figures 8f and 8g show that MPADRA restores the connectivity of the network successfully for every pair of failed nodes and maintains the same TMN performance ratio to the optimal solution even if the network scales. As expected, the movement distance with optimal solution is slighlty better. However, optimal solution is a centralized approach and requires lots of information exchange as will be explained later. We observe that distributed solution to fix double failures costs more as compared with single node failure costs in terms of TMN. This is not surprising since it is like running PADRA twice to restore connectivity. MPADRA, on the other hand, has the ability to perform the handling in parallel. There are two more observations from the graphs. First, the TMN decreases or keeps stable with the increasing number of nodes and radio range. This is due to having more connectivity and thus closer nodes to replace the failed nodes. Second, with the increasing number of nodes and radio range the performance gap among MPADRA and Optimal solution starts to decrease since there will be more dominatees to pick when the connectivity is improved. This reduces the chance for the failure of path reservation in our approach and thus the travel distance decreases. Total Messages: We also counted the number of messages sent during connectivity restoration in MPADRA in order to verify its consistency with the analytical component. First, we increased the number of nodes and counted the messages. The number of messages for each node count is shown in Table 4a. These results are consistent with the message complexity O(n), as proved in Theorem 9. Note that the ratio of number of messages to the number of nodes keeps linear in the figure, indicating a linear message complexity. When the radio range is increased with a fixed number of nodes, the number of messages does not change significantly as seen in Table 4b. There is only a minor decrease when the radio range gets larger. This is due to being able to find closer nodes for replacement. As a result, the length of the path to send reservation and replacement messages is decreased which helps to save some messages. Note that we have not made a comparison to the number of messages for Optimal solution. In this case, the optimal solution is only possible through a central server which will know all the nodes and be able to communicate with the nodes remotely. That is, each PFH will have to communicate with the central server which needs to send a message to the appropriate nodes about when and where to move. A distributed solution is not possible even if we assume that each node knows the whole topology (i.e., with O(n2 ) total message complexity). This is because, each node would need to flood the network before it starts its replacements. However, if the network is partitioned, this may introduce further partitioning problems as the PFHs will not be able to

coordinate. Therefore, the message complexity of the optimal solution is not applicable here. Total Coverage: To assess the effect of movements on coverage, we recorded the initial and final network coverage for different number of nodes as seen in Figure 8h. The results show that the coverage change due to movements is very small in MPADRA. This is not surprising since the topology change is very minimal (i.e., only dominatee nodes are deleted from the topology). MPADRA is suitable for our targeted forest monitoring application since the actors can only know their local neighborhood. However, in target tracking applications where actors can communicate through an unmanned aerial vehicle (UAV) intermittently, optimal solution can be computed at the UAV and communicated to the appropriate actors. In addition, we note that the effect on coverage is negligible with two failures. However, when a lot of failures are considered, this may necessitate a topology adjustment in terms of coverage using [14] if the initial network is more than 1-covered. Table 4 # of messages in MPADRA for (a) node size, (b) radio range

Nodes 20 40 60 80 100

Avg # of Msgs 85.7 169.7 249.2 325.3 404.7 (a)

Range 50 100 150 200

Avg # of Msgs 251.1 249.2 245.8 244.3 (b)

7 Conclusion and Future Work In this paper, we presented a local, distributed and movement efficient protocol PADRA to handle the failure of any node in a connected WSAN. As it is unknown at the time of failure whether such failures would cause a partitioning or not, we provided a technique based on CDS of the network which can decide whether a node is a cut-vertex or not before the failure happens. Basically, if a node finds out that it is a cut-vertex, the closest dominatee/neighbor is delegated to perform failure recovery on behalf of that node. The failure recovery is done by determining the closest dominatee node and replacing it with the failed node in a cascaded manner so that the movement load is shared among the nodes that sit on the path to the closest dominatee. We further extended PADRA to handle multiple simultaneous failures in a distributed manner. The proposed approach, MPADRA, first reserves the nodes on the path to the closest dominatee before the replacements are performed. This is needed since a node may be requested to move by both failure handlers and thus disconnections and race conditions can occur. We maintained a state diagram at each node in order to eliminate race conditions. Once the nodes are reserved for handling a certain failure, the replacements can be done safely.

14

We analyzed the performance of PADRA and MPADRA both analytically and experimentally. Simulation results confirmed that PADRA performed very close to the optimal solution in terms of travel distance while keeping the approach local and thus minimizing the message complexity. These results were consistent with the analytical results derived. In addition, our approach outperformed another approach DARA in terms of travel distance which requires the knowledge of two-hops for each node. Simulation results for MPADRA confirmed that it can restore connectivity in a distributed manner. While the centralized optimal solution is performing better in terms of travel distance, the messaging complexity of MPADRA is linear and thus does not require access to the whole topology information. As a future work, we plan to work on heuristics which will improve the TMN performance of MPADRA. We also plan to look at coverage issues in conjunction with connectivity when movement of a sub-set of nodes would cause sensing coverage holes in terms of some sensing modalities. We plan to study a new approach which can handle both connectivity and coverage at the same time at the expense of slightly increased movement and messaging cost. Finally, we will be testing the performance of the approach in a real set-up which will be consisting of a network of mobile robots. A small prototype network consisting of 3 mobile PDX Robots [22] has already been created. We plan to extend this network by adding more sensors to test all the proposed protocols in this paper.

References G. T. Sibley and M. H. A. Rahimi, “Robomote: A tiny mobile robot platform for large-scale ad-hoc sensor networks,” in IEEE International Conference on Robotics and Automation, 2002. [2] M. B. McMickell, B. Goodwine, and L. A. Montestruque, “Micabot: a robotic platform for large-scale distributed robotics,” in ICRA, 2003, pp. 1600–1605. [3] I. F. Akyildiz and I. H. Kasimoglu, “Wireless sensor and actor networks: Research challenges,” in Elsevier Ad hoc Network Journal, vol. 2, 2004, pp. 351–367. [4] M. Mysorewala, D. Popa, V. Giordano, and F. Lewis, “Deployment algorithms and in-door experimental vehicles for studying mobile wireless sensor networks,” in Proceedings of 2nd ACIS-SAWN, Las Vegas, Nevada, June 2006. [5] M. Younis and K. Akkaya, “Strategies and techniques for node placement in wireless sensor networks: A survey,” in Elsevier AdHoc Network Journal, vol. 6, no. 4, 2008, pp. 621–655. [6] R. C. Shah, S. Roy, S. Jain, and W. Brunette, “Data mules: Modeling a three-tier architecture for sparse sensor networks,” in IEEE SNPA’03, Anchorage, AK, May 2003. [7] A. Kansal, A. Somasundara, D., J. M. Srivastava, and D. Estrin, “Intelligent fluid infrastructure for embedded networks,” in MobiSys’04, Boston, MA, June 2004. [8] E. Lloyd and G. Xue, “Relay node placement in wireless sensor networks,” in IEEE ToC, vol. 56, no. 1, 2007, pp. 134–138. [9] D. Goldenberg, J. Lin, A. S. Morse, B. Rosen, and Y. R. Yang, “Towards mobility as a network control primitive,” in MobiHoc’04, Tokyo, Japan, 2004. [10] K. Akkaya and M. Younis, “Coverage and latency aware actor placement mechanisms in wireless sensor and actor networks,” in International Journal of Sensor Networks special issue of Coverage Problems in Sensor Networks, to appear. [11] M. Ma and Y. Yang, “Adaptive triangular deployment algorithm for unattended mobile sensor networks,” in IEEE Trans. on Computers, vol. 56, no. 7, 2007, pp. 946–958.

[12] K. Akkaya and M. Younis, “Coverage-aware and connectivityconstrained actor positioning in wireless sensor and actor networks,” in IEEE IPCCC ’07), April 2007, New Orleans, Louisiana. [13] W. Wang, V. Srinivasan, and K. Chu, “Using mobile relays to prolong the lifetime of wireless sensor networks,” in Mobicom’05, Cologne, Germany, 2005. [14] J. Wu and S. Yang, “Smart: A scan-based movement assisted sensor deployment method in wireless sensor networks,” in IEEE INFOCOM, Miami, FL, March 2005. [15] G. Wang, G. Cao, T. L. Porta, and W. Zhang, “Sensor relocation in mobile sensor networks,” in Proceedings of INFOCOM, Miami, FL, March 2005. [16] A. Abbasi, K. Akkaya, and M. Younis, “A distributed connectivity restoration algorithm in wireless sensor and actor networks,” in IEEE LCN’07, Dublin, Ireland, Oct. 2007. [17] P. Basu and J. Redi, “Movement control algorithms for realization of fault-tolerant ad hoc robot networks,” in IEEE Networks, no. 4, 2004, pp. 36–44. [18] K. Akkaya, F. Senel, A. Thimmapuram, and S. Uludag, “Distributed recovery of actor failures in wireless sensor and actor networks,” in IEEE WCNC, Las Vegas, NV, Sept. 2008. [19] S. D. et al, “Localized movement control for fault tolerance of mobile robot networks,” in Proceedings of the First IFIP International Conference on WSANs, Albacete, Spain, Sept. 2007. [20] F. Dai and J. Wu, “An extended localized algorithms for connected dominating set formation in ad hoc wireless networks,” in IEEE Trans. on Parallel and Distributed Systems, vol. 15, no. 10, 2004. [21] R. Braden, L. Zhang, S. Berson, S. Herzog, and S. Jamin, “Resource reservation protocol (rsvp) – version 1 functional specification,” in RFC 2205, September 1997. [22] “General purpose robot, pioneer dx,” in http://www.mobilerobots.com, Nov. 2008. Kemal Akkaya received his PhD in Computer Science from University of Maryland Baltimore County in 2005. Currently, he is an assistant professor in the Department of Computer Science at Southern Illinois University Carbondale. His research interests include energy aware routing, security and quality of service issues in ad hoc wireless and sensor networks.

[1]

Fatih Senel received his MS degree in Computer Science from Southern Illinois University Carbondale in 2008. He is currently a PhD student in the department of Computer Science at University of Maryland Baltimore County. His research interests include clustering, topology control and fault-tolerance in wireless sensor/actor networks.

Aravind Thimmapuram received his MS degree in Computer Science from Southern Illinois University Carbondale in 2007. He is currently with Object Technology Solutions, Inc. His research interests include relocation and faulttolerance in wireless sensor/actor networks.

Suleyman Uludag is an assistant professor at the University of Michigan - Flint. He received his Ph.D. from DePaul University in 2007. His research interests include guaranteed and stochastic routing in wired, wireless mesh and sensor networks, topology aggregation and channel assignment in wireless mesh networks.

Distributed Recovery from Network Partitioning in Movable Sensor ...

Distributed Recovery from Network Partitioning in Movable Sensor ...

Suggest Documents

Network Partitioning Recovery Mechanisms in ...

Distributed Sensor Network Localization from ... - Infoscience - EPFL

Building Distributed Sensor Network Applications ...

Distributed Sensor Network Localization Using SOCP ... - CiteSeerX

Distributed Recovery Units for Demodulation in Wireless Sensor

SNIF: Sensor Network Inspection Framework - The Distributed ...

Distributed Evolutionary Graph Partitioning

Learning from Sensor Network Data

A Distributed Resource and Network Partitioning ... - Semantic Scholar

Key Predistribution Schemes in Distributed Wireless Sensor Network ...

A Wireless Sensor Network for Distributed Fault Management in ...

Key Predistribution Schemes in Distributed Wireless Sensor Network

Distributed sensor failure detection in sensor networks

Data Recovery from Distributed Personal ... - Semantic Scholar

Online Data Partitioning in Distributed Database

Distributed forward Brillouin sensor based on local light phase recovery

Distributed forward Brillouin sensor based on local light phase recovery

partitioning strategies for distributed association

Distributed learning in sensor networks

SENSOR REGISTRATION IN A SENSOR NETWORK BY ...

Neuronal Network Plasticity and Recovery From ...

Using Data Mules for Sensor Network Data Recovery

Learning Network Topology from Simple Sensor Data

jNode: a Sensor Network Platform that Supports Distributed ... - TecO