Fault Detection in MANETs

1 downloads 0 Views 182KB Size Report
system of mobile nodes connected by wireless links [4]. Wireless links provide .... node, u and node v, SNRu,v denotes the sequence of the last n. SNR values ...
Fault Detection in MANETs David Kidston and Li Li Canada Communications Research Centre Ottawa Canada {david.kidston, li.li}@crc.ca

Walee Al Mamun and Hanan Lutfiyya Department of Computer Science The University of Western Ontario London, Canada {walee, hanan}@csd.uwo.ca Abstract— Node faults may be frequent in a mobile ad hoc network (MANET). Most work related to fault detection and localization for MANETS essentially use changes in topology to identify faults. Most of this work does not distinguish between node movement that results in a functioning node moving out of range of all other nodes versus a node that crashes as the result of a radio transmitter malfunctioning or a battery no longer operating. This paper introduces a novel protocol that makes this distinction. Nodes forward their neighbourhood observations to a cluster head. The cluster head uses this to determine a list of suspected crashed nodes and sends this list to all the other nodes to filter out those nodes that have moved out of range. The crosslayer implementation of the protocol is effective in distinguishing between crashed nodes and nodes that have moved out of range. Keywords- cross-layer, data dissemination, fault detection, MANET

I. INTRODUCTION A mobile ad hoc network (MANET) is an autonomous system of mobile nodes connected by wireless links [4]. Wireless links provide lower bandwidth and higher error rates compared with fixed networks. MANETs can however be deployed quickly and without large infrastructure investments in order to complete assigned tasks e.g., military operations, emergency situations, sensory environments, etc. The nodes may be resource constrained, have limited battery power, and because of mobility must continuously monitor and react to changes in their transmission neighbourhood. At a given point of time t, the topology of the MANET can be described as a directed graph Gt = (Vt,Et) where Vt is the set of nodes and Et is the set of links at time t. For any two nodes u,v ∈ Vt, (u,v) is in Et, if the transmitter of u can reach v. The nodes u and v are said to be one hop neighbours. The topology of a MANET may change due to a node crashing but it could also be that node has moved such that it has different neighbours or it has moved out of range. A node that has crashed versus a node that has moved out of range has the same impact on the topology. The node is not in Vt. This makes it difficult to make the distinction. This distinction is needed for some environments e.g., tactical networks. A node that often moves out of range may viewed suspiciously and thus there may be reluctance to assign tasks to that node.

c 978-1-4673-0269-2/12/$31.00 2012 IEEE

There is considerable work in fault detection that essentially uses changes in topology to identify faults. However, this work is limited. Most current algorithms will falsely determine that a node that has moved out of range has crashed or the topology changes are limited (e.g., [6], [2], [7], [1], [3]). Most of this work assumes a static topology or no mobility during diagnosis. This paper proposes a novel protocol for distinguishing between crashes and nodes that have moved out of transmission range. The management model assumed is based on those found in existing work (e.g., [5]) that shows the effectiveness of having a coordinator node or cluster head. We chose this model as a starting point because of the need for a single node to make decisions. The protocol allows for changes to occur during topology at any time and messages may be lost in transmission. It builds on an innovative approach to data dissemination that uses spanning trees instead of flooding [4]. The remainder of the paper is organized as follows. Section II describes our detection scheme. Section III describes the implementation. Section IV presents the results of experiments. Finally the paper concludes with a summary. II. DETECTION PROTOCOL This section describes a protocol that is used to distinguish between node crashes and nodes moving out of range. The protocol assumes that each node runs the same algorithm. The algorithm is executed periodically in fixed time intervals. Each run is referred to as iteration. The algorithm consists of the following: Each node determines its neighbours. The neighbourhood information is sent to the cluster head. The cluster head uses this information to determine the topology. The cluster head uses topology information collected in the previous iteration and the current iteration to determine a set of nodes that are suspected to have crashed. This set of nodes suspected of crashing is sent to all other nodes. A node sends a message to the cluster head if it believes that a node suspected of crashing has not actually crashed. The cluster head uses these messages to make decisions on whether a node has crashed or moved out of range. In this section we will describe the design and implementation of each of these steps. A) Neighbourhood Discovery The messages associated with neighbourhood discovery are the DISCOVER and RESPONSE messages. Neighbourhood discovery assumes that after the elapse of T time units a node sends a DISCOVER message to all of its one hop neighbours.

663

The DISCOVER message contains the following information about the sender node: (node identifier, node address). The sender of the DISCOVER message waits for a bounded amount of time for responses. The purpose of the DISCOVER message is to find one hop neighbours. N(u)t is used to denote the neighbours found for node u in response to the DISCOVER message sent by node u at time t. All the nodes that receive the DISCOVER message, add the node that sent the DISCOVER message to its neighbour list. Each node that received the DISCOVER message replies with a RESPONSE message. When the sender of the DISCOVER message receives a RESPONSE message it determines if the sender, node v, is currently considered a neighbour. If not then it is added to N(u)t. If the sender of the DISCOVER has waited for a specified amount of time it assumes that no more neighbouring nodes will reply. Essentially the node can assume that it has received all of the RESPONSE messages. B) Sending Neighbourhood Information to Cluster Head The information gathered in subsection III.A is to be sent to the cluster head through a spanning tree where the cluster head is the root. The spanning tree was used to disseminate this information since all nodes have to send data, which would make the direct sending of data to the cluster head inefficient. The construction of a spanning tree uses the approach described and analyzed in [4]. After a pre-defined period of time a leaf node sends its neighbourhood information to its parent using a REPORT message. The REPORT message consists of the one-hop neighbours of the sending node. A parent node takes the data received from its child nodes and sends this data along with its neighbourhood nodes to its parent node. This process ends with the cluster head that determines the topology from the neighbourhood sets it has received. The amount of time that a node waits for data for its child nodes to send neighbourhood information is bounded. If a parent node u does not receive data from a child node v (node v has moved or there was a message loss) within the specified period of time then u will use data it had from the previous iteration. Thus the data used by the cluster head to determine topology may be out of date. The result is that multiple iterations may be needed to detect a crashed node or to determine if a node has moved out of range. C) Detecting Non-Visible Nodes The cluster head maintains Gt = (Vt,Et) and Gt-1 = (Vt-1,Et-1). The set Dt = Vt-1\Vt are the nodes in Gt-1 but not in Gt. A node u in Dt could result from any of the following: node u has moved out of range, node u has crashed or there was a message was lost when a child sent information to its parent in the spanning tree. Message loss occurs since MANETS typically use a non-reliable transport protocol e.g., UDP. Regardless of the reason, node u is considered to not be visible and is placed in the set crashedSet. This set represents nodes that are suspected of crashing. The cluster head sends the set crashedSet consisting of all non-visible nodes to all of the nodes within the cluster through the spanning tree. If desired

664

the cluster head can send the network topology down the tree so that all nodes are aware of movements within the network. . D) Node Movement Trends When a node u receives crashedSet from the cluster head in iteration t it determines the movement trends for each node v ∈ N(u)t. This is based on the use of signal-to-noise ratio (SNR) values to determine node movement trends. Node movement trends in iteration t are used in iteration t+1 to determine if a node v ∈ crashedSet has crashed or moved out of range. This will be explained shortly. A node v may be moving towards node u, moving away from node u or has stopped moving (indicating that v has crashed). SNR values are collected every second. For each node, u and node v, SNRu,v denotes the sequence of the last n ௡ିଵ SNR values sampled by node u for its neighbour v. ܴܵܰ௨ǡ௩ represents the most recent SNR value sampled by node u for ଴ neighbour v and ܴܵܰ௨ǡ௩ is the least recent sampled value. SNRu,v is a sliding window of values. If a node v has stopped transmission or has moved out of range then it cannot be ௡ିଵ is assigned the nil value. detected and ܴܵܰ௨ǡ௩ SNR values are used to determine node movement trends. A node v is considered to be moving closer to u if ௜ ௜ିଵ ܴܵܰ௨ǡ௩  ൐  ܴܵܰ௨ǡ௩ ‫ Ͳ׊‬൏ ݅ ൏ ݊

(1)

A node v is considered moving away if ௜ ௜ିଵ ൏  ܴܵܰ௨ǡ௩ ‫ Ͳ׊‬൐ ݅ ൐ ݊ ܴܵܰ௨ǡ௩

(2)

It may be the case that the n sampled SNR values do not monotonically increase or decrease. If this is the case then only the last two samples are used for node movement. Special consideration must be made when there exists an m such that ௜ ܴܵܰ௨ǡ௩  ് ݈݈݊݅‫ Ͳ׊‬൑ ݅ ൏ ݉ܽ݊݀ ௜ ܴܵܰ௨ǡ௩ ൌ ݈݈݊݅‫ ݉׊‬൑ ݅ ൏ ݊

(4)



This suggests that a node has moved out of range or has crashed since u cannot detect a signal for node v. To make this distinction we chose a value, ߪ, that represents a small SNR value. If the following is true then the node v is considered to be moving away from node u. ௠ିଵ Ͳ ൏ ܴܵܰ௨ǡ௩ ൏ ߪ

(5)

Otherwise node v is assumed to have crashed. If there exists an m such that ௜  ൌ ݈݈݊݅‫ ݈׊‬൑ ݅ ൏ ݉ܽ݊݀ ܴܵܰ௨ǡ௩ ௜  ് ݈݈݊݅‫ ݉׊‬൑ ݅ ൏ ݊ ܴܵܰ௨ǡ௩

2012 IEEE Network Operations and Management Symposium (NOMS): Short Papers

(6)

௠ ௡ିଵ then the movement trend is based on ܴܵܰ௨ǡ௩ ‫ܴܰܵڮ‬௨ǡ௩ . This allows us to consider cases where a node has moved out of transmission range but has moved back in.

Each node, u, maintains two sets: MO(u)t and MI(u)t. The set MO(u)t represents nodes that node u has determined are moving away from u in the tth iteration while MI(u)t represents nodes that u has detected are moving towards it. The nodes in MO(u)t-1and in MI(u)t-1 are checked to determine if a non-visible node is moving or crashed. In either case a message is sent to the cluster head. Messages provide direction. The reason for checking MO(u)t-1and in MI(u)t-1 is that a node is suspected of having crashed if it appeared in Dt. If a node v is not in either MO(u)t-1 or in MI(u)t-1 then no message is sent to the cluster head since this node cannot provide any information indicating that a suspected crashed node has not actually crashed. E) Determining Crashed or Moving Nodes The cluster head uses the messages to make the final decision about the status of non-visible node. For node v ∈ crashedSet, m(v)i denotes a message from the ith node and m(v)i.d represents the direction. Let Mv represent the set of messages received for node v. The following analysis takes place: Case 1: If Mv is empty then there are no messages sent from the nodes providing additional information that suggests that node v has not crashed. Case 2: If for all m(v)i in Mv if m(v)i.d = out then crashedSet = crashedSet – {v}. Since the node v is considered out of range then v can be placed in another set. Case 3: If there exists an i such that m(v)i.d = out and there exists a j such that m(v)j.d = in then we assume that the node has crashed. No adjustments are made to crashedSet. Note that if some messages indicate that a node is moving in and some indicate that a node is moving out we assume that the node has crashed. This may not be correct but we choose to be pessimistic. F) Other Adjustments to crashedSet If a node u was in Gt but not in Gt+1 then it is added to crashedSet. If node u recovers from a crash at some later point (e.g., Gt+l) then u is removed from crashedSet. G) Iterations It is a possible that a message is lost since UDP is the typical transport protocol used in MANETS. As a result more than one iteration may be needed to detect a crash or verify that a node has moved out of range. If a node is in crashedList for a specified number iterations then it would be assumed that the node has crashed. III. IMPLEMENTATION NOTES For simulation, we have developed and evaluated two implementations of the data dissemination protocol [4]. We

briefly describe the cross-layer implementation. The experiments presented were based on the cross-layer implementation. The implementation of neighbourhood discovery can make use of underlying routing protocols. For example, AODV and OLSR both provide neighbourhood discovery. With AODV, each node periodically sends a HELLO message e.g., every one second to its one hop neighbours. This represents a DISCOVERY message. The receiving of the HELLO messages from the one hop neighbours represents the RESPONSE messages. The monitoring of SNR values for neighbouring nodes does not incur additional overhead since this is already done by network protocols. The sending of neighbourhood information focuses on the differences in neighbours between the previous iteration and the current iteration. This minimizes message sizes when there are few changes. If the topology is to be sent from the cluster head to all nodes then only differences are used. IV. SIMULATION AND PERFORMANCE MEASUREMENTS In this section, we analyse the performance of the data dispersion protocol based on a Qualnet simulation of the two implementations. The experimental parameters used in these simulations are presented in Table 1. Parameter Value(s) Antenna Type Omni directional Channel Frequency 2.4 GHz Pathloss Model Two ray Maximum Propagation Range 400m Radio Type 802.11b Data Rate 2mbps Routing Protocol AODV AODV Hello Message Interval 1 second Scenario Area 1500 x 1500m Node Count 10, 30, 50, 100 Number of runs 15 Mobility rate variable Percentage of Nodes Moving 10% Table 1: Simulation Parameters Experiments ran with 10, 30, 50, 100 nodes and 10% of the nodes in each case are mobile (random waypoint model). The mobility rate for each node was different for different nodes. Nodes were placed within close proximity for the 30 node scenario and more dispersed for the 10 and 50 node scenarios. Each experiment was repeated 15 times and the average outcome is given in the results. There were six iterations. Detection time is measured using the number of iterations. A) Detecting single node crash/moving out of range This section analyzes protocol behaviour for a single node. We show the results of two experiments. In one experiment a

2012 IEEE Network Operations and Management Symposium (NOMS): Short Papers

665

single node crashed and in another experiment a single node moved out of range. We were able to detect crash and moving of range with 100% accuracy for all networks. The table below shows the detection time with respect to the number of iterations. Number of nodes 100 nodes 50 nodes 30 nodes 10 nodes

Crashed detection time

Moved out of range detection

3 iterations 2 iterations 1 iteration 1 iteration

3 iterations 1 iterations 1 iteration 1 iteration

B) Multiple Nodes In another set of experiments we allowed 10% of the total node to move out of or crash. We recorded the detection time and accuracy of detection for these experiments. We repeated these experiments for 10, 30, 50, 100 nodes scenarios. We allowed almost half of the node to crash and half of the node to move out of range out of the 10% nodes which were allowed to move. The following table shows the detection time. Number of nodes 100 nodes 50 nodes 30 nodes 10 nodes

Crashed detection time 3 iterations 2 iterations 1 iteration 1 iteration

Moved out of range detection time 4 iterations 3 iterations 2 iterations 1 iteration

The following table shows the detection accuracy for 10, 30, 50, 100 nodes when 10% of the nodes were allowed to move. The second column shows the percentage of number of nodes whose movement (out of network range) or crashing was correctly detected. Number of % of correct % of incorrect nodes detection detection 100 nodes 70 30 50 nodes 80 20 30 nodes 100 0 10 nodes 100 0

The following table shows the percentage of correct detection of nodes movement. The second column shows the percentage correct detection of moving out of range and third column shows correct detection of crashing. The percentage is calculated by comparing the suspected number of nodes (crashed or moved out of range) with the total number of nodes (crashed or moved out of range).

666

Number of nodes 100 nodes 50 nodes 30 nodes 10 nodes

% of crash detection 80 50 100 100

% of moving out of range detection 60 100 100 100

V. CONCLUSIONS AND FUTURE WORK In this paper we have proposed a novel cross-layer protocol that detects the difference between a crashed node and a node that has moved out of range. Simulations show that we can achieve good to high accuracy. While our current scheme considers only a single k-cluster, in order to accomplish global node disconnection detection inter-cluster communication is required. We are also investigating the use of other routing schemes as a substitute for AODV in carrying failure information as well as modifying our current AODV implementation. We are investigating how this scheme can be used to detect and respond other types of failure, including distinguishing between node disconnection and other types of node failure, using cross-layer information and policy-based response. We are also interested in applying the use of SNR values to gossip approaches. REFERENCES. [1] S. Chessa and P. Santi, “Comparison-based system-level fault diagnosis in ad hoc networks”, Proceedings of IEEE Symposium on Reliable Distributed Systems, 2001, pp. 257-266. [2] M. Elhadef, A. Boukerche and H. Elkadiki, “Performance Analysis of a Distributed Comparison-Based SelfDiagnosis Protocol for Wireless Ad-Hoc Networks” MSWiM’06, 2006. [3] M. Elhadef and A. Boukerche, “A Failure Detection Service for Large-Scale Dependable Wireless Ad-Hoc and Sensor Networks”, The Second international Conference on Availability Reliability, and Security, 2007. [4] D. Kidston, L. Li, W. Al Mamun and H. Lutfiyya,” CrossLayer Cluster-Based Data Dissemination for Failure Detection in MANETs”, To appear in the 7th International Conference on Network and Service Management (CNSM), 2011. [5] K. Phanse and L. SaSilva, “Addressing the Requirements of QoS Management for Wireless Ad-Hoc Networks”, Computer Communications, Volume 28, number 12, July 2003, pp. 1264-1273. [6] T. Shen at al, “Neptune: Scalable Replica Management and Programming Support for Cluster-Based Network Services”, Proceedings of Usenix Symposia on Internet Technologies and Systems, 2001, pp. 197-208. [7] N. Sridhar, “Decentralized Local Failure Detection in Dynamic Distributed Systems”, Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems, pp. 143-154, 2006.

2012 IEEE Network Operations and Management Symposium (NOMS): Short Papers

Suggest Documents