a novel hierarchical clustering approach for diagnosing ... - CiteSeerX

9 downloads 4510 Views 253KB Size Report
as well as monitoring the environment. Since they .... initiated by a set of fault-free initiator nodes at the high- est layer ... the status of node u is fault free or faulty respectively, and ..... based on the SNMP protocol, IEEE GlobeComm 2002, Nov.
International Journal of Computers and Applications, Vol. 31, No. 4, 2009

A NOVEL HIERARCHICAL CLUSTERING APPROACH FOR DIAGNOSING LARGE– SCALE WIRELESS ADHOC SYSTEMS P.M. Khilar∗ and S. Mahapatra∗∗

Abstract

and diagnosis is important for smooth functioning of the systems and is becoming a main building block in the design of fault-tolerant wireless adhoc systems. The existing diagnosis algorithms for wireless distributed systems result in diagnosis overhead such as high diagnostic latency and message complexity [1–3]. They assume a single initiator node that initiates the diagnosis process thus creating a centralized bottleneck. In case of failure of the initiator, all other nodes wait for diagnosis information and thus result in starvations. We assume a dynamic fault environment to allow the nodes to change their status during execution of the algorithm particularly for the suitability in the system where mobile nodes move from one place to another place during the execution of the diagnosis procedure. And also, the diagnosability of the network is limited by the connectivity of the network [14, 15]. This means that if a portion of the network is faulty, the entire system results in a catastrophe situation. This work proposes a new distributed diagnosis approach for large-scale self-diagnosable wireless distributed system consisting both nodes from MANET and WSN. Our goal is to allow every active node to learn the status of every other node in the system. The passive sensor nodes are tested only nodes and need not maintain the global diagnostic view about the entire network. If the testing and diagnosis process is correct, our algorithm achieves the diagnosis within least possible time. The main contributions of this paper are (i) providing a hierarchical clustering method that resolves the scalability issue, (ii) assuming a static and dynamic fault environments where both static and mobile nodes in different clusters are subjected to crash and value faults, (iii) allowing many (but not all) initiator nodes avoids the centralized bottleneck of having a single initiator node, and (iv) as all nodes are not initiators, this significantly reduces the diagnosis overhead as compared to traditional distributed diagnosis technique where every node needs to learn the status of every other nodes. It is noted that in practical systems there may be many passive and tested only nodes which are diagnosed but do not keep any diagnosis information about the entire network. The paper is organized as follows: in Section 2, we describe system and fault model that exactly describes the adhoc networks where the nodes either remain stationary

We propose a scalable distributed diagnosis approach for large-scale self-diagnosable wireless adhoc networks that form an arbitrary network topology. The diagnosis strategy assumes multiple initiators for the diagnosis process in contrast to a single initiator centralized bottleneck and also avoids a costly distributed diagnosis algorithm where every node is an initiator of the diagnosis process.

Key

results of this paper include realistic testing mechanism and fault models, an efficient and scalable distributed diagnosis algorithm using clustering, a global diagnosis strategy of all the nodes. The proposed approach has been evaluated analytically as well as through simulation. The diagnosis latency and message complexity of the algorithm was found to be O(ΔTx + lc Tf + max(Tout1 , Tout2 )) and O(nc Cs ) respectively.

The result shows that the diagnosis

performance is better using the proposed clustering approach than non-clustering approaches.

Key Words Adhoc networks, clustering, heartbeat and comparison-based testing, distributed diagnosis, diagnosability, fault-tolerance, diagnosis latency, message complexity

1. Introduction Wireless Adhoc Networks such as mobile adhoc networks (MANETs) and wireless sensor networks (WSNs) are fully autonomous distributed systems and can find their use in many applications such as in battlefield/emergency/rescue as well as monitoring the environment. Since they are deployed in harsh and hostile environment, the nodes in the underlying network are subjected to crash and value (incorrect computation) faults. A crash faulty node may not respond due to movement from one place to another or physically damaged or battery failure whereas a value faulty node may produce some erroneous result while processing the data packet. Therefore, fault detection ∗

Department of CSE, NIT, Rourkela-769008; e-mail: khilarpm@ yahoo.com ∗∗ Department of E & ECE, IIT, Kharagpur-721302; e-mail: [email protected] Recommended by Dr. A. Nayak (paper no. 202-2513)

260

of range. Value faulty nodes produce and communicate erroneous values while processing the data packet [16]. A crash fault model captures not only the faulty nodes due to physical damage or battery depletion but also the mobile nodes, which are out of range. However static nodes do not suffer from crash fault due to out of range. Both static and mobile nodes are subjected to value faults. A value fault model captures not only the erroneous values produced by the nodes but also the remaining energy value of a node that goes below certain threshold. This means that faults are permanent, i.e., nodes remain faulty until they are repaired and/or replaced. New faults may occur during the execution of the diagnosis protocol (viz, other than those already present before the start of the diagnosis session), which is nothing but a dynamic fault environment. The dynamic fault model captures the faulty cluster heads and faulty initiator nodes during execution of the diagnosis algorithm.

or allowed to move from one place to another. Section 3 describes the distributed diagnosis algorithm using clustering approach for wireless adhoc systems. The correctness proof and complexity analysis of the algorithm has been presented in Section 4 where its cost is computed in terms of diagnosis latency and message complexity. The simulation results are shown in Section 5 followed by concluding remarks in Section 6. 2. Preliminaries In this section, we describe the system model, fault model, diagnostic model and the basic definitions with different notations and their meaning. 2.1 System Model Consider the system of n wireless nodes (static or mobile), which communicate via radio transceivers. The wireless nodes are either switches or endpoints. In MANET or WSN, all devices deployed are not identical, and hence we treat some nodes as switches and some other nodes as endpoints such as sensor nodes. Each switch is equipped with transceivers and processing logic within it whereas each endpoint has only sensing and transceivers logic without any processing logic as such. A synchronous system is assumed where the execution time of diagnosis tasks and communication delay is bounded. Generic parameters are assumed for executing the diagnosis tasks; send initiation time and propagation time of the heartbeat and the diagnostic messages. Each node is assigned with a unique identifier and can be encoded using log n bits. Each node knows its identifier and the identifiers of its neighbours. We assume that all transmissions from any node u are omni-directional. Thus, any node in its neighbourhood, i.e. within its transmitting range, can receive any message sent by u. The neighbourhood of node u is denoted by N (u). The topology of the system can be described as a directed graph G = (V, E), called the communication graph, where V is the set of nodes and E is the set of edges connecting nodes. Given any u, v ∈ V , directed edge (u, v) ∈ E if and only if v ∈ N (u). We assume that there exists a collision free message delivery link level protocol providing the following services: a MAC protocol is executed to solve contentions; the protocol provides two communication primitives: 1-hop reliable broadcast (1_hB) and Selective Send (SelSend). Primitive 1_hB(m) delivers a message m to all the nodes in the senders neighbourhood, while primitive SelSend(v, m) delivers m only to neighbour node v. The communication graph is connected. The network is divided into a number of clusters at various layers in the hierarchy. Every cluster in the network may have one or more mobile nodes along with a set of stationary nodes.

2.3 Diagnostic Model The diagnostic model that is used to identify the crash and value faulty nodes is based on adaptive heartbeat and comparison-based technique as follows. A time-out mechanism is used to detect a crash fault. If a testee node does not respond within certain time out Tout1 or Tout2 period, we say the node is crash faulty. Whereas, to detect a value fault, we assume that each wireless node has an estimated value (may be energy value or value of a diagnosis task) already known to it. The observed value obtained from the testee nodes are compared with the estimated value and their difference is computed. If the deviation is greater than certain threshold θ, we say that the node is value faulty. The time-out mechanism can be implemented using an interrupt by the tester nodes such that it generates a high-observed remaining energy value. The main reason for incorporating this fault model is to reduce the model to the state-of-the-art diagnosis mechanism called the comparison-based diagnosis [1]. The algorithm assumes that the diagnosis process is initiated by a set of fault-free initiator nodes at the highest layer of clusters, in response to an explicit request of the external operator. There are two types of messages exchanged during the diagnosis execution: fixed size heartbeat messages, and variable size diagnostic messages. The heartbeat messages are of two types such as initiation heartbeat message and response heartbeat message. The format of the heartbeat message sent by a node u contains (u, v, diagnostic value, message-code), where v is the identifier of the node from which u received the first heartbeat message (u itself if u is the initiator), diagnostic value is the result of testing task and the message code is used to identify the type of heartbeat message. Heartbeat messages with comparison-based mechanism are used to detect the crash and value faulty nodes. The diagnostic messages exchanged during the execution of the algorithm are of two types: the local diagnostic message and the global diagnostic message. The format of a diagnostic message sent by a node u contains (u, Fu , message code), where Fu is the set of identifiers of the nodes currently diagnosed as faulty by

2.2 Fault Model The wirelessly connected nodes in the network are subjected to crash and value faults. Crash faulty nodes are unable to communicate with the rest of the system, either due to physical damage, or battery depletion or out 261

immediately above. It is assumed that the sensor nodes in the lowest layer in the hierarchy are tested-only nodes thus need not record the global diagnostic view. Connectivity between adjacent clusters is provided by virtual gateways that essentially are pairs of peer switches, in the neighbouring clusters.

node u and message code is the code to identify the type of message. The local diagnostic messages are used to contain the identities of faulty nodes within a subtree rooted at a cluster head. Whereas, global diagnostic messages are used to contain the identities of faulty nodes about the entire network. A variable diagnostic message size was chosen to accommodate the faults that would occur during the diagnosis process as well as the faults that were present before the start of a diagnosis session. Since the messages keep only the status of the faulty nodes, the message size is reduced. To maintain the status of nodes about the entire network, each cluster head maintains a vector known as Status_Table[u] which stores the status of each node u in the network. An entry 0 or 1 in this vector indicates that the status of node u is fault free or faulty respectively, and x denotes an entry that the status of a node u is unknown. The corresponding entry in this vector is updated as and when any event occurs in that node.

The cluster formation consists of a discovery of a BFS tree and then forming clusters. The cluster creation phase of the protocol is invoked, when the existing clustering falls below a quality threshold. To achieve the reduced diagnosis latency and message complexity, almost equal sized clusters are constructed at all the layers in the hierarchy except the highest layer. The clustering and diagnosis approach allows the mobile nodes to leave the old cluster and join new clusters dynamically during execution of the diagnosis algorithm. When a mobile node enters into another cluster it may increase the size of the cluster for which splitting of clusters are necessary. We consider the case when all nodes in the network have the same maximum transmission range i.e., the underlying communication graph is an embedding of the vertices in the plane, such that (u, v) is an edge if and only if d(u, v) ≤ R.

2.4 Definitions and Notations We present the following definitions from the diagnosis literature for completeness of the paper. Definition 1. The connectivity k(G) of a graph G is the minimum number of nodes whose removal results in a disconnected or trivial graph. Definition 2. The cluster connectivity kc is the minimum number of nodes in the cluster whose removal results the clusters as a disconnected graph. Definition 3. A wireless network described by the communication graph G is said to be t-diagnosable if correct diagnosis is always possible provided the number of faults does not exceed t. The largest integer t for which G is t-diagnosable is called the diagnosability of G, denoted tG . Definition 4. Any adjacent pair of clusters are connected by virtual gateways so that the union of all clusters cover the entire network. Definition 5. A diagnosis session means the time elapsed between the initiations of the diagnosis session till all other nodes receive the global diagnostic message of the entire network.

3.1 Clustering

The clustering algorithm can be initiated by any node in the network (the initiator will be the root of the BFS tree). There are two parts to cluster creation: tree discovery and cluster formation. The messages for cluster formation are piggybacked on the messages for the tree discovery component. Each node, u, transmits tree discovery beacon, which indicates it is shortest hop-distance to the root r. The beacon contains the following fields: {src-Id, ParentId, root-Id, root-seq-no, root-distance }. If any neighbour, v of u on receiving this beacon, discovers a shortest path to the root through u, it will update its hop-distance to the root appropriately and will choose u to be its parent in the tree. The parent-Id field will be initially NULL, and change as the appropriate parent in the BFS tree is discovered. The root-distance field reflects the distance in hops from the root of the tree. The root-Id is used to distinguish between multiple simultaneous initiators of the cluster creation phase of which only one instance is allowed to proceed. The root-seq-no is used to distinguish between multiple instances of the cluster creation phase initiated by the same root node at different time instants. To create the clusters on the BFS tree, each node needs to discover its subtree size and the adjacency information of each of its children in the BFS tree. A cluster formation message is piggybacked onto the tree discovery beacon by each node and has the following fields: {subtree-size, node adjacency}. All clusters will have sizes between 1 and Cs . Any pair of clusters has only one common vertex. If the number of initiators exceeds Cs , the number of nodes more than one cluster is used for initiators. The clustering algorithm proceeds by finding a (rooted) BFS spanning tree of the connected graph.

3. The Proposed Hierarchical Diagnosis Approach We introduce the diagnosis approach for wireless adhoc networks using a hierarchical clustering method. The proposed approach is divided into the following phases: (i) clustering, (ii) electing leader in each cluster, and (iii) testing and diagnosis. Clustering or partitioning methods have been suggested to provide scalable solutions to the diagnosis problem in many large networking systems [4–13]. The nodes in wireless systems are expected to autonomously group themselves into clusters, each of which functions as a multi-hop packet radio network. A hierarchical structure allows the clusters to be organized into different layers. Initially, all nodes in the network are joined to the lowest layer. Some nodes in the upper layers are the cluster representatives of these clusters at the immediate lower layer. The representatives of the clusters in a layer join the layer 262

3.2 Electing Cluster Head

sage arrives but does not match with the actual value, it is considered as value faulty. Thus a cluster head if exists will record a fault event in the Status_Table irrespective whether crash or value faulty. If a node was found to be fault-free, the diagnosis information was received from that fault-free node about the entire subtree formed by this node. The proposed algorithm assumes a 100% test coverage. 2. Backward diagnosis propagation. Each cluster head v achieves its local diagnosis independently (i.e. it knows the state of all of its neighbours), based on the heartbeat message received. Node v combines all of the received diagnosis and its own local diagnosis into a unique diagnostic message m containing the identities of all the faulty nodes adjacent to at least one faultfree node in the subtree rooted at v, and selectively sends m backwards to its parent in the spanning tree. It is noted that the diagnosis algorithm executed at the node v captures both crash and value faulty node status. This process is repeated for each of the cluster heads in the hierarchy until the initiator nodes achieve the complete diagnosis about the nodes in the spanning tree rooted at the initiator node. It is noted that, the spanning tree formed by each of the initiator node is independent for each of the initiator. To preserve this property, when a node moves from one cluster to another cluster, it becomes a member of the later cluster and the cluster head of the destination cluster will assign a new node-id to this migrating node and consider this node as a member of its own cluster. 3. Exchange of the diagnosis information among initiator nodes and Global diagnosis broadcast. Once the initiators receives diagnostic messages from all its children in the spanning tree, it combines the diagnostic information of these messages with its own diagnosis. To obtain a global diagnosis of the system, i.e., the identities of all the faulty nodes in the system. It is noted that, when a child node does not receive a heartbeat message or receives an erroneous message from its cluster head (may be initiator at the top level in the hierarchy), it finds the cluster head is faulty thus it elects another cluster head from the nodes in that cluster. For implementing this, the child nodes also maintain another time out Tout2 . Therefore, all cluster heads and initiator nodes remain fault-free. This property of the algorithm satisfies the dynamic fault environment in the sense that when a cluster head becomes faulty the child nodes detect this and elects another cluster head. This ensures the cluster heads and initiator nodes towards higher layer in the hierarchy except the nodes in the lowest layer contain only faultfree nodes. The fault free initiator nodes maintain the status of entire network. The initiator nodes exchange the diagnosis information among themselves therefore achieves the global diagnosis of the entire network i.e., identities of all the faulty nodes in the system. The global diagnosis is then disseminated downward in the spanning tree using a broadcast protocol. While executing the above three phases, the wireless network is subjected to any of the events such as new node

The node having lowest id within a cluster is elected as the cluster head. The main advantage of electing a leader using cluster id is that the cluster ids are also exchanged for diagnosis purposes. Each fault-free node in the cluster sends its node id to all other nodes in a cluster. The fault free node ids of a cluster are arranged in ascending order and first node is the cluster head. The decisions about the identity of the cluster head or representative can be made locally and consistently. The assignment of the representative tasks is dynamic in that the identities of the representatives may change after each diagnosis session due to failures or repairs. A simple election algorithm would be to assign the representative task to the fault-free nodes with the highest node identification assuming that the node identifications are unique and orderable. This is possible as long as each member of a cluster knows the structure of the hierarchy at the next higher level. Since, every cluster may have one or more mobile nodes along with static sensor nodes, the leader election procedure identifies a mobile node as a cluster head. 3.3 Testing and Hierarchical Diagnosis Once the cluster creation phase generates a set of clusters, the cluster diagnosis phase is invoked to perform small incremental changes to the clusters to handle crash and value fault. The algorithm allows efficient diagnosis in heterogeneous (static and mobile nodes) systems where some nodes are mobile moving from one cluster to other clusters while the sensor nodes are stationary. At the end of cluster creation, if a small cluster of size < Cs exists (there can be only one such cluster), it is merged with another cluster of size < Cs . The cluster size Cs and the number of layers lc are two variables can be used to tune the clusters such that a balance is achieved between these two. The algorithm in each cluster is designed such that the hierarchical structure of the clusters remain constant. If all the nodes in a cluster is faulty, there remains no cluster head for that cluster, therefore the most immediate cluster head under which the faulty nodes come will record the status information about all the nodes downward in the tree. The distributed diagnosis procedure with respect to each initiator is divided into three phases. 1. Forward dissemination of heartbeat messages. Each imitator node initiates testing by sending a heartbeat (test) initiation messages top-down on the spanning tree rooted by the initiator node to its neighbours using the 1_hB primitive. Heartbeat messages are propagated throughout the network, along a tree spanning all fault-free nodes corresponding to each initiator. Once a node has propagated the message, it waits for the heartbeat messages of its children nodes, diagnosing as fault-free, the sender of any message received and has passed the comparison tests. After a timeout time Tout1 , nodes that did not reply with their heartbeat message are diagnosed as crash faulty. If the mes263

joins, triggering reclustering and existing nodes leaves. The clustering, leader election and diagnosis procedure executes in a coordinated manner to handle these events.

Proof. Assume that node u is connected to the network. u may be either fault-free or faulty. If node u is fault-free, then whether it is static or mobile it will receive a heartbeat message in a finite time according to Lemma 1. Hence, if it receives a heartbeat message then it will report about its status to its cluster head by sending a reply heartbeat message. Since we are considering diagnosable adhoc network (i.e., the diagnosability bound in each cluster is preserved) then node u will always have its fault-free cluster head node in its cluster at any point of time, which will correctly diagnose its state. Now if node u is faulty, then four scenarios may be considered. The first scenario is that u is a static value-faulty node. In this case, the same previous reasoning applies, i.e., u will receive a heartbeat message and will reply by sending a reply heartbeat message carrying an erroneous result within a bounded time allowing hence its cluster head to diagnose its faulty state. However, if u is static crash faulty node (second scenario), then its cluster head will be able to diagnose its state once a timeout occurs. The third scenario is that node u is a dynamic value-faulty node. In this case, node u may not receive initiation heartbeat message from its cluster head since it has moved to some other clusters. However, it will receive a heartbeat message from the cluster head to which it has joined. In this case, it will send an erroneous reply heartbeat message to its new cluster head. The new cluster head can diagnose this value faulty node. The last scenario is that node u is a dynamic crash-faulty node, and hence, the state of u will be correctly identified once the timer in the cluster head times out. It is not possible to diagnose a cluster if the number of faulty nodes in a cluster is greater than the number tolerable i.e., the diagnosability bound. One method of detecting if the diagnosability bound of a cluster has been exceeded, though would be to assume that all nodes in that cluster were faulty. Then these nodes would be put into clusters whose diagnosability would not be affected by the addition of faulty nodes. If the nodes in a cluster concluded that their diagnosability was insufficient to continue, then reconfiguration would be necessary. This process would be initiated by simply not electing a cluster representative. The reconfiguration would consist of the combination and reclustering of all the nodes represented by the partition at the next higher layer. This process may be performed recursively until all the nodes in the system would be included in the reconfiguration. Now, we prove the second property, i.e., all nodes are correctly diagnosed in a finite time.

4. The Correctness Proof and Complexity Analysis The cost associated with the execution of a diagnosis protocol has been evaluated in terms of diagnosis latency and message complexity, where the diagnosis latency is defined as the elapsed time between the inception and the end of the diagnosis protocol and the message complexity is defined as the total number of messages exchanged during diagnosis. In the following, we first prove that the algorithm is correct. Then we analyse its time and message complexity. For a distributed self-diagnosis algorithm to be correct it should guarantee that at the end of each diagnosis session each fault-free node correctly diagnoses not only the state of all its nodes within cluster, but also that of all the nodes in the system. In the following we prove that our algorithm is correct and complete. The correctness proof is based on the following two main properties: partial correctness (the fault status of any node connected to the network is correctly diagnosed by at least one fault-free node) and complete correctness (local diagnostic views generated by fault-free nodes are correctly received by any other fault-free node in a finite time). The first property certifies that cluster heads know the fault status of all nodes within the cluster that are connected to the network at the end of the local testing/diagnosis phase. While, the second property ensures that after propagating the partial diagnostic views by cluster heads i.e., at the end of the dissemination of global diagnostic message, a correct diagnosis is achieved in a finite time. The first property is proved in the Lemma 2, while Lemma 3 proves the second property. Assume the network topology is time varying, and let the communication graph G be the connected communication graph of the system at time t during the diagnosis session, with tstart ≤ t ≤ tend where tstart and tend are the starting and end time of a diagnosis session. Lemma 1. If a diagnosis session has been initiated, then the leaf node in the lowest layer in the hierarchy to receive the first heartbeat message will do so in at most Tx + (lc − 1)∗ 1_hrB_delay time where Tx is the send initiation time of the heartbeat message and 1hB_delay is the one hop reliable broadcast delay within the cluster. Proof. The initiator node generates a heartbeat message at most in Tx . This heartbeat message reaches the cluster head of immediate cluster in the next layer in the hierarchy. The leaf node in the spanning tree rooted at the initiator node will receive this message at most in Tx + (lc − 1)∗ 1_hrB_delay. It follows that in at most Tx + (lc − 1)∗ 1_hrB_delay, all the nodes in spanning tree will receive the heartbeat message sent by initiator node.

Corollary 1. If a diagnosis session has been started, then the last cluster head node to transmit its local diagnostic view will do so in at most Tx + (lc − 1)∗ 1_hb_delay + max(Tout1 , Tout2 ). The proof of this corollary follows trivially from that of Lemma 1 given that after receiving its first heartbeat message the node will transmit its heartbeat message to all its children nodes, which are leaf nodes, and then waits at most Tout1 before starting preparation of local diagnostic message. In case any parent node fails including initiator node,

Lemma 2. If a diagnosis session has been initiated, then the fault status of each node connected to the network during the diagnosis session is correctly diagnosed by at least one fault-free node. 264

Case I. G is a single connected component: It follows that the tree created during the first phase of the protocol spans all fault-free nodes. Consider any leaf node in the tree. Once the leaf node has its own diagnosis (in time at most Tout1 ), it sends a diagnostic message to its parent in the tree. Eventually, this message is aggregated with other diagnostic messages and sent to the upper level in the tree. Hence, in finite time the initiator receives the local diagnosis of all the nodes and broadcasts the complete diagnosis downward in the tree, which is received in finite time by any fault-free node. The thesis follows by observing that, since f < k(G), every faulty node must be adjacent to at least one fault-free node, hence it is diagnosed by at least one-fault-free node.

the children nodes will be able to identify by timeout Tout2 . Hence, the last cluster head to compute its local diagnostic view will do so in Tx + (lc − 1)∗ 1_hb_delay + max(Tout1 , Tout2 ). Lemma 3. Let G be the graphs representing adhoc systems during a diagnosis session. If G is connected and the total number of faulty nodes is at most δc = kc − 1, in each cluster then the local diagnostic view generated by a head node is correctly received by any other faultfree node connected to the system during the diagnosis session in a finite time. Proof. If a cluster does not exceed its diagnosability bound δc its cluster head will acquire the status of its children nodes within the time out by the timer Tout at the cluster head. The cluster head prepares a local diagnostic message containing the fault status of only faulty nodes in this cluster within a bounded time. The local diagnostic message is transmitted to the cluster head bottom up in the hierarchy to reach initiator node in the spanning tree using selective send SelSend primitive. If a cluster head exceeds its diagnosability bound or all its nodes are faulty, the nested clusters in the higher layer in the hierarchy ensures its status within time out Tout1 maintained at the cluster head. If a cluster head becomes faulty, it will be detected by its children nodes and elect another node in the cluster as cluster head within a time out Tout2 maintained at the children nodes. Although this duration is variable due to the size of the local diagnostic message but it is bounded. The diagnosis task executed at the top layer in the hierarchy exchange the local diagnostic message. It is noted that if an initiator node is faulty it will be detected by all its fault-free children node and elect another initiator dynamically in the same diagnosis session. Therefore, the initiator nodes are always fault-free and achieve the global diagnostic message within bounded time. Once, the complete message is prepared by the initiator node, it broadcasts to all the nodes in its spanning tree. It is easy to observe the complete diagnostic message sent by respective initiators reach all the leaf nodes within a bounded time according to Lemma 1. The thesis follows that all the nodes achieve the diagnosis about the entire graph within a finite time.

Case II. G is divided into number of clusters: If G is partitioned into connected clusters, the diagnosis algorithm in each cluster will achieve the diagnosis and the algorithm is correct within each cluster. Now, the same two cases Case I and Case II applies in each cluster. The algorithm is trivial, when a cluster consists of a single connected component of cluster size Cs. Now we evaluate the performance of our algorithm in the worst case by analysing its time and message complexity. The time complexity refers to the duration of a diagnosis session whereas the message complexity refers to the total number of 1-hop reliable broadcasts transmitted during a diagnosis session. Theorem 3. If lc < Δ then the proposed hierarchical clustered diagnosis algorithm has the time-complexity O(ΔTx + lc Tf + max(Tout1 , Tout2 )) where Δ is the diameter of the network and Tf is an upper bound to send any message from initiator nodes to the leaf nodes or from the leaf nodes to initiator nodes. Proof. Any fault-free node propagates the heartbeat message originated by any initiator in time atmost Tx . This means that in time at most ΔTx the furthest node in the network receives the heartbeat message. Hence, the last diagnostic message is generated in time at most (ΔTx + max(Tout1 , Tout2 ), and its diagnostic information, possibly aggregated with other diagnostic information by nodes in the upper levels of the tree is received by the initiator in time at most (ΔTx + lc Tf + max(Tout1 , Tout2 ) + Txcg ) where Txcg is the time needed to exchange the diagnosis information by all the initiators at the top layer in the hierarchy. The initiators broadcast the complete diagnosis in time at most (ΔTx + lc Tf + max(Tout1 , Tout2 ) + Txcg ). This follows by observing that any fault-free node receives the complete diagnosis in time at most (lc − 1)Tf since the time it has been sent. To determine the savings of the clustering method, we look at the message count of the clustering as compared with single level or non-clustering method. The following theorems evaluate the proposed diagnosis algorithm message complexity for clustering and non-clustering diagnosis approaches.

Theorem 1. Let G be the graph representing the system during a diagnosis session. If G is connected and the total number of faulty nodes in each cluster is at most δc = kc − 1, then every fault-free node connected to the system during a diagnosis session correctly diagnoses the state of all nodes in the system in a finite time. The proof of Theorem 1 follows trivially from Lemma 2 and Lemma 3. Theorem 2. Let G = (V, E) be the communication graph of a wireless system. If G is connected and f ≤ k(G), then diagnosis algorithm is correct. Proof. Let G be the graph induced on G by node set V -F . Since f ≤ k(G), G is either a single connected component or partitioned into a number of connected components. The two cases are analysed as follows:

Theorem 4. The message complexity of hierarchical cluster based diagnosis approach is O(nc Cs ). 265

Proof. The worst-case scenario is when all nodes are fault-free. In this case, the initiators send one heartbeat message to all nodes. Any other node sends one heartbeat message and the one diagnostic message. Finally, the initiator broadcasts the global diagnosis in the tree rooted at the initiator. This task can be accomplished by sending nc global diagnostic messages downward in the spanning tree, where nc is the number of nodes in the spanning tree rooted at the initiators. Hence, the total number of messages exchanged is (1 + 2(nc − 1) + (nc − 1) * Ninit . The expression is equal to (3nc − 2) Ninit = O(nc Cs ) since the maximum number of initiators is Cs . It can be observed that nc = n/Ninit .

randomly introduced such that every node has at least k neighbours to ensure k-connectivity. The experiments were conducted with varying network and cluster size. In the experiments reported here, we choose the average time gap between successive initiations of diagnosis sessions by an initiator node to be 60 units. The send initiation time Tx , diagnosis task execution time, propagation time were assumed to follow an uniform distribution between lower and upper limit as follows: Tx : 0.002 units, 1-hB-delay: 0.008 to 0.08 units, Diagnosis Task Execution Time: 0.01 to 0.05 units, Clustering Task execution Time: 0.01 units, Diagnosis sessions: 1000 simulation run. The diagnostic latency and number of messages exchanged to achieve diagnosis for varying network and cluster size of both clustering and non-clustering approach is shown in Fig. 1 and 2 respectively. The diagnostic latency for clustering significantly reduces as compared to the non-clustering approach. This is due to clustering and computing the diagnostic latency parallel over the wireless network. Also, the number of messages exchanged in clustering scheme is less as compared to non-clustering due to the higher number of hops between any pair of nodes in non-clustering scheme.

Theorem 5. The percentage of the diagnosability improvement of proposed diagnosis algorithm is (α − β)/100 over the non-clustering diagnosis approach. Proof. The node connectivity of each cluster is k(G). It can be observed that the diagnosability of the nonclustering algorithms is the connectivity k(G) of the system. Let β = k(G) − 1 be the diagnosability of the system assuming any portion of the system is subjected to k(G) numbers of faults simultaneously. Therefore, nonclustering algorithms can tolerate β number of faults in the entire network. The cluster-based diagnosis algorithm proposed here allows the diagnosis process to proceed independently in each of the clusters. The clustering the whole system into almost equal size clusters allow each cluster to diagnose up to a number of (Cs − 1) faults since the cluster remains completely connected i.e., there is a connection between every pair of nodes in the cluster. Therefore, the local diagnosability of each cluster is (Cs − 1). This leads to the global diagnosability up to nc (Cs − 1)j faults, where (Cs − 1)j is the diagnosability of cluster j and nc is the total number of clusters. Let the global diagnosability of the clustering method is α = nc (Cs − 1)j . Therefore, while each cluster guarantees that at least (Cs − 1) faults can be diagnosed assuming this as a local diagnosability of each cluster, the clustering can improve the global diagnosability of the whole system by (α − β) > 0. Though the local diagnosability of each cluster may be different depending on the type of system diagnosis algorithm is used in each cluster, in this paper the local diagnosability of each cluster remains same since the same algorithm is used in each of the clusters and the nodes are homogeneous. Therefore, the percentage of improvement of the diagnosability of clustering approach is (α − β)/100. This follows the theorem. This also improves the diagnosability due to paralleling the diagnosis process and improving the diagnosability in each of the cluster, which results in their minimum cost in terms of the number of messages required to complete.

Figure 1. Average diagnostic latency versus n.

Figure 2. Number of messages versus n. 6. Conclusions We proposed a hierarchical clustering method for largescale self-diagnosable wireless distributed system exploiting the topological properties of these classes of distributed systems. The analysis shows that the diagnosis algorithm has diagnosis latency and message complexity are of O(ΔTx + lc Tf + max(Tout1 , Tout2 )) and O(nc Cs ) respec-

5. Simulation Results A simulator has been developed using language C++ for evaluating the diagnosis latency and message complexity of the proposed distributed diagnosis approach. Graphs were randomly generated for a given n and k. The links were 266

tively. The multi-cluster approach is shown to provide the following advantages: (i) improved scalability, (ii) reduced message overhead, (iii) reduced diagnosis time, and (iv) improved diagnosability. The proposed clustering allows the customization of the diagnosis process for the many environments in a large distributed system.

[15] S. Rangarajan, A.T. Dahbura, & E.A. Ziegler, A distributed system-level diagnosis algorithm for arbitrary network topologies, IEEE Transactions of Computers, 44(2), 1995, 312–334. [16] K.-F. Ssu, Ch.H. Chou, H.C. Jian, & W.-T. Hu, Detection and diagnosis of data inconsistency failures in wireless sensor networks, Computer Networks, 50, 2006, 1247–1260.

Biographies References P.M. Khilar was born in India, on July 10, 1968, graduated (First class with Distinction) from Mysore University in Computer Science & Engineering in the year 1990. He received his M.Tech Degree (First Rank) in Computer Science from NIT, Rourkela India in the year 1999. He obtained his Ph.D degree from Indian Institute of Technology, Kharagpur, India in the year 2009. At present he is a senior faculty in the department of Computer Science and Engineering of NIT, Rourkela, India. His current research interests include parallel and distributed processing, fault tolerant computing and ad-hoc and sensor networks.

[1] S. Chessa & P. Santi, Comaprison-based system level fault diagnosis in ad hoc networks, Proc. SRDS’ 01, New Orleans, LA (October-2002). [2] S. Chessa & P. Santi, Crash faults identification in wireless sensor networks, Computer Communications, 2002. [3] M. Barborak, M. Malek, & A. Dahbura, The consensus problem in fault-tolerant computing, ACM Computing Survey, 25(2), 1993, 171–220. [4] M. Barborak & M. Malek, Partitioning for efficient consensus, Proc. IEEE 26th Hawaii International Conf. on System Technology Sciences, 1993, 438–446. [5] O. Benkahla, C. Aktouf, & C. Robach, System-diagnosis of cluster-based parallel architectures, Proc. IEEE conf. on Parallel and Distributed Processing’96, 1996, 305–309. [6] E.P. Duarte, Jr. & T. Nanya, A hierarchical adaptive distributed system-level diagnosis algorithm, IEEE Transactions on Computers, 47(1), 1998, 34–45. [7] E.P. Duarte, Jr. & T. Nanya, Multi-Cluster adaptive distributed system-level diagnosis algorithms, IEICE Technical Report FTS, 1995, 95–73. [8] A. Brawerman & E.P. Duarte, Jr, An isochronous testing strategy for hierarchical adaptive distributed system-level diagnosis, Journal of Electronic Testing: Theory and Applications 17, 2001, 185–195. [9] M.S. Su, K. Thulasiraman, & A. Das, A scalable on-line multilevel distributed network fault detection/monitoring system based on the SNMP protocol, IEEE GlobeComm 2002, Nov. 2002. [10] M.S. Su, K. Thulasiraman, & V. Goel, The multi-level paradigm for distributed fault detection in network with unreliable processors, ISCAS’03, Proc. 2003 International Symp. on Circuits and Systems, 3(5), 2003, 862–865. [11] P. Chandrapal & P. Kumar, A scalable multi-level distributed system-level diagnosis, Proc. ICDCIT, Springer-Verlag Berlin, Heidelberg 2005, LNCS, 192–202. [12] G. Jeon & Y. Cho, A partitioning method for efficient systemlevel diagnosis, The Journal of Systems and Software, 63, 2002, 1–16. [13] S. Chutani, A framework for fault diagnosis of networks, Ph.D Dissertation, University of North Carolina, Chapel Hill, USA, 1996. [14] M. Stahl, R. Buskens & R.P. Bianchini, On-line diagnosis in general topology networks, Workshop on Fault Tolerant Parallel and Distributed Systems, July 1992.

S. Mahapatra was born in India, on Feb 27, 1969, graduated from Sambalpur University in Electronics and Telecommunications Engineering in the year 1990. He received his M.Tech. and Ph.D. degrees from IIT, Kharagpur in the year 1992 and 1997 respectively. Presently he is working as an Associate Professor in the Department of E & ECE , IIT, Kharagpur, India. His current research interests include parallel and distributed processing, lossless data compression, and optical networking.

267

Suggest Documents