A Survey on Fault Management in Wireless Sensor Networks M.Yu, H.Mokhtar, and M.Merabti School of Computing & Mathematical Science Liverpool John Moores University
[email protected],
[email protected],
[email protected]
Abstract— Recent increasing growth of interest in Wireless Sensor Networks (WSNs) has provided us a new paradigm to design wireless environmental monitoring applications in the 21st century. However, the nature of these applications and network operational environment has also put strong impact on sensor network systems to maintain high service quality. One challenge is to design efficient fault management solutions to recover network systems from various unexpected failures. In this paper, we address this challenge by surveying existing fault management approaches in WSNs. We divide fault management process into three phases, such as fault detection, diagnosis and recovery. We classify existing approaches according to these phases. We also summarize the existing management architectures, which are adopted to support fault management in WSNs. Finally, we outline future challenges for fault management in WSNs.
I. INTRODUCTION
T
echnologies of miniaturizing the size of low-cost electronic appliances have boosted the visions of deploying large-scale and dense sensor nodes to extract useful information from harsh environments. However, such small dimension design also puts strong restrictions on the hardware and software capability of a sensor node, in terms of its processing capability, memory storage, and energy supply. Among them, the limited power supply is one of the most important constraints. Sensor nodes usually carry limited, general irreplaceable battery power resource. They are expected to operate autonomously for periods of time ranging from days to years. For such reasons, faults are likely to occur frequently and unexpectedly in sensor networks compared to traditional networks. Therefore, there is a trade-off between prolonging the network lifetime via conserving the energy of individual nodes, and maintaining the high quality of network services by implementing complex fault management schemes in the network. Moreover, sensor networks are liable to suffer faults due to harsh environmental conditions, such as rain or fire etc. In such cases, faults might also include node
ISBN: 1-9025-6016-7 © 2007 PGNet
misbehaviours, ranging from simple crash faults where a node becomes temporary inactive to faults where the node behaves arbitrarily or maliciously. For all of these reasons, fault management in WSNs needs to be tackled differently and treated with extra cares. The technical core of this paper is to investigate the current state of fault management in WSNs and its future prospects. To achieve this aim, we address this challenge by surveying existing approaches and providing a comprehensive overview of fault management in WSNs. Existing approaches of fault management in are various in forms of architectures [2-4], protocols [5, 6], detection algorithm [7-10] or detection decision fusion algorithm [11-16]. We classify these approaches into three phases, such as fault detection, fault diagnosis and recovery, as an appropriate fault management process. The rest of this paper is organized as follows: in Section II, we survey the existing fault detection approaches in WSNs; Section III discusses fault diagnosis approaches; In Section IV, it examines existing fault recovery approaches; Section V classifies existing management architectures used to support fault management in WSNs. Finally, we outline few potential future challenges of fault management in WSNs. II.
FAULT DETECTION
Fault detection is the first phase of fault management, where an unexpected failure should be properly identified by the network system. We classify the existing failure detection approaches in WSNs into two types: centralized and distributed approach. A. Centralized Approach Centralized approach is a common solution to identify and localize the cause of failures or suspicious nodes in WSNs. Usually, a geographically or logically centralized sensor node (in terms of base station [5, 17, 18], central controller or manager [4], sink [19]) takes responsibility for monitoring and tracing failed or misbehaviour nodes in the network. Most these approaches consider the central node has unlimited
resources (e.g. energy) and is able to execute a wide range of fault management maintenance. They also believe the network lifetime can be extended if complex management work and message transmission can be shifted onto the central node. The central node normally adopts an active detection model to retrieve states of the network performance and individual sensor nodes by periodically injecting requests (or queries) into the network. It analyzes these information to identify and localize the failed or suspicious nodes. More specifically, Sympathy [19] uses a message-flooding approach to pool event data and current states (metrics) from sensor nodes. In order to minimize the number of communication messages nodes must send and conserve node energy, a Sympathy node can selectively transmit important events to the Sympathy sink node. While, Jessica Staddon et al.,[18] seek the solution of appending network topology information (i.e. node neighbour list) into node’s routing update messages rather than in a separate approach. Thus, the base station can construct the entire network topology by integrating each piece of network topology information embedded in route update messages. These messages are normally forwarded by the execution of some well-known routing-discovery protocols. Once the base station knows the network topology, the failed nodes can be efficiently traced using a simple divide-and-conquer strategy. Note that this approach assumes each node has a unique identification number and the base station is able to directly transmit messages to any node in the network. Moreover, some common routing protocols (e.g. SPINs[5]) can also detect failed or misbehaving nodes through routing discovery and update. This typically requires the nodes to send additional messages, and it is consequently very expensive. In [17], the base station uses marked packets (containing geographical information of source and destination locations etc) to probe sensors. It relies on nodes’ response to identify and isolate the suspicious nodes on the routing paths when an excessive packet drops or compromised data has been detected. In addition, the central manager in WinMS [4] provides a centralized approach to prevent the potential failure by comparing the current or historical states of sensor nodes against the overall network information models (i.e. topology map, and energy map). As a summary, the centralized approach is efficient and accurate to identify the network faults in certain ways. However, resource-constrained sensor networks can not always afford to periodically collect all the sensor measurements and states in a centralized manner. A distinctive problem of this approach is that the central node easily becomes a single point of data traffic concentration in the network, as it is responsible for all the fault detection and fault management. This subsequently causes a high volume of message traffic and quick energy depletion in certain regions of the network, especially the nodes closer to the base station.
They take extra burdens for forwarding the communication messages from other nodes. This approach will become extremely inefficient and expensive in consideration of a large-scale sensor network. In addition, multi-hops communication of this approach will also increase the response delay from the base station to faults occurred in the network. Therefore, we have to seek a localized and more computationally efficient fault detection model. B. Distributed Approach Distributed approach encourages the concept of local decision-making, which evenly distributes fault management into the network. The goal of it is to allow a node to make certain levels of decision before communicating with the central node. It believes the more decision a sensor can make, the less information needs to be delivered to the central node. In the other word, the control centre should not be informed unless there is really a fault occurred in the network. Examples of such development are: node fault self-detection and self-correction on its hardware physical malfunction (i.e. sensor, battery, RF transceiver) [20, 21], failure detection via neighbour coordination [7-10, 22], utilization of WATCHDOG to detect misbehaving neighbour [6], use of group (or cluster) technology to distribute fault detection into the network [2, 23]. Others address the use of decision fusion centre (i.e. several fusion nodes across the network) to make the final decisions on suspicious nodes in the network [11, 12, 14, 16]. 1).Node Self-detection: S Harte et al., [21] propose a selfdetection model to monitor the malfunction of the physical components of a sensor node via both hardware and software interface. The hardware interface consists of a flexible circuit using accelerometers that acts as a sensing layer around a node to detect the orientation and impact on the node. They adopt several software components (e.g. ADCC, TimerC) from TinyOS operating system to sample sensor node’s reading. Self-detection of node failure in [20] is somehow straightforward as the node just observes the binary outputs of its sensors by comparing with the pre-defined fault models. 2).Neighbour Coordination: Failure detection via neighbor coordination is another example of fault management distribution. Nodes coordinate with their neighbors to detect and identify the network faults (i.e. suspicious node or abnormal sensor readings) before consulting with the central node. In most case, the central node is not aware of any failure unless something is believed to be wrong with high confidence via node coordination diagnosis. This design reduces network communication messages, and subsequently conserves node energy. While, existing developments of such design are various due to individual needs of each approach. For example, in a decentralized fault diagnosis system [22], a sensor node can execute a localized diagnosis algorithm in
steps to identify the causes of a fault. The diagnostic information is usually derived from the lower communication layers (e.g. the MAC and network layers). In addition, a node can also query diagnostic information from its neighbours (in one-hop communication range). This allows the decentralized diagnostic framework to scale easily to much larger and denser sensor networks if required. Alternatively, suspicious (or failed) nodes can be identified via comparing its sensor readings with neighbors’ median readings. With this motivation, Min et al., [9] developed a localized algorithm to identify suspicious node whose sensor readings have large difference against the neighbors. Although this algorithm works for large size of sensor networks, the probability of sensor faults needs to be small. If half of the sensor neighbors are faulty and the number of neighbors is even, the algorithm cannot detect the faults as efficient as expected. In addition, this approach also requires each sensor node to be aware of its physical location by equipped with expensive GPS or other GPS-less technology. While, in [10], Jinran et al., improved such kind of approach, which does not require node physical position. This algorithm can still successfully identify suspicious nodes even when half neighbors are faulty. It chooses the GD sensor in the network, and uses its best sensor results to diagnose other sensors’ status. This information can be further propagated through the entire network to diagnose all other sensors as good or faulty. Chihfan et al., [7, 8] address the accuracy of failure detection via a two-phase neighbour coordination scheme. A sensor node always consults with its neighbours first to verify the diagnosis results before sending out a failure alarm. In this approach, a sensor node uses the first phase to wait for the neighbours to update the information of the suspicious node, and uses the second phase to consult with neighbours in order to reach a more accurate decision. Since neighbours monitor each other, the monitoring effect is propagated across the network. Thus, the central node only needs to monitor a potential very small subset of sensors. This approach requires the network is pre-configured and each sensor should have a unique ID, and the central node knows the existence and ID of each sensor. Similar approach in [6], where a node can listenin on its neighbour using WATCHDOG. It can detect failed or misbehaving neighbours if data packets have not been transmitted properly by the neighbours it is currently routing to. However, this process may be slow, and is error-prone because the constrained sensor nodes cannot be expected to constantly police all their neighbours and consequently may end up routing into a new neighbour that has also failed. 3)Clustering Approach: Clustering [24] has become an emerging technology for building scalable and energybalanced applications for WSNs. Ann T.Tai et al., [2] derive
an efficient failure detection solution using a cluster-based communication hierarchy to achieve scalability, completeness, and accuracy simultaneously. They split the entire network into different clusters and subsequently distribute fault management into each individual region as in Fig.1. Intracluster heartbeat diffusion is adopted to identify failed nodes in each cluster. Cluster head detects the suspicious nodes by exchanging heartbeat messages in an active manner with onehop neighbours (in a same radius transmission range). By analyzing the collected heartbeat information, the cluster head finally identifies failed nodes according to a pre-defined failure detection rule. Further, if a failure is detected, the local detected failure information can be propagated to all the clusters. It makes the local failure detection to be aware of the changes of network conditions or the overall objectives.
Fig. 1 Intra-Cluster heartbeat diffusion and Inter-Cluster information propagation [2]
While, Ruiz et al., [23] adopt an event-driven detection via a manager-agent model supported by a management architecture MANNA [3]. In this approach, agents are executed in the cluster-heads with more resources than common nodes. A manager is located externally to the WSN where it has a global vision of the network and can perform complex management tasks and analysis that would not be possible inside the network. Every node checks its energy level and sends a message to the manager or agent whenever there is a state change. The manager then uses these information to build topology map and network energy model for monitoring and detecting the potential failure of the network in future. To detect node failure, the manager commands the agents to execute the failure detection management service by sending GET operations for retrieving node states. Without hearing from the nodes, the manager will consult the energy map to check whether a node has any residual energy. If so, the manager detects a failure and sends a notification to the observer. The scheme has a drawback of providing a false de bugging diagnostic. For instance, common-odes may be disconnected from its cluster head and so they are not able to receive the GET operation from the manager, or GET and GET-RESPONSE packet may be lost due to noise. These conditions may mislead the manager into making incorrect fault detection. Furthermore, random distribution and limited transmission range capability of common-node and cluster-heads provides no guarantee that
every common-node can be connected to a cluster head. In addition, the transmission costs for network state polling has not been considered in this approach. 4) Distributed Detection: The basic idea of Distributed Detection is to have each node make a decision on faults (typically binary data of abnormal sensor reading). This approach is especially energy-efficient and ideal for datacentric sensor applications. However, there remain various research challenges in order to achieve a better balance between fault detection accuracy and the energy usage of the network. Usually, the efficiency of such failure detection schemes is counted in terms of node communication costs, precision, detection accuracy and the number of faulty sensor nodes tolerable in the network. One of the techniques suggested is fusion sensor coordination. In Clouqueur’s work [15], fusion sensors (in terms of manager nodes) coordinate with each other to guarantee that they obtain the same global information about the network before making a decision, as faulty nodes may send them inconsistent information. In addition, Wang et al. [16] also consider a fault-tolerant solution to reduce the overall computation time and memory requirements at the fusion sensors. This approach adopts cluster technology for data aggregation and lessening of redundant data. In [15], Thomas et al., seek efficient algorithms for collaborative failure target detection that are efficient in terms of communication cost, precision, accuracy, and number of faulty sensors tolerable in the network. Fusion sensors (in terms of manager nodes) coordinate with each others in order to guarantee that all manager nodes obtain the same global information of the network, as faulty nodes may send inconsistent information to different manager nodes. While, Tsang-Yi et al., [16] look into the design of a fault-tolerant fusion rule for wireless sensor networks. Its classification approach improved fault-tolerant capability in the network, but also reduced the overall computation time and memory requirements at the fusion sensors. In consideration of largescale sensor networks, this approach also adopts the cluster technology to aggregate nodes into several groups, which reduce the amount of power consumed on the long distance data transmission towards the base station. Each node makes a local decision according to the off-line optimized local decision rule. Such local decision rule is based on the given fault-tolerant fusion rule in a particular cluster. The final decision is made at the cluster head by employing the faulttolerant fusion rule. III. FAULT DIAGNOSIS Fault diagnosis is a stage that the causes of detected fault can be properly identified and distinguished from the other irrelevant or spurious alarms. The accuracy and correctness of detected fault have already been partly reviewed and achieved
in fault detection phase as in [2, 8, 12, 15]. However, there is still no comprehensive model or description of faults in sensor networks to support the network system for accurate fault diagnosis. Most existing approaches address the fault models only on the individual node level (including its hardware components malfunction). In particular, both [10, 20] assume the system software (including sensor application software) are already fault tolerant. They focus on hardware level faults of a sensor node, especially on sensor and actuator which are most prone to malfunctioning. Farinaz et al., [20] adopt two fault models. The first one is related to sensors that produce binary outputs. The second fault model is related to the sensors with continuous (analog) or multilevel digital outputs. In [15], Thomas et al., only consider faulty nodes are due to harsh environmental conditions. In their work, faulty nodes are assumed to send inconsistent and arbitary values to other nodes during information sharing phase. While, Min Ding et al., [9] model the “event” (or abnormal behaviour of a sensor node) by real numbers such as sensor readings instead of 0/1 decision model. As a result, this algorithm is generic enough as long as the thresholds and real number of events can be specified by fault tolerance requirements from various sensor applications. Note that, apart from fault tolerance models for hardware components (e.g. sensor, actuator) or abnormal behaviours (i.e. arbitrary sensor readings) of a node, there is still a need of the development of fault models for sensor networks, which are various due to their types, environment, and also at the network and distributed system level. IV. FAULT RECOVERY Ideally, the failure recovery phase is the stage at which the sensor network is restructured or reconfigured, in such a way that failures or faulty nodes do not impact further on network performance. Most existing approaches in WSNs isolate failed (or misbehaving) nodes directly from the network communication layer (e.g. the routing layer). For example, in the technique of Marti et al. [6], after the faulty neighbour is detected, a node chooses a new neighbour to route to. Staddon et al. [18] proposed two approaches of resuming the network routing paths from the silent nodes (i.e. failed nodes), which are detected in each network routing update epoch. Instead of taking passive recovery actions as mentioned before, the proactive action is also considered as a novel approach to prevent the potential faults in the network. As from the approach of WinMS [4], the central manager detects the network region with weak health (e.g. low battery power) by comparing the current network state (including individual nodes) with a historical network information model (e.g. an energy map or topology map). It takes proactive action by instructing nodes in that area to send data less frequently for node’s energy conservation. Koushanfar et al. [20] suggested a heterogeneous back-up scheme for tolerating and healing the
hardware malfunctions of a sensor node. They believe a single type of hardware resource can back up different types of resources. The key idea is to adapt application algorithms and/or operating system to match the available hardware and the applications needs. Their current approach focuses on five primary types of resource: computing, storage, communication, sensing and actuating, which can replace each other via suitable changes in system and application software. Although this solution is not directly relevant to fault recovery in respect of the network system management, it still provides us with a useful vision of a future design to reconfigure or update the management functionality of a node when its management responsibility has changed due to fault occurred in the network. Currently, one of the obvious ways to update the management functionality of a node is the use of mobile code technology. V. FAULT MANAGEMENT ARCHITECTURE Efficient architecture is usually on-demand to support the execution of management activities. Generally, these architectures can be classified into centralized, distributed and hierarchical models. --Centralized model, is designed ideally for the centralized fault detection approaches. A central controller is usually responsible for fault maintenance of the overall network. In order to construct the global view of the network, central controller keeps updating nodes’ states by message exchange. The central controller aggregates these data into information models, in terms of metrics[19], topology model [18], topology and energy map [4], WSN models [23], and cluster topology model [2]. The central node identifies any faulty or suspicious nodes by comparing the current node states against those historical information models. It is seen to be accurate solution to identify fault. However, this centralized architecture will become less efficient and more energy-expensive in consideration of large-scale sensor networks. The message flooding of such approach may greatly consume node energy because of frequent in-network message exchanges. --Distributed model, splits the entire network into several sub-regions, and distributes fault management tasks evenly into individual regions. Each region employs a central manager. Manager is responsible for monitoring and detecting failure in its region. It is also able to directly communicate with other managers in a coordination fashion for fault detection. As a result, the central controller of the overall network only needs to monitor a very small number of sensors in the network. This design conserves node energy by lessening in-network communication messages, and also enhances the system response time towards events occurred in the network. Many existing approaches have adopted Clustering to group nodes into different regions. Clustering
allows sensors to efficiently coordinate their local interactions in order to achieve global goals. In particular, localized clustering [16] can contribute to more scalable behaviours as number of nodes increase, improved robustness, and more efficient resource utilization for many distributed sensor coordination tasks. As mentioned in Section 3, cluster-based failure detection [2] supports the cluster head (CH) to efficiently identify suspicious nodes within a small region of the network by exchanging heartbeat messages. In addition, the gateway node, which is a one-hop neighbour of the CHs of two different clusters, provides the communication ‘bridge’ for two CHs to propagate the fault report across the network. With the high probability, multiple gateway candidates will be available for robust cluster connectivity. In MANNA [25], Ruiz et al., also propose a simple network management protocol called MannNMP to support sub-network (cluster) managers to exchange management information. This protocol is designed to considers the specific characteristics of WSNs.
Fig. 2. Hierarchical architecture of mobile agent-based policy management [1]
--Hierarchical model, it is a hybrid between the centralized and distributed approach. It uses intermediate managers to distribute manager functions. However, these intermediate managers do not directly communicate with each other. Each manager is responsible to manage nodes in its sub-network and report to its high-level central manager. In a three-layered hierarchical sensor network structure [13], cluster is adopted to reduce the number of sensor nodes monitoring the events occurred in the network. They believe only a tiny fraction of the sensors in the network can accurately detect the target events; while most sensors’ measurements (which are far away from the target) are just pure noises. Sensors only send data of detected events to their corresponding cluster heads instead of transmitting to a far-away central fusion centre. The cluster head will make a decision about the fault events occurred within that sub-region. The decisions from cluster heads will be further transmitted to the fusion centre to inform it if there is a target or event in specific sub-regions. In order to accomplish fault management objectives in a reliable and energy-efficient way, Ying et al., [1] propose a hierarchical mobile agent-based policy management architecture (as in Fig.2) for sensor networks. In which, Policy Manager (PM) at the highest level, Local Policy Agent (LPA) which manages a sensor node, Cluster Policy Agent (CPA) as an intermediate
component between PM and LPA. The management commands are always propagated from the PM to CPAs to LPAs.
[7]
VI. CONCLUSION
[8]
Wireless sensor network has gradually emerged as a cutting-age technology to develop new wireless applications for pervasive computing in the 21st centaury. One of challenges to success this vision is robust fault management. In this paper, we provided a comprehensive overview of fault management in WSNs, and surveyed existing approaches. We classified them in three phases, such as fault detection, diagnosis and recovery. As we have seen, the fundamental requirement of self-managed WSN solutions is that of providing an efficient management architecture. However, there is a trade-off between the complexity of architecture design for fault management in WSNs and the resourceconsumption of sensor networks. More specifically, reconfiguration of fault management functionality for a sensor node is one main challenge when node management responsibilities change. So far, mobile code technology has been proposed as an efficient scheme. It selectively chooses and updates the parts of code that are needed to support adaptation and reconfiguration of node functionality. In order to achieve this, sensor nodes in the network have to distribute and forward the on-demand mobile code to the destination nodes. These actions may consume limited resource of sensor nodes, such as battery energy. Thus, there is still a need to design an alternate solution to energy-efficiently support the functionality reconfiguration of sensor nodes. Moreover, selfmanageable WSNs are also required to have a certain wisdom, which clearly identifies various faults in the network. In order to distinguish between different faults in the network and subsequently take the appropriate actions, the development of theoretical and realistic fault models for WSNs becomes another key technique.
[9]
[10]
[11] [12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
REFERENCES [1] [2]
[3]
[4]
[5]
[6]
Z.Ying, X.Debao. Mobile Agent-based Policy Management for Wireless Sensor Networks. in WCNM. 2005. Ann T. Tai, Kam S. Tso, William H. Sanders. Cluster-Based Failure Detection Service for Large-Scale Ad Hoc Wireless Network Applications in Dependable Systems and Networks DSN ' 04. 2004. Linnyer Beatrys Ruiz, Jose Marcos S. Nogueira, Antonio A.F. Loureiro, MANNA: A Management Architecture for Wireless Sensor Networks. IEEE Communications Magazine, 2003. 41(2): p. 116-125. Winnie Louis Lee, Amitava Datta, Rachel Cardell-Oliver, WinMS: Wireless Sensor Network-Management System, An Adaptive PolicyBased Management for Wireless Sensor Networks. 2006, UWA, aUSTRALIA. A.Perrig, R.Szewczyk, V.Wen, D.Culler, J.D.Tygar. SPINS: Security protocols for sensor networks. in ACM MobiCom'01. 2001. Rome, Italy: ACM Press. Sergio Marti, T.J.Giuli, Kevin Lai, Mary Baker. Mitigating Routing Misbehavior in Mobile Ad Hoc Networks. in 6th International
[22]
[23]
[24]
[25]
Conference on Mobile Computing and Networking. 2000. Boston, Massachusetts, USA: ACM. Chihfan Hsin, Mingyan Liu. A Distributed Monitoring Mechanism for Wireless Sensor Networks. in 3rd workshopo on Wireless Security. 2002: ACM Press. Chihfan Hsin, Mingyan Liu, Self-monitoring of Wireless Sensor Networks. Computer Communications, 2005. 29: p. 462-478. Min Ding, Dechang Chen, Kai Xing, Xiuzhen Cheng. Localized FaultTolerant Event Boundary Detection in Sensor Networks. in INFOCOM 2005. Jinran Chen, Shubha Kher, Arun Somani. Distributed Fault Detection of Wireless Sensor Networks. in DIWANS'06. 2006. Los Angeles, USA: ACM Pres. XuanwenLuo, Ming Dong, Yinlun Huang, Optimal Fault-Tolerance Event Detection in Wireless Sensor Networks. Xuanwen Luo, Ming Dong, Yinlun Huang, On Distributed FaultTolerant Detection in Wireless Sensor Networks. IEEE Transactions on Computers, 2006. Ruixin Niu, Pramod K.Varshney, Distributed Detection and Fusion in a Large Wireless Sensor Network of Random Size. Eurasip Journal on Wireless Communications and Networking, 2005. 4: p. 462-472. Pramod K.Varshney Ruixin Niu, Dale Klamer Michael Moore. Decision Fusion in a Wireless Sensor Network with a Large Number of Sensors. in IF04. 2004. Stockholm, Sweden. Thomas Clouqueur, Kewalk, Saluja, Parameswaran Ramanathan, Fault Tolerance in Collaborative Sensor Networks for Target Detection. IEEE Transactions on Computers, 2004. 53(3): p. 320-333. Tsang-Yi Wang, Yunghsiang S.Han, Pramod K.Varshney, Po-Ning Chen, Distributed Fault-Tolerant Classification in Wireless Sensor Networks. IEEE Journal on Selected Areas in Communications, 2005. 23(4): p. 724-734. Sapon Tanachaiwiwat, Pinalkumar Dave, Rohan Bhindwale, Ahmed Helmy, Secure Locations: routing on trust and isolating compromised sensors in location-aware sensor networks. 18. Jessica Staddon, Dirk Balfanz, Glenn Durfee. Efficient Tracing of Failed Nodes in Sensor Networks. in First ACM International Workshop on Wireless Sensor Networks and Applications. 2002. Altanta, GA, USA: ACM. NIthya Ramanathan, Kevin Chang, Rahul Kapur, Lewis Girod, Eddie Kohler, Deborah Estrin. Sympathy for the Sensor Network Debugger. in 3rd Embedded networked sensor systems. 2005. San Diego, USA: ACM Press. Farinaz koushanfar, Miodrag Potkonjak, Alberto SangiovanniVincentelli. Fault Tolerance Techniques for Wireless Ad Hoc Sensor Networks. 2000. S Harte, A Rahman, K M Razeeb. Fault Tolerance In Sensor Networks using Self-Diagnosing Sensor Nodes. in IE2005. 2005: IEEE. Anmol Sheth, Carl Hartung, Richard Han. A Decentralized Fault Diagnosis System for Wireless Sensor Networks. in 2nd Mobile Ad Hoc and Sensor Systems. 2005. Washington, USA. Linnyer Beatrys Ruiz, Isabela G.Siqueira, Leondardon B. e Oliveria, Hao Chi Wong, Jose Marcos S. Nogueira, Antonio A.F. Loureiro. Fault management in event-driven wireless sensor networks. in International Workshop on Modeling Analysis and Simulation of Wireless and Mobile Systems. 2004. Venice, Italy: ACM Press. Deborah Estrin, Ramesh Govindan, John Heidemann, Satish Kumar. Next Century Challenges: Scalable Coordination in Sensor Networks. in ACM/IEEE International Conference on Mobile Computing and networking. 1999. Linnyer Beatrys Ruiz, Isabela G.Siqueira, Leondardon B. e Oliveria, Hao Chi Wong, Jose Marcos S. Nogueira, Antonio A.F. Loureiro, On the Design of a Self-Managed Wireless Sensor Network. IEEE Communications Magazine, 2005: p. 95-102.