systems with unpredictable fault model and its fault monitoring framework can be ..... Monitoring. ⢠Disk Space Invariant (SM): No job should consume more disk.
Implementing an Autonomic Architecture for FaultTolerance in a Wireless Sensor Network Testbed for AtScale Experimentation Mukundan Sridharan, Sandip Bapat, Rajiv Ramnath, Anish Arora The Ohio State University Department of Computer Science and Engineering Columbus OH - 43210 1-614-292-5813
{sridhara,bapat,ramnath,anish}@cse.ohio-state.edu to support the research community, they are themselves complex systems prone to faults in hardware, software specification and software implementation. Further, and because they must support wide-ranging experimentation, testbeds should expose as much of the capability of the underlying hardware and software components as possible, so that users thereby may push the envelope of the state of the field by running experiments at the limits of the underlying system. Thus any testbed design has the conflicting requirements of allowing the maximum control by users, while also making sure the testbed remains a stable platform for experimentation for other users of this shared infrastructure, and, returns to a stable state at the finish of an experiment. Note also that testbeds cannot use, for example, simple rebooting of the used portion after every job. Firstly, this could take a considerable amount of time and, secondly, for the low-end devices that are typically on a testbed, this has proved to be damaging. Also, a long running testbed simply cannot rely on human management for its day-to-day operations. Thus a WSN testbed must be feature-rich and flexible, yet stable and selfmanaging (i.e. autonomic). In this paper we present our approach in implementing autonomic capability in Kansei, a large-scale WSN testbed through the use of detectors for recognizing faults and, on detection of faults, correctors to achieve stabilization. The validation of the design is implicit in the fact that, the architecture is able to manage a truly large scale testbed like Kansei with very minimal human intervention and very high availability rate (typically above 85%).
ABSTRACT The wireless sensor networking (WSN) community has increasingly grown to rely on experimentation with large-scale test-beds as a means of verifying protocols, middleware and applications. These testbeds need to be highly available in order to support this community, but are themselves complex, and complex to manage, being prone to faults in hardware, software specification and software implementation. In this paper we report on our experience in designing Kansei, a WSN testbed for experimentation at scale, to be autonomic – i.e. self-healing and self-managing. We implement autonomic management in Kansei through an architecture that consists of a hierarchy of selfcontained components, extended with detectors for discovering faults and correctors for subsequent stabilization. We find that our invariant based architecture is well suited for large complex systems with unpredictable fault model and its fault monitoring framework can be extended to include user programs.
Categories and Subject Descriptors C.4 [Performance of Systems]: Fault Tolerance, Reliability availability, and serviceability.
General Terms Reliability, Management, Experimentation, Design, Theory.
Keywords Autonomic Software Design, Detectors and Correctors, FaultTolerance, Wireless Sensor Testbed.
1. INTRODUCTION With research on wireless sensor networks having increasingly adopted experimentation with actual state-of-the-art hardware and software platforms, shared large-scale testbeds have become the preferred basis for experimentation and testing before deployment in the field and have become an integral part of wireless community. However, while testbeds need to be highly available Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08, March 16-20, 2008, Fortaleza, Ceará, Brazil. Copyright 2008 ACM 978-1-59593-753-7/08/0003…$5.00.
Figure 1: Kansei Testbed Specifically, we discuss the design and implementation of the autonomic architecture of Kansei. We begin with a description of Kansei (Section 2), followed by a structured system specification and a fault model for the testbed (Section3), that is based on a
1670
classification of the actual faults encountered on the testbed. Next (in Section 4), we describe the theory behind detectors and correctors and detail a fault-tolerant autonomic architecture and implementation for the system, as derived from the system specification and fault model. Next (in Section 5) we detail examples of the detectors and correctors implemented in Kansei. We end the paper with related research (Section 6) and conclusions and future work (Section7).
3. KANSEI SYSTEM SPECIFICATION AND FAULT MODEL 3.1 Kansei System Model The Kansei system model, S, consists of: • Four hardware layer specifications, Hxsm, Htmote, Hstargate and Hpc, corresponding to the 4 devices on the testbed. Each specification consists of parameters that specify the resources provided, such as the CPU, radio, radio frequencies and channels, memory size and external interfaces.
2. KANSEI OVERVIEW The Kansei is a state-of-the-art wireless and sensor testbed that makes sensor and wireless experimentation fast and convenient. Kansei services are offered free of charge for wireless sensor network researchers around the globe. Kansei [4] testbed currently consists of four kinds of hardware devices - 100 eXtreme Scale Motes (XSM), 100 Tmotes, 100 Stargates, and 3 Linux-based personal computers; an Imote2 layer and a SunSPOT layer are planned for addition. Figure 1 shows a picture of the Stargate-XSM-Tmote devices arranged in a grid. The XSM and Tmotes are mote platforms which run, for instance, TinyOS [5], a lightweight, event-based operating system that implements a networking stack and a sensor interface. Each mote integrates a variety of sensors, including a photocell, a temperature sensor, four passive infrared (PIR) sensors, a two-axis magnetometer, and a microphone.
• Each hardware layer has an associated software services specification, such logging, security, and data injection. • For each hardware layer, H, Kansei also specifies a topology H.t and maintains an up-to-date status H.s for the devices in that layer
3.2 Kansei Operation Specification Given the Kansei system model S: • Kansei takes as input a user Program P, and a configuration C. • If C is valid under S, then P is executed under S, in accordance with C. The program P is run for a specified time C.t, on a network topology C.n, to create the output C.o. As P is executed under S, a set of user-specified invariants C.i is monitored.
• After time C.t the nodes in C.n are restored to the normal configuration.
3.3 Fault Model The traditional way of defining a fault model and the fault tolerant behavior for a distributed system is to define the class of likely faults, to then show that the system is stabilizing under these faults. This approach is impractical for complex systems such as Kansei where we are forced to consider both hardware and software failures as well as faults created by user programs. Given such a system and its intended use, it is impossible to fully specify every possible fault that could occur. Thus we take the alternate invariant-perturbation based approach of specifying the ideal behavior of the system with faults being every deviation from this specification. Given the Kansei system specification S, we derive specifications and invariants for individual components. Any violation of these invariants is considered a fault that the system must be designed to detect, and correct if possible. Note that when such a model for faults is used, the fault tolerance of the system depends on the completeness of the invariant specification. This is, in fact, a strength of the architecture in that it allows for gradual addition of invariants as more faults are discovered, without redesign of the system components themselves being required.
Figure 2: A kansei node: XSM, Tmote and Imote2 attached to a Stargate Figure 2 pictures a single Kansei node with a XSM, a Tmote and Imote2 attached to a Stargate. The Stargate is an expandable single-board computer with Intel’s 400-MHz PXA255 CPU running the Linux operating system. It has several interfaces, including RS-232, 10/100 Ethernet, USB, and 802.11a/b. Stargates also serve as an integration point for mote-level devices. Stargates are connected through high-speed network switches to an Ethernet back-channel, which provides high-bandwidth connectivity for management commands, data injection, and extraction. The PC cluster connected to the Ethernet-backbone runs the code for visualization, simulation, and diagnostic analysis. A separate PC serves as experiment manager and provides remote (Web-based) access to Kansei. Jobs consisting of a combination of applications developed for the above platforms can be uploaded on the devices and the results downloaded, remotely through the web-interface. In this paper we will concentrate on the software architecture of Kansei. For more details on Kansei devices and features, readers are directed to [6].
Note that achieving the stated objective of an operation under every type of fault in the system is not always possible. For example, if the node hardware fails when a user program is running on the node, it is impossible for the user program to run to completion. So, depending on what is achievable under a particular fault, we classify the kind of fault-tolerance as:
1671
1. The Kansei Specification defines the overall expected behavior of the system. The autonomous components and modules of Kansei cooperate to achieve this specification.
• Masking fault tolerance: In the presence of faults the system/component never violates the specification and eventually resumes/completes correct operation (i.e. faults in the system are not visible to the users).
2. Each component has a well-defined sub-specification and a set of invariants based on the sub-specification
• Non-masking fault tolerance: In the presence of faults the system/component violates the specification, but when the fault stops, the system/component eventually resumes correct operation (i.e., the user might observe the fault, but the system corrects itself and completes the job).
3. Fault-tolerance is achieved by implementing detectors and correctors on these invariants, which depend on the services provided by a ‘Trusted Base’ (Section 4.1). Once a “corrector” corrects the system to a “legal” state, any future correct execution will result in a legal system state.
• Fail-safe fault tolerance: In the presence of faults the system/component violates the invariant. When the fault stops, the system/component might not resume correct operation (i.e. the system detects the fault, but is not able to correct it).
4. Detectors can be scheduled to run periodically or based on specific events (e.g. one fault could trigger a detector for a related fault). However, a knowledge of the fault for which a detector has been designed could help tune the frequency of detector, their by increasing the efficiency of the system.
Further, the faults on Kansei can themselves be classified as: • Transient faults: These are not permanent and will stop occurring in finite time. Once a transient fault has terminated, the system should correct itself and run to completion.
4.1 The Trusted Base The architecture guarantees that as long as the detectors and correctors themselves are not corrupted, the system stabilizes to a stable state. Thus, in order to avoid the faulty components from corrupting the detectors and correctors, they are isolated and implemented on top of what is called the “Trusted Base” [9]. The Base incorporates a “Trusted Store” and provides scheduling services so that detectors and correctors are invoked randomly and infinitely; this is known as angelic scheduling [9]. More specifically, the “Trusted Base Services” consist of:
• Fail-stop faults (permanent faults), where the system should be marked as incorrect within finite time. A recoverable sub-class of these faults are known as crash-restart faults. A hardware failure is a non-recoverable fail-stop fault, while a process crash is a crash-restart fault. 3.4 System Invariants Having adopted an invariant-perturbation fault model, it is important to define the system invariants carefully to capture the faults. We classify the invariants definable for the system as follows:
1. Trusted Read and Write: For detectors to detect faults, they have to reliably read the state of components however corrupt that location might be. Similarly for correctors to correct the state they should be able to correctly write to a memory location, however corrupt that location might be. These services are generally available in any modern operating system.
• Job invariants: Invariants related to the correct deployment, logging and clean-up of jobs • Health invariants: Invariants related to the health of the hardware and software components
2. Trusted Schedule: In order to schedule detectors and correctors in a manner that cannot be affected by, or predicted by, the rest of the system’s components, a way of reliably and randomly scheduling processes is needed. We approximate angelic scheduling using a system-based “cron” daemon (available in any standard Unix system), scheduled independently of the system components.
• Resource invariants: Invariants related to detecting resource conflicts and incorrect use. • User Program invariants: Invariants specified by the user to be monitored during execution. We discuss in detail each class of invariants in Section 5.
3. Trusted Store: Detectors and correctors need reliable storage to store private state that cannot be corrupted by other components. On Kansei server, trusted store is implemented using a relational database. On the Stargates, trusted store is provided by a root level process operating on a protected file area.
4. A FAULT TOLERANT ARCHITECTURE USING DETECTORS AND CORRECTORS In this section we first outline the theory of detectors and correctors for designing fault-tolerant software and then describe the Kansei architecture in detail. Our design approach for faulttolerance consists of dividing the system into multiple autonomous components that are self-managing, and implementing “detectors and correctors” for these components, thereby making them autonomic. We draw heavily from the theory for fault-tolerant component design using detector and correctors proposed in [7] and [8]. Our approach is directly based on the framework suggested in [9]. Our architectural approach may be summarized as follows:
This approach guarantees that faults generated locally are handled properly. However, in a distributed system, it is still possible for a component to get corrupted by a remote method invocation; where one faulty component spreads the faults to other components. In [9], the authors prove that if the detectors and correctors are executed in synchronized way across the system, the system will eventually stabilize once the faults stop occurring. In our system, we have observed that fault propagation is minimal. Hence, we schedule all our detectors synchronously just twice every day.
1672
Director. It also provides a Trusted Store for Detector-Corrector sub-system.
4.2 Kansei Software Architecture Our system design consists of two main components - a “Kansei Director” (KD) that manages the overall job scheduling, a PC Manager (PM) that manages the PC cluster and a “Stargate Manager” (SM) on each Stargate which manages itself and the attached mote devices. Each of these components has its own detectors and correctors. The KD includes a sub-component “Chowkidar”, a distributed autonomic self-stabilizing component that continuously monitors the health of all devices [11].
• Kansei Scheduler: This process queries the database periodically for pending jobs, allocates resources and schedules them as needed. For each job to be scheduled it creates a job manifest (i.e. a job configuration, as shown in Figure 4) and sends it to the appropriate Stargate Manager, along with the necessary files. • Chowkidar: This is the health monitoring component of Kansei. Conceptually this is a detector for the hardware devices in the testbed. It checks and reports on the hardware status of devices.
Figure 3 shows the overall Kansei architecture. Functionally, there is a master-slave relationship between the Kansei Director (KD) and the Stargate Manager (SM), in that the KD sends commands and job-details to the SM. The KD also acts as the user-interface.
• Detector/Corrector subsystem: This ensures the safe running of the Kansei Director and is a separate process that checks the invariants of the Director. Any violation causes triggering of the appropriate corrector (s).
There are two different regions of autonomic behavior in Kansei. The first is the entire testbed, which manages itself. That is, it deploys a job, cleans it up, returns results, finds faulty devices, tolerates faults and heals itself. The other regions are those associated with Kansei components such as the Stargate Manager, PC Manager or Health Monitor, each with its own detectors/correctors. In contrast to the master-slave functional view, the components are, in fact, independent autonomic (albeit cooperating) entities, with respect to fault-tolerant behavior. Such a system with hierarchical components each with its own faulttolerance is called Multi-tolerant. We now provide details about the Kansei Director and the Stargate Manager.
Assuming that the operating system itself is not compromised, the detector/ corrector process will restore the state of the system to a legal state. Notifications of job terminations are sent to relevant stakeholders - administrator and testbed users.
4.2.2 Stargate Manager The Stargate Manager complements the Kansei Director by autonomically managing the Stargate and the attached mote devices. Figure 4 shows the architecture of a Stargate Manager and its interactions with the Kansei Director. For scheduling jobs correctly, the Stargates and the Kansei Director must be timesynchronized. To satisfy this requirement each Stargate runs an NTPD client, whereas the Kansei Director runs an NTPD server.
Kansei Director Detector/Corrector
Chowkidar (Health Monitor) Detector/ Corrector
Kansei Scheduler
Database
In order to schedule a job on a Stargate, Director creates a “Job Manifest” which specifies the executable files to run on each device, how long to run the job, and what files to return upon job completion. This job manifest is sent to the SM along with other necessary executables and support files. The SM reads the manifest and schedules the job accordingly. The manager stores in a “Job Table” information about current jobs. Start and Stop commands from the KD are acknowledged by the SM. If a fault results in job termination an appropriate error message is sent back to the KD. A separate detector and corrector sub-component monitors the state of the SM.
Web Interface
Ethernet
Stargate Manager
PC Manager
Detector/ Corrector
Detector/ Corrector
Stargate Array
Linux PC cluster
Health Monitor Dectector/ Corrector
Figure 3: Kansei Software Architecture START
4.2.1 Kansei Director
Kansei Director
The Kansei Director consists of the (a) Web Interface (b) Database, (c) Scheduler, (d) Chowkidar Health-Monitor and (e) Detectors -Correctors subsystem. For completeness we will briefly describe the other components of Director here while elaborating on the detectors and correctors (in Section 5).
Stargate Manager
ACK_START JOB TABLE ACK_STOP RET_FILES
Kansei Stargate Kernel
Job Manifest job_id=439 job_time=2006−08−27 12:46:00 user=kuser1 xsm_file=kansei−1XyphY xsm_erase=0 telos_file= stargate_cmd=hmon3 localhost 9000 10.11.0.214 return_files=datalog.txt,datalog−telos.txt,t2_log.txt,job.log,
• Web Interface: This enables a user to submit jobs for any combination of the hardware and get results along with debugging information. • Database: This provides secure storage of state and controlled sharing of job state across the different processes of the Kansei
Job_id, User_name, End_time, Return_files 1001 kuser1 20:35:00 datalog.txt
Figure 4: Stargate Manager Architecture
1673
Corrector Action: Notify the Kansei Director about the failed device.
5. INVARIANTS, DETECTORS AND CORRECTORS FOR KANSEI
• Software health invariants (KD&SM): All software components of the testbed should be alive and run the same version of software.
In this section, we give examples of each class of invariants introduced in Section 3.4 and their corresponding detectors and correctors. Note that this is not an exhaustive list of all the invariants. For every invariant there is an associated detector that deals with the faults associated with that invariant. Thus, an invariant is synonymous with its associated detector. At the end of this section, we provide a list of actual faults from our testbed, together with their corresponding fault type and invariant. Table 1 summarizes invariants, their expected behavior under faults of the overall system and the component where the faults occur. All detectors are scheduled to run in an ongoing but random manner, with frequencies that depend upon the invariant.
Corrector Action: Update the software components and restart. • TimeSync Invariant(SM): The time difference between Stargate and Kansei should not be more than a threshold. Corrector action: Restart NTPD. Invariant Violated
Overall System Behavior
Individual Component/ Program Behavior
Job predeployment
Nonmasking tolerance: Notify user about bad configuration
Nonmasking tolerance: Trigger database consistency check
Job table consistency
Nonmasking tolerance Kill all jobs violating the invariant
Nonmasking tolerance: Kill all jobs violating the invariant
Job status
Fail-safe tolerance Notify user about failed jobs
Fail-safe tolerance: Kill jobs violating the invariant
Job termination
Masking tolerance: Clean up job violating invariant
Masking tolerance: Clean up the job violating the invariant
802.11 Radio
Nonmasking tolerance: Trigger leader election process
Nonmasking tolerance: Restart radio
Hardware Health
Nonmasking tolerance:
Fail-safe tolerance: Notify Health-Monitor of status within finite time
Software Health
Masking tolerance: Update and restart failed component
Masking tolerance: Update and restart the failed component
• Job termination Invariant (SM): For all jobs in the job table, the end time should be greater than the current time.
TimeSync
Masking tolerance
Masking tolerance: Restart NTPD
Corrector Action: Terminate the job violating the invariant and return “JOB_TERMINATED” to the KD.
Disk space
Nonmasking tolerance: Kill erring jobs
Fail-safe tolerance: Kill erring jobs
Device access
Nonmasking tolerance: Kill erring job
Fail-safe tolerance: Kill erring job
Frequency access
Nonmasking tolerance: Kill erring job
Fail-safe tolerance: Kill erring job
User Specified
Fail-safe tolerance
Fail-safe tolerance
5.1 Invariants and Correctors for Job Control This set of invariants ensures that jobs submitted by users are deployed, executed, logged and cleaned up properly. An interesting point to note is that a significant number of faults under these invariant tend to be fail-stop faults for which only fail-safe tolerance can be provided. • Pre-deployment Invariant (KD): All devices selected for a job should be in the “ready” or “failed” state. Less than 10% of the selected devices should be “failed”. The detector for this invariant is scheduled before every job deployment. Corrector Action: Mark Scheduler process as faulty (and start a database consistency check). This fault means that the scheduler accepted a job for a topology which doesn’t have enough “ready” devices. • Job table consistency Invariant (KD): If no job is “running”, then none of the devices should be ”busy”. Corrector Action: Send a “CLEAR_ALL_JOBS” message to the devices violating the invariant and set their status to “ready”. • Job status Invariant (SM): All processes of a job should be alive for the entire period specified in the job configuration. Corrector Action: Kill the corresponding job and return a “JOB_ERROR” message to the KD.
5.2 Invariants and Correctors for System Health • 802.11 Radio invariant (SM): The previous radio cell of any device should be same as the current radio cell. Corrector action: Initiate a radio-leader-election process. After leader election, restart the radio starting with the leader as center and moving outwards in concentric circles until a cell is found. • Hardware health invariants (SM): All hardware devices should be up at all times.
Table 1: Kansei invariants and stabilization actions
1674
Director
health
recoverable
• Disk Space Invariant (SM): No job should consume more disk space than requested in the job configuration.
Unclean cleaning of jobs (some job files remain after a job is finished)
Job terminatio n
Fail-stop, recoverable
Corrector Action: Kill the corresponding job. Send a “DISK OVERUSE” error to KD.
Time mismatch between nodes (NTPD failure)
TimeSync
Fail-stop, recoverable
• Device Access Invariant (SM): No two processes (even from same job) should access a hardware device simultaneously
Health Monitor reports bad status (false positive: Node down, reported alive)
Job status
Transient
State corruption in components
Software health
Transient
Disk overflow on Kansei Director
Software health
Fail-stop, recoverable
5.3 Invariants and Correctors for Resource Monitoring
Corrector Action: Kill all processes accessing the device and send an error message to KD • Frequency Access Invariant(KD): No two jobs should use the same radio frequency/channel at the same time Corrector Action: Kill all jobs that violate the invariant and send error to KD.
Table 2: Kansei faults and invariant violated
5.4 User Specified Invariants
Table 2 classifies actual faults and lists the corresponding invariant. Again, this table is not exhaustive. It should be noted that while some invariants were derived directly from the specification, others were added to the system after analysis of a particular fault. The 802.11 radio invariant, for example, was added after we observed that the switching of node radios switch between two different cells with the same “ESSID” when operating in the Ad-hoc mode. This highlights the iterative addition of invariants and correctors.
User specified invariants usually do not have any associated correctors. These invariants are monitored by Kansei and any violations are reported to the user along with the job logs. These invariants are useful in judging the fidelity of output produced by Kansei. Some examples of user-specified invariants are: • At least 95% of the devices should successfully complete programming and start execution. • At least 90% of the devices should successfully complete executing the user program.
6. RELATED WORK A formal definition of state predicate based detectors and correctors is given by Arora and Kulkarni, who illustrate and delimit the role of this type of detection and correction in the design of fault-tolerance [8, 7]. This invariant-based approach to incremental correctness is in contrast to and is complementary to approaches based on N-version [10] programming (where N copies of the same program is run) [10], and recovery blocks [12]. N-version programming can capture transient errors in state, but does not capture faults due to faulty specification or incorrect upgrades since all copies of a program would suffer from such faults. Also related is work on software rejuvenation [13], which assumes no knowledge of component invariants, and simply gracefully terminates and restarts a component at a clean state, as a way of proactive compensation for transients. The observation by Gray [14], that most faults in complex computer systems are soft/transient/Heisenbugs, in that they will likely not be repeated if the component is immediately reinitialized is very helpful and we deal with a lot of software component faults by restarting the faulty component.
• Each device must have at least 5 neighbors within 5 feet.
5.5 Kansei Faults Observed Fault
Invariant Violated
Fault Type
802.11 switching between two different cells with same ESSID in Adhoc mode
802.11 radio
Fail-stop, recoverable
Hardware components failure (motes, hubs, wires, ethernet cards)
Hardware health
Fail-stop, nonrecoverable
XSM failure due to pin race conditions (two processes accessing the serial port simultaneously)
Device access
Fail-stop, nonrecoverable
Non-uniform software layer(due to a upgrade/network error )
Software health
Fail-stop, recoverable
in
a bad
Resource conflict between jobs
Device access
Transient
Disk overflow at Stargate
Disk Usage
Transient
Software
Fail-stop,
Database
connection
dies
at
7. CONCLUSION AND FUTURE WORK In this paper we presented our autonomic testbed design using detector/corrector framework. From our experience with the implementation, we conclude that our framework is not only robust, but also very practical to implement on current generation computing platforms. We find that our architecture is particularly well suited for large complex systems, where the faults are unpredictable. We also find that it not only improves availability and performance, but also debugging.
1675
We are in the process of adding Imote2 and SunSPOT hardware layers and expanding the Tmote layer to 1000 nodes. As the hardware and software features of the testbed evolve, we expect to deal with an increasing number of fault types. Our architecture allows us to deal with this extension elegantly, through the incremental addition of invariants, detectors and correctors. We also continue to study more complex detector and corrector algorithms and are currently studying systematic methods to analyze system specifications to identify invariants with maximal coverage. We are also exploring implementing recovery using distributed snapshots of jobs, to allow pausing, resuming, moving and roll-back of jobs without requiring the killing of already scheduled jobs.
[6] E. Ertin, A. Arora, R. Ramnath, V. Naik, S. Bapat, and M. S. et al., “Kansei: A Testbed for Sensing at Scale”, 5th Intl. Conf. on Information Processing in Sensor Networks, 2006. [7] A. A rora and S. S. Kulkarni, “Detectors and correctors: A theory of fault-tolerance components”, International Conference on Distributed Computing Systems, 1998. [8] A. Arora and S. S. Kulkarni, “Component-based design of multi-tolerance”, IEEE Transactions on Software Engineering, vol. 24, 1998. [9] A. Arora and M. Theimer, “On modeling and tolerating incorrect software”, Technical Report, MSR-TR-2003-27, Microsoft Research, 2003. [10] A. Avizienis, “The N-Version Approach to Fault-Tolerant Software”, IEEE Transactions on Software Engineering, no. 12, 1985.
8. REFERENCES [1] J. Heidemann, N. Bulusu, J. Elson, C. Intanagonwiwat, K. Lan, Y. Xu, W. Ye, D. Estrin, and R. Govindan, “Effects of detail in wireless network simulation”, SCS Multiconference on Distributed Simulation, 2001.
[11] S. Bapat, W. Leal, T. Kwon, P. Wei, and A. Arora, “Chowkidar: A health monitor for Wireless Sensor Networks”, TridentCom, 2007.
[2] M. Takai, J. Martin, and R. Bagrodia, “Effects of wireless physical layer modeling in mobile ad hoc networks”, Proceedings of MobiHoc, 2001.
[12] B. Randell, “System Structure for Software FaultTolerance”, IEEE Transactions on Software Engineering, No. 2, 1975.
[3] K. Pawlikowski, H.-D. J. Jeong, and J.-S. R. Lee, “On credibility of simulation studies of telecommunication networks”, IEEE Communications Magazine, vol. 40, 2002.
[13] Y. Huang, C. Kintala, N. Kolettis, and N. Fulton, “Software Rejuvenation : Analysis, Module and Applications”, IEEE Intl. Symposium on Fault Tolerant Computing, 1995.
[4] “Kansei: A Sensor Testbed for At-Scale Experiments”, http://ceti.cse.ohio-state.edu/kansei/.
[14] J. Gray, “Why Do Computers Stop and What Can We Do About It”, 6th International Conference on Reliability and Distributed Databases, 1987.
[5] “TinyOS: An Operating System for Wireless Embedded Sensor Network”, http://tinyos.net.
1676