The EFTOS approach to dependability in embedded ... - IEEE Xplore

3 downloads 4621 Views 410KB Size Report
tools from which the application developer can choose to make an embedded application on a parallel or distributed system more dependable. A high-level ...
76

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

The EFTOS Approach to Dependability in Embedded Supercomputing Geert Deconinck, Senior Member, IEEE, Vincenzo De Florio, Theodora A. Varvarigou, and Evangelos Verentziotis, Member, IEEE

Abstract—Industrial embedded supercomputing applications benefit from a systematic approach to fault tolerance. The EFTOS framework provides a flexible and adaptable set of fault-tolerance tools from which the application developer can choose to make an embedded application on a parallel or distributed system more dependable. A high-level description (Recovery Language) helps the developer specify the fault-tolerance strategies of the application as a second application layer; this separates functional from fault-tolerance aspects of an application, thus shortening the development cycle and improving maintainability. The framework incorporates a backbone (to hook a set of fault-tolerance tools onto, and to coordinate the fault-tolerance actions) and a presentation layer (to monitor and test the fault tolerance behavior). A practical implementation is described with its performance evaluation, using an industrial case study from the energy-transport area, as well as an analytic deduction of the appropriateness of fault-tolerance techniques for various application profiles. Index Terms—Application recovery, distributed system, embedded system, fault-tolerant communication, maintainability, performance, software-based fault tolerance, stable memory.

ACRONYMS1 API CCT DIR EFTOS

application programmer interface channel control thread detection/isolation/recovery embedded fault-tolerant supercomputing (acronym of the European ESPRIT project 21 012) FT fault tolerance FT_comm fault-tolerant communication HVS high voltage substation MIA manager-is-alive MIMD multiple-instruction, multiple-data OS operating system RINT recovery interpreter RL recovery language RTOS real-time OS SM stable memory TAIA this-agent-is-alive TEIF this-entity-is-faulty TMR triple modular redundancy. Manuscript received November 15, 1999; revised September 9, 2000. This work was supported in part by ESPRIT-projects 21012 (EFTOS), 28620 (TIRAN). G. Deconinck and V. De Florio are with the Department Elektrotechniek (ESAT), K.U. Leuven, B-3001 Leuven, Belgium (e-mail: {Geert.Deconinck; Vincenzo.DeFlorio}@esat.kuleuven.ac.be). T. A. Varvarigou and E. Verentziotis are with the Department of Electrical and Computer Engineering, N.T.U.A., GR-15733 Zografou, Greece (e-mail: {Dora; Verentz}@telecom.ntua.gr). Publisher Item Identifier S 0018-9529(02)02608-8. 1The

singular and plural of an acronym are always spelled the same

I. INTRODUCTION

E

MBEDDED supercomputing is an enabling technology for an increasing number of industrial applications in domains such as process control and signal processing. In these mission-critical applications, system failures are connected with high costs, or even with hazards for the environment. The road, therefore, to high performance goes inevitably through dependability. FT provides the means of making systems more dependable by the adoption of several techniques. It can be implemented at two levels: 1) FT transparent to the application. This is done at the hardware and OS-level and is often addressed by hardware redundancy and OS-level error detection, isolation, and recovery mechanisms. The advantage of this approach is that application developers do not have to worry about FT. FT mechanisms are maintained and upgraded by the platform providers. Hence, the same treatment is applied to all applications. A disadvantage is that special-application requirements are not considered; hence not all dependability problems can be solved, as indicated by the end-to-end argument [1]. 2) FT at the application level. Application developers build ad hoc solutions, which are based on application-specific properties and are integrated in the application code. The advantage of this approach is that FT solutions are tailored to the needs of the application; often, better performance is obtained. Moreover, the ability to evaluate the effect of each mechanism leads to better predictability. On the other hand, this approach increases the code in size and complexity, which: a) can lead to the introduction of additional faults (bugs); b) increases the development costs; and c) makes the code difficult to maintain, document, and upgrade. Furthermore, the FT components are not reusable; the wheel is often reinvented for each application. Both methods have limitations, which, seen collectively, can form the features of an alternative and more efficient approach that satisfy the requirements: • Adaptability to the needs of different applications; • Portability on different platforms combining commercial off-the-shelf and custom components; thus, solutions should neither be incorporated in hardware or in the OS, nor in the application, but as middleware. • Maintainability, which allows for correction and evolution; • solutions should be usable by nonspecialists of FT.

0018–9529/02$17.00 © 2002 IEEE

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

77

The EFTOS approach toward dependability covers exactly these needs and aims to provide robust, reusable, efficient, and cost-effective solutions to FT. It is between the two aforementioned levels so that the advantages of both can be combined.

disturbed environments with high demands on performance and dependability; yet, they call for a tradeoff against costs. Hence, the EFTOS framework was designed to fill this gap for developers and users of embedded supercomputing applications.

A. The EFTOS Approach

B. Related Work

The EFTOS approach consists of a framework of softwarebased FT solutions—organized into several levels—that can be integrated according to the needs of an application [2]. The framework consists of the entities: • Basic tools for error detection, isolation, recovery, and fault masking—available in a library as parametric functions. • A backbone to extract information on the application topology and progress, and to coordinate FT actions—implemented as a distributed application. • A high-level language (RL) to separately specify these FT actions, viz, indicate recovery strategies by detailing isolation, reconfiguration, and recovery actions needed to follow the detection of an error. • Administration and presentation tools for monitoring and testing, via software fault injection. Part of this framework is implemented as middleware (a layer between the application and the target platform), and as a library. Hence, application developers can benefit from a variety of FT functions from which they can select and combine what is necessary to obtain the dependability required for their system.2 This integrated approach, as well as the possibility to express recovery strategies, are the key-points of EFTOS. Its main benefits include: • the FT modules are re-usable and maintainable;. • the application programmer can focus on developing the application and on incorporating FT strategies at a higher level (separation of concerns), thus reducing costs and time-to-market; • the level of FT can be chosen according to the actual project requirements, by adapting entities from the library. Target applications are those for which the dependability can be improved at relatively low cost by software-based FT solutions; more specifically, EFTOS targets mission-critical embedded applications with soft real-time requirements on a MIMD homogeneous, distributed-memory, message-passing, multiprocessor system. The system is built from several interconnected high performance computing nodes; hence this domain is called: embedded supercomputing. The industrial embedded applications from which the specifications of the EFTOS framework have been derived—like mail-sorters [3] and controllers of high-voltage substations [4]—clearly illustrate the lack of a standard approach toward FT and the inappropriateness of nonreusable ad hoc solutions. This market is lacking solutions usable at the various levels of automation, ranging from workstations involved in system supervision, to medium and small controllers applied to local control, command, and monitoring. These automation systems are often embedded in the plant, and they operate in highly

In recent years, several research initiatives have investigated FT in a distributed environment. These initiatives vary from • proposals for generic architectures for dependable distributed computing [5]–[7] and predictably dependable computing systems [8], to • applied software FT solutions in massively parallel systems [9] and meta-level architectures for the development of adaptively dependable systems [10]. Six representative approaches are given. • GUARDS [5] provides an architecture for real-time dependable applications that can be instantiated according to the application needs to use a specific degree of redundancy on a dedicated platform. It can be adapted in three dimensions: there is redundancy in channels, in lanes, and in integrity levels. It targets generic components and has upgradability as a design goal. • CHAMELEON [11] provides software-based FT solutions to make distributed systems more dependable. It uses a set of basic tools (armors) and a distributed component, hierarchically structured into managers and nonmanagers. • MARS (Maintainable Real-Time System) [8] consists of processors that are interconnected via a proprietary bus, and uses active replication. Several hardware FT measures have been used to meet hard real-time constraints. This design requires very specific hardware and a specific OS (suited for time-triggered applications), which makes it difficult to use in arbitrary industrial environments. • HORUS [12] provides an object-oriented communication protocol composition framework that can flexibly stack protocol layers on top of each other at run time, according to FT needs. Its focus, however, is on group communication rather than FT. • Delta-4 [6], [7] proposes an open, fault-tolerant, distributed system architecture, based on multiplication of modules. Delta-4 has several intrinsic advantages, e.g., it does not require clock synchronization, and it allows for nondeterminism within passively replicated processes. Two variants of the architecture, focus on Delta-4’s portability and performance. Both variants, however, rely on fail-silent hardware and on an atomic multicast protocol. • FTMPS (FT in off-the-shelf Massively Parallel Systems) [9] provides long-running number-crunching applications with FT measures. The solutions were purely software solutions, based mainly on checkpointing and reconfiguration or remapping. Neither real-time aspects nor industrial environments are considered. Three kinds of extended research on the development of distributed RTOS with fault-tolerant behavior are given. • CHORUS [13], a family of open micro-kernel-based OS components, addresses high availability by providing on-the-fly dynamic software-reconfiguration and hot

2Dependability is a general term; here, it is used as availability, integrity, and/or maintainability.

78

restart of system and application software. No real FT provisions are provided, but the system allows for building good FT measures on top. • CHIMERA multiprocessor RTOS [14] is designed to support the development of dynamically reconfigurable software for robotics and automation systems. CHIMERA has elaborate error-detection and error-handling capabilities, and prominently features “global error handling” and “deadline failure handling” mechanisms. The system is built on the VME-bus system and intrinsically uses hardware timers to provide its timing options. The supported processors are only of the MC680xx series. • SoHaR Distributed FT technology [15] maintains overall control, inhibits unacceptable or erratic outputs, and preserves critical-process state data. It lets real-time application developers meet reliability and availability requirements by providing hardware FT through “dual redundancy in processors” and “quadruple redundancy in networks,” and software FT through the capability to switch between two versions of the software via distributed recovery blocks. This solution also strongly relies on hardware techniques to accomplish the demanded level of FT, which implies custom platforms. Another alternative is the separation of functional and nonfunctional (dependability) concerns. For instance, in object-oriented environments, the approach of meta-objects and reflection allows intercepting calls to functions, allowing for FT management, e.g., replication techniques [16]–[19]. This allows for good maintainability, because the functional behavior of the application can be changed without modifying the FT aspects, or vice versa. Other approaches like ROAFTS [20] or FRIENDS [19] require the application to be object-oriented. Complex models and frameworks have emerged which aim at the dependability evaluation of FT systems [21]–[24]. Other research, e.g., [8], [25]–[28], shows the suitability and advantages of software-based FT solutions to improve the dependability of distributed applications. In general, many of these referenced approaches, target system-requirements which are mostly out of the scope of EFTOS (at least for its current implementation), for example: distributed systems, real-time systems, CORBA compliant tools, and dependable distributed objects. The target architecture of EFTOS is: a MIMD, homogeneous, distributed-memory, message-passing, multi-processor system for embedded parallel applications with soft real-time requirements. For example, Chameleon [11]: • targets a networked-environment, consisting of heterogeneous nodes. The configuration is not considered to be fixed, and new nodes can dynamically join the environment. • concurrently supports several applications, with various levels of availability requirements. • dynamically adapts to changing availability requirements. • deals software faults only by voting. • uses very general execution strategies that employ certain FT techniques, while EFTOS allows a free and flexible combination of several customizable techniques (at the cost, of course, of transparency).

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

For another example, AQUA [29]: • targets object-oriented distributed systems. • tries to make distributed-objects dependable. • assumes requirements that change during execution. • is based on group-communication services. • is based on replication. This paper shows how the EFTOS approach provides a software-based FT solution for the above-defined target application domain. • Section II describes the various elements that compose the EFTOS FT framework. • Section III shows how the developer can specify the recovery strategies that have to be executed by the fault-tolerant application. • Section IV presents results from practical experience with the EFTOS framework, including an industrial case study, and an analytic deduction of the appropriateness of its tools. • Section V summarizes the strengths and weaknesses of the EFTOS approach, and proposes further research. This paper is not a formal analysis of the EFTOS framework. It presents qualitatively the functionality of the framework, referring the reader to other cited sources for a more formal analysis. II. OVERVIEW

OF THE EFTOS FRAMEWORK AND ITS COMPONENTS

A. Framework Architecture Fig. 1 is a view of the functional layering of the EFTOS framework. For an example, a sample collection of EFTOS elements is placed at each layer. Five layers constitute the framework: 1) The base layer is an adaptation layer to the underlying OS and provides additional services (e.g., remote thread creation, recollection of information messages) to upper layers. This adaptation layer looks differently on the different target platforms, depending on what services are provided by the underlying OS. See Section II-B. 2) Layer #2 is the detection/isolation/recovery layer, where all basic FT tools for error detection, isolation, and recovery are positioned. These tools do not interact and can be either coordinated under the upper layer (the DIR net), or tied straightforward to the application. See Section II-C. 3) Layer #3 is the control layer, where the DIR net is positioned. The DIR net is the backbone of the EFTOS framework, and coordinates all the FT actions among the involved nodes and applies consistent recovery strategies. Its structure and functionality are discussed in Section II-D. 4) Layer #4 is the application layer, which combines the user code with high-level FT mechanisms (e.g., SM, FT_comm). These mechanisms depend on the application structure and on all other lower-level tools and mechanisms and are therefore best positioned at this level.

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

79

Fig. 1. EFTOS framework—functional view.

5) Layer #5 (top layer) is the presentation layer, where the EFTOS monitoring and fault injection service are located. This layer presents a human interpretable status overview of the system on which the EFTOS library is currently running. The tools developed at this layer can also inject faults into the system for testing purposes. The role of the presentation layer in the framework is discussed in Section II-F. These 5 layers are intended to be used together as a set of middleware. They are detailed in the following subsections. Their interaction is described in a scenario in Section II-G. B. The Adaptation Layer Adaptation is required between the framework and the underlying OS. To this end, there are some adaptation-layers readily available, which can be used as they are, with no adaptation effort from the user. These layers are developed for use with the OS: EPX, WinNT, and TEX, representing a variety of widely used OS with different principles. The selection of these OS for development was directly dictated from the common requirements of the industrial users that helped to determine the EFTOS specifications. However, since these OS have features similar to many other OS that are commercially used, the user can still use the existing adaptation layer, with particular limitations. This adaptation layer provides, for the rest of the framework, a uniform interface to the underlying services. C. Basic FT Tools The basic tools are software implementations of well-known FT techniques, grouped in a library of adaptable functions.

Fig. 2. The Atomic Action tool embraces two levels of control: Local within the Atomic Action thread, and global in agreement with the other Atomic Action threads.

These basic tools can be used on their own, or as cooperating entities attached to the DIR net backbone. The framework includes a set of ready-to-use tools as examples: • The watchdog-timer that uses user-driven time-stamps to detect processing errors on local or remote nodes. • The trap-handler to perform exception handling in a usertransparent way under control of the DIR net. • The atomic-action-tool that provides atomicity for distributed actions: a protocol assures that all nonfaulty components (cooperating in the atomic action) make the same decision whether or not to update their state. For this, the atomic-action tool allows storing the predefined state of all entities, and to restore it in case of an error [30]. Fig. 2 shows the architecture of the atomic action tool.

80

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

Fig. 3.

An architectural view of the DIR net.

• Assertions that additionally inform the backbone about failed run-time consistency checks, out-of-range parameters for the application, failed invariants, etc. • The distributed voting-mechanism [31] featuring majority, median, plurality, weighted average, and consensus voting. User-defined tools, expressing specific application needs, can be developed by the user and then easily integrated into the framework. For example, the user can define one or more trap handling functions, which can be specialized on one class of traps. The only thing the functions should return, is whether the trap has been handled or not. To accomplish this tool development, the user is supported by detailed documentation (analyzing each tool’s functionality, API description, interrelationships with other framework elements, dependencies and limitations) and by several functions to incorporate these user-defined tools. D. Backbone The DIR net is the backbone of the EFTOS framework [32]. It gathers information on the topology of the application, and coordinates actions to provide resilience against faults, preventing them from resulting in failures. As Fig. 3 shows, the DIR net is structured as a hierarchical network, thus allowing coordinated distributed actions. It has 4 building blocks, each with its specific task: • Normal Agent: It is connected to the manager and all backup agents. It has only a node-local view of the application and interacts with the local basic FT tools. Normal agents assist the DIR net manager, one agent on each node. • Backup Agent: It is connected to all normal agent modules. It uses buffered communication, and receives all information the normal agents gather. This way, a backup agent has a global view of the application in a data-

base—application topology and integration of FT tools and mechanisms, error history, and application status. A backup agent however does not take any decisions unless its role is changed to a manager role upon manager failure detection. • Manager: It is the main module in the DIR net, and is functionally a backup agent that, however, has been elected to become the manager. As such, contrary to regular backup agents, it coordinates all decisions and allows global recovery actions to be executed. The ability to do distributed recovery is an advantage over several other systems. • “I’m alive” thread: It detects complete node, agent, or manager failures. It is present on every node. The basic FT tools from the library, which can be started by the application or autonomously, connect to the backbone. When one of them detects an error, it passes the necessary information (type of error, location, and identification) to the local DIR net agent. The DIR net agent, in turn, warns the manager and the backup agents. The “diagnosis engine of the DIR net” analyzes incoming error messages to derive affected components and the nature of the fault (permanent or transient). This allows isolating affected entities by disabling inter-process communication involved. Recovery actions can then be initiated by the DIR net to bring the application back into a consistent state. These recovery actions are described using a high level language (RL), as explained in Section III. Recovery actions include: • (re)starting a single task or node or a set of tasks or nodes; • (re)setting communication channels; • releasing resources that were assigned to failed entities; • application-specific actions. The DIR net has several built-in mechanisms to provide self-FT. Information redundancy safeguards the backbone’s database against corruption; the “I’m alive” mechanism allows detecting failures of DIR net components. The backbone is built in such a way that its components are re-established upon restarting or rebooting of a specific node and they are re-integrated with the rest of the backbone. Furthermore, a new manager is elected whenever the one in charge fails. The backbone can connect to an operator module; this module provides an interface between the operator and the DIR net to perform user-driven recovery actions and to visualize the behavior of the fault-tolerant application. E. Integrated Mechanisms at the Application Layer The integrated mechanisms combine several basic tools into more elaborated FT techniques. The 3 ready-to-use examples are: fault-tolerant communication, distributed memory mechanism, and SM module which are described in Sections II-E1–II-E-3. They are situated at the application layer, because they rely on basic tools and the backbone from the underlying layers; analogous to an application, they integrate these lower level aspects to create dependable mechanisms. 1) Fault-Tolerant Communication: Two versions of fault-tolerant communication have been implemented: one uses channels for synchronous blocking communication and the other uses mailboxes for asynchronous nonblocking communication [33].

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

81

In a system with “synchronous blocking communication,” problems arise when links (communication channels) or communicating threads are in an erroneous state (e.g., broken links, threads in infinite loops); hence the threads remain blocked. Under these circumstances, no communication can be initiated or completed. Two methods might solve these situations: • The status of the link and the communication partner are tested prior to initiating communication to avoid blocking. • Communication is established normally, but time-out mechanisms are used for exiting situations of potential deadlock or livelock. Both approaches can also be used in combination. The “Message Delivery Time-out mechanism” detects if a message cannot be delivered with a time-out period specified by the application. It incorporates a simple acknowledgment protocol. For each communication channel, a CCT thread is created to handle time-outs and to trigger isolation and recovery actions. And, it passes information or control to the DIR net (Fig. 4). The implemented protocol is transparent to the user. Thus the application thread (sender or receiver, in Fig. 4) makes a simple call to the appropriate function of the library, and the fault-tolerant transaction of the requested communication is completely taken care of by the CCT threads. In case of error, the CCT threads inform the DIR net, which, in turn, assumes control of the faulty situation by taking isolation and recovery actions (e.g., inactivate links, restart/stop faulty threads). The result of these actions is communicated to the CCT threads, which either return normally (on successful error handling) or return a time-out error to the application threads they connect to (to inform them that the communication requested failed). “Asynchronous nonblocking communication” is based on the mailbox concept: the sending thread does not block while sending a message. Instead, the message is stored in a buffer or mailbox, from where it is retrieved by the receiving thread. The “Message Delivery Time-out mechanism” detects when the delivery to the receiver’s mailbox cannot be completed within a time-out period (specified by the application) or when the receiver does not retrieve the message from the mailbox within a time-out period. If this occurs, the backbone is informed. The possible recovery actions associated with the fault-tolerant asynchronous communication are: 1) 2) 3) 4) 5)

Leaving the mail in the mailbox (no recovery). Deleting the mail when the mailbox is full. Resetting the receiver via an interrupt signal. Resetting both the sender and the receiver. Performing an application-dependent recovery action.

In actions #3 and #4, the entire mailbox is cleared. The recovery strategy (specified in RL and executed by the DIR net) determines which action is to be executed. 2) Distributed Memory Mechanism: Typical for many embedded, distributed systems is a cyclic behavior, wherein each cycle data is read from the status variables and then stored at the end of the cycle, to be used in the next cycle. These data can, for instance, describe the actual state of the system, for which the integrity is required to guarantee the correct deterministic behavior of a finite state machine. The “distributed memory mechanism” allows for protection of these data via replication in the

Fig. 4. Architecture for fault-tolerant synchronous communication and the corresponding protocol.

memory of remote nodes at the write statements (scattering). In this context, a read statement implies gathering the contents of the involved memory cells and executing a vote among all replicas. The distributed memory mechanism (located at the application layer) manages these tasks for the user, and allows masking transient or permanent faults in memory. All technicalities concerning the setup and the internal communication of the mechanism have been masked to the user module. This lets the user see only the local memory handler, and all interactions are hidden. Communication is hidden in terms of function calls. The whole memory module behaves as a handler that is an object of commands or requests as they are used in an object oriented approach. 3) SM Module: The SM module executes a strategy based on time redundancy in order to stabilize input and status data for a finite state machine. It hereby complements and incorporates the spatial redundancy of the distributed memory mechanism. Assume a cyclic application, e.g., describing a finite state machine, which calculates its new state based on the previous state and input data. To ensure that the results are correct, in spite of transient faults affecting the input samples or corrupted computations, the application executes the calculations repeatedly. At the beginning of each cycle, the system state is reset to a known, correct starting point; hence, eliminating the effects of possible

82

Fig. 5.

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

State transition diagram of the central part of the SM module.

transient faults during the previous cycle. The consecutive results are compared, and if they have been stabilized (repeated a predefined number of times), then the application switches to the new state. The SM module implements this stabilization of the user data, and provides the application with the stable (previously stabilized) data. In view of that, the SM module uses two memory banks: one (future) to stabilize the data and another (current) to provide stable data. The latter is possible by using the distributed memory mechanism within the SM module. If the data have been stabilized, both banks change their function. The bank switch is done transparently to the application. This internal state of the SM module also has to be recalculated at the beginning of each cycle, and the same level of protection has to be applied to it. Fig. 5 shows the state transition diagram of the central part of the SM module. A prototype of this SM module in a case study of “electric substation automation” allows tolerating • permanent faults in memory, and • transient faults affecting computation, input and memory devices. Transient faults lead to extra cycles before data are stabilized; permanent faults can be masked or lead to reconfiguration [4]. F. Presentation Layer The presentation layer visualizes information from the backbone via a monitor. This provides the operator with an interface to access status information of the fault-tolerant application, including the topology of the application, the location of the DIR net manager, agents, detection and recovery tools, and the actions carried out by these modules and by the application itself, including error detection, isolation, and recovery. This macroscopic, quasi real-time, view of the system is provided to the operator in a hierarchical way; i.e., the overall data stream is organized and can be browsed via “layers”: • at the highest layer, only the logic structure of the application is displayed: the nodes that are used, the DIR net roles played by each node, their overall status;

• at a medium layer, a concise description of the events pertaining each particular node is available; • at the lowest layer, a deeper description of each particular event is supplied—on demand—to the user. Such a monitor can be used during development or operation to evaluate the fault-tolerant strategies, because it allows injecting software faults into the system [34]. The monitors use the http/cgi protocols, popularized via WWW technology, or Tcl/Tk with socket communication. G. A Typical Scenario The separated layers of the EFTOS approach are intended to be used as middleware. Although they are clearly separated and have a well-defined API, they are barely used in the stand-alone version; by integration, they obtain their power. Consider a typical scenario as follows. As soon as the application starts on the various nodes of the system, the DIR net components connect to each other and start to exchange data and signals. At this level, several transparent error-detection mechanisms are started, e.g., a trap-handling tool is activated by the first application process on each node. These tools forward their detection information to the control backbone. The same transparent forwarding can be attached to system functions that activate or terminate tasks or communication channels. In addition, a set of heartbeat messages monitors the availability of the entire backbone. The basic FT tools of the user library, which can be started by the application or autonomously, connect to the backbone. When an error is detected, local fault containment is executed (associated with, and started by, the error detection mechanisms themselves). Alternatively, fault masking can be active via, e.g., voting. Thereafter, the information concerning the detected error is passed to the DIR net (type of error, location, identification). The DIR net manager interprets the predefined recovery scripts to see if any matching conditions are fulfilled. If any are fulfilled, then these (possibly) distributed recovery actions are executed by the DIR net manager and agents. If no condition is fulfilled, a default action (e.g., the shutdown of a

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

83

node) is executed to allow for graceful degradation of the application. III. OBTAINING A FAULT-TOLERANT APPLICATION WITH THE FRAMEWORK COMPONENTS Following the EFTOS approach, a user application is made fault-tolerant in 2 steps. • The developer integrates the functions from the library with the application. This requires a function call to the basic tools or mechanisms (e.g., to set the parameters for a watchdog timer); other functions can be added without source code intrusion (e.g., informing the backbone when a new task is started). • The developer has to specify separately the FT actions to be taken when an error is detected. This is explained in the next paragraph. The user specifies the recovery strategy by using RL, a scripting language. It allows defining which FT actions must be executed when specific errors are detected. Examples include: stopping a certain thread, restarting it, starting an alternative task, and resetting a specific node. This RL implements a sort of meta-level representation of the FT aspects of a user application. Within the language, it is possible to, for example: • work with logical groups of threads (e.g., application tasks, which are logically dependent, can be warned or reset with a single action); • indicate generalities (all tasks on a node, a faulty task, the nonfaulty tasks, etc.); • specify default actions (if no other rules apply). The recovery strategy can be made dependent on the actual state or progress of the application (as it is possible for the application to inform the backbone of its current state). The human-readable script is translated into a compact code that is interpreted and executed at run time by the backbone, as shown in Fig. 6. The RL serves 2 purposes: • to enhance the configurability of an EFTOS application (with role configuration for the backbone and with setup of a default recovery action); • to express the user’s recovery strategy. Adapting these recovery strategies allows changing the FT profile (recovery strategies) of the application, without major modifications to the application code, or vice versa [35]. This separation of concerns (functional aspects versus FT aspects) hence allows for a better maintainability of the code and for a way to master complexity. The application must incorporate provisions to ensure consistency when the recovery strategies (specified in RL) are executed. Checkpointing tools have not been developed in this described EFTOS approach, because the applications from the case studies did not require them. When state-less applications are involved, ensuring consistency can be straightforward, as it is sufficient that all of the communicating processes restart from their initial state. When application processes do have a state, the library can contain tools to save periodically a checkpoint of the representative state; this state is then restored in the appropriate processes when the application is restarted. To this extent, the

Fig. 6. A global view of the use of RL: The user supplies RL source code, that it translates into binary r-codes; these codes are then interpreted at run time by the DIR net.

RL recovery strategies can include signals to be sent to the processes, such that they are aware that they are restarted (rather than started for the first time). A consistent state could also be saved via other (non-EFTOS) tools for checkpointing—again RL can then coordinate application recovery [36]. Alternatively, if one had to develop checkpointing tools, they would be included at the level of the basic tools. The integration of EFTOS mechanisms in an application does not require that the application is rewritten from scratch in an appropriate form. It is often possible to integrate the error detection mechanisms with a minimal code intrusion. However, for the recovery steps it is important that the processes can cope with a partial restarting of the application if such a strategy is specified in RL. For some application types, this requires no changes to the application structure (e.g., farmer-worker applications, stateless applications); for other existing applications this can be less obvious, and a partial redevelopment can be necessary. When the application is developed from scratch, integrating the EFTOS approach requires less effort than integrating all FT steps (from detection to recovery) into the application. It is generally very difficult to specify/interpret the FT requirements of an application, and this is definitely the job of the application developer. It is not the aim of the EFTOS approach to substitute the developer by completely automating the process of making applications dependable. Rather, the EFTOS approach provides tools to help the developer to incorporate FT in the application, once the individual needs have been recognized, in an easy and efficient way without the need to code these FT mechanisms from scratch. This two-step tailoring of the framework for an application (elaboration of recovery strategies in RL and integration of adapted mechanisms) minimizes code intrusion and favors a small object code (which is beneficial for embedded applications). Only the necessary parts of the framework are integrated into the application, while unneeded parts are not incorporated. For these steps, the developer is assisted by a handbook to select the various techniques. This handbook is essentially a hypertext document that provides information at various levels:

84

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

Fig. 7. Implementation environments for the EFTOS framework.

• At the highest level, it provides a qualitative description of how the mechanisms improve the dependability of the application, and provides design patterns to select the most appropriate elements according to the embedded application and its environment. • At a middle level, it classifies tools according to their typical use in applications (via examples), and the involved overhead, both quantitatively and qualitatively. • At the lowest level, it provides a programmer’s reference manual explaining the use and features of the elements. As such, the application developer is able to trade-off the resulting enhanced application dependability against performance or resource consumption. IV. EXPERIENCE The EFTOS framework has been used for three application environments on different platforms (see Fig. 7). • A Parsytec CC system [37] with an EPX/nK environment on top of a 5-node PowerPC-based parallel system for an image-processing application, part of a postal automation system. • A Parsytec CCi system [38] with an EPX/WinNT environment on top of a Pentium Pro multiprocessor for a surface-inspection application in the steel industry. • An ENEL proprietary system with the TXT TEX environment [39] on top of a DEC-Alpha based distributed system for a sequence controller from energy transport. Above and below the common-core functions, an adaptation layer was required to link the application and computing platforms to the EFTOS API. A. Case Study Results An HVS sequence controller from ENEL (main Italian electricity supplier and third largest worldwide) has been used for an EFTOS case study. An HVS is a semi-automatic, remotely controlled node of interconnection and energy transformation

between the high-voltage transport network and the mediumvoltage distribution network. Many of the older, dedicated computer systems used in such HVS are at the end of their life cycle, and their performance is too poor for today’s application management or for the addition of further functionality. Commercial off-the-shelf, parallel or distributed, high-performance computing systems are considered as an important alternative. The sequence-controller is the part of the HVS automation system located on the plant. It controls the execution of the operator commands and the evolution of the system. It executes a diagnostic activity to identify erroneous behavior, reporting alarms to the local or remote operator. This sequence controller is mission-critical (availability of the electric power is essential: partial or total unavailability has relevant economic drawbacks; increased availability results in lower costs) rather than safety-critical (barriers to catastrophic failures are provided by additional protection relays). The level of required dependability varies for the different functions of the controller. For example, the acceptable unavailability ranges • from 30–60 minutes/year for core functions: A(1 year) 0.9999; • to 3–5 days/year for the lower level functions: A(1 year) 0.99. Electricity itself is the main source of errors for the HVS automation system, although a first level of hardware protection mechanisms is adopted (electric dischargers, optical decouplers, filters, capacitors, etc.). Electrical and electromagnetic interference with the computer circuits (low voltage) and with the relays command circuit (medium voltage) might pass these barriers, and computation, memory, or I/O could be affected. While such faults cannot be avoided, they should not impair the process output, because this controller drives the actuators in the field. Beside the application of fault prevention, removal, and forecasting techniques, various FT techniques are integrated. In particular, the following have been used: • time-redundancy, to overcome the effect of transient disturbances [41],

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

• cyclic-restart, to force system evolution and to confine consequences of a fault within a single cycle, • watchdog, to allow the sequence controller to release control of the plant in case of unrecoverable problems, • hardware-redundancy with voting and correction mechanisms, to allow fault masking, etc.

85

TABLE I EFTOS MECHANISM INTEGRATED IN THE ENEL CASE STUDY

Hardware and software solutions based on dedicated boards have been used to implement most of these FT techniques up to now. However, other solutions are needed to answer many pressing needs, e.g., • high maintenance costs of existing solutions, • limited performance of the existing solutions, and increased demand of functionality of the sequence controller, • reusability of FT solutions for various applications, • portability of automation applications on different platforms. Software-based FT solutions using a framework approach can provide the required flexibility, reusability, and portability at a reasonable cost. Embedded distributed platforms offer higher performance, provide redundancy on conventional hardware, and answer to the physical distribution of automation functions on the field. As an example of the EFTOS approach, a software module has been designed, implementing the SM [4], as a mechanism combining physical with temporal redundancy (and with several protocols) to recover from transient faults affecting memory or computation (see Section II-E-3). With respect to a previous solution relying on dedicated hardware boards, this software implementation of the SM module has the advantage of flexibility and maintainability. For example, it is possible to set the size of the stabilizing memory with a parameter, to select the number of redundant copies in the physical and temporal domain, and to modify the voting algorithm. The developer can adjust the allocation of the physically distributed copies to the available resources. All these parameters—and the actions to be taken when an error has been detected—are described in RL. The additional flexibility offered by these recovery strategies allows, e.g., reconfiguration by re-allocating the distributed copies to nonfailed components in case a permanent fault occurred. The recovery strategies are not hard-coded in the application, but are specified in RL at a higher level and executed by the backbone that interacts with the modules. This improves the maintainability of the application. Because the interface to the dedicated board and to the software module is identical, the complexity for the developer is equivalent in both implementations. In a similar way, the ENEL pilot application allows using different restart mechanisms (driven by RL): • hardware restart (signal from external board); • process-level restart (restart of task by kernel, triggered in RL), node-level restart (handled by DIR net), and application-component restart (by partitioning the application tasks). As a third example, the ENEL application uses flexible watchdog techniques: a hardware watchdog on a dedicated external board, an on-board watchdog (that can be queried and activated by the DIR net), and a software watchdog from the

basic tools (executed on a remote node). RL allows selecting the appropriate configuration and set-up of the parameters. This EFTOS framework approach applied to single subsystems (the SM) as well as applied to the entire pilot application, will shorten the development cycle and improve the maintainability of the application. In the traditional methodology with dedicated hardware solutions, every change in the environment (larger system, additional functionality) resulted in a different implementation of the dedicated solution; also every application is embedded in a different environment, both of which impacted the implementation. By using RL in the EFTOS approach, the configuration of the elements and of recovery strategies themselves can be adapted, e.g., due to changing environments or requirements, without major modifications to the application code. Analogously, if the functional aspects of the application need to be modified, this does not necessarily interfere with the FT strategies. This allows specialists in FT to cooperate with specialists in the target application (this is easier than requiring people specialized in both the application and dependability). The flexibility is further ensured by the modularity and openness of the framework, such that additional detection, isolation. or recovery mechanisms can be added if required. Table I lists all EFTOS mechanisms that have been integrated in the ENEL case study. The entire middleware comprises about 10 lines of C source code (which is effectively small with respect to the target applications). The coding effort for the integration of the mechanisms in the application code was adapting about 1000 lines of source code, and the recovery scripts written in RL contain about 200 lines. The integration took about 1 person-month for an application expert that was assisted by an EFTOS expert. An internal ENEL study showed that both the original (dedicated) solution, and the software-based solution in the pilot application with the EFTOS approach, have met the necessary dependability requirements, where the latter provides more flexibility. The development-cost reduction introduced by the framework usage has been evaluated in an internal ENEL-study. The development of the FT layer of the sequence controller, as a tailored instance of the EFTOS framework, allowed a 30%–35% reduction of analysis, design, and implementation costs with respect to the effort to develop the previous solution. (This layer took about 50% of the development costs of the entire application.) The cost reduction was limited by the need to analyze the framework impact on the entire FT strategy and by the integration of the framework mechanisms with existing ones.

86

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

TABLE II COVERAGE OF SEVERAL EFTOS FRAMEWORK ELEMENTS: WHICH ERRORS CAN BE DETECTED, AND HOW IS RECOVERY POSSIBLE

B. Overhead Considerations and Dependability Improvement Measurements on the various platforms gave the time and resource overheads. These concern the elements of the framework in stand-alone configuration, integrated framework measurements, and application measurements. For the tested industrial target applications, time overhead is typically 5%–10% in the fault-free case. Recovery times are determined by the time required to reset/reboot the system, the reintegration of the fresh resources, and the application recovery. The size of the entire middleware ranges from 100 KB to 650 KB for the given development platforms. The increased dependability (availability, integrity, or maintainability) depends on the application itself and on the environment (type and frequency of faults). Table II summarizes the coverage of several of the EFTOS framework elements. 1) Performance of the Backbone: The performance of the backbone depends first on its basic gossiping service (used to forward information internally [42]). The algorithm of mutual suspicion (for the self-FT of the backbone) is based on the values of the deadlines of a number of time-outs: the MIA, TAIA, TEIF time-outs. Consider the case of a crashed-node hosting the manager. The error-detection latency is a contribution of the time required to • detect that a MIA message is missing (equal to the deadline of a MIA time-out) • plus the time required to detect that a TEIF message is missing, • plus some minor delay due to communication and execution overheads. Performance also depends on the communication network—the faster the network is, and the larger its throughput, the shorter can be the deadlines of these time-outs without congesting the network. The choice of which deadlines to adopt is important (these deadlines can be set in RL): • Too-short deadlines can congest the network and bring many periods of instability. These periods are characterized by system partitioning, in which correct nodes are disconnected from the rest of the system for the entire duration of the period. This requires further overhead for managing the following partition merging. Further, reducing the deadlines exacerbates the instability, and results in a chaotic behavior. • Too-long deadlines translates into fewer messages, which asks for fewer resources to the communication network, and also implies a larger error-detection latency, and consequently a larger delay for initiating recovery.

Fig. 8. Time to detect that an agent is faulty and to remove it from the DIR net.

In general, a proper trade-off between communication-overhead and detection-latency is required. Both for the prototype realized on a Parsytec Xplorer and for that running on Windows NT, the same values are being used: • A deadline of 1 second for the time-outs for sending a MIA or TAIA message. • A deadline of 2 seconds for the time-outs corresponding to declaring a missing MIA or TAIA message, thus initiating a “suspicion period.” • A deadline of 0.9 seconds for declaring a missing TEIF message, and thus initiating the graceful degradation of the backbone. Fig. 8 shows the results for a version running on 2 Windows-NT nodes. Times are expressed in seconds. 47 runs are reported. The deadline of the TEIF time-out is 0.9 second in this case. Communication and execution overheads come to a maximum of about 1.2 seconds. Values are related to a backbone running on 2 Windows NT workstations connected via a 10 MB/sec Ethernet, and both based on a Pentium processor at 133 MHz. 2) Performance of the Recovery Interpreter: To estimate the time required for error recovery, including the overhead of the RINT recovery interpreter (the part of the backbone that interprets the recovery strategies written in RL), a test application was run consecutively 120 times, injecting a value fault on 1 of the versions. The system replied executing an RL script for the recovery of a TMR-and-a-spare system: it “switches the failed task off” and “switches a spare in,” thus restoring fault-free TMR. The times reported in Fig. 9 relate to the prototype running on the Parsytec Xplorer (based on PowerPC 601 processors with a clock frequency of 66 MHz).

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

87

TABLE III TIME OVERHEAD OF THE VOTING-FARM, FOR 1 SYSTEMS (1 VOTER/NODE)

TO

4 NODE

TABLE IV TIME OVERHEAD OF THE FT_comm MECHANISM (SYNCHRONOUS COMMUNICATION CASE) COMPARED TO PROCESSING TIME (ORDER OF 1 s) AND WAIT TIME (ORDER OF 100 ms)

Fig. 9. Recovery times.

3) Performance of the Distributed Voting Tool: This section focuses on estimating the “voting delay of the distributed voting tool” using some experiments. Overheads are expressed in terms of system resources (e.g., threads, memory occupation). All measurements were made by running a restoring organ con. The executable sisting of processing nodes, file was obtained on a Parsytec CC system with the “ancc” C compiler using the “-O” optimization flag. The application was executed in 4 runs, each of which was repeated 50 times, increasing the number of voters, , from 1 to 4, 1-voter per node. Wall-clock times were collected. Table III gives the averages, , and standard deviations, as a function of . The system requires: threads to be spawned, • inter-process communication descriptors for the com• munication between each user module and its local voter. remote The network of voters needs another communication descriptors. 4) Performance of the FT_comm Mechanisms: The FT_comm mechanisms spawn 2 threads in the synchronous communication case (the sender and receiver CCT) and 1 thread in the asynchronous case (the monitoring thread). Time overhead is relevant to the time-out values used by these threads. These values must be fine-tuned with the particular system-constraints and the application real-time requirements. For a given image-processing application (size approximately 1 MB), the FT_comm mechanism added 50 KB. Table IV shows the time-overhead values. These measurements were obtained by running a generic application on top of the EPX OS with and without the FT_comm library; they depict the • time to create a communication link, • time to transfer a message through a link, • time to abolish a link. The connect-times and break-times are of minor importance because these functions are performed only once. The transfer-time becomes more important as the application becomes more communication-intensive. For example, in applications where communications consume much more time

than the CPU calculations, and they are perfectly synchronized, the total run time (Table IV) almost quadruples by using this functionality. However, such applications are exceptional and are not within the scope of this study. The type of applications that can benefit from the FT_comm Library are where communication-time is comparable to CPU-time e.g., 50%–200% of the CPU time). In this communication time, one should include the wait-time due to imperfect synchronization. Except for special cases (e.g., the parallel solution of 2-D differential equations), this wait-time is much larger than the transfer-time. For example, in the real applications for which this Library was designed, the average wait-time is of the order of 100 ms. For such applications, the total overhead is comparatively small. C. Appropriateness of Framework Elements Notation: processing time (useful CPU time) of the application waiting-time for synchronous communication due to imperfect synchronization transfer-time for synchronous communication “total application run-time” in the fault-free case average period of fault occurrence rebooting-time recovery-time for a given recovery tool “total application run-time” when faults occur, and FT is not incorporated (only rebooting) “total application run-time” when faults occur, and FT is incorporated overhead of FT_comm library with respect to overhead due to rebooting when FT is not incorporated overhead due to recovery when FT is incorporated. An important aspect of EFTOS is the selection of the framework elements to be integrated with the application. If overhead measurements are known for a given implementation on a given platform, and a fault frequency is assumed, then the appropriateness of the selected mechanisms can be analyzed. This allows estimating, in advance, the time benefits of the adoption of a mechanism; this analysis considers only time-overhead. Other advantages stemming from the use of the framework approach, or of its elements, are not included in this deduction, e.g., flex-

88

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

ibility, maintainability, ease of use, reusability; these properties are much harder to quantify. As an example, the synchronous communication FT from Section II-D-1 is analyzed. The total application run-time when no fault occurs is:

Using the fault-tolerant communication library results in a shorter application execution time if . Combine this inequality with (4) and (5); then (6)

(1) It is assumed that faults occur randomly at a rate, 1/ . If no fault-tolerant synchronous communication is used, and if a fault occurs, then assume that the user or the system initiates recovery (e.g., rebooting), which takes time, . Also, it is optimistically assumed that, after rebooting, the application returns to the latest point before the failure and does not have to re-execute tasks; needs to include the application recovery. The alternatively, average total application run time when faults occur is then:

(7) Equation (7) shows that using the library is not advantageous when the ratio of “communication transferring-time” to “total application-time” is greater than the upper bound on the r.h.s of (7). %. Then Continuing this example, let %; from this inequality and (1), it follows (7) yields that

(2) When the fault-tolerant communication library is used, there is (proportional to ) in the fault-free overhead equal to case. Each fault activates a recovery tool that takes time, . The average total application run time (all errors are detected) is then (3) It is assumed that the incorporation of the library does not alter the processing-time nor waiting-time. This assumption is valid is small, because then the extra in applications where time due to the detection and recovery tools of the library does not appreciably alter the profile of the processes’ states versus time; hence does not appreciably alter the synchronization between processes. For common practical applications, the relations: hold. The first inequality is valid because is of the order of milliseconds (time taken to run a • recovery routine), is of the order of seconds or more (time to reboot a • system). The second inequality is valid because it can be assumed that faults occur not too often (e.g., of the order of hours or days). The total overhead due to rebooting the system (when the fault-tolerant communication library is not used) is

This means that if the processing-time is not large enough compared to the transferring-time, then the ratio of total waiting-time to transferring-time must be large enough for the library to be effective when in use. Similar reasoning is applied to the other elements of the EFTOS framework. Let there be categories of FT tools, each , in the fault-free case and a recovery with an overhead, time , then the average total application run-time is increased by

for each category

of tools that is used, or by

when “rebooting of the system” performs better than the corresponding tool for a given implementation (category not used). The average total application run time, when faults of category occur randomly at a rate 1/ , thus becomes:

(8) (4) The total overhead due to the detection and recovery tools of the library is

is a set of binary parameters enabling In (8), only those mechanisms whose corresponding overhead is less than the overhead caused by system reboot:

(5) As an example, let for a given implementation, • • the communication times and processing times be about the same, % • then the overhead in the fault-free case is about 5% [33].

(9)

Equations (8) and (9) allow incorporating only those FT tools that increase the availability of the application for “a given implementation of the tools” and “a given fault pattern.” Indeed,

DECONINCK et al.: DEPENDABILITY IN EMBEDDED SUPERCOMPUTING

(9) selects only those categories for which the time overhead associated with the FT actions is smaller than some standard action (rebooting) that is performed when the category is not used.

V. DISCUSSION The EFTOS approach, with • basic FT tools and mechanisms; • a backbone; and • a high-level language (RL) for specifying recovery strategies; allows for flexible integration of FT into embedded applications with soft real-time requirements on homogeneous message-passing multiprocessors. The application developer integrates only those parts of the framework that are required; those elements can be adapted, and be used in a stand-alone configuration or coupled to the backbone. If “coupled to the backbone,” then the “existence of a kind of second application layer, RL, devoted to specifying recovery strategies” permits “separation of design concerns.” This systematic framework-approach shortens the development cycle and improves the maintainability of the fault-tolerant application. Hence, the FT strategy can be adapted, e.g., due to changing environments or requirements, without major modifications to the application code. On the other hand, if the functional aspects of the application need to be modified, this does not interfere with the FT strategies [35]. The flexibility is further ensured by the modularity and openness of the framework, such that • additional detection, isolation, or recovery mechanisms can be added, and • portability to other environments is facilitated. This results in cost-effectiveness, because the tailoring of FT functions from a library or framework is easier than their development from scratch of ad hoc solutions for each application. Our future research concentrates on • support for validation and verification of the dependability requirements of the application, during the life cycle of an application—from design, through development and implementation, to operation; • the integration with a) hard real-time applications and b) interoperability with emerging technologies, standards, and products (e.g., Java, POSIX, CORBA).

ACKNOWLEDGMENT The authors are pleased to thank the contributions of the project partners (especially O. Botti) to the design and development efforts, and the reviewers of this paper for their useful comments.

REFERENCES [1] J. H. Saltzer, D. P. Reed, and D. D. Clark, “End-to-end arguments in system design,” ACM Trans. Computer Syst., vol. 2, no. 4, pp. 277–288, 1984.

89

[2] G. Deconinck, T. Varvarigou, and O. Botti et al., “(Reusable software solutions for more fault-tolerant) industrial embedded HPC applications,” Int. J. Supercomputer (ASFRA BV, Edam, The Netherlands), vol. 69, no. 3/4, pp. 23–44, 1997. [3] G. Deconinck, R. Lauwereins, and N. vom Schemm, “Fault tolerance requirements in postal automation: A case study,” in Proc. 4th IFAC Workshop on Algorithms and Architectures for Real-Time Control (AARTC’1997), A. E. Ruano and P. J. Fleming, Eds. New York: Pergamon, 1997, pp. 155–160. [4] G. Deconinck, O. Botti, and F. Cassinari et al., “Stable memory in substation automation: A case study,” in Digest of Papers, 28th Ann. Int. Symp. Fault-Tolerant Computing (FTCS-28): IEEE Computer Soc. Press, 1998, pp. 452–457. [5] D. Powell, J. Arlat, and L. Beus-Dukic et al., “GUARDS: A generic upgradable architecture for real-time dependable systems,” IEEE Trans. Parallel and Distributed Syst., vol. 10, pp. 580–597, June 1999. [6] D. Powell, Ed., Delta-4: A Generic Architecture for Dependable Distributed Computing. New York: Springer-Verlag, 1991. [7] P. A. Barrett et al., “The Delta-4 extra performance architecture (XPA),” in Proc. 20th Fault-Tolerant Computing Symp., 1990, pp. 481–488. [8] B. Randell, J.-C. Laprie, H. Kopetz, and B. Littlewood, Eds., ESPRIT Basic Research Series: Predictably Dependable Computing Systems. New York: Springer-Verlag, 1995. [9] G. Deconinck, J. Vounckx, and R. Cuyvers et al., “Fault tolerance in massively parallel systems,” Transputer Communications, vol. 2, no. 4, pp. 241–257, Dec. 1994. [10] G. Agha and D. C. Sturman, “A methodology for adapting to patterns of faults,” in Foundations of Ultradependability, G. Koob, Ed: Kluwer Academic, 1994, vol. 1. [11] Z. T. Kalbarczyk, R. K. Iyer, S. Bagchi, and K. Whisnant, “Chameleon: A software infrastructure for adaptive fault tolerance,” IEEE Trans. Parallel and Distributed Systems, vol. 10, pp. 560–579, June 1999. [12] R. van Renesse, K. P. Birman, and S. Maffeis, “Horus: A flexible group communication system,” Comm. ACM, vol. 39, no. 4, pp. 76–83, 1996. [13] [Online]. Available: http://www.sun.com/chorusos/ [14] D. B. Stewart, D. E. Schmitz, and P. K. Khosla, “The Chimera II real-time operating system for advanced sensor-based robotic applications,” IEEE Trans. Systems, Man, Cybernetics, vol. 22, pp. 1282–1295, Nov./Dec. 1992. [15] M. Hecht, J. Agron, H. Hecht, and K. H. Kim. A distributed fault tolerant architecture for nuclear reactor and other critical process control applications. presented at Proc. IEEE 21st Fault Tolerant Computing Symp. [Online]. Available: http://www.sohar.com/sdft.html [16] G. Kiczales, J. des Rivières, and D. G. Bobrow, The Art of the Metaobject Protocol: MIT Press, 1991. [17] H. Masuhara, S. Matsuoka, T. Watanabe, and A. Yonezawa, “Object-oriented concurrent reflective languages can be implemented efficiently,” in Proc. Conf. Object-Oriented Programming Systems, Languages, and Applications (OOPSLA-1992), 1992, pp. 127–144. [18] J. C. Fabre, V. Nicomette, and T. Prennou et al., “Implementing faulttolerant applications using reflective object-oriented programming,” in Proc. 25th Int. Symp. on Fault-Tolerant Computing (FTCS’25): IEEE Computer Soc. Press, 1995, pp. 489–498. [19] J.-C. Fabre and T. Prennou, “A metaobject architecture for fault-tolerant distributed systems: The FRIENDS approach,” IEEE Trans. Computers (Special Issue on Dependability of Computing Systems, pp. 78–95, Jan. 1998. [20] K. H. Kim, “ROAFTS: A middleware architecture for real-time objectoriented adaptive fault tolerance support,” in Proc. HASE 1998 (IEEE CS 1998 High-Assurance Systems Engineering Symp., pp. 50–57. [21] J. Arlat, K. Kanoun, and J.-C. Laprie, “Dependability modeling and evaluation of software fault-tolerant systems,” IEEE Trans. Computers, vol. 39, pp. 504–513, Apr. 1990. [22] R. Geist and K. Trivedi, “Reliability estimation of fault-tolerant systems: Tools and techniques,” IEEE Computer, pp. 52–61, July 1990. [23] “Special issue on fault tolerance,” IEEE Trans. Reliability, vol. 42, June 1993. [24] J. Arlat, A. Costes, J.-C. Laprie, and D. Powell, “Fault Injection and Dependability evaluation of fault-tolerant systems,” IEEE Trans. Computers, vol. 42, pp. 913–923, Aug. 1993. [25] D. B. Stewart, R. A. Volpe, and P. K. Khosla, “Integration of real-time software modules for reconfigurable sensor-based control systems,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS 1992), 1992, pp. 325–332. [26] Y. Huang and C. M. R. Kintala, “Software fault tolerance in the application layer,” in Software Fault Tolerance, M. Lyu, Ed. New York: Wiley, 1995.

90

[27] M. R. Lyu, Ed., Handbook of Software Reliability Engineering: McGraw-Hill, 1995. [28] Y. M. Wang, Y. Huang, and K. P. Vo et al., “Checkpointing and its applications,” in Proc. 25th Int. Symp. Fault-Tolerant Computing (FTCS’25), 1995. [29] M. Cukier, J. Ren, and C. Sabnis et al., “AQUA: An adaptive architecture that provides dependable distributed objects,” in Proc. 17th Symp. Reliable and Distributed Systems, 1998. [30] W. Rosseel, V. De Florio, and G. Deconinck et al., “Novel atomic action protocol for parallel systems with communication faults,” in Proc. 16th IASTED Int’l Conf. Applied Informatics (AI’1998): IASTED/ACTA Press, pp. 344–347. [31] V. De Florio, G. Deconinck, and R. Lauwereins, “Software tool combining fault masking with user-defined recovery strategies,” IEE Proc.—Software Special Issue on Dependable Computing Systems (IEE), vol. 145, no. 6, pp. 203–211, 1998. [32] G. Deconinck, V. De Florio, R. Lauwereins, and R. Belmans, “A software library, a control backbone and user-specified recovery strategies to enhance the dependability of embedded systems,” in Proc. 25th EUROMICRO Conf. (EuroMicro’1999), Workshop on Dependable Computing Systems: IEEE Computer Soc. Press, 1999, vol. II, pp. 98–104. [33] G. Efthivoulidis, E. Verentziotis, and A. Meliones et al., “Fault-tolerant communication in embedded supercomputing,” IEEE Micro Special issue on Fault Tolerance, vol. 18, no. 5, Sept.–Oct. 1998. [34] V. De Florio, G. Deconinck, and M. Truyens et al., “A hypermedia distributed application for monitoring and fault-injection in embedded fault-tolerant parallel programs,” in Proc. 6th Euromicro Conf. Parallel and Distributed Processing (PDP’1998): IEEE Computer Soc. Press, 1998, pp. 349–355. [35] V. De Florio, G. Deconinck, and R. Lauwereins, “Recovery languages: An effective structure for software fault tolerance,” in Fast Abstract at 9th Int’l Symp. Software Reliability Engineering (ISSRE’98), R. Chillarege and T. Illgen, Eds., 1998, pp. 39–40. [36] J. S. Plank, M. Beck, G. Kingsley, and K. Li, “Libckpt: Transparent checkpointing under UNIX,” in Usenix Winter 1995 Technical Conf., 1995, pp. 213–223. [37] Anon., “Parsytec CC series—Cognitive Computing,” Parsytec GmbH, 1995. [38] Anon., “Parsytec CCi Series,” Parsytec GmbH, 1997. [39] F. Cassinari and E. Birindelli, “TEX User Manual,” TXT Ingegneria Informatica, 1997. [40] O. Botti and M. Cesana et al., “SISPAR—Simulation of high-voltage substations on parallel architectures,” in Proc. Int. Conf. High-Performance Computing and Networking (HPCN’1996). ser. Lecture Notes Comp. Sc. 1067, H. Liddell et al., Eds: Springer, 1996, pp. 935–937. [41] R. Gargiuli and P. G. Mirandola et al., “ENEL approach to computer supervisory remote control of electric power distribution network,” in Proc. 6th IEE Int. Conf. Electricity Distribution (CIRED’1981), 1981, pp. 187–192. [42] V. De Florio, G. Deconinck, and R. Lauwereins, “A novel distributed algorithm for high-throughput and scalable gossiping,” in Proc. 8th Int. Conf. High Performance Computing and Networking (HPCN Europe 2000). ser. Lecture Notes in Computer Science, M. Bubak e.a., Ed: Springer-Verlag, 2000, vol. 1823, pp. 313–322.

IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 1, MARCH 2002

Geert Deconinck received the M.Sc. degree (1991) in electrical engineering and the Ph.D. degree (1996) in applied sciences from the Katholieke Universiteit Leuven (K.U. Leuven), Belgium. He is a Postdoctoral Fellow of the Fund for Scientific Research—Flanders (Belgium) (F.W.O.). He is working in the Application-driven Configuring of Computing Architectures (ACCA) research group of the Department of Electrical Engineering (ESAT), K.U. Leuven, Belgium, where he is also a visiting professor since 1999. His research interests include the design, analysis, and assessment of software-based fault-tolerance solutions to meet real-time, dependability and cost constraints for embedded applications on parallel and distributed systems. In this field, he has authored and coauthored more than 50 publications in international journals and conference proceedings. In 1995–1997, he received a grant from the Flemish Institute for the Promotion of Scientific-Technological Research in Industry (IWT). Dr. Deconinck is an ASQ-certified reliability engineer, a senior member of the IEEE, and a member of the IEEE Reliability Society and Computer Society and of the Royal Flemish Engineering Society (KVIV).

Vincenzo De Florio received the Laurea degree (1987) in computer science from the University of Bari, Italy, and the Ph.D. degree (2000) in engineering from the University of Leuven, Belgium. From 1990 to 1996 he was with the Tecnopolis Research Center and from 1992 to 1995 with the School for Advanced Studies in Industrial and Applied Mathematics, Italy, where he was lecturer and researcher—his main research interest was parallel computing. From 1996 to 2000 he was with the ACCA Division at the ESAT Department of the University of Leuven, Belgium, where he took part in the two ESPRIT projects 21 012 EFTOS and 28 620 TIRAN. Since November 2000, he is a Postdoctoral Researcher with the ACCA Division. His current main research interests include FT structuring techniques for adaptable distributed applications.

Theodora A. Varvarigou received the B.Tech. degree (1988) from the National Technical University of Athens (NTUA), Greece, the M.S. degree (1989) in electrical engineering, the M.S. degree (1991) in computer science from Stanford University, CA, and the Ph.D. degree (1991) also from Stanford University. She was with AT&T Bell Labs, NJ, between 1991 and 1995. Between 1995 and 1997, she worked as an Assistant Professor at the Technical University of Crete, Chania, Greece. Since 1997 she has been an Assistant Professor with the NTUA. Her research interests include parallel algorithms and architectures, fault-tolerant computation, and parallel scheduling on multiprocessor systems.

Evangelos Verentziotis received the B.S. degree (1995) in electrical engineering from the Hellenic Air Force Academy, and the Ph.D. degree (2000) from the National Technical University of Athens (NTUA). He is a Research Associate with n the Computer Science Division of the Electrical and Computer Engineering Department of NTUA and has been involved in many national, European Union (ESPRIT), and multinational military projects. His research interests include distributed high-performance computing, parallel algorithms and architectures, and fault-tolerant computing. Dr. Verentziotis is a member of the IEEE and the IEEE Computer Society.