Separating Recovery Strategies from Application ...

4 downloads 1497 Views 881KB Size Report
framework approach, recovery strategies. Industry-oriented fault tolerance solutions for embedded distributed systems should be based on adaptable, reusable.
Separating Recovery Strategies from Application Functionality : Experiences with a Framework Approach Geert Deconinck . K.U .Leuven . Leuven Vincenzo Dc Florio 9 K.U .Lcuven o Leuven Oliver Botti a CESI . Milan Key Words : software-implemented fault tolerance, dependable computing, software tool, software maintainability, adaptable framework approach, recovery strategies . A library , of basic tools : this library provides basic SUMMARY & CONCLUSIONS elements for error detection, localization, containment, recovery and fault masking. The tools are software-based Industry-oriented fault tolerance solutions for embedded implementations of well-known fault tolerance distributed systems should be based on adaptable, reusable mechanisms, grouped in a library of adaptable, parametric elements . Software-implemented fault tolerance can provide functions. These basic tools can be used on their own, or a., such flexibility via the presented framework approach . It co-operating entities attached to a backbone (see below) . consists of I ) a library of fault tolerance functions, 2) a Examples include watchdogs, voting units, supported Ior backbone coordinating these functions, and 3) a language acceptance tests, replicated memory, etc. expressing configuration and recovery . This language is a sort " A control backbone: this backbone is a distributed of ancillary application layer, separating recovery aspects application that extracts information about tlrc froth functional ones . Such a framework approach allows for a application's topology, its progress and its status . it flexible combination of the available hardware redundancy maintains this information in a replicated database and it with soft ware- irnplernented fault tolerance . This increases the coordinates the fault tolerance actions at run-time via tlic availability and reliability of the application at a justifiable interpretation of user-defined recovery strategies . The cost thanks to the re-usability of the library elements in backbone functions as a sort of middleware . It i. different targets systems. It also increases the maintainability hierarchically structured to maintain a consistent system due to the separation of the functional behavior from the view and contains self'-testing and self-healin,L, recovery strategies that are executed when an error is detected mechanisms . - as the modifications to functional and non-functional A high-level configuration. and recovery langrru,4r behavior are, to some extent, independent and hence less (ARIEL) . this language is used to configure the basic tool . Practical experience complex is reported frorn the integration and to express the recovery strategies . The application of this framework approach in an automation system for developer specifics these configurations and recovery electricity distribution . This case study illustrates the power of actions via a descriptive language called ARIEL . For software-based fault tolerance solutions and of the configuration purposes, ARIEL is able to set parametcr~, language con fguration-and-recovery ARIEL to allow and properties of the basic tools. For expressing recover flexibility and adaptability to changes in the environment. strategics - . .e indicating fault tolerance strategics h\ i detailing localization, containment and recovery actions W 1 . INTRODUCTION he executed when an error is detected - ARIEL allows Costs are an important driving force for distributed building queries on the database o1 the backbone and embedded systems, also when fault tolerance is concerned. attaching actions to these queries . These actions allow. This calls for open, flexible and configurable solutions that are e.g ., to start, terminate, isolate or inform an entity . Such an able to answer in a cost-effective way to a variety of entity can be a node, a single task, a group of tasks, etc. A~ dependability requirements . Such solutions can be based on such, it is possible to start a standby task, to reset a node ()r pre-built and reusable modules, adaptable for a wide range of link, synchronization signals to , to generate applications and reusable in different environments . In this reconfiguration, etc. Several ARIEL templates supl-n context, this article presents a framework approach that fault tolerance strategies based on standby spari"='provides fault tolerance capabilities to embedded systems by rcc.overy blocks, N-modular redundancy, etc . exploiting the systems' distributed hardware and by separating The innovative aspects of this approach do not come lr()it1 the functional behavior from the actions to be executed when the iirtplernentation of the library of well-known luuh an error is detected . (We call these actions the recovery tolerance tools, but rather from the combination with thW strategies.) backbone that executes predefined actions as described III This conceptual framework consists of the following three ARIEL when an error is detected . Such (possibly distributed' entities (Refs . I, 2, 3) : recovery strategies specified in ARIEL Let the user separatck address the (non-functional) aspects of application recover`

246

0-7803-6615-8/01/$10 .00 C 2001 IEEE 2001 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Syml)osiur 7

ors those concerning the (functional) behavior that the pplication should have in the absence of faults . Furthermore, :pirating these aspects allows the modification of the ,covery strategy with only a limited number of changes in the pplication code, and vice-versa (i .e . the application inctionality can be changed without adjusting the fault )Icrance strategy) . This results in a better maintainability of is application (assuming a reliable interface and an rthogonal division of application functionality from fault ferance strategies) . Following [his framework approach, increasing the cpcndability of an application implies the configuration and Itegration of basic fault tolerance tools from the library into le application and writing a recovery script in ARIEL, i .e . cscribing the recovery actions to be executed when an error detected . This script is translated into a compact code that is Kecutcd at run-time by the backbone when an error is elected . It matches well to a number of coarse-grained local id distributed fault tolerance mechanisms . The framework development is further guided by the use of ,nii-formal techniques to support requirement specilication id by modeling for predictive evaluation, together with itensive testing and evaluation on pilot applications . The iscssment considers real-time, dependability and cost ,quiretnents . (These aspects arc not discussed in this paper.) The target system is a distributed embedded system, which assumed to obey (lie timed asynchronous distributed s_ystent Odel (Ref. 4) - this realistic model supposes that all services :onurrunication, computation) are limited in time and that the odes have access to a local chick with a bounded drift rate . A lessage-passing library is required, offering asynchronous, on-blocking multicast primitives (third party tools may rovide this). This paper is not meant to present a formal analysis of' the aniework approach ; it rather focuses on the ARIEL part section 2) and provides a view from industry on the relevance I the framework approach for an embedded distributed .Itornation application (section 3)_ Indeed, this framework pproach has partially been developed within an industryriven European project (ESPRIT 28020 TIRAN - Tailoruble inlt tolerance ,/raniework .for embedded applications) . There, ie involved partners target several commercial real-time periling systems - Windows CE, VxWorks, Virtuoso (Ref: 5), EX (Reef G) - and exploit the approach in different pilot hplications . For the project, portability is a design goal, (though a pragmatic approach has been taken: the critical arts are developed or adapted specifically for each target Iatfortn (e .g . for reliable communication, or for performanceitical functions) . 2 . THE FRAMEWORK APPROACH AND ARIEL The basic tools and the backbone of the framework pprlrich are described in detail in other papers (Ref. I, 2, 3) . ]ere, the focus is on the third component, AREL . There are NO major tasks devoted to the description language ARIEL: IC CImliguration of the instantiation of (lie framework, and Its description of the recovery strategics .

2 .1 ARIEL as Configuration Support Tool

Within ARIEL the developer writes the configuration of the parametric basic tools and their integration in the application . A translator processes these ARIEL descriptions and issues header tiles defining configured objects and symbolic constants, These header files are to be compiled with the application . Three examples of tools configured in ARIEL follow . (Example l) A software-implemented watchdog timer can be configured in ARIEL by indicating the heartbeat period, the task to be guarded and the task to be informed when an exception occurs : WATCHDOG task10 WATCHES taskl4 HEARTBEAT 100 ms ON ERROR WARN tasklt3 END WATCHDOG (Example .?) ARIEL can be used to implement transparent task replication, and to indicate how voting is to be handled. The voting algorithm and the metric for comparison of the objects can be selected . Within ARIEL, one can include a timeout for a slow or missing voting party, and either choose to continue as soon as two of the three inputs are received or to wait until all three inputs arc received or the timeout has elapsed (which is the default option). REPLICATE]) taskl0 Is task101, task102, task103 MULTICAST IS ATOMIC METHOD IS M013ULAR REDUNDANCY VOTING ALGORITHM IS MAJORITY METRIC 'lint-cmpll TIMEOUT 1000 ms END METHOD ON SUCCESS task20 ON ERROR task30 END REPLICATED (Example '~) For retry blocks, the input and the state of the calling task are transparently recorded in a recovery cache before the retry block is entered. These are restored when the acceptance test fails in order to re-execute the task based on the same input. RETRY taskl0 TIMEOUT 100 ms ACCEPTANCE TEST task20 RETRIES 3 ON ERROR task30 END RETRY Other ARIEL-templates have been created to handle recovery blocks, N-modular redundancy, exception handling, and other well-known fault tolerance techniques . 2.2 ARIEL as Recovery language

The second usage of ARIEL is as an ancillary application layer, to describe the recovery strategies to be executed when an error is detected . These strategies are specified at development time as described in the examples hereafter . In essence, the language allows querying the database of the backbone for the state of entities of the application, and attaching run-time actions to be executed on these entities if the condition is fulfilled: IF [ -FAULTY taskl OR -REBOOTED nodel I THEN action

Such an entity can he a single task, a node, a group of tasks, or the set of tasks that fulfil or do not fulfil a given query. As

00 1 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

247

such, one can query to see if an entity has been found in error, is running, has been restarted/rebooted, etc . The actions allow, e.g ., to start, terminate, isolate or inform an entity . It is also possible to start an alternative task, to reset a node or link, to generate synchronization signals so to order a reconfiguration, etc. These actions can directly interact with the database, c_g., to clear erroneous conditions . The recovery strategy can be made dependent on the actual state or progress of the application, because it is possible for the application processes to inform the backbone of their current state; this information can then he queried in the recovery scripts. Before run-time, the translator processes these ARIEL scripts and produces binary recovery code to he compiled with the application . At run-time, the backbone will execute these strategies devoted to error processing . These strategies are switched in either asynchronously - when an error is detected within the system by one of the basic error detection tools -, or synchronously - when the user signals particular dynamic conditions like, e.g ., a failed assertion, or when the control flow runs into user-defined breakpoints . As an example consider the following script :

IF [ -FAULTY taskl ] THEN STOP taskl START task4 WARN task2, task3 FI

This script is a part of a three- and-a-spare system (triplemodular redundant task with a standby component that can take over in case one of the 3 replicas fails) . Three such rules describe the complete three-and-a-spare system . Some metacharacters allow the writing of more powerful scripts: Meta-character refers to any entity in the system Meta-character can be used to refer in a section to an entity mentioned in the query. For instance, `STOP $2' means "stop the entity fulfilling the condition in the second

part of the query" . Meta-character `@' refers to the entity fulfilling the query, while meta-character `..' refers to those entities not fulfilling the query. The power of ARIEL is the ability to describe distributed recovery strategies, It will be the backbone that takes care of passing the necessary information to other nodes in the system and of initiating the recovery actions. For instance, the

/* 3-and-a-spare */ IF [ -FAULTY groupl ] STOP taskg START task4 WARN taskFI

example from Figure I shows that a different ARIEL script can allow the application to behave as a three-and-a-spare system or as a gracefully degrading set of tasks . In the latter case, the voting function is transparently modified from a 2out-of-3 majority voting to a 2-out-ol'-2 duplication . This approach lets the developer separately address the (non-functional) aspects of application recovery from those pertaining to the (functional) behavior that the application should have in the absence of' faults . Furthermore, the user is made able to tackle more easily and effectively any of these two fronts, e.g ., to modify the fault tolerance strategy with only a few or no modifications at all in the application, on vice-versa, resulting in a better maintainability of the application . 3. CASE STUDY : A PRIMARY SCMSTATION AUTOMATION SYSTEM As an example, we describe the integration of the framework approach in an industrial application provided by CESI, a private company which provides R&D and Measurements services for the Energy field, partially owned by ENEL, the main Italian electricity provider . Its internal market covers production, transport and distribution of energy . In particular, the pilot application concerns the Primary Substation Automation System (PSAS) (Ref. 7) . PSAS requires protection, command and control, monitoring and supervision capabilities and it is representative for many applications with strict dependability requirements in the energy field. On short term, the renewal of about 2000 PSAS, distributed over the Italian territory, is planned (Ref. 8) . 3 . l The pilot application description

The energy distribution network provides the connection between the High Voltage (HV) lines coming from generator plants and the HV Substations on one hand and the customers on the other. It is a meshed network and handles Medium and Low Voltage (MV/LV) (in the range 12-15-20 kV). Its nodes of interconnection are primary substations (PS) - connected to HV lines and transforming and distributing energy to secondary substations and to MV customers - and secondart substations that transform and distribute energy to L,V customers . A substation can he controlled locally and/or remotely,

/* graceful degradation */ IF [ -FAULTY groupl ] THEN STOP task@ WARN taskF1

Figure I . Changing recoverl' strategies f'ront 3-and-u-spare to graceful degradation.

SympoSiian' 2001 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY

possibility and cost of confinement, etc.) . The major source of faults in the system is electromagnetic interference (EMI) (Ref. 7) caused by the electric process itself (opening and closing of HV connections), in spite of the attention paid to design for electromagnetic compatibility and the use of physical barriers . Based on requirements of operational divisions working with these automation applications, the company has developed a systematic approach to dependability, involving the whole life cycle of the application and concerning fault prevention, fault removal, fault forecasting and fault tolerance. Fault tolerance plays a central role and is obtained through a layered system organization separating the application from the platform (hardware + operating system + middleware) that provides fault tolerance. Formal and semi-formal languages are applied - favored by the availability of in-house developed tools (Refs. 9, 10) - to the specification, design and code generation of the automation system . Qualitative and quantitative analysis of functional, real-time and dependability aspects is supported by these tools, The dependability requirements of this and other automation systems in the energy field have been traditionally fulfilled by dedicated hardware and software fault tolerance solutions . Today, the evolution towards a new generation of automation systems demands for a reduction of development and maintenance: costs and requires the use of lower cost hardware and software components from the market . This trend has a direct impact on the selection of the target platforms for automation, where, e.g ., industrial CompactPCI-based PCs with commercial real-time operating systems are pushed as and logical PSAS (gray lines) Figure 2 : PS electrical schema alternatives to previously adopted dedicated boards or architecture (black lines) . customized Programmable Logic Controllers. The migration away from traditional architectures imposes the adoption of As indicated on Figure 2, a primary substation (gray lines) new fault tolerance strategies to cope with dependability "msists of switches, insulators, bus bars, transformers, requirements, especially if the target platform does not offer '. ;rpacitors and other electrical components . The PSAS logical fault tolerance capabilities off-the-shelf. The use of a ,uchitecture (black lines) consists of a controller (LCL - Local distributed architecture is then the key issue for this migration, Ct)ntrol Level) and a number of Peripheral Units (PU) since this provides redundancy based on conventional distributed on the plant . (The inherent redundancy of the hardware modules . This redundancy provides the opportunity uchitecture is not shown.) Each PU is associated with a that was once only provided component of the plant and provides for this component: to obtain a level of dependability by complex, dedicated architectures. The lack of built-in fault onunand and control, diagnostic information, data collection, tolerance capabilities at each node can be compensated in this pninary and secondary protection levels, and, possibly, an case with higher level, distributed fault tolerance strategies interface to a local operator . The LCL provides functionality fully exploiting the redundancy . cmtralized at the PS level for command and control, The availability of a flexible software-based solution, like ~ii(miloring, Pf1 supervision, additional protection, and the framework approach, that exploits a distributed ntertacing to remote control systems and to local and remote based on commodity hardware, offers architecture 1I1eratcrrs. This LCL has been selected as a pilot application for cost reduction, reusability, and considerable possibilities i()r the framework approach . It executes on a distributed (e .g ., openness to the adoption of emerging iiidu .strial 1'C, running the real-time operating system extendibility technologies). ~ \Works, and up to three DEC-alpha based dedicated boards, As an example of the framework approach presented in (Ref. 6) . !-sted in an ENEL rack and running the TEX kernel section 2, a software module has been designed, implementing Ii is powered from the electricity network and has backup the so called stabilizing memory (Ref . 11), as a mechanism `at(elics to withstand power failures . combining physical with temporal redundancy (and with _' Embedding of' the tarnework approach within the several protocols) to recover from transient faults affecting rrryuut)' .ctrr7tegV memory or computation . With respect to a previous solution organized. The I'lie tar-et application is hierarchically relying on dedicated hardware boards, this software control, command and liticrent automation tasks (protection, implementation of the stable memory module has the dependability degrees of ~luaritening) require different advantage of flexibility and maintainability. For instance, it is I~'I~rnding un the criticality (impact of faults and errors, 1

~

')()I t'ROCFI=DINCS Annual RELIABILITY and MAINTAINABILITY Svmposiurn

249

possible to set the size of the stabilizing tnetnory with a parameter, to select the number of redundant copies in the physical and temporal domain, etc. The developer can also modify the allocation of the physically distributed copies to the available resources. All these parameters - and the actions to he taken in case an error has been detected - are described in ARIEL. The additional flexibility offered by these recovery strategics allows e .g . for reconfiguration by re-allocating the distributed copies to non-tailed components in case a permanent fault occurred . The recovery strategies are not hard-coiled in the application, but are specified in ARIEL at a higher level and executed by the backbone that interacts with the modules. This improves the maintainability of the application . As the interface to the dedicated board and to the software module is identical, the complexity for the developer is equivalent in both implementations . This framework approach applied to a single subsystem (the stabilizing memory) as well as applied to the entire PSAS application, confirmed to shorten the development cycle and to improve the maintainability of the application . In the traditional methodology with dedicated hardware solutions, every change in the environment (larger system, additional functionality) resulted in a different implementation of the dedicated solution ; also every PSAS is embedded in a different environment, which had impact on the implementation . By using ARIEL in the framework approach, the configuration of the different elements and recovery strategies themselves can be adapted, e.g ., due to changing environments or requirements, without major modifications to the application axle . Analogously, if the functional aspects of the application need to be modified, this does not necessarily interfere with the fault tolerance strategies . This allows specialists in fault tolerance to collaborate with specialists in the target application (this is easier than requiring persons specialized both in the application and in dependability) . The flexibility is further ensured by the modularity and openness of- the framework, such that additional detection, isolation or recovery mechanisms may be added if required. The framework has been ported to three real-time operating systems and a well-defined API (application programmer interface) of the library functions facilitates portability to other environments . This also results in cost-effectiveness . 4. RELATED AND FUTURE WORK As this approach is industry-driven, it implements several well-known techniques for fault tolerance in a pragmatic way. In the research literature, one can find a lot of work on libraries for fault tolerance techniques : for example, Refs . 12, 13, 14, 15 have shown the suitability and advantages of software-based fault tolerance solutions to improve the dependability of distributed applications . Also the middleware approach towards fault tolerance recently gained much support (Refs. 16, 17, 18). The concept of de-coupling the functional application aspects from the non-functional ones concerning fault tolerance, is also present in the reflective / theta-ohjcct approach, where a call to a methcxi in an objectoriented application is trapped in order to transparently implement some fault tolerance function (Refs . 19, 20). Our framework approach combines the advantages of software-

250

implemented fault tolerance, via a library of functions, with the flexibility and maintainability of the theta-ohjcct appro,lcfj (hut without requiring object orientation) by specilvinL, recovery actions as a sort of ancillary application layer . Irr addition to the meta-ohjcct approach, the ARIEL Iangua~ ,c ~ allows for a concise description of distributed actions . We motivated that this three-tiered framework approach, comprising a user library of basic fault tolerance mechanisms, a control backbone and a high-level language for descrihinP recovery strategies, allows for flexible integration of fault tolerance into embedded distributed applications . For tlw involved project-partners, the framework approach has shown the feasibility of separating non-functional (.recovery) aspects from the functional ones . The Inanework approach has been designed for those distributed applications for which dependability can he improved by adding software-implemented fault tolerance solutions as a sort of middleware . In addition to this, the framework approach may need to be complemented by other techniques on lower levels (hardware or operating system), e.g ., to be able to meet hard real-time requirements, and/or by application-specific mechanisms . Currently, research concentrates on extending the framework approach with support for validation and verification of the dependability requirements of the application through modeling and fault injection . Development is further driven by the interaction among the industrial end-users and the framework designers and developers . The authors are investigating ways to integrate changing recovery strategies (depending on the dynamic environment) into the framework approach, for example for embedded applications that are distributed locally (intro-sire) and globally (inter-site) . ACKNOWLEDGME:NT This project has been partially supported by ESPRIT-project 28620 (T'IRAN) and by the Fund for Scientific Research Flanders (Belgium, F.W .O .) through the Postdoctoral Fellowship for Geert Deconinck. REFERENCES 1. G. Deconinck, V . De Florio, R . Lauwereins, R. Belmans, "A Software Library, a Control Backbone and User-Specified Recovery Strategies to Enhance the Dependability of Embedded Systems", Proc . 25''' EUROMICRO Conf (EuroMicro'99) (IEEE Comp . Soc. Press, Lc)s Alamitos, CA), Workshop on Dependable Computing S;vstetns, Milan, Italy, Sep. 8-10, 1999, Vol. 11, pp . 98-104 . 2. V . De Florio, G . Deconinck, R . Lauwereins, "Recovery Languages: an Effective Structure for Software Fault Tolerance", Fast abstract at 0' Int. Sump. on Software Reliabilitv Engineering; (ISSRE'98) (R . Chillarege and "1'. Illgen, New York), faderhorn, Germany, Nov . 1998, pp . 39-40. 3. V . De Florio, G. Deconinck, R. Lauwereins, "Software Tool Combining Fault Masking with User-Defined Recovery Strategies", IEE Proc. - Sofware, Special issue on dependable computing svslems (IEE, London, UK), Vol. 145, No . 6, Dec. 1998, pp . 203-211 . 4. F. Cristian, C. Fetzer, "The Timed Asynchronous Distributed System Model", IEEE Trans . on Parallel and Distributed Systems , Vol, 10, No . 6, Jun 1999, pp . 642-657 .

2001 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Anon ., "Virtuoso 4-0 User Manual", Eonic Systems, jhot, Belgium, 1999, URL : ttttp ://www .conic .com . Anon ., "TEX User Manual", TXT Ingegneria Informatica, n . Italy, 1997, URL : http ://www .txt .itR . Gargiuli, P-G . Mirandola, et al ., "ENEL Approach to Electric Power peter Supervisory Remote Control of ihutiun Network" . Proc . 6 1 ' IEE Int . Cont'. on Electricity ihunon (CIRED'81), Brighton (UK), 1981, pp . 187-192 . R . Meda, A . Bertani, P . Colombo, S . D' Imporzano, P . Perna, ititetna di Protezione c Controllo delta (China Primaria", ENEL nal draft . Feb . 1999 : in Italian . F . Nlaestri- R . Meda, G-L . Rcdaclli, "Un ambientc di sviluppo tnzioni applicative strutturate per sistemi di autotnazione di ,rnti ENEL", technical report AN IPLA "La progcttazione del . +arc di controlo per I'automazione di impianti", Milan, Italy, Jim Dec . 1997) ; e Strumenia;ionu, : Automazione (Alsrr published in . tlian thesis, ) . A . Moro, "Tiaduttore delle reti ASFA", Master Italian . Italy, 1998 ; in oil Milano . (CCn1CO V . De Florio, F. Cassinari, O . Botti, Ci . Deconinck, Substation Automation : a Case Memory in , ;tuwereins, "Stable

Syrup. on Fault-Tolerant lv", Digest of Papers of 28'" Annual Int. Alamitos, CA), rputing (FTC S-28) (IEEE Comp . Soc . Press, Los

rich, Germany, Jun- 1998, pp . 452-457 . Y . lluatig, C .M .R . Kintala, "Software Fault Tolerance in the licirtion Layer", chapter of "Software Fault Tolerance", M . Lyu ),John Wilev & Sons, Mar . 1995 . z . M .R . Lyu (Ed .), "Handbook of Software Reliability tnecrinv", McGrass, -Hill, New York, 1995 . 1 . B . Randell, J .-C. . Laprie, H . Kopetz, B . Littlewood (Eds .), PRIT Basic Research Series : Predictably Dependable Computing terns" . SprinKer-Verlag, Berlin, 1995 . 5 . Y .M . Wang, Y . Iluang, K .P . Vo, P .Y . Chung, C:. Kintala, ieckpointing and its Applications", Proc. 25" Int. .Svmp. on Fault,-rant Computing (FTCS'25) (IEEE Comp . Soc . Press, Los itutos, CA), Pasadena, CA, Jun . 1995 . h . D . Powell, 1 . Arlat, L. Bees-Dukic, A . Bondavalh, P . )poltr, A . Fantechi, E . Jenn, C . Rabcjac, A . Wellings, "GUARDS . Generic Upgradable Architecture for Real-Time Dependable toms . IEEE Trans. On Parallel and Distributed Systems, Vol . 10, (,, lute 1999, pp . 580-597 . 7 . 7. .'f . Kalbarczyk, R-K . lyer, S . Bagchi, K . Whisnant, mmcleorr : A Software Infrastructure for Adaptive Fault Svstenis, Vol. cr,mcc", IEEE Trans. On Parallel and Distributed

tin . 6, .tun 1999, pp . 560-579. 8 . K .H . Kim, "ROAFTS : A Middleware Architecture for Realic Object-oriented Adaptive Fault Tolerance Support", Proc.

Engineering SE '98 (IEEE CS 1998 High-Assurance Systems ry0. W .tshington, D .C ., Nov, 1998, pp . 50-57 .

!') . J .C . Fabre, V . Nicomette, T . Perennou, R .J . Stroud, Z . Wit, iilfementing Fault-tolerant Applications using Reflective 01)ject.25'r' Int. .Swill). on Fault-Tolerant icnocd Programming", Proc: . vrrputing (FTC.S'25) (IEEE Comp . Soc . Press, Los Alamitos, CA), ~,idcna. CA, Jim . 1995, pp . 489-498 . '.t) . G . Kiczales, J . des Rivicrcs, D . G . Bobrow, "The Art of the I'lohjcct Protocol", The MIT Press, Cambridge, MA, 1991 . BIOGRAPI-HES

Geert Deconinck is a postdoctoral fellow of the Fund for Scientific Research - Flanders (Belgium) . He works in the research group ACCA (Application-driven Configuring of Computing Architectures) of the ESAT-Department of Electrical Engineering at the Katholieke Universiteit Leuven (K,U .Leuven), Belgium, where he is also a visiting professor . His research interests include the design, analysis and assessment of software-based fault tolerance solutions to meet real-time, dependability and cost constraints for embedded applications on parallel and distributed systems . In this field, he has authored and co-authored about 50 publications in international journals and conference proceedings . He received his M .Sc . in Electrical Engineering and his Ph .D . in Applied Sciences from the K .U1euven in 1991 and 1996 respectively . In 1995-1997, he received a grant from the Flemish Institute for the Promotion of Scientific-Technological Research in Industry (IWT) . He is a senior member of the IEEE, a member of the Royal Flemish Engineering Society and of the IEEE Reliability Society . Vincenzo De Florio K .U .Leuven Dept. Electrical Engineering (ESAT), Division ACCA Kard . Mercierlaan 94 B-3001 Leuven, 13ELGIUM Internet (e-mail) : Vincenzo .DeFlorio@esat .kuleuven .ac .be

Vincenzo De Florio received his Laurea degree from the University of Bari, Italy . He was researcher and tutor at the School for Advanced Studies in Industrial and Applied Mathematics in Teenopolis Novus Ortus science park (Italy) until 1996 . Then he joined the research group ACCA at the department of Electrical Engineering (ESAT) at the K.U .Leuven, where he is currently finalizing his Ph .D . in Applied Sciences . His research interests include software fault tolerance algorithms and methods for parallel and distributed applications . He is author or co-author of over 25 papers publishiM in international journals or conferences proceedings . Oliver Botti CESI S .p .A . Networks and Plants Automation Group Via R . Rubattino 54 I-20134 Milano, ITALY Internet (e-mail): Botti .Oliver@cesi .i t

Oliver Botti graduated in Computer Science at the University of Milan in 1991 . He joined ENEL R&D in 1992 as a researcher working in internal and co-operative projects in the field of Software Engineering, addressing most of the steps of system life-cycle, from formal specification to design, development and evaluation . He has been Project Manager of several ESPRIT projects concerning HPCN, Performance Evaluation and Fault Tolerance issues . At CESI he is now responsible of two large R&D projects addressing 1) the development of techniques and tools for fault prevention and removal, covering the whole life-cycle of automation systems, and 2) the development of novel techniques and to o)Is for embedding fault tolerance capabilities in dependable automation applications . Ile is author or co-author of over 30 papers published in international journals or conferences proceedings .

it Deconinck, PhD I,cuven In I?Icctrical Engineering (ESAT), Division ACCA ~id . Mercierlaan 94 (1111 Leuven, BELGIUM

I

r ,- rvrr-r t, marl) . (ice rr .Deconitick(t,)esat .kuleuven .ac .he

)()1 PROCEEDINGS Annual RELIABILITY and MAINTAINABII-ITY Symposiual

251

Annual RELIABILITY and MAINTAINABILITY Symposium

2001 PROCEEDINGS -

Annual

RELIABILITY and MAINTAINABILITY Symposium

Theme

Applications & 'Trends for Using Reliability & Maintainability Tools The International Symposium on Product Quality & Integrity

To cite this Proceedings as a reference : Proc. Ann. Reliability & Maintainability Syrup :, 2001 . The Symposium does not copyright this Proceedings itself, but uses one of its cosponsors (IEEE) to perform this service and associated administration . Copyright © 2001 by the Institute of Electrical & Electronics Engineers Inc . All rights reserved . Copyright & Reprint Permission

Abstracting is permitted with credit to the source . Libraries are permitted to photocopy beyond the limit of US copyright law for private use of patrons those articles in this Proceedings that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through : Copyright Clearance Center " 222 Rosewood Drive " Danvers, Massachusetts 01923 USA, For other copying, reprint, or republication permission, write to : 1TEE Copyrights Manager " IEEE Operations Center " POBox 13.31 (445 Hoes Lane) Piscataway, New Jersey 08855-1331 USA . 11

Printed in the United States of America . ISSN 0149-144X ISBN : 0-7803-6615-8 (softbound) ISBN : 0-7803-6616-6 (microfiche) Library of (congress Catalog Card Number : 78-132873 IEEE Catalog Number : 0ICH37179

2001 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Suggest Documents