The TIRAN Approach to Reusing; Software Implemented ... - CiteSeerX

14 downloads 18170 Views 1MB Size Report
ance capabilities to automation svstenis, with the goal ofa/- lowing portable ... show that the embedded software tools market is growing at about 309c/year and ...
The TIRAN Approach to Reusing; Software Implemented Fault Tolerance O . Botti ENEL S.p.A ., R&D, Via Volta 1, 1-20093 Cologno Monzese (Milano), Italy Tel: +39-02-7224-5553, Fax : +39-02-7224-5465, e-mail : botti@pea . enel . i t V. De Florio, G. Deconinck, R. Lauwereins K.U .Leuven, Dept . Elektrotechniek (ESAT), Kard. Mercierlaan 94, B-3001 Heverlee, Belgium F Cassinari TXT Ingegneria Informatica, Via Socrate 41 .,1-20128 Milano, Italy S . Donatelli, A. Bobbio Universitii di Torino, Dip . di Informatica, C .SO Svlz2:era 185, 1-10149 Torino, Italy A. Klei ;n, H . Kufnev, E. Tturner Siemens AG, Dept . ZT SE 2, Otto-Hahn-Ring 6, D-81730 Munchen, Germany E. . Verhulst Eonic Systems, Nieuwlandlaan 9, B-3200 Aarschot, Belgium Abstract

embedded systems were judged costly and difficult to develop and maintain, as they generally need strong customisation, impacting on the whole life-cycle, and requiring highly specialised design teams. New proposals for fault tolerance (FT) should be based on pre-built and easily reusable solutions, customisable for different applications and portable on commercial and proprietary real-time operating systems (RTOS) . The TIRAN project addresses these urgent needs raised from the market of real-time embedded automation systems. Recent estimations (Embedded Software Association) show that the embedded software tools market is growing at about 309c/year and totals about 800 M US$ in 1998 . Other investigations allows to estimate that at least 15% of all realtime applications are candidates for inclusion of fault tolerance . TIRAN is a 24-month ESPRIT project, that has now reached the stage of full definition of the proposed solution, that will be developed on different platforms to allow an estimation of the reusability, and will be used on three different pilot applications to study its efficacy in making the applications more dependable . This paper describes the TIRAN solution and compares it with other approaches from the literature and/or from previous ESPRIT projects . The TIRAN solution is built around a software framework which provides fault tolerance capabilities to automation systems. Application developers are allowed to select,

Available solutions for fault tolerance in embedded automation are often based on strong customisation, have hnpacts on the whole life-cycle, and require highly specialised design teams, thus making dependable embedded svstenis costly and difficult to develop and maintain . The TIRf4u1r project develops a fraineivork which provides fault tolerance capabilities to automation svstenis, with the goal ofa/lowing portable, reusable and cost-effective sobuions . Application developers are allowed to select, configure and integrate in their own environment a variety of software-based functionsfor error detection, confinement and recovers provided by theframework .

1

Industrial Motivations and Answers to the Market Needs

Market investigations with users and producers of automation systems have recognised the benefits offered by dependable systems, not only in the classical area of safetycritical tasks, but also in mission-critical ones, clue to the high economic impact of failures . Moreover, dependable ESPRIT Project 28620 TIRAN -'Tailorable fault tolerance frameworks for embedded applications" .

1066-6192/99 $10 .00 © 1999 IEEE

325

configure and integrate in their own environment a variety of software-based FT functions for fault masking, error detection, isolation and recovery, available in the framework. The framework includes techniques, tools and documentation to support the developers when combining the selected functions into their own fault tolerance strategies to obtain the required level of dependability . The framework is basically a library of software elements which are parametric and support an easy configuration process . It is designed with the portability goal in mind : standard API are defined towards the application environment and towards the platform (e .g . RT-POSIX), most of the functions are intended to be platform and application independent and the cost to adapt the remaining; ones is minimised. The adoption of FT functions from this, framework significantly reduces the development costs of a new dependable system . The Consortium will develop one single generic framework, providing its porting on different commercial RTOS (VxWorks, Windows CE) and micro kemels (Virtuoso, TEX), to be exploited in the pilot applications provided by the Consortium end-users and running on their corresponding target platforms . The use of formal techniques to support requirement specification and predictive evaluation, together with the intensive testing on pilot applications, aims at guaranteeing the correctness of the framework, and to quantify the fulfilment of real-time, dependability and cost requirements, providing also valuable guidelines to the confi .cruratiion process for the different users. The paper introduces the target industrial environment (Sect . 2) and the user requirements driving the project activities (Sect . 3) . The TIRAN approach is sketched in Sect . 4, and its positioning with respect to emerging FT trends is given in Sect . 5. Finally, the current status of the project is outlined in Sect. 6.

2

The Industrial Environment: Pilot Applications, Exploitation and Market Access

The TIRAN Consortium includes large European suppliers and end-users of mission critical systems in their application fields (energy and transport automation) that have an active role in the technology development and in the international standardisation process on the subject. The Consortium involves the R&D Departments of ENEL (the main Italian electricity supplier, the 3rd largest world-wide) and of SIEMENS (the leading; German company in the field of electrical engineering and electronics, one of the largest worldwide), EONIC Systems (B) (supplier of very high performance and application specific small RTOS for DSP processors and embedded ASIC cores), TXT Informatica (I) (provider of systems and ser-

vices in many fields of real-time automation), the University of Leuven (B) and {he University of Turin (I), particularly active on the R&D frontiers of fault tolerance and performance evaluation . The project results, driven by industrial users' requirements and market demand, will be integrated with an existing off-the-shelf tool supplied by its producer (Virtuoso of Eonir- Systems), will be ported onto commercial and proprietary RTOS (Windows CE, VxWorks, TEX), that will represent the basis for services offered by a system integrator (TXT), and are to be tested and adopted by two large enduser companies (ENEL, SIEMENS) within their application fields . Pilot applications from Energy Distribution (ENEL, primary substation automation system) and Industrial Automation (SIEMENS, airfield light control system), bring a wide variety of requirements to the framework development and allow a deep validation of project results. Broad exploitation of results and market access are guaranteed in different sectors by the involved users, the technology providers and by third party association with market leaders . In addition, system providers (EONIC, TXT) bring in the project their typical customers' requirements from the fields of Space, Avionics, Automotive, Biomedical .

3

User Requirements

The first activity of the project has been the collection, classification of fault tolerance requirements of the partners, and their harmonization in terms of the features expected from the framework, their portability on different platforms, their performance and their configurability (see Table I) . Requirements have been split into functional and non-functional ones, where functional is here intended as "what is the functionality of the framework" .

3.1

Functional Requirements

Requirements collected in this area have identified the FT features the TIRAN framework shall support. The :framework shall provide support for error detection. The types of errors taken in consideration include: communication errors (like corrupted messages, late or nondelivered messages, incorrect delivery order), processing errors, memory fault . A detected error shall be contained to avoid affecting other parts of the system ; isolation mechanisms that shall be supported by the framework include: output commitment, termination of faulty communication channels or processes or nodes, communication path reconfiguration, memory protection . In addition, support for fault masking shall be provided by the framework: replication, voting and stable memory,

Error processing /Fault treatment

Portability_ Flexibility Performance

_Error detection Fault containment / error isolation Fault masking _Error recovery Target monitoring Fault injection Pl atform and RTOS independence _ Configurability and scalability Hard real-time Soft real-time

Table 1 . Classification of user requirements . applying both spatial and temporal redundancy . An important feature to obtain a FT system is the possibility to restore/replace faulty components . The use of a "recovery language" will allow the user to define its own recovery strategy : following this strategy, the running system can check which kind of error has occurred in which part of the system, and start the most appropriate recovery actions, like terminate or restart a specific process or node, activate spare processes on different nodes, warn other processes about what happens, reintegrate a recovered node . Since the typical use of such FT features is on embedded systems, it is important to provide mechanisms for s.ystenr monitoring . Requirements for system monitoring include a communication mechanism with a host cornputer and the possibility to check the status of the embedded system, perform specific actions and simulate or inject faults for testing purposes . 3.2

Non-functional Requirements

The proposed FT framework is intended as a software solution, independent from hardware and operatiing system, scalable on a wide range of different target topologies, easily configurable, and open for interoperability with other commercial or user defined tools . From these goals, a set of additional requirements comes out. Concerning portability of solutions: to allow a wide exploitability, the FT framework must be easy portable on different platforms, eliminating or reducing as much as possible the adaptation work . To obtain such a platform independence, different approaches have been adopted : adherence to appropriate standards (e .g . POSIX and its. real-time ex-. tension) ; encapsulation of the operating system interface in a proper "adaptation layer" hiding the operating system peculiarities and providing to the upper level a stable and homogeneous interface. To concretely drive the implementation of the framework in a portable way, different pilot platforms have been selected on which a final evaluation and assessment will be performed . The selected platforms cover

a wide range of embedded system including MOSAIC20 DSP platform with Virtuoso RTOS, Siemens SMP16 with Windows CE, PowerPC and embedded PC with VxWorks, Digital Alpha with TEX RTOS . For what concerns flexibility requirements, the framework shall be easy adaptable to different user needs in terms of " usage of the framework components in a stand-alone way or combined together ; " allocation of each component on different nodes ; " scalability from a single node to distributed/heterogeneous/differently connected nodes; " parameterisation of the provided mechanisms in terms of data structures, size and number of replica. The adoption of the proposed FT solution should not affect the. responsiveness of the system, at least in absence of faults, i.e . the percentage of FT overhead must be negligible with respect to the application size and response time . Particular attention should be paid for those components that are positioned closed to the fault occurrence, i.e . masking, detection and isolation mechanisms, while a greater amount of overhead can be accepted for those mechanisms related to recovery actions, depending on the application specific real-time requirements .

4 The Framework Approach TIRAIN bases its fault tolerance strategy on the concept of "framework", which translates into the conjoint use of a layered system of fault tolerance mechanisms arranged into a library and of a sort of configuration tool by means of which the user expresses a number of recovery strategies . This is combined with a formal assessment of the performance of the framework approach centred around modelling and measurements associated with fault injection. These three elements are hereby described.

4.1

The FT Library

-- memory faults :

* transient errors (bit errors): error detecting codes, parity bits (Z)

This part describes the generic fault tolerance; entities for the TIRAN framework, i .e . in the library. This is to be accompanied by system-specific, kernel-specific and application-specific mechanisms . The system is made of three "layers" : a basic layer, with entities related to error detection, fault containment, recovery, and fault masking mechanisms, a control layer, constituting a sort of "backbone" component for a coherent coordination of the entities in the basic layer, and a monitoring andfault injection later, introduced for testing purposes and substantially given by a hypermedia distributed application . The framework will be accompanied with a handbook providing:

mechanisms for diagnosis of latent errors * periodic tests: testing status of communication partners, memory scrubbing, monitoring of error history

Isolation mechanisms

* output commitment : distributed synchronisation, voting, output delay * termination of a faulty process (Z) * disconnection of a faulty node (~ Recovery mechanisms

" a characterisation of the library components in terms of FT coverage and performance overhead,

* re-initialisation or establishment of new communication channels (Z) * re-initialisation or establishment of new process (Z) * re-initialisation of a node (reloading of kernel and/or application) (Z) * integration of a recovered node in the system

" strategies to be applied in assembling the mechanisms to meet dependability requirements, " reports produced by validation and assessment carried out on the framework and its applications . 4.1 .1

(z) * restoring the state in application : checkpoint and rollback, distributed atomic actions

Basic Layer.

This part of the library will host basic mechanisms for error detection, fault containment, error recovery, and fault masking. The following list enumerates the mechanisms that will be considered for inclusion in the TIRAN framework, either implementing them or providing a suitable interface, public or for private use of the framework, to specific services supplied at driver, kernel, or hardware level (these latter are marked as "T') :

Fault masking

data replication : distributed memory * voting * stable memory (conjoint use of temporal and spatial redundancy) error correcting codes

Others

" Error detection mechanisms - communication errors and mechanisms * corrupted messages : CRC, parity checking, checksums, retransmission, loopback testing (Z) * late or non-delivered messages : time-out (T) * messages in disorder: checking packet numbers (T)

- processing errors and mechanisms

* late or dead processes: watchdogs * corrupted processes: assertions, plausibility checks, range checks, history checks, authority checks (T) * runtime errors (numerical exceptions) : exception processing (1) * memory violations, stack overflows: exception processing (Z)

distributed synchronisation These basic mechanisms rely on support from the kernel and the hardware, e.g ., to allow memory protection or to have fault-tolerant communications . 4.1 .2

The TIRAN Control Layer.

This layer is substantially given by a distributed tool which will play the role of backbone for the mechanisms in the basic layer, collecting all relevant information concerning the current structure and status of the user application and of the basic TIRAN tools. This application will also coordinate recovery and interpret user-defined recovery strategies . A prototypal implementation of this component, called DIR net (standing for "Detection, Isolation, Recovery net") has been developed within the ESPRIT project EFTOS 2 and = ESPRIT Project 21012 EF rOS - "Embedded fault tolerant supercomputing" .

discussed in [4]. This layer could also include a special component which, by means of, e.g., the count-andthreshold scheme, could be used to diagnose the nature of the witnessed faults, namely whether a given fault is temporary or permanent. 4.1 .3

Monitoring and Fault Injection.

A special tool will serve as a monitor and software fault injector for debugging and testing purposes on embedded targets . The monitor will describe the structure of the system (user application plus TIRAN framework components) and will report in a hypertextual structure all the events taking place at the basic layer and at the control layer. This component will be also responsible for managing software fault injection services . These will include, e.g., task terminations, node reboots, illegal instructions generation (divide by zero, illegal memory access, etc.), actions resulting in value failures and violations of real-time constraints . 4.2

Configuration

While the overhead of including detection mechanisms in an application can be minimal, we experienced that a non-trivial, end-to-end strategy for recovering from faults, based on the pure library approach, does require large modifications to that application . For instance, the availability of a perfect-coverage trap handler, capable of intercepting any processor or communication exception, gives the developer but a powerful fault tolerance atom that must be nevertheless wisely integrated into a coherent, systern-wide recovery strategy in order to actually enhance the resilience of the user application . This strategy usually results in the intrusion of a large percentage of non-functional code . This problem can be weakened, if not solved, by physically separating the specification of functional code, from that of the code for recovering from an erroneous condition. 'This is proposed via the use of a suitable "recovery language", by means of which the user can express strategies for recovery and reconfiguration . Special support towards design diversification (e .g ., by means of recovery blocks) is under consideration . Further description of this approach and of its impact on transparency is given in Section 5.1 . 4.3

Framework Validation and Evaluation

To prove the efficacy of the proposed framework it is important to assess the performance of the generic library and of its portings running on the different supported platforms. To reach these goals two techniques will be used in TIRAN: a probabilistic approach centred around modelling and measurements associated with fault injection. A very nice study of available techniques for the validation and evaluation of dependability attributes has been

done as part of the European project SQUALE (Security, Safety and Quality Evaluation for Dependable Systems) [2] . A probabilistic analysis has the advantages of allowing to treat the (intrinsic) probabilistic nature of faults, and of permitting the abstraction, in terms of probability, of a reality too complex to be treated and described in full details, but has the disadvantages of providing only probabilistic results (e .g . probability that a certain output will be produced before a given delay) . Modelling allows to perform a quantitative analysis before the final system is actually built, and can therefore be used also as a development aid, to evaluate alternative solutions and to perform sensitivity analysis on different ranges of parameters . The models of the delivered version of the; framework will be used to compute quantitative indices, providing the framework handbook with this characterisation . Probabilistic modelling will be carried on using stochastic Petri nets [I] . Separate models for the library functions, the faults and the hardware characteristics will be developed and freely mixed to study different systems [6]. Simulations will be used when the type of stochastic delays considered, or the size of the resulting model does not allow an analytical solution . Measurements of the system using software and physical fault injection on the pilot target environments will complement predictions providing a deep characterisation of the framework behaviour. Petri net models of the library functions can also be used for validation, to prove 'classical' properties like absence of deadlocks and livelocks, as well as for proving properties expressed in temporal logic language [13] . 5

Comparison with Other Works

This section positions TIRAN with respect to other emerging fault tolerance approaches and research projects . In order to compare different approaches we selected a number of attributes, i.e ., positive aspects and qualities of each approach, and assessed each approach with respect to these attributes . Doing this we can position each approach and research project into a unique spectrum (see Fig. 1) . 5.1

Relevant Attributes for Positioning

Five attributes have been selected, namely : " Efficiency of the approach, measured by response times ., what concerns error detection, isolation, and recovery, " Transparency, or the separation of functional from dependability concerns, measured by lines of code (LOC) to be added or modified into an existing application in, order to adopt the approach,

" Portability, measured by effort or LOC to be added or modified in order to allow the approach on a new platform, " Cost of adoption for the target application developer, both at development and at maintenance times, " Flexibility, i .e . the effectiveness of the° approach with respect to different fault classes, different fault tolerance mechanisms, and to different end-user systems. Four orthogonal approaches have been selected and will be assessed via the above attributes :

" Bad flexibility (a prefixed number and quality of dependability services are compulsorily offered to all customers ; furthermore this approach does not include any fault tolerance support within the target application, so as argued in [12], it cannot reach full endto-end fault tolerance) . Moreover, this approach does not support means for tolerating design faults (ranking : low) . Project GUARDS [11] also makes use of this approach .

" System approach, i .e ., the embedding of fault tolerance mechanisms into the hardware and/or the operating system .

Library Approach . This approach consists in embedding a number of fault tolerance mechanisms and tools in the form of a, library. The approach is characterised by :

" Library approach, or the embedding of fault tolerance tools into an application-level collection of functions.

" MCclium efficiency (which makes it unsuitable for matching hard real-time requirements in its pure form),

" Conjoint use of metaobject protocols and of reflection, or modifying the semantics of an object-oriented language for a transparent adoption of fault tolerance methods. " The Recovery Language Approach, viz., the conjoint use of a middleware backbone and a sort of secondary application layer for embedding and orchestrating recovery and configuration strategies and for managing design diversity . Five ranking levels have been defined to measure the different approaches, ranging from low to high : low, low-tomedium, medium, medium-to-high, high . The attributes and approaches constitute, in a sense, a base for discussing in more formal terms nowadays fault tolerance trends . TIRAN is finally assessed on the view of this base . System Approach . Embedding fault tolerance in the hardware and/or in the operating system is typically characterised by : " High efficiency, which makes it suitable if not mandatory for fulfilling hard real-time requirements, " High transparency, as it does not involve the application level altogether, " Good portability of the functional part of the userapplication, bad portability of the features offered by the approach (ranking : medium), " No development or maintenance costs (for the cus-tomer) ; typically higher costs for acquiring the target system with respect to general-purpose systems (ranking: low),

" Medium transparency (the application programmer does not have to "re-invent each time the fault-tolerant wheel ;" though he/she is responsible for embedding the function calls of the library into the application, from detection-level tools through to complete recovery), High portability (in principle), " Low-to-medium development and maintenance costs, " Medium flexibility (as it lacks any system support) . Examples of' this approach can be found in Project Isis [3]. 'Tools like, e.g ., HATS [8], are purely based on the library approach . Metaob,ject Protocols and Reflection. An elegant approach is ,given by metaobject protocols (MOP'S) [9] . The idea is to open the implementation of an object-oriented language so that the developer can adopt and program different, ciustorn semantics, adjusting the language to the needs of the user and to the requirements of the environment . The transparent addition and management of temporal and spatial redundancy is a typical example of this approach . The key concept behind MOP'S is computational reflection, or the causal connection between a system and a meta-level description representing structural and computational aspects of that system . This causality translates into the possibility to adjust dynamically the structure and the operation of a system, e.g ., to perform reconfiguration and error recovery . The basic object-oriented feature of inheritance can be used to enhance the reusability of the fault tolerance mechanisms developed with this approach . It is characterised by :

" Medium efficiency (someone questioned MOP'S because of the inherent overhead of managing a different semantics; this overhead has been found to be negligible at least in some cases, see e.g ., [10, 9]), " High transparency (full separation of functional concerns from dependability concerns), " Medium-to-high portability (requires customs languages, though often with strong resemblance with existing ones, like C++ or Lisp) " Medium development and maintenance costs, " Perhaps medium flexibility-no experimental or analytical evidence exists which allowed to evaluate the practicality and the generality of this approach . In other words, it is still an open question whether MOP'S and reflection represent a practical solution towards the integration of any of the existing fault tolerance mechanisms into a given user application . Reflection and MOP'S work at language level, giving means to modify the semantics of basic object-oriented language constructs like object creation and deletion, calling and termination of class methods, and the like.. This means looking at the problem from a microscopic point of view . Some of the hitherto developed fault tolerance techniques fit perfectly with this approach-for instance, object redundancy can be easily managed with this approach . Some other techniques, specifically those who might be described as "the mest coarse grained ones," like e.g ., distributed recovery blocks . or those typical of digital systems (e .g . standby sparing), appear to be less suited for being efficiently implemented via MOP'S and reflection . An architecture including this approach is Friends [7] . The Recovery Language Approach . This is the approach developed in the EFTOS framework [5] : a layered software system including a set of detection tools (Drools), a control backbone coordinating detection, isolation and recovery (the DIR net), and a virtual machine executing recovery strategies, written by the user in a custom scripting language called Ariel. This language constitutes a sort of secondary application layer, devoted to the execution of system-wide recovery strategies, which comes into action whenever an error is detected by one of the basic fault tolerance elements of the EFTOS framework or of the underlying runtime systems. Only a minor number of changes need to be made to the original user application, as many of these fault tolerance methods can be transparently adopted-the bulk of the strategy to cope with errors is expressed in a . Such :" separately defined, user-provided "recovery scrip script is written in what we call a "recovery language", i .e ., a custom language devoted to the management of the fault

Figure 'I . A ranking of the different orthogonal approaches according to the five attributes defined in 0 .1 .

tolerance strategies of the user and allowing to express scenarios of isolation, reconfiguration, and recovery, to be executed on meta-entities of the application with physical or logical counterparts (processing nodes vs . tasks or userdefined groups of tasks) . Such strategy lets the user separately address the (non-functional) aspects of fault tolerance computing 1`rom those pertaining the (functional) behaviour that the user application should have in the absence of faults . Furthermore, the user is made able to tackle more easily and effectively any of these two fronts, e .g ., to modify the fault tolerance strategy with only a few or no modifications at all in the functional part, or vice-versa, resulting in a better maintainability of the fault-tolerant application . Different developers, with specific knowledge, could act in each of these two application layers ; e.g . a team with a FT skill could arrange the secondary application-level suitable for a wide class of applications . This approach is characterised by : " Medium efficiency (as, in its pure form, it lacks hardware support for detection, isolation, and for managing redundancy), " Medium-to-high transparency (full transparency is not reached, though there is a partial separation of design concerns, at least what recovery and reconfiguration are concerned), " High portability (in principle), " Medium development costs and low maintenance costs (low-to-medium), " Medium flexibility (in its pure form).

TIRAN Positioning . TIRAN can be considered as a mixture of the library and of the recovery language approach for managing confinement-level, detection-level and isolationlevel tasks, plus specific hooks to the underlying run-time system . The approach does not require object orientation and can be effectively used to add fault tolerance features to existing distributed or parallel applications . Thanks to specific system support for detection, isolation, and fault masking, one of the goals of TIRAN is also that of being able to fulfill real-time requirements . Furthermore, specific support towards managing design diversity ceuld be ;included in a specific extension of the recovery language . The approach allows the integration within the genetic library of custom FT basic mechanisms (for masking, detection isolation, recovery) available at the system or software library level: this can have a positive impact on the flexibility and the efficiency without reducing portability. A goal of TIRAN is therefore to inherit or increase as much as possible the best performing attributes of the system and of the recovery language approach, and to reach a high degree of flexibility such that a large portion of the spectrum of fault classes can be effectively tackled.

6

Current Status and Outlook

Previous experiences within the EFTOS project [5] showed by running industrial prototypes the feasibility of the TIRAN framework approach and pointed out the major required improvements . TIRAN started with a revision of the industrial requirements to allow a wide usability of results and to promote the development of an industrial product. A suitable general architecture was identified to allow the satisfaction of a wide range of performance requirements, from soft to hard real-time. Relevant novelties concem the flexibility and portability issues, the adoption of market standards, the integration of the framework within commercially available solutions supplied by their producers, as well as the adoption of formal validation techniques . A first TIRAN "concept demonstrator" has been built and the development of the FT framework has staned . The first prototype has been issued in October 199'9. Portings on VxWorks and Windows CE are scheduled for :March 2000, while the final release of the complete product and all its portings is planned for October 2000 . Acknowledgments . This project is partly sponsored by the ESPRIT IV projects 28620 "TIRAN" and by COF/96/11 . Geert Deconinck is a Postdoctoral Fellow of the Fund for Scientific Research - Flanders (Belgium) (FWO). Rudy Lauwereins is a Senior Research Associate of FWO. The first Author wishes to thank Mr Piergiorgio Mirandola and Mr Roberto Meda for the fruitful discussions trans-

ferring years of experience on fault tolerance in the Energy field .

References [1] M. Ajmone Marsan, G. Balbo, G. Conte, S. Donatelli, and G. Franceschinis . Modelling with Generalized Stochastic Perri Nets . J. Wiley, New York, N.Y., USA, 1994. [2] Anonymous . SQUALE - Security, safety and QUALity Evaluation for dependable systems. Technical Report European commission project ACTS/AC097 . Available at URL:

hit:p :// www .laas .research .ec .org/squal e . [3] K. J. Birman . Replication and fault tolerance in the Isis sys-

tem. ACM Operating Systems Review, 19(5) :79-86, 1985 .

[4] G.. Deconinck, M. Truyens, V. De Florio, W. Rosseel,

[6]

[8]

[11] [12] [13]

R. Lauwereins, and R. Belmans. A framework backbone for software fault tolerance in embedded parallel applications. In Proc. of the 7th Euromicro Workshop on Parallel and Distributed Processing (PDP'99), pages 189-195, Funchal, Portugal, February 1999 . IEEE Comp . Soc. Press. G. Deconinck, T. Varvarigou, 0. Botti, V. De Florio, A. Kontizas, M . Truyens, W. Rosseel, R. Lauwereins, F. Cassinari, S. Graeber, and U. Knaak. (Reusable software solutions for more fault-tolerant) Industrial embedded HPC applications . Supercomputer, XIII(69) :23-44, 1997 . S. Donatelli and G. Franceschinis. The PSR methodology : integrating hardware andsoftware models . LNCS, N. 1091 . Springer Verlag, Osaka, Japan, June 1996 . 1.-C. Fabre and T. Pdrennou . FRIENDS : A flexible architecture for implementing fault tolerant and secure applications . In Proc. of the 2nd European Dependable Computing Conference (EDCC-2), Taormina, Italy, October 1996 . Y. Huang and C. M. Kintala. Software fault tolerance in the application layer. In M. Lyu, editor, Sofnvare Fault Tolerance, chapter 10, pages 231-248. John Wiley & Sons, New York., 1995. G. Kiczales, J. des Rivieres, and D. G. Bobrow . The Art of the Metaobject Protocol. The MIT Press, Cambridge, MA, 1991 . H. Masuhara, S. Matsuoka, T Watanabe, and A. Yonezawa. Object-oriented concurrent reflective languages can be implemented efficiently . In Proc . of the Conference on ObjectOriented Programming Systems, Languages, and Applications (OOPSLA-92), pages 127-144, 1992 . D. Powell . Preliminary definition of the GUARDS architecture . Technical Report 96277, LAAS-CNRS, January 1997 . J. li . Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Trans. on Computer Systems, 2(4) :277-288, 1984 . K. Varpaaniiemi, J. Halme, K. Hiekkanen, and T. Pyssysalo. Prod reference manual . Digital Systems Laboratory Report B 13, University of Technology, Helsinki, August 1995 . Available at URL: http : // saturn .tcs .hLut .fi/pub/reports/B13 .ps . Z .

:Proceedings

Stn Euromicro Workshop on Parallel and Distributed Processing J ;znuary 19 - 21, 2000 Rhodos, Greece

IEEE

COMPUTER. SOCIETY

8`h

Table of Contents Euromicro Workshop on Parallel and Distributed Processing - EURO-PDP 2000

Program Chair's Message . . .. .. .. .. .. .. .. . . . . .. .. .. . . . .. .. .. .. .. . .. .. .. . . . .. .. .. . . . .. .. .. .. . .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. . .. .. .. .. .. .. .. . . . . . .. ..xi Program and Organizing Committee . . .. .. .. ... . . .. .. .. . . . .. .. .. . .. .. .. .. . .. .. .. .. .. . . . . . .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. .. . . . . . .. .. .. .. .. . . . .. .xii Reviewers . .. . . . .... .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. .. .. .. . . .. . .. .. .. . . . .. .. .. .. . . . .. .. .. ... .. .. .. . . . .. .. .. .. . . . .. .. .. ... . . . . . . . . . . .. .. . . . .. .. .. .. .. .. .. . . . . . .. .. .. . . xiii Keynote Address

Chair: P. Kacsuk

A7

Grid : Where We've Been, Where We Are, and Where We're Going_The C. Kesselmann

Session 1 : Distributed and UO Svsterns Chair: C.'. Kesselmann

ViMPIOS, a "Truly" Portable MPI-IO Implementation . .. .. .. ... .. . . .. . . . .. .. .. . . .. . . . . . .. .. . . . .. .. .. .. .. . . . . . . . . . .. .. . . . . . .. .. .. .. .. .. . . . .. .. .. . . .4 K. Stockinger, E. Schikuta

Specification-driven Monitoring of TCP/IP . .. .. .. . . . .. .. .. . . . .. . . .. .. .. . .. .. .. . . . .. .. .. .. .. . . . .. .. .. . . . .. .. . .. .. .. . . . . . .. .. .. .. . . . .. .. .. .. .. .. . . . .. .. .10 R. Hofinann, F. Lemmen

A QoS Performance Measure Framework for Distributed Heterogeneous Networks . . .. .. .. .. .. . . . . . .. ..,. . . .. . .. . , .  ..  ,18 J. Kim, D. Hensgen, T. Kidd, H. Siegel, D. St. John, C. Irvine, T. Levin, N. Porter, V. Prasanna, R. Freund

Communication Support for Distributed Mu :timedia Components . . . .. .. . . . .. .. .. .. .. .. . .. .. .. .. .. . . . .. .. .. .. .. .. . . . .. ., .. ., .  , . . .28 A . Saleh, G. Justo

Short Presentations l : Applications

~~

i!

Chair: V. Getov

Parallelizing a Global Optimization Method in a Distributed-Memory Environment . .. .. .. . . .. . ...... .. .. .. . . . .. .. .. .. .. ....... .. 38 Z. Szczerbinski, S. Korvalik

Fast Parallel Algorithms for 3D Reconstruction of Angiographic Images .. .. .. .. . . . .. . . . . . .. .. .. .. .. . . .. .. ., . .. .. .. . ..  ,.  .., ..43 R. Rivas, M. Ibditez

A Robust Multigrid Solver on Parallel Computers .. .. . . . .. .. .. . . ... .. .. .. . .. .. .. .. .. ... .. .. .. . . . .. .. .. ... .... .. .. .. .. .. .. ... .. .. .. .... .. .. . .. .. .. . . .48 R. Montero, M. Prieto, I. Llorente, F. Tirado

Finite Element Computations on Cluster of PC"s and Workstations .. .. .. .. .. . .. .. .. .. ... .. .. ..... ...... .. .. ., .. ..... ..  .,_ .. .. .. .56 A. Spyropoulos, J. Pal),vos, A. Boudouvis

Distributed and Randomized Enumeration . .. .. .. ... .. .. .. .. . .... .. .. .. ... .. .. .. .. ... .. .. .. . . . .. .. .. ..... .. .. .. .. .. ... . . .. .. .. .. .. . .. .. .. .. .. .. . . . .. .. .. .62 P. Dembirtski

Heterogeneous Client-Server Architecture for a. Virtual Meeting Environment .. .. .. . .. .. .. .. .. .. . . . .. .. .. .. .. .. ..,., .. ..  ;, ;,.  . . .67 M. Masoodian, S. Luz

ruN

r

N

Session 2 : Web Based Computing; I Chair: K. Klockner Introduction . .. .. .. .. .. .. .. .. .. .. .. . . . .. .. . . . .. .. . .. .. .. . .. .. .. .. .. . . . .. . . . . . .. .. .. .. .. .. .. .. .. . . .. . .. .. .. .. .. .. . . . .. .. .. . . . .. .. .. ... .. .. .. ..... .. .. .. .. .. .. .. .. .. .. .. .. ..76 Replicating the R in URL . . .. .. .. .. . .. .. . .. .. .. . .. .. .. .. .. . .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. . . .. .. .. .. .. .. . . . .. .. .. . .. .. .. .. . .. .. . . .. . .. . . .. .. .. .. .. .. .. . . .. .. . . ... .77 C. Allison, M. Brandev, J. Serrano, D. McKechan _

' L7

Security Mechanisms for the MAP Agent System .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . .. .. .. . . . .. .. .. ... .. .. ..... .. .. .. .. .. .. .. ........... .. .. .84 A. Puliafto, O. Tomarchio C `{`Implementation of an Annotation Service on the WWW-Virtual Notes .. .. . . . .. .. .. ... .. .. ... .. .. .. . ., .. .. .. .... .. . . ... .. .. . . .92 S. Koch, G. Schneider ,~ Session 3: Performance Analysis Chair: P. Crenmonesi Introduction .. .. . . . .. ... .. . .. .. .. .. .. . .. .. .. .. .. .. . .. . .. .. .. . . . .. .. .. . .. .. .. . . . . . .. .. .. . . .. .. .. .. .. .. . . ... .. .. .. . . . .. .. .. .. .. ... .. .. . .. .. .. ... .. .. .. .. .. .. .... ..... . . .. .100 A Prototype Feedback-guided Perf'"ormance Enhancement System . . . . .. .. .. .. .. .. . .. .. . .... .. .. ..... .. .. .. .. ....., ... .. .101 N. Mukherjee, G. Rilev, J. Gurd

FINESSE :

Performance and Transparency of Message Passing and DSM Services within the GENESIS Operating System for Manas;ing Parallelism on COWs . .. .. .. .. . . . .. .. .. .. .. .. . .. .. .. .. . .. .. .. . .. .. .. .. . . . .. .. ., .. .. .. .. .., .. .110 A. Goscinski, M. Hobbs, J. Silcock Predictability of Bulk Synchronous Programs using MPI .. .. . . . .. .. .. .. .. .. .. .. . . . .. . . .. .. .. .,, .. .. . .. .. .. . . . .. .. . ., .. .. . . . .. .. ..,. . A . Zavanella . A. Milazzo XC~:cr

~

118

Session 4: Web Based Computing; II Chair: K. Klockner Gypsy : A Component-based Mobile Agent System .. .. . .. .. .. .. .. .. .. . . ... . . .. .. .. .. . .. .. .. .. .. .. .. .. . . . .. .. . . . .. . . . . . .. .. .. ......... .. .. .. .. . . M. Jazayeri, W. Lugmarr Towards Reference Architectures for Distributed Groupware Applications .. .. .. .. .. .. ... . . .. .. . .. .. .. . .. .. .. .. ........... .. .. .. . . . .135 K. Sandkcfhl, B. Messer ENTER : The Personalisation and Centextualisation of 3-Dimensional Worlds . .. .. .. .. . . . .. .. . .. .. .. . . . .. .. .. ........... .. .. .. .. .. 142 T. Guinan, G. O'Hare, N. Doikov Session 5: Computer Architectures Chair: F. Tirado Tailoring a Self-Distributing Architecture to Cluster Computer Environment .. .. .. ., .. . .. .. . . .. .. ... ..... .. ... ........ .. .. .. .. ..150 R. Moore, B. Klauer, K. Waldschmidt ` ~ 0~ A Framework for Process Migration in Software DSM Environments . .. .. . .. ..... .. .. .. .. .. . . . .. .. .. . .. .. .. ... .. .. ........ .. .. ....... ..158 l. Zoraja, A. Bode, V. Sunderanz PQE_HPF-A Library for Exploiting the Capabilities of a PQE:-1 Heterogeneous Parallel Architecture .. ......... 166 R. Baraglia, R. Ferrini, D. Laforenza, P. Palmerini, R. Perego

A Case Study of Trace-driven Simulation for Analyzing Interconnection Networks : cc-NUMAs with ILP Processors . .. .. . . . .. .. .. .. .. .. .. .. .. .. .. . . .. .. . . . . . .. .. .. .. .. .. .. . .. .. ... .. .. . .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. .... ....... .. .. .. . .. .174 V. Puente, J. Prellezo, C. lzu, J. Gregorio, R. Beivide Short Presentations 2 : Models and Languages Chair: Yards loannidis Scalability Analysis of Parallel Systems with Multiple Components of Work . .. ... .. . . .. .. .. .. . . . . . .. .. .. .. .. ..... .. .. .. .. .. . .. .. .. 182 E. Tamhouris, P. Van Santen

:VA Specification Y. Chen

for Reactive Bulk-Synchronous Provrammin . .. .. .. .. .. ... .. .. . .. .. .. . . . .. . . . . .. .. . . . . . .. .. .. .. .. .. . .. .. ._ .. . . . .. ...., . .. .  190 t

First Steps in Metacomputing with Amica. .. . . .. . . . .. .. .. .. .. .. .. .. .. .. .. . ... . .. .. .. . .. . . . . . .. .. .. ... .. .. .. . . . .. .. .. .. .. .. .. .. ... .. .. .. .. ...... . .. .. . ..197 T. Fink, S. Kinderrnann Parallel Prouramming with Data Driven Model .. .. .. .. .. . . . .. . . .. .. .. .. .. .. . . . .. .. . . . .. .. . . . .. .. . . . .. .. .. .. .. .. .. . . . . . .. .. .. . . . . .. . .. .. .. .. .. . .. .. .205 V. Tran, L. Hluch~, G. NguYen A Knowled2e-hased Approach to Scheduling Jobs in Metacomputer Environment .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. . .. .212 J Nahr=rski, J. Pukacki, YV. Stroiriski Consistency Requirements of Distributed Shared Memory for Lamport's Bakery Algorithm for Mutual Exclusion . . .. .. .. .. .. . . . .. .. .. . .. .. .. .. .. .. .. .. . . . . . .. .. .. .. . . .. .. .. .. .. .. .. .. . . . .. .. . . . . . .. .". . .. .. .. . .. .. .. . . .. . . .. . . . ... .. .. .. .. .. ... .. .. .. ... .. .. .. . . 220 ~I7 J Br_e_iriski, D. Ii'mrr=lrriak Session 6: Scheduling and Mapping Chair: A. Antola

vti~ C r

C9S,~t'~ ~

Nntroduction .. .. .. .. .. .. . . . . . .. .. .. .. .. . .. . . .. .. ..... .. .. .. .. .. . . .. .. .. . . . .. .. .. .... . . .. .. .. . .. .. .. .. . .. .. .. . .. .. .. .. . . . .. .. .. .. .. . . . .. .. .. .. .. .. .. . . . . . .. .. .. .. .. ... .. ..228 Modellin~ Message-Passimt Programs t rams for Slatic Mappin ~" . .. .. .. . .. . . to C. Roig, A. Ripoll, M. Serrar, F. GuiracGo, E . Lugrre

c

.

'.J,G ca t_, I,

f . . .. .. .. ..Y). .. .. .. -' ' . ..!,1`n .. .. . . . .. .. .. . . .. .. .. . . . . .. .. .. .. .. ... .. .. 29

Optimal Scheduling for UET-UCT Grids into Fixed Number of Processors . .. .. .. . .. .. .. .. .. . . . .. .. .. . . .. .. .. .. . . . .. .. .. .. .. .. . .. ..237 T Andronikos, N. Koziris, G. Papakonsrantinou, P. Tsanakas Gru, ps in Bulk Synchronous Parallel Computing .. . . .. .. .. .. ' A/C .. .. .. .. . ..'.. . . _. .. ., . .. ..~= .. ..f..;,S . .. .. .. za!~ : 4 ... . J. Gonzalez, C. Leon, F. Piccoli, M. Printista, if . Roda . C. Rodriguez-`"F. Sande Panel on Metacomputing Moderator: F. Vajda Panelists: P. Kacsuk, K. Klockner, C. Kesselman. V. Getov, D. Laforenza

:'r~ . .. .. .. . . . .. .. .. .. .. .. . .. ..244 v .1 s

'

CA c~."F

SP

,

-, ; Panel Statement.. p .. .. .. . .~ .

F Keynote Address Chair: Y. Cotronis Databases and the Web : An Oxymoron or a ;Pleonasm'? Y. loannidis

N1 "

P

x

Aeft-f re.k

Session 7 : Languages and Compilers Chair: F. Vajda

Introduction . . .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . . . . .. .. ... .. .. .. . .. .. .. . . . .. .. . . . .. .. .. .. . . . .. . . . . .. .. .. . . . .. .. .. .. .. .. .. .. . . . . . .. .. . . .. .. .. . . . .. .. .. . . . .. .. . . . .. .. . . . .258 CELLAR : A Hi Lh Level Cellular Pro«rammini7 LanguaLe with Regions . . .. .. . . .. .. .. .. .. . . . .. .. .. .. .. . . . .. .. .. .. . .. .. .. .. . .. .. .. . . . .259 G. Folino, G. Spezzano Generating Structured Program Instances with a HiL~h DeLTree of I_ocality .. .. . . . . . . . . . . . . . . . .. .. . . . .. .. .. .. .. .. . . . .. .. . . . .. .. .. ... .. .267 C. Leopold Asynchronous Progressive Irregular Prefix Operation in HPF2 .. .. . . . .. .. .. .. .. .. .. . . . .. .. .. .. . . .. .. .. . . . .. .. .. .. .. . . . .. .. . . ... .. .. .. . .. .. .275 F. Bregier, M-C. Counilh, J. Roman The Parallel Cellular Proerammin "_ Model . .. . . . .. .. .. .. .... . .. .. .. .. . . . .. .. . . .^. .. .. .. .. .. .. . . . . . .. .. .. .. .. .. .. .. . .. .. .. . . . . . .. .. ..1/ . . .. . . ... .. .. .. . ..283 P-C. Cagnard ..i ~t' ~t i'~` :. ;.J

Short Presentations 3: Distributed Shared Niemorv Systems Chair: D. Laforenza

Introduction . .. .. .. .. .. . . . . . .. .. .. .. .. . . . . . . . . . .. . . . . . . . . .. .. . . . .. .. .. . . . . . .. .. .. .. . . . . . .. .. .. . . . .. .. . . . .. .. . . .. .. .. . . .. . . . .. .. .. .. .. .. .. .. . . . .. .. . . . .. .. . . . .. .. . . . .. .. . .292 Self-Similarity in SPLASH-2 Workloads on Shared Memory Multiprocessors Systems. .. .. .. .. . . . .. .. . . . .. .. . . . .. .. . . . .. .. . . 293 J. Sahuyuillo, T. Nachiondo. J. Cano, J. Gil, A. Pont Two Manaeement Approaches of the Split Data Cache in Multiprocessor Systems .. .. . . . . . .. .. .. .. .. .. . . . .. .. .. . . . .. .. . . . .. .. . . 301 J. Salniquillo, A. Pont Evaluation of a Virtual Shared Memory Machine by the Compilation of Data Parallel Loops . .. . .. .. .. .. . .. .. ... .. .. .. . .. . 309 F. Baiardi, D. Guerri, P. Mori, L. Ricci Using Agent Wills to Provide Fault-Tolerance in Distributed Shared Memory Systems .. .. .. . .. . . .. .. .. . .. .. .. .. . .. .. .. . .. .. . 317 A. Rowstron The TIRAN Approach to Reusing Software Implemented Fault Tolerance . .. .. . . . .. .. .. . . .. .. .. .. . . . .. .. .. .. . . . .. .. .. . .. .. .. ... .. .. . 325 O . Botti, V. De Florio, G. Deconinck, R. Lauivereins, F. Cassinari, S. Donatelli, A. Bobbio, A. Klein, H. Kufiier, E. Thurner, E:. Verhulst Combining Low-Latency Communica,ion Protocols with Multithreading for High Performance DSM Systems on Clusters ..... .. .. .. .. . .. .. .. .. .. . . . .. .. . . . .. .. .. .... ... .. .. .. .. .. .. .. ... .. .. .. .. .. .. . . . .. .. .. . .. .. .. . .. .. .. . .. ..333 L. Lefevre, O. Reymann

Session 8 : Applications Chair: S. Smith

Introduction . . . . . .... . . .. . .. .. . . .. . .. .. .. .. .. .. .. .. . . .. .. .. ..... .. .. .. . . . .. .. .. .. ..... .. .. ... .. .. .. .. ... .. .. .. .. . .. .. .. .... .. .. .. . . . .... .. .. .. .. .. ... .. .. .. . .. . . .. . .. .. . .342 Parallel Simulated Annealing for the Set-Partitioning Problem . .. .. .. . .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .. .. .. .. ..... .. .. ... .. .. .. . .. .. . .343 Z. Czech 2-D Wavelet Packet Decomposition on lAulticomputers. .. .. . .. .. .. . .. .. . . .. .. ....... .. .. .. . . ........... .. .. .. .. .. ..... .. .. .. ... .. ..... .. ......351 M. Fed, A. Uhl Towards Efficient BSP Implementations of BSR Programs for Some Computational Geometry Problems .. .. .. ... .357 C. Cerin

Short Presentations 4: Tools and Environments Chair: J. Bass Introduction . .. .. .. . . . . .. .. .. .. . . . . . .. .. . .. .. .. . .. .. . . . .. .. . . . .. .. .. .. . . . . .. . . . . . .. .. .. .. .. .. .. .. . . . .. .. . . . . . .. .. .. . .. .. . . . .. .. .. . . . .. . . .. . . . . . . . . . . . . . .... .. . . .. .. . . ... .366 A Performance Simulation Technique for Distributed Programs : Application to an SOR Iterative Solver . .. .. .. .. ..368 R. Aversa, B. Di Martino, N. Ma ;,occa, U. Villano Monitoring and Debugging Message Passim Applications with MPVisualizer .. .. . .. .. .. .. . . . .. .. .. .. .. .. .. . .. .. .. .. .. .. . . ..... ..376 A. Claudio, J. Cunha. M. Carrno Pipelining-based Tradeoffs for Hardware/Software Codesign of Multimedia Systems .. . . . . . .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. .383 J. Castellano, D. Sanchez, O. Cazorla, A. Suarez PVM Application-Level Tuning over ATM. .. .. ... .. .. .. .. .. .. .. . . . . ... . .. .. . . . .. .. . . . .. .. .. . .. .. . . . . . .. . . .. . .. .. .. .. .. .. .. . . . .. .. .. .. .. .. .. .. . . . .. .. .391 C. Di Napoli, M. Giordano, M. Furnari, F. Vitobello Reusable Message Passim Components . . . .. . . . . . .. . . . . .. . . . . . . . . . .. .. .. .. .. .. . . . .. .. .. . . . . . .. .. .. .. . .. .. . .. .. .. .. . . . . . . . .. .. . . . . .. . . . . .. . .. .. .. .. .. .. ..398 J. Cotronis An Efficient Algorithm for the Physical Mapping of Clustered Task Graphs onto Multiprocessor Architectures . .. .. . .. .. . . . .. .. .. .. . .. .. .. . . . .. .. .. .. . . .. .. .. .. . . . .. .. .. .. .. .. . . . .. . . . . .. . .. .. . . . .. .. . . . .. .. .. . . . . . .. .. .. .. .. .. .. . . . . . .. .. .. .. ..406 N. Koziris, M. Romesis, P. Tsanakas, G. Papakonstantinou Index of Authors . . .. . . . .. . . .. .. .. .. .. . .. .. . . . .. .. .. .. . .. .. .. . . . .. .. .. .. .. .. .. .. . . . . .. . .. . . .. . . .. .. . . . .. .. . . . .. .. . .. .. .. . . . .. .. .. .. .. .. . . . .. .. .. .. .. . . .. . . . . . .. ..41 5