The FTMPS{Project: Design and Implementation of

The FTMPS{Pro ject: Design and Implementation of Fault{Tolerance Techniques

1

for Massively Parallel Systems

Johan Vounckx3, G. Deconinck3 , R. Lauwereins3 2 , G. Viehover4 , R. Wagner4, H. Madeira5 , J.G. Silva5, F. Balbach6, J. Altmann6, B. Bieker7 , H. Willeke7 3

Katholieke Universiteit Leuven, ESAT, K. Mercierlaan 94, 3001 Heverlee, Belgium, tel: +32{16{22 09 31 { fax: +32-16-22 18 55 { johan.vounckx*esat.kuleuven.ac.be 4 Parsytec GmbH (D) 5 Universidade de Coimbra (P) 6 F.A. Universitat Erlangen-Nurnberg (D) 7 Universitat-GH Paderborn (D) Abstract. The FTMPS-project provides a solution to the need for fault{ tolerance in large systems. A complete fault-tolerance approach is developed and being implemented. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possibly recon guration is collected. Backward error recovery based on checkpointing and rollback, is implemented.

1

Introduction

Fault-tolerance is now more than ever indispensable in computer systems: even if the massively parallel systems (MPS) may grant the growing demand for computational power, execution times of weeks and months are still required for heavy number-crunching applications. Though the probability of the failure of a single chip decreases due to better technologies, the statistical chance that one element in these large systems breaks down during run time is not at all negligible. However, the failure of one component should not result in the failure of the complete system. Hence, the failures which will occur, must be tolerated, without in uencing the results of the computations. Most users of MPS which are used for scienti c and industrial applications do not require that these systems are continuously running with their full capabilities. As long as the system is usable, producing correct results and getting back to full speed within an acceptable period, they allow reduced system performance, due to failures of some components or down-times due to preventive maintenance. 1 2

Supported by the EC as ESPRIT-project 6731 Senior Research Associate of the Belgian National Science Foundation

This need for correctness, continuity and availability motivates the faulttolerance techniques for long-running number-crunching applications developed within ESPRIT-project 6731. The acronym FTMPS - A Practical Approach to fault-tolerance in Massively Parallel Systems - explains its primary objective: developing fault-tolerance concepts to manage failures in massively parallel computers which are immediately applicable to the market. Project-partners are the KU Leuven (Belgium), the University of Coimbra (Portugal), the Universities of Paderborn and Erlangen-Nurnberg (both Germany), British Aerospace (UK) and Parsytec Computer GmbH (Germany). In the next section the target machine is described for which the FTMPS software is developed. In the third section we give the fault-tolerance concept followed within the project. The next section gives two additional developments of the project which support the software for the operator and enhance its applicability for other MPS. Finally, we end this paper with a conclusion and an outlook to the future of this project. 2

Target Machine

In this section we describe the target architecture for our fault-tolerance techniques. To make the system software as hardware independent as possible a Unifying System Model (USM) is introduced. The applicability of the FTMPS approach is proven on Parsytec's GC series. In the second subsection we will describe these computers.

2.1 Unifying System Model In the USM, the massively parallel system consists of two logically divided parts. The D-net (data network) is used to execute user applications. It is divided into partitions (space-sharing), which correspond to the set of nodes assigned to a single application. The C-net (control network) is used to execute the system (control) software. The global C-net is shared by all partitions. A local C-net is used by a single application. These divisions are necessary to provide faultisolation and prevention of in uences from one application to an other. By assuring that the system software corresponds to this USM, it is possible to obtain a high degree of architecture independence, applicable to a wide range of MPS. The USM does not impose any network topology. However, each topology requires another routing approach. In the FTMPS-project we focus on mesh-based architectures. Indeed, this topology covers a wide range of massively parallel systems [8], [7], [3] thanks to its expandability and routing simplicity [11].

2.2 Parsytec GC Series The Parsytec GC series covers a wide range of (massively) parallel computers. The GCel [8] consists of a network of T805 transputers connected in a 2D-grid. Statical routing switches and a control network allow space sharing by assigning a partition to each user. The built-in spare is already a rst fault-tolerance means. The GC/T9 [7] is based on the T9000 transputer and dedicated hardware routers minimising communication delay. These routers must be programmed to set up the partitions. Again spares are available. The hierarchical structure is 3D-grid based. The nodes of the GC/PPC are connected into a 2D-mesh like the GCel. Each node consists of two powerful PowerPC's for computational purposes and four T805's for communication. Statical switches are provided to set up the userpartitions. Spares are equally integrated in the system. 3

Fault-Tolerance Techniques

In this section we focus on the fault-tolerance concept of the FTMPS-project. The concurrent error-detection's goal informs the diagnosis software upon a fault. This fault is then analysed and classi ed. The system must possibly be recon gured and the application restarted from a saved checkpoint. A statistics tool (Section 4.1) is provided to help diagnosis and recon guration. The following subsections describe each of the subtasks in more detail.

3.1 Error-detection The error-detection is based on two complementary techniques. Structural error techniques are based on codes for memory and communication. Classically used techniques such as parity bits, error detecting and correcting codes, possibly checksums must be included in the system. Behaviour based techniques detect errors in the subsystem formed by the processor, bus and memory. In this approach, information describing a particular aspect of the behaviour (e.g. the program control ow) is previously collected (e.g. at compile/assembly time). At run time this information is compared with the actual behaviour. An excellent survey of this family of error detection techniques can be found in [5]. detection

3.2 Diagnosis The next step is the evaluation of the data delivered by the fault detection mechanisms in order to localise the faulty components and to classify the fault. The fault diagnosis software handles an error when an error message is received by the control processors (from an error-detection mechanism) or when a faulty component omits to send I'm alive messages. The algorithm consists of two parts

running on the control processors. The rst part supervises the application processors and the data net; the second part checks the control processors and control net. If a fault cannot be diagnosed by the control processors, the host diagnoses the fault. In addition to this, the host is the interface to the statistics database, which stores the diagnosis results and delivers necessary information. The result of the diagnosis is passed to the recovery algorithms which - if necessary - recon gure the system and restart the application (Sections 3.3 and 3.4).

3.3 Recon guration A partition which seems undamaged for the application is needed. Hence, permanent errors require a recon guration of the system before the application may continue. This recon guration is user-transparent and consists of two parts. Rerouting: when a component has failed permanently, the routing tables must be reprogrammed such that those failed entities are detoured. The GC uses interval routing [9] which is very well suited for an MPS because of the compactness of its routing information. Adaptation of the routing tables to overcome failures while keeping the information compact is quite complicated. Solutions are shown in [10]. Remapping: when processors fail they must be remapped on spare processors to allow the application to continue. The repartitioning algorithm must nd a partition with sucient working processors for the application.

3.4 Application Recovery The recovery-software must also initialise the restart of the involved applications. To avoid that long-running applications must be restarted from scratch, backward error recovery is integrated. It is based on checkpointing and rollback. This means that during normal execution, the status of the application program is saved to stable storage. This set of information about the status is called a checkpoint. In case of a failure the application can be rolled back to a checkpoint and restarted from there. Recovery of parallel applications has to cope with the consistency problem. After rollback the application must obviously restart from a consistent state. This means not only a consistent status of each processor on his own, but also the consistency among several checkpoints corresponding to dierent processors must be guaranteed. Since the dierent nodes communicate and no global time is available, problems arise. An overview of existing solution may be found in [2]. In the FTMPS-project an application-wide checkpointing technique is used which has the advantage that during normal fault-free operation overhead is produced only when saving the checkpoints. Fail time bounded behaviour is incorporated which does not assume fail-stop, but allows an error detection latency. The user-driven implementation is a non-blocking approach which saves the minimal amount of information at the cost of a limited user-involvement.

The user-transparent approach freezes the application and relieves the user of this task. 4

Applicability

The next section describes two important aspects to enhance the applicability of the project results. Focus is put on two aspects: a system operator interface which facilitates the use of the system and serves as backbone for the FTMPSsoftware and the portability of the software.

4.1 OSS (Operator Site Software) For the maintenance sta, it is important to know which boards are damaged and have to be replaced. Manufacturers try to localise weak points in their hardware. For these reasons an automatic error log is integrated. Additionally, a software monitor [1], [4], [6] provides data about the system's workload in order to reveal failure/load relationships. New problems are faced due to the amount of data which has to be routed to and evaluated on the control host of an MPS. This OSS supports both the other fault-tolerance software (providing history facts and system status) and the system operator (visualising statistics and status, and interfacing for interventions, e.g. in exceptional cases, system crashes, repairs, changes in the partitioning, upgrades, etc.). The OSS can be thought of as the front-end of the fault-tolerance software. The built-in visualisation features simplify the handling of massively parallel systems for the operator.

4.2 HIL (Hardware Independence Layer) The Hardware Independence Layer (HIL) provides hardware component independence for the operating system and the FTMPS software. This enables portability of this software across dierent platforms containing dierent types of processors and communications. The cost of incorporating this software on other or updated hardware will then be reduced as only the HIL must be adapted. In de ning the HIL a trade-o is made between minimising the necessary recoding and maximising the eciency of a single implementation by taking the advantage of speci c hardware facilities. For a programmer, the HIL should be transparent and not aecting the functionality of the services oered by the operating system and fault-tolerance extensions. Hence, the HIL provides a virtual processor interface to the system software, which may be unmodi ed. The HIL therefore consists of a library of functions which execute only under control of the operating system.

5

Conclusion

In this paper, we presented a speci cation of a global fault-tolerance approach for a wide range of massively parallel systems. The real need of fault-tolerance in large systems was described. We explained the target hardware on which this project is being implemented together with the fault-tolerance softwaremodules. The built-in hardware error-detection features combined with software error-detection techniques provide a high coverage of transient as well as permanent failures. Combined with the diagnosis software, the necessary information for the OSS (statistics and visualisation) and the possible recon guration is collected. From the applications' points of view, backward error recovery based on checkpointing and rollback, is implemented. Especially for long-running numbercrunching applications this seems an advantageous approach. By now the project has nished a rst prototype demonstrating the applicability and exibility of the developed fault-tolerance techniques. References

1 Castillo F., Siewiorek D.P.: Workload, Performance, and Reliability of Digital Computer Systems. IEEE Proc. of FTCS-11, pp. 84-89, June 1981. 2 Deconinck G., Vounckx J., Lauwereins R., Peperstraete J.A.: Survey of Backward Error Recovery Rechniques for Multicomputers Based on Checkpointing and Rollback. IASTED Intl. Conf. on Modelling and Simulation, Pittsburgh, PA, USA, May 1993, pp. 262-265. 3 Esser R., Knecht R.: Intel Paragon XP/S - Architecture and Software Environment. Proceedings of Supercomputer 93, Mannheim, June 1993. 4 Iyer R.K., Rossetti D.J.: A Measurement-Based Model for Workload Dependence of CPU Errors. IEEE Trans. on Computers, C35(6):511-519, June 1986. 5 Mahmood A.: Concurrent Error Detection Using Watchdog Processors - A Survey. IEEE Trans. on Computers, 37(2), 1990. 6 Maehle E., Obelor W.: DELTA-T, a User-Transparent Software-Monitoring Tool for Multi-Transputer Systems. Proc. EUROMICRO 92, Microprocessing and Microprogramming, 32(9):245-252, Sep. 1992. 7 Parsytec GmbH: Technical Summary Parsytec GC, Version 1.0. Parsytec GmbH, 1991. 8 Tiedt F.: Parsytec GCel Supercomputer, Technical Report, Parsytec GmbH, 1991. 9 van Leeuwen J., Tan R. B.: Routing with Compact Routing Tables. Technical Report RUU-CS-83-16 Rijksuniversiteit Utrecht, Nov. 1983. 10 Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Network fault-tolerance with Interval Routing Devices. Proc. of the 11th IASTED Int. Symp. Applied Informatics, pp. 293-296, Annecy, France, May 1993. 11 Vounckx J., Deconinck G., Cuyvers R., Lauwereins R., Peperstraete J.A.: Multiprocessor Routing techniques. Deliverable O3.1.1/L of ESPRIT Project 6731, July 1993. This article was processed by the author using the TEX macro package from SpringerVerlag.