First Conference on Fault Tolerant Systems, IIT Madras, December 20-22, 1995
Understanding Communication Faults in Parallel Computers João Carreira, Diamantino Costa, Henrique Madeira, João Gabriel Silva Departamento de Engenharia Informática, Universidade de Coimbra Email:
[email protected] Abstract This paper addresses the evaluation of the dependability properties of distributed memory parallel systems through fault injection. The most popular parallel computers are based on the distributed memory architecture where loosely coupled processors communicate by message-passing. Fault tolerance is an issue which increasingly concerns manufacturers and end users of these systems as the probability of occurrence of a fault increases with the number of components, and parallel machines can have up to thousands of nodes and complex interconnection media. For the purpose of the validation of fault tolerance in these systems, both the processing nodes and the communication subsystem should be taken into account. This paper focus on the validation of communication subsystems and reports experiments conducted with the CSFI tool - Communication Software Fault Injector in a commercial parallel machine with no fault handling mechanisms. Two set of experiments have been performed: one using original applications, and another using the same applications in conjunction with an application level CRC mechanism for the messages. The outcome of the experiments was analysed focusing on those faults that caused the generation of wrong results by the application without any error being detected. These cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous. The results obtained show the effectiveness of the CRC as an error detection mechanism and emphasise the need for robust communication protocols in parallel machines in order to achieve confidence in the applications results and suggest that the actual quest for performance in the parallel computing industry can only be effective if it is provided along with dependability.
1. Introduction Parallel computing systems are being increasingly used for running complex scientific programs and other computing-intensive applications. However, as the number of processors and other system components increase, the probabilities of occuring faults increase as well. Faults in parallel systems can be specially harmfull, as the crash of a single processor can abort an entire long-running application and force a restart from scratch. Consequences can be even worst if transient faults cause the generation of erroneous results in such long runs and mislead the users and application engineers.
For this reason, fault detection and recovery are nowadays becoming major concerns in the design of parallel machines. The nature of the faults that can occur in these machines is not fully understood yet. In shared memory parallel machines, faults can occur in processor circuitry, memories and support logic just as in traditional computers but with a higher probability due to the increased number of such components. On the other hand, disjoint memory parallel machines have an additional source of errors: the communication subsystem. In these parallel system architecture several nodes with traditional processor systems are interconnected by some communication media and communicate via message passing. Communication media are intrinsically very prone to electrical interferences. They can cause errors in transmitted data, propagate into processing nodes and finally cause computational errors or node failures. In fact, those errors can be as harmfull as errors in processor circuitry. This scenario is complicated by the fact that communication software developed for this machines is usually made having performance as the only concern. Robust communication mechanisms are not implemented in most of the cases and error detection at the protocol level is weak. In the quest for performance and scalability, basic issues are simply forgotten. This paper reports experiments conducted with the CSFI tool - Communication Software Fault Injector in a comercial parallel machine with no built-in error detection mechanisms. The CSFI is a comprehensive set of tools which fulfils all the steps required to inject communication faults, including definition of sets of faults, automatic fault injection and collection of results for statistical analysis [3]. Several experiments have been performed using a set of parallel benchmarks representative of the patterns of communication and computation found in real scientific applications. The goal of the experiments was to evaluate a CRC (Cyclic Redundancy Check) based error detection mechanism for the messages, in avoiding the most dramatic fault consequences which are the generation of wrong application results. The structure of the paper is as follows: Section 1 discusses previous research in the area of fault injection. Section 3 briefly describes the target parallel system and the fault injection tool used in the experiments. Results collected from fault injection experiments are analysed in Section 4, and finally Section 5 concludes the paper.
2. Related Research This work is part of a broader research project1 which aim is to validate several fault tolerant mechanisms included in a commercial parallel system. The validation is performed using experimental methods (fault injection) and comprises the processing nodes and the communication subsystem. The technique used in both cases is software implemented fault injection (SWIFI) due to its advantages concerning the other known techniques [4]. In fact, SWIFI techniques [4, 6, 9, 10, 14, 15, 20] are being increasingly used as an alternative to the other methods like physical fault injection [16, 17]or simulation [13] for injecting faults in computer systems. The injection of fault types specific to parallel and distributed systems have also been a major concern in several fault injectors such as DOCTOR [10], DEFINE [15], EFA [9] and CSFI [3]. These tools are able to inject faults in the communication subsystems of their target systems through software and have been used for several purposes, such as evaluating distributed diagnosis algorithms, the fault
1
FTMPS: A Practical Approach to Fault Tolerant Massively Parallel Systems. Project partners are: Univ. Coimbra, Univ. Erlangen, Univ. Leuven, Univ. Luebeck, British Aerospace and Parsytec Gmbh.
tolerant capability of algorithms, or the overall effect of communication faults in parallel applications. The tool used in this work has been presented in detail in [3].
3. The Target System The parallel system used in the experiments reported in this paper is a typical commercial system [1] from Parsytec used by Industry and Academia. It includes four nodes, each with a PowerPC (MPC601) for computation and a T805 [23] devoted entirely to communication among nodes. The four T805 serial links work at 20 Mbits/s. The operating system of this parallel machine is called Parix [1] and is a Unix-like O.S. with extensions for parallel programming. The development system runs in an host computer (a SunSparc) and applications are prepared and downloaded from this machine. The host is connected to the parallel system through an interface board (BBKS4) and a serial link to one of the transputers (Figure 1). The CSFI tool used in the fault injection experiments has software modules running in the host for experiment definition and control and others running in the target system (both PowerPC and T805) for fault injection (see [3] for a detailed description). HOST
PowerXplorer 3
dserver
2 PPC
PPC
T8
T8
T8
T8
PPC
PPC
L3 L2 T4 L1 BBKS4
L0
0
1
Figure 1. The Target System.
4. Experimental Results Several experiments were conducted with CSFI using two parallel benchmarkse: π Calculation and MATMULT . A short description of these applications is given bellow: 1. π Calculation (Linda) Computes an approximate value of π by numerically calculating the area under the curve 4/(1+X2). The area is partitioned in N strips by a Master program and each job is assigned a subset of the total strips. This jobs will be carried out by Workers that return to the Master their part of the total sum. The final calculated value for π is stored in a file by the master. 2. MATMULT - Matrix Multiplication (Linda) A matrix multiplication program following the master worker paradigm. Each worker enrolled in the computation is responsible for calculating a part of the result matrix (119x119 integers). Note that both applications generate the results to a file, so that the results generated under fault injection can be compared with the correct ones generated in a gold run. The faults injected consisted in one bit flip and affected indistinticly any byte transmitted in a link, which means that both the payload and the headers of the packets were corrupted.
The outcome of the experiments was summarized in four main classes as shown in Table 1. Errors of the second class are the most dangerous ones, because the user will think that the results are correct, as no error was reported and the size of the results file was plausible. An initial series of experiments was performed using the original applications and system without any type of CRC check in the communication. A total of 1500 faults has been injected in each application.The summary of results is shown in Figure 2. System hung Wrong Results Correct Results Error detected
The system hanged up after a fault has been injected without any error being reported to the outside. The appplication terminated correctly, no error has been reported to the outside, but the application results were corrupted. The appplication terminated correctly, no error has been reported to the outside, and the application results were correct. An error was detected and reported to the outside or the application terminated incorrectly (wrong exit code) Table 1. Classification of the experiments outcome
π Calculation
MATMULT System hung
Correct results
Wrong results
8%
18% Correct results
10%
42% Wrong results 13%
Error detected 35%
16% System hung Error detected 58%
Figure 2 Summary of results for MATMULT and π Calculation under communication faults
As can be noticed, the percentage of Undetected errors rised up to 13% in MATMULT and 8% in π Calculation. Basically, this results coincide with those collected for a T805 based system reported in [Carreira95]. It is worth to stress that these cases are very critical, as the user would confidently use the erroneous outcome of its application. The percentage of erros detected correspond mainly to cases where system calls returned in error and caused the application to terminate. In a second phase of experiments we used altered PARIX communications primitives to include CRC checking, implemented at the application level. A CRC is calculated for the payload of the transmitted messages. Basically, the usual PARIX communication primitives, SendLink(..) and RecvLink(..) were substituted in the source code by two new primitives and the applications were recompiled. The Linda library was also recompiled with the new primitives which are:
SendLinkCRC(..) RecvLinkCRC(..) These primitives have the same parameters as the original ones. SendLinkCRC simply calculate the CRC for the given message and sends it along with the CRC using the normal SendLink(). On the other hand, RecvLinkCRC() receives the message using the original RecvLink(), calculates its CRC, compares it with the received CRC and terminates the application if they are not equal. No provision was made for retransmission of messages in case of a CRC error as the aim of our work was only to assess the effectiveness of CRCs in messages through fault injection. The overhead of the CRC’s is highly dependent on the application. The size of the messages and the ratio communication/computation are among the factors that influence it most. Table 2 shows the different execution times for the two benchmarks with and without CRCs. Note that the input parameters of the applications were chose in order to the execution time to be low and enable the execution of a large number of experiments necessary to collect statistically meaningfull results. While for π the difference is almost negligible because the communication/computation ration is low, for MATMULT where big data chunks are transferred it increases almost two times.
π Calculation MATMULT
with CRC’s 20s 400 ms 19 s 143 ms
without CRC’s 19 s 900 ms 11 s 33 ms
Table 2. Application overhead of CRC's
Obviously, the CRC overhead is not neglible. This was expected because CRC’s were implemented in software and by sending an extra message with the CRC itself to avoid memory manipulations (malloc’s and memcopy’s). If the CRC was calculated by hardware as is usually the case, the overhead would be much lower. In fact, hardware support for CRC calculation is provided nowadays in some processors like the T9000 [11]. The same bechmarks, MATMULT and π Calculation were recompiled using the CRC communication primitives described above, and the experiments were repeated assuming the same fault model, one-byte duration faults with single bit flips. The summary of results is shown in Figure 3. π Calculation
MATMULT System hung Correct results
10%
5% 0%
Wrong results
53%
Wrong results
0%
Error detected 37%
79%
Correct results 16%
System hung
Error detected
Figure 3 Summary of results for MATMULT and π Calculation under communication faults
The results show that the CRC’s effectively detected all the errors that caused the generation of wrong results in the previous experiments. Using the CRC communication primitives the percentage of Undetected errors was reduced to zero in our experiments. Of course, if we consider errors in the communication subsystem as errors occuring in the T805 transputer responsible for communications, errors occuring before the calculation of the CRC will not be detected. Nevertheless, these results show the importance a simple mechanism like this can play in communication systems.
5. Conclusions This paper addresses the evaluation of the dependability properties of distributed memory parallel systems through fault injection. It focused particularly on the communication subsystem of these machines, as it is a potential source of errors that can be as important as errors occuring in processor circuitry. A tool named CSFI - Communication Software Fault Injector was used to inject faults in a commercial parallel machine based on the PowerPC and the T805 transputer with no built-in fault handling mechanisms. Two series of fault injection experiments were performed: one using the original applications,and another using an application level CRC mechanism in the messages. Results show that faults injected in the original system were not detected by any means and caused the generation of wrong results in up to 20% of the cases. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous. On the other hand, fault injection experiments using CRC's show that this simple error detection mechanism was very effective as no faults caused the generation of wrong results without being noticed. This results emphasise the need for robust communication protocols in parallel machines in order to achieve confidence in the application results. An hardware CRC generator seems to be a good start as it can be easily included without affecting performance. The results also suggest that the actual quest for performance in the parallel computing industry can only be effective if it is provided along with dependability. Finally, fault injection has proved as a valuable instrument for the validation of communication subsystems.
References [1] [2] [3]
[4]
[5] [6] [7]
"PowerParix 1.3 Reference Manual and User Manual", Parsytec GmbH, 1994. J.Arlat et al., “Fault injection for dependability validation: a methodology and some applications”, IEEE Trans. on Software Eng., Vol 16, No 2, Feb. 1990, pp. 166-182. João Carreira, Henrique Madeira, João Gabriel Silva. “Assessing the Effects of Communication Faults on Parallel Applications” to be presented at IPDS’95, International Computer and Dependability Symposium, Erlangen, Germany, April 1995. João Carreira, Henrique Madeira, João Gabriel Silva. “Xception: Software Fault Injection and Monitoring in Processor Functional Units” Preprints of the DCCA-5, Beckman Institute, Urbana Champaign, Illinois, USA, pp. 135-149, 27-29 September 1995. João Carreira, “Software Fault Injection in Parallel Systems”, MSc thesis, University of Coimbra, Portugal, July 1995. R. Chilarege ans N. Bowen, “Understanding Large Systems Failures - A fault Injection Experiment”, Proc. 19th Int. Symp. Fault-Tolerant Computing, Chicago, June, 1989, pp. 356-363. Roy-Chowdhury and P. Banerjee, “A Fault-Tolerant Algorithm for Iterative Solution of the Laplace Equation”, Proc. of the Int. Conference on Parallel Processing, 1993, pp. II-133 to III-140.
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
E. Czeck and D. Siewiorek, “Effects of transient gate-level faults on program behavior”, FTCS-20, Newcastle Upon Tyne, June 1990, p. 236-243. K. Echtle, M. Leu. “The EFA Fault Injector for Fault-Tolerant Distributed System Testing”, in Workshop on Fault-Tolerant Parallel and Dist. Systems, pp 28-35, 1992. S. Han, H.Rosenberg, K.Shin, “DOCTOR: an Integrated Software Fault Injection Environment”, Technical Report-University of Michigan, 1993. “The T9000 Transputer Hardware Reference Manual”, INMOS Limited 1993. R. Iyer and D. Rossetti, “A measurement-based model for workload dependance of CPU errors”, IEEE Trans. on Computers, vol. C-35, pp. 511-519, June 1986. E. Jenn, J. Arlat, M. Rimén, J. Ohlsson, and J. Karlsson, “Fault Injection into VHDL Models: The MEFISTO tool”, Proc. of FCTS-24), pp. 336-344, Austin, TX, USA, 1994. G. Kanawati, N. Kanawati, and J. Abraham, “FERRARI: A Tool for the Validation of System Dependability Properties”, FTCS-22, Digest of papers, IEEE 1992, pp. 336-344. Wei-lun Kao, R. K. Iyer, “DEFINE: A Distributed Fault Injection and Monitoring Environment”, Workshop on Fault-Tolerant Parallel and Distributed Systems, June, 1994. J. Karlsson, P. Lidén, P. Dahlgren, R. Johansson, and U. Gunneflo, “Using Heavy-ion Radiation to Validade Fault-Handling Mechanisms”, in IEEE Micro, Vol. 14, No. 1, pp. 8-32, 1994. H. Madeira and J.G.Silva, “Experimental Evaluation of the Fail-silent behaviour in Computers without Error Masking”, FTCS-24 June 1994. H. Madeira, M.Rela, F.Moreira, J.Silva. “RIFLE: A General Purpose Pin-Level Fault Injector”, Proc. First European Dependable Computing Conference, pp 199-216, Berlin, Germany, October 1994. M.Rimen, J.Ohlsson, J.Torin, “ On Microprocessor Error Behaviour Modelling”, Proc. of the 24th Int. Symp. on Fault-Tolerant Computing (ftcs-24), pp. 76-85, Austin, TX, USA, 1994. Z.Segall, T.Lin, “FIAT: Fault Injection Based Automated Testing Environment”. In Proc.. 18th Int. Symp. Fault - Tolerant Computing., June 1988, pp 102-107. D. P. Siewiorek and Robert S. Swarz, The Theory and Practice of Reliable Design, Digital Press, Educational Services, Digital Equipment Corporation, 1982, Bedford, Massachusetts. J.G.Silva, J.Carreira, F.Moreira,”ParLin: From a Centralized Tuple Space to Adaptive Hashing”.Transputer Applications and Systems’94, pp 91-104, IOS Press, 1994. "Transputer Technical Notes", Prentice Hall International, ISBN 0-13-929126, 1989 INMOS Limited .