the processor registers and memory locations. To minimize the intrusion, the Software Monitor. Debugger is only used before or after the experiment execution.
Experimental Evaluation of Computer-Based Railway Control Systems A.M. Amendola, L. Impagliazzo, P. Marmo, F. Poli Ansaldo-Cris, Via Argine 425, Napoli, Italy Email: {amendola, leoimp, marmo, poli}@cr.ansaldo.it Abstract Several factors are creating pressure for the enhancement of methodologies and techniques for the validation of computer-based railway control systems. These factors are related to the introduction of new technologies and equipment, the design of interoperable railway networks in Europe, and the new strong competition in the market of railway products. This paper presents LIVE (Low-Intrusion Validation Environment), the validation environment developed at Ansaldo-Cris to experimentally evaluate the dependability of the new families of Computerbased Railway Control Systems. The methodological framework for LIVE is summarized. LIVE integrates fault injection and software testing techniques to achieve an accurate and nonintrusive analysis of a system prototype. Such evaluation is needed to ensure full compliance with the new dependability standards emerging for railway apparatus. The test results of a trial application are presented. These results highlight the importance of the quality of the test set and its influence on the final evaluation of system dependability.
1
Introduction
The last years have seen the demand for a great increase in reliability and performance of control systems in railway and metro lines. This demand has required the transition from relays to computer-based systems, stressing the need for the design and assessment of the safety of completely new systems. The safety of railway systems is based on the fail safe behavior of their components. This concept is well assessed for relays, which are characterized by defined failure modes, but it is hardly applicable to modern control systems because of the unpredictable behavior of microprocessors in the presence of faults. The CENELEC norms [1, 2, 3], which are being approved as European standards, require: 1) the use of verification and validation processes in all phases of the life-cycle of the system, 2) the demonstration of compliance with quantitative safety targets that depend on the criticality of the system functions. These recommendations (coupled with the availability, at low
cost, of microprocessor devices and with the strong competition in the railway field, which highly values performance and safety) have brought about the definition and development of new methodologies for safety design and assessment. This paper aims to present validation techniques and an environment developed at Ansaldo-Cris for the experimental evaluation of the ATC (Automatic Train Control) in the People Mover system. The work is organized as follows: Section 2 briefly presents the characteristics of railway signaling systems and the proposed methodology for experimental validation. The environment used for the experimental validation of railway systems (LIVE) is described in Section 3. Section 4 presents some experiment results from a simple case study. Section 5 discusses the results of the work and the future activities.
2
Validation of railway control systems
Modern railway control systems, such as the interlocking systems for large or medium railway stations and high-speed lines, and the ATC for metro lines [6, 9, 10], perform signaling and automation functions. While these systems may deeply differ in the fault tolerance mechanisms adopted, in general they are based on redundant distributed architectures that make use of microprocessors boards [11]. In the validation methodology of these systems, two phases are of primary importance: • dependability testing (fault removal), aimed at the discovery and correction of design errors in the management of the faults in hardware-software system prototypes; • dependability evaluation (fault forecasting), combining the experimental results with analytical models in order to evaluate the MTBHE (Mean Time Between Hazardous Events) [3, 4]; this parameter represents the rate with which the system is forecast to produce output that can lead to catastrophic effects for people or equipment. The norms prescribe that the MTBHE is to be kept under specified levels, depending on the criticality of the function performed by the system. The CENELEC norms do not specify the techniques that must be used in this methodology. They only
prescribe that the process has to address all the issues of the validation and that the evidence of the adequacy of the methodology adopted has to be provided. To fulfill these requirements, an environment is needed for the experimental validation of the system. The requirements of such an environment are as follows: • All the domains of an experiment (input domain, time domain, fault domain) have to be addressed. • Every function of both basic and application software has to be tested with “representative” inputs to remove software defects. As an operational profile [8] is not easy to evaluate, a structured analysis is also required (code coverage, decision coverage, etc.). • Time measures, verification of synchronization and multiprocessor analysis are needed to analyze real-time performance in redundant architectures. • Fault injection experiments have to be executed for fault removal and to evaluate fault coverage and latencies (fault forecasting). Fault injection (FI) requires defining accurate fault models and techniques that allow injecting faults in any hardware component (CPU, memories, I/O devices, serial links, etc.), with as little intrusion as possible. These requirements and the previous tools described in literature (reference in [5]) were considered in the development of the new environment presented below, which is characterized by: • low or no intrusion of the adopted fault injection and monitoring techniques, • inclusion of static and dynamic software analysis in the evaluation of test quality and overall safety, • tools integration with an up-to-date WWW userfriendly interface, • portability, with minimum effort, to various computer based control systems. The environment described in the following section is being used for the validation of the ATC for the People Mover project [6, 7], in its prototype phase [5].
3
LIVE validation environment
An overview of the structure of the environment is presented in Fig. 1 (gray blocks refer to Ansaldo-Cris components). A set of central resources are controlled, via Ethernet and serial links, by a SUN workstation that acts as the server of the environment. These central resources and all features of the environment are available to a number of users on client machines. The following elements of LIVE are independent from the specific target systems: • FIB (Fault Injection Board) - a specific programmable board capable of performing LIVE FI by two techniques: forcing logic values on the CPU bus and serial links (hardware fault injection), or generating an interrupt to activate a
software procedure (hybrid implemented Fault Injection). Such a procedure, previously loaded in the Target System RAM, is used to modify any software-visible register (CPU, memory, I/O). &OLHQW +773
/2*,& $1$/> maximum program execution time). This situation is not dangerous because the system watchdog is able to detect the anomaly. 1111
1010
Illegal Access
Addr.
Divis.
Other
Emul.
Emul.
Instr.
Error
By 0
Excep.
Error
(the percentage of Not Activated is the same over the three classes as it depends on input expression), while a multiple bit-flip fault is more likely to generate an exception.
Input_06
13%
3%
38%
34%
7%
0.7%
4.3%
Input_10
13%
3%
34%
32%
7%
6%
5%
Table 4: Percentage of exception for different inputs Failure: The parser processes its input and assigns a wrong value to the output; no error is detected. As mentioned before, for those experiments, the fault is injected in the memory image; one byte is selected (with uniform distribution) within the bytes of code (and constants) of the parser, and one (50%), two (30%) or three (20%) bits are flipped. The impact of single or multiple bit flips on the fault injection results is shown in Figure 2 for the Input_10 (values have been normalized over the number of injected faults in each class). A single bit-flip fault is more likely to be tolerated
All One Bit Two Bits Three Bits
Tol. or Ing
HW Detected
SW Detected
WD Detected
Failure
Fig. 2: Percentage of detection wrt fault model
4.3
Upper bounds for failure rate
In Table 3, the row of Input_1to6 is calculated from the results of FI experiments of the six inputs in RITC with these rules: 1) For each fault in the set, the results of all inputs in RITC are considered. 2) The result for Input_1to6 on the considered fault is: • Failure if there is a failure for at least one input, • WD Detected if there are no failures and at least one WD Detected, • SW Detected if there are neither failures nor a WD Detected and at least one SW Detected, • HD Detected if there are neither failures nor other detection and at least one HW Detected, • Tolerated if all are Not Activated but there is at least one Tolerated, • Not Activated if all are Not Activated. With these rules, in order to be a representative input test case also for FI experiments, the RITC should show: • a lower percentage of Not Activated for any correct input in the input domain. This is to ensure that no correct input is able to activate a fault more than the ones activated by the RITC; this fault could lead to a Failure that is not counted in RITC Failure. • a higher percentage of Failure for any correct input in the input domain. This is to obtain an upper bound for the failure rate of the system running the parser, independent of the specific input. • values for the other percentages that can be representative of (not too far from) the characteristic of the system running the parser in the presence of a fault, whatever the input is. All these points seem to be confirmed by the results of CITC. The analysis should be confirmed by FI experiments on other programs, which are scheduled in future work.
4.4
Evaluation of fault dormancy and latency
In Figure 3, the cumulative distribution function for fault dormancy (percentage of faults activated versus time) is presented. The shape of curves can be explained as follows: • At the beginning, the parser prepares the computation and than executes each operation, processing the input expression according to algebraic rules (sharp start). • When an already computed operator is reexecuted, only a few other faults are activated (slow rise, especially for complex inputs); for instance, expression Input_07 (Table 1) starts with many “+” operators, so the curve rises very slowly. • Once the result has been computed, it is stored in memory in ASCII format (sharp end). 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
exp01 exp06 exp07
3952
3648
3344
3040
2736
2432
2128
1824
1520
912
1216
608
0
304
exp10
µs
Fig. 3: Cumulative distribution function for fault dormancy The curves are strictly dependent on input test case complexity. To limit fault dormancy, reducing its dependency from inputs, on-line testing, and data refresh are used [4, 7]. The cumulative probability distribution for the error latency of hardware (exp-- HD) and software (exp-- SD) error-detection mechanisms is shown in Figure 4. Values have been normalized over the number of activated faults for each input. 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
exp01 HD exp06 HD exp07 HD exp10 HD exp01 SD exp06 SD exp07 SD exp10 SD 1840
1656
1472
1288
1104
920
736
552
368
184
0
µs
Fig. 4: Latency for hardware (HD) and software (SD) error detection mechanisms Error latencies depend mainly on the error-
detection mechanisms. Hardware error-detection mechanisms have much lower latency times than software mechanisms. As expected, no meaningful difference can be observed in latency distribution or values for considered inputs, so RITC should also be able to evaluate error latencies.
5
Conclusion
The requirements have been identified for an environment for experimental evaluation of railway signaling systems. To this purpose the future CENELEC norms (the industrial practice), as well as methodologies and techniques produced by the fault tolerance community, have been considered. LIVE, a validation environment that fulfills these requirements, has been described, and an example of a parser procedure has shown the impact of input domain on dependability figures. Finally, simple rules have been defined to evaluate how an input test set can be considered representative of the input domain for fault injection experiments. LIVE is now being used to validate the new architectures for railway control systems developed by Ansaldo Trasporti.
6
References
[1] EN50126 “The Specification and Demonstration of Reliability, Availability, Maintainability and Safety (RAMS) for Railways Application” [2] EN50128 “Railway Application: Software for Railway Control and Protection Systems” [3] EN50129 “Railway Application: Safety Related Railway Control and Protection Systems” [4] A. M. Amendola et. Al., “Architecture and Safety Requirements of the ACC Railway Interlocking System”. Proceedings of IPDS ‘96, pp. 21-29, September 1996. [5] Iyer, “Experimental Evaluations”. FTCS-25 Special Issue pages 117-132, June 1995. [6] G.Mongardi, “Dependable Computing for Railway Control Systems” in DCCA-3 Springer-Verlag Wien New York. [7] F. Corno et. Al., “On-Line Testing of an Off-the-shelf Microprocessor Board for Safety-critical Applications” in EDCC-2 Springer-Verlag Wien 1996 [8] Chen et. Al., “A Case-study to Investigate Sensitivity of Reliability Estimates to Errors in Operational Profiles”. Proc. of the 10th Software Reliability Symp., pp. 276-281, Nov. 1994. [9] A. Hachiga et. Al., “The Design Concepts and Operational Results of Fault-Tolerant Computer Systems for the Shinkansen Train Control”, FTCS-23, June 1993, pag 78-87. [10] Hennebert, G. Guiho, “SACEM: A Fault Tolerant System for Train Speed Control”, FTCS-23, June 1993, pag 624-628. [11] J. Arlat et. Al., “Dependability of Railway Control Systems” Panel at FTCS-26, June 1996 [12] R.K. Iyer and D.J. Rossetti, “Effect of System Workload on Operating System Reliability: A Study on IBM 3081”, IEEE Trans. Soft. Eng., VOL. SE-11, No. 12, pp. 14381448.