Grand Challenges for Computing Science: Techniques for Identifying the Causes of Computer System Failure Chris Johnson, Department of Computing Science, University of Glasgow.
[email protected], http://www.dcs.gla.ac.uk/~johnson Background: The UK has a leading strategic position in the design of safety-critical systems. There are internationally respected groups in Bristol, London, Newcastle and York amongst others. The majority of this work has focussed on support for the design, operation and maintenance of complex, tightly coupled applications. In contrast, very little is known about the causes of computer-related failure in safety-critical systems. This is regrettable for a number of reasons. Previous research into constructive design techniques has been based on very limited empirical data. Our leading teams have been forced to rely upon subjective expertise. The lack of analytical techniques for software-related failure also creates a concern that increasing numbers of safety-related failures are being ‘blamed’ on software that fulfils functional requirements (Leveson 2002, Johnson 2002). Problems of Software-Related Mishap Analysis: It is important to emphasise that the lack of work into the failure of safety-critical systems not only has an impact upon UK research. It also concerns both regulators and those agencies that operate safety-critical software systems. For example, I recently collaborated on a post hoc analysis of the mission interruption that affected the NASA/European Space Agency’s Solar and Heliospheric observatory (SOHO). Brevity prevents a detailed analysis of the software related failures that led to this interruption. To summarise, communications problems between the spacecraft designers, the mission operation staff and scientific teams led to transpositions in a gyroscope reconfiguration sequence. However, the lack of explicit techniques for analysing the causes of software related failures led the joint ESA/NASA board to rely on techniques that had been developed in the US Nuclear industry over 20 years ago, including the Failure Event Tree illustrated in Figure 1. (65
(65
(65
-XQH
-XQH
-XQH
&2%6DXWRPDWLF
*\URB%
5HVWDUW,6$
*\URB%
UHFRQILJXUDWLRQ
JDLQ6(7
6XQ
RII
127(1$%/('
+,*+
DFTXLVLWLRQ
5HVWDUW
$WWLWXGH
7HOHPHWU\
,6$6XQ
ORVV
VLJQDOORVV
$FTXLVLWLRQ 3RZHU ORVV
*URXQGV\VWHP
*URXQGV\VWHP
*URXQGV\VWHP
)DXOW\
7KUXVWHUV
SURFHGXUHQRW
VRIWZDUHHUURU
VRIWZDUHLQ
'HFLVLRQ
DFWLYDWHG
LQLWLDO
XVHG
LQ
$B55$'B6(7
E\,6$
FRQGLWLRQVIRU
3RZHUOLPLWHG
UHJXODWRUV
$B,58B5(67
$B&21),*B1
(65
GLVFRQQHFWHG
8QFRQWUROODEOH
,QSURSHUFRGH
GLVFKDUJH
2SHUDWLRQV
YHULILFDWLRQ
0DQDJHPHQW SUREOHPV1RUHYLHZ ERDUGQRDQDO\VLVRI UHDOSUREOHPQR WLJHUWHDPVHWXS
:URQJ ,OOGHILQHG2SV &KDQJH5HTXHVW
/DFNRIV\VWHP
PRGLILFDWLRQ
:URQJ
$XWRPDWLF
PRGLILFDWLRQ
5HFRQILJ1RW
*\UR%+LJK
(QDEOHG
*DLQ
6FLHQFHWLPH
QRWHDVLO\
SUHVVXUHRQ
UHDGDEOHRQ
)27
,02&
NQRZOHGJH
6LPXODWRUV FRXOGQRWILQG HUURU
*\URVSLQVWDWXV
DQGZRUNORDG
5HTXLUHPHQW IRUVDYLQJRQ
6WUHDPOLQLQJRI
/DFNRI
*\UR
JURXQG
DSSUHFLDWLRQRI
$OOLHG6LJQDOV
XWLOL]DWLRQ
SURFHGXUHV
VHULRXVQHVVRI
5ROHVQRWFOHDUDW
VLWXDWLRQ
ZRUNLQJOHYHO
(65UHFRYHU\
1R(6$1$6$
1RFRGH ZDONWKURXJK
*\URIOLJKW
ZLWK(6$006
H[SHULHQFH
62+2PLVVLRQ H[WHQVLRQ
DQG62+2 PLVVLRQ H[WHQVLRQ
(6$1$6$
1$6$'LUHFWHG
VFULSWQRW
PDQDJHPHQW
IROORZHG
DYDLODEOHDWWKH WLPH
*URXQG&RVW 5HGXFWLRQ
0LVKDS(YHQW
'LUHFWIDFWRU
,QGLUHFWIDFWRU
2XWFRPHHYHQW
Figure 1: Failure Event Tree for the NASA/ESA Mission Interruption (Courtesy: NASA SOHO JIB)
-1-
The flow-charting approach illustrated in this diagram is cumbersome and ill suited to accidents that involve complex systems. Prof. Leveson (MIT) has recently argued that these techniques bias investigations towards failures of commission, which can be represented as events, rather than omissions, including the lack of code inspections. She has proposed alternative models of software failure that extend elements of control theory and which, therefore, view accidents as a failure to maintain proper constraints between system components. However, an initial analysis by a joint team from Glasgow and NASA has shown that, as yet, this approach cannot easily deal with rapid software reconfigurations of the type that occurred during the SOHO mission. Traditional accident analysis techniques have been adapted for use in software-related failures. For example, Ladkin’s Why-Because Analysis provides a formal approach for incident analysis based on the Lewis semantics for counterfactual arguments. Again, however, there has been very limited success in applying this approach to complex incidents where coding errors are symptomatic of deeper problems in the regulation and management of safety-critical systems development. The Glasgow-NASA analysis, mentioned above, has also found problems in interpreting the notion of strict rather than material implication used in this approach. The meta-level point here is that there is no nationally or internationally recognised method for analysing the failure of safety-critical software. Problems of Software-Related Incident Reporting: The lack of techniques to support the analysis of software related failures not only affect high-profile scientific instruments such as the SOHO observatory. It also affects the daily operation of hundreds of thousands of programmable devices. Many standards and regulations encourage or oblige manufacturers and device operators to report the failure of safety-critical devices. ICAO Annex 13 governs the aviation industries. MDA directives govern healthcare devices. IEC 61508 is widely used as a standard for the development of safety-critical applications that incorporate computer systems. This includes the recommendation that manufacturers should: “…implement procedures which ensure that hazardous Mishaps (or Mishaps with potential to create hazards) are analysed, and that recommendations are made to minimise the probability of a repeat occurrence.” (IEC, paragraph 6.2.1). In a recent project with Adelard and the Health and Safety Executive into guidelines for the reporting of software related failures under IEC61508, we conducted interviews across the UK process industries. This identified the problems that many companies have in distinguishing software failures from other hardware related failures including sensing devices and equipment under control. In other work, I have found examples of staff adopting potentially dangerous coping strategies. For example, the nurses in an intensive care unit chose to ‘reboot’ a device with the patient still attached. The intention was to return to a state that they were familiar with and which they recognised in the manufacturers’ manual. They did not report the incident because they were unsure whether the unknown state that the system had entered was actually a ‘feature’ of the system or a result of a software related failure. It is difficult to under-estimate the consequences of such under-reporting. Designers cannot elicit sufficient information to validate the risk assessments that underpin their safety cases. This undermines critical feedback loops that are assumed to support systems developed using the standards and regulations mentioned above. The problems of under-reporting are, of course, exacerbated by the lack of proper analytical techniques. Those failures, which are reported, are often incorrectly attributed to operator error or local configuration issues. Why is this a Grand Challenge? This challenge is slightly different from some of the others that have been proposed on the workshop web site. It is driven by a practical need to support the investigation and analysis of software related failures. However, it also poses considerable technical challenges. For instance, it raises questions about the nature of causation and of ‘blame’ in software development. Who is ultimately responsible for coding errors? Programmers often point to problems in risk assessments that guide the allocation of resources to the verification and validation of safety critical code. Conversely, the safety managers who direct risk assessment point to the systemic causes of software related failures and to the difficulties of anticipating complex interactions between integrated subsystems. At higher levels of managerial and regulatory responsibility, questions can be raised about the oversight that might reasonably be expected from individuals who often do not possess the detailed technical knowledge necessary to accurately identify the hazards facing many complex, software related systems. It is difficult to underestimate the challenge of developing appropriate analytical techniques that might be used to identify and correct the complex latent
-2-
and catalytic failures that are, typically, distributed between the different layers of responsibility from coder to regulator. Summary: The UK has a leading position in the development of constructive techniques for the design of safety-critical software systems. There is a lack of techniques for the analysis of software related failures. The challenge is to take our leading work in design and see whether elements might be applied to understand the causes of failure in programmable systems. Acknowledgements: The ideas in this draft benefited from initial discussions with C.M. Holloway (NASA, Langley Research Centre) and Nancy Leveson (MIT). The SOHO study was supported by a NASA/ICASE Software Engineering fellowship funded by by NASA contract NAS1-97046, Task 212. References • IEC 61508, (2000) Functional safety of electrical/electronic/programmable electronic safety-related systems, International Electrotechnical Commission. See http://www.iec.ch/61508 for further details. • C.W. Johnson (2002 in press), A Handbook for the Reporting of Incidents and Accidents, Springer Verlag, London, UK. • N. Leveson, (2002), A Systems Model of Accidents. In J.H. Wiggins and S. Thomason (eds) Proceedings of the 20th International System Safety Conference, 476-486, International Systems Safety Society, Unionville, USA. • NASA/ESA, (1998), SOHO Mission Interruption Joint NASA/ESA Investigation Board Final Report. Available from http://sohowww.nasa.gov/whatsnew/SOHO_final_report.html • NASA (2001), NASA Procedures and Guidelines for Mishap Reporting, Investigating and Recordkeeping,Safety and Risk Management Division, NASA Headquarters, NASA PG 8621.1, Washington DC, USA, http://www.hq.nasa.gov/office/codeq/doctree/safeheal.htm.
-3-