1- Grand Challenges for Computing Science: Techniques ... - CiteSeerX

Grand Challenges for Computing Science: Techniques for Identifying the Causes of Computer System Failure Chris Johnson, Department of Computing Science, University of Glasgow. [email protected], http://www.dcs.gla.ac.uk/~johnson Background: The UK has a leading strategic position in the design of safety-critical systems. There are internationally respected groups in Bristol, London, Newcastle and York amongst others. The majority of this work has focussed on support for the design, operation and maintenance of complex, tightly coupled applications. In contrast, very little is known about the causes of computer-related failure in safety-critical systems. This is regrettable for a number of reasons. Previous research into constructive design techniques has been based on very limited empirical data. Our leading teams have been forced to rely upon subjective expertise. The lack of analytical techniques for software-related failure also creates a concern that increasing numbers of safety-related failures are being ‘blamed’ on software that fulfils functional requirements (Leveson 2002, Johnson 2002). Problems of Software-Related Mishap Analysis: It is important to emphasise that the lack of work into the failure of safety-critical systems not only has an impact upon UK research. It also concerns both regulators and those agencies that operate safety-critical software systems. For example, I recently collaborated on a post hoc analysis of the mission interruption that affected the NASA/European Space Agency’s Solar and Heliospheric observatory (SOHO). Brevity prevents a detailed analysis of the software related failures that led to this interruption. To summarise, communications problems between the spacecraft designers, the mission operation staff and scientific teams led to transpositions in a gyroscope reconfiguration sequence. However, the lack of explicit techniques for analysing the causes of software related failures led the joint ESA/NASA board to rely on techniques that had been developed in the US Nuclear industry over 20 years ago, including the Failure Event Tree illustrated in Figure 1. (65

(65

(65

-XQH

-XQH

-XQH

&2%6DXWRPDWLF

*\URB%

5HVWDUW,6$

*\URB%

UHFRQILJXUDWLRQ

JDLQ6(7

6XQ

RII

127(1$%/('

+,*+

DFTXLVLWLRQ

5HVWDUW

$WWLWXGH

7HOHPHWU\

,6$6XQ

ORVV

VLJQDOORVV

$FTXLVLWLRQ 3RZHU ORVV

*URXQGV\VWHP

*URXQGV\VWHP

*URXQGV\VWHP

)DXOW\

7KUXVWHUV

SURFHGXUHQRW

VRIWZDUHHUURU

VRIWZDUHLQ

'HFLVLRQ

DFWLYDWHG

LQLWLDO

XVHG

LQ

$B55$'B6(7

E\,6$

FRQGLWLRQVIRU

3RZHUOLPLWHG

UHJXODWRUV

$B,58B5(67

$B&21),*B1

(65

GLVFRQQHFWHG

8QFRQWUROODEOH

,QSURSHUFRGH

GLVFKDUJH

2SHUDWLRQV

YHULILFDWLRQ

0DQDJHPHQW SUREOHPV1RUHYLHZ ERDUGQRDQDO\VLVRI UHDOSUREOHPQR WLJHUWHDPVHWXS

:URQJ ,OOGHILQHG2SV &KDQJH5HTXHVW

/DFNRIV\VWHP

PRGLILFDWLRQ

:URQJ

$XWRPDWLF

PRGLILFDWLRQ

5HFRQILJ1RW

*\UR%+LJK

(QDEOHG

*DLQ

6FLHQFHWLPH

QRWHDVLO\

SUHVVXUHRQ

UHDGDEOHRQ

)27

,02&

NQRZOHGJH

6LPXODWRUV FRXOGQRWILQG HUURU

*\URVSLQVWDWXV

DQGZRUNORDG

5HTXLUHPHQW IRUVDYLQJRQ

6WUHDPOLQLQJRI

/DFNRI

*\UR

JURXQG

DSSUHFLDWLRQRI

$OOLHG6LJQDOV

XWLOL]DWLRQ

SURFHGXUHV

VHULRXVQHVVRI

5ROHVQRWFOHDUDW

VLWXDWLRQ

ZRUNLQJOHYHO

(65UHFRYHU\

1R(6$1$6$

1RFRGH ZDONWKURXJK

*\URIOLJKW

ZLWK(6$006

H[SHULHQFH

62+2PLVVLRQ H[WHQVLRQ

DQG62+2 PLVVLRQ H[WHQVLRQ

(6$1$6$

1$6$'LUHFWHG

VFULSWQRW

PDQDJHPHQW

IROORZHG

DYDLODEOHDWWKH WLPH

*URXQG&RVW 5HGXFWLRQ

0LVKDS(YHQW

'LUHFWIDFWRU

,QGLUHFWIDFWRU

2XWFRPHHYHQW

Figure 1: Failure Event Tree for the NASA/ESA Mission Interruption (Courtesy: NASA SOHO JIB)

-1-

The flow-charting approach illustrated in this diagram is cumbersome and ill suited to accidents that involve complex systems. Prof. Leveson (MIT) has recently argued that these techniques bias investigations towards failures of commission, which can be represented as events, rather than omissions, including the lack of code inspections. She has proposed alternative models of software failure that extend elements of control theory and which, therefore, view accidents as a failure to maintain proper constraints between system components. However, an initial analysis by a joint team from Glasgow and NASA has shown that, as yet, this approach cannot easily deal with rapid software reconfigurations of the type that occurred during the SOHO mission. Traditional accident analysis techniques have been adapted for use in software-related failures. For example, Ladkin’s Why-Because Analysis provides a formal approach for incident analysis based on the Lewis semantics for counterfactual arguments. Again, however, there has been very limited success in applying this approach to complex incidents where coding errors are symptomatic of deeper problems in the regulation and management of safety-critical systems development. The Glasgow-NASA analysis, mentioned above, has also found problems in interpreting the notion of strict rather than material implication used in this approach. The meta-level point here is that there is no nationally or internationally recognised method for analysing the failure of safety-critical software. Problems of Software-Related Incident Reporting: The lack of techniques to support the analysis of software related failures not only affect high-profile scientific instruments such as the SOHO observatory. It also affects the daily operation of hundreds of thousands of programmable devices. Many standards and regulations encourage or oblige manufacturers and device operators to report the failure of safety-critical devices. ICAO Annex 13 governs the aviation industries. MDA directives govern healthcare devices. IEC 61508 is widely used as a standard for the development of safety-critical applications that incorporate computer systems. This includes the recommendation that manufacturers should: “…implement procedures which ensure that hazardous Mishaps (or Mishaps with potential to create hazards) are analysed, and that recommendations are made to minimise the probability of a repeat occurrence.” (IEC, paragraph 6.2.1). In a recent project with Adelard and the Health and Safety Executive into guidelines for the reporting of software related failures under IEC61508, we conducted interviews across the UK process industries. This identified the problems that many companies have in distinguishing software failures from other hardware related failures including sensing devices and equipment under control. In other work, I have found examples of staff adopting potentially dangerous coping strategies. For example, the nurses in an intensive care unit chose to ‘reboot’ a device with the patient still attached. The intention was to return to a state that they were familiar with and which they recognised in the manufacturers’ manual. They did not report the incident because they were unsure whether the unknown state that the system had entered was actually a ‘feature’ of the system or a result of a software related failure. It is difficult to under-estimate the consequences of such under-reporting. Designers cannot elicit sufficient information to validate the risk assessments that underpin their safety cases. This undermines critical feedback loops that are assumed to support systems developed using the standards and regulations mentioned above. The problems of under-reporting are, of course, exacerbated by the lack of proper analytical techniques. Those failures, which are reported, are often incorrectly attributed to operator error or local configuration issues. Why is this a Grand Challenge? This challenge is slightly different from some of the others that have been proposed on the workshop web site. It is driven by a practical need to support the investigation and analysis of software related failures. However, it also poses considerable technical challenges. For instance, it raises questions about the nature of causation and of ‘blame’ in software development. Who is ultimately responsible for coding errors? Programmers often point to problems in risk assessments that guide the allocation of resources to the verification and validation of safety critical code. Conversely, the safety managers who direct risk assessment point to the systemic causes of software related failures and to the difficulties of anticipating complex interactions between integrated subsystems. At higher levels of managerial and regulatory responsibility, questions can be raised about the oversight that might reasonably be expected from individuals who often do not possess the detailed technical knowledge necessary to accurately identify the hazards facing many complex, software related systems. It is difficult to underestimate the challenge of developing appropriate analytical techniques that might be used to identify and correct the complex latent

-2-

and catalytic failures that are, typically, distributed between the different layers of responsibility from coder to regulator. Summary: The UK has a leading position in the development of constructive techniques for the design of safety-critical software systems. There is a lack of techniques for the analysis of software related failures. The challenge is to take our leading work in design and see whether elements might be applied to understand the causes of failure in programmable systems. Acknowledgements: The ideas in this draft benefited from initial discussions with C.M. Holloway (NASA, Langley Research Centre) and Nancy Leveson (MIT). The SOHO study was supported by a NASA/ICASE Software Engineering fellowship funded by by NASA contract NAS1-97046, Task 212. References • IEC 61508, (2000) Functional safety of electrical/electronic/programmable electronic safety-related systems, International Electrotechnical Commission. See http://www.iec.ch/61508 for further details. • C.W. Johnson (2002 in press), A Handbook for the Reporting of Incidents and Accidents, Springer Verlag, London, UK. • N. Leveson, (2002), A Systems Model of Accidents. In J.H. Wiggins and S. Thomason (eds) Proceedings of the 20th International System Safety Conference, 476-486, International Systems Safety Society, Unionville, USA. • NASA/ESA, (1998), SOHO Mission Interruption Joint NASA/ESA Investigation Board Final Report. Available from http://sohowww.nasa.gov/whatsnew/SOHO_final_report.html • NASA (2001), NASA Procedures and Guidelines for Mishap Reporting, Investigating and Recordkeeping,Safety and Risk Management Division, NASA Headquarters, NASA PG 8621.1, Washington DC, USA, http://www.hq.nasa.gov/office/codeq/doctree/safeheal.htm.

-3-

1- Grand Challenges for Computing Science: Techniques ... - CiteSeerX

1- Grand Challenges for Computing Science: Techniques ... - CiteSeerX

Suggest Documents

Exploring Grand Challenges in Trustworthy Computing

Grand Challenges in Computing Education - UK Computing Research ...

Grand Challenges in Computing Education - UK Computing Research ...

Interaction techniques for radiology workstations ... - Computing Science

SOFTWARE ENGINEERING TECHNIQUES - Computing Science

UML&AADL '2007 Grand challenges - Computer Science

Grand Challenges in Computational Materials Science - Core

Grand Challenges in Computational Materials Science - Core

Grand challenges in glass science - Core

Grand Challenges - DOE Office of Science

UML&AADL '2007 Grand challenges - Computer Science

Infrastructure for Pervasive Computing: Challenges - CiteSeerX

Earth System Science for Global Sustainability: Grand Challenges

Grand Challenges in Global Health: Community ... - CiteSeerX

The Grand Challenges in Natural Computing Research

grand challenges canada welcomes creation of grand challenges israel

GIScience Grand Challenges - Esri

GIScience Grand Challenges - Esri

Grand Challenges Canada: Inappropriate

What Are The Grand Challenges for Data Mining? KDD ... - CiteSeerX

Grand Challenges in Design Research for Human ... - CiteSeerX

Grand Opportunities: Strategies for Addressing Grand Challenges in ...

Security challenges for wearable computing

Grand Challenges for Biological and Environmental Research