3.2.2 Safety Management Systems . .... 6.1.2 Incident and Emergency
Management . ...... The problems of drafting these guidelines led directly to this
book.
Failure in Safety-Critical Systems:
A HANDBOOK OF INCIDENT AND ACCIDENT REPORTING
Chris Johnson
Glasgow University Press
i Glasgow University Press,
Publicity Services, No. 2 The Square, University of Glasgow, Glasgow, G12 8QQ, Scotland. c 2003 by C.W. Johnson. Copyright All rights reserved. No part of this manuscript may be reproduced in any form, by photostat, microform, retrieval system, or any other means without the prior written permission of the author. First printed October 2003. ISBN 0-85261-784-4.
ii
Contents 1 Abnormal Incidents
1.1 The Hazards . . . . . . . . . . . . . . . . . . 1.1.1 The Likelihood of Injury and Disease . 1.1.2 The Costs of Failure . . . . . . . . . . 1.2 Social and Organisational In uences . . . . . 1.2.1 Normal Accidents? . . . . . . . . . . . 1.2.2 The Culture of Incident Reporting . . 1.3 Summary . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
3.1 Regulatory Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Incident Reporting to Inform Regulatory Intervention . . . . 3.1.2 The Impact of Incidents on Regulatory Organisations . . . . 3.2 Managerial Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The Role of Management in Latent and Catalytic Failures . . 3.2.2 Safety Management Systems . . . . . . . . . . . . . . . . . . 3.3 Hardware Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Acquisition and Maintenance Eects on Incident Reporting . 3.3.2 Source, Duration and Extent . . . . . . . . . . . . . . . . . . 3.4 Software Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Failure Throughout the Lifecycle . . . . . . . . . . . . . . . . 3.4.2 Problems in Forensic Software Engineering . . . . . . . . . . 3.5 Human Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Individual Characteristics and Performance Shaping Factors . 3.5.2 Slips, Lapses and Mistakes . . . . . . . . . . . . . . . . . . . 3.6 Team Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Common Ground and Group Communication . . . . . . . . . 3.6.2 Situation Awareness and Crew Resource Management . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
2 Motivations for Incident Reporting
2.1 The Strengths of Incident Reporting . . . . 2.2 The Weaknesses of Incident Reporting . . . 2.3 Dierent Forms of Reporting Systems . . . 2.3.1 Open, Con dential or Anonymous? . 2.3.2 Scope and Level . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . .
. . . . . .
3 Sources of Failure
4 The Anatomy of Incident Reporting 4.1 Dierent Roles . . . . . . . . 4.1.1 Reporters . . . . . . . 4.1.2 Initial Receivers . . . 4.1.3 Incident Investigators
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
iii
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1
2 6 8 9 9 10 18
21
21 25 28 28 39 43
45
46 46 47 48 49 50 50 52 53 56 57 61 63 63 71 73 77 79 82
85
86 86 89 92
iv
CONTENTS
4.1.4 Safety Managers . . . . . . . . . . . 4.1.5 Regulators . . . . . . . . . . . . . . 4.2 Dierent Anatomies . . . . . . . . . . . . . 4.2.1 Simple Monitoring Architectures . . 4.2.2 Regulated Monitoring Architectures 4.2.3 Local Oversight Architectures . . . . 4.2.4 Gatekeeper Architecture . . . . . . . 4.2.5 Devolved Architecture . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
5.1 `Incident Starvation' and the Problems of Under-Reporting 5.1.1 Reporting Bias . . . . . . . . . . . . . . . . . . . . . 5.1.2 Mandatory Reporting . . . . . . . . . . . . . . . . . 5.1.3 Special Initiatives . . . . . . . . . . . . . . . . . . . 5.2 Encouraging the Detection of Incidents . . . . . . . . . . . . 5.2.1 Automated Detection . . . . . . . . . . . . . . . . . 5.2.2 Manual Detection . . . . . . . . . . . . . . . . . . . 5.3 Form Contents . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Sample Incident Reporting Forms . . . . . . . . . . 5.3.2 Providing Information to the Respondents . . . . . . 5.3.3 Eliciting Information from Respondents . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
5 Detection and Noti cation
6 Primary Response
6.1 Safeguarding the System . . . . . . . . . . . . . 6.1.1 First, Do No Harm . . . . . . . . . . . . 6.1.2 Incident and Emergency Management . 6.2 Acquiring Evidence . . . . . . . . . . . . . . . . 6.2.1 Automated Logs and Physical Evidence 6.2.2 Eye-Witness Statements . . . . . . . . . 6.3 Drafting A Preliminary Report . . . . . . . . . 6.3.1 Organisational and Managerial Barriers 6.3.2 Technological Support . . . . . . . . . . 6.3.3 Links to Subsequent Analysis . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
7.1 Gathering Evidence about Causation . . . . . . . . 7.1.1 Framing an Investigation . . . . . . . . . . 7.1.2 Commissioning Expert Witnesses . . . . . . 7.1.3 Replaying Automated Logs . . . . . . . . . 7.2 Gathering Evidence about Consequences . . . . . . 7.2.1 Tracing Immediate and Long-Term Eects . 7.2.2 Detecting Mitigating Factors . . . . . . . . 7.2.3 Identifying Related Incidents . . . . . . . . 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
7 Secondary Investigation
8 Computer-Based Simulation
8.1 Why Bother with Reconstruction? 8.1.1 Coordination . . . . . . . . 8.1.2 Generalisation . . . . . . . 8.1.3 Resolving Ambiguity . . . . 8.2 Types of Simulation . . . . . . . . 8.2.1 Declarative Simulations . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
94 96 100 100 101 102 103 104 105
107
108 109 111 113 115 115 123 131 132 134 138 141
143
147 147 151 153 153 157 169 169 171 172 175
177
180 180 185 190 196 197 200 203 206
209
209 214 216 218 222 224
v
CONTENTS
8.2.2 Animated Simulations . 8.2.3 Subjunctive Simulations 8.2.4 Hybrid Simulations . . . 8.3 Summary . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
9.1 Reconstruction Techniques . . . . . . . . . . 9.1.1 Graphical Time Lines . . . . . . . . 9.1.2 Fault Trees . . . . . . . . . . . . . . 9.1.3 Petri Nets . . . . . . . . . . . . . . . 9.1.4 Logic . . . . . . . . . . . . . . . . . 9.2 Requirements for Reconstructive Modelling 9.2.1 Usability . . . . . . . . . . . . . . . 9.2.2 Expressiveness . . . . . . . . . . . . 9.3 Summary . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Why Bother With Causal Analysis? . . . . . . . . 10.1.2 Potential Pitfalls . . . . . . . . . . . . . . . . . . . 10.1.3 Loss of the Mars Climate Orbiter & Polar Lander 10.2 Stage 1: Incident Modelling (Revisited) . . . . . . . . . . 10.2.1 Events and Causal Factor Charting . . . . . . . . . 10.2.2 Barrier Analysis . . . . . . . . . . . . . . . . . . . 10.2.3 Change Analysis . . . . . . . . . . . . . . . . . . . 10.3 Stage 2: Causal Analysis . . . . . . . . . . . . . . . . . . . 10.3.1 Causal Factors Analysis . . . . . . . . . . . . . . . 10.3.2 Cause and Contextual Summaries . . . . . . . . . 10.3.3 Tier Analysis . . . . . . . . . . . . . . . . . . . . . 10.3.4 Non-Compliance Analysis . . . . . . . . . . . . . . 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
11.1 Event-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Multilinear Events Sequencing (MES) . . . . . . . . . . . 11.1.2 Sequentially Timed and Events Plotting (STEP) . . . . . 11.2 Check-List Approaches . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Management Oversight and Risk Tree (MORT) . . . . . . 11.2.2 Prevention and Recovery Information System for Monitoring and Analysis (PRISMA) . . . . . . . . . . . . 11.2.3 Tripod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Mathematical Models of Causation . . . . . . . . . . . . . . . . . 11.3.1 Why-Because Analysis (WBA) . . . . . . . . . . . . . . . 11.3.2 Partition Models for Probabilistic Causation . . . . . . . 11.3.3 Bayesian Approaches to Probabilistic Causation . . . . . 11.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Bottom-Up Case Studies . . . . . . . . . . . . . . . . . . . 11.4.2 Top-Down Criteria . . . . . . . . . . . . . . . . . . . . . . 11.4.3 Experiments into Domain Experts' Subjective Responses 11.4.4 Experimental Applications of Causal Analysis Techniques 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
9 Modelling Notations
10 Causal Analysis
11 Alternative Causal Analysis Techniques
. . . .
230 238 250 253
257
257 258 262 280 293 310 311 318 335
337
337 338 341 345 348 349 354 368 391 391 402 409 421 427
431
432 432 441 449 449 463 473 481 481 495 501 507 509 512 521 524 531
vi
CONTENTS
12 Recommendations
12.1 From Causal Findings to Recommendations . . . . . 12.1.1 Requirements for Causal Findings . . . . . . 12.1.2 Scoping Recommendations . . . . . . . . . . 12.1.3 Con icting Recommendations . . . . . . . . . 12.2 Recommendation Techniques . . . . . . . . . . . . . 12.2.1 The `Perfectability' Approach . . . . . . . . . 12.2.2 Heuristics . . . . . . . . . . . . . . . . . . . . 12.2.3 Enumerations and Recommendation Matrices 12.2.4 Generic Accident Prevention Models . . . . . 12.2.5 Risk Assessment Techniques . . . . . . . . . . 12.3 Process Issues . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Documentation . . . . . . . . . . . . . . . . . 12.3.2 Validation . . . . . . . . . . . . . . . . . . . . 12.3.3 Implementation . . . . . . . . . . . . . . . . . 12.3.4 Tracking . . . . . . . . . . . . . . . . . . . . . 12.4 Summary . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
13.1 The Challenges of Reporting Adverse Occurrences . . . . . . . . . . . . . . . . . 13.1.1 Dierent Reports for Dierent Incidents . . . . . . . . . . . . . . . . . . . 13.1.2 Dierent Reports for Dierent Audiences . . . . . . . . . . . . . . . . . . 13.1.3 Con dentiality, Trust and the Media . . . . . . . . . . . . . . . . . . . . . 13.2 Guidelines for the Presentation of Incident Reports . . . . . . . . . . . . . . . . . 13.2.1 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Electronic Presentation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Limitations of Existing Approaches to Web-Based Reports . . . . . . . . 13.4.2 Using Computer Simulations as an Interface to On-Line Accident Reports 13.4.3 Using Time-lines as an Interface to Accident Reports . . . . . . . . . . . . 13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
13 Feedback and the Presentation of Incident Reports
14 Dissemination
14.1 Problems of Dissemination . . . . . . . . . . . . . . . . . . . 14.1.1 Number and Range of Reports Published . . . . . . 14.1.2 Tight Deadlines and Limited Resources . . . . . . . 14.1.3 Reaching the Intended Readership . . . . . . . . . . 14.2 From Manual to Electronic Dissemination . . . . . . . . . . 14.2.1 Anecdotes, Internet Rumours and Broadcast Media 14.2.2 Paper documents . . . . . . . . . . . . . . . . . . . . 14.2.3 Fax and Telephone Noti cation . . . . . . . . . . . . 14.3 Computer-Based Dissemination . . . . . . . . . . . . . . . . 14.3.1 Infrastructure Issues . . . . . . . . . . . . . . . . . . 14.3.2 Access Control . . . . . . . . . . . . . . . . . . . . . 14.3.3 Security and Encryption . . . . . . . . . . . . . . . . 14.3.4 Accessibility . . . . . . . . . . . . . . . . . . . . . . . 14.4 Computer-Based Search and Retrieval . . . . . . . . . . . . 14.4.1 Relational Data Bases . . . . . . . . . . . . . . . . . 14.4.2 Lexical Information Retrieval . . . . . . . . . . . . . 14.4.3 Case Based Retrieval . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
535
535 537 540 552 562 565 570 573 584 590 599 599 603 610 613 617
619
619 620 623 626 635 635 649 661 673 673 691 698 700 702 704 707
711
711 711 713 715 719 719 723 728 729 730 741 742 745 746 748 769 780
vii
CONTENTS
14.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790
15 Monitoring
15.1 Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Direct Feedback: Incident and Reporting Rates . . . . . . . . . . . . . . . . 15.1.2 Indirect Feedback: Training and Operations . . . . . . . . . . . . . . . . . . 15.1.3 Feed-forward: Risk Assessment and Systems Development . . . . . . . . . . 15.2 Process Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Submission Rates and Reporting Costs . . . . . . . . . . . . . . . . . . . . . 15.2.2 Investigator Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 Intervention Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Acceptance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Safety Culture and Safety Climate? . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Probity and Equity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.3 Financial Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Monitoring Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Public Hearings, Focus Groups, Working Parties and Standing Committees 15.4.2 Incident Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.3 Sentinel systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.4 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.5 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.6 Electronic Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.7 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 Conclusions
16.1 Human Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Reporting Biases . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 Blame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.3 Analytical Bias . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Technical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1 Poor Investigatory and Analytical Procedures . . . . . . . . 16.2.2 Inadequate Risk Assessments . . . . . . . . . . . . . . . . . 16.2.3 Causation and the Problems of Counter-Factual Reasoning 16.2.4 Classi cation Problems . . . . . . . . . . . . . . . . . . . . 16.3 Managerial Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Unrealistic Expectations . . . . . . . . . . . . . . . . . . . . 16.3.2 Reliance on Reminders and Quick Fixes . . . . . . . . . . . 16.3.3 Flaws in the Systemic View of Failure . . . . . . . . . . . . 16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
795
811 812 817 820 826 826 829 833 836 838 843 845 848 848 854 859 863 867 872 883 891
895
896 897 897 898 900 900 901 902 904 907 907 908 910 912
viii
CONTENTS
List of Figures 1.1 Components of Systems Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Process of Systems Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Normal and Abnormal States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 5 16
2.1 Federal Railroad Administration Safety Iceberg . . . . . . . . . . . . . . . . . . . . . 2.2 Heinrich Ratios for US Aviation (NTSB and ASRS Data) . . . . . . . . . . . . . . .
23 37
3.1 3.2 3.3 3.4 3.5 3.6
Levels of Reporting and Monitoring in Safety Critical Applications Costs Versus Maintenance Interval . . . . . . . . . . . . . . . . . . Failure Probability Distribution for Hardware Devices . . . . . . . Cognitive In uences in Decision Making and Control . . . . . . . . Cognitive In uences on Group Decision Making and Control . . . . In uences on Group Performance . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
45 53 54 69 76 77
4.1 4.2 4.3 4.4 4.5
A Simple Monitoring Architecture . . . Regulated Monitoring Reporting System Local Oversight Reporting System . . . Gatekeeper Reporting System . . . . . . Devolved Reporting System . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
100 101 102 103 105
5.1 5.2 5.3 5.4 5.5 5.6 5.7
Generic Phases in Incident Reporting Systems . . . . . . . . . . . . . . . Accident and Incident Rates for Rejected Takeo Overruns . . . . . . . CHIRP and ASRS Publications . . . . . . . . . . . . . . . . . . . . . . . Web Interface to the CHSIB Incident Collection . . . . . . . . . . . . . . Incident Reporting Form for a UK Neonatal Intensive Care Unit [119] . ASRS Reporting Form for Air TraÆc Control Incidents (January 2000) The CIRS Reporting System [756] . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
107 126 127 129 132 134 135
6.1 6.2 6.3 6.4 6.5 6.6
Generic Phases in Incident Reporting Systems . . . . . . . . . . . . . . US Army Preliminary Incident/Accident Checklist . . . . . . . . . . . Interview Participation Diagram . . . . . . . . . . . . . . . . . . . . . US Army Incident/Accident Reporting Procedures . . . . . . . . . . . US Army Preliminary Incident/Accident Telephone Reports . . . . . . US Army Aviation and Missile Command Preliminary Incident Form .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
144 148 162 172 173 174
8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9
Imagemap Overview of the Herald of Free Enterprise . . . . Imagemap Detail of the Herald of Free Enterprise . . . . . . QuicktimeVR Simulation of a Boeing 757 . . . . . . . . . . QuicktimeVR Simulation of Lukas Spreaders . . . . . . . . VRML Simulation of Building Site Incidents . . . . . . . . . NTSB Simulation of the Bus Accident (HWY-99-M-H017). 3 Dimensional Time-line Using DesktopVR . . . . . . . . . Overview of Perspective Wall Using DesktopVR . . . . . . . Detail of Perspective Wall Using DesktopVR . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
225 226 229 229 233 234 236 237 238
. . . . .
ix
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . .
. . . . .
. . . . . . . . .
. . . . .
. . . . . . . . .
. . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
x
LIST OF FIGURES
8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17
Graphical Modelling Using Boeing's EASY5 Tool . . . . . . . . . . . . . . . . . . . . NTSB Simulated Crash Pulse Of School Bus and Truck Colliding . . . . . . . . . . . NTSB Simulation of Motor Vehicle Accident, Wagner Oklahoma . . . . . . . . . . . Biomechanical Models in NTSB Incident Simulations (1) . . . . . . . . . . . . . . . . Biomechanical Models in NTSB Incident Simulations (2) . . . . . . . . . . . . . . . . Multi-User Air TraÆc Control (Datalink) Simulation . . . . . . . . . . . . . . . . . . EUROCONTROL Proposals for ATM Incident Simulation . . . . . . . . . . . . . . . US National Crash Analysis Centre's Simulation of Ankle Injury in Automobile Accidents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.18 Integration of MIIU Plans, Models, Maps and Photographs . . . . . . . . . . . . . . 8.19 NTSB Use of Simulations in Incident Reports . . . . . . . . . . . . . . . . . . . . . . 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 9.10 9.11 9.12 9.13 9.14 9.15 9.16 9.17 9.18 9.19 9.20 9.21 9.22 9.23 9.24 9.25 9.26 9.27 9.28 9.29 9.30 9.31 9.32 9.33 9.34 9.35 9.36 9.37 9.38 9.39 9.40
Graphical Time-line Showing Initial Regulatory Background. . . . . . . . Graphical Time-line Showing Intermediate Regulatory Background. . . . . Graphical Time-line Showing Immediate Regulatory Background. . . . . . Graphical Time-line of Events Surrounding the Allentown Explosion. . . . Graphical Time-line of the Allentown Explosion. . . . . . . . . . . . . . . Two-Axis Time-line of the Allentown Explosion. . . . . . . . . . . . . . . Fault tree components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Simple Fault Tree for Design. . . . . . . . . . . . . . . . . . . . . . . . . Simpli ed Fault Tree Representing Part of the Allentown Incident. . . . . Fault Tree Showing Events Leading to Allentown Explosion . . . . . . . . Using Inhibit Gates to Represent Alternative Scenarios . . . . . . . . . . . Using House Events to Represent Alternative Scenarios . . . . . . . . . . Fault Tree Showing Post-Explosion Events . . . . . . . . . . . . . . . . . . Fault Tree Showing NTSB Conclusions about the Causes of the Explosion Fault Tree Showing Conclusions about Injuries and Loss of Life . . . . . . Fault Tree Showing Conclusions about Reliability of Excess Flow Valves . Petri Net of Initial Events in the Allentown Incident . . . . . . . . . . . . A Petri Net With Multiple Tokens . . . . . . . . . . . . . . . . . . . . . . A Petri Net Showing Catalytic Transition. . . . . . . . . . . . . . . . . . . A Petri Net Showing Con ict . . . . . . . . . . . . . . . . . . . . . . . . . A Petri Net With An Inhibitor Avoiding Con ict. . . . . . . . . . . . . . A Sub-Net Showing Crew Interaction. . . . . . . . . . . . . . . . . . . . . A Sub-Net Showing Alternative Reasons for the Foreman's Decision. . . . High-Level CAE Diagram for the Allentown Incident . . . . . . . . . . . . Representing Counter Arguments in a CAE Diagram (1) . . . . . . . . . . Representing Counter Arguments in a CAE Diagram (2) . . . . . . . . . . Representing Counter Arguments in a CAE Diagram (3) . . . . . . . . . . High-Level CAE Diagram Integrating Formal and Informal Material . . . Extended CAE Diagram Integrating Formal and Informal Material (1) . . Extended CAE Diagram Integrating Formal and Informal Material (2) . . Subjective Responses to Modelling Notations. . . . . . . . . . . . . . . . . Subjective Responses to Logic-Based Reconstruction How Easy did you nd it to understand the logic-based model? . . . . . . Qualitative Assessments Of CAE-Based Diagrams How Easy Did You Find It to Understand the CAE Diagram? . . . . . . . Qualitative Assessments of Hybrid Approach . . . . . . . . . . . . . . . . Allentown Fault Tree Showing Pre- and Post-Incident Events . . . . . . . Cross-Referencing Problems in Incident Reports . . . . . . . . . . . . . . . Using a Petri Net to Build a Coherent Model of Concurrent Events . . . . Lack of Evidence, Imprecise Timings and Time-lines . . . . . . . . . . . . Continuous changes and Time-lines . . . . . . . . . . . . . . . . . . . . . . Using Petri Nets to Represent Dierent Versions of Events . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
240 242 243 244 245 249 251 252 253 254 258 259 260 261 263 264 265 266 267 269 271 273 275 276 279 280 282 284 286 287 288 290 292 307 308 309 310 311 312 312 313
. . . . . . 315 . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
316 317 320 322 323 325 326 327
xi
LIST OF FIGURES
9.41 9.42 9.43 9.44 9.45
Annotating Petri Nets to Resolve Apparent Contradictions Representing the Criticality of Distal Causes . . . . . . . . Representing the Impact of Proximal Causes . . . . . . . . Representing the Impact of Mitigating Factors . . . . . . . Representing Impact in a Causal Analysis . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
328 330 332 333 334
10.1 Overview of the Dept. of Energy's `Core' Techniques . . . . . . . . 10.2 Simpli ed Structure of an ECF Chart . . . . . . . . . . . . . . . . 10.3 Components of ECF Chart . . . . . . . . . . . . . . . . . . . . . . 10.4 High-Level ECF Chart for the Mars Climate Orbiter (MCO) . . . 10.5 Angular Momentum Desaturation Events Aect MCO Navigation 10.6 High-Level ECF chart for the Mars Polar Lander (MPL) . . . . . . 10.7 Premature MPL Engine Shut-Down and DS2 Battery Failure . . . 10.8 Integrating the Products of Barrier Analysis into ECF Charts . . . 10.9 Process Barriers Fail to Protect the Climate Orbiter . . . . . . . . 10.10Process Barriers Fail to Protect the Climate Orbiter (2) . . . . . . 10.11Process Barriers Fail to Protect the Climate Orbiter (3) . . . . . . 10.12Technological Barriers Fail to Protect the Climate Orbiter . . . . . 10.13Technological Barriers Fail to Protect the Climate Orbiter (2) . . . 10.14Integrating Change Analysis into an ECF Chart . . . . . . . . . . 10.15Representing StaÆng Limitations within an ECF Chart . . . . . . 10.16Representing Risk Management Issues within an ECF Chart . . . 10.17Representing Technological Issues within an ECF chart (1) . . . . 10.18Representing Technological Issues within an ECF chart (2) . . . . 10.19Using Change Analysis to Collate Contextual Conditions . . . . . 10.20Integrating Development Issues into an ECF chart (1) . . . . . . . 10.21Integrating Development Issues into an ECF chart (2) . . . . . . . 10.22Integrating Review Issues into an ECF chart . . . . . . . . . . . . 10.23An ECF chart of the Deep Space 2 Mission Failure . . . . . . . . . 10.24An ECF chart of the Polar Lander Mission Failure . . . . . . . . . 10.25An ECF chart of the Climate Orbiter Mission Failure . . . . . . . 10.26NASA Headquarters' OÆce of Space Science [569] . . . . . . . . . 10.27JPL Space and Earth Sciences Programmes Directorate [569] . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
348 349 350 350 351 352 353 359 361 362 363 365 368 374 376 378 381 382 385 386 387 390 393 395 399 412 413
11.1 Abstract View of A Multilinear Events Sequence (MES) Diagram . . 11.2 An Initial Multilinear Events Sequence (MES) Diagram . . . . . . . 11.3 A MES Flowchart showing Conditions in the Nanticoke Case Study 11.4 A MES Flowchart showing Causation in the Nanticoke Case Study . 11.5 Causal Relationships in STEP Matrices . . . . . . . . . . . . . . . . 11.6 STEP Matrix for the Nanticoke Case Study . . . . . . . . . . . . . . 11.7 The Mini-MORT Diagram . . . . . . . . . . . . . . . . . . . . . . . . 11.8 A Causal Tree of the Nanticoke Case Study . . . . . . . . . . . . . . 11.9 The Eindhoven Classi cation Model [840] . . . . . . . . . . . . . . . 11.10Classi cation Model for the Medical Domain [844] . . . . . . . . . . 11.11The Three Legs of Tripod . . . . . . . . . . . . . . . . . . . . . . . . 11.12Tripod-Beta Event Analysis of the Nanticoke Incident (1) . . . . . . 11.13Tripod-Beta Event Analysis of the Nanticoke Incident (2) . . . . . . 11.14Why-Because Graph Showing Halon Discharge . . . . . . . . . . . . 11.15Why-Because Graph for the Nanticoke Alarm . . . . . . . . . . . . . 11.16Overview of the Why-Because Graph for the Nanticoke Incident . . 11.17Possible `Normative' Worlds for the Nanticoke Incident . . . . . . . . 11.18Bayesian Network Model for the Nanticoke Fuel Source . . . . . . . 11.19Causal Tree from McElroy's Evaluation of PRIMA (1) . . . . . . . . 11.20Causal Tree from McElroy's Evaluation of PRIMA (2) . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
433 437 438 440 442 445 450 464 468 469 474 477 479 483 484 486 487 505 526 528
xii
LIST OF FIGURES
13.1 Simpli ed Flowchart of Report Generation Based on [631] . . . . . . . . . . . . . 13.2 Data and Claims in the Navimar Case Study. . . . . . . . . . . . . . . . . . . . . 13.3 Quali cation and Rebuttal in the Navimar Case Study. . . . . . . . . . . . . . . 13.4 More Complex Applications of Toulmin's Model. . . . . . . . . . . . . . . . . . . 13.5 Snowdon's Tool for Visualising Argument in Incident Reports (1). . . . . . . . . 13.6 Snowdon's Tool for Visualising Argument in Incident Reports (2). . . . . . . . . 13.7 MAIB On-Line Feedback Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.8 MAIB Safety Digest (HTML Version) . . . . . . . . . . . . . . . . . . . . . . . . 13.9 ATSB Incident Report (PDF Version) . . . . . . . . . . . . . . . . . . . . . . . . 13.10NTSB Incident Report (PDF Version) . . . . . . . . . . . . . . . . . . . . . . . . 13.11Douglas Melvin's Simulation Interface to Rail Incident Report (VRML Version) . 13.12James Farrel's Simulation Interface to Aviation Incident Report (VRML Version) 13.13Peter Hamilton's Cross-Reference Visualisation (VRML Version) . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
627 684 685 687 690 691 697 700 702 702 703 703 704
14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9
Perceived `Ease of Learning' in a Regional Fire Brigade . . . . . . . . . . . The Heavy Rescue Vehicle Training Package . . . . . . . . . . . . . . . . . . Overview of the MAUDE Relations . . . . . . . . . . . . . . . . . . . . . . . The MAUDE User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Components of a Semantic Network . . . . . . . . . . . . . . . . . . . . . . Semantic Network for an Example MAUDE Case . . . . . . . . . . . . . . . Using a Semantic Network to Model Stereotypes . . . . . . . . . . . . . . . US Naval Research Laboratory's Conversational Decision Aids Environment
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
736 737 749 764 777 782 783 784 786
15.1 15.2 15.3 15.4 15.5 15.6 15.7
Static Conventional Visualisation of SPAD Severity by Year Static `Column' Visualisation of SPAD Severity by Year . . Static `Radar' Visualisation of SPAD Severity by Year . . . Computer-Based Visualisation of SPAD Data . . . . . . . . Signal Detail in SPAD Visualisation . . . . . . . . . . . . . Dynamic Queries in the Visualisation of SPAD Incidents . . Eccentric Labelling in the Visualisation of SPAD Incidents .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
874 875 876 879 880 881 882
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Preface Incident reporting systems have been proposed as means of preserving safety in many industries, including aviation [308], chemical production [162], marine transportation [387], military acquisition [287] and operations [806], nuclear power production [382], railways [664] and healthcare [105]. Unfortunately, the lack of training material or other forms of guidance can make it very diÆcult for engineers and managers to set up and maintain reporting systems. These problems have been exacerbated by a proliferation of small-scale local initiatives, for example within individual departments in UK hospitals. This, in turn, has made it very diÆcult to collate national statistics for incidents within a single industry. There are, of course, exceptions to this. For example, the Aviation Safety Reporting System (ASRS) has established national reporting procedures throughout the US aviation industry. Similarly, the UK Health and Safety Executive have supported national initiatives to gather data on Reportable Injuries, Diseases and Dangerous Occurrences (RIDDOR). In contrast to the local schemes, these national systems face problems of scale. It can become diÆcult to search databases of 500,000 records to determine whether similar incidents have occurred in the past. This book, therefore, addresses two needs. The rst is to provide engineers and managers with a practical guide on how to set up and maintain an incident reporting system. The second is to provide guidance on how to cope with the problems of scale that aect successful local and national incident reporting systems. In 1999, I was asked to help draft guidelines for incident reporting in air traÆc control throughout Europe. The problems of drafting these guidelines led directly to this book. I am, therefore, grateful to Gilles le Gallo and Martine Blaize of EUROCONTROL for helping me to focus on the problems of international incident reporting systems. Roger Bartlett, safety manager at the Maastricht upper air space Air TraÆc Control center also provided valuable help during several stages in the writing of this book. Thanks are also due to Michael Holloway of NASA's Langley Research Center who encouraged me to analyze the mishap reporting procedures being developed within his organization. Mike O'Leary of British Airways and Neil Johnstone of Aer Lingus encouraged my early work on software development for incident reporting. Ludwig Benner, Peter Ladkin, Karsten Loer and Dmitri Zotov provided advice and critical guidance on the causal analysis sections. I would also like to thank Gordon Crick and Mark Bowell of the UK Health and Safety Executive, in particular, for their ideas on the future of national reporting systems. I would like to thank the University of Glasgow for supporting the sabbatical that helped me to nish this work. Chris Johnson, Glasgow, 2003.
xiii
Chapter 1
Abnormal Incidents Every day we place our trust in a myriad of complex, heterogeneous systems. For the most part, we do this without ever explicitly considering that these systems might fail. This trust is largely based upon pragmatics. No individual is able to personally check that their food and drink is free from contamination, that their train is adequately maintained and protected by appropriate signalling equipment, that their domestic appliances continue to conform to the growing array of international safety regulations [278]. As a result we must place a degree of trust in the organisations who provide the services that we use and the products that we consume. We must also, indirectly, trust the regulatory framework that guides these organisations in their commercial practices. The behaviour of phobics provides us with a glimpse of what it might be like if we did not possess this trust. For instance, a fear of ying places us in a nineteenth century world in which it takes several days rather than a few hours to cross the Atlantic. The SS United States' record crossing took 3 days, 10 hours and 40 minutes in July 1952. Today, the scheduled crossings by Cunard's QEII now take approximately 6 days. In some senses, therefore, trust and pro t are the primary lubricants of the modern world economy. Of course, this trust is implicit and may in some cases be viewed as a form of complicit ignorance. We do not usually pause to consider the regulatory processes that ensures our evening meal is free of contamination or that our destination airport is adequately equipped. From time to time our trust is shaken by failures in the infrastructure that we depend upon [70]. These incidents and accidents force us to question the safety of the systems that surround us. We begin to consider whether the bene ts provided by particular services and products justify the risks that they involve. For example, the Valujet accident claimed the lives of a DC-9's passengers and crew when it crashed after takeo from Miami. National Transportation Safety Board (NTSB) investigators found that SabreTech employees had improperly labelled oxygen canisters that were carried on the ight. These cannisters created the necessary conditions for the re, which in turn led to the crash. Prior to the accident, in the rst quarter of 1996, Valujet reported a net income of $10.7 million. After the accident, in the nal quarter of 1996, Valujet reported a loss of $20.6 million. These losses do not take into account the additional $262 million costs of settlements with the victims relatives. The UK Nuclear Installations Inspectorate's report into the falsi cation of pellet diameter data in the MOX demonstration facility at Sella eld also illustrates the consequences of losing international con dence [641] In the wake of this document, Japan, Germany and Switzerland suspended their ships to and from the facility. The United States' government initiated a review of BNFL's participation in a $4.4bn contract to decommission former nuclear facilities. US Energy Secretary Bill Richardson sent a team to England to meet with British investigators. British Nuclear Fuel's issued a statement which stated that they had nothing to hide and were con dent that the US Department of Energy would satisfy itself on this point [106]. The Channel Tunnel re provides another example of the commercial consequences of such adverse events. In May 1997, the Channel Tunnel Safety Authority made 36 safety recommendations after nding that the re had exposed weaknesses in underlying safety systems. InsuÆcient sta training had led to errors and delays in dealing with the re. Eurotunnel, therefore, took steps to 1
2
CHAPTER 1.
ABNORMAL INCIDENTS
address these concerns by implementing the short-term recommendations and conducting further studies to consider those changes that involved longer-term infrastructure investment. However, the UK Consumer Association mirrored more general public anxiety when its representatives stated that it was `still worried' about evacuation procedures and the non-segregation of passengers from cars on the tourist shuttle trains [97] Te re closed the train link between the United Kingdom and France for approximately six months and served to exacerbate Eurotunnel's 1995 loss of $925 million. This book introduces the many dierent incident reporting techniques that are intended to reduce the frequency and mitigate the consequences of accidents, such as those described in previous paragraphs. The intention is that by learning more from `near misses' and minor incidents, these approaches can be used to avoid the losses associated with more serious mishaps. Similarly, if we can identify patterns of failure in these low consequence events we can also reduce the longer term costs associated with large numbers of minor mishaps. In order to justify why you should invest your time in reading the rest of this work it is important to provide some impression of the scale of the problems that we face. It is diÆcult to directly assess the negative impact that workplace accidents have upon safe and successful production [283]. Many low-criticality and `near miss' events are not reported even though they incur signi cant cumulative costs. In spite of such caveats, it is possible to use epidemiological surveys and reports from national healthcare systems to assess the eects of incidents and accidents on worker welfare.
1.1 The Hazards Employment brings with it numerous economic and health bene ts. It can even improve our life expectancy over those of us who may be unfortunate enough not to nd work. However, work exposes us to a range of occupational hazards. The World Health Organisation (WHO) estimate that there may be as many as 250 million occupational injuries each year, resulting in 330,000 fatalities [872]. If work-related diseases are included then this gure grows to 1.1 million deaths throughout the globe [873]. About the same number of people die from malaria each year. The following list summarises the main causes of occupational injury and disease.
Mechanical hazards. Many workplace injuries occur because of poorly designed or poorly screened equipment. Others occur because people work on, or with, unsafe structures. Badly maintained tools also create hazards that may end in injury. Musculo-skeletal disorders and repetitive strain injury are now the main cause of work-related disability in most of the developed world. The consequent economic losses can be as much as 5% of the gross national product in some countries [872]. The Occupational Safety and Health Administration's (OSHA) ergonomics programme has argued that musculo-skeletal disorders are the most prevalent, expensive and preventable workplace injuries in the United States. They are estimated to cost $15 billion in workers' compensation costs each year. Other hazards of the working environment include noise, vibration, radiation, extremes of heat and cold.
Chemical Hazards. Almost all industries involve exposure to chemical agents. The most
obvious hazards arise from the intensive use of chemicals in the textile, healthcare, construction and manufacturing industries. However, people in most industries are exposed to cleaning chemicals. Others must handle petroleum derivatives and various fuel sources. Chemical hazards result in reproductive disorders, in various forms of cancer, respiratory problems and an increasing number of allergies. The WHO now ranks allergic skin diseases as one of the most prevalent occupational diseases [872]. These hazards can also lead to metal poisoning, damage to the central nervous system and liver problems caused by exposure to solvents and to various forms of pesticide poisoning.
Biological hazards. A wide range of biological agents contribute to workplace diseases and infections. Viruses, bacteria, parasites, fungi, moulds and organic dusts aect many dierent industries. Healthcare workers are at some risk from tuberculosis infections, Hepatitis B and C as well as AIDS. For agricultural workers, the inhalation of grain dust can cause asthma
1.1.
THE HAZARDS
3
and bronchitis. Grain dust also contains mould spores that, if inhaled, can cause fatal disease [321].
Psychological Hazards. Absenteeism and reduced work performance are consequences of occu-
pational stress. These problems have had an increasing impact over the last decade. In the United Kingdom, the cost to industry is estimated to be in excess of $6 billion with over 40 million working days lost each year [90]. There is considerable disagreement over the causes of such stress. People who work in the public sector or who are employed in the service industries seem to be most susceptible to psychological pressures from clients and customers. High workload, monotonous tasks, exposure to violence, isolated work have all been cited as contributory factors. The consequences include unstable personal relationships, sleep disturbances and depression. There can be physiological consequences including higher rates of coronary heart disease and hypertension. Post traumatic stress disorder is also increasingly recognised in workers who have been involved in, or witnessed, incidents and accidents.
This list describes some of the hazards that threaten workers' health and safety. Unfortunately, these items tell us little about the causes of these adverse events or about potential barriers. For example, OSHA report describes the way in which a sheet metal worker was injured by a mechanical hazard: \...employee #1 was working at station #18 (robot) of a Hitachi automatic welding line. She had been trained and was working on this line for about 2 months... The lifting arm then rises and a robot arm moves out from the operator's side of the welding line and performs its task. Then there is a few seconds delay between functions as the robot arm nishes welding, rises, returns to home and the lifting arm lowers to home, ready for the nished length of frame steel to move on and another to take it's place. During the course of this operation the welding line is shut down intermittently so that the welding tips on the robot arms can be lubricated, preventing material build up. This employee, without telling anyone else or shutting down the line, tried to perform the lubrication with the line still in automatic mode. She thought this could be done between the small amount of time it took all parts to complete their functions and return to home. The employee did not complete the task in time, as she had anticipated. Her right leg was located between the protruding rods on the lifting arm and the openings the rods rest in. Her leg was trapped. When other employees were alerted, they had trouble trying to switch the line to manual because the computer was trying to complete it's function and the lifting arm was trying to return to home. The result was that one employee used a crowbar to help relieve pressure on her leg and another used the cellenoid which enabled the lifting arm to rise. The employee received two puncture wounds in the thigh (requiring stitches) and abrasions to the lower leg. Management once again instructed employees working this line on the serious need to wait until all functions are complete, the line shut down and not in the automatic mode before attempting any maintenance." (OSHA Accident Report ID: 0352420). It is possible to identify a number of factors that were intended to prevent this incident from occurring. Line management had trained the employees not to intervene until the robot welding cycle was complete. Lubrication was intended to be completed when the line was 'shut down' rather than in automated mode. It is also possible to identify potential factors that might have been changed to prevent the accident from occurring. For example, physical barriers might have been introduced into the working environment so that employees were prevented from intervening during automated operations. Similarly, established working practices may in some way have encouraged such risk taking as the report comments the management `once again' instructed employees to wait until the line was shut down. These latent problems created the context in which the incident could occur [698]. The triggering event, or catalyst, was the employee's decision that she had enough time to lubricate the device. The lack of physical barriers then left her exposed to the potential hazard once she had decided to pursue this unsafe course of action. Observations about previously unsafe working practices in this operation may also have done little to dissuade her from this intervention.
4
CHAPTER 1.
ABNORMAL INCIDENTS
Figure 1.1 provides a high level view of the ways in which incidents and accidents are caused by catalytic failures and weakened defences. The diagram on the left shows how the integration
Figure 1.1: Components of Systems Failure of working practices, working environment, line management and regulatory intervention together support a catalytic or triggering failure. Chapter 3 will provide a detailed analysis of the sources for such catalytic failures. For now, however, it is suÆcient to observe that there a numerous potential causes ranging from human error through to stochastic equipment failures through to deliberate violations of regulations and working practices. It should also be apparent that there may be catalytic failures of such magnitude that it would be impossible for any combination of the existing structures to support, for any length of time. In contrast, the diagram on the right of Figure 1.1 is intended to illustrate how weaknesses in the integration of system components can increase an application's vulnerability to such catalytic failures. For example, management might strive to satisfy the requirements speci ed by a regulator but if those requirements are awed then there is a danger that the system will be vulnerable to future incidents. These failures in the supporting infrastructure are liable to develop over a much longer timescale than the triggering events that place the system under more immediate stress. The diagrams in Figure 1.1 sketch out one view of the way in which speci c failures place stress on the underlying defences that protect us from the hazards what were listed in previous paragraphs. A limitation of these sketches is that they provide an extremely static impression of a system as it is stressed by catalytic failures. In contrast, Figure 1.2 provides a more process oriented view of the development of an occurrence or critical incident. Initially, the systems is in a `normal' state. Of course, this `normal' state need not itself be safe if there are aws in the working practices and procedures that govern everyday operation. The systems may survive through an incubation period in which any residual aws are not exposed by catalytic failures. This phase represents a `disaster waiting to happen'. However, at some point such an event does cause the onset of an incident or accident. These failures may, in turn, expose further aws that trigger incidents elsewhere in the same system or in other interrelated applications. After the onset of a failure, protection equipment and other operators may intervene to mitigate any consequences. In some
1.1.
5
THE HAZARDS
Figure 1.2: Process of Systems Failure cases, this may return the system to a nominal state in which no repair actions are taken. This has potentially dangerous implications because the aws that were initially exposed by the triggering event may still reside in the system. Alternatively, a rescue and salvage period may be initiated in which previous shortcomings are addressed. In particular, a process of cultural readjustment is likely if the potential consequences of the failure have threatened the continued success of the organisation as a whole. For example, the following passage comes from a report that was submitted to the European Commission's Major Accident Reporting System (MARS) [229]: \At 15:30 the crankcase of an URACA horizontal action 3 throw pump, used to boost liquid ammonia pressure from 300 psi to 3,400 psi, was punctured by fragments of the failed pump-ram crankshaft. The two operators investigating the previously reported noises from the pump were engulfed in ammonia and immediately overcome by fumes. Once the pump crackcase was broken, nothing could be done to prevent the release of the contents of the surge drum (10 tonnes were released in the rst three minutes). The supply of ammonia from the ring main could only be stopped by switching o the supply pump locally. No one were able to do this as the two gas-tight suits available were preferentially used for search and rescue operations, and thus release of ammonia continued. Ammonia fumes quickly began to enter the plant control room and the operators hardly had the time to sound the alarms and start the plant shut-down before they had to leave the building using 10 minutes escape breathing apparatus sets. During the search and rescue operation the re authorities did not use the gas-tight suits and fumes entered the gaps around the face piece and caused injuries to 5 men. The ammonia cloud generated by the initial release drifted o-site and remained at a relatively low level." (MARS report 814). A period of normal operation led to an incubation period in which the pump-ram crankshaft was beginning to fail and required maintenance. The trigger event involved the puncture of the pump's
6
CHAPTER 1.
ABNORMAL INCIDENTS
crankcase when the ram crankshaft eventually failed. This led to the onset of the incident in which two operators were immediately overcome. This then triggered a number of further, knock-on failures. For instance, the injuries to the remen were caused because they did not use gas tight suits during their response to the initial incident. In this case, only minimal mitigation was possible as operators did not have the gas tight suits that were necessary in order to isolate the ammonia supply from the ring main. Those suits that were available were instead deployed to search and rescue operations. Many of the stages shown in Figure 1.2 are based on Turner's model for the development of a system failure [790]. The previous gure introduces a mitigation phase that was not part of this earlier model. This is speci cally distinguished from Turner's rescue and salvage stage because it re ects the way in which operators often intervene to `cover up' a potential failure by taking immediate action to restore a nominal state. In many instances, individuals may not even be aware that such necessary intervention should be reported as a focus for potential safety improvements. As Leveson points out, human intervention routinely prevents the adverse consequences of many more occurrences than are ever recorded in accident and incident reports [486]. This also explains our introduction of a feedback loop between the mitigation and the situation normal phases. These features were not necessary in Turner's work because his focus was on accidents rather than incidents. Figure 1.2 also introduces a feedback loop between the onset and trigger phases. This is intended to capture the ways in which an initial failure can often have knock-on eects throughout a system. It is very important to capture these incidents because are increasingly common as we move to more tightly integrated, heterogeneous application processes. Previous paragraphs have sketched a number of ways in which particular hazards contribute to occupational injuries. They have also introduce a number of high-level models that can be used to explain some of the complex ways in which background failures and triggering events combine to expose individuals to those hazards. The following sections build on this analysis by examining the likelihood of injury to individuals in particular countries and industries. We also look at the costs of these adverse events to individuals and also to particular industries. The intention is to reiterate the importance of detecting potential injuries and illnesses before they occur.
1.1.1 The Likelihood of Injury and Disease Work-place incidents and accidents are relatively rare. In the United Kingdom, approximately 1 in every 200 workers reports an occupational illness or injury resulting in more than three days of absence from employment every year [331]. OSHA estimates that the rate of work-related injuries and illnesses dropped from 7.1 per year for every 100 workers in 1997 to 6.7 in 1998 [652]. These gures re ect signi cant improvements over the last decade. For example, the OSHA statistics show that the number of work-related fatalities has almost been halved since it was established by Congress in 1971. The Australian National Occupational Health and Safety Commission report that the rate of fatality, permanent disability or a temporary disability resulting in an absence from work of one week or more was 2.2 per 100 in 1997-8, 2.5 in 1996-7, 2.7 in 1995-6, 2.9 in 1994-95, 3.0 in 1993-4, 2.8 in 1992-3 [44]. The following gures provide the same data per million hours worked: 13 in 1997-8, 14 in 1996-7, 16 in 1995-6, 16 in 1994-5, 17 in 1993-4, 19 in 1992-3. These statistics hide a variety of factors that continue to concern governments, regulators, managers, operators and the general public. The rst cause for concern stems from demographic and structural changes in the workforce. Many countries continue to experience a rising number of workers. This is both due to an increasing population and to structural changes in the workforce, for instance increasing opportunities for women. In the United Kingdom, the 1% fall between 1998 and 1999 in the over 3 day injury rate is being oset by a (small) rise in the total number of injuries from 132,295 to 132,307 in 1999-2000 [331]. Similarly the OSHA gures for injury and illness rates show a 40 % decline since 1971. At the same time, however, U.S. employment has risen from 56 million workers at 3.5 million worksites to 105 million workers at nearly 6.9 million sites [652]. Population aging will also have an impact upon occupational injury statistics. Many industrialised countries are experiencing the twin eects of a falling birth rate and a rising life expectancy. This will increase pressure on the workforce for higher productivity and greater contributions to retirement provision. Recent estimates place the number of people aged 60 and over at 590 million worldwide.
1.1.
7
THE HAZARDS
By 2020, this number is projected to exceed 1,000 million [873]. Of this number, over 700 million older people will live in developing counties. These projections are not simply signi cant for the burdens that they will place on those in work. Older elements of the workforce are often the most likely to suer fatal work-related injuries. In 1997-98, the highest rate of work-related fatalities in Australia occurred in the 55 plus age group with 1.3 deaths per 100 employees. They were followed by the 45-49 and 50-54 age groups with approximately 0.8 fatalities per 100 employees. The lowest number of fatalities occurred in workers under that age of 20 with 0.2 deaths per 100 employees. It can be diÆcult to interpret such statistics. For example, they seem to indicate that the rising risks associated with aging outweigh any bene cial eects from greater expertise across the workforce. Alternatively, the statistics may indicate that younger workers are more likely to survive injuries that would prove fatal to older colleagues. The UK rate of reportable injury is lower in men aged 16-19 than all age groups except for those above 55 [326]. However, the HSE report that the dierences between age groups are not statistically signi cant when allowing for the higher accident rates for those occupations that are mainly performed by younger men. There is also data that contradicts the Australian experience. Young men, aged 16-24, face a 40% higher relative risk of all workplace injury than men aged 45-54 even after allowing for occupations and other job characteristics. The calculation of health and safety statistics has also been eected by social and economic change. Part-time work has important eects on the calculation of health and safety statistics per head of the working population [652, 326]. The rate of injury typically increases with the amount of time exposed to a workplace risk. However, it is possible to normalise the rate using an average number of weekly hours of work. The rate of all workplace injury in the UK is 8.0 per 100 for people working less than 16 hours per week. For people working between 16 and 29 hours per week it is 4.3, between 30 and 49 hours it is 3.8, between 50 and 59 it is 3.2 and people working 60 or more hours per week have an accident rate of 3.0 per 100 workers per annum. People who work a relatively low number of hours have substantially higher rates of all workplace and reportable injury than those working longer hours. The relatively high risk in workers with low hours remains after allowing for dierent occupational characteristics [326]. The growth of temporary work has similar implications for some economies. In the UK, the rate of injury to workers in the rst 6 months is double that of their colleagues who have worked for at least a year. This relatively high risk for new workers remains after allowing for occupations and hours of work. 57% temporary workers have been with their employer for less than 12 months. Figure 1.1 shows that accident rates are not uniformly distributed across industry sectors. For example, the three day rate for agriculture and shing in the United Kingdom is 1.2 per 100 employees. The same rate for the services industries is approximately 0.4 per 100 workers. Industry
UK 1993 1994 Agriculture 7.3 8.5 Utilities 0.5 0.6 Manufacturing 1.6 1.2 Construction 8.9 6.9 Transport 2.2 2.0 Other 0.3 0.4 services All 1.2 0.9 Industries
Germany 1993 1994 6.0 6.7 3.1 4.3 2.3 1.6 7.9 8.0 7.2 7.5 1.0 1.2
France 1993 9.8 5.6 2.3 17.6 6.5 1.9
Spain 1992 1993 9.1 5.4 12.5 10.1 6.7 4.9 21.0 19.3 13.0 10.7 1.4 1.5
Italy 1991 18.4 4.4 3.3 12.8 11.2 0.9
3.3
3.9
6.4
5.5
3.2
5.1
Table 1.1: Industry Fatality Rates in UK, Germany, France, Spain & Italy [324] Accidents rates also dierent with gender. Positive employment practices are exposing increasing numbers of women to a greater variety of risks in the workplace. The overall Australian National Occupational Health and Safety Commission rate of 2.2 injuries and illnesses per 100 workers hides a considerable variance [44]. For males the rate was 2.9 per 100 workers whilst it was 1.3 for females.
8
CHAPTER 1.
ABNORMAL INCIDENTS
In 1997-8, the industries with the highest number of male fatalities were Transport and Storage (66) and Manufacturing (64), while for females Accommodation, Cafes and Restaurants (4) and Property and Business Services (4) were the highest. The male fatalities were mainly employed as Plant and Machine Operators, and Drivers (91). Female fatalities were mainly employed as Managers and Administrators (5). These dierences may decline with underlying changes in workplace demographics. However, UK statistics suggest some signi cant residual dierences between the genders: \the rate of all workplace injury is over 75% higher in men than women, re ecting that men tend to be employed in higher risk occupations. After allowing for job characteristics, the relative risk of workplace injury is 20% higher in men compared with women. Job characteristics explain much of the higher rate of injury in men but not all because men still have an unexplained 20% higher relative risk". [326] Table 1.1 illustrates how the rate of industrial injuries diers within Europe. Such dierences are more marked when comparisons are extended throughout the globe. However, it is not always possible to nd comparable data: \The evaluation of the global burden of occupational diseases and injuries is diÆcult. Reliable information for most developing countries is scarce, mainly due to serious limitations in the diagnosis of occupational illnesses and in the reporting systems. WHO estimates that in Latin America, for example, only between 1 and 4% of all occupational diseases are reported. Even in industrialised countries, the reporting systems are sometimes fragmented." [873] For example, the Australian statistics cited in previous paragraphs include some cases of coronary failure that would not have been included within the UK statistics. These problems are further exacerbated by the way in which local practices aect the completion of death certi cations and other reporting instruments. For instance, the death of a worker might have been indirectly caused by a long running coronary disease or by the immediate physical exertion that brings on a heart attack. It is important to emphasise that even if it were possible to implement a consistent global reporting system for workplace injuries, it would still not be possible to directly draw inferences about the number of incidents and accidents directly from that data. Many incidents still go unreported even if well-established reporting systems are available. A further limitation is that injury and fatality statistics tell us little or nothing about `near miss' incidents that narrowly avoided physical harm.
1.1.2 The Costs of Failure In 1996 the UK Health and Safety Executive estimated that workers and their families lost approximately $558 million per year in reduced income and additional expenditure from work-related injury and ill health [322]. They also estimated that the loss of welfare in the form of pain, grief and suering to employees and their families was equivalent to a further $5.5 billion. These personal costs also have wider implications for employers, for the local economy and ultimately for national prosperity. The same study estimated that the direct cost to employers was approximately $2.5 billion a year; $0.9 billion for injuries and $1.6 billion for illness. In addition, the loss caused by avoidable accidental events that do not lead to injury was estimated at between $1.4 billion and $4.5 billion per year. This represents 4-8% of all UK industrial and commercial companies' gross trading pro ts. Employers also incur costs through regulatory intervention. These actions are intended to ensure that a disregard for health and safety will be punished whether or not an incident has occurred. Tables 1.2 and 1.3 summarise the penalties imposed by United States' Federal and State inspectors in the scal year 1999 [652]. Regulatory actions imposed a cost of $151,361,442 beyond the immediate nancial losses incurred from incidents and accidents. These gures do not account for the numerous competitive disadvantages that are incurred when organisations are associated with high-pro le failures [675].
1.2.
9
SOCIAL AND ORGANISATIONAL INFLUENCES
Violations 646 50,567 1,816 226 408 23,533 77,196 Violations 441 57,010 2,162 785 46 82,120 202,962
Percent 0.8 66 2 0.3 0.01 30 Total
Type Willful Serious Repeat Failure to abate Unclassi ed Other
Penalties $24,460,318 $50,668,509 $8,291,014 $1,205,063 $3,740,082 $1,722,338 $90,087,324
Table 1.2: Federal Inspections Fiscal Year 1999 Percent Type Penalties 0.3 Willful $12,406,050 40 Serious $35,441,267 1.5 Repeat $4,326,620 0.5 Failure to abate $2,860,972 0.0002 Unclassi ed $2,607,900 40 Other $3,631,309 Total $61,274,118 Table 1.3: State Inspections Fiscal Year 1999
1.2 Social and Organisational In uences These statistics illustrate the likelihood and consequences of occupational injuries. It is important, however, to emphasise that this data suers from a number of biases. Many of the organisations that are responsible for collaring the statistics are also responsible for ensuring that mishap frequencies are reduced over time. Problems of under-reporting can also complicate the interpretation of national gures. There is often a fear that some form of blame will attach itself to those organisations that return an occupational health reporting form. The OSHA record keeping guidelines stress that: \Recording an injury or illness under the OSHA system does not necessarily imply that management was at fault, that the worker was at fault, that a violation of an OSHA standard has occurred, or that the injury or illness is compensable under workers' compensation or other systems." [653] However, in many counties including the United States, organisations that have a higher reported rate of occupational illness or injury become the focus of increasing levels of regulatory inspection and intervention. This has a certain irony because, as OSHA acknowledge, relatively low levels of reported injuries and illnesses may be an indicator of poor health and safety management: \...during the initial phases of identifying and correcting hazards and implementing a safety and health program an employer may nd that its reported rate increases. This may occur because, as an employer improves its program, worker awareness and thus reporting of injuries and illnesses may increase. Over time, however, the employer's ... rate should decline if the employer has put into place an eective program." [648] It is instructive to examine how our analysis relates to previous work on enhancing the safety of hazardous technologies. Two schools of thought can be identi ed; the rst stems from the `normal accident' work of Perrow [675]; the second stems from the idea of `high reliability' organisations [718].
1.2.1 Normal Accidents? Perrow argues that the characteristics of high-risk technologies make accidents inevitable, in spite of the eectiveness of conventional safety devices. These characteristics include complexity and
10
CHAPTER 1.
ABNORMAL INCIDENTS
tight coupling. Complexity arises from our limited understanding of some transformation stages in modern processing industries. It stems from complex feedback loops in systems that rely on multiple, interacting controls. Complexity also stems from many common-mode interconnections between subsystems that cannot easily be isolated. More complex systems produce unexpected interactions and so can provoke incidents that are harder to rectify. Perrow also argues that tight coupling plays a greater role in the adverse consequences of many accidents than the complexity of modern technological systems. This arises because many applications are deliberately designed with narrow safety margins. For example, a tightly coupled system may only permit one method of achieving a goal. Access to additional equipment, raw materials and personnel is often limited. Any buers and redundancy that are allowed in the system are deliberately designed only to meet a few speci ed contingencies. In contrast, Perrow argues that accidents can be avoided through loose coupling. This provides the time, resources and alternative paths to cope with a disturbance. There is evidence to contradict parts of Perrow's argument [710, 684]. Some `high reliability' organisations do seem to be able to sustain relatively low incident rates in spirit of operating complex processes. Viller [847] identi es a number of key features that contribute to the perceived success of these organisations:
The leadership in an organisation places a high priority on safety.
Organisational learning takes place through a variety of means, including trial and error but also through simulation and hypothesis testing.
High levels of redundancy exist even under external pressures to trim budgets. Authority and responsibility are decentralised and key individuals can intervene to tackle potential incidents. These actions are supported by continuous training and by organisational support for the maintenance of an appropriate safety culture.
These characteristics illustrate the important role that incident reporting plays for `high reliability' organisations. Such applications are an important means of supporting organisational learning. Table 1.4 summarises the main features of 'Normal Accident' theory and 'High Reliability' organisations. Sagan [718] used both of these approaches to analyse the history of nuclear weapons safety. His conclusions lend weight to Perrow's pessimistic assessment that some accidents are inevitable. They are signi cant because they hold important implications for the interpretation both of incident and accident reports. For example, Sagan argues that much of the evidence put forward to support high reliability organisations is based on data that those organisations help to produce. Accounts of good safety records in military installations are often dependent on data supplied by the military. This is an important caveat to consider during the following pages in which we will present incident and accident statistics. We may not always be able to rely upon the accuracy of information that organisations use to publicise improvements in their own safety record. Sagan also argues that social pressures act as brakes on organisational learning. He identi es ways in which stories about previous failures have been altered and falsi ed. He then goes on to show how the persuasive eects of such pressures can help to convince the originators of such stories that they are, in fact, truthful accounts of incidents and accidents. This reaches extremes when failures are re-painted as notable successes.
1.2.2 The Culture of Incident Reporting Sagan's work shows that a variety of factors can aect whether or not adverse events are investigated. Thes factors aect both individuals and groups within safety-critical organisations. The impact of cultural in uences, of social and legal obligations, cannot be assessed without regard to individual dierences. Chapter 3 will describe how subjective attitudes to risk taking and to the violation of rules can have a profound impact upon our behaviour. For now it is suÆcient to observe that each of the following in uences will aect individuals in a number of dierent ways. In some groups, it can be disloyal to admit that either you or your colleagues have made a mistake or have been involved in a `failure'. These concerns take a number of complex forms. For example,
1.2.
SOCIAL AND ORGANISATIONAL INFLUENCES
High Reliability Organisations Accidents can be prevented through good organisational design and management Safety is the priority organisational objective. Redundancy enhances safety: duplication and overlap cam make a reliable system out of unreliable parts. Decentralised decision-making is needed to permit prompt and
exible operating responses to surprises A culture of reliability enhances safety by encouraging uniform and appropriate responses by operators Continuous operations, training and simulations can create and maintain high reliability operations. Trial and error learning from accidents can be eective and can be supplemented by anticipation and simulations
11
Normal Accidents Theory Accidents are inevitable in complex and tightly coupled systems. Safety is one of a number of competing objectives. Redundancy often causes accidents: it creates interactive complexity and encourages risk taking. De-centralised control is needed for complex systems but centralised control is needed for tight coupling. A military model of intense discipline and isolation is incompatible with democratic values Organisations cannot train for unimagined, highly dangerous or politically unpalatable operations Denial of responsibility, faulty reporting and reconstruction of history cripples learning eorts.
Table 1.4: Competing Perspectives on Safety with Hazardous Technologies [718] individuals may be prepared to report failures. However, individuals may be reluctant to face the retribution of their colleagues should their identity become known. These fears are compounded if they do not trust the reporting organisation to ensure their anonymity. For this reason, NASA go to great lengths to publicise the rules that protect the identity of contributors to the US Aviation Safety Reporting System. Companies can support a good 'safety culture' by investing in and publicising workplace reporting systems. A number of factors can, however, undermine these initiatives. The more active a company is in seeking out information about previous failures then the worse its safety record may appear. It can also be diÆcult to sustain the employee protection that encourages contributions when incidents have economic as well as safety implications. Individuals can be oered re-training after a rst violation, re-employment may be required after a second or third. The social in uence of a company's `safety culture' is reinforced by the legal framework that governs particular industries. This is most apparent in the regulations that govern what should and what should not be reported to national safety agencies. For example, the OSHA regulations follow Part 1904.12(c) of the Code of Federal Regulations. These require that employers record information about every occupational death; every nonfatal occupational illness; and those nonfatal occupational injuries which involve one or more of the following: loss of consciousness, restriction of work or motion, transfer to another job, or medical treatment (other than rst aid). [653] As we shall see, this focus on accidents rather than `near-miss' incidents re ects an ongoing debate about the scope of Federal regulation and enforcement in the United States. It is often argued that individuals will not contribute to reporting systems unless they are protected from self-incrimination through a `no blame' policy [700]. It is diÆcult for organisations to preserve this `no blame' approach if the information that they receive can subsequently be used during prosecutions. Conversely, a local culture of non-reporting can be reinforced or instigated by a fear of legal retribution if incidents are disclosed. These general concerns characterise a range of more detailed institutional arrangements. For example, some European Air TraÆc Management
12
CHAPTER 1.
ABNORMAL INCIDENTS
providers operate under a legal system in which all incidents must be reported to the police. In neighbouring countries, the same incidents are investigated by the service providers themselves and, typically, fall under an informal non-prosecution agreement with state attorneys. Other countries have more complex legal situations in which speci c industry arrangements also fall under more general regional and national legislation. For example, the Utah Public OÆcers and Employees' Ethics Act and the Illinois' Whistle Blower Protection Act are among a number of state instruments that have been passed to protect respondents. These local Acts provide for cases that are also covered by Federal statutes including the Federal False Claims Act or industry speci c provision for Whistle Blowers such as section 405 of the Surface Transportation Assistance Act. This has created some disagreement about whether state legislation preempts federal law in this area; several cases have been conducted in which claimants have led both common law and statutory suits at the same time. Cases in Texas and Minnesota have shown that Federal statutes provide a base-line and not a ceiling for protection in certain states. Such legal complexity can deter potential contributors to reporting systems. There are other ways in which the legislative environment can aect reporting behaviour. For example, freedom of information and disclosure laws are increasing public access to the data that organisations can hold. The relatives or representatives of people involved in an accident can potentially use these laws to gain access to information about previous incidents. In such circumstances, there is an opportunity for punitive damages to be sought if previous, similar incidents were reported but not acted upon. These concerns arose in the aftermath of the 1998 Tobacco Settlement with cigarette manufacturers in the United States. Prior to this settlement, states alleged that companies had conspired to withhold information about the adverse health eects of tobacco [580]. The legislative environment for accident and incident reporting is partly shaped by higher-level political and social concerns. For example, both developed and developing nations have sought to deregulate many of their industries in an attempt to encourage growth and competition. Recent initiatives to liberalise the Indian economy have highlighted this con ict between the need to secure economic development whilst also coordinating health and safety policy. The Central Labour Institute has developed national standards for the reporting of major accidents. However, the Directorate General of Factory Advice Services and the Labour Institutes have not developed similar guidelines for incident and occurrence reporting. The focus has been on developing education and training programmes that can target speci c health and safety issues after industries have become established within a region [156]. Some occupational health and safety reporting system have, however, been extended to explicitly collect data about both actual accidents and `near-miss' incidents. For example, employers in the UK are guided by the Reporting of Injuries, Diseases and Dangerous Occurrences Regulations (RIDDOR) 1995. These cover accidents which result in an employee or a self-employed person dying, suering a major injury, or being absent from work or unable to do their normal duties for more than three days. They also cover `dangerous occurrences' that do not result in injury but have the potential to do signi cant harm [320]. These include:
The collapse, overturning or failure of load-bearing parts of lifts and lifting equipment. The accidental release of a biological agent likely to cause severe human illness. The accidental release of any substance which may damage health. The explosion, collapse or bursting of any closed vessel or associated pipework. An electrical short circuit or overload causing re or explosion. An explosion or re causing suspension of normal work for over 24 hours.
Similarly, Singapore's Ministry of Manpower requires that both accidents and `dangerous occurrences' must be reported. Under the fourth schedule of the national Factory Act, these may `under other circumstances' have resulted in injury or death [742]. The detailed support that accompanies the act provide exhaustive guidance on the de nition of such dangerous occurrences. These are taken to include incidents that involve bursting of a revolving vessel, wheel, grindstone or grinding wheel.
1.2.
SOCIAL AND ORGANISATIONAL INFLUENCES
13
Dangerous occurrences also range from electrical short circuit or failure of electrical machinery, plant or apparatus, attended by explosion or re or causing structural damage to an explosion or failure of structure of a steam boiler, or of a cast-iron vulcaniser. A duty to report on incidents and accidents does not always imply that information about these occurrences will be successfully acted upon. This concern is at the heart of continuing attempts to impose a `duty to investigate' upon UK employers. At present, the UK regulatory framework is one in which formal accident investigation of the most serious incidents is undertaken by specially trained investigators. Employers are not, in general, obliged to actively nding out what caused something to go wrong. Concern about this situation led to a 1998 discussion document that was published by the Health and Safety Commission (HSC). It was observed that: \At present, there is no law which explicitly requires employers to investigate the causes of workplace accidents. Many employers do undertake accident investigation when there has been an event in the workplace which has caused injury in order to ensure lessons are learnt, and although there is no explicit legal duty to investigate accidents there are duties under some health and safety law which may lead employers to undertake investigation. The objective of a duty to investigate accidents would be to ensure employers draw any appropriate lessons from them in the interests of taking action to prevent recurrence." [314] There are many organisational reasons why a body such as the HSC would support such an initiative. The rst is the face-value argument that such a duty to investigate accidents and incidents would encourage employers to adopt a more pro-active approach to safety. The second is that such a duty would help to focus nite regulatory resources by following the deregulation initiated in the UK under the Robens Committee [709]. This group responded to the mass of complex regulations that had emerged from the plethora of nineteenth century factory acts. As industries merged and emerged, it was diÆcult for employers to know which parts of each act actually applied to their business. As a result, the Robens Committee helped to propose what became the Health and Safety at Work Act (1974). Key sections of the Roben report [701] argued that: \We need a more eective self-regulating system... It calls for better systems of safety organisation, for more management initiatives, and for more involvement of work people themselves. The objectives of future policy must, therefore, include not only increasing the eectiveness of the state's contribution to safety and health at work but also, and more importantly, creating conditions for more eective self-regulation" [709] The same concerns over the need to target nite regulatory resources and the need to encourage pro-active intervention by other organisations also inspired attempts in the United States to establish OSHA's Cooperative Compliance Programme. This focused on the 12,000 employers that had the highest reported mishap rates. Those companies that agreed to participate and invest in safety management programs were to be oered a reduced likelihood of OSHA inspection. This was estimated to be a reduction from an absolute certainty of inspection down to approximately 30% [648]. This policy was intended to leverage OSHA resources by encouraging commercial investment in safety. It was also intended to provide OSHA with a means of targeting nite inspection resources. However, employers' organisations claimed that it introduced new roles and responsibilities for the Federal organisation. The US Chamber of Commerce helped to present a case before the US Court of Appeals that succeeded in blocking OSHA's plans. The Assistant Secretary of Labour for Occupational Safety and Health argued: \The goal of Cooperative Compliance Programme (CCP) is to use OSHA's limited resources to identify dangerous work sites and work in partnership with management and labour to nd and x hazards. America's taxpayers expect nothing less for their continued support and funding of OSHA. This lawsuit is frivolous; it has no merit and aims only to hinder our ability to protect working men and women from often life-threatening hazards. The CCP is an enforcement program{not a regulation. We are con dent that our program is lawful. Attempts by the National Association of Manufacturers and the
14
CHAPTER 1.
ABNORMAL INCIDENTS
U.S. Chamber of Commerce to throw-up legal roadblocks will only ensure that the most dangerous work sites in America remain that way, putting untold numbers of workers at risk." [397] The CCP provides important insights into the regulatory environment in the United States. As a result of the legal action, OSHA was forced to build less formal partnerships with employers' organisations. The CCP is also instructive because OSHA produced detailed guidance on those measures that high-reporting organisations ought to introduce in order to address previous failures. Table 1.5 presents OSHA's guidelines [648] on how to assess the quality of accident investigation within an organisation. As can be seen, the investigation of `near-miss' incidents, or occurrences in HSE terms, characterises an organisation at the highest level of safety management. 1 2 3 4
5
No investigation of accidents, injuries, near misses, or other incidents is conducted. Some investigation of incidents takes place, but root cause may not be identi ed, and correction may be inconsistent. Supervisors prepare injury reports for lost time cases. OSHA-101 (report form) is completed for all recordable incidents. Reports are generally prepared with cause identi cation and corrective measures prescribed. OSHA-recordable incidents are always investigated, and effective prevention is implemented. Reports and recommendations are available to workers. Quality and completeness of investigations are systematically reviewed by trained safety personnel. All loss-producing accidents and near-misses are investigated for root causes by teams or individuals that include trained safety personnel and workers. Table 1.5: OSHA Levels of Accident and Incident Investigation
Dierent reporting systems have dierent de nitions of what should and what should not be reported. These distinctions re ect national and international agreements about the nature of incidents and accidents. For instance, Table 1.6 embodies International Civil Aviation Organisation (ICAO) and EUROCONTROL requirements for incident and accident reporting in Air TraÆc Control. As can be seen, this covers both speci c safety-related incidents such as the loss of control in
ight and also failures to provide adequate air traÆc management services. Table 1.6 provides domain dependent de nitions of incidents and accidents. Each row provides explicit examples of occurrences in Air TraÆc Management. It could not easily be used in the chemical or healthcare industries. It can still be diÆcult to apply these consequence based de nitions of ATM incidents and accidents. For example, a loss of separation might be avoided if air crews spot each other and respond appropriately. Such an occurrence might be given a relatively low criticality assessment; no loss of separation occurred. However, it can also be argued that this incident ought to be treated as if an air proximity violation had occurred because air traÆc control did not intervene to prevent it from happening. This approach is exploited within some European ATM service providers. Further problems complicate the use of consequence based de nitions of accidents and incidents, such as those illustrated in Table 1.6. Individuals may not be able to observe the consequences of the adverse events that they witness. For example, maintenance teams are often remote from the operational outcomes of their actions. As a result, organisations such as the UK Civil Aviation Authority approve speci c lists of occurrences that must be reported. For instance, the Ground Occurrence Report Form E1022 is used for the noti cation of defects found during work on aircraft or aircraft components which are considered worthy of special attention [10]. In contrast to Table 1.6, the following list includes procedural errors and violations, such as incorrect assembly, as well as
1.2.
SOCIAL AND ORGANISATIONAL INFLUENCES
15
Occurrence Category De nitions of an Occurrence Accidents Mandatory Mid-air collision, controlled ight into terrain, ground collision between aircraft, ground collision between aircraft and obstruction. Other accidents of special interest including loss of control in ight due to VORTEX or meteorological conditions. Incidents Mandatory Loss of air separation, near controlled ight into terrain, runway incursion, inability to provide ATM services, breach in ATM system security. Other oc- Voluntary Anything which has serious safety implicacurrences tions but which is neither an accident nor an incident. Table 1.6: Distinctions between Accidents and Incidents in Air TraÆc Control observations of potential component failure, such as overheating of primary or secondary structure:
Defects in aircraft structure such as cracks in primary or secondary structure, structural corrosion or deformation greater than expected
Failures or damage likely to weaken attachments of major structural items including ying controls, landing gear, power plants, windows, doors, galleys, seats and heavy items of equipment
When any component part of the aircraft is missing, believed to have become detached in ight Overheating of primary or secondary structure Incorrect assembly Failure of any emergency equipment that would prevent or seriously impair its use Critical failures or malfunction of equipment used to test aircraft systems or aircraft units Any other occurrence or defect considered to require such noti cation.
The ICAO list of air traÆc incidents relied upon an analysis of the potential consequences of any failure. In contrast, the CAA de nition of ground maintenance incidents was built from a list of errors, violations and observations of potential failures. These dierences can be explained in terms of the intended purpose of these de nitions. In the former case, the list of ATM accidents and incidents was intended as a guideline for safety managers in national service providers. They are assumed to have the necessary investigative resources, analytical insights and reconstruction capabilities to assess potential outcomes once incidents have been reported. However, the CAA reporting procedures provide direct guidance for maintenance personnel. These individuals are not expected to anticipate the many dierent potential outcomes that can stem from the failures that they observe. Such criticality assessments must be performed by the line management who receive and interpret the information from incident reporting systems. These dierences illustrate the diÆculty of developing a priori de nitions of accidents and incidents that ignore the purpose to which those de nitions will be put. Some authors have constructed more general de nitions of accidents and incidents. For instance, Perrow [675] proceeds by distinguishing between four levels of any system. Unlike most regulatory de nitions, such as that illustrated in Table 1.6, Perrow does not focus directly on the likely consequences of a failure but rather looks at those portions of a system that were eected by an incident or accident:
16
CHAPTER 1.
ABNORMAL INCIDENTS
1. parts. The rst level of any system represent the smallest components that are likely to be considered during an accident investigation. They might include objects such as a valve. 2. unit. These are functionally related collections of parts. For example, a motor unit is built from several individual component parts. 3. subsystem. These are composed from individual units. For example, the secondary coolant system of a nuclear reactor will contain a steam generator and a water return unit. 4. the plant or system. This is the highest level involved in an accident. Beyond this it is only pro table to think in terms of the impact of an accident on the environment. In Perrow's terms, accidents only involve those failures that aect levels three and four of this hierarchy. Incidents disrupt components at levels one and two. This de nition is critical for the normal accidents argument that Perrow proposes in his book. He argues that `engineered safety functions' cannot reliably be constructed to prevent some incidents from becoming accidents at levels three and four. Unfortunately, however, these distinctions raise a number of problems for our purposes. De nitions of incidents and accidents must serve the pragmatic role of helping individual workers to know what should, and what should not, be reported. It is unclear whether people would ever be able to make the distinctions between levels 2 and 3 that would be required under this scheme. There are further practical problems in applying such structural distinctions between accidents and incidents. As with consequential de nitions, it may be diÆcult for any individual to determine the scope of any failure as it occurs. They may fail to realise that the failure of a level one valve will create knock-on eects that compromise an entire level four system. The social and cultural issues that were introduced in previous sections also aect the interpretation of accidents and incidents. This is illustrated in Figure 1.3. From the viewpoint of person A, the system is operating `abnormally'
Figure 1.3: Normal and Abnormal States as soon as it moves from state 1 to state 2. Person B holds dierent beliefs about what is, and what is not, normal. As a result, they only consider that an incident has occurred when the system moves from state 2 to state 3. The dierent viewpoints shown in this sketch can arise for a number of reasons. For example, Person A may have been trained to identify the transition between states 1
1.2.
SOCIAL AND ORGANISATIONAL INFLUENCES
17
and 2 as potentially hazardous. Alternatively, person B may exhibit individual attitudes to risk that can dispose them not to report hazardous incidents. Figure 1.3 can also illustrate how attitudes to hazards may change over time. For example, the gure on the left might represent an initial attitude when the system is initially installed. Over time, dangerous working practices can become the norm. It can be diÆcult for many individuals to question such established working practices even if they violate recognised rules and regulations. Over time, these dangerous practices may themselves become sanctioned by procedures and regulations. This is illustrated by what Diane Vaughan has called \normalised deviance" in the events leading to the Challenger accident [846]. Under such circumstances, the gure on the right might represent the prevailing view of normal and abnormal states. The previous analysis helps to identify a number of possible approaches to the de nition of what an incident actually is. These can be summarised as follows:
open de nitions. This approach encourages personnel to report any failure as a safety-related incident. It is exploited by the Air Navigation Services Division of the Swedish Civil Aviation Administration. As a result they receive several thousand reports per year ranging from the failure of lights or heating systems through to potential air proximity violations. The open approach to the de nition of an incident avoids some of the problems with more restricted de nitions, see below. However, it can also lead to a dilution of the safety reporting system with more general concerns. In the Air Navigation Services Division this approach is well supported by trained `Gatekeepers' who lter low priority reports from more serious occurrences. The entire system is, however, dependent on the skill and insight of these personnel and their ability to perform a timely analysis of the initial reports.
closed de nitions. Closed systems lie at the other extreme from open de nitions such as
consequential de nitions. These represent a subset of the closed approach, described above. Incidents and accidents are distinguished either by their actual outcomes or by the probable worst case consequences. For example, the US Army regulations distinguish between class
those exploited by the Swedish Air TraÆc Control organisation. These systems provide rigid de nitions or enumerations for those incidents that are to be reported to the system. All sta are trained to recognise these high priority occurrences and all other incidents are handled through alternative mechanisms. The diÆculty with this approach is that the introduction of new equipment can have a profound impact upon the sorts of incidents that will occur. As a result, these enumerations must be revised over time. Otherwise, sta will not report new incidents but will instead continue to wait for occurrences that are now prevented by more secure defences.
A to D accidents whose consequences range from $1,000,000 or more (class A) to between $2,000 and $10,000 (class D) [806] Class E incidents result in less that $2,000 damage but interrupt an operational or maintenance mission. Class F incidents relate to Foreign Object Damage and are restricted to aviation operations. As we have seen, the problem here is that it can be diÆcult for operators to predict the possible consequences of a failure without further investigation and analysis. As a result, these de nitions tend to be applied by investigators and analysts after an initial warning or report has been generated.
structural de nitions. This is a further example of a closed approach which has strong links to
procedural de nitions. This is another example of a closed approach. Rather than focusing on the anticipated outcome of a failure, procedural de nitions look at violation of the prescribed
consequential de nitions. The consequences of a failure are assessed for each of several layers of a system. Incidents aect the lower level components whilst accidents involve the system as a whole. There are a number of practical problems in applying this as a guide for incident reporting. there are also theoretical problems when individual component faults may cause a fatality, for example through electrocution, even though the system as a whole continues to satisfy its functional requirements. A strict interpretation of such events would rank them as an incident and not an accident in Perrow's terms [675].
18
CHAPTER 1.
ABNORMAL INCIDENTS
methods. The problem here is that the individuals who witness violations may fail to recognise them as violations, especially if they have become part of standard working practices. Such problems also aect incident reporting systems that ask operators to comment on `anything unusual'.
pragmatic de nitions. Some incident reporting systems take a particularly pragmatic approach to the de nition of what should and what should not be reported. They are often characterised by the phrase `target the doable'. This characterises systems that have been established within larger organisations that may not, as a whole, support the recommendations of the scheme. Some of the pioneering attempts to establish incident reporting systems within the UK National Health Service deliberately focused on those occurrences that individual consultants felt that they could address; incidents stemming from wider acquisitions policy or even from other clinical departments were deliberately excluded.
special issues. Finally, some incident reporting systems deliberately focus on key issues. For instance, the European Turbulent Wake incident reporting system was established with help from the UK Meteorological Service in response to concerns about a number of occurrences involving commercial ights. Other systems are deliberately focused to elicit information from key personnel who may be under-represented in existing incident databases. For example, schemes have been initiated to encourage incident reporting from General Aviation and military pilots rather than commercial pilots. Other schemes have focused on eliciting information from medical and surgical sta rather than nursing personnel.
The preceding discussion should illustrates the diÆculty of providing a single de nition of accidents and incidents. These problems stem from the dierent ways in which dierent people must use these de nitions. The person witnessing an adverse occurrence must know whether or not it is worthwhile reporting. Safety managers may apply dierent criteria when determining whether or not an incident report merits a full-scale investigation or whether it can be dealt with at a more local level. National authorities may apply further criteria when deciding whether national trends indicate a need for regulatory intervention. It is important to emphasise that the distinction between an incident and an accident is not rm and cannot be made a priori. The same set of events may be reclassi ed at several stages in the investigation and analysis of an occurrence. These must not be arbitrary decisions. Later chapters will stress the need to provide a documented justi cation for such changing assessments. However, there are often important pragmatic reasons for such actions. For example, a number of European air traÆc control agencies have not reported any major accidents in recent years. As a result, some air traÆc service providers have begun to treat certain `critical incidents' as-if they were accidents, even though no loss of life or property has occurred. The intention is to rehearse internal procedures for dealing with more critical events when, and if, they do occur. Such decisions also focus attention and resources on the causes of these incidents. To summarise, simple distinctions between accidents and incidents ignore the underlying complexity that characterises the ways in which dierent national and international organisations treat technological failure. Dierent de nitions are used, and may indeed be necessary, to support dierent stages of an organisation's response to incidents and accidents.
1.3 Summary It is diÆcult to estimate the costs when human error, systems failure or managerial weakness threatens safety. Employers face a number of direct costs when their employees are injured. The UK Health and Safety Executive estimate that occupational injuries cost employers around 4-8% of their gross trading pro t; currently approximately $6 billion. There are also indirect costs that accrue when regulators intervene. In the United States Federal and State inspectors levied penalties for health and safety violations that totalled $151,361,442 for the scal year 1999. Incident or occurrence reporting systems enable companies to identify potential failures before they occur. They provide insights that can be used to guide risk assessment during subsequent development.
1.3.
SUMMARY
19
Incident reporting systems provide regulators with data that can be used to guide any necessary intervention. They help to prioritise health and safety initiatives and awareness raising campaigns. They can also be used to address public concerns, for example the creation of a national incident reporting system for UK railways followed shortly after the Ladbroke Grove and Southall accidents. At an international level, incident reporting systems provide means of ensuring that lessons are eectively shared across national boundaries. The following chapter introduced the challenges that must be addressed if these claimed bene ts are to be realised.
20
CHAPTER 1.
ABNORMAL INCIDENTS
Chapter 2
Motivations for Incident Reporting This chapter explains why many organisations develop incident reporting systems. The intention is often to identify potential failures before an accident occurs. The higher frequency of less critical mishaps and near-miss events also supports statistical analysis that cannot reliably be performed on relatively infrequent accidents. Data and lessons from one system can be shared with the operators of other similar applications. The following pages also identify limitations that are often forgotten by the proponents of incident reporting systems. Many submissions do little more than remind their operators of hazards that are well understood but are diÆcult to avoid. The resources used by a reporting system might alternatively fund safety improvements. Managers of successful reporting systems can be overwhelmed by a mass of data about relatively trivial mishaps. Later sections go on to review issues of con dentiality and scope that help to determine whether the claimed bene ts outweigh the perceived costs of operating these systems.
2.1 The Strengths of Incident Reporting The US Academy of Science recommended that a nationwide mandatory reporting system should be established to improve patient safety [453]. They argued that this system should initially be based around hospitals but that eventually other `care settings' should be included. The International Civil Aviation Organisation has published detailed guidance on the manner in which reporting systems must be implemented within signatory states [384]. \(The assembly) urges contracting states to undertake every eort to enhance accident prevention measures, particularly in the areas of personnel training, information feedback and analysis and to implement voluntary and non-punitive reporting systems, so as to meet the new challenges in managing ight safety, posed by the anticipated growth and complexity of civil aviation". (Resolution A31-10: Improving accident prevention in civil aviation) \(The assembly) urges all Contracting States to ensure that their aircraft operators, providers of air navigation services and equipment, and maintenance organisations have the necessary procedures and policies for voluntary reporting of events that could aect aviation safety" (ICAO Resolution A32-15: ICAO Global Aviation Safety Plan) The US Coast Guard and the Maritime Administration have helped to establish a voluntary international maritime information safety system. This is intended to receive, analyse, and disseminate information about unsafe occurrences. They argue that these `non-accidents' or `problem events' provide an untapped source of data. They can be used as indicators of safety-levels in the maritime community and provide the information necessary to prevent accidents before they happen [830]. The goals of the system are to reduce the frequency of marine casualties, to reduce the extent of injuries and property damage (including environmental damage), and to create a safer and more eÆcient shipping transportation system and mariner work environment. 21
22
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
The Council of the European Union had similar concerns when it drafted the 1996 directive on the control of major accident hazards. This has become more widely known as the Sveso II directive; it was named after the town in Italy where 2,000 people had to be treated following a release of tetrachlorodibenzoparadioxin (Dioxin) in 1976: \Whereas, in order to provide for an information exchange and to prevent future accidents of a similar nature, Member States should forward information to the Commission regarding major accidents occurring in their territory, so that the Commission can analyse the hazards involved, and operate a system for the distribution of information concerning, in particular, major accidents and the lessons learned from them; whereas this information exchange should also cover `near misses' which Member States regard as being of particular technical interest for preventing major accidents and limiting their consequences." [187] The Transportation Safety Board of Canada [622] identi ed a number of reasons to justify the creation of its own con dential incident reporting system. They argued that incident data will support the Board's studies on a wide range of safety-related matters including operating procedures, training, human performance and equipment suitability. The analysis of incident reports can also help to identify widespread safety de ciencies that might not have been detected from individual reports submitted to regional centres. Greater insights into national and international transportation safety issues can be gained by collating accident/incident reports and by comparing it with data from other agencies. These individual initiatives across a range of industries illustrate the increasing importance of incident reporting within safety management systems [444]. They can also be used to identify common arguments that justify the development and maintenance of incident reporting systems: 1. Incident reports help to nd out why accidents DONT occur. Many incident reporting forms identify the barriers that prevent adverse situations from developing into a major accident. These insights help analysts to strengthen those safeguards that have already proven to be eective barriers in `near miss' incidents. 2. The higher frequency of incidents permits quantitative analysis. It can be argued that many accidents stem from atypical situations. They, therefore, provide relatively little information about the nature of future failures. In contrast, the higher frequency of incidents provides greater insights into the relative proportions of particular classes of human `error', systems `failure', regulatory `weakness' etc. 3. They provide a reminder of hazards. Incident reports provide a means of monitoring potential problems as they recur during the lifetime of an application. The documentation of these problems increases the likelihood that recurrent failures will be noticed and acted upon. 4. Feedback keeps sta `in the loop'. Incident reporting schemes provide a means of encouraging sta participation in safety improvement. In a well-run system, they can see that their concerns are treated seriously and are acted upon by the organisation. Many reporting systems also produce newsletters that can be used to increase awareness about regional and national safety issues. 5. Data (and lessons) can be shared. Incident reporting systems provide the raw data for comparisons both within and between industries. If common causes of incidents can be observed then, it is argued, common solutions can be found. However, in practice, the lack of national and international standards for incident reporting prevents designers and managers from gaining a clear view of the relative priorities of such safety improvements. 6. Incident reporting schemes are cheaper than the costs of an accident. The relatively low costs of managing an incident reporting scheme should be oset against the costs of failing to prevent an accident. This is a persuasive argument. However, there is also a concern that punitive damages may be levied if an organisation fails to act upon the causes of an incident that subsequently contribute towards an accident.
2.1.
THE STRENGTHS OF INCIDENT REPORTING
23
7. May be required to do it. The nal argument in favour of incident reporting is that these schemes are increasingly being required by regulatory agencies as evidence of an appropriate safety culture. This point is illustrated by the ICAO resolutions A31-10 and A32-15 and by the EC Seveso II directive that were cited on previous pages. Many of these arguments require little additional explanation. For example, it it suÆcient to cite the relevant ICAO resolutions to demonstrate that member states should implement incident reporting systems. However, some of these apparent justi cations for incident reporting are more controversial. For example, we have argued that the higher number of incidents can be used to drive statistical analyses of the problems that lead to a far smaller number of accidents. Heinrich's [340] pioneering studies in occupational health and safety suggested an approximate ratio of one accident to thirty occurrences involving major injuries to three hundred `near-miss' incidents. More recently, Bird [84] proposed a ratio of one accident, involving serious or disabling injuries, to ten minor injuries to 30 incidents involving property damage to six hundred incidents resulting in no visible damage. He based this on a statistical analysis of 1.5 million reported incidents. The work of Heinrich, Bird and their colleagues have led to the `Iceberg' model of incident data. Any accident is the pinnacle, or more properly the nadir, of a far larger number of incidents. The consequences of this form of analysis seem clear. Incident reports provide a far richer data sources for organisational learning and the `control' of major accidents.
Figure 2.1: Federal Railroad Administration Safety Iceberg Figure 2.1 illustrates a number of caveats that can be made about the Iceberg model. The central pyramid represents the results of Heinrich's initial study. On either side, the diagram presents the proportion of fatal to non-fatal injuries reported for dierent groups of workers in the US rail system based on Federal Railway Administration data from 1997 to 2000. Direct railroad employees or `workers on duty' suered a total of 119 fatalities and 33,738 injuries. Contractors experienced 31 fatalities and 1,466 injuries in the same period. The rst problem is that the FRA has no reliable means of calculating the number of `near miss' incidents over this period. As a result, it is only possible to examine the relationship between fatal work related deaths and injuries. Workers had a Heinrich ratio of one fatality for every two hundred and eighty-four injuries. The ratio for contractors was one fatality to seventy-seven injuries. Further problems arise when we interpret these ratios. They might show that contractors are less likely to be injured than `workers on duty'. An alternate way of expressing this is to say that contract sta are more likely to be killed than injured when compared to other employees. However, these ratios provide a very impoverished measure of probability. They do not capture the comparative risk exposure of either group. For example, the smaller number of fatal accidents to contractors may stem from a proportionately smaller number of workers. Contract workers are more likely than full-time, direct sta to be involved in high-severity incidents [874]. Alternatively, it can be argued that contractors are more reluctant to report work-related injuries than `directly' employed sta.
24
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
This line of analysis is important because it questions the reliability of the data that can be obtained to calculate Heinrich ratios. The argument that statistical data about incidents can be used to predict potential accidents is based on the premise that incidents are accidents in the making. It is assumed that incidents share the same root causes as more serious occurrences Van der Schaaf [843, 840] provides preliminary data from the Dutch chemical industry to con rm this premise. Glauz, Bauer and Migletz [291] also found a correlation between traÆc con icts and accidents. Other have exploited a more qualitative approach by looking for common contributory factors in both incidents and accidents. For instance, Helmreich, Butler, Taggart, and Wilhelm [341] have attempted to show that poor Crew Resource Management (CRM) causes both incidents and accidents. They then use this analysis to propose a predictive Cockpit Management Attitudes Questionnaire that can assess individual attitudes towards crew communication, coordination, and leadership issues. A great deal of safety-related research rests on the assumption that incidents are good predictors of potential accidents. Wright has recently challenged this view in her statistical analysis of Scottish railways Con dential Incident Reporting and Analysis System (CIRAS ) [874]. This con dential system elicits information about less `critical' incidents. All accidents must, in contrast, be reported to a specialist unit within the UK Health and Safety Executive. Her work, therefore, focuses on `near misses' and unsafe acts near the base of the Iceberg model. A near-miss has the potential to lead to a more serious occurrence, for example: \A Driver overshot a station platform by one and a half coach lengths. The Driver experiences wheelslip which may have been due to rail contamination. This did not result in any damage or injury" [874] An unsafe act occurs when operator intervention actively undermines the safety of their system: \A Driver stated that when requested by the Signaller to do a controlled stop to assess railhead conditions he carries out this procedure assuming exceptional conditions i.e., reduced speed rather than normal speed. A controlled stop test carried out in this manner would not indicate the braking capacity in normal conditions and lead to an incorrect assumption that normal working may be resumed" [874] Wright was able to conduct follow-up interviews with the sta who had submitted a con dential form from a total collection of 165 reports. A causal analysis was conducted using guidelines in the systems classi cation handbook and was validated by inter-rater reliability trials [197]. Occurrences were rst assessed to identify technical and human factors issues. If a human factors `failure' was identi ed then it was categorised as either proximal, distal or intermediate. Proximal factors include a range of human failures at the `sharp end'. Intermediate factors relate to training or communications failures between high-level management and front-line sta. Distal factors relate to organisational and managerial issues that are remote from the workplace. Table 2.1 provides a comparison of the high-level causes of the `near misses' and unsafe acts. The discrepancy between the number of reports and the total number of causal factors in this table can be explained by the fact that an incident can involve one or more causal factors. Category
Near Miss (total 155)
Technical Proximal Intermediate Distal
20.7% (32) 27.7% (43) 21.9% (34) 29.7% (46)
Unsafe Acts (total 223) 1.3% (3) 23.3% (52) 21.2% (47) 54.3% (121)
Table 2.1: Causal Comparison of CIRAS Incidents and Unsafe Acts As can be seen, technical faults and failures seem to occur more frequently in near miss events than in unsafe acts. Conversely, distal factors such as organisation and managerial problems seem
2.2.
THE WEAKNESSES OF INCIDENT REPORTING
25
to occur more frequently as causal factors in unsafe acts. From this it follows that any analysis of `near miss' events might fail to predict probable causes of actual incidents at the lower levels of the Iceberg model. These results can be explained in terms of the particular application area that Wright was studying. For example, near misses typically involved a failure to halt a train within the speci ed distance from a particular signal. These were often attributed to technical problems such as contaminated railheads. Unsafe acts were, in contrast, associated with the violations of company rules and procedures that govern driver behaviour on the UK railways. More work is required to con rm Wright's more general hypothesis that adverse events at the lower levels of the Iceberg model may provide poor predictors of accidents at the higher levels.
2.2 The Weaknesses of Incident Reporting The most obvious limitation of incident reporting systems is that they can be expensive both to set up and to maintain. For instance, Leape notes that the Aviation Safety Reporting System spends about $3 million annually to analyse approximately 30,000 reports. This equates to about $100 ($66) per case. \These `near miss' situations are far simpler to analyse than actual accidents, thorough investigation of which would almost certainly cost far more. It would be interesting to know, for example, the cost per case of investigations reported to the con dential enquiries. However, if we applied the gure from the Aviation Safety Reporting System to the 850,000 adverse events that are estimated to occur annually in the UK National Health Service, the cost of investigation would be $50 million annually." [480] For comparison, it has been estimated that the cost of clinical negligence to health authorities and NHS Trusts was approximately $200 million in 1995-1996. The NHS summarised accounts for 19962001 include provision totalling $80 million with contingent liabilities of $1.6 billion [89]. Even when incident reporting systems are successfully established and maintained, a number of problems can limit their eectiveness. For instance, there is in reality very little sharing of incident data. For example, the European Con dential Aviation Safety Reporting Network ran between 1992 and 1999 with funding from the European Community. The network was intended to improve safety by passing on incident information to the aviation community. However, it was forced to close through lack of support from some sectors of the European aviation industry. Further problems limit the transfer of incident information between organisations within an industry. For instance, Boeing operate an extensive system for collecting information about maintenance problems in their aircraft. They have successfully encouraged the exchange of data with airline operators. Unfortunately, however, there has been little coordination between airlines and groups of airlines about the format that this data should take. These formats are proprietary in the sense that they have been tailored to meet the speci c needs of the operating companies. As a result when Boeing attempt to collate the data that is being shared they must face the considerable task of translating between each of these dierent formats. Any conclusions that are drawn from this data must also account for the dierent reporting cultures and reporting practices that exist within dierent operating groups [472]. Incident reporting systems may also fail to keep sta `in the loop'. Occasionally these systems develop into grandiose initiatives that ful ll the organisational ambitions of their proponents rather than directly addressing key safety issues. There is also a danger that incident reporting systems degenerate into reminders of failures that everyone knows exists but few people have the political or organisational incentives to address [409]. Similarly, they may recommend short-term xes or expedients that fail to address the underlying causes of incidents. This is illustrated by the following report from NASA's Aviation Safety Reporting System (ASRS): \Problem: on landing, gear was unlocked but up. Contributing factors: busy cockpit. [I] did not notice the gear down-and-locked light was not on. Discovered: Gear up was discovered on landing. Corrective action: [I] was unable to hear gear warning horn because of new noise cancelling headsets. I recommend removal of one ear-piece in
26
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
landing phase of ight to audible warning devices to be heard by pilot. The noisecancelling headsets were tested by three people on the ground and all three noted that with the headsets active that the gear warning horn was completely masked by the headsets." [62] This illustrates the strengths and weaknesses of many incident report schemes. They provide rsthand insights into operational problems. They can also provide pragmatic remedies to the challenges that poorly designed equipment creates. However, there is also a danger that immediate remedies to individual incidents will fail to address the root cause of a problem. The noise-correcting headphones were clearly not t for purpose. The proposed remedy of removing one headphone provides a shortterm x for individual pilots. However, it does little to address the underlying problems for future product development. Further problems limit the ways in which data can be shared between incident reporting schemes. Although some organisations have successfully exchanged information about the frequency of particular occurrences, there have been few attempts to ensure any consistency in their response to those incidents. This creates particular problems for the maritime and aviation industries where operators may read of dierent recommendations being made in dierent countries. The following excerpt comes from the Con dential Human Factors Incident Reporting Programme (CHIRP). CHIRP is the UK equivalent of the ASRS that was cited in the previous quotation. This excerpt oers a slightly dierent perspective on the problems of ambient noise in the cockpit: \Fortunately, I have no incident to report. I would like, however, to highlight a common practice by some airlines, including my employer, which I feel is a signi cant risk to ight safety: namely the practice of not using ight deck intercom systems in favour of half wearing a headset over one ear for VHF comms, whilst using the other ear, unaided, for cockpit communications. And all this in what are often not so quiet ight decks. I cannot believe that we do not hear much better with two ears than with one, and many are the times when I, and other colleagues of mine, have had to ask for the other crew member to repeat things because of aircraft noise in one ear, and ATC in the other with the volume turned high enough not to miss a call. Not the best answer in a busy terminal area after a long ight, and an unnecessary increase in stress factors. Myself and others have raised this point several times to our training and safety departments, all of which has fallen, pardon the pun, onto deaf ears. The stock answer is that there is no written down SOP on intercoms, and common agreed practice rules. In reality, the guy in the right hand seat has no in uence without things getting silly. As even single ear-piece headsets are not incompatible with intercoms, I would have thought a compromise would be mandatory use of full headset and intercom at the busy times, say below a given ight level, with the option for personal preferences in the cruise. Volumes for dierent communication channels could be adjusted to suit, and surrounding noise signi cantly reduced. This would preclude the need to speak louder than usual to be heard, to ask for repetitions, and general ly improve the working environment. After all, if the CAA and other agencies have made intercoms mandatory in transport aircraft, it will be for a reason. CHIRP Comment: The use of headsets for the purpose of eective reception of RTF/intercom messages between ight crew members is not mandated. The certi cation requirement for an intercom system is to provide communication between all crew members in an emergency. The partial/full use of a headset in normal operations should be dependent on the ambient noise level on the ight deck. For this reason, some operators specify the headset policy by aircraft type and phase of ight, as the reporter suggests. [175]" The US ASRS article, cited above, argues that only one headset should be used during landing in order to help the crew hear cockpit warnings. In contrast, the CHIRP report condemns this practice as a threat to ight safety. This apparent contradiction is resolved by the second report, which
2.2.
THE WEAKNESSES OF INCIDENT REPORTING
27
argues that the partial or full use of headsets should be determined by the level of ambient noise. However, this distinction is not made explicit in the rst report. Such dierences illustrate the inconsistencies that can arise between national incident reporting systems. They are also indicative of a need to improve communication between these systems if we are to achieve the bene ts that are claimed for the exchange of incident data. The ASRS and CHIRP systems are run by `not for pro t' organisations. The problems of data exchange are many times worse when companies may yield competitive advantage through the disclosure of incident information. Incident reporting systems can provide important reminders about potential hazards. However, in extreme cases these reminders can seem more like glib repetitions of training procedures rather than pro-active safety recommendations. This problem is compounded by the tendency to simply remind sta of their failures rather than to address the root causes, such as poor design or `error inducing environments' [362]. Over time the continued repetition of these reminder statements from incident reporting systems is symptomatic of deeper problems in the systems that users must operate: \On pre- ight check I loaded the Flight Management Computer (FMC), with longitude WEST instead of EAST. Somehow the FMC accepted it (it should have refused it three times). During taxi I noticed that something was wrong, as I could not see the initial route and runway on the navigation map display, but I got distracted by ATC. After we were airborne, the senior cabin attendant came to the ight deck to tell us the cabin monitor (which shows the route on a screen to passengers) showed us in the Canaries instead of the Western Mediterranean! We continued the ight on raw data only to nd out that the Heading was wrong by about 30-40 degrees. With a ceiling of 1,000 ft at our destination I could not wait to be on `terra rma'. Now I always check the Latitude/Longitude three times on initialisation!" (Editorial note) A simple but eective safeguard against ` nger trouble' of the type described is for the pilot who does not enter the data to con rm that the information that he/she sees displayed is that which he/she would expect. Then, and only then, should the `Execute' function button be pressed." [176] The CHIRP feedback is well intended. It also reiterates recommended practices that have formed part of Crew/Cockpit Resource Management (CRM) training for almost twenty years [410]. UK Aeronautical Information Circular (AIC) 143/1993 (Pink) states that all crew must have completed an approved CRM course before January 1995. Joint Airworthiness Requirement Operational Requirements (JAR OPS) sub-part N, 1.945(a)(10) and 1.955(b)(6) and 1.965(e) extended similar requirements to all signatory states during 1998. There is a considerable body of human factors research that points to the dangers of any reliance on such reminders [699]. Eectiveness declines with each repetition that is made. It is depressing, therefore, that such data-entry problems continue to be a frequent topic in aviation reporting systems. These incidents are seldom the result of deliberate violations or aircrew negligence. They illustrate the usability problems that persist within Commercial Aviation and which cannot simply be ` xed' by training in cockpit coordination [410]. Incident reporting systems must go beyond repeated reminders to be `careful' if they are to preserve the con dence of those who contribute to them. The US ASRS recognise this by issuing two dierent forms of feedback in response to the reports that they receive. The Callback bulletin describes short-term xes to immediate problems. In contrast, the DirectLine journal addresses more systemic causes of adverse events and `near miss' incidents even if it has a more limited audience than its sister publication. For instance, the following excerpt is taken from a DirectLine analysis of the causes of several mishaps involving Pre-Departure Clearances: \The type of confusion experienced by this ight crew over their (Pre-Departure Clearance) PDC routing is potentially hazardous, as noted by a controller reporter to ASRS: `It has been my experience ... that several times per shift aircraft which have received PDCs with amended routings, have not picked up the amendment ... I have myself on numerous occasions had to have those aircraft make some very big turns to achieve sep-
28
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
aration.' (ACN # 233622). The sources consulted by ASRS suggested several potential solutions to this problem: Standardise PDC formats, so that pilots will know where to look for routing information and revisions. Show only one clearance line in a PDC, and insert any revisions into the clearance line. Make the revision section more visible by tagging it (`REVISION') or highlighting with asterisks or other eye -catching notation (*****). Provide ight crews with training in how to recognise PDC revisions." [56] There are limits to the safety improvements that can be triggered through initiatives in publications such as DirectLine. Some mishaps can only be addressed through industry cooperation and regulatory intervention. Others require international agreements. For example, reporting systems have had a limited impact on workload in aviation. Similarly, usability problems continue to aect new generations of computer systems for airline operations. Data entry in ight management systems continues to be error prone many years after the problem was rst identi ed. These `wicked problems' must be considered when ambitious proposals are made to extend aviation reporting into healthcare and other transportation modes.
2.3 Dierent Forms of Reporting Systems There are several dierent types of reporting system. This section explains why concerns over retribution have led to anonymous and con dential schemes. It also explains how both national and local systems have been set up to ensure that recommendations do not simply degenerate into reminders about known problems.
2.3.1 Open, Con dential or Anonymous? The FAA launched the Global Aviation Information Network (GAIN) initiative as an attempt to encourage national and commercial organisations to exchange occurrence data. The OÆce of System Safety that drove the GAIN proposal within the FAA identi ed four main barriers to the success of such a system. These can be summarised as follows: \1. Punishment/Enforcement. First, potential information providers may be concerned that company management and/or regulatory authorities might use the information for punitive or enforcement purposes. In the US, signi cant progress has been made on this issue. Following the example of the UK, the FAA issued a policy statement in 1998 to the eect that information collected by airlines in their Flight Operations Quality Assurance (FOQA) programs, in which ight data recorder information is collected routinely, will not ordinarily be used against the airlines or pilots for enforcement purposes. In January 2000, the US President announced the creation of the Aviation Safety Action Programme (ASAP), in which airlines will collect reports from pilots, mechanics, dispatchers, and others about potential safety concerns, and made a commitment analogous to the FOQA commitment not to use the information for enforcement purposes. In April 2000, Congress enacted legislation that requires the FAA to issue a rule to develop procedures to protect air carriers and their employees from enforcement actions for violations that are discovered from voluntary reporting programs, such as FOQA and ASAP programs. 2. Public Access. Another problem in some countries is public access, including media access, to information that is held by government agencies in certain countries. This problem does not aect the ability of the aviation community to create GAIN, but it could aect the ability of government agencies in some countries to receive information from GAIN. Thus, in 1996 the FAA obtained legislation that requires the agency to protect voluntarily supplied aviation safety information from public disclosure. This
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
29
will not deprive the public of any information to which it would otherwise have access, because the agency would not otherwise receive the information; but on the other hand, there is a signi cant public bene t for the FAA to have the information because it helps the FAA prevent accidents and incidents. The FAA is now developing regulations to implement that legislation... 3. Criminal Sanctions. A problem in some countries is the fear of criminal prosecution for regulatory infractions. Such a fear would be an obvious obstacle to the ow of aviation safety information. This has not historically been a major problem in the U.S., but the trend from some recent accidents is troubling. 4. Civil Litigation. Probably the most signi cant problem, certainly in the U.S., is the concern that the information will be used against the reporter in accident litigation. Some have suggested that, as was done in relation to the public disclosure issue, the FAA should seek legislation from Congress to protect aviation safety information from disclosure in litigation. In comparison with the public disclosure issue, however, the chances of obtaining such legislation are probably very remote; and a failed attempt to obtain such legislation could exacerbate the situation further because these disclosure issues are now determined in court, case by case, and a judge who is considering this issue might conclude that a court should not give protection that Congress refused to give." [308] Incident reporting systems have addressed these concerns in a number of dierent ways. For instance, it is possible to identify three dierent disclosure policies. Anonymous systems enable contributors to entirely hide their identity. Con dential systems allow the limited disclosure of identity but only to trusted parties. Finally, open systems reveal the identity of all contributors. The impact of the distinctions between open, con dential and anonymous systems cannot be under-emphasised. In anonymous systems, contributors may have greater con dence in their submission; safe in the knowledge that they can avoid potential `retribution'. However there is a danger that spurious reports will be led. This problem is exacerbated by the fact that it is diÆcult to substantiate anonymous reports to determine whether they really did occur in the manner described. Investigators cannot simply ask about an incident within a workgroup without the possibility of implicating the contributor. This would remove the protection of con dentiality and could destroy the trust that is fundamental to the success of such systems. The distinctions between open, anonymous and con dential systems are also blurred in many existing applications. For example, the Swedish Air TraÆc Control organisation (Luftfartsverket Flygtra kjansten) encourages the open contribution of incident reports. However, normal reporting procedures direct submissions through line supervisors. There is a danger that this might dissuade contributions about the performance of these supervisors. As a result, procedures exist for the con dential submission of incident reports via more senior personnel.
Trust and Technological Innovation Distinctions between con dential, anonymous and open systems are intended to sustain the con dence and trust of potential participants. In a con dential system, contributors trust that only `responsible' parties will receive identi cation information. The implications of this for the operation of any reporting system are illustrated by the approach taken with the CIRAS system that covers UK railways. This receives paper-based forms from train drivers, maintenance engineers and other rail sta. A limited number of investigators are responsible for processing these forms. They will conduct follow-up interviews in-person or over the telephone. These calls are not made to the contributor's workplace for obvious reasons. The original report form is then returned to the employee. No copies are made. Investigators type up a record of the incident and conduct a preliminary analysis. However, all identifying information is removed from the report before it is submitted for further analysis. From this point it is impossible to link a particular report to a particular employee. The records are held on a non-networked and `protected' data base. This data itself is not revealed to industry management. However, anonymized reports are provided to managers every three months.
30
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
Incident reporting systems increasingly rely on computer-based applications . The Swedish Air TraÆc Control system, mentioned above, is an example of this. Controllers in air elds in the more remote areas of Northern Sweden can receive rapid feedback on a report using this technology. However, electronic submission creates a number of novel and complex challenges for systems that attempt to preserve anonymity. These concerns are illustrated by the assurances that are provided to contributors on the Swiss Anaesthesia Critical Incident Reporting System. These include a commitment that they `will NOT save any technical data on the individual reports: no E-mail address and no IP-number (a number that accompanies each submitted document on the net)' [755]. The use of computer-based technology not only raises security problems in the maintenance of trust during the transmission and storage of electronic documents, it also oers new and more exible ways of maintaining incident reporting systems. For example, the US Department of Energy's Computerised Accident/Incident Reporting System (CAIRS) exploits an access control mechanism to tailor the level of con dentiality that is aorded to particular readers of particular incident reports. The CAIRS database is used to collect and analyse reports of injuries, illnesses, and other accidents that are submitted to the Department of Energy by their sta or contractors. The following paragraphs provide a brief overview of the innovative way in which the con dentiality of information is tied to particular access rights. \When you are granted access to CAIRS, you will be assigned an organisational jurisdiction. This jurisdiction may be for a speci c organisation or for a complete contractor, area oÆce, or eld oÆce. This jurisdiction assignment will determine the records that will be selected when the default organisation selection is utilised in many of the reports and logs. The default can be over-ridden by entering the desired organisation codes in the appropriate input boxes. CAIRS reports contain personal identi ers (names and social security numbers) and information regarding personal injury or illness. In order to prevent an unwarranted invasion of personal privacy, all personal identi ers are masked from the view of general users whenever any logs or reports are generated. The default registration for CAIRS does not provide access to any privacy information. If you require access to privacy information in order to perform your job function, you may apply for access to that information." [655] It can be diÆcult to communicate the implications of such computer-based security measures to non-computer literate employees. There is a natural reluctance to believe in the integrity of such safeguards given continuing press coverage about the vulnerability of `secure' systems [1]. The ability to access this data over the web might compound such misgivings.
Workplace Retribution and Legal Sanction At least two dierent classes of problems exist in more open systems. Later paragraphs will address the issues that arise when trying to integrate a pro-active safety culture into a punitive legal system. There is a natural reluctance to implicate oneself or one's colleagues when subsequent investigations might directly threaten their livelihood and wellbeing. The second set of problems arise from a justi ed fear of persecution from colleagues or employers. These fears are natural if, for example, the subject of a report is a person in a position of authority or if the report re ects badly upon such a person. These individuals are likely to have a strong in uence upon the career prospects and promotion opportunities of their more junior colleagues. The long term consequences of any actual or implied criticism can be extremely serious. Such concerns have long been apparent in the `cockpit gradient'; co-pilots have extreme diÆculty in challenging even minor mistakes made by a Captain. Co-Pilots have been known to remain silent even when their colleague's behaviour threatened the lives of everyone on board [733]. There are other reasons why individuals can be reluctant to contribute to incident reporting systems. There may be a fatalism that such an individual or group will suppress the report. If the report focuses less on higher management and more on their colleagues then the contributor may have concerns about appearing to be disloyal. In all of these cases, a natural reluctance can be
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
31
compounded by a feeling of self-doubt. It may not be clear to the reporter that an adverse event has occurred. Those involved in an incident may seek to excuse or cover up their behaviour. Junior sta can also be reluctant to appear `stupid' by raising concerns over unfamiliar equipment or procedures. As a result, they can remain silent about important safety concerns. Many of the issues described above are illustrated by the events leading to the UK Bristol Royal In rmary Inquiry. This focused on the procedures that were used to gain parental approval for child organ retention after autopsy. Concerns about these procedures were rst identi ed following complaints that several complex cardiac surgical procedures continued to be conducted in spite of an unusually low recovery rate. The inquiry heard how Steve Bolsin, a member of sta within the unit, had attempted to draw attention to these problems by conducting a personal clinical audit. The following quotation comes from the hearings of this inquiry. The questions, labelled Q, were posed by the leagl team to the Chief Executive of the United Bristol Healthcare NHS Trust. His answers are labelled with an A. \Q. There was, was there, personal diÆculty for a number of people in his overall conclusions being accepted? A. That certainly seems to be the case from all the records that I have seen, yes. Q. To what extent was that a re ection, would you say, of the absence of an institutionalised system of audit the absence of an institutionalised system of audit properly monitored, and to what extent did you consider that was part of a club culture where someone who rocked the boat, in whatever capacity, might be, as it were, going against the `club'? A. They could both be contributory factors. Clearly, if there was no thorough-going structure in place along the lines we have discussed, then that is not going to lead to a climate whereby individuals doing audit and then presenting it is necessarily going to be received positively. Also, of course, if data is produced that appears to be critical of certain individuals and has not been collected with their knowledge and they do not subscribe to the methodology, then it would be surprising if they did not feel a degree of resentment and rejection of what was put in front of them. And it is possible that if this was undertaken by someone relatively new to the organisation who was challenging senior gures in the organisation, that, yes, indeed, it may have cut across some of the cultural boundaries within the Trust." [435] In the subsequent investigations, Steve Bolsin's intervention was widely praised. However, things become more complex if an individual's actions can be interpreted as either `whistler blowing' or `trouble making' depending on ones' perspective. This dichotomy is illustrated by Mary Schiavo's criticisms of the FAA. She held the post of Inspector General in the US Department of Transportation. Following the Valujet crash, she told an American House of Representatives panel that she had made regular complaints to the FAA about what she felt were lax inspection practices in monitoring rapidly expanding airlines. Her comments and criticisms were widely reported in the media. However, her `whistle blowing' was, in turn, heavily criticised by the US Congress. They attacked her by asking why she had not rst passed her concerns to the Congress before publicly airing her criticisms. Under federal law, inspectors general are required to pass on to Congress within seven days any problems requiring immediate attention. She chose to resign from her post and subsequently published an account of her criticisms [729]. This dichotomy between constructive `whistle blowing' and destructive criticism of an employer can also be seen in the Paul van Buitenen case. He voiced concerns about fraud and mismanagement in the European Commission's $60 billion budget. When these criticisms were made public, the veracity of his claims and his motivation for making them were, in turn, heavily criticised by
32
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
individuals within the Commission. Although this incident did not have direct safety implications, his statements in a BBC interview provide a powerful illustration of the psychological pressures that aect such individuals: \I did not realise the full consequences of what would happen. I did not even know the word whistle-blower - I did not know this phenomenon existed... It was completely strange for me to see the commission tackle me on my personality and my credibility and not on the contents of what was disclosed. Sometimes I had diÆculty keeping the tears inside when I discovered what machinery was brought against me... I am withdrawing as of April 1st, I want to be an anonymous oÆcial again. I want to show I can still be loyal, I want to do a normal standard budget management job. I want to have a quiet family life and be a husband and a father to my children who still have to do three years at secondary school, and I cannot carry on carrying this on my own." [103] A UK National Audit OÆce enquiry headed by Sir John Bourn subsequently found errors totalling about $3 billion in European pay-outs during 1998. van Buitenen concerns are occasionally echoed in safety-related incident reporting systems: The provision of a reporting system is no guarantee of an appropriate safety culture in the companies that operate within an industry: \At the start of the Winter heavy maintenance programme, the company railroaded into place a computerised maintenance and integrated engineering and stores, planning and labour recording system. No training was given on the operational system only on a unit under test. Consequently we do not look at planes any more just VDU screens, lling in fault report forms, trying to order parts the system does not recognise, as the stores system was not programmed with (aircraft type) components (the company wanted to build a data base as equipment was needed)... The record had numerous faults, parts not recorded as being tted, parts removed with no replacements, parts been tted two or three times, parts removed by non-engineering sta, scheduled tasks not called-up by planning, incorrect trades doing scheduled tasks and certifying, and worst of all the record had been altered by none certifying sta after the CRS signatories had closed the work. Quality Airworthiness Department were advised of these de ciencies and shown actual examples. We were advised by the management that these problems are being addressed but they are not, we still have exactly the same problems today. What am I to do without losing my job and career. In a closed community like aviation, troublemakers and stirrers do not keep jobs and the word is spread around...' [174]. The comments that aviation is a \closed community" and that \troublemakers and stirrers do not keep jobs" provide an important `reality-check' against some assertions about the bene ts of incident reporting. These schemes have little impact on the underlying safety culture of many companies and organisations. O'Leary and Chappell argue that con dential incident reporting systems create a `vital awareness of safety problems' [660]. The key point is not, perhaps, that O'Leary and Chappell are wrong but that the bene cial eects of these systems are constrained by the managerial culture in which they operate.
Media Disclosure Issues of con dentiality and disclosure do not simply re ect the need to protect an individual's identity from their co-workers. They can also stem from concerns about media intrusion. For example, recent amendments have been proposed for ICAO Annex 13 on Accident and Incident Investigation and Prevention. The revisions would provide pilots with automatic con dentiality in accident and incident investigations. They would also limit the disclosure of information following an incident or accident. These amendments are signi cant in two ways. Firstly, they would ensure that the media had no right to cockpit voice recordings. This is an important issue given public and professional reactions to the broadcasting of such recordings after fatal accidents. Secondly, it would increase the level of civil protection available to pilots. The intention is to encourage a `no-blame' approach to incident reporting. The concept is currently being tested in New Zealand civil courts.
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
33
If the ICAO adopts these amendments, it is likely that they will be rati ed by all ICAO signatory nations as international law. Accident and incident investigators often have a complex relationship with the media [419]. Public disclosure of sensitive information can jeopardise an enquiry and can dissuade contributions about potential hazards. Media interest can also play a powerful role in establishing reporting systems and in encouraging investment in safety initiatives. Peter Majgrd Nrbjerg's account of the new Danish Air TraÆc Management reporting system reveals these two aspects of media involvement: \Then, in 2000, in order to push for a change the Chairman of the Danish Air Traf c Controllers Association decided to be entirely open about the then current obstacles against reporting. During an interview on national television, she described frankly how the then current system was discouraging controllers from reporting. The journalist interviewing the ATCO chairman had picked up observations made by safety researchers that, as described above, Denmark had a much smaller number of occurrence reports than neighbouring Sweden. Responding to the interviewer's query why this was so, the ATCO chairman proclaimed that separation losses between aircraft went unreported simply due to the fact that controllers - for good reasons - feared for retribution and disclosure. Moreover, she pointed out, ight safety was suering as a consequence of this! These statements, broadcasted on a prime time news program, had the immediate eect that the Transportation Subcommittee of the Danish Parliament asked representatives from the Danish Air TraÆc Controllers Association to explain their case to the Committee. Following this work, the Committee spent several of their 2000-01 sessions exploring various pieces of international legislation on reporting and investigation of aviation incidents and accidents. As a result of this, in 2001 the Danish government proposed a law that would make non-punitive, strictly con dential reporting possible." [676] The irony in this account is obvious. The media played a key role in motivating political intervention to establish the reporting system. One of the rst acts in establishing the new scheme was to create a legislative framework that eectively protected contributors from media exposure.
Proportionate Blame... Potential contributors often have a justi ed fear of retribution. They may be dissuaded from participating in a reporting system if they feel that their colleagues and managers will perceive them to be `whistle blowers'. Contributors can also be concerned about the legal consequences of submitting an incident report [83]. Leape points out that this reluctance is exacerbated by apparent inequities in the degree of blame that is associated with some adverse events. He also identi es a spectrum of blame that can lead from peer disapproval through to legal sanctions: \...these punishments are usually calibrated to the gravity of the injury, not the gravity of the error. The nurse who administers a tenfold overdose of morphine that is fatal will be severely punished, but the same dosing error with a harmless drug may barely be noted. For a severe injury, loss of the right to practise or a malpractice suit may result. Moderate injuries may result in a reprimand or some restriction in practice. Punishment for less serious infractions are more varied: retraining, reassignment, or sometimes just shunning or other subtle forms of disapproval." [480] This fear of retribution has been addressed by number of regulatory organisations who have sought to ensure that any enforcement actions are guided by principles that are intended to protect individuals and companies. For example, the UK Health and Safety Executive is responsible for initiating prosecutions that relate to violations of health and safety law. These action are often taken in response to the accidents and injuries that are reported under the RIDDOR scheme, introduced in Chapter 1. The Health and Safety Commission requires that individual HSE inspectors inform their actions by the principle of proportionality; the enforcement action must re ect the degree of risk. They must also endeavour for consistency in their enforcement actions; they must adopt a similar approach in similar circumstances to achieve similar ends. A further HSE principle concerns the
34
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
targeting of enforcement. Actions are focused on the people who are responsible for the risk and who are best placed to control it. Finally, there is a requirement that any legal or other enforcement actions should be transparent; the justi cations and reasons for any decision to prosecute must be open to inspection. These guiding principles clearly distinguish regulatory actions from the informal retribution that often dissuades potential contributors from `whistle-blowing'. In order to achieve these principles, Health and Safety inspectors will exploit a range of enforcement actions: \Enforcing authorities must seek to secure compliance with the law. Most of their dealings with those on whom the law places duties (employers, the self employed, employees and others) are informal - inspectors oer information, advice and support, both face to face and in writing. They may also use formal enforcement mechanisms, as set out in health and safety law, including improvement notices where a contravention needs to be remedied; prohibition notices where there is a risk of serious personal injury; withdrawal of approvals; variations of licences or conditions, or of exemptions; or ultimately prosecution. This statement applies to all dealings, formal or informal, between inspectors and duty holders - all contribute to securing compliance." [315] The legal position of incident reporting systems is inevitably complicated by dierences between dierent national systems. The eects of this can be seen from the diering reporting practices in European air traÆc control. Some service provides are compelled to report all incidents to the national police force or to state prosecutors who will launch an investigation if they believe that an oence has been committed. However, there is a concern in the European coordinating organisation, EUROCONTROL, that controllers and pilots will signi cantly downgrade the severity of the incidents that they report in such potentially punitive environments. Concerns over litigation can also prevent reports from being led. Other states have reached agreements between air traÆc management organisations and state prosecutors to protect sta who actively participate in the investigation of an occurrence. The Swedish experience of operating an open reporting system is that very few controllers have lost their licenses as a result of ling an incident report within the last decade. The Luftfartsverket Flygtra kjansten personnel who operate the system stress the need to protect the controller's trust in the non-punitive nature of the system. The overall safety improvements from the information that is gathered by a non-punitive system are believed to outweigh the disciplinary impact of punitive sanctions. These arguments have also motivated the Danish system, mentioned earlier in this chapter [676]. It is interesting to note that the same personnel who expect a non-punitive approach to protect their submissions often also expect more punitive actions to be taken against others who are perceived to have made mistakes, especially pilots. Most companies and regulators operate `proportionate blame' systems. Annex 13 to the International Civil Aviation Organisation's International Standards and Recommended Practices provides the framework for accident and incident reporting in world aviation. This advocates a non-punitive approach to accident and incident reporting. It might, therefore, seem strange that some countries continue to operate systems that directly inform the actions of state prosecutors. There is, however, a tension between the desire to ensure the trust of potential contributors and the need to avoid a system that is somehow `outside the law'. Ethical as well as judicial considerations clearly prevent any reporting system from being entirely non-punitive. For instance, action must be taken when reports describe drug or alcohol abuse. As a result most systems reserve the right to pass on information about criminal acts to the relevant authorities. This is illustrated by the immunity caveats that are published for NASA and the FAA's Aviation Safety Reporting System (ASRS). Section 5 covers the `prohibition against the use of reports for enforcement purposes':
\a. Section 91.25 of the Federal Aviation Regulations (FAR) (14 CFR 91.25) prohibits the use of any reports submitted to NASA under the ASRS (or information derived therefrom) in any disciplinary action, except information concerning criminal oences or accidents which are covered under paragraphs 7a(l) and 7a(2).
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
35
b. When violation of the FAR comes to the attention of the FAA from a source other than a report led with NASA under the ASRS, appropriate action will be taken. See paragraph 9. c. The NASA ASRS security system is designed and operated by NASA to ensure con dentiality and anonymity of the reporter and all other parties involved in a reported occurrence or incident. The FAA will not seek, and NASA will not release or make available to the FAA, any report led with NASA under the ASRS or any other information that might reveal the identity of any party involved in an occurrence or incident reported under the ASRS. There has been no breach of con dentiality in more than 20 years of the ASRS under NASA management." [59]
Section 7 of the regulations governing the ASRS describes the procedure for processing incident reports. Again, this process explicitly describes the way in which legal issues are considered before reports are anonymized:
a. \NASA procedures for processing Aviation Safety Reports ensure that the reports are initially screened for: 1. Information concerning criminal oences, which will be referred promptly to the Department of Justice and the FAA; 2. information concerning accidents, which will be referred promptly to the National Transportation Safety Board (NTSB) and the FAA; and Note: Reports discussing criminal activities or accidents are not de-identi ed prior to their referral to the agencies outlined above. 3. time-critical information which, after de-identi cation, will be promptly referred to the FAA and other interested parties. b.Each Aviation Safety Report has a tear-o portion which contains the information that identi es the person submitting the report. This tear-o portion will be removed by NASA, time-stamped, and returned to the reporter as a receipt. This will provide the reporter with proof that he/she led a report on a speci c incident or occurrence. The identi cation strip section of the ASRS report form provides NASA program personnel with the means by which the reporter can be contacted in case additional information is sought in order to understand more completely the report's content. Except in the case of reports describing accidents or criminal activities, no copy of an ASRS form's identi cation strip is created or retained for ASRS les. Prompt return of identi cation strips is a primary element of the ASRS program's report de-identi cation process and ensures the reporter's anonymity." [59]
These quotations show that incident reporting systems must de ne their position with respect to the surrounding legislative and regulatory environment. They also illustrate the care that many organisations take to publish their position so that potential contributors understand the protection they are aorded. This does not necessarily imply that they respect the intention behind such protection. For instance, ASRS reporting forms are often colloquially referred to as `get out of gaol free cards' by some US pilots. The protection oered by con dential reporting systems has both positive and negative eects. `No blame' reporting is intended to encourage participation in the system. Protection from prosecution can, however, introduce bias if it has greater value for particular contributors. This can be illustrated by the Heinrich ratios for US Commercial and General Aviation. The bottom tier of the Iceberg can be assessed through contributions to NASA's ASRS. Table 2.2 shows that General Aviation and air traÆc management personnel submitted less voluntary incident reports than the crews of commercial air carriers in 1997 and 2000. These years were chosen because the ASRS provide complete month by month submission statistics. Administrative problems have led to submission data being merged for some months in other years. Others, including cabin crew, mechanics and
36
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
military personnel provide very few submissions. The relatively high level of commercial aircrew contributions can be explained in terms of the protection oered by ASRS submissions. Submission to the system turns an adverse event into a learning opportunity In contrast, General Aviation pilots typically do not, typically, risk their livelihoods if their licences are revoked after an adverse event. There may also be less concern that others will witness and report an adverse event in General Aviation. They may, therefore, be less likely to submit information about adverse events they have been involved in. There is always the possibility in Commercial Aviation that other members of the
ight crew or air traÆc managers will le a report even if you do not. Air Carrier 1997 2000 January 1,888 2,451 February 1,681 2,217 March 1,884 2,503 April 1,894 2,677 May 1,798 2,112 June 1,952 2,232 July 2,051 2,536 August 1,944 2,663 September 1,974 1,719 October 1,988 1,897 November 1,837 1,721 December 2,017 1,895 Total 22,908 26,623
General Aviation 1997 2000 612 597 677 608 779 582 776 727 701 718 718 729 762 829 650 774 759 619 724 857 589 850 637 611 8,384 8,501
Air TraÆc Managers 1997 2000 59 76 55 52 69 85 82 72 69 54 88 81 113 72 105 95 84 37 119 46 68 30 54 28 965 728
1997 42 29 42 31 38 66 64 56 63 50 68 69 618
Others 2000 162 188 191 194 192 193 168 188 139 102 103 80 1,900
Table 2.2: ASRS Contribution Rates 1997 and 2001 Table 2.3 presents NTSB data for accidents involving Commercial and General Aviation. In theory, this information can be used to calculate the Heinrich ratios that in turn illustrate the eects of `no blame' reporting on participation rates. Unfortunately, the ASRS and NTSB use dierent classi cation schemes. The NTSB classify Commercial operations using the 14 CFR 121 and 14 CFR 135 regulations. In broad terms, 14 CFR 135 refers to aviation operations conducted by commuter airlines. 14 CFR 121 refers to larger air carriers and cargo handlers. The 14 CFR 135 statistics are further divided into scheduled and unscheduled services. Table 2.3, prsents the NTSB accident data for scheduled services. The 14 CFR 135 gures in parentheses also include accidents involving on-demand unscheduled services, such as air taxis. In calculating the Heinrich ratios, we have taken the gures for both scheduled and unscheduled services.
1997 1998 1999 2000 2001 2002
14 CFR 135 All Fatal 16 (98) 5 (20) 8 (85) 0 (17) 13 (86) 5 (17) 12 (92) 1 (23) 7 (79) 2 (20) 8 (66) 0 (17)
All 49 50 51 56 45 41
14 CFR 121 Fatal 4 1 2 3 6 0
General Aviation All Fatal 1,845 350 1,904 364 1,906 340 1,837 344 1,726 325 1,714 343
Table 2.3: NTSB Fatal and Non-Fatal Accident Totals Figure 2.2 illustrates the Heinrich ratios for US Commercial and General Aviation in 1997 and 2000. The ratios were based on the number of incident submissions from Table 2.2. Table 2.3 provided the total number of fatal accidents. The number of non-fatal accidents was derived by subtracting the number of fatal incidents from the NTSB totals for all accidents. The General
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
37
Aviation classi cation is used in both the ASRS and NTSB statistical sources. The frequency of fatal commercial accidents was derived from the sum of incidents associated with 14 CFR 121 and 135 operations in the NTSB datasets. The gures in parentheses represent the total incident frequencies used in calculating the ratios.
Figure 2.2: Heinrich Ratios for US Aviation (NTSB and ASRS Data) The proportion of injuries to deaths in Figure 2.2 is lower for both General and Commercial Aviation than would be expected from Heinrich's ratio of one death to thirty injuries. In the case of General Aviation there is one fatal accident for every four non-fatal accidents. In Commercial Aviation, the ratio is one to ve. This is deceptive. The ratios in Figure 2.2 cannot be directly compared to Heinrich's results. The NTSB and ASRS data refers to accidents rather than the number of injuries. The dierence between Heinrich's ratio and our data arises because a single accident in the NTSB data can yield multiple fatalities or injuries. The NTSB do, however, present fatality and injury numbers for 14 CFR 121 operations. From this we can derive ratios of 1(2) : 10(21) : 13,311(26,623) in 1997 and 1(83) : 0.1(9) : 276(22,908) in 2000. The numbers in parentheses are the total frequencies for fatalities, minor injuries and incident reports. Further caveats can also be raised about these revised 14 CFR 121 ratios because the ASRS submission statistics combine 14 CFR 121 and 135 operations. These anomalies illustrate the practical diÆculties that are often ignored by proponents of the Heinrich ratio as a tool for Safety Management. They also illustrate a recurrent observation in this book; incident and accident statistics are often presented in incompatible formats. This makes it diÆcult to trace the relative frequency of adverse events and their outcomes over time. It is apparent, however, that the revised 14 CFR 121 ratios are very dierent from Heinrich's gures. In particular the ratio of 1 death to 0.1 injuries seems at odds with the one to thirty ratio cited
38
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
above. Fatal accidents are relatively rare in Commercial Aviation. Those that do occur often result in signi cant loss of life. Relatively few passengers and crew survive with minor injuries. These particular characteristics help to explain the apparent anomaly in the 1:0.1:276 ratio in 2000 for 14 CFR 121. Heinrich's original work mixed outcome frequencies in terms of fatalities and injuries with event frequencies, based on observations of near misses. He did not attempt to estimate likely outcomes for near miss incidents. It can, therefore, be argued that the ratios in Figure 2.2 are more informative because they are based entirely on event frequencies. They do not include outcome information. Figure 2.2 can be used to identify patterns in ASRS submission data. In General Aviation, there was 1 fatal incident for every 24 submissions in 1997 and one fatal accidet for every 25 submissions in 2000. In Commercial Aviation, there were 954 ASRS submissions in 1997 and 1,024 in 2000 for each fatal incident. There are a number of possible explanations for these ratios. We can argue that there is a higher proportion of fatal accidents in General Aviation than in Commercial Aviation. This hypothesis is supported by the lower standards of training and equipment in General Aviation [82]. The higher rate of incident reports from Commercial Aviation in Figure 2.2 might be explained if these pilots had a greater incident exposure than in General Aviation. This is contradicted by the observation that General Aviation pilots accumulate signi cantly more ying hours than 14 CRF 121 and 14 CFR 135 operations combined. Table 2.4 presents NTSB statistics for ying hours and also accident rates per 100,000 hours in both Commercial and General Aviation [201]. To simplify the calculation of these rates we have excluded non-scheduled on-demand air taxis under 14 CFR 135. This is justi ed by the relatively low number of ying hours and incidents in this category.
1997 1998 1999 2000 2001 2002
14 CFR 135 Accident Flying Hours Rate 1.628 982,764 2.262 353,670 3.793 342,731 3.247 369,535 2.330 300,432 2.595 308,300
14 CFR 121 Accident Flying Hours Rate 0.309 15,838,109 0.297 16,816,555 0.291 17,555,208 0.306 18,299,257 0.231 17,752,447 0.228 18,011,700
General Aviation Accident Flying Hours Rate 7.19 25,591,000 7.44 25,518,000 6.4 29,713,000 6.3 29,057,000 6.28 27,451,000 6.56 26,078,000
Table 2.4: NTSB Fatal and Non-Fatal Accident Rate Per 100,000 Flight Hours The ratios in Figure 2.2 can also be explained in terms of a lower proportion of ASRS submissions from General Aviation than Commercial Aviation. Commercial pilots have more to lose from adverse events. The additional protection provided by the `no blame' environment of the ASRS approach encourages them to submit a report. This attitude partly arises from the professional and personal consequences of losing a license that is essential to that person's job. Interviews with pilots have revealed that they are more likely to submit an ASRS report if they believed that someone else had also witnessed the incident. Given the NASA/FAA statement protection, cited above, there is perhaps a tendency to use the ASRS as a form of confessional in which contribution implies repentance. Arguably this has reached the point where many ASRS incidents are of a relatively trivial nature and provide few safety-related insights. With less to lose, General Aviation pilots may be less inclined to contribute to the system. The diÆculty in interpreting Heinrich ratios for US Commercial and General Aviation illustrates the confounding factors that must be considered when analysing reporting patterns. It seems likely that immunity policies aect contribution rates but little work has been conducted to determine how they interact with risk exposure, with individual attitudes to risk etc. The lack of such information is a primary motivation in writing this book. Major policy decisions have been made and continue to be made on the basis of data supplied by national and international reporting systems. There are, however, many open questions about the reliability, or biases, that aect these information sources.
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
39
2.3.2 Scope and Level There are many dierent types of reporting system. Local schemes may record incident information supplied by a few sta in a particular department. International systems have been developed by groups such as the International Maritime Organisation to support the exchange of information between many dierent multinational companies [387]. These dierences in the coverage of a reporting system can be explained in terms of their scope and level. The level of a reporting system is used to distinguish between local, national and international initiatives. The scope of a system de nes the groups who are expected to participate in the scheme. The concept of coverage is a complex one. It is possible to distinguish between the theoretical and actual scope of a system. Although a system is intended to cover several dierent groups, such as medical and nursing sta, it may in practice only receive contributions from some subset of those groups. Similarly, a national system may be biased towards contributions from a particular geographical area. There are important dierences between national and regional reporting systems. For example, it can be easier to guarantee anonymity in national systems. Reports that are submitted to local systems often contain suÆcient details for others to infer the identity of individuals who are involved in an adverse event. National systems are more likely to be protected by legal guarantees of con dentiality. They are also more likely to have the resources to nance technology protection for contributors, such as that oered by the Department of Energy's CAIRS system [655]. They can also nance dedicate personnel to process reports. Key individuals, such as the `Gatekeepers' in the Swedish Air TraÆc Control system, can be given the task of anonymizing information so that identities are hidden during any subsequent analysis. Steps may even be taken, as in the case of CIRAS, to ensure that these individuals are also prevented from retrieving identity information after the analysis is completed. All of these protection mechanisms are easier to sustain at a national level where resources of time, money and personnel can be deployed to address the logistical problems that often threaten locally-based systems. A host of problems threaten anonymity in local reporting systems. For instance, the individuals who are responsible for setting up and running such a system can have some diÆculty in convincing sta that they will not divulge con dential information to management or to other members of sta. One common means of avoiding this problem is to operate completely anonymous systems in which no identi cation information is requested. This creates the opportunity for malicious reporting in which one person implicates another. It also creates diÆculties in both analysing and interpreting the causes and eects of particular incidents. One of the longest running medical incident reporting systems was established in the Intensive Care Unit of an Edinburgh hospital. This scheme can be used to illustrate the diÆculty of preserving anonymity and con dentiality in local reporting systems. The unit has eight beds [121]. There are approximately three medical sta, one consultant, and up to eight nurses per shift on the ward. Given the relatively close-knit working environment of an intensive care unit, it is possible for other members of sta to narrow down those individuals who might have submitted a report about a particular procedure or task that they were involved in. A key issue here is the trust that is placed in the person who is responsible for operating the system. The Edinburgh system was set up by David Wright, a consultant anaesthetist, who was heavily in uenced by the earlier Australian Incident Monitoring Study (AIMS) [866]. This local system is heavily dependent upon his reputation and enthusiasm. He receives the reports and analyses them with the help of a senior nurse. The extent of his role is indicated by the fact that very few reports are submitted when he is not personally running the scheme.
The Paradox of Anonymity There is a paradox in the aect that anonymity has on the value of a report at the local, national or international level. As part of the initiative to establish common guidelines for incident reporting in Air TraÆc Control, interviews were conducted with controllers and other personnel in several European countries [423]. During these sessions, several contributors stressed the importance of anonymity. However, they also stressed the importance of knowing the context in which an incident occurred. This included both the location, which airport and which runway, as well as the time of
40
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
day, the operator's shift pattern etc. Without this information, they argued that the report would have little or no value to other operators. With that information, however, it would be relatively easy to narrow the potential contributor down to a few individuals. The paradox here is that anonymity is often essential to encourage the continued submission of incident reports. However, anonymity jeopardises the usefulness of a report for those who may bene t most from the lessons that it contains. In international schemes this paradox raises a number of deeper questions. A large number of local factors will in uence the way in which an occurrence is dealt with. These include dierences in national operating practices, in equipment, in workload. However, if a report were to be anonymized then much of this information would have to be omitted. It is not clear how much information about all of these issues ought to be provided and how much can be assumed about the readers knowledge of regional and national dierences. In aviation, this has been addressed by ICAO Annex 13, mentioned above. This speci es the minimum content for accident and incident reports. However, these guidelines are not always adhered to. Similar provisions do not currently exist to support the sharing of data in the medical domain or in, for instance, rail transportation. In local schemes, the context is already well established. The sta in the Edinburgh ICU system know that all reports refer to occurrences within that unit. As a result, much of the identifying information about that ICU can be reatined in the reports. Much of this detail would have to be removed in a con dential national systems in order to protect the individual hospital department. At the same time, however, there is an increased likelihood that those running local systems may be able to infer who contributed an anonymous report from their knowledge of the unit. The managers of the reporting system must ensure that similar inferences cannot easily be made by the co-workers who receive the recommendations that are generated from each contribution. This again leads to the danger that necessary information will be omitted.
`Targeting the Doable" Local incident reporting systems must typically select their recommendations from a more limited set of remedial actions than national or international systems. For example, the FAA/NASA's ASRS is widely recognised to have a profound in uence not just on US but also on global aviation policy. The same cannot be said for more local systems where it may only be possible to in uence the unit in which it is being run. This is re ected in the more limited de nition of an incident in some of these schemes. For example, the sta of the Yorkhill Hospital for Sick Children recently established an incident reporting system for incidents in a Neonatal Intensive Care Unit. This local system borrowed heavily from the existing schemes in Edinburgh and at various places in Australia [122, 121]. The agreed de nition of an incident that fell within the scope of the system was printed on each of the forms: \A critical incident is an occurrence that might have led (or did lead) if not discovered in time - to an undesirable outcome. Certain requirements need to be ful lled: 1. It was caused by an error made by a member of sta, or by a failure of equipment; 2. It can be described in detail by a person who was involved in or who observed the incident; 3. It occurred while the patient was under our care; 4. It was clearly preventable. Complications that occur despite normal management are not critical incidents. But if in doubt, ll in a form." [122] The penultimate sentence illustrates a key point about local systems. Local schemes depend upon the good will, or at worst the passive acceptance, of higher levels of management. Such support can be jeopardised if the system is seen to move beyond constructive criticism. Many of the incidents reported to local schemes can only be avoided or mitigated through cooperation with other, external organisations. For example, van Vuuren's study of incident reporting in a UK Accident and Emergency unit found that forty- ve per cent of the causes (42 out of a total of
2.3.
DIFFERENT FORMS OF REPORTING SYSTEMS
41
93) of the 19 incidents that were studies had organisational causes. Of these, thirteen causes were external to the Department itself. This is due to the way in which an Accident and Emergency department depends on the specialist services of other departments, including radiology, biochemistry etc: \Because the external factors are beyond the control of the investigated department, it is diÆcult to assess their real causes. It is of little use to hypothesise in detail about their origins and accompanying corrective actions of root causes in other departments... However, the majority of the external factors relate to the priorities of hospital management. The consequences of these priorities in uence day to day practice in the A&E department, revolving mainly around staÆng problems (not enough senior sta) and bed problems (lack of beds for A&E patients due to the continuous closing of beds on the wards), Although these external factors are beyond the control of the investigated department, their reporting is important to enable informal discussion between departments and to stimulate other departments to assess their own performance and its impact." [844] There are clear dierences between van Vuuren's emphasis on collecting data, even if it cannot immediately be used to aect other departments, and the previous de nition of an incident which `targets the doable'. The previous de nition of a critical incident, arguably, illustrates the pragmatic approach that must be adopted during the establishment of an incident reporting system. Before the value of such a scheme has been widely accepted, it can provide diÆcult to get other groups to accept that their actions may lead to failures in the unit operating the system. Van Vuuren's argument that incident data can be used to enable informal discussions about common concerns will only be eective if other groups are willing to participate. National and international systems can often make recommendations that have a much wider impact than local systems. For instance, the recommendations that are obtained from the UK's Royal College of Anaesthetists systems can be passed directly to other college's for further consideration[715]. Similarly, the GAIN system is intended to support the dissemination of `best practice' across the World's airline operators and manufacturers [308]. It is also intended to support the dissemination of recommendations to air traÆc service providers, airport managers etc. A number of limitations aect these large scale systems. It can be diÆcult to encourage the active participation of all regions within a system. These systems can also become victims of their own success if it becomes diÆcult to identify common patterns of failure amongst a large number of submissions. Local, national and international systems provide dierent insights. For example, Section 2.1 described the potential bene ts of incident reporting. These included the fact the they provide a reminder of hazards and that lessons can be shared. In a local system, these reminders may have greater local relevance than in a national scheme. In a national system, feedback often retains local features that were observed in the initial incident report. These features may not be appropriate for all participants. Alternatively, the incident must be abstracted to derive a generic account of the failure. In this case, the recipients must interpret the implications of the generic lesson in the context of their department or organisation. This can lead to a strongly negative reaction to the system if the lessons seem to be inappropriate [408]. There is also a danger of ambiguity; the implications of a generic lesson can be misinterpreted. The following list reviews a number of further dierences:
local systems can react relatively quickly to any report of an incident. As mentioned, the
overheads of analysing and investigating a mishap can be substantially reduced because the individuals who run the system will have a good understanding of the context in which any failure occurred. These systems may only have a limited scope within a particular level of an organisation. Partly as a consequence of this, they often exploit ad hoc solutions to more serious problems. For instance, many hospital systems train their sta how to `make do and mend' with poorly designed equipment [418]. National and international systems typically have the greater in uence necessary to change procedures and prohibit the use of particular devices.
42
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
national systems have correspondingly greater coverage. As a result, more reports may be received and better statistical data can be derived from them. This enables a closer relationship to be created between incident reporting and the subsequent risk assessments that drive future development and operational decision making. The ability to collate national data makes it more likely that such systems will be able to identify trends of common failures across many dierent sites. This is important because they can recognise the signi cance of what would otherwise appear as isolated failures. For instance, the lack of any eective central monitoring system has been identi ed as a reason why repeated problems with radiotherapy systems were not corrected sooner [487]. However, these systems introduce new problems of scale. There are considerable information processing challenges in identifying common trends in the 500,000 reports currently held by the ASRS . It can also be diÆcult to respond promptly when analysts must communicate with regional centres to establish the detailed causes of an adverse occurrence. Finally, it can be hard to ensure that local and regional agencies exploit consistent reporting procedures. This implies that similar incidents must be reported in a similar manner and that local or regional biases must be identi ed.
international systems enable states to share information about relatively rare mishaps. They
can also be used to exchange insights into the success or failure of recommendations for common problems. For example, Germany Air TraÆc Control (Deutschen Flugsicherung GmbH ) currently operates several parallel approach runways. The increasing use of these con gurations has encouraged them to share data with other organisations which operate similar approaches, such as the UK's National Air TraÆc Services operation at Heathrow. International reporting systems enable states to identify potential problems before they introduce systems that are currently operated in other countries. It can, however, be diÆcult to ensure the active participation of several dierent countries. Individual states must trust other countries both to investigate and report on their incidents. Cultural and organisational problems also aect the successful operation of international systems. For example, there is often a reluctance to adopt forms and procedures that were not developed within a national system. Occasionally, there is a belief that some of the incidents which are covered by national systems simply `could not happen here' [423].
Large scale systems often attract political criticism if they are perceived to threaten other national and international organisations. It is for this reason that recent attempts to develop medical incident reporting systems in the United States are at pains to consider the relationship between federal and state bodies: \Congress should: designate the Forum for Health Care Quality Measurement and Reporting as the entity responsible for promulgating and maintaining a core set of reporting standards to be used by states, including a nomenclature and taxonomy for reporting; require all health care organisations to report standardised information on a de ned list of adverse events; provide funds and technical expertise for state governments to establish or adapt their current error reporting systems to collect the standardised information, analyse it and conduct follow-up action as needed with health care organisations. Should a state choose not to implement the mandatory reporting system, the Department of Health and Human Services should be designated as the responsible entity; and designate the Center for Patient Safety to: 1. convene states to share information and expertise, and to evaluate alternative approaches taken for implementing reporting programs, identify best practices for implementation and assess the impact of state programs; and 2. receive and analyse aggregate reports from States to identify persistent safety issues that require more intensive analysis and/or a broader-based response (e.g., designing prototype systems or requesting a response by agencies, manufacturers or others)."
2.4.
SUMMARY
43
[453] The distinctions between local, national and international schemes often become blurred under systems such as that proposed by the US Institute of Medicine. Local initiatives report to State organisations that may then contribute to a Federal database. Such an integration will, however, change the nature of local systems. For instance, the need to ensure consistency in the information that is gathered nationally will force changes on the forms and procedures that are used locally. Recommendations that are issued from a national level may not easily be implemented under local conditions. For instance, recommendations relating to the use of more advanced equipment that has not yet been installed in all regions can serve to remind teams of what they are missing rather than forewarn them about the potential problems of equipment that they might receive in the future [409]. Similar comments can be made about initiatives to integrate national and international reporting systems [423]. The need to convert between national reporting formats and consistent international standards can lead to considerable tension. For instance, some European Air TraÆc reporting systems operate a national system of severity assessment that must then be translated into categories proposed by EUROCONTROL's ESARR 2 document [717]. This translation process must be transparent if all of the member states are to trust the reliability of the statistics produced from international initiatives.
2.4 Summary This chapter has summarised the reasons why a range of government and commercial organisations have established these systems in the military, in transportation, in healthcare, in power generation etc. These initiatives have been justi ed in terms of the learning opportunities that can be derived from incident data ideally before an accident takes place. This chapter has also looked at some of the problems associated with incident reporting. These include the diÆcult of encouraging participation from a broad spectrum of contributors. For instance, we have calculated Heinrich ratios for fatal and minor accidents aecting US personnel. This reveals that contract sta may report fewer minor injuries than directly employed sta. The FRA have, therefore, encouraged greater monitoring of incidents involving contract workers. `No blame' reporting systems encourage greater participation. However, the Heinrich ratios for General and Commercial Aviation suggest that the protection oered to contributors can introduce biases. In particular, pilots are more likely to report an adverse event if their livelihood is at risk or if they are concerned that their actions may be reported by colleagues and co-workers. This book addresses the problems identi ed in this chapter. The aim is to present techniques that will help to realise the bene ts that are claimed for incident reporting systems. Issues of anonymity, of legal disclosure, of retribution and blame, of scope and context must all be considered when developing an eective reporting scheme. It is also important to consider the sources of human error, system failure and managerial weakness that contribute to the incidents that are reported. This is the topic of the next chapter.
44
CHAPTER 2.
MOTIVATIONS FOR INCIDENT REPORTING
Chapter 3
Sources of Failure Failures are typically triggered by catalytic events, such as operator error or hardware failure. These triggers exacerbate or stem from more latent problems, which are often the result of managerial and regulatory failure. In the most general sense, incident reporting systems provide a way of ensuring
Figure 3.1: Levels of Reporting and Monitoring in Safety Critical Applications that such routine failures do not escalate in this manner. As a result, they must operate at several dierent levels in order to reduce the likelihood of latent failures and reduce the consequences of catalytic failures. Figure 3.1 provides an idealised view of this process. This diagram is a simpli cation. Political and economic necessity often break this chain of monitoring behaviour. Simple terms, such as \regulator" and \management" hide a multitude of roles and responsibilities that often con ict with a duty to report [773]. However, the following paragraphs use the elements of Figure 3.1 both to introduce the sources of failure and to explain why incident reporting systems 45
46
CHAPTER 3.
SOURCES OF FAILURE
have been introduced to identify and combat these sources once they have been identi ed.
3.1 Regulatory Failures Regulation is centred around control of the market place. Regulators intervene to ensure that certain social objectives are not sacri ced in the pursuit of pro t. These objectives include improvements in safety but they also include the protection of the environment, the preservation of consumer rights, the protection of competition in the face of monopolistic practices etc. For example, the Federal Railroad Administration's mission statement contains environmental and economic objectives as well as a concern for safety: \The Federal Railroad Administration promotes safe, environmentally sound, successful railroad transportation to meet current and future needs of all customers. We encourage policies and investment in infrastructure and technology to enable rail to realise its full potential." [239] A similar spectrum of objectives is revealed in the Federal Aviation Administration's strategic plan for 2000-2001 [201]. The rst of their three objectives relates to safety; they will `by 2007, reduce U.S. aviation fatal accident rates by 80 percent from 1996 levels'. The second relates to security; to `prevent security incidents in the aviation system'. The nal aim is to improve system eÆciency; to `provide an aerospace transportation system that meets the needs of users and is eÆcient in the application of FAA and aerospace resources'.
3.1.1 Incident Reporting to Inform Regulatory Intervention Regulatory authorities must satisfy a number of competing objectives. For example, it can be diÆcult to both promote business eÆciency and ensure that an industry meets particular safety criteria. In such circumstances, regulatory duties are often distributed amongst a number of agencies. For example, the US National Transportation Safety Board (NTSB) has a duty to investigate accidents and incidents in road, rail and maritime transportation. All other regulatory activities in the eld of aviation have been retained by the Federal Aviation Administration: \Congress (in enacting the Civil Aeronautics Act of 1938] is to provide for a Safety Board charged with the duty of investigating accidents... The Board is not permitted to exercise ... (other) regulatory or promotional functions. It will stand apart, to examine coldly and dispassionately, without embarrassment, fear, or favour, the results of the work of other people." (Edgar S. Gorrell, President, Air Transport Association, 1938 [482]). The NTSB investigates the causes of incidents and accidents whilst the FAA is responsible for enforcing the recommendations that stem from these investigations. This separation of roles is repeated in other industries. For example, the US Chemical Safety and Hazard Investigation Board operates under the Clean Air Act. Section 112 (r) (6) (G) prohibits the use of the Board's conclusions, ndings, or recommendations from being used in any lawsuit arising from an investigation. In contrast, the US Occupational Safety and Health Act of 1970 established the Occupational Safety and Health Administration (OSHA) to `assure so far as possible every working man and woman in the Nation safe and healthful working conditions' through standards development, enforcement and compliance assistance. Although the distinction between investigatory and enforcement functions is apparent in many dierent industries, the precise allocation of responsibilities diers greatly from country to country. For instance, the UK Rail Regulator is charged with safeguarding the passengers' interests within a `deregulated and competitive' transportation system. However, the monitoring and enforcement of safety regulations remains the responsibility of the Railway Inspectorate. This diers from the US system in which the Federal Rail Administration takes a more pro-active role in launching safety initiatives. In the UK system, this role seems to rest more narrowly with the railways inspectorate that is directly comparable with the US NTSB.
3.1.
REGULATORY FAILURES
47
We are interested in regulators for two reasons. Firstly, they are often responsible for setting up and maintaining the incident reporting systems that guide regulatory intervention. Secondly, regulators are ultimately responsible for many of the incidents that are reported by these systems. Similar failures that recur over time are not simply the responsibility of system operators or line managers, they also re ect a failure in the regulatory environment. Many regulators speci cally have the task of ensuring that accidents and incidents do not recur. For instance, the US Chemical Safety and Hazard Investigation Board was deliberately created to respond to common incidents that were being addressed by 14 other Federal agencies, including the U.S. Environmental Protection Agency (EPA) and the U.S. Department of Labor's OSHA.
3.1.2 The Impact of Incidents on Regulatory Organisations Regulators are increasingly being implicated in the causes of accidents and incidents [701]. In consequence, investigations often recommend changes in regulatory structure. The Cullen report into the Piper Alpha re led to responsibility being moved from the Department of Energy's Safety Directorate to the Health and Safety Executive's Oshore Safety Division. Similarly, the Fennell report into the Kings Cross re was critical of the close relationship that had grown up between the Railways Inspectorate and the London Underground management. Prior to Kings Cross, there had only been two Judicial Inquiries into UK railway accidents, the Tay Bridge disaster [357] and the Hixon Level Crossing Accident [549]. These criticisms reveal some of the problems that face regulators who must monitor and intervene in complex production processes. These problems can be summarised as a lack of information; a lack of trained personnel and a concern not to impose onerous constraints on trade. Many industries increasingly depend upon complex, heterogeneous application processes. Most regulatory agencies cannot assess the safety of such systems without considerable help from external designers and operators. It is no longer possible for many inspectors to simply demand relevant safety information. They, typically, rely on the company and its sub-contractors to indicate which information is considered to be relevant to safety-critical processes. Rapid technical development, deliberate obfuscation, the use of (often proprietary) technical terms can all make it diÆcult for inspectors to gain a coherent view of the processes that they help to regulate. The activities of many regulatory agencies are further constrained by personnel limitations. These constraints partly stem from nancial and budgetary requirements. It can be diÆcult to train and retain sta who are trained not only in the details of complex application processes but also in systems safety concepts. Even if it is possible to preserve a skilled core of regulators, it can be diÆcult to ensure that they continue to receive the `up to date' training that is necessary in many industries. Regulators must balance demands to improve the safety of complex application processes against the costs of implementing necessary changes within an industry. In 1999 Railtrack estimated that the cost of installing an Advanced Train Protection system over the UK rail network was in the region of $2 billion [690]. This system uses trackside transmitters to continuously monitor the activity of trains; including its speed, number of carriages, braking capacity etc. The ATP system will sense if the driver fails to react to any line-side instructions, including signals passed at danger, and will start to reduce the speed of the train. The costs of installing the more limited Train Protection Warning System was estimated by Railtrack to be in the order of $310 million. This system monitors the train before the key signals that protect junctions, single lines and `unusual' train movements. A sensor is attached to the train and this detects emissions from two radio loops that are laid before these key signals. TPWS uses information about the current speed and the radio information that is transmitted when a signal is at red to detect whether the train is liable to stop in front of that signal. The information available to the system and the possible interventions are, therefore, more limited than ATP. The economic implications of regulatory intervention in favour of either ATP or TPWS are obvious. The Railway Safety Regulations (1999) require that ATP is tted when `reasonably practicable'. The wording of this regulation re ects the sensitivity that many regulators must feel towards the balance between safety and the promotion of commercial and consumer interests. If regulators were to recommend ATP rather than TPWS, rail operators would have been faced with signi cant overheads that many felt could not be justi ed by safety improvements. If they had
48
CHAPTER 3.
SOURCES OF FAILURE
recommended TPWS rather than ATP, passenger groups such as Railwatch would have criticised regulatory failure to introduce additional safeguards. The Rand report was commission by the NTSB as part of an investigation into future policy for accident and incident investigation. This document questioned the nature of regulation in many safety-critical industries: \The NTSB relies on teamwork to resolve accidents, naming parties to participate in the investigation that include manufacturers; operators; and, by law, the Federal Aviation Administration (FAA). This collaborative arrangement works well under most circumstances, leveraging NTSB resources and providing critical information relevant to the safety-related purpose of the NTSB investigation. However, the reliability of the party process has always had the potential to be compromised by the fact that the parties most likely to be named to assist in the investigation are also likely to be named defendants in related civil litigation. This inherent con ict of interest may jeopardise, or be perceived to jeopardise, the integrity of the NTSB investigation. Concern about the party process has grown as the potential losses resulting from a major crash, in terms of both liability and corporate reputation, have escalated, along with the importance of NTSB ndings to the litigation of air crash cases. While parties will continue to play an important role in any major accident investigation, the NTSB must augment the party process by tapping additional sources of outside expertise needed to resolve the complex circumstances of a major airplane crash. The NTSB own resources and facilities must also be enhanced if the agencys independence is to be assured." (Page xiv, [482]) A number of alternate models have been proposed. For instance, international panels can provide investgatory agencies with a source of independent advice. This approach is likely to be costly; such groups could only be convened in the aftermath of major accidents. In many industries, the dominance of large multi-national companies can make it diÆcult identify members who are suitably quali ed and totally independent. Alternatively, investigatory agencies can develop specialist inhouse investigation teams. The additional expense associated with this approach can make it diÆcult to also provide adequate coverage of the broad range of technical areas that must be considered in many incidents and accidents.
3.2 Managerial Failures By failing to adequately address previous mishaps, regulators are often implicated in the causes of subsequent incidents. In consequence, they often help to establish reporting schemes as means of informing their intervention in particular markets. There are some similarities between regulatory intervention and the role of management in the operation of incident reporting systems. On the one hand, many organisations have set up incident reporting systems to identify potential weaknesses in production processes. On the other hand, many of the incidents that are reported by these schemes stem from managerial issues. Social and managerial barriers can prevent corrective actions from being taken even if a reporting system identi es a potential hazard. These barriers stem from the culture within an organisation. For example, Westrum identi es a pathological culture that `doesn't want to know' about safety related issues [862]. In such an environment, management will shirk any responsibility for safety issues. The contributors to a reporting system can be regarded as whistle blowers. Any failure to attain safety objectives is punished or concealed. In contrast, the bureaucratic culture listens to messengers but responsibility is compartmentalised so that any failures lead to local repairs. Safety improvements are not eectively communicated between groups within the same organisation. New ideas can be seen as problems. They may even be viewed as a threat by some people within the organisation. Finally, the generative culture actively looks for safety improvements. Messengers are trained and rewarded and responsibility for failure is shared at many dierent levels within the organisation. Any failures also lead to far-reaching reforms and new ideas are welcomed. Westrum's categories of organisational culture mask the more complex reality of most commercial organisations. Accident and incident reports commonly reveal that elements of each of these
3.2.
MANAGERIAL FAILURES
49
stereotypes operate side by side within the same organisation. This is illustrated by the Australian Transport Safety Bureau's (ATSB) report into a re on the Aurora Australis [48]. The immediate cause of the incident was a split fuel line to the main engine. Diesel came into contact with turbochargers that were hotter than the auto-ignition temperature of the fuel. It can be argued that the ship's operators resembled Westrum's bureaucratic organisation. Information about the modi cations was not passed to the surveyors and other regulatory authorities. It can also be argued that this incident illustrates a pathological culture; ad hoc consultations perhaps typify organisations that are reluctant to take responsibility for safety concerns: \Consultations between the company and Lloyds Register and Wartsila, on the use of
exible hoses were ad hoc and no record of consultation or approval concerning their tting was made by any party. No approval was sought from the Australian Maritime Safety Authority for the tting of exible hoses. Knowledge that the exible hoses had been tted under the oor plates was lost with the turn-over of engineers. The fact that other exible hoses were tted to the engines was well evident, but this did not alert either class or AMSA surveyors to the fact that the modi cations were not approved." (Summary Conclusions, [48]) This same organisation also reveals generative behaviour. Persistent safety problems were recognised and addressed even if ultimately those innovations were unsuccessful. For instance, the operators of the Aurora Australis made numerous attempts to balance safety concerns about the fuel pipes against the operational requirements of the research vessel: \At an early stage of the ships life Wartsila Australia provided omega4 pipes to connect to the engines in an attempt to overcome the failures in the fuel oil pipework. This however did not solve the problem... When scienti c research is being undertaken and dynamic positioning is in use, the isolation of noise and vibration from the hull is of importance. During these periods the main engines would not be in use. However the main generator sets are required and, to reduce vibration, the generator sets are exibly mounted. For this reason, the generator sets were connected to the fuel system pipework with exible hoses supplied by Wartsila. The subsequent approach in solving the problem on the main engines involved the tting of sections of medium pressure hydraulic/pneumatic hose." (Page 33 - Engine Fuel Systems, [48]) Many investigators apply a form of hindsight bias when they criticise the organisational culture of those companies that suer severe accidents. They have experienced a major failure and, therefore, these organisations must have a `pathological' attitude to safety. This is over-simplistic. The previous incident has illustrated the complex way in which many organisations respond to safety concerns. It is possible to identify several dierent `cultures' as individuals and groups address a range of problems that change over time.
3.2.1 The Role of Management in Latent and Catalytic Failures MAnagement play an important role in the latent causes of incidents and accidents. The distinction between latent and catalytic factors forms part of a more general classi cation introduced by Hollnagel [362]. He identi es eects, or phenotypes, as the starting point for any incident investigation. They are what can be observed in a system and include human actions as well as system failures. In contrast, causes or genotypes represent the categories that have brought about these eects. Causes are harder to observe than eects. Their identi cation typically involves a process of interpretation and reasoning. It is also useful to distinguish between proximal and distal causes [115]. In Hollnagel's terms, most incident reports focus on the proximal genotypes of failure. These the include `person' and `technology' related genotypes that are addressed later in this chapter. However, they also include `organisation related genotypes' that address the role of line management in the conditions leading to an adverse event: \This classi cation group relates to the antecedents that have to do with the organisation in a large sense, such as safety climate, social climate, reporting procedures, lines of
50
CHAPTER 3.
SOURCES OF FAILURE
command and responsibility, quality control policy, etc." (Page 163, [362]). Hollnagel's classi cation of organisation genotypes re ects the increasing public and government interest in the distal causes of failure. He explicitly considers safety climate, social climate, reporting procedures, lines of command and responsibility and quality control policy as contributory factors in the events leading to failure. Table 3.2.1 illustrates how this high level categorisation can be re ned into a check-list that might guide both the investigation of particular incidents and the development of future systems.
3.2.2 Safety Management Systems Management can recruit a number of techniques to help them combat the latent causes of incidents and accidents. For example, Safety management systems help organisations to focus on \those elements, processes and interactions that facilitate risk control through the management process" [189]. The perceived success of this approach has led a number of regulators to support legislation that requires their use within certain industries, for example through the UK Oshore Installations (Safety Case) Regulations of 1992 . The UK Health and Safety Executive publish guidance material on the development of Safety Management Systems [319]. They emphasise a number of phases [189]:
developing policy, which sets out the organisations general approach, goals and objectives towards safety issues;
organising, which is the process of establishing the structures, responsibilities and relationships that shape the total working environment;
planning, the organisational process which is used to determine the methods by which speci c objectives should be set out and how resources are allocated;
implementation which focuses on the practical management actions and the necessary employee behaviours that are required to achieve success;
measuring performance, which incorporates the process of gathering the necessary information to assess progress towards safety goals; and
auditing and reviewing performance, which is the review of all relevant information.
Incident reporting schemes oer a number of potential bene ts within a safety management system. In particular, they can help to guide the allocation of nite resources to those areas of an application process that have proven to be most problematic in the past. In other words, incident reporting systems can focus risk assessment techniques using `real world' reliability data that can be radically dierent from the results of manufacturer's bench tests. Incident reporting systems can also be used to assess the performance of safety management activities. They can provide quantative data that avoids subjective measures for nebulous concepts such as `safety culture'. Managerial performance can be assessed not simply in terms of reduced frequency for particular incidents but also in terms of the reduced severity of incidents that are reported. Chapter 15 will, however, discuss the methodological problems that arise when deriving quantitative data from incident reporting systems.
3.3 Hardware Failures Public attention is increasingly being focussed on the role of regulatory authorities in the aftermath of accidents and incidents. This has increased interest in incident reporting techniques as a means of informing regulatory intervention. Managerial failures also play an important role in creating the conditions that lead to many of the failures that are described in occurrence submissions. In consequence, a number of regulatory authorities have advocated the use of incident reporting techniques to help identify potential managerial problems within a wider safety management system. The following section builds on this analysis and begins to look at phenotypes and genotypes that relate to hardware failures. It can be argued that many of these failures stem from the distal causes of managerial failure. Stochastic failures can be predicted using probabilistic risk assessment. Design
3.3.
51
HARDWARE FAILURES
General consequent Maintenance failure
Speci c consequent Equipment operational Indicators working
Inadequate quality control
Management problem
De nition not
not
Inadequate procedures Inadequate reserves Unclear Roles Dilution of responsibility Unclear line of command
Design failure
Anthropometric mismatch
Inadequate task allocation
Inadequate Human-Machine Interface Inadequate managerial rule
Social pressure
Inadequate task planning Inadequate work procedure Group think
Equipment (controls, resources) does not function or is not available due to missing or inappropriate management Indications (lights, signals) do not work properly due to missing maintenance Equipment/function is not available due to inadequate quality control Lack of resources or supplies (e.g., inventory, back-up equipment etc.) People in the organisation are not clear about their roles and their duties There is not a clear distribution of responsibility; this is particularly important in abnormal situations. The line of command is not well de ned and the control of the situation may be lost. The working environment is inadequate and the cause is clearly a design failure. The interface is inadequate and the cause is clearly a design failure. The organisation of work is de cient due to the lack of clear rules or principles Task planning or scheduling is de cient Procedures for how work should be carried out are inadequate The individual's situation understanding is guided or controlled by the group.
Table 3.1: Hollnagel's Categories for Organisational Genotypes
52
CHAPTER 3.
SOURCES OF FAILURE
and requirements failures may be detected using appropriate validation techniques. However, many incidents defy this simplistic analysis of managerial genotypes as the root of all mishaps. Individual managers are subject to a range of economic, political and regulatory constraints that limit their opportunities to address potential hardware failures in many industries.
3.3.1 Acquisition and Maintenance Eects on Incident Reporting Several factors aect the successful acquisition of hardware devices. Managers must have access to accurate reliability data. They must also be able to assess whether devices will be compatible with other process components. Compatibility can be assessed both in terms of device operating characteristics but also in terms of maintenance patterns. This is important if managers are to optimise inspection and replacement policies. A number of further characteristics must also be considered. The operating temperatures, humidity performance, vibration tolerances etc should exceed those of the chosen environment. components must meet electromagnetic interference requirements. They should also satisfy frequency, waveform and signal requirements as well as maximum applied electrical stresses. The tolerance drift over the intended life of the device should not jeopardise the required accuracy of the component. Finally, the component must fall within the allocated cost budget and must usually be available during the service life of an application process. Many components fail to meet these requirements. Hardware failures have many dierent causes. The distal genotypes include design failures; the device may not perform the function that was intended by the designer. Hardware may also fail because of problems in requirements elicitation; the device may perform as intended but the designers' intentions were wrong. It can also fail because of implementation faults; the system design and requirements were correct but a component failed through manufacturing problems. A fault typically refers to lower-level component malfunction whilst failures, typically, aect more complex hardware devices. There are also more proximal genotypes of hardware failures. In particular, a device may be operated beyond its tolerances. Similarly, inadequate maintenance can lead to hardware failures. A number of military requirements documents and civilian standards have been devised to address these forms of failure, such as US MIL-HDBK-470A (Designing and developing maintainable products and systems) or the FAA's Code of Federal Regulations (CFR) Chapter I Part 43 on Maintenance, Preventive Maintenance, Rebuilding and Alteration. These standards advocate a number of activities that are intended to reduce the likelihood of hardware problems occurring or, if they do occur, to reduce the consequences of those failures. An important aspect of these activities is that they must continue to support the product throughout its operational life. Two key components of hardware acquisition and maintenance schemes are a preferred parts list and a Failure Reporting, Analysis and Corrective Action system (FRACAs). Preferred parts lists are intended to ensure that all components come from known or approved suppliers. These preferred parts lists also avoid the need for development and preparation of engineering justi cation for new parts and materials. They reduce the need for monitoring suppliers and inspecting/screening parts and materials. They can also avoid the acquisition of obsolete or sole-sourced parts. Failure Reporting, Analysis and Corrective Action systems provide individual organisations with a means of monitoring whether or not the components on a preferred parts list actually perform as expected when embedded within production processes in the eventual operating environment. A continuing theme in this book will be that the use of safety-critical design and maintenance techniques, such as a preferred parts list, can have a profound impact on the practical issues involved in incident reporting. If a structured approach to hardware acquisition is not followed then it can be extremely diÆcult for engineers to eectively exploit the information that is submitted through a FRACA system. Engineers must assume that all components share similar failure modes even though they are manufactured by dierent suppliers. This can have considerable economic consequences if similar devices have dierent failure pro les, for example from dierent manufacturing conditions. Adequate devices may be continually replaced because of historic failure data that is based on similar but less reliable components. Conversely, it can be dangerous for engineers to assume that a failure stems from a particular supplier rather than from a wider class of similar devices. In order to support this inference, operators must analyse the dierent engineering justi cation for each of
3.3.
HARDWARE FAILURES
53
the dierent supplier's components to ensure that faults are not shared between similar devices from dierent manufacturers. The practical consequences of miscalculating such maintenance intervals is illustrated by work from the European insurance company Det Norske Veritas [458]. They assume that:
that failure rate increases with increasing maintenance interval; that maintenance cost is inversely proportional to the maintenance interval that expected total cost is the sum of the maintenance cost and the expected failure cost.
It is possible to challenge these simplifying assumptions, however, they are based on considerable practical experience. Figure 3.2, therefore, illustrates the way in which the costs of maintenance are reduced as maintenance intervals are increased. It also shows the expectation that the costs of any failure will rise with increased maintenance intervals. The importance of this diagram for incident
Figure 3.2: Costs Versus Maintenance Interval reporting is that each of these curves is based on the maintenance intervals and costs for particular devices. If a less reliable device were used with the same maintenance intervals then cost curves may be signi cantly higher, that is to say they will be translated along the Y-axis. Conversely, the cost curves for more reliable devices will be signi cantly lower even though the maintenance intervals will be based on less reliable devices. In either case, the eective use of reliability data for preventive maintenance depends upon the monitoring of devices from dierent suppliers within the actual operating environment of particular production processes [27].
3.3.2 Source, Duration and Extent It is possible to identify a number of dierent types of hardware failure. In particular, they can be distinguished by their source, duration and extent [762]. Each of these failure types poses dierent challenges for the successful operation of incident reporting systems. The source of a failure refers to whether it is random or systematic. Component faults provide the primary cause of random hardware failures. All components have a nite chance of failing over a particular period of time. It is possible to build up statistical models that predict failure probabilities over the lifetime of similar devices. These probability distributions are usually depicted by the `bath tub' curve shown in Figure 3.3. Initially there is an installation or `burn-in' period when the component has a relatively
54
CHAPTER 3.
SOURCES OF FAILURE
Figure 3.3: Failure Probability Distribution for Hardware Devices high chance of failure. Over time, this declines for the useful life of the product until it begins to wear out. At this point, the likelihood of failure begins to increase. As can be seen from Figure 3.3 it is possible to abstract away from these lifecycle dierences by suing a mean failure rate. However, this has profound practical consequences for the operation of an incident reporting system. When a class of components are rst deployed, FRACAs submissions will indicate a higher than anticipated failure rate. This need not imply that the mean is incorrect, simply that the components must still go through the `burning-in' period indicated in Figure 3.3. The second source of hardware problems relates to systematic failures. These stem from errors in the speci cation of the system and from errors in the hardware design. Systematic failures are more diÆcult to combat using incident reporting techniques. The causes of particular mishaps may lie months or even years before a problem is reported by a supplier or end-user. It is for this reason that initiatives such as US MIL-STD-882D: Standard Practice for System Safety focus on the quality control and inspection procedures that are used throughout the design and implementation lifecycle. If systematic faults are found in hardware, or in any other aspect of a safety critical system, then this raises questions not just about the particular product that failed but also about every other product that was produced by that development process. The duration of a failure can be classi ed as either permanent, transient or intermittent. Intermittent problems occur and then recur over time. For instance, a faulty connection between two circuits may lead to an intermittent failure. Occasionally the connection may operate as anticipated. At other times it will fail to deliver the correct signal. Conversely, transient failures occur once but may not recur. For instance, a car's starter motor may generate electromagnetic interference that will not recur until another car starts in the same location. Finally, permanent failures persist over time. Physical damage to a hardware unit, typically, results in a permanent failure. Each of these failure types poses dierent challenges for reporting systems. Transient failure can be particularly diÆcult to diagnose. They are, typically, reported as one-o incidents. This makes it very hard to reconstruct the operational and environmental factors that contributed to the failure. There is also a strong element of uncertainty in any response to a transient failure; it can often be very diÆcult for engineers to distinguish this class of failures from intermittent problems. The passage of time may convince engineers that a failure will not recur. This can be dangerous if the failure returns and proves to be intermittent rather than transient. Permanent failures can seem simple to identify, diagnose and rectify. However, `fail silent' components may leave few detectable traces of their failure until they are called upon to perform speci c functions. Conversely, `fail noisy' components may generate so many confounding signals that it
3.3.
HARDWARE FAILURES
55
can be diÆcult for engineers to determine which device has failed. It is important to stress that in practice there will seldom be a one-to-one mapping between each possible failure mode for any particular device and the reports that are submitted about those failures. For example, if two different members of sta identify the same failure then managers will be faced with the diÆcult task of working out whether or not those two reports actually do refer to the same problem or to two dierent instances of a similar failure. In such circumstances, it can take considerable time and resources for sta to accurately diagnose the underlying causes. Intermittent failures are diÆcult to detect and resolve. Low frequency, intermittent failures may only be identi ed by comparing incident reporting systems from many dierent end-user organisations. The reports that document these failures may be distributed not only in time but also in geographical location. Many safety-critical products operate in similar environments in many dierent parts of the globe. Chapter 15 will argue that recent advances in probabilistic information retrieval and case based reasoning techniques for the rst time provide eective tools for detecting and responding to this diÆcult class of failures. For now it is suÆcient to observe that the identi cation of intermittent failures and trend information from incident reporting remains one of the biggest practical challenges to the eective use of these systems. The nal classi cation of failure types relates to the extent of its consequences. A localised fault may only eect a small sub-system. The consequences of a global fault can permeate throughout an entire system. Between these two extremes lie the majority of faults that may have eects that are initially localised but which, over time, will slowly spread throughout an application. In many instances it is possible to use incident reporting systems to chart the propagation of a failure over time. This provides valuable information not only about the failure itself but also about the reporting behaviour of the systems, teams and individuals who must monitor application processes. The following incident report from the FDA's US Food and Drug Administration's Manufacturer and User Facility Device Experience Database (MAUDE) provides a glimpse of the complex relationships between device suppliers and the technical support sta who must operate them. In this case, end users made repeated attempts to x problems that were created by the inadequate cooling of a patient monitor. The account of the problem clearly illustrates the end-user's sense of frustration both with the unreliability of the device and with the manufacturers' response: Monitors lose functions due to internal heat Note: several of the units returned for repair have had "fan upgrades to alleviate the temp problems". However, they have failed while in use again and been returned for repair, again salesman has stated it is not a thermal problem it is a problem with X's circuit board. Spoke with X engineer, she stated that device has always been hot inside, running about 68C and the X product has been rated at only 70C. Third device transponder started to burn sent for repair. Shortly after the monitor began resetting itself for no reason, fourth device monitor, SPO2 failed and factory repaired 10/01, 3/02. Also repaired broken wire inside unit 12/01. Tech 3/02 said the symptoms required factory repair... ([272], MDR TEXT KEY: 1370547) This incident resulted in a series of follow-up reports. However, the manufacturers felt that the events described by the user could not be classi ed as safety-related; `None of the complaints reported by the user were described as incidents or even near incidents. The recent report sent to the FDA appears to be related to frustration by the end user regarding the product reliability'. The manufacturer further responded by describing the evaluation and test procedures that had been used for each of the faulty units. The rst had involved the customer replacing a circuit board. This did not x the problem and the unit was sent back to the factory. The power supply was replaced but no temperature related failure was reproduced under testing by the manufacturers. A second device was also examined after a nurse had complained that the monitor had `spontaneously' been reset. The hospital biomedical technicians and manufacturers representatives were unable to reproduce the transient failure and all functions were tested to conform to the manufacturers' speci cations. Manufacturers and suppliers are also often unable to determine the particular causes of reported mishaps. In the previous incident, the integrator/manufacturer believed that some of the problems might have stemmed from a printed circuit board made by another company. Tests determined that a board malfunction resulted in a failure to display patient pulse oxymetry waveforms on
56
CHAPTER 3.
SOURCES OF FAILURE
the monitoring system. The problems did not end when the integrator replaced the faulty board. The customer again returned the unit with further complaints that the device would not change monitoring modes. The integrator determined that the connectors to the printed circuit board were not properly seated. However, the board must have been properly placed prior to dispatch in order for the unit to pass its quality acceptance test. It is possible that the connector was not seated completely during the initial repair and gradually became loose over time. This incident illustrates the confusion that can arise when hardware devices are developed by groups of suppliers. The marketing of the device may be done by an equipment integrator who out-sources components to sub-contractors. For example, one company might provide the patient monitoring systems while another supplies network technology. This market structure oers considerable exibility and cost savings during development and manufacture. However, problems arise when incidents stem from subcomponents that are not directly manufactured by the companies that integrate the product. Complaints and incident reports must be propagated back along the supply chain to the organisations that are responsible for particular sub-systems.
3.4 Software Failures Software is now a key component in most safety critical systems. It is used to con gure the displays that inform critical operating decisions, it can detect and intervene to mitigate the consequences of potential failures. Even if it is not used directly within the control loops of an application, it typically plays a key role in the design and development practices that help to produce the underlying systems. The Rand report into the investigatory practices of the NTSB emphasised the new challenges that these developments are creating: \As complexity grows, hidden design or equipment defects are problems of increasing concern. More and more, aircraft functions rely on software, and electronic systems are replacing many mechanical components. Accidents involving complex events multiply the number of potential failure scenarios and present investigators with new failure modes. The NTSB must be prepared to meet the challenges that the rapid growth in systems complexity posed by developing new investigative practices." [482] The consequences of software-related incidents should not be underestimated. The failure of the London Ambulance Computer Aided Dispatch system is estimated to have cost between $1.1 and $1.5 million. Problems with the UK Taurus stock exchange program cost $75 to $300 million. The US CONFIRM system incurred losses in the region of $125 million [79] Few of these mishaps were entirely due to software failure. They were the result of \interactions of technical and cognitive/organizational factors than by technical factors alone" [533]. There are important dierences between hardware and software failures. As we have seen, hardware failures can be represented as probability distributions that represent the likelihood of failure over the lifetime of a device. The practical diÆculties of fabrication and installation prevent designers from introducing completely reliable hardware. If hardware related incidents exceed the frequency anticipated by the predicted failure probabilities then additional safeguards can be deployed to reduce the failure frequency or to mitigate the consequences of these failures. In contrast, software is deterministic. The same set of instructions should produce the same set of results each time they are executed. In consequence, if a software `bug' is eliminated then it should never recur. There are some important caveats, however. In the real world, software operates on stochastic devices. In other words, subtle changes in the underlying hardware, including electromagentic interference, can cause the same set of instructions to have dierent results. In other applications, concurrent processors can appear to behave in a non-deterministic fashion as a result of subtle dierences in the communications infrastructure [420]. Small dierences in the mass of input provided by these systems may lead to radically dierent software behaviours. The problem is not that the code itself is non-deterministic. However, it can be almost impossible for operators and maintenance engineers to detect and diagnose the particular set of input conditions that caused the software to react in the manner that is described within an incident report. The consequences of this cannot easily be
3.4.
SOFTWARE FAILURES
57
underestimated. In particular, it makes it diÆcult for engineers to distinguish between transient or intermittent hardware failures and software bugs arising from rare combinations of input conditions. It can also be diÆculty to ensure that bug xes reach all end-users once a safety-critical product has been distributed. These practical diÆculties are again illustrated by an incident report from the FDA's MAUDE system: \For approximately three weeks user hasn't been able to archive patient treatments due to software error. (The) facility has attempted to have company x system in person but has only been successful at having company try by modem but to no avail." ([272], Report Number 269987) The introduction of bug xes can also introduce new faults that must, in turn, be recti ed by further modi cation.
3.4.1 Failure Throughout the Lifecycle Jecott and Johnson [396] argue that many software failures stem from decisions that are taken by high-level management. They illustrate this argument as part of a study into the organisational roots of software failures in the UK National Health Service. For example, the inquiry into the failure of the London Ambulance Computer Aided Dispatch System criticised the initial tendering process that was used: \Amongst the papers relating to the selection process there is no evidence of key questions being asked about why the Apricot bid, particularly the software cost, was substantially lower than other bidders. Neither is there evidence of serious investigation, other than the usual references, of Systems Options or any other of the potential suppliers' software development experience and abilities. ([772], page 18) Such problems are typical of industries that are struggling to adapt management and procurement policies to the particular demands of software acquisition and development. They also illustrate the ways in which the various genotypes , such as managerial failure, help to create the conditions in which other forms of failure are more likely to manifest themselves. The causes of software bugs can be traced back to the development stages where they were rst introduced. For instance, the IEC 61508 development standard distinguishes between eleven lifecycle phases: initial conceptual design; the identi cation of the project scope; hazard & risk assessment; identi cation of overall safety requirements; resource allocation to meet safety requirements; planning of implementation and validation; system realization; installation and commissioning; validation; operation and maintenance; modi cation[420]. Software failures, typically, have their roots early in this development cycle. Many incidents stem from inadequate risk assessment. This is important in standards such as IEC 61508 that guide the allocation of software design resources in proportion to the predicted likelihood of a failure and its anticipated consequences. Errors during this risk assessment phase may result in unjusti ed attention being played to minor aspects of software functionality whilst too little care may be taken with other more critical aspects of a design. Any code that is then developed will fail to insure the overall safety of an application even though it runs in the manner anticipated by the programmer. Such problems are often caught during subsequent validation and veri cation. Those failures that do occur are, therefore, not only the result of an initial mistake or genotype. They also stem from failures in the multiple barriers that are intended to prevent faults from propagating into a nal implementation. The IEC 61508 standard requires that the sta employed on each development task must be competent; they must understand the importance of their task within the overall development lifecycle; their work must be open to veri cation; it must be monitored by a safety management system; their ork must be well documented; it must be integrated within a functional safety assessment. These requirements apply across all of the lifecycle phases and are intended to ensure that failures do not propagate into a nal implementation. Managerial failures are an important precursor to other problems during software development, such as inadequate requirements capture [415]. This is signi cant because it has often been argued
58
CHAPTER 3.
SOURCES OF FAILURE
that the costs of xing software bugs rise rapidly as development progresses. For example, Kotonya and Sommerville estimate that the costs of xing a requirements error may be up to one hundred times the costs of xing a simple programming error [459]. Such estimates have important implications for incident reporting. There can be insuÆcient resources to x those software failures that are reported once a system is in operation. Many development organisations have introduced reporting schemes, such as NASA's Incidents, Surprises and Anomalies application, to elicit safety concerns well before software is deployed. Requirements analysis helps to identify the functions that software should perform. It also helps to capture additional non-functional constraints; including usability and safety criteria. There are many reasons for the failure of requirements elicitation techniques. The following list provides a partial summary:
lack of stakeholder involvement. The end-users who arguably know most about day to day operation may not be suÆciently consulted. In consequence, software engineers can get a distorted view of an application process. Similarly, some sectors of plant management and operation may not be adequately consulted. This may bias software engineers towards considering the requirements of one group of users' needs.
incorrect environmental assumptions. A very common source of requirements problems stem
communications failures within development teams. Incorrect assumptions about operating
from incorrect assumptions about the environment in which a software system will operate. Neumann's collection of computer related risks contains numerous examples of variables that have fallen above or below their anticipated ranges during `normal' operation [627]. environments often occur because software engineers must often rely upon information provided by domain experts. Problems arise when these specialists must communicate technical expertise to people from other disciplines.
inadequate con ict management. It is easy to underestimate the impact that social dynamics
lack of `ecological' validity. It has increasingly been argued that requirements cannot simply
can have upon requirements engineering. Dierent stakeholders can hold radically dierent views about the purpose and priorities of application software. Requirements capture will fail if it does not address and resolve the tensions that are created by these con icts. In particular, they can result in inconsistencies requirements, for example between speed and cost, that cannot be met by any potential design. be gathered by asking people about the intended role of software components [459]. in order to gain a deeper understanding of the way in which software must contribute to the overall operation of a system, it is important to carefully observe the day to day operation of that system.
As software engineering projects move from requirements elicitation towards installation and operation, they typically pass through a speci cation stage. This process identi es what a system must do in order to satisfy any requirements. It does not, however, consider the precise implementation details of how those requirements will be met. A similar array of problems aect this stage of software development:
inadequate resolution of ambiguity. There is no general agreement about the best means of expressing requirements for large-scale software engineering projects. Formal and semi-formal notations provide means of reducing the ambiguity that can arise when natural language terms are used in a requirements document. However, these mathematical and diagrammatic techniques suer from other limitations.
inadequate peer review. Formal and semi-formal notations can be used to avoid the ambiguity and inconsistency of natural language. However, they may only be accessible to some of the people who are involved in the development process. In particular, they typically cannot be review by the domain experts and stakeholders who must inform requirements elicitation.
3.4.
SOFTWARE FAILURES
59
lack of change management. Requirements will change over time as analysts consult more and
lack of requirements maintenance. The constraints that software must satisfy will change
more of the stakeholders involved in a system. These changes can result in `feature accretion'; the core application functionality may become obscured by a lengthening wish-list of less critical features. during the lifetime of a system. Unless these changes trigger maintenance updates then software will continue to satisfy obsolete functional and non-functional requirements [434].
Errors in requirements elicitation and speci cation are more diÆcult to rectify than simple programming errors. There is, however, a bewildering array of potential pitfalls for the programmers of safety-critical systems. These include logical errors in calculations, such as attempting to divide a number by zero. They also include errors that relate to the handling of information within a program. For example, a variable may be used before it has been initialised with its intended value. The types of data that are represented within the program may not accurately match the full range of values that are provided as input to the program. The representations of these types may also dier between components of a program that are written by dierent teams or companies. The defences of strong typing that prevent such problems may be subverted or ignored. Valuable data may be over-written and then later accessed as though it still existed. A further class of problems relates to what is known as the ow of control. Instead of executing an intended sequence of instructions or of inspecting a particular memory location an arbitrary jump may be introduced through an incorrect reference or instruction. Other problems relate to the way in which a particular piece of code eventually executes at run-time. For example, there are dierences between the precision with which data is represented on dierent target processors. It is important not to underestimate the consequences of such coding errors. For example, the report into the London Ambulance Dispatch System failure records how such a bug caused the entire system to fail: \The Inquiry Team has concluded that the system crash was caused by a minor programming error. In carrying out some work on the system some three weeks previously the Systems Options programmer had inadvertently left in the system a piece of program code that caused a small amount of memory within the le server to be used up and not released every time a vehicle mobilisation was generated by the system. Over a three week period these activities had gradually used up all available memory thus causing the system to crash. This programming error should not have occurred and was caused by carelessness and lack of quality assurance of program code changes." ([772], page 45). This quotation again illustrates the genotypes that lead to software failures. Errors can result from time and cost pressures; programmers may lack the necessary resources that are necessary to ensure type consistency and other necessary properties across module interfaces. If programmers receive inadequate training then they may fail to recognise that they have made an error. These problems can, in turn, be compounded by the lack of adequate tool support during various stages of implementation and testing. Designers cannot be certain of eliminating all bugs from complex software systems. As a result, development resources must be allocated in proportion to the criticality of the code. If less resources are allocated to a module then there is, in theory, a higher likelihood that bugs will remain in that section of a program. Further problems stem from the diÆculty of performing static and dynamic tests on complex and embedded systems. Dynamic testing involves the execution of code. This is intuitively appealing and can provide relatively direct results. It is also fraught with problems. It can be diÆcult to accurately simulate the environment that software will execute in. For instance, the Lyons report spends several pages considering the reasons why the inertial reference system (SRI) was not fully tested before Ariane ight 501: \When the project test philosophy was de ned, the importance of having the SRI's in the loop was recognised and a decision was made (to incorporate them in the test). At a later stage of the programme (in 1992), this decision was changed. It was decided not to
60
CHAPTER 3.
SOURCES OF FAILURE
have the actual SRI's in the loop for the following reasons: the SRIs should be considered to be fully quali ed at equipment level; the precision of the navigation software in the on-board computer depends critically on the precision of the SRI measurements. In the Functional Simulation Facility (ISF), this precision could not be achieved by electronics creating test signals; the simulation of failure modes is not possible with real equipment, but only with a model; the base period of the SRI is 1 millisecond whilst that of the simulation at the ISF is 6 milliseconds. This adds to the complexity of the interfacing electronics and may further reduce the precision of the simulation" (page 9, [505]). Even in simple cases there are so many dierent execution paths and possible inputs that they cannot all be tested through dynamic analysis. As a result, many organisations have turned to combinations of both dynamic and static forms of testing. Static analysis evaluates the software without executing it. This relies upon reasoning about an abstraction of the speci c machine that is eventually constructed by running code on a particular processor. For instance, walkthroughs can be performed by analysing the changing values of dierent variables as each line of code is executed by hand. Of course, this becomes increasingly problematic if the code is distributed. Formal, mathematical techniques can be used to reason about the behaviour of such software. However, all of these approaches rely upon reasoning about abstractions of the eventual system. There continue to be both theoretical and practical diÆculties in re ning proofs about models of a system into assertions about the potential behaviour of software operating on particular processors. The key point in all of this is that both static and dynamic testing provide means of increasing our assurance about the quality of a particular piece of code. Neither provide absolute guarantees. As a result, it seems likely that incident reporting systems will continue to provide valuable information about the symptoms of software failure for some time to come. Redundancy can be used to reduce the likelihood of software failures. Several dierent routines can be used to perform the same function. The results from these computations can be compared and a vote taken to establish agreement before execution proceeds. If one section of code calculates an erroneous value then their result can be overruled by comparison with the other results. Lack of redundancy can, therefore, be seen to be a source of software failure. However, redundancy introduces complexity and can itself yield further implementation problems. It can also be diÆcult to ensure true diversity. For instance, programmers often resort to the same widely published solutions to common problems. If those solutions result in common problems then these may be propagated into several versions of the redundant code. Even if redundancy is successfully deployed, it can raise a number of further technical problems for the successful detection and resolution of incidents. For instance, redundancy is compromised if a routine continually computes an erroneous result but is successfully over-ruled by other implementations. The system will be vulnerable to failures in any of the alternative implementations of that function. It is, therefore, critical to monitor and respond to recurrent failures in redundant code. Poor documentation can prevent technical sta from installing and con guring safety-critical applications. It can prevent end-users from responding appropriately to system prompts and directives. These problems can, in turn, compound the results of previous software failures if users cannot intervene in a timely fashion. Inadequate documentation can also be a cause of implementation errors in safety-critical programs. It is hard for programmers to correctly use their colleagues' work if they cannot understand the interfaces between modules. This problem also aects engineers who must maintain legacy systems. In particular, programmers often have to understand not simply what a piece of code does but also WHY it does it in a particular manner. This is critical if maintenance engineers are to justify their response to the problems identi ed by incident reporting systems. It is also important if engineers are to determine whether or not code can be deactivated or reused when it is ported between applications. There are close connections between these speci c documentation issues, the problems of dynamic testing and the managerial causes of software failure: \Strong project management might also have minimised another diÆculty experienced by the development. The developers, in their eagerness to please users, often put through software changes `on the y' thus circumventing the oÆcial Project Issue Report (PIR) procedures whereby all such changes should be controlled. These `on the
3.4.
SOFTWARE FAILURES
61
y' changes also reduced the eectiveness of the testing procedures as previously tested software would be amended without the knowledge of the project group. Such changes could, and did, introduce further bugs." [772] As mentioned, changes in the operating environment can invalidate the assumptions that were documented during any initial requirements engineering. Modi cations that are introduced in response to those changes can, in turn, introduce further faults. Any one of these genotypes can lead to the incidents of software failure that are increasingly being documented by reporting systems[420].
3.4.2 Problems in Forensic Software Engineering Many well-established techniques support the design and implementation of safety-critical systems. Unfortunately, very few support the investigation and analysis of software failure. These problems often manifest themselves in the recommendations that are made following such failures. In particular, many current standards advocate the importance of process measures as an indication of quality during safety-critical systems development. This means that regulators and quality assurance oÆces focus on whether appropriate practices have been followed during the various stages of the development process. They do not attempt to directly assess the quality of the nal product itself. This avoids the many problems that arise when attempting to de ne appropriate measures of software quality [486]. However, this approach creates tremendous problems for the maintenance of incident reporting systems. The identi cation of a software fault throws doubt not only on the code that led to the failure but also on the entire development process that produced that code. At worst, all of the other code cut by that team or by any other teams practicing the same development techniques may be under suspicion. Readers can obtain a avour of this in the closing pages of the Lyons report into the Ariane 5 failure. The developers must: \Review all ight software (including embedded software), and in particular: Identify all implicit assumptions made by the code and its justi cation documents on the values of quantities provided by the equipment. Check these assumptions against the restrictions on use of the equipment." [505] Unfortunately, this citation does not identify any tools or techniques that might be used to `identify all implicit assumptions' in thousands of lines of code. Such comments perhaps reveal some confusion about the practical problems involved in software development. This is illustrated by a citation from the report into the London Ambulance Computer Aided Dispatch system. Previous sections have identi ed a number of reasons why software cannot be totally reliable: \A critical system such as this, as pointed out earlier, amongst other prerequisites must have totally reliable software. This implies that quality assurance procedures must be formalised and extensive. Although Systems Options Ltd (SO) had a part-time QA resource it was clearly not fully eective and, more importantly, not independent. (Paragraph 3083, [772]). Software-related incidents typically stem from more systemic problems. Bugs are often the result of inadequate funding or skill shortages. These failures are rooted in project management, including the risk assessment techniques that help to identify the criticality of particular sections of code. Many complex software failures also involve interactions between faulty and correct subsystems. They can stem from detailed interaction between hardware and software components. The nature of such incidents is illustrated by the following report from the FAA's Aviation Safety Reporting System. The erroneous TCAS II advisory interacted with the Ground Proximity Warning System: \Climbing through 1,200 feet [on departure] we had a TCAS II Resolution Advisory (RA) and a command to descend at maximum rate (1,500 to 2,000 feet per minute). [The ight crew followed the RA and began a descent.] At 500 feet AGL we leveled o, the TCAS II still saying to descend at maximum rate. With high terrain approaching, we started a maximum rate climb. TCAS II showed a TraÆc Advisory (TA) without an altitude ahead of us, and an RA [at] plus 200 feet behind us... Had we followed the
62
CHAPTER 3.
SOURCES OF FAILURE
TCAS directions we would de nitely have crashed. If the weather had been low IFR, I feel we would have crashed following the TCAS II directions. At one point we had TCAS II saying `Descend Maximum Rate,' and the GPWS (Ground Proximity Warning System) saying `Pull Up, Pull Up.' [The] ATC [Controller] said he showed no traÆc con ict at any time." [546] There are a number of reasons why traditional software engineering techniques cannot easily be applied to analyse the causes and consequences of software related failures. Most existing techniques address the problems of complexity by functional decomposition [486]. This assumes that by improving the reliability of individual components it is possible to improve the safety of an entire system. Such a decomposition often fails to account for interactions between subsystems. For example, the previous incident was caused by a software failure but resolved by operator intervention. Any redesign of the TCAS system must, therefore, ensure the reliability of the software and preserve the crews' ability to identify potential TCAS failures. A number of further problems complicate the use of traditional software engineering techniques to analyse incidents involving programmable systems. At one level, a failure can be caused because error-handling routines failed to deal with a particular condition. At another level, however, analysts might argue that the fault lay with the code that initially generated the exception. Both of these problems might, in turn, be associated with poor testing or awed requirements capture. Questions can also be asked about the quality of training that programmers and designers receive. These dierent levels of causal analysis stretch back to operational management and to the contractors who develop and maintain application software. This multi-level analysis of the causes of software failure has important consequences. Existing software engineering techniques are heavily biased towards the requirements engineering, implementation and testing of safety-critical systems. There has been relatively little work into how dierent management practices contribute to, or compound, failures at more than one of these levels [396]. Leveson argues that: \...in general, it is a mistake to patch just one causal factor (such as the software) and assume that future accidents will be eliminated. Accidents are unlikely to occur in exactly the same way again. If we patch only the symptoms and ignore the deeper underlying cause of one accident, we are unlikely to have much eect on future accidents. The series of accidents involving the Therac-25 is a good example of exactly this problem: Fixing each individual software aw as it was found did not solve the safety problems of the device" (page 551, [486]). An alternative approach is to build on the way that standards, such as IEC61508, advocate the use of dierent techniques to address dierent development issues [879]. A range of dierent experts can be brought in to look at each dierent aspect of an incident. Management experts mght focus on the organisational causes of failure. Human factors specialists would use human factors techniques to investigate the role that operator behaviour played in an incident and so on. There are several objections to this approach. The cost of multidisciplinary investigations restrict them to high-risk mishaps. It can also be diÆcult to reconcile the views of individual team members from a range of dierent disciplines. Lekberg's has shown that the previous background of investigators will bias their interpretation of an incident [484]. Analysts are also most likely to nding the causal factors that are best identi ed using the tools and techniques that they are familiar with. In the case of software engineering, this might result in analysts identifying those causal factors that relate most strongly to requirements capture, to implementation or to testing rather than to the overall management of a software project. There is also a danger that such a multidisciplinary approach will suer from problems that are similar to traditional techniques based on functional decomposition. If each expert focusses on their particular aspect of an incident then they may neglect the interactions between system components. Further problems complicate the analysis of software failures. For example, simulation plays an important tool in many incident investigations. Several hypotheses about the sinking of the MV Estonia were dismissed through testing models in a specially adapted tank [227]. Unfortunately, incident investigators must often account for software behaviours in circumstances that cannot easily
3.5.
HUMAN FAILURES
63
be recreated. The same physical laws that convinced the sub-contractors not to test the Ariane 5's inertial reference systems in the Functional Simulation Facility also frustrate attempts to simulate the incident [505]. Similarly, it can be diÆcult to recreate the exact circumstances which help to shape operator intervention. This is a general problem for the simulation of complex systems. However, it is particular severe for software systems that support synchronous interaction between teams of users and their highly distributed systems [415]. These issues form the focus of the next section.
3.5 Human Failures Human failure plays a signi cant role in incidents and accidents. For instance, Van Cott cites studies which nd that 85% of all incidents involving automobiles are caused by human error, 70% of all incidents in U.S. nuclear power plants, 65% in world wide jet cargo transport and 31% in petrochemical plants [185]. Similarly, Nagel argues that humans are implicated as `causal factors' in more than half of all aircraft accidents. Within this gure, he argues they are involved in nine out of ten incidents involving general aviation [557]. These estimates can be misleading. Even those incidents that involve periodic hardware failures can be ascribed to human failure in the maintenance cycle. Failures that involve adverse meteorological conditions are caused by poor judgement in exposing the system to the risks associated with poor weather. It can be argued that all accidents and incidents are ultimately the responsibility of the regulatory authorities who must monitor and intervene to guarantee the safety of an industry. It is, therefore, perhaps better to distinguish between the proximal and distal impact of human error in the causation of adverse events. For instance, Heinrich claimed that up to 88% of all accidents stem from dangerous acts by individual workers [340].
3.5.1 Individual Characteristics and Performance Shaping Factors Reason [699] and Wickens [863] provide sustained introductions to diverse forms of human error. In contrast, this section provides an introductory overview. Pr (B j C ):Pr (A j C )
(11.10)
The key point to understanding this formula is that causes do not make their eects probable. They simply make them more probable than they otherwise would have been. We can also say that a is negatively causally related to b when the probability of A and B given C is less than the probability of B given C multiplied by the probability of A given C [313]. For instance, the probability of an engineer failing to obtain a fuel-tight seal, b , and of that engineer reporting the problem associated with the sealing surface, a , are together less than the independent probabilities of the engineer reporting the problem multiplied by the probability of the engineer failing to obtain the seal. This follows because the fact that the engineer reported the maintenance problem makes it less likely that they will be satis ed by any subsequent attempt to form a seal on the damaged surface:
Pr (B ^ A j C ) < Pr (B j C ):Pr (A j C )
(11.11)
From this line of argument, we can say that a 's cause b 's under circumstances C if a 's precedes b 's in the temporal sequence leading to an incident and it is the case that the probability of B and A in C is greater than the probability of B given that we know : A and C . Or we can say that a 's cause b 's under C if a 's precedes b 's in the temporal sequence leading to an incident and it is the case that the probability of B and A in C is greater than the probability of B given only C :
Pr (B ^ A j C ) > Pr (B j : A ^ C ) _ Pr (B j A ^ C ) > Pr (B j C )
(11.12)
Unfortunately, this formalisation leads to further problems. For example, it may be that a precedes b and that Pr (B ^ A j C ) > Pr (B j : A ^ C ) but that a and b and eects of the same cause. One way to avoid this is to examine the events prior to a to determine whether there is another event that might `screen o' or account for both a and b . Further models have been developed to formalise this approach [765] and these, in turn, have been further criticised [313]. The key point here is to provide an impression of the complexity that must be address by any attempt to exploit probabilistic models of causation as a means of supporting incident analysis. The initial appeal of an alternative to deterministic models rapidly fades as one considers the complexity of an alternative formulation.
11.3.
MATHEMATICAL MODELS OF CAUSATION
499
One important source of additional complexity is that causal factors may both promote and confound particular eects. In this re nement, some factor that causes a s to occur can have an independent negative in uence on the occurrence of b 's. For instance, the probability that the Nanticoke re would lead to the loss of the vessel was increased by the lack of eective monitoring when the initial re developed on the port side of the engine room. This lack of monitoring might have been a result of having both the Chief Engineer and the Mechanical Assistant in the Control Room during the re drill. However, the same circumstances that interfered with their monitoring responsibilities may also have reduced the probability that the re would jeopardise the safety of the vessel because both crewmembers could initiate the eventual response to the incident. Similarly, the increasing probability of b from a by one causal path can be oset by negative in uences from a along another causal path. For example, the re drill procedures may have made it more likely that the vessel would be seriously damaged by distracting members of the crew from their normal activities. The same drills may have made it less likely that the vessel would be seriously damaged because members of the crew were already prepared to respond to the general alarm that was sounded by the Chief Engineer. The importance of these mitigating factors has been repeatedly emphasised in recent studies of incident reporting systems [842]. Unfortunately, these factors are not adequately represented within many deterministic causal analysis techniques. The proponents of probabilistic theories of causation have responded to these observations by revising the previous formulations to include a partition Sj of all relevant factors apart from A and C . From this it follows that a 's cause b 's in circumstances C if and only if:
8 j : Pr (B j A ^ Sj ^ C ) > Pr (B j S ^ C )
(11.13)
fSj g is a partition of all relevant factors excluding A and C. These factors represent the negative
or positive causal factors, c1 ; :::; cm , that must be held xed in order to observe the causal eect of a . We require that any element, d , of a subset in Sj is in ci if and only if it is a cause of b or : b , other than a , and it is not caused by a . For instance, a hot manifold is liable to have a negligible impact on an existing re. We can, therefore, include a factor, ci , in each subset to require that a re must not have already started in order for a hot manifold, a , to ignite a fuel source, b . Each of the factors in c1 ; :::; cm must be represented in each subset. Each factor must also either be present or absent; there may or may not be an existing re. This results in 2m possible combinations of present or absent factors. Some combinations of the factors c1 ; :::; cm will be impossible. Hence some combinations of ci can be excluded from Sj . For example, it is diÆcult to foresee a situation in which the engine room is ooded with Halon gas and the re continues to burn. Yet both of these factors could prevent us from observing an ignition caused by a hot manifold. Other combinations may result in b being assigned a probability of 1 or 0 regardless of a . For instance, if the engine room were ooded with Halon then the re should not ignite irrespective of the state of the exhaust manifold. As mentioned, these impossible combinations and combinations that determine b are omitted from Sj . All the remaining combinations of causal factors must be explicitly considered as potential test conditions and are elements of Sj . In other words, a 's must cause b 's in every situation described by Sj . Some proponents of this partition theory dispense with any explicit representation of the context, C [153]. This approach relies entirely upon the partitioning represented by Sj . This is misleading. Causal relations may change from one context to another. For instance, the eects of a fuel leak may depend upon the pressure at which the fuel escapes. This, in turn, may depend upon the size and con guration of a generator. The meta-level point here is that we would like causal relations to hold over a variety of circumstances, these are characterised by Sj . We cannot, however, expect to identify causal relations that are not relativised to some background context [313]. A number of objections have inspired further elaborations to this partition model of causation [221]. In terms of this book, however, we are less interested in the details of these reformulations than we are in determining whether these models might support the causal analysis of incidents. The abstract model, outlined above, provides a structure for the analysis of incidents in the same way that Why-Because graphs and the associated proof templates provided by Ladkin and Loer also provide a structure for causal analysis. For example, we can apply the partition model to the Nanticoke example by identifying candidate causal relations. Investigators can use their domain
500
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
expertise to determine those relations that are then subjected to a more formal analysis, this equates to stage 2 of the method proposed for Hempel's model given above. For instance, previous sections have argued that it is diÆcult to determine the ignition source for the Nanticoke re. This causes problems for deterministic causal models. We might, therefore, exploit the partition model to represent a causal relationship between an ignition event, b , and the fuel oil coming into contact with an exhaust manifold. As mentioned, C represents all state descriptions for the system under consideration. We might, therefore, informally argue that C represents the state of any merchant vessel that relied upon diesel generators. This context might be narrowed if the formalisation of the incident is intended only to apply to a restricted subset of these ships. In contrast, it might be extended if the formalisation also captures important properties of other vessels, such as military ships that employ diesel generators. Irrespective of the precise interpretation, it is important that analysts explicitly identify this context that helps other investigators to understand the scope of the model. We can then go on to identify other causal factors that might be represented in subsets of the form c1 ; :::; cm 2 Sj . Recall that d is in c1 ; :::; cm if and only if it is a cause of b or : b , other than a , and it is not caused by a :
c1 represents `the room oods with Halon', c2 represents `fuel is sprayed at pressure', c3 represents `shielding protects the manifold'. As mentioned, individual factors may either be present or absent during particular incidents. There are, therefore, 23 potential elements of Sj . In the following, the omission of an element from any set implies that the causal factors are omitted. The rst sequence represents a situation in which all of the previous causal factors are present. The room oods with Halon and the fuel is sprayed at pressure and shielding protects the manifold. The second of the subsets indicates that all of the factors are true except for the last one; the shielding does not protect the manifold.
fc1 ; c2 ; c3 g; fc1; c2 g; fc1; c3 g; fc2 ; c3 g; fc1g; fc2g; fc3 g; fg; We can, however, reduce the number of combinations that we need to consider in order to establish a causal relation between a and b . As mentioned, some combinations of these causal factors are impossible. Other combinations may entirely determine the eect irrespective of the putative cause. For example, we can ignore any subset that contains c1 . If the room oods with Halon then the re will not ignite, b , whatever happens to the fuel and the manifold, a . Conversely, we can insist that all subsets must include c2 . If the fuel is not sprayed at pressure then the re will not ignite even if the fuel oil comes into contact with an exhaust manifold; as the manifold may not reach the
ash-point of the fuel. In order to establish causality, we must however consider whether a increases the probability of b taking all other combinations of the causal factors, ci , into account:
fc2 ; c3 g; fc2g: In other words, in order for a causal relation to hold between between an ignition event, b , and the fuel oil coming into contact with an exhaust manifold, a , we must show that the eect would still be more likely if fuel is sprayed at pressure whether or not shielding protected the manifold. The shielding might reduce the absolute probability of the ignition but may not necessarily reduce it to zero, as a Halon deployment might. We must, therefore, show that the cause still increases the probability of the eect in both of these conditions. This application of the partition model has a number of practical advantages. For instance, investigators are not forced to quantify the probability that a cause will yield a particular eect. The partition model also oers some advantages when compared to more deterministic models. This approach provides an elegant means of dealing with uncertainty about the precise causes of an incident. In particular, previous analyses have experienced acute problems from the investigators diÆculty in determining what caused the ignition of the re on the Nanticoke. The partition model entirely avoids this problem. It is possible to characterise multiple potential causes using the relevant
11.3.
MATHEMATICAL MODELS OF CAUSATION
501
factors represented by ci . For example, we could have extended Ci to include fuel oil comes into contact with an indicator tap. We can also use the same techniques to represent and reason about the impact of mitigating factors. This again was problematic in deterministic techniques. In the previous example, we had to demonstrate that an ignition was more likely to occur whether or not the manifold was protected by shielding. We also showed how the same approach can represent barriers, such as Halon deployment, that prevent an eect from occurring. It is important to stress that these arguments about the probability of an ignition must be validated [195]. The partition model helps here because analysts can explicitly represent the anticipated impact of contributory causes, of mitigating events and of potential barriers. In contrast, many deterministic techniques consider these issues as secondary to the process of enumerating those failures that led to an incident. There are, however, a number of practical concerns that arise during the application of the partition model of non-deterministic causation. All of the relevant factors, ci , in the previous example were carefully chosen to be events. This satis es the requirement that `d is in c1 ; :::; cm if and only if it is a cause of b or : b , other than a , and it is not caused by a '. Previous informal examples in this section have argued that a hot manifold would not have ignited the re if a re had already been burning. Ideally, we would like to extend ci to include an appropriate state predicate so that we can explicitly represent and reason about such a situation. Alternatively, we could re ne the relatively abstract view of the context, C , that was introduced in this example. Further concerns stem from the problems of applying an abstract model of causation to support incident analysis. It is entirely possible that the previous example re ects mistakes in our interpretation of the theoretical work of Cartwright [153] and Hausman [313]. Further work is, therefore, needed to determine appropriate formulations and interpretations of these non-deterministic models. This brief example does, however, demonstrate that probabilistic approaches can avoid some of the problems that uncertainty creates for the deterministic techniques that have been presented in previous sections. There are also a number of more theoretical concerns about the utility of partition models. The formula (11.13) ensures that a increases the probability of b irrespective of the values assigned to these other relevant factors. This provides a de nition of causation in which the mitigating eects of these relevant factors must not oset the increased probability of an associated eect. This might seem a reasonable criterion for causality. It does, however, lead to a number of philosophical problems. For instance, it might be argued that the crew's failure to regularly inspect the engine room is a potential cause of major res such as that on board the Nanticoke. It might equally be argued that, under certain circumstances, regular inspection of the engine room might lead to a major re. For example, operators might miss an automated warning in the control room that indicated a potential problem elsewhere in the engine room [621]. Under the system described above, such circumstances would prevent investigators from arguing that lack of inspection is a cause of major res! This is a general problem; there are contexts in which \smoking lowers one's probability of getting lung cancer, drinking makes driving safer and not wearing seat-belts makes one less likely to suer injury or death" [313]. As before there are a number of re nements on the basic model outlines in (11.13). It remains to be seen whether any of these extensions might provide an adequate framework for the causal analysis of safety-critical incidents. It might seem to be far-fetched that probabilistic models of causation might yield pragmatic tools for incident analysis. Against this one might argue that Lewis' possible worlds semantics for counterfactual reasoning would have appeared equally arcane before the development of WBA.
11.3.3 Bayesian Approaches to Probabilistic Causation There are a number of alternative semantic interpretations for the Pr function introduced in the previous section [150]. In particular, Pr may be viewed either as a measure of con rmation or as a measure of frequency. The former interpretation resembles the Bayesian view; probability is contingent upon the observation of certain evidence. The latter resembles the manner in which engineers derive reliability gures. Estimates of pump failures are derived from the maintenance records of plant components. This distinction has been a subject of some controversy. For instance, Carnap argued: \... for most, perhaps for practically all, of those authors on probability who do not
502
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
accept a frequency conception the following holds. i. Their theories are objectivist (and) are usually only preliminary remarks not aecting their actual working method. ii. The objective concept which they mean is similar to (the frequency view of) probability." [150] Brevity prevents a detailed explanation of the contrasting positions in this debate. It is, however, possible to illustrate the common origin for these two dierent approaches. Both the partition models and Bayesian views exploit conditional probabilities. These also formed the foundation for the treatment of probabilistic causality in the previous chapter. As before, we use the following form to denote that the probability of the event B given the event A in some context C is x .
Pr (B j A ^ C ) = x
(11.14)
From this we can derive the following formula, which states that the conditional probability of B given A in C multiplied by the probability of A in C is equivalent to the probability of A and B in C . In other words the probability of both A and B being true in a given context is equivalent to the probability of A being true multiplied by the probability that B is true given A:
Pr (B j A ^ C ):Pr (A j C ) = Pr (A ^ B j C )
(11.15)
We can use this axiom of probability calculus to derive Bayes theorem:
Pr (B j A ^ C ):Pr (A j C ) = Pr (B ^ A j C ) (Commutative Law applied to
^
in (11:15))
Pr (B j A ^ C ):Pr (A j C ) = Pr (A j B ^ C ):Pr (B j C ) (Substitution of RHS using (11:15)) Pr (B j A ^ C ) =
Pr (A j B ^ C ):Pr (B j C ) Pr (A j C )
(Divide by Pr (A j C ))
(11.16)
(11.17)
(11.18)
The key point about Bayes' theorem is that it helps us to reason about the manner in which our belief in some evidence aects our belief in some hypothesis. In the previous formula, our belief in B is aected by the evidence that we gather for A. It should be emphasised that (11.18) combines three dierent types of probability. The term Pr (A j C ) represents the prior probability that A is true without any additional evidence. In the above, the term Pr (B j A ^ C ) represents a posterior probability that B is true having observed A. We can also reformulate (11.18) to determine the likelihood of a potential `cause' [200]. The following formula considers the probability of a given hypotheses, B , in relation to a number of alternative hypotheses, Bi where B and Bi are mutually exclusive and exhaustive:
Pr (B j A ^ C ) =
Pr (A j B ^ C ):Pr (B j C ) P Pr (A j B ^ C ):Pr (B j C ) + i Pr (A j Bi ^ C ):Pr (Bi j C )
(11.19)
The previous formula can be used to assess the likelihood of a cause B given that a potential eect, A, has been observed. This has clear applications in the causal analysis of accidents and incidents. In particular, (11.19) provides a means of using information about previous incidents to guide the causal analysis of future occurrences. In the Nanticoke case study, investigators might be interested to determine the likelihood that reported damage to an engine room had been caused by the pressurised release of fuel from a lter. The rst step would involve an analysis of previous incidents. This might reveal that fuel from a
11.3.
MATHEMATICAL MODELS OF CAUSATION
503
lter was identi ed as a cause in 2% of previous mishaps. Lubrication oil might account for 1%. Other fuel sources might together account for a further 3% of all incidents:
Pr ( lter re j C ) = 0:02 Pr (lube re j C ) = 0:01 Pr (other re j C ) = 0:03
(11.20) (11.21) (11.22)
In order to gain more evidence, investigators might try to determine how likely it is that one of these res would cause serious damage to an engine room. Further analysis might reveal that thirty per cent of previous incidents involving the ignition of lter fuel resulted caused signi cant damage to an engine room. Twenty per cent of lube res and fty per cent of res involving other fuel sources might have similar consequences:
Pr (engine room damage j lter re ^ C ) = 0:3 Pr (engine room damage j lube re ^ C ) = 0:2 Pr (engine room damage j other re ^ C ) = 0:5
(11.23) (11.24) (11.25)
We can now integrate these observations into (11.19) to calculate the probability that a lter fuel re was a cause given that a serious engine room re has been reported. This following calculation suggests that there is a twenty-six per cent chance that such a lter re had this eect:
Pr ( lter re j engine room damage ^ C ) Pr (engine room damage j lter re ^ C ):Pr ( lter re j C ) = ((Pr (engine room damage j lter re ^ C ):Pr ( lter re j C )) + (Pr (engine room damage j lube re ^ C ):Pr (lube re j C )) + (Pr (engine room damage j other re ^ C ):Pr (other sources j C )) =
(0:3):(0:02) (0:3):(0:02) + (0:2):(0:01) + (0:5):(0:03)
= 0:26
(11.26) (11.27) (11.28)
A number of caveats can be raised against this application of Bayes' theorem. Many concerns centre on our use of evidence about previous mishaps to guide the causal analysis of new incidents. The previous calculations relied upon investigators correctly identifying when a re had caused `signi cant damage' to an engine room. These is a danger that dierent investigators will have a dierent interpretation of such terms. Chapter 15 describes how Bayesian techniques can account for the false positives and negatives that result from these dierent interpretations. For now it is suÆcient to observe that our analysis of previous incident frequencies might bias the causal analysis of future incidents. For instance, we have made the assumption that these incidents occurred in comparable contexts, C . There may be innovative design features, such as new forms of barriers and protection devices, that would invalidate our use of previous frequencies to characterise future failures. Dembski argues that it is seldom possible to have any con dence in prior probabilites [200]. Such gures can only be trusted in a limited number of application domains. For instance, estimates of the likelihood of an illness within the general population can be validated by extensive epidemiological studies. It is diÆcult to conduct similar studies into the causes of safety-critical accidents and incidents. In spite of initiatives to share incident data across national boundaries, there are few data sources that validate the assumptions represented in (11.23), (11.24) and (11.25). This book has identi ed a number of dierent biases that aect the use of data from incident reporting systems. For example, Chapter 5 referred to the relatively low participation rates that aect many incident reporting schemes. This makes it diÆcult for us to estimate the true frequency of lube res or lter res. These incidents may also be extremely rare occurrences. It can, therefore, be very diÆcult for investigators to derive the information that is required in order to apply (11.19).
504
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
In the absence of data sources to validate prior probabilities, investigators typically rely upon a variant of the indierence principle. This states that given a set of mutually exclusive and exhaustive possibilities, the possibilities are considered to be equi-probable unless there is a reason to think otherwise. This would lead us to assign the same probabilities to res being caused by lter fuel, to lube oil and to all other sources. Unfortunately, the pragmatic approach suggested by the indierence principle can lead to a number of paradoxes [369]. Objections can have also been raised against any method that enables investigators to move from conditional probabilities, such as Pr (A j Bi ^ C ), to their `inverse' likelihoods, Pr (Bi j A ^ C ) [200]. The use of subjective probabilities provides a possible solution to the lack of frequential data that might otherwise support a Bayesian approach to the causal analysis of safety-critical incidents. Subjective probabilities are estimates that individual investigators or groups of investigators might make about the probability of an event. For example, a subjective probability might be an individuals assessment of the chances that lube oil could start a re that might cause serious damage to an engine room. One standard means of estimating this probability is to ask people to make a choice between two or more lotteries. This technique is usually applied to situations in which it is possible to associate nancial rewards with particular outcomes. Von Neumann and Morganstern provide a detailed justi cation for the general applicability of this approach [626]. Certain changes must, however, be made in order to explain how these lotteries might support the causal analysis of adverse incidents. 1. I might be oered a situation in which there is a certainty that if a lube oil re occurs in the next twelve months then it will result in major damage to an engine room; 2. alternatively, I might be oered a form of `gamble'. This requires that I select a token at random from a jar. This jar contains N tokens that are marked to denote that there is no serious damage to an engine room during the next twelve months. The remaining 100-N tokens are marked to denote that there is such an incident. I will prefer option (2) if every token indicates that engine rooms remain undamaged, i.e. N=100. Conversely, I will prefer option (1) if every token indicates the opposite outcome, i.e. N=0. This requires additional explanation. Recall from option (1) that engine room damage will occur if there is a lube oil re. In option (2), if N=0 then there is a certainty that engine room damage will occur. This explains the preference for (1), the individual makes a subjective assessment of the likelihood of the lube re and then must trade this o against the potential for there not to be engine room damage in (2). There will, therefore, be a value of N for which the two situations are equally N is my estimate of the probability that a lube oil attractive. At such a position of indierence, 100 re will cause serious damage to an engine room in the next year. Jensen notes that \for subjective probabilities de ned through such ball drawing gambles the fundamental rule can also be proved" [398]. This fundamental rule is given by formula (11.15) that provided the foundation for Bayes' theorem (11.18). A number of problems aect the use of subjective probabilities. An individuals' preference between the two previous options is not simply determined by their subjective estimate of the probability of a lube oil re. It can also be aected by their attitudes towards risk and uncertainty. For example, one person might view that a 20% chance of avoiding engine room damage is an attractive gamble. They might, therefore, be willing to accept N=20. Another individual might be very unwilling to accept this same gamble and might, therefore, prefer option (1). These dierences need say very little about the individual's view of the exact likelihood of a lube re resulting in major engine room damage. In contrast, it may reveal more about their attitude to uncertainty. The rst individual may choose the gamble because they have more information about the likelihood of engine room damage than they do about the lube re in (1). Individual preferences are also aected by attitude to risk [689]. Experimental evidence has shown that dierent individuals associate very dierent levels of utility or value to particular probabilities. A risk adverse individual might view a 20% gamble as a very unattractive option whereas a risk preferring individual might be more inclined to accept the risk given the potential rewards. In spite of the problems if deriving both frequentist and subjective probabilities, Bayesian inference has been used to reason about the dependability of hardware [86, 294] and software systems
11.3.
505
MATHEMATICAL MODELS OF CAUSATION
[497]. In particular, a growing number of researchers have begun to apply Bayesian Networks as a means of representing probabilistic notions of causation. it is based around the concepts on contingent probability that, as we have seen, can also arguably be used to provide insights into the likelihood of certain causes. Figure 11.18 presents a Bayesian network model for one aspect of the Nanticoke incident. Investigators initially observed horizontal soot patterns on top of valve covers 4, 5 and 6 and a shadowing on the aft surfaces of these structures. These observations indicate that the re originated on the port side of the engine, forward of cylinder head number 1. These eects might have been caused by a re fed from one of two potential sources. This is indicated in Figure 11.18 by the two arrows pointing into the node labelled horizontal soot patterns.... The arrows point from a cause towards the eect. The + symbols indicate the cause makes the eect more likely. Conversely, a barrier might be labelled by a - symbol if it made an eect less likely. As can be seen, the two potential causes of the horizontal soot patterns include a lube oil leak from under that valve cover near cylinder head number 6. These eects might also have been caused by a fuel oil leak from one of the lters. Further investigations reveal that the valve covers were in tact and in place after the re. This increases the certainty that the re started from a lter leak rather than a lube oil leak under the valve covers. Another way of looking at Figure 11.18 is to argue that leaks from either the lter or from lube oil are consistent with the horizontal soot patterns. Only a fuel oil leak from the lter is consistent with the valve covers being in tact after the re.
Figure 11.18: Bayesian Network Model for the Nanticoke Fuel Source Before continuing to apply Bayesian networks to support our causal analysis of the Nanticoke incident, it is important to observe that some authors have argued that these diagrams must not be used as causal models. In contrast, they should only be used to model the manner in which information propagates between events. This caution stems from doubts over methods that enable investigators to move from conditional probabilities to their `inverse' likelihoods, mentioned in previous paragraphs [200]. This point of view also implies further constraints on the use of Bayesian networks. For instance, it is important not to model interfering actions within a network of information propagation. Jensen provides a more complete introduction to these potential pitfalls [398].
valve covers ok : valve covers ok
lter re 1 0
:
lter re
0.98 0.02
Table 11.12: Conditional Probabilities for the Bayesian Analysis of the Nanticoke Incident (1) The quantitative analysis of Figure 11.18 begins with either a frequentist or subjective estimate of the likelihood of each cause. Recall that Pr ( lter re j C ) = 0:02 and that Pr (lube re j C ) = 0:01. We can use these prior probabilities and the information contained in Figure 11.18 to derive the
506
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
conditional probabilities for P ( lter re j valve covers ok ). These are shown in Figure 11.12. If we know that there was a lter re then it is certain that the valves would be in tact, this represents a simplifying assumption that can be revised in subsequent analysis. If there was not a lter re then there is a 0.98 probability that the valves would be in tact but a 0.02 probability that they would not. The conditional probabilities shown in Figure 11.12 are represented in matrix form throughout the remainder of this analysis. We can calculate the prior probability that the valve covers are in tact using formula (11.15:
Pr (valve covers ok j lter re ):Pr ( lter re ) = Pr (valve covers ok ^ lter re )
(11.29)
The following calculation introduces the conditional probabilities in Figure 11.12.
Pr (valve covers ok ^ lter re ) = =
1:0x 0:98 0:98x 0:02 0:0x 0:98 0:02x 0:02 0:98 0:0196 0:0 0:0004
(11.30) (11.31)
In order to derive the prior probability Pr (valve covers ok ) from Pr (valve covers ok ^ lter re ) we have to use a procedure called marginalisation. This is characterised as follows:
Pr (A) =
X B
Pr (A; B )
(11.32)
This can be applied to the matrix in (11.31) to derive Pr (valve covers ok ) = (0:9996; 0:0004). In other words the prior probability that the valve covers are in tact is just over 99%. Jensen provides more details on both the theoretical underpinning and the practical application of Bayesian networks [398]. The key point is that the underlying calculus provides investigators with a sophisticated analytical toolkit that can be used to supplement the less formal reasoning supported by the Bayesian network illustrated in Figure 11.18. The calculus can be used to derive prior and contingent probabilities depending on the nature of the information that is provided. Unfortunately, as can be seen from the preceding example, that application of these techniques can be complicated even for specialists who have considerable expertise in Bayesian analysis. For this reason, most practical applications of the approach rely upon the support of automated tools such as Hugin [399]. The previous calculations also relied upon the adaptation of models that were rst developed to support medical diagnosis. This introduces the possibility that errors may have been introduced into the calculations as a result of attempting to reformulate the models to yield particular insights into the Nantcoke case study. To summarise, the nal two sections of this chapter have looked beyond the well-understood deterministic models of causation that have been embodied within incident and accident analysis techniques. The intention has been to determine whether investigators might bene t from recent developments in the theory and application of probabilistic models of causation. We have seen how this area promises many potential bene ts. For example, the partition model and Baysian approaches can deal with the uncertainty that characterises the initial stages of an investigation. The importance of this should not be underestimated. Given the increasing complexity and coupling
of modern, safety-critical systems, it is inevitable that investigators will nd it more and more diÆcult to determine a unique cause for many adverse incidents. The Rand report into the National Transportation Safety Board (NTSB) repeatedly points to the increasing length of time that must be spent before analysts can identify the causes of many recent failures [482]. It is diÆcult to assess the true potential of these techniques because they have not been widely applied to support the causal analysis of adverse occurrences. In their current form there is little chance that they will be accessible to many investigators. Tool support must be provided. Methods
11.4.
COMPARISONS
507
and procedures must also be developed to help investigators learn how to apply these techniques without necessarily requiring a full understanding of the underlying theories that support the analysis. The use of Why-Because graphs as a central feature of WBA provides a useful prototype in this respect. The previous analysis has, however, identi ed several key issues that must be addressed before these more applied techniques will yield tangible bene ts. In particular, there must be some means of assessing prior probabilities if investigators are to exploit Bayesian techniques for analysing causality through contingent probabilities. Dembski summarises this argument as follows: \Bayesian conceptions of probability invariably face the problem of how to assign prior probabilities. Only in special cases can prior probabilities be assigned with any degree of con dence (e.g., medical tests). so long as the priors remain suspect, so does any application of bayes' theorem. On the other hand, when the priors are well-de ned, Bayes' theorem works just ne, as does the Bayesian conception of probability. To sum up then, there is no magic bullet for assigning probabilities" [200] There may not be any general-purpose magic bullet but the previous pages have, at least, identi ed two potential solutions that might work as a means of assigning priors within the specialist domain of incident investigation. Firstly, we have shown how subjective probabilities can be derived using the lottery-based procedures of Von Neumann and Morgenstern [626] or of March and Simon [513]. In general these are diÆcult to apply because individual attitudes to risk make it diÆcult to interpret the expressed preferences that support inferences about subjective probabilities. We are not, however, dealing with a general population. Investigators are, typically, trained in the fundamentals of reliability and risk assessment. There is some prospect, therefore, that this method might yield better results than the more general studies of decision making under conditions of economic uncertainty. The second, perhaps obvious, point is that we are not attempting to assign prior probabilities with complete ignorance about the nature of previous failures. In many ways, the entire purpose of an incident reporting system is to provide precise the sorts of quantitative information that is necessary in order to calculate the prior of Bayesian inference! It is paradoxical, therefore, to deny the usefulness of this data in a book that is devoted to the potential bene ts of incident reporting. Unfortunately, as we have seen, we cannot trust the statistics that are extracted from national and international systems. Previous chapters have cited various estimates for the under-reporting of adverse occurrences. For instance, the Royal College of Anaesthetists estimates that only 30% of adverse medical incidents are voluntarily reported [715], Barach and Small estimate that this gure lies somewhere between 50 and 95% [66]. Chapters 5 and 15 describe techniques that can be used to assess the extent of this problem. For example, workplace monitoring can be used to identify the proportion of adverse incidents that occur within a given time period in a representative team. The results of this analysis can then be compared with incident submission rates by a similar workgroup. This is not a panacea. Even if we can assess the contribution rate within a reporting system, there is still no guarantee that we can trust the data that has been gathered about an incident. Consider the Nanticoke case study, if we wanted to gather data about the prior probability of fuel from a lter being involved in a re, we would have to be sure that previous incidents were correctly analysed and indexed to indicate that this had indeed been a factor in previous incidents. The reliability of data about prior probabilities would be compromised if other investigators incorrectly diagnosed an incident as a lter re when it was not. It data would also yield incorrect priors if investigators failed to diagnose this fuel source when it had contributed to an incident. Chapter 15 describes a statistical technique that can be used to identify and correct for these potential biases. For now it is suÆcient to observe that this approach will only work if investigators have already performed a causal analysis of previous incidents. From this it follows that the application of Bayesian techniques may ultimately depend upon and support the use of more deterministic analysis.
11.4 Comparisons Previous sections have reviewed a number of dierent techniques that can be used to support the causal analysis of safety-critical incidents. The diversity of these techniques makes it important that
508
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
investigators and their managers have some means of assessing the support oered by these dierent approaches. Unfortunately, a range of practical, theoretical and also ethical issues complicate any attempt to perform comparative evaluations of causal analysis techniques:
the costs of learning new techniques. Considerable training is required before investigators
can apply some of the causal analysis techniques that we have considered. A signi cant level of investment would be needed to sponsor the evaluation of mathematical approaches unless investigators already had an appropriate training in the use of logic or of statistical reasoning. Similarly, it is diÆcult not to underestimate the problems associated with the independent application of Tier Analysis. Previous sections have emphasised the political and social pressures that aect the attribution of root causes to dierent levels within complex commercial organisations. Any investment in the evaluation of these techniques would carry the signi cant risk that they might not bene ts the sponsoring organisation.
the costs of applying new techniques. The investment that is required in order to train investi-
practice eects and the problems of fatigue. Empirical test-retest procedures provide means of reducing the costs associated with the `live' use of analysis techniques within multidisciplinary investigation teams. Investigators are presented with an example of a causal analysis technique being applied to a particular case study incident. The relative merits of that particular technique are assessed by asking investigators to answer comprehension questions, to complete attitude statements about the perceived merits of the approach and by timing investigators during these various tasks. The same procedure is then, typically, repeated for a number of further causal analysis techniques after a short break. This creates several experimental problems. For example, investigators can use the insights that were gathered from the rst analysis technique to help answer questions about the second. One would, therefore, expect that the quality of the analysis might improve. On the other hand, investigators will also suer from increasing fatigue as the evaluation proceeds. This, in turn, will impair their performance. These practice and fatigue eects can be addressed by counter-balancing. Dierent analysis techniques are applied in a dierent order by dierent investigators. One group might be presented with a STEP analysis and then a MORT analysis. This order would be reversed for another group. Such studies do not, however, provide any insights into the application of particular techniques over prolonger periods of time.
the problems of assessing learning eects. The test-retest procedures, described in the previous
gators to use particular analysis techniques must be supplemented to meet the costs associated with applying those techniques. This book has argued that computer-controlled automation supports increasingly complex application processes [675]. At the same time, incident investigations have been further complicated by the increasing appreciation that organisational, technical and human factors contribute to the causes of many `failures'. These two in uences have complicated the tasks associated with incident investigation. They are taking longer to complete and increasingly require the participation of multidisciplinary teams of investigators [482]. These increasing costs have not, to date, justi ed the allocation of resources to determine whether certain causal analysis techniques help to control the overall expenditure on incident investigations.
paragraph, do not provide information about the long-term support that may be provided by a causal analysis technique. There studies also often yield subjective results that are strongly in favour of techniques that are similar to those which investigators are already familiar with. These potential biases create many problems. For instance, the results of a test-retest validation may simply indicate `super cial' preferences based on a brief exposure to a relatively simple case study. These results may not be replicated if investigators actually had to apply a technique during a `live' investigation. For example, we have described the results of an evaluation conducted using o-shore oil workers in which techniques that achieved the lowest subjective satisfaction ratings also yielded the highest comprehension and analysis scores [403]! Similarly, innovative techniques can often be undervalued if they provide signi cant long-term bene ts that are not readily apparent during a cursory inspection.
11.4.
COMPARISONS
509
the diÆculty of nding `realistic' examples. Test-retest techniques reduce the costs associated
with the validation of causal analysis techniques. The investment associated with training investigators is avoided because they, typically, are not required to apply the techniques themselves. The costs associated with applying the technique are, therefore, also avoided. Investigators are only committed to an initial assessment of existing case studies. This raises further concerns. In particular, the choice of case study may in uence the investigators' responses. This is a signi cant issue because, as we have seen, techniques that focus on the managerial and organisational causes of failure may provide few insights into the failure of technical barriers. The test-retest procedure must, therefore, be replicated with several dierent case studies to provide a representative sample of the potential incidents that must be addressed. This, in turn, raises concerns that the individual preparing the case studies may also introduce potential biases that re ect their own experience in applying particular techniques. Some of these problems are addressed by accepting the costs associated with longitudinal studies of investigators applying causal analysis techniques. Given that high-consequence incidents will be rare events, even this approach provides no guarantee that investigators will meet a suÆcient range of potential failures.
the diÆculty of ensuring the participation of investigators. Many of the previous problems relate to the diÆculty of identifying an appropriate experimental procedure that can be used to support comparisons between causal analysis techniques. These issues often play a secondary role to the practical diÆculties that are associated with ensuring the `enthusiastic' participation of investigators in these studies. As we have seen, investigatory `methodologies' are often intended to improve the accuracy of investigations by imposing standard techniques [73]. They constrain an individual's actions in response to a particular incident. It is, therefore, essential that to encourage the support and participation of investigators in the evaluation process. Any technique that under-values the existing skill and expertise of investigation teams is unlikely to be accepted. Similarly, the techniques that are being assessed must be adequately supported by necessary training material that is pitched at a level that can be understood by its potential users. Above all, the comparative evaluation of a causal analysis technique must not be misinterpreted as a comparative evaluation of incident investigators.
the ethical issues that stem from studying the causal analysis of incidents. We have been involved in several studies that have performed empirical comparisons of dierent causal analysis techniques. These evaluations often raise a host of ethical questions for the organisations that are involved. If new techniques are introduced for a trial period then many industries require that these approaches should at least be as `good' as existing approaches. This creates an impasse because such reassurances cannot be oered until after the evaluation has been conducted. This often forces investigators to continue to apply existing techniques at the same time as a more innovative technique is being trialed. At rst sight, this replicated approach seems to oer many bene ts. Investigators can compare the results that are obtained from each technique. It can, however, lead to more complex ethical issues. For instance, the application of novel causal analysis techniques can help to identify causal factors that had not previously been considered. In extreme cases, it may directly contradict the ndings of the existing technique. Under such circumstances, it can be diÆcult to ignore the insights provided by the approach when the consequences might be to jeopardise the future safety of an application process. The following pages build on this analysis. They provide a brief summary of several notable attempts that have been made to evaluate the utility of causal analysis techniques. As will be seen, the individuals and groups who have conducted these pioneering studies often describe them as ` rst steps' or `approximations' to more sustained validation exercises.
11.4.1 Bottom-Up Case Studies Dierent causal analysis techniques oer dierent level of support for the analysis of dierent causal factors. For instance, MORT provides considerable support for the analysis of managerial and
510
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
organisational failure. In contrast, early versions of this technique arguably lack necessary guidance for the technical analysis of hardware and software failures. In contrast, ECF analysis lacks any causal focus and, therefore, oers a broader scope. We have seen, however, that investigators must recruit supplementary tier analysis and non-compliance analysis to focus on particular human factors, managerial and organisational causes of an incident. It is important to emphasise that the scope of causal analysis techniques is not static. Van Vuuren perceives a cycle in dierent industries [844]. A focus on individual blame and on isolated forms of equipment failure leads on to a focus on the organisational causes of incidents: This change in focus has altered the `status quo' of safety related research and led to a number of innovative tools for causal analysis, including Tripod and PRISMA. Unfortunately, there has been a tendency for some organisations to accept that organisational failure is the end point in this process. In this Whig interpretation, accident and incident investigation has culminated in an acceptance of `systemic' failure as the primary cause of incident investigation. Causal analysis techniques that identify the organisational precursors to systemic failures must, therefore, be chosen over those that focus more narrowly on the technical and human factors causes of incidents and accidents. This argument raises a number of concerns. Firstly, it is unlikely that our view of incidents and accidents will remain unchanged over the next decade. The increasing development of incident reporting systems is likely to provide us with access to failure data on a scale that has never before been possible. In particular, the computer-based tools that are described in Chapter 15 already enable investigators to search through millions of reports to identify trends and causal factors that were not anticipated from the exhaustive, manual analysis of local data sources [410]. The current focus on organisational and managerial issues may, therefore, be superceded as we learn more about the causes of failure. Secondly, the focus on organisational issues is not an end in itself. We know remarkably little about the organisational and managerial causes of failure [444, 701, 839]. From this it follows that current techniques that speci cally address these issues may actually fail to identify important causal factors. Indeed, many of this new generation of techniques have been attacked as premature. Researchers have pointed to particular theoretical weaknesses that are perceived to create practical problems for the investigators who must apply them: \The distinction between active and latent failure is the most important one in order to understand the dierence in impact of dierent kinds of human failure. However, in his discussion Reason only focuses on the human contributions at dierent stages during accident causation, without providing insight into whether these human contributions result in technical, human or organisational problems. The eleven General Failure Types that are listed for Tripod are... a combination of technical, human and organisational factors, and are also a combination of causes/symptoms and root causes. For example, hardware problems are likely to be caused by incorrect design and the category organisation refers to failures that can cause problems in communication, goal setting, etc. This might be acceptable for an audit tool, however, it is not for incident analysis. Although claiming to focus on management decisions, no de nition of management or organisational failure is provided. The lack of knowledge of how to model organisational failure in the area of safety related research states the importance of a bottom-up approach, using empirical incident data as a main input for new models and theories to be developed." [844] These criticisms undervalue the pioneering role that Tripod played in re-focusing attention on the managerial and organisational factors that compromise barriers and create the context for latent failures. Van Vuuren does, however, make an important point when he urges that any evaluation of incident investigation techniques should be based on empirical data, derived from bottom-up investigations. He exploited this approach to assess the utility of the PRISMA technique. A series of case studies were conducted to demonstrate that this approach might support the causal analysis of incidents in a wide range of dierent domains. He also sought to validate PRISMA by applying it to dierent case studies within the same domain. For instance, he developed variants of the Eindhoven Classi cation Model to analyse incidents reported in the steel industry. He began by looking at coke production. Coke is a solid substance that remains after gases have been extracted from coal and is
11.4.
511
COMPARISONS
primarily used as fuel for blast furnaces. The company that he studied had an annual production of approximately ve million tons of pig-iron. This required more than two million tons of coke from two dierent plants. His study focussed on one of these plants which employed 300 people in a `traditional hierarchical organisation'. His study of fty-two incidents revealed the distribution of causal factors illustrated in Table 11.13. The coke plant lies at the beginning of the steel making process. It provides fuel for the blast furnaces that produce pig-iron. He, therefore, conducted a second case study involving a plant that transformed pig-iron from the blast furnaces into steel. Table 11.14 presents the causal classi cation that was obtained for twenty-six incidents that were analysed using PRISMA in this second case study. No. of root causes Percentage
Organisational 111
Technical 67
Human 126
Unclassi able 13
Total 317
35%
21%
40%
4%
100%
Table 11.13: Distribution of root causes in Coke Production [844]
No. of root causes Percentage
Organisational 73
Technical 46
Human 57
Unclassi able 5
Total 181
40%
25%
32%
3%
100%
Table 11.14: Distribution of root causes in Steel Production [844] As mentioned, Van Vuuren was anxious to determine whether PRISMA could be successfully applied to a range of dierent domains. He, therefore, studied that application of the technique within both an Accident and Emergency and an Anaesthesia department. These dierent areas within the same healthcare organisation raised dierent issues in the application of a causal analysis technique. The work of the Accident and Emergency department uctuated from hour to hour and was mainly staed by junior doctors. In contrast, the Anaesthesia department provided wellplanned and highly technical working conditions. It was mainly run by experienced anaesthetists. The insights gained from applying PRISMA within these institutions were also compared from its application in an institution for the case of the mentally ill. This institution had experienced nine incidents over a twelve month period that resulted in the death of eight of their residents and one near miss where the resident involved could barely be saved from drowning in the swimming pool at the institution. The direct causes of death varied between three cases of asphyxiation, three traÆc accidents outside the main location of the institution and two drownings while taking a bath. The results of the causal analysis are summarised in Table 11.15.
No. of root causes Percentage
Organisational
Technical
Unclassi able Total
3
Human Patient related 24 11
29
4
71
41%
4%
34%
6%
100%
15%
Table 11.15: Distribution of root causes in Mental Health Study [844] Van Vuuren's work is important because it illustrates the use of a bottom-up approach to the validation of causal analysis techniques [844]. He provides direct, rst-hand insights into the strengths and weaknesses of the PRISMA approach in a range of dierent application domains. This approach can be contrasted with the highly-theoretical comparisons that have been made by the proponents
512
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
of other techniques. Unfortunately, the Van Vuuren's results cannot easily be applied to guide any decision between the dierent techniques that have been introduced in previous paragraphs. We simply lack the necessary data to make such a comparison. Techniques such as MORT have been widely applied in a range of dierent industries but there have been few recent attempts to systematically collate and publish the experience gained from the application of this approach. Other techniques, such as WBA and the statistical partition approaches, are relatively new and have only been validated against a small number of incidents and accidents. Van Vuuren's approach is also limited as a basis for comparisons between causal analysis techniques. He was involved in the analysis of the case studies. It can, therefore, be diÆcult to assess how important his interventions were in the adoption and adaptation of the PRISMA technique. It must also be recognised that the case studies were not simply intended to provide insights into the relative utility of this approach compared to other causal analysis techniques. As can be seen, the results in Table 11.13, 11.14 and 11.15 provide no insights into how easy or diÆcult it was to apply PRISMA. Nor do they suggest that the ndings of one investigation would be consistent with those of a previous study of the same incident. Van Vuuren was not primarily interested in the criteria that make one causal analysis technique `better' than another. The primary motive was to learn more about the nature of organisation failure in several dierent industries. In contrast, Benner has applied a set of requirements to assess the utility of a wide range of investigatory methodologies.
11.4.2 Top-Down Criteria The previous paragraphs have illustrated the diverse range of of causal analysis techniques that might be recruited to support incident investigation. This diversity is also re ected within investigatory organisations. Benner conducted a pioneering study into the practices of seventeen US Federal agencies: Consumer Product Safety Commission; Department of Agriculture; Department of the Air Force; Department of the Army; Department of Energy; Department of Labour; Mine Safety and Health Administration - Department of Labour; Occupational Safety and Health Administration (OSHA); Coast Guard; Department of Transportation; Federal Highways Administration - Department of Transportation; General Services Administration; Library of Congress; NASA; National Institute of Occupational Safety and Health; NTSB; Navy Department; Nuclear Regulatory Commission; National Materials Advisory Board - Panel on Grain Elevator Explosions [73]. He identi ed fourteen dierent accident models: the event process model, the energy ow process model, fault tree model; Haddon matrix model; all-cause model; mathematical models; abnormality models; personal models; epidemiological models; pentagon explosion model; stochastic variable model; violations model; single event and cause factors and a chain of events model. The term `accident model' was used to refer to \the perceived nature of the accident phenomenon". Benner reports that these models were often implicit within the policies and publications of the organisations that he studied. He, therefore, had to exploit a broad range of analytical techniques to identify the investigators' and managers' views about the nature of accidents and incidents. Benner's study also revealed that these dierent models supported seventeen dierent investigation methodologies: event analysis; MORT; Fault Tree Analysis; NTSB board and inter-organisational study groups; Gannt charting; inter-organisational multidisciplinary groups; personal judgement; investigator board with intraorganisational groups; Baker police systems; epidemiological techniques; Kipling's what, when, who, where, why and how; statistical data gathering; compliance inspection; closed-end- owcharts; nd chain of events; fact- nding and legal approach; `complete the forms'. The term `accident methodology' refers to \the system of concepts, principles and procedures for investigating accidents" [73]. Benner's ndings have a number of important consequences. He argues that the choice of accident methodology may determine an organisation's accident model. For instance, the application of the MORT technique would naturally lead to a focus on managerial issues. The use of Gannt charts would, similarly, suggest an accident model that centres on processes and events. Benner also observed the opposite eect; accident models can predispose organisations towards particular methodologies. An enthusiasm for epidemiological models leads to the development and application of an epidemiological methodology. He also identi es a third scenario in which an analysis method determines the accident model and investigation methodology but neither the model nor
11.4.
COMPARISONS
513
the investigatory methodology particularly in uences each other. One interpretation of this might be situations in which organisations enthusiastically impose analytical techniques upon their investigators without considering whether those techniques are widely accepted as being consistent with the investigators' perception of an accident or incident. A number of objections can be raised both the Benner's approach and to his analysis. For example, he used interviews to extract implicit views about models and methodologies. The ndings of these meetings were supported by an analysis of documents and statutes. Previous sections in this book have enumerated the many dierent biases that can make it diÆcult to interpret these forms of evidence. Conversely, this distinction between model and methodology can become very blurred. The relatively broad de nition of the term `methodology' seems to imply that it contains elements of an accident model. The relationship between these two concepts is discussed but it is not the focus of Benner's work [73]. His investigation looks beyond the causal analysis that is the focus for this chapter, however, this work does identify a number of general problems: \Little guidance exists in the accident investigation eld to help managers or investigators choose the best available accident models and accident investigation methodologies for their investigation... No comprehensive lists of choices, criteria for the evaluation or selection, or measures of performance (have) emerged to help accident investigators or managers choose the \best" accident model and investigative methodology." [73] In order to address this problem, Benner proposed a series of criteria that might be used to guide a comparative evaluation of both accident models and their associated methodologies. A three point rating scheme was applied in which 0 was awarded if the model/methodology was not likely to satisfy the criterion because of some inherent shortcoming, 1 was awarded if the model/methodology could satisfy the criterion with some modi cation and 2 indicated that the model/methodology was likely to satisfy the criterion. Benner applied this scheme without any weightings to dierentiate the relative importance of dierent criteria. He also acknowledges that the procedure was awed \undoubtedly, ratings contained some author bias". The contribution of this work, arguably, rests on criteria that helped to guide his evaluation of accident models and methodologies. The following list summarises Benner's conditions for models that re ect the perceived nature of accident phenomena. As will be seen, these criteria cannot be directly applied to assess the relative merits of causal analysis techniques. Most of the requirements support reconstructive modelling and simulation. Benner's methodological requirements have greater relevance for the content of this chapter. The model criteria are presented here, however, for the sake of completeness. This also provides an overview of Benner's more general comparison of investigatory techniques. It should be noted that we have redrafted some of the following criteria to re ect our focus on incident analysis rather than accident investigations: 1. realistic. This criteria focuses on the expressiveness of an incident model. Benner argues that it must capture the sequential and concurrent aspects of an adverse occurrence. It must also capture the `risk-taking' nature of work processes. 2. de nitive. Any model must describe the information sources that must be safe-guarded and examined in the aftermath of an incident. Ideally, the model must be composed from `de nitive descriptive building blocks' that enable investigators to set more focussed objectives during the investigatory process. 3. satisfying. The model must t well with the investigatory agency's wider objectives, including any statutory obligations. It should not compromise the agencies `credibility' or the technical quality of its work. 4. comprehensive. The model must capture both the initiating events and the consequences of an incident. It must capture all signi cant events. It must avoid ambiguity or gaps in understanding. 5. disciplining. The incident model must provide a rigorous framework that both directs and helpts to synchronise the activities of individual investigators. It should also provide a structure for the validation of their work.
514
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
6. consistent. This criterion urges that the incident model must be `theoretically consistent' with the investigatory agencies safety program. 7. direct. The model must help investigators to identify corrective actions that can be applied in a prompt and eective manner. It should not be necessary to construct lengthy narrative histories before an immediate response can be coordinated. 8. functional. Accident models must be linked to the performance of worker tasks and to work ows. It should enable others to see how the performance of these tasks contributed to, mitigated or exacerbated the consequences of incident. 9. noncausal. \Models must be free of incident cause or causal factor concepts, addressing instead full descriptions of incident phenomenon, showing interactions among all parties and things rather than oversimpli cation; models must avoid technically unsupportable fault nding and placing of blame" [73]. 10. visible. Models must help people to see relevant aspects of an incident. This should include interactions between individuals and systems. These representations must be accessible to investigators and to members of the general public who may themselves be `victims' of an incident. These criteria illustrate the way in which Benner's accident or incident models can be seen as models or templates for the incident reconstructions that have been described, for instance, in Chapters 8 and 10. The recommendation that models must capture the `initiating events and the consequences of an incident' was a recurring theme of the earlier sections in this book. There are, however, some important dierences between the approach developed in his paper and the perspective adopted in this book. For instance, Benner's criteria intend that accident models should be `noncausal'. In contrast, we have argued that investigators cannot avoid forming initial hypotheses about the causes of an incident during the early stages of an investigation. Investigators must be encouraged to revise these working hypotheses as their work develops [850]. Benner's concept of an accident or incident model helps to determine what investigators consider to be relevant when analysing a particular `failure'. In consequence although his model requirements focus on primary and secondary investigations, they indirectly determine the information that will be available to any causal analysis. In addition to the model criteria, list above, Benner proposes the following list of methodological requirements: 1. encouragement. This criteria argues that methodologies must encourage the participation of dierent parties aected by an investigation. Individual views must also be recognised and protected within such an approach. 2. independence. It is also important that methodologies should avoid `blame'. The role of management, supervisors, employees must be recognised within the methodology. 3. initiatives. Personal initiatives must also be supported. It should produce evidence about previous failures that promotes intervention and shows what is needed to control future risks in the workplace. 4. discovery. Methodologies must support the timely discovery of information about an incident. It should also be clear when the discovery of such information may be delayed until a credible sample has been established or until \causality requirements are met" [73]. 5. competence. This criteria argues that methodologies must leverage employees' competence. For example, it should be supported by training. This, in turn, should support the detection, diagnosis, control and mitigation of risk. 6. standards. Methodologies must provide credible and persuasive evidence for setting or reinforcing safety standards. It must also enable investigators to document and monitor the eectiveness of those standards over time.
11.4.
COMPARISONS
515
7. enforcement. This criteria is intended to help ensure that a methodology can be used to identify potential violations. The methodology must explore deviations from expected norms. Compliance problems must be identi ed. 8. regional responsibility. In the US context, methodologies must help individual States to ensure that incident reports provide consistent and reliable accounts of accidents and incidents. More generally, methodologies must identify the role that regional organisations can play in identifying safety objectives from individual reports. 9. accuracy. Methodologies must validate the products of an investigatory process. It must assess the technical `completeness, validity, logic and relevance' of these outputs. 10. closed loop. Methodologies must close the loop with design practices. Previous risk assessments often contain information about anticipated failure modes. These can be assessed against what actually did happen during an incident. In turn, future safety assessments can be informed by the results of previous investigations. Benner identi ed the personal bias that aected his analysis. The detailed scores from his investigation are, therefore, less interesting than the criteria that drove the evaluation. In passing, it is worth noting that models which tended to represent accidents as processes were rated most highly according to the criteria listed above. These included the event process model and the energy ow process model. Elements of both of these techniques were incorporated into Benner's work on P-Theory and the STEP method mentioned previously. MORT was also ranked in the top three places according to these criteria. Similar ndings were reported for the methodologies that were examined. Events analysis was rated most highly. MORT was ranked in second place assuming that it incorporated the ECF extensions described in Chapter 10 [430]. Many of the techniques that were assessed by Benner continue to have a profound impact upon existing investigatory practice. For instance, MORT is still advocated as a primary analysis technique by the US Department of Energy almost two decades after it was originally developed. Many aspects of accident and incident investigation have, however, changed since Benner rst published his analysis. In particular, there has been a growing recognition of the organisational causes of adverse occurrences [701]. Software related failures also play a more signi cant role in many investigations [411]. The following paragraphs, therefore, use Benner's criteria to structure an evaluation of the causal analysis techniques that have been presented in previous pages.
Encouragement This criteria was intended to ensure that methodologies encourage participation in an investigation. It is diÆcult, however, for many people to follow the detailed application of some causal techniques that have been presented in this chapter. This might act as a disincentive to participation during certain stages of a causal analysis. For instance, it can be hard to follow some of the statistical techniques if people are unfamiliar with their mathematical underpinnings. Similarly, the Explanatory Logic that supports WBA can be diÆcult to communicate to those without a training in formal logic. Fortunately, the proponents of mathematical techniques for causal analysis have recognised these objections. WBA is supported by a diagrammatic form that provides important bene ts for the validation of any proof. Similarly, individuals can participate in the application of Bayesian techniques, for instance through the procedures of subjective risk assessment, without understanding all of the formulae that an investigator may employ during the more nal aspects of the analysis. These communications issues also reveal tensions within Benner's criteria. Mathematically-based techniques, typically, bene t from well-de ned syntax and semantics. They provide proof rules that oer objective means of establishing the completeness and consistency of an analysis. These strengths support the accuracy criteria, assessed below, but are achieved at the expense of potentially discouraging some participation in the application of the analysis techniques. Conversely, the accessibility of Tripod, of ECF, MES and of STEP all encourage wider participation in the analysis. There is,
516
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
however, a strong subjective component to the forms of analysis that are supported by these techniques. There is also a danger that participation without strong managerial control can compromise the ndings of any analysis.
Independence This criterion is intended to ensure that any methodology addresses the `full scope' of an incident. It should consider the role of management, supervisors and employees without connotations of guilt or blame. Some techniques, including Tier analysis and MORT, provide explicit encouragement for investigators to consider these dierent aspects of an incident. As we have seen, however, there is a strong contrast between causal analysis techniques that oer checklist support and those that expect investigators to scope their analysis. It would be perfectly possible to apply tier analysis to an incident but for the analyst to overlook a particular sections of a management structure. In contrast, the MORT diagram provides explicit guidance on the roles that contribute to an incident or accident. The diÆculty with this approach is that it can be diÆcult for analysts to match the details of a particular situation to the pre-de ned scope in a checklist-based approach. This criteria introduces further tensions between the Benner criteria. For example, techniques that encourage an independent assessment of the various contributions to an incident and accident can also lead to the tensions and con ict that discourage widespread participation. Many analysis techniques almost inevitably create con ict as a by-product of their application within the investigatory process. For instance, there are many reasons why non-compliance analysis creates resentment. It gains its analytical purchase from the observation that many incidents and accidents stem from individual and team-based `failures' to follow recognised standards. Chapter 3 has also explained how observations in the health care and aviation domains have shown that operators routinely violate the myriad of minor regulations that govern their working lives. These violations do not have any apparent adverse consequences and often help individuals to optimise their behaviour to particular aspects of their working environments. In extreme examples, they may even be necessary to preserve the safety of an application process. Non-compliance analysis should reveal these violations. Investigators must, therefore, be careful to pitch their recommendations at a level that is intended to achieve the greatest safety improvements without unnecessarily alienating other participants in the investigatory process. Other causal analysis techniques raise similar issues. For example, Tier analysis is unlikely to promote harmony. Investigators successively associate root causes with higher and higher levels within an organisation. As mentioned, this encourages the independent analysis of the many dierent parties who can contribute to an adverse occurrence. However, it can lead to strong feelings of guilt, blame and anxiety as root causes are successively associated with particular levels in a management structure.
Initiatives This criterion is intended to ensure that any accident or incident methodologies provide adequate evidence to encourage the focussed actions that address risks in a speci c workplace. The previous sections in this book have not explicitly considered the means by which such recommendations for action can be derived from the products of any causal analysis. This is the central topic of Chapter 12. We have, however, considered how some analysis techniques can be used to derive particular recommendations. For instance, our analysis of PRISMA included a description of Classi cation/Action Matrices. These enable investigators to simply `read-o' an associated action from a table once the causes of an incident have been determined by previous stages in the analysis. MORT oers similar support. Table 11.7 presented the `Stage 2' analysis form proposed by the US Department of Energy. This is encourages analysis to enumerate the dierent ways in which an incident can be explained by the particular failures that are enumerated in the branches of a MORT diagram. The frequency with which speci c items in the why branch are identi ed helps to establish priorities for action. These, in turn, help to justify the initiatives and interventions that are recommended by Benner's criteria. As we shall see in Chapter 15 similar summaries can be derived by collating the causal analysis of several incidents.
11.4.
COMPARISONS
517
As before it is possible to highlight dierences between the checklist and `free form' approaches. Techniques, such as PRISMA, encourage a consistent approach to the recommendations and initiatives that are motivated by a causal analysis. Investigators have limited scope to alter the actions that are recommended by particular cells within a Classi cation/Action matrix. Conversely, less structured techniques enable investigators to tailor their response to the particular circumstances of an incident. The consequence of this is that without additional methodological support it is likely that dierent investigators will initiative dierent responses to very similar incidents.
Discovery This criterion requires that an incident methodology should encourage a `timely discovery process'. We have shown in previous paragraphs that there are tensions between Benner's criteria, for example encouragement can con ict with accuracy. This criteria illustrates a form of internal con ict. For example, checklist approaches are likely to provide relatively rapid insights because the items that they present can help to structure causal analysis. In contrast, techniques such as ECF analysis or Bayesian approaches to statistical causality are likely to take considerably longer given that investigators lack this guidance. In contrast, the application of `raw' checklists is unlikely to yield entirely innovative insights. Investigators will be guided by the items that have already been identi ed by the author of the checklist. Free-form techniques arguably oer less constraints to the discovery process. As mentioned, these methodology criteria were originally intended to support a comparison of accident investigation techniques. The causal analysis of safety-critical incidents creates new challenges. For instance, the statistical analysis of a body of incident reports can be used to yield new insights that might not emerge from the study of individual mishaps. The techniques that we have summarised in this chapter each pose dierent problems for the application of this form of discovery through data mining. For instance, the subjective nature of ECF analysis can make it very diÆcult to ensure that dierent investigatory teams will identify the same root causes for the same incident. This creates problems because any attempt to retrieve incidents within similar root causes will miss those records that have been `miss-classi ed'. It will also yield reports that are potentially irrelevant from the perspective of the person issuing the query if they cannot follow the justi cation for a particular classi cation. In contrast, PRIMA's use of the Eindhoven Classi cation Model is intended to reduce inter-analyst variation by providing a high-level process to guide the allocation of particular categories of causal factors. Problems stem from the use of causal trees prior to the the use of the classi cation model. Subtle changes to the structure of the tree will have a signi cant impact on the number and nature of the root causes that are identi ed for each incident. This, in turn, can have a profound impact on the discovery of causal factors through the analysis of incident databases. The argument in the previous paragraph assumes that causal analysis techniques have a measurable impact upon both the speed of an investigation and the likelihood that any investigation will yield new discoveries. As we shall see, some initial evaluations have shown that the investigators' background has a more profound impact upon these issues than their application of a particular technique. The discovery of particular causes can be determined by the investigator's familiarity with the nature of the corresponding more general causes. Individuals with human factors expertise are more likely to diagnose human factors causes [484].
Competence The competence criterion requires that any methodology must help employees to increase the eectiveness of their work. This implies that they must be able to use a causal analysis technique in a cost-eective manner. Appropriate training must enable individuals to detect, diagnose, control and ameliorate potential risks. This criteria has clear implications for the more mathematical techniques that we have examined. The Explanatory Logic of WBA will only deliver the intended bene ts of precision and rigour if those who apply it are correctly trained in its many dierent features. The partition approach to probabilistic causation provides a more pathological example of this. It is unclear precisely what aspects of the existing theories might actually be recruited to support incident
518
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
investigation. The development of appropriate training courses, therefore, lies in the future. Conversely, these approaches oer means of objectively assessing the competence of individuals in the application of a causal analysis technique. Mathematical rules govern the use of statistical data and the steps of formal proofs. Tests can be devised to determine whether or not individuals understand and can apply these mathematical concepts. Such evaluations are more diÆcult to device for less formal approaches where the rules that govern `correct' transformations are less clearly de ned. Techniques such as MORT and Tripod have already been widely adopted by commercial and industrial organisations [430, 701]. Training courses and commercial software can also be recruited to improve employee competence in the application of these techniques. PRISMA arguably rests halfway between these more commercial techniques and the more novel mathematical approaches to causal analysis. As we have seen, there is limited evidence that this approach can be used in a range of dierent contexts within several industries. These is, as yet, relatively little guidance on how investigators might be trained to exploit this approach. It is important to emphasise that the lack of training materials does not represent a fundamental objection to any of the techniques that have been considered in this book. Our experience in training investigators to conduct various forms of causal analysis has shown that most organisations tend to develop their own training materials. It is important that any causal analysis technique supports both their organisational priorities and also the reporting obligations that are imposed on them by other statutory and regulatory bodies.
Standards The standards criterion requires that incident methodologies must enable investigators to identify potential aws in their work. They must also provide a comprehensive, credible, persuasive basis for the advocacy of appropriate interventions. This criterion served as a prerequisite for the inclusion of causal analysis techniques in this book. It is possible, however, to identify a number of distinct approaches that are intended to improve the standard of incident investigation within the dierent approaches that we have analysed. Arguably the weakest claim that is made for causal analysis techniques is that they provide intermediate representations, typically gures and graphs, that can be exposed to peer review during the investigatory process. ECF charting, non-compliance tables, MES owchart all help to explicitly represent stages of analysis. This can help to identify potential contradictions or inconsistencies that might otherwise have been masked by the implicit assumptions that are often made by dierent members of an investigatory team. Many of the techniques that we have studied also provide particular heuristics or rules of thumb that provide a basis for more complex forms of analysis. MES, STEP, ECF, MORT, WBA all exploit variants of counterfactual reasoning. This approach oers considerable bene ts not simply because it encourages a consistent approach to the identi cation of causal factors. Counterfactual reasoning also provides investigators with a common means of identifying potential counter-measures. Recall that the counterfactual question can be expressed as `A is a necessary causal factor of B if and only if it is the case that if A had not occurred then B would not have occurred either'. We might, therefore, consider preventing an incident, B , by devising means of avoiding B . As we have seen, however, causal asymmetry implies that investigators must be circumspect in exploiting techniques which advocate this style of analysis. Further questions arise both from the cognitive problems of reliably applying counterfactual reasoning [124] and from the practical problems of validating hypothetical reasoning about the failure of barriers, described in Chapter 10. This criteria also urges that causal analysis techniques should be assessed to determine whether they provide a comprehensive, credible, persuasive basis for the advocacy of appropriate interventions. The interpretation of a `comprehensive' approach depends upon the scope of the techniques. This was addressed in the previous section. It s more diÆcult to assess the credibility and persuasiveness of an approach. We are unaware of any studies that have directly addressed this issue as part of an evaluation of causal analysis techniques. Similar studies in other domains have, however, indicated that the application of a particular method may be less important than the identity of the individual or group who apply the technique [278].
11.4.
COMPARISONS
519
Enforcement Benner's criteria require that an incident methodology should reveal expectations about the norms of behaviour. This, in turn, helps investigators to identify non-compliance. It should also help to identify instances in which the enforcement of standards has been insuÆcient to prevent violations. As with the other criteria, each of the techniques that we have assessed can be argued to oer dierent levels of support for these aspects of `enforcement'. For example, non-compliance analysis was integrated into our use of ECF in Chapter 10. This technique is deliberately intended to identify violations and to expose deviations from expected norms. The Explanatory Logic that supports the formal components of WBA also includes deontic operators that explicitly capture notions of obligation and permission. These have been used to represent and reason about the particular consequences of non-compliance with standards and procedures [118, 469]. A number of caveats can be made about this criterion and its application to causal analysis techniques. Firstly, it can be diÆcult to distinguish a willful violation from a slip or a lapse. The distinction often depends upon the analyst's ability to identify the intention of an individual or group. None of the causal analysis techniques that we have investigated support the inference of intention from observations of behaviour. The PARDIA components of WBA provide a possible exception to this criticism. Unfortunately, there are relatively few examples of this technique being used to analyse complex, operator behaviours. It remains to be seen whether this approach might be used to enable analysts to reason about the detailed cognitive causes of violation and non-compliance. A second caveat to Benner's enforcement criteria is that numerous practical diÆculties frustrate attempts to chart the dierences that exist between everyday working practices and the recommendations of standards and regulations. Chapter 10 showed that it was extremely diÆcult for executives, managers and supervisors to keep track of the dozens of procedures and guidelines that had been drafted to guide the development of the Mars Polar Lander and Climate Orbiter projects. Previous paragraphs have also noted the high-frequency of apparently minor violations that have been noted as characteristic of expert performance within particular domains, especially aviation [672]. The `enforcement' criterion, therefore, represents a class of requirements that are currently not adequately met by any causal analysis techniques. These criteria can be contrasted with other requirements, such as the need to provide `standards' for causal analysis, which are arguably satis ed by all of the techniques that we have examined.
Regional responsibility This criterion was initially drafted to ensure that individual States are encouraged to use a methodology and to take responsibility for its application within the context of U.S. Health and Safety \mandates" [73]. In contrast, we argue that causal analysis techniques must consider the dierent roles and objectives that distinguish regional organisations from the local groups that, typically, implement reporting systems. Some causal analysis techniques seem to be better suited to regional organisations. For instance, Chapter 10 shows how Tier Analysis associates root causes with higher levels in an organisational hierarchy. This process is likely to create con icts between local investigators and senior members of a management organisation. Regional investigators are more likely to possess the competence and independence that are necessary to resist the pressures created by such con icts. Other techniques, such as WBA, can be so costly in terms of the time and skills that are required to exploit them that regional and national groups must be involved in their application. In contrast, the methods that might be derived from probabilistic models of causality are likely to bene t from the information contained in large-scale datasets. Regional organisations may, therefore, be best placed to oer advice and support in the application of these techniques. The aims and objectives of national and regional organisations are likely to be quite dierent from those of the local teams who are responsible for conducting an incident investigation. Regional organisations are more concerned to derive a coherent overview from a collection of incident reports than they are to understand the detailed causal factors that led to one out of several thousand or hundreds of thousands of incidents [444]. An immediate concern to mitigate the local eects of an incident are part of a wider concern to ensure that lessons are propagated throughout an industry. It is, therefore, important that regional organisations should understand and approve of the causal
520
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
analysis techniques that are used to provide data for these aggregate data sources. They may also impose training requirements to ensure competence in the application of those techniques [423]. A number of further complications can, however, frustrate regional and national initiatives to exploit the products of causal analysis techniques. For instance, regional bodies are often anxious to ensure that investigators exploit similar techniques so that accurate comparisons can be made between individual reports from dierent areas of their jurisdiction. It is likely, however, that there will be pronounced local distortions even if dierent geographical regions all share the same causal analysis techniques. A succession of similar incidents or an accident with notably severe consequences can sensitise investigators to certain causes. This eect may be more pronounced for those who are most closely associated with previous incidents [410]. In consequence, very dierent results can be obtained when the same incidents are reclassi ed by investigators from dierent regions and even from dierent teams. These issues complicate attempts to share data across European boundaries in the aviation domain [423] and across State boundaries within US healthcare [453]. They are also largely orthogonal to the problems of identifying appropriate causal analysis techniques.
Accuracy This criteria is similar to aspects of the `standards' requirement that was introduced in the previous paragraphs. Accuracy is intended to ensure that incident methodologies can be tested for completeness, consistency and relevance. All three of these concepts are relevant to the causal analysis of safety-critical incidents. The rst two directly motivated the application of formal proof techniques to support semi-formal argumentation in WBA. As mentioned, however, it can be more diÆcult to validate techniques that exploit precise mathematical concepts of consistency and completeness. A further caveat is that the use of formal techniques does not guarantee an error-free analysis [21]. It does, however, provide objective rules for identifying and rectifying these problems. The other techniques that we have presented, such as STEP, MES and MORT, provide fewer rules that might be applied to assess the accuracy of a causal analysis. Instead, as mentioned, they rely upon diagrams and tables that are open for peer review. The less formal processes involved in achieving group consensus are intended to provide greater con dence than the formal manipulations of abstractions whose precise interpretation can defy even skilled mathematicians. A similar debate between informal, semi-formal and formal methods has dominated areas of computing science research for several decades [32]. The detailed comparison of the strengths and weaknesses of these dierent approaches lies outside the scope of this book. In particular, such comparisons have little value unless they can be substantiated by detailed empirical evidence. Later sections will brie y summarise the preliminary results that have been obtained in this area [529, 552]. Unfortunately, these ndings provide limited insights into the application of particular approaches. They do not support more general conclusions about comparative bene ts and weaknesses. In consequence, investigators must make their own subjective assessments of the claims that proponents make for the `accuracy' of their causal analysis techniques.
Closed Loop This nal criterion requires that incident methodologies should be tightly integrated into other aspects of systems design and implementation. The data from incident reporting systems should inform future risk assessments. Information from past risk assessments, or more precisely the arguments that support those assessments, can also help to guide a causal analysis providing that it does not prejudice investigators' hypotheses. We have not suggested how any of the causal analysis techniques that we have examined might support satisfy such requirements. There are, however, many similarities between non-deterministic causal models and the techniques that support quantitative approaches to reliability and risk assessment. The prior probabilities of Baysian analysis can be derived from the estimates that are embodied in risk assessments, especially when reliable data cannot be derived directly from an incident reporting system. Previous paragraphs have, however, provided an example of a risk assessment technique being used to guide the causal analysis of an adverse incident. Chapter 10 described how NASA's mishap
11.4.
COMPARISONS
521
investigation board identi ed a problem in the software that was designed to automatically reestablish communications links if the up-link was lost during the Polar Lander's Entry, Descent and Landing phase. This bug was not detected before launch or during the cruise phase of the ight. A Fault Tree analysis did, however, identify this possible failure mode after the Polar Lander had been lost. Chapter 12 will return to this issue in greater detail. The relationship between risk assessment, design and incident reporting is often only considered as an after-thought by many organisations. In consequence, subjective assessments of component and system reliability are often exploited by one group within a company while others in the same organisation continue to collate data about the actual performance of those systems [414]. More generally, this same situation can arise when design groups are unwilling to approach other organisations within the same industry who have previous experience in the incidents that can arise from the operation of particular application processes. In consequence, development resources are often allocated to perceived hazards that cannot be justi ed by data about previous failures. The previous paragraphs have shown how Benner's methodological criteria can be used to structure a comparison of causal analysis techniques. Some of the criteria, such as `accuracy' and `standards', are relevant objectives for all of the approaches that we have examined. Other requirements, such as `encouragement', are less important for particular techniques. WBA and techniques derived from partition models of probabilistic causality focus more on `accuracy' and `competence'. The main weakness with this approach is that Benner fails to provide any objective procedures that might be used to determine whether or not a particular methodology satis es a particular criterion. It is, therefore, possible for others to argue against our analysis. For example, it might be suggested that partition methods can encourage greater participation. Certainly, the diagrammatic forms of WBA do help to increase access to some aspects of this technique. There have, however, been no empirical studies to investigate the communications issues that might complicate the use of these formal and semi-formal techniques within the same approach. The following sections, therefore, brie y describe the limited number of studies that have been conducted in this area. These studies do not provide a rm basis for the comparative evaluation of causal analysis techniques. They focus on a limited subset of the available approaches. They also concentrate on incidents within particular industries. These studies do, however, illustrate the manner in which empirical evidence might be recruited to support assertions about the relative merits of these techniques.
11.4.3 Experiments into Domain Experts' Subjective Responses Both Van Vuuren's bottom-up analysis of the PRISMA approach and Benner's application of model and methodology criteria were driven by the direct involvement of the individuals who were responsible for conducting the tests. Van Vuuren participated in the analysis that is summarised in Tables 11.13, 11.14 and 11.15. Benner performed the ratings that were derived from the lists of criteria presented in the previous section. This level of personal involvement in the validation of causal analysis techniques should not be surprising. Previous sections have summarised the practical, theoretical and ethical issues that complicate the evaluation of dierent causal analysis techniques. Many researchers, therefore, side-step the problems of investigator training and recruitment by conducting subjective studies based on their own application of alternative techniques. In contrast, Munson builds on the work on Benner [73] and Van Vuuren [844] by recruiting a panel of experts to validate his application of causal analysis techniques [552]. Munson began by applying a number of analysis techniques to examine a canyon re that had previously been investigated by the U.S. Fire Service. In particular, he applied Fault Tree analysis, STEP and Barrier analysis to assess the causal factors that contributed to this incident. He then recruited his panel by choosing wildland re ghting experts rather than `professional' investigators; this \most accurately emulates real world situations where investigators may have some investigative experience but their primary occupation and training is not in these techniques" [552]. None of the evaluators had any prior experience with accident analysis techniques. This helped to avoid any preference for, or experience of, existing approaches. Each member of the panel had a minimum of fteen years experience in wildland re suppression and were quali ed to `Strike Team Leader' level. Individuals were selected on a ` rst come' basis.
522
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
Munson acknowledges that this may have introduced certain biases, however, he endeavoured to ensure that none of the panel consulted each other about their ratings. He was also aware that the panel members may have held preconceived ideas about the causes of the case study; \since they were evaluating the investigation methods and not the reinvestigation of the re, bias should have been reduced" [552]. The members of the panel were asked to compare Munson's Fault Tree analysis, STEP analysis and Barrier analysis of the canyon re by rating each technique against a number of criteria. These requirements were based on a subset of the Benner criteria [73]. As can be seen, some of these requirements apply more to reconstruction and modelling than they do to causal analysis. This can be justi ed by the mutual dependencies that we have stressed in previous chapters. Munson's criteria can be summarised as follows: 1. Realistic. Any analysis must capture the sequential, concurrent, and interactive nature of the
ow of events over time. 2. Comprehensive. Any analysis must identify the beginning and the end of an accident sequence and there must not be any gaps in the investigator's understanding of an incident. 3. Systematic. Any analysis must be supported by a logical and disciplined method that allows for peer review and mutual support by all of the members of an investigation team. 4. Consistent. The method must be consistent and it should be possible for investigators to verify that any conclusions are correct from the information that is available. 5. Visible. Any analysis must discover and present the events and interactions throughout an accident sequence so that colleagues can understand the manner in which they contribute to an incident. 6. Easy to learn. It should be possible for investigators to learn how to apply a technique by attending a one week course. This criterion re ects Munson's focus on the re ghting community and he acknowledges that it should not be considered when attempting to assess the `best' analysis technique. The experts were asked to use a ranking system that was similar to that described in the previous section; \The rating scale follows Benner's [73] approach in that until a more comprehensive scale is developed to better dierentiate levels of compliance to the criterion, a more simple direct measurement scale is appropriate" [552]. For each model, they were asked to rate each criterion. A score of zero was used to denote that they did not believe that the approach met this criterion. A score of one was used to denote indicate that they believed that the approach addressed the criteria but not completely and improvement would be required. A score of two was to be awarded if the analysis technique satis ed the criterion. No weighings were applied to the results of this process because no criterion was perceived to have more signi cance than any other. The results from summing the individual scores showed that STEP received the highest rating; 52 from a possible 60 (87%). Fault Tree Analysis received 51 out of 60 (85%). Barrier Analysis obtained 42 out of 60 (70%). STEP was rated as the most `comprehensive' (100%) and most `consistent' (100%). Both Fault Tree Analysis and STEP were rated as the `easiest to use' (90%). Barrier analysis was rated the most `realistic' technique (90%). Fault Tree Analysis was rated as the most `systematic' method (100It was also the most visible (90%). Two evaluators rated it as the best overall approach. Two rated STEP the highest. One assigned equal value to both STEP and Fault Tree Analysis. Barrier Analysis was not assigned the highest rating by any of the evaluators. Munson also analysed his data to assess the level of agreement between his panel of assessors. Multivariate techniques were not used; \the number of evaluators and criteria were considered too small and would not constitute any meaningful insight" [552]. Instead, Perreault and Leigh's [674] index was used to assess inter-rater reliability. Indexes above 0.85 are considered to indicate a high degree of consensus. Levels below 0.80 require further analysis. Munson provides the following
11.4.
523
COMPARISONS
equation for the reliability index. F is the frequency of agreements between the evaluators, N is the total number of judgements and k is the number of categories:
Ir = [(F =N ) 1=k )][k =k
1)]0:5
(11.33)
The panel's evaluation of the six criteria for the STEP method yielded an index of 0.84 [674]. Fault Tree Analysis received 0.86 over all of the criteria. Barrier Analysis achieved a reliability index of 0.79. The inter-rater reliability for all methods was 0.84. As can be seen, only the Fault Tree assessment indicated a high degree of consensus but all other measures fell into the acceptable region identi ed by Perreault and Leigh [674]. If we look at levels of agreement about individual criteria it is possible to see some interesting patterns. For example, there was little agreement about whether or not Fault Tree analysis was a `realistic' technique (0.63). STEP received the highest rating for `comprehensiveness' and achieved an index of 1.0 for inter-rater reliability. Fault Tree analysis and Barrier analysis achieved a similar level of consensus but at a lower over rating about the `comprehensiveness' of the techniques. Munson provides a more sustained analysis of these results [552]. The experts were each asked to provide additional comments about the applicability of each method. Munson cites a number of the responses that were provided. Ironically some of these comments reveal the experts' lack of understanding about the technical underpinning of the methods that they were asked to evaluate. The attitudes to Fault Tree analysis are particularly revealing in this context, given the key role that they play within many areas of reliability and risk assessment: \One evaluator liked the way Fault Tree Analysis visually presented complex events and the way it showed accidents as a chain-of-events as opposed to a single random occurrence... They thought that this method might be better at uncovering managerial/administrative latent factors contributing to the incident than the other two methods. In contrast, one evaluator responded that the STEP method appeared more stringent in revealing underlying human causal factors. They commented that STEP (and Control/Barriers Analysis) provided an approach that was more likely to distinguish more abstract human factors from hard factual data considerations and therefore be better at raising questions into human error causes... All evaluators expressed concern that Control/Barriers Analysis was inadequate in determining causal factors when applied to wildland re ghting. It had strengths in identifying needed and/or compromised barriers at an administrative level but the dynamic and highly variable aspect of the re ghting environment made its application to investigations inadequate" [552]. Munson concludes that STEP is the most `desirable' method for the investigation of wildland re ghter entrapments. The small dierences between the scores for this technique and for Fault Tree analysis suggest, however, that there are unlikely to be strong dierences between these two techniques. Both were rated more highly than Barrier Analysis. A number of questions can be raised both about the methods that Munson used and about the results that he obtained from them. Firstly, Munson was not quali ed in accident or incident investigation when he undertook this study. The manner in which he applied the three techniques need not, therefore, have re ected the manner in which they might have supported an active investigation by trained personnel. Secondly, a number of caveats and criticisms have been made about his application of particular techniques. For example, Fault Tree analysis of the canyon re breaks some of the syntactic conventioned that are normally associated with this approach, see Chapter 10. Paradoxically, these dierences aid the presentation of Munson's analysis. They also make it diÆcult to be sure that the results from this study could be extended to the more general class of Fault Trees that obey these syntactic conventions. Thirdly, this study focuses on experts who only represent a very small cross-section of the community who are involved in accident and incident investigations. This is a universal weakness shared by all previous validation studies that we have encountered. Chapter 4 has shown that incident and accident reporting systems involve individual workers, supervisors, investigators, safety managers, regulators and so on. Benner's original `encouragement' criteria captures some aspects of this diversity. However, experimental validations focus on the
524
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
utility of causal analysis techniques for investigators or, as in this case, domain experts. Regulators might take a very dierent view. Fourthly, a number of minor caveats can be raised about the choice of statistical techniques that were used to analyse the data from this study. Multivariate analysis might have been applied more systematically. This could have yielded results that are easier to interpret than the piecemeal application of Perreault and Leigh's index. Finally, Munson's study speci cally addresses the re ghting domain. Several of the criteria were speci cally tailored to re ect the working and training practices of this application area. Further studies are required to replicate this work in other domains. It is important to balance these criticisms and caveats against the major contribution that has been made by Munson's work. The opening paragraphs of this section reviewed the many pragmatic, theoretical and ethical barriers that complicate research in this area. His approach successfully addresses many of these potential problems. Muson shows that it is possible to provide further evidence to support Benner's subjective analysis.
11.4.4 Experimental Applications of Causal Analysis Techniques Previous sections have described a number of dierent approaches to the validation and comparative evaluation of causal analysis techniques. Van Vuuren adopted a bottom-up approach by applying PRISMA to support a number of incident reporting systems within particular industries. Benner adopted a much more top-down approach when he developed and applied a set of criteria in a subjective evaluation of accident models and methodologies. Munson used this approach as the foundation for an expert panel's evaluation of causal analysis techniques for re ghting incidents. A limitation of Benner's approach is that it was based upon the subjective analysis of the researcher. Munson avoided this by recruiting a panel of experts. They did not, however, apply any of the methods and only provided subjective ratings based on a case study that was developed by Munson himself. Van Vuuren's study involved investigators in the application of the PRISMA technique. He, however, played a prominent role in coaching the use of this approach; \guidance was necessary to pinpoint these mistakes or lapses and by doing this to improve the quality of the causal trees and stimulate the learning process regarding how to build causal trees" [844]. This intervention was entirely necessary given the ethical issues that surround the validation of incident investigation techniques using `live' data. The closing sections of this chapter describe an alternative approach. McElroy attempted to integrate elements of Munson's more controlled experimental technique and Van Vuuren's concern to involve potential end-users in the application of particular approaches [529]. McElroy's evaluation began with a sustained application of the PRISMA technique. He used a variant of the Eindhoven Classi cation Model to identify the root causes of more than one hundred aviation incidents from the ASRS dataset. This yielded approximately 320 root causes; the majority of which related to human factors issues. In order to validate his results, he recruited a number of experts to repeat his analysis of selected incidents from the study. The intention was then to compare the causal trees that they produced and the resulting root cause classi cation with McElroy's ndings from the initial analysis. He rejected Munson's approach of recruiting domain experts, such as pilots or air traÆc controllers. This was partly motivate by pragmatic reasons, such as the diÆculty of securing access to participants for the study. It was also motivated by the diÆculties that Munson and Benner had foreseen in training domain experts to apply novel analysis techniques, rather than simply requiring them to comment on the use of the approach be someone else. In contrast, McElroy recruited participants who had speci c expertise or training in incident and accident analysis. This approach also raised problems; he found it diÆcult to secure the involvement of participants with similar expertise and training. Both of these factors are signi cant given Lekberg's results, which show that the investigator's training will in uence their causal analysis of safety-critical incidents [484]. In the end he was only able to assess the application of the technique by two participants. In consequence, his ndings cannot be used to support general conclusions about the PRISMA technique. They do, however, provide a glimpse of some of the individual dierences that might aect the application of causal analysis techniques by incident investigators. As mentioned, McElroy provided his participants with short synopses of incidents that had previously been submitted to the ASRS. The following paragraph provides a brief extract from the
11.4.
COMPARISONS
525
summary that McElroy used in his study: ACCESSION NUMBER : 412640 DATE OF OCCURRENCE : 9808 NARRATIVE : DEPARTING NEWPORT ARPT, AT THE TIME OF DEP, THE W HALF OF THE ARPT WAS STARTING TO FOG IN. I HOVER-TAXIED TO THE FAR E END OF THE ARPT AND WAS ABLE TO TAKE OFF IN BLUE SKIES AND UNLIMITED VISIBILITY. THIS ARPT IS SET UP FOR A CTL ZONE WHEN THE VISIBILITY IS LESS THAN 3 MI AND A 1000 FT CEILING. THERE WAS ANOTHER HELI IN THE PATTERN WHOM I WAS IN RADIO CONTACT WITH. HE GAVE ME PERMISSION TO TAKE OFF FIRST AND THEN HE WENT IN AND LANDED. ALL OF THIS WAS DONE VFR ON THE E END OF THE FIELD WHILE THE W END WAS FOGGED IN. THE STANDARD FOR THE OTHER ARPTS WITH CTL TWRS HAS BEEN IF I WAS INSIDE OF THEIR CTL ZONE AND IT WAS IN EFFECT, THEY HAVE ALLOWED ME TO WORK INSIDE THE CTL ZONE WITHOUT A SPECIAL VFR IF I WAS IN THE STANDARD VFR CONDITIONS. ALL I NEEDED TO DO WAS MAKE RPTS OF MY LOCATIONS WHILE WORKING IN THEIR AIRSPACE. AS LONG AS I WAS VFR, I DID NOT NEED A SPECIAL VFR TO BE INSIDE THE AIRSPACE. MY POINT TO ALL OF THIS IS THAT IT IS NOT TAUGHT TO NEW STUDENTS THIS WAY SO IT BECOMES MORE LIKE JUST A STORY WHEN AN OLDER PLT DOES SOMETHING LIKE THIS. IT IS LEGAL TO DO BUT NOT GOOD FOR STUDENTS TO SEE. NOT SURE OF HOW OR WHERE TO MAKE A POINT OF THIS, OR IF MAYBE IT IS NOT A RELATIVE POINT TO MAKE AT ALL. HOPE THIS IS NOT TOO CONFUSING, AND THANK YOU FOR YOUR TIME. The rst participant produced the Causal Tree shown in Figure 11.19. McElroy's study focussed more on the application of this diagram to support PRISMA's root cause analysis. A number of insights can, however, be derived from this initial stage of his evaluation. The tree took several hours to construct but, as can be seen, it is essentially a sketch of the incident. It includes inferences and judgements that cannot directly be supported from the synopsis. For instance, one note is annotated to denote that the helicopter pilot took o illegally, happy he was on a visual ight rule. Nowhere does the report state that the pilot was `happy' with the state of aairs. Similarly, the causal tree refers to the maneuver as `illegal' although the pilot believes that there actions were `legal' within the control zone of the airport tower. This ambiguity re ects a lack of contextual information and the participants' limited domain knowledge. It was not, however, addressed in McElroy's analysis. A key point here is that although this evaluation ran for several hours, the participants never had the opportunity to move beyond this high-level sketching to the more meticulous analysis that would be need to demonstrate the suÆcient and correctness of a causal `explanation'. One might, therefore, infer that such an experimental procedure would have to be signi cantly revised if it were to be used to assess the utility of one of the mathematical techniques that we have described. As mentioned, McElroy's aim was to determine whether participants who were training in incident analysis would con rm his own application of PRISMA. The rst participant was, therefore, asked to use their diagram in Figure 11.19 to drive the categorisations of root causes using a variant of the Eindhoven Classi cation Model. They identi ed the following list of potential causes:
Operating procedures. This is represented by the node labelled OP in the Eindhoven Classi cation Model of Figure 11.9. The participant identi ed that the incident was the result of inadequate procedures.
Management priorities. This is represented by MP in the Eindhoven Classi cation Model. The participant identi ed that top or middle management placed pressure on the operator to deviate from recommended or accepted practice.
Permit. This is represented by HR2 in the Eindhoven Classi cation Model. The participant identi ed that the operator failed to obtain a permit or licence for activities where extra risk was involved.
526
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
Figure 11.19: Causal Tree from McElroy's Evaluation of PRIMA (1)
Planning. This is shown as HR5 in the Eindhoven Classi cation Model. The participant identi ed that the activity was not fully planned. Appropriate methods were not identi ed nor were they carried out in a well-de ned manner.
Unclassi able behaviour. This is shown as X in the Eindhoven Classi cation Model. The participant also identi ed that some of the causal factors denoted in Figure 11.19 could not be classi ed using the model.
In contrast, McElroy's analysis only identi ed management priorities and planning as causal factors in this incident. The other three causes identi ed by the participant were not identi ed in the initial analysis. In addition, McElroy's analysis identi ed Goal? (HK2) as a potential cause that was not recognised by the rst participant. This root cause categorisation denotes that the operator failed to identify appropriate goals or priorities for their actions. This comparison raises several issues. Firstly, the study tells us what categories the participant felt were important to the causes of the case study. It does not tell us why they believed them to be important. This is important because both McElroy and the rst participant identi ed planning as a causal factor, it is entirely possible however that they had entirely dierent reasons for making this categorisation. Conversely, we do not know the reasons why they diered over speci c elements in their causal analysis. Secondly, it is diÆcult to determine the justi cation for some of the reported conclusions made by both McElroy and the participant. Although the previous quotation is an abbreviated from of the incident report that was supplied during the study, there is no explicit indication that management priorities had caused the pilot to behave in the manner that they reported. This illustrates the more general point that we have made repeatedly in this book; it is not suÆcient simply to present a causal analysis
11.4.
COMPARISONS
527
without providing a detailed justi cation of the reasons supporting that analysis. As mentioned, the second evaluation focussed on an incident from the ASRS' air traÆc dataset: ACCESSION NUMBER : 425120 DATE OF OCCURRENCE : 9901 NARRATIVE : WX WAS SUNNY BUT COLD, A DAY OR 2 AFTER A SNOW/ICE STORM. SABRELINER WAS TAXIING OUT FOR IFR DEP. ATC OBSERVED THE FUSELAGE WAS COVERED WITH SNOW AND ICE. ATC ADVISED THE PLT 'IT APPEARS THERE'S A LARGE AMOUNT OF SNOW AND ICE ON THE TOP OF YOUR ACFT.' THE PLT STATED 'IT'S NOT A LOT, IT'S A LITTLE, AND IT WILL BLOW OFF WHEN WE DEPART.' ON TKOF ROLL, ICE WAS OBSERVED PEELING OFF THE FUSELAGE. THIS CONTINUED AS THE ACFT CLBED OUT. ICE WAS OBSERVED FALLING ON OR NEAR A HWY JUST OFF THE DEP END OF THE RWY. THE ACFT WAS SWITCHED TO DEP, BUT A FEW MINS LATER RETURNED FOR LNDG. AS THE ACFT TAXIED IN, SIGNIFICANT ICE FORMATION WAS OBSERVED ON THE ELEVATORS. THE ACFT TAXIED TO AN FBO AND WAS DEICED BEFORE TAXIING BACK OUT FOR DEP. I SPOKE WITH THE FBO LATER. THEY SAID THEY HAD SEEN THE PLT CLRING SNOW AND ICE OFF THE ACFT BEFORE HE FIRST DEPARTED. HOWEVER, THE UPPER SURFACE OF THE ELEVATORS WAS TOO HIGH FOR THE PLT TO SEE FROM THE GND. The second participant produced the Causal Tree shown in Figure 11.20 for this incident report. McElroy's again analysis focussed on the causal factors that were identi ed using a variant of PRISMA's Eindhoven Classi cation Model. As before, however, this diagram yields several insights into the assessment of causal analysis techniques. There is a far greater level of detail in this tree than in Figure 11.19. There is insuÆcient evidence to determine whether this is an artifact of individual dierences between the participants or whether it stems from dierences in the two incidents that they studies. As with many aspects of McElroy's work, it provides tantalising hints of deeper issues. He did not counter-balance the study so that each participant was asked to analyse each incident. This had been an intention behind the study but he ran out of time. Rather than rush the participants to perform a partial study of two incidents, he chose to allow them more time with a single incident. Both Figures 11.19 and 11.20 are sketches. They record the participants' initial thoughts about the incidents. They follow the high-level structure proposed by the causal tree approach; left branches represent the `failure' side while the right branch denotes `recovery' events. There are also examples in both trees where that participants either deliberately neglect the syntax of there trees or else they did not follow the syntactic rules that were presented. In Figure 11.19, there is a minor violation with an AND gate that includes a single event. It can be argued that this represents a stylistic issue rather than a violation f any explicit syntactic rule. In this case, however, it is uncertain how to interpret the relationship between Helicopter pilot did not get a special visual
ight rule clearance and The helicopter was still able to take o without contact from air traÆc control. Figure 11.20 raises more questions. No checklist or protocol is linked to two events without any intervening gate. The event labelled Pilot dismiss ATC concerns is provided as an input to two different AND gates. Such techniques break the independence assumptions that are important for the analysis of more `conventional' fault trees. These rules were, almost certainly, not presented to the participants in McElroy's study. Such annotations are, therefore, of considerable interest because they illustrate ways in which users are shaping the notation to represent the course of an incident. In future studies, it would be important to know what was intended by the event labelled No checklist or protocol. This would enable us to determine whether the notation fails to support a necessary feature or whether the training failed to convey signi cant syntactic constructs to the participants. Given that participants were unlikely to derive a reliability assessment from Figure 11.20 it can be argued that the independence assumption has not value for the practical application of causal trees? As with the rst participant for the helicopter case study, the second participant was also asked to use their causal tree to drive the categorisation process that is supported by the Eindhoven Classi cation Model. The following list summarises the categories of root causes that were identi ed
528
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
Figure 11.20: Causal Tree from McElroy's Evaluation of PRIMA (2) during by the second participant:
Operating procedures. This is represented by the node labelled OP in the Eindhoven Classi cation Model of Figure 11.9. The participant identi ed that the incident was the result of inadequate procedures. This category was also identi ed by the rst participant for the helicopter case study.
System Status. This is shown as HK1 in the Eindhoven Classi cation Model. The participant identi ed that the operator did not have an accurate knowledge of the \state and dynamics" of the system at key points during the incident [529].
Permit. This is represented by HR2 in the Eindhoven Classi cation Model. The participant identi ed that the operator failed to obtain a permit or licence for activities where extra risk was involved. This category was also identi ed by the rst participant for the helicopter case study.
Checks. This is represented by HR4 in the Eindhoven Classi cation Model. The participant indicated that the operator had failed to conduct suÆcient checks on the local system state to ensure that it complies with the expected conditions.
Planning. This is shown as HR5 in the Eindhoven Classi cation Model. The participant identi ed that the activity was not fully planned. Appropriate methods were not identi ed
11.4.
COMPARISONS
529
nor were they carried out in a well-de ned manner. This category was also identi ed by the rst participant for the helicopter case study.
Unclassi able behaviour. This is shown as X in the Eindhoven Classi cation Model. The participant also identi ed that some of the causal factors denoted in Figure 11.19 could not be classi ed using the model. This category was also identi ed by the rst participant for the helicopter case study.
McElroy's initial analysis had also identi ed Checks (HR4), Planning (HR5) and Unclassi ed behaviour (X) as root causes for this incident. The other categories were omitted. In addition, McElroy also identi ed License (HR1) as a causal factor. He argued that the operator in question must be quali ed to do the job. He also identi ed Management Priorities (MP) as an issue in this incident. Top or middle management placed pressure on the operator to deviate from recommended or accepted practice. As noted in previous sections, it is diÆcult to reconstruct the thought processes that either of the participants used to justify their categorisation. McElroy notes in several places that the participants lacked the additional information that would have supported hypotheses about, for instance, the organisational causes of an incident. These are intriguing results. McElroy's results perhaps re ect the participants' suspicions that there must have been organisational causes to explain the operators' behaviour. If this is true then perhaps we are experiencing the consequences of the recent emphasis on the managerial and organisational causes of failure. These will be diagnosed as potential causes even when investigators are not provided with suÆcient evidence to con rm these potential causes! A number of methodological criticisms can be raised about McElroy's study. As mentioned, the lack of alternative data sources often forced the participants to make inferences and assumptions about potential causal factors. This led to causal trees and root cause classi cation that resembled rough `sketches' of an incident. There criticisms can be addressed by acknowledging the severe time constraints that aected McElroy's work. They can also be countered by arguing that these highlevel interpretations may resemble the level of detail that can be derived from an initial analysis of an incident report prior to a secondary investigation. It also provides an accurate impression of the `rough' forms of causal analysis that can be conducted for contributions to anonymous incident reporting systems. In such circumstances, investigators are also constrained by the information sources that are available to them without compromising the identity of the contributor. Further objections can be raised about the lack of empirical support for McElroy's work. He does not attempt to quantify agreement between his own causal analysis or that of the other participants. Given the limited data that he was able to obtain, this should not be surprising. He does not, however, speculate on measures that might be used. These is a vast range of techniques that can be used to represent and compare the structure of arbitrary tree structures [450, 451]. These algorithms might be used to detect patterns within the structure of causal trees. For example, Lekberg argues that an investigator's education background can bias their causal analysis [484]. Similarity metrics, for example based on vector representations, might be used to determine whether investigators from similar educational backgrounds produce measurably similar tree structures for the same incidents. There are certain ironies about McElroy's approach. He framed his study as an experimental comparison between his own analysis and that of participants who were trained in incident analysis. he controlled the materials that were available to the participants and gave them the same training in the PRISMA technique. Having established these conditions, he lacked the resources to perform the additional tests that would have thrown light on many important issues. For instance, he did not counter-balance the incidents that were presented to the participants. This makes it diÆcult to determine whether any observed eects stem from the participant or the incident being studied. Similarly, McElroy only obtained access to two trained participants. Such a sample is inadequate to support any general conclusions. It should be stressed, however, that McElroy views his study as an initial experiment. It was intended to act as a marker for future studies that might attempt to assess whether investigators can use a causal analysis technique rather than just assessing their subjective attitudes towards someone else's application of an approach, as Munson had done [552]. This section has summarised recent attempts to assess the strengths and weaknesses of dierent causal analysis techniques. We have seen that these studies have only provided preliminary results.
530
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
The main conclusion from all of the work that we have cited in this section is that further research is needed to validate the many bene ts that are claimed for the approaches that have been summarised in this chapter. It is, however, also possible to identify a number of more speci c conclusions that might guide the future validation of causal analysis techniques: Consider a broad range of stakeholders. Previous studies have almost exclusively focussed on the investigators' assessment of causal analysis techniques. This is natural given that they are likely to determine whether or not a particular approach can be applied in the eld. It should not be forgotten, however, that there are many other groups and organisations that must participate in, or validate, incident investigations. For instance, Chapter 3 discussed the role that regulators play in many reporting systems. A technique that satis es investigators but does not convince regulatory bodies is unlikely to be acceptable. Similarly, it is important that any potential causal analysis technique should also be acceptable to those who must pay for its application. If this is not the case then there will be continued and increasing pressure either to reject the approach or to `cut corners' in order to save resources.
Consider longitudinal factors as well as short-term eects. All of the studies that we have presented are based around relatively short-term assessments of particular techniques. In particular, Munson and McElroy's evaluations took place over several hours. They do not, therefore, provide much evidence about the long-term bene ts that might be provided by the consistent application of causal analysis techniques. There are also a range of detailed issues that are diÆcult to examine without more sustained studies. For instance, it is often necessary for investigators to revisit an analysis at some time after it was originally conducted. They may want to determine whether or not a subsequent incident has the same causal factors. In such circumstances, it is important not simply to identify the results of a causal analysis. It is equally important to understand the reasons why a particular decision was reached. Consider the range of incidents in an evaluation. It can be diÆcult to ensure that any assessment presents its participants with an adequate range of incidents. If this is not done then the utility of a causal analysis technique may be demonstrated for a sample of incidents that do not re ect the problems that are reported to the sponsoring organisation. There are further aspects to this issue. It may not be suÆcient to base an evaluation on an accurate sample of current incidents. Given the overheads associated with training sta and nancing the implementation of a new causal analysis technique, it is likely that any approach will be used for a signi cant period of time. If this is the case then any validation must also consider whether incidents will change during the `lifetime' of an analysis technique. For example, Chapter 3 has argued that the increasing integration of computer-controlled production systems is posing new challenges in forensic software engineering. None of the techniques presented here, with the possible exception of WBA, explicitly addresses these challenges [411].
Consider the impact of individual or team-based investigation. The studies of Munson, Benner
and McElroy focussed on the performance and subjective assessments of individual investigators. Munson even went out of his way to prevent `collusion' between the participants in his study. In contrast, Van Vuuren's evaluation involved teams of engineers, domain specialists, managers and safety experts. This re ects his intention to assess the application of this technique without the usual experimental controls that were accepted by Munson and McElroy. It is diÆcult, however, to determine whether team composition had any eect on the causal analyses reported by Van Vuuren. His published work says remarkably little about these issues. Work in other areas of groupwork have indicated that such factors can have a profound impact upon the successful application of design and analysis techniques [489, 556]. For example, the ability to use drawings and sketches as a medium of negotiation and arbitration can have a powerful eect during group confrontations. Attention may be focussed more on the shared artifact and less of the individuals involved in the discussion. We do not know whether these eects are also apparent in the application of causal analysis techniques.
Consider causal analysis in relation to other phases of investigation. Benner reiterates the importance of evaluating any analytical technique within the context of a wider investigation
11.5.
SUMMARY
531
[73]. Analysis techniques are unlikely to yield suÆcient explanations if investigators have not been able to elicit necessary information about the course of an incident. This argument was rst made in the context of accident investigations. Unfortunately, cost limitations and the constraints of con dentiality/anonymity can prevent investigators from obtaining all of the data that they may need to complete a causal analysis. All of the techniques introduced in this chapter, with the exception of PRISMA, were developed to support accident investigations. These are, in one sense, information rich environments. In contrast, the particular characteristics of incident reporting systems may make relevant information very diÆcult to obtain. Any assessment must not, therefore, provide participants with information that they might not otherwise have available during the application of a particular technique.
consider which stage of an investigation is being assessed. As we have seen, McElroy's initial evaluation of the application of an analysis technique produced results that were compatible with the early stages of an investigation. The participants produced trees that `sketched' the outline of an incident. They did not produce polished artifacts that might provide consistent and suÆcient causal explanations. Techniques that are intended to provide such quality control must, therefore, be validated in a way that enables investigators to achieve some degree of competence in the more `advanced' use of the approach.
This is not an exhaustive list. Previous attempts to validate particular approaches have done little more than to sign-post areas for further work. It is equally important not to underestimate the importance of the small number of pioneering studies that have begun to validate the claimed bene ts of causal analysis techniques.
11.5 Summary This section has reviewed a broad range of techniques that can be used to support the causal analysis of safety-critical incidents. The opening sections build on our application of ECF analysis in Chapter 10 by introducing alternative event-based techniques. The related approaches of Multilinear Event Sequencing (MES) and Sequentially Timed and Events Plotting (STEP) were presented. These techniques all encourage analysts to use semi-formal, graphical or tabular notations to construct causal models of the events that lead to particular incidents. These notations provides great
exibility; investigators have considerable freedom in the manner in which they construct a causal model. Counterfactual reasoning is then, typically, applied to identify root causes from the candidate causal factors that are represented in a semi-formal model. Unfortunately, the exibility oered by these approaches can also be a weakness. There are few guarantees that dierent investigators will derive the same results using this approach. Similarly, it is also unlikely that the same investigator will be able to replicate the details of their analysis at a later date. Event-based techniques were, therefore, contrasted with approaches that exploit check-lists. These techniques provide investigators with a restricted choice of causal factors. Management Oversight and Risk Tree (MORT), Prevention and Recovery Information System for Monitoring Analysis (PRISMA) and Tripod all exploit variants of this underlying idea. The enumeration of causal factors guides and prompts investigators. It can also help to encourage consistency in an analysis. This is particularly important if national or regional comparisons are to be made between the causal factors of incidents that occur at a local level. Aggregated statistics would be unreliable if dierent investigators identi ed dierent causal factors behind the same incident. Of course, the price of consistency is that it may be diÆcult to identify an appropriate causal factor from the list of choices that are oered by these techniques. MORT and PRISMA address this potential caveat by encouraging investigators to extend the basic enumerations to re ect regional or domain-dependent variations in the incidents that are reported. A further limitation of checklist approaches is that it can be diÆcult to check whether a particular analysis provides a consistent or suÆcient explanation of a safety-critical incident. This chapter, therefore, introduced a range of formal causal analysis techniques. These approaches exploit mathematical systems of reasoning and argument to provide clear and concise rules about what can and
532
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
what cannot be concluded about the causes of an incident. In particular, we have introduced WBA, partition techniques for non-deterministic causation and Bayesian approaches to subjective, probabilistic causation. Although these techniques are not widely used, they oer a number of potential bene ts. They avoid many of the limitations that others have identi ed for the existing techniques that we have introduced in previous paragraphs [453, 482]. The rules that govern the application of these techniques provide objective criteria for verifying that an analysis is correct. The importance of ensuring the consistency and completeness of any analysis is also increasing signi cant given the rising cost of litigation in the aftermath of adverse occurrences. The modular approach supported by WBA and partition methods provides one means of addressing the increasing complexity of many safety-critical incidents. These bene ts will only be realised if we can develop techniques that will enable non-formalists to participate in their application. At present, it can be diÆcult for those without an extensive background in mathematics to understand the strengths and the limitations of a particular formal analysis. Fortunately, many of the underlying mathematical models that support these causal analysis techniques can also be incorporated into software tools. There is also considerable potential for developing graphical and tabular representations that can be used to communicate more formal aspects of a causal analysis. This chapter went on to describe attempts to validate some of the causal analysis techniques that we have described. Van Vuuren conducted bottom-up studies that involved the implementation of the PRISMA approach within several dierent industries [844]. He was then able to perform a comparative analysis of the dierent role that organisational factors played in a variety of dierent contexts. He did not, however, perform a detailed analysis of investigators' experiences in applying the causal analysis technique. In contrast, Benner provided a generic set of criteria that can be applied in a top-down manner to assess dierent accident models and methodologies [73]. By extension these same criteria might also be applied to assess dierent approaches to causal analysis. He relied largely upon his own subjective assessments. Munson, therefore, recruited an expert panel of re ghters to apply similar criteria to a case study that had been analysed using Fault Trees, STEP and Barrier Analysis [552]. He was able to replicate results that suggested there were strong subjective preferences for STEP and Fault Trees over Barrier Analysis. Unfortunately, this study did not demonstrate that potential investigators might be able to apply any of these techniques themselves. McElroy, therefore, combined elements of Van Vuuren and Munson's approach when he asked a panel to apply the causal trees and Eindhoven Classi cation Model of the PRISMA technique [529]. This study revealed striking dierences between the manner in which some people have proposed that causal analysis techniques should be used and the way in which investigators might actually use them in the early stages of an investigation. Rather than a detailed and careful analysis of the causal factors leading to an incident, the participants used them to sketch high level causes. They were less concerned with the consistency and suÆciency of an argument than they were with providing a clear overview of the incident itself. This, in part, re ects the important point that causal analysis techniques may have to play a variety of dierent roles during dierent stages of an investigation. Our analysis of previous attempts to validate causal analysis techniques has revealed how little we know about the comparative strengths and weaknesses of these dierent approaches. We know from recent reports that current techniques are failing to support investigators tasks in many industries [482, 453]. This is another area that requires considerable further applied research so that practitioners can have greater con dence in the methods that are being proposed. The importance of this point cannot be underestimated. Several research teams are currently developing `systemic' approaches to the causal analysis of incidents and accidents. These techniques are intended to address the challenges that are being posed by the failure of increasingly complex, tightly coupled systems. Unfortunately, less attention has been paid to the problem of demonstrating the practical bene ts of these techniques than is currently being invested in their development. It is worth emphasising that increasing complexity is one of several challenges that must be addressed by novel analysis techniques. They must also be proof against the sources of bias that in uence the ndings of many reports. Ultimately, it is not enough to show that any analysis technique can `correctly' identify the causes of an incident. It must also demonstrate that it cannot easily be used to identify `incorrect' causes. This is a signi cant burden given the many dierent
11.5.
SUMMARY
533
forms of bias that might aect a causal analysis: 1. Author bias. This arises when individuals are reluctant to accept the ndings of any causal analysis that they have not themselves been involved in. 2. Con dence bias. This arises when individuals unwittingly place the greatest store in causal analyses that are performed by individuals who express the greatest con dence in the results of their techniques. Previous work into eye-witness testimonies and expert judgements has shown that it may be better to place greatest trust in those who do not exhibit this form of over-con dence [223, 759]. 3. Hindesight bias. This form of bias arises when investigators criticise individuals and groups on the basis of information that may not have been available to those these participants at the time of an incident. More generally it can be seen as the tendecy to search for human error rather than deeper, organisational causes in the aftermath of a failure. 4. Judgement bias. This form of bias arises when investigators perceive the need to reach a decision within a constrained time period. The quality of the causal analysis is less important that the need to make a decision and act upon it. 5. Political bias. This arises when a judgement or hypothesis from a high status member commands in uence because other respect that status rather than the value of the judgement itself. This can be paraphrased as `pressure from above'. 6. Sponsor bias . This form of bias arises when a causal analysis indirectly aects the prosperity or reputation of the organisation that an investigator manages or is responsible for. This can be paraphrased as `pressure from below'. 7. Professional bias . This arises when an investigators' colleagues favour particular outcomes from a causal analysis. The investigator may nd themselves excluded from professional society if the causal analysis does not sustain particular professional practices. This can be paraphrased as `pressure from beside'. 8. Recognition bias. This form of bias arises when investigators have a limited vocabulary of causal factors. They actively attempt to make any incident ` t' with one of those factors irrespective of the complexity of the circumstances that characterise the incident. 9. Con rmation bias. This arises when investigators attempt to interpret any causal analysis as supporting particular hypotheses that exist before the analysis is completed. in other words, the analysis is simply conducted to con rm their initial ideas. 10. Frequency bias. This form of bias occurs when investigators become familiar with certain causal factors because they are observed most often. Any subsequent incident is, therefore, likely to be classi ed according to one of these common categories irrespective of whether an incident is actually caused by those factors [394]. 11. Recency bias. This form of bias occurs when the causal analysis of an incident is heavily in uenced by the analysis of previous incidents. 12. Weapon bias. This form of bias occurs when causal analyses focus on issues that have a particular `sensational' appeal. For example, investigators may be biased to either include or exclude factors that have previously been the focus of press speculation. Alternatively, they may become xated on the primary causes of an incident rather than secondary causes that may determine the severity of an incident. For example, an investigation may focus on the driver behaviour that led to a collision rather than the failure of a safety-belt to prevent injury to the driver. This is a variant on the weapon focus that is described by studied into eye-witness observations of crime scenes [758].
534
CHAPTER 11.
ALTERNATIVE CAUSAL ANALYSIS TECHNIQUES
The elements of this list illustrate the point that the success or failure of a causal analysis technique is, typically, determined by the context in which it is applied. For example, investigators can (ab)use causal analysis techniques by constructing causal chains that support particular, pre-determined conclusions. Such practices can only be discouraged by peer review during the various stages of a particular technique and by oering investigators a degree of protection against the sources of bias listed above. It should also be emphasised that causal analysis techniques are only one component in an incident reporting system. We cannot, therefore, assess the success or failure of such a system simply in terms of the suÆciency and completeness of the causal analyses that it produces. Such a validation must consider the success or failure of the recommendations that are justi ed by any causal analysis. These issues are addressed in the next chapter.
Chapter 12
Recommendations Chapter 6 described how operators must often make immediate recommendations in the aftermath of an incident. These are intended to preserve the short-term safety of application processes. These immediate actions often exacerbate the consequences that they are intended to mitigate. numerous potential problems can prevent an eective response to an incident. Inadequate training, poor situation awareness, time pressure, the lack of necessary information, inadequate system support, pressure to preserve levels of service all impair operators' attempts to rectify an adverse situation. Chapter 6 also described a number of incident and emergency management techniques that are intended to reduce the impact of these factors. This chapter looks beyond the short-term recommendations that are made in the aftermath of an incident. In contrast, the intention is to examine the range of techniques that have been developed to identify potential remedies for the various causes that can be extracted from the approaches that have been introduced in Chapters 10 and 11.
12.1 From Causal Findings to Recommendations A number of problems make it diÆcult to identify recommendations that reduce the likelihood or mitigate the consequences of future failure. The following paragraphs brie y summarises these problems. For example, there is a danger that investigators will continue to rely upon previous recommendations even though they have not proved to be eective in the past. Many authors have identi ed a form of `conservatism' that aects large and complex organisations. It can take signi cant periods of time for new solutions to be adopted even when there is a considerable body of evidence that indicates the eÆcacy of alternative remedies. There are several variations on the previous requirement. Many accidents occur because previous incidents have resulted in recommendations that do not adequately address the causes of previous incidents. Previous incidents provoke a range of dierent recommendations that reduce particular types of failure but which do not target underlying safety problems. Often these piecemeal recommendations avoid the expense or political involvement that are eventually committed in response to a subsequent accident. This can be illustrated by the US Central Command's investigation into a `friendly re' accident at Udairi Range in Kuwait. Previous incidents had resulted in procedures that were intended to ensure that crews were prevented from deploying their weapons if there was any danger of them mistaking observation posts for potential targets. Four previous incidents involving similar close-support operations had led to a number of local remedies being taken to minimise any potential confusion. Range procedures for xed-wing aircraft to ground operations were changed to restrict delivery of ordnance to within two kilometers of Bedouin camps. The target was also altered to decrease the chances of any confusion. A tower was also constructed to help distinguish an observation post. The rooftop of the tower was painted white with a red cross. All of these physical changes failed, however, to address the overall problems of ensuring that crew did not inadvertently deploy their weapons at an observation post. The report concluded that `despite four documented incidents in the past eight months, and attempts to improve conditions, observation posts and targets remain hard for pilots to see day or night' [824]. 535
536
CHAPTER 12.
RECOMMENDATIONS
It is important that investigators should avoid arbitrary or inconsistent recommendations. Similar causal factors should be addressed by similar remedies. Of course, this creates tensions with the previous guidelines. The introduction of innovative solutions inevitably creates inconsistencies. The key point is, however, that there should be some justi cation for treating apparently similar incidents in a dierent manner. These justi cations should be documented together with any recommendations so that other investigators, line managers and regulators can follow the reasons why a particular remedy was proposed. Dierent organisations have proposed radically dierent approaches to the in uence of nancial or budgetary constraints on the identi cation of particular recommendations. Some organisations, such as the US Army have argued that `the board should not allow the recommendation to be overly in uenced by existing budgetary, material, or personnel restrictions' [806]. Other incident reporting systems, such as the local hospital systems that have been mentioned in previous chapters, accept a more limited horizon in which any recommendations must `target the doable' [119]. The key point here is that investigators must ensure that their recommendations are consistent with the scope of the system. In the Army system, incident reporting is more open-ended with the implicit acknowledgement that signi cant resources may be allocated if investment is warranted by a particular incident. In the local system, incident reporting is constrained to maximise the nite resources of the volunteer sta who have run these systems. It is easy to criticise this constrained approach by recommending a more ambitious scope for the recommendations of a reporting system. It should be noted, however, that these systems have continued to introduce safety innovations for over a decade and without the national resources that are now being devoted to clinical incident reporting. It is clearly important that any potential remedies must not introduce the possibility of new forms of failure into a system. Of course, it is easier to state such a requirement than to achieve it. The implementation of particular recommendations can introduce new forms of working that may have subtle eects. Given the relatively low frequency of many adverse occurrences it may only be possible to witness the safety-related consequences of those eects many months after particular recommendations have been introduced. The debate surrounding the concept of risk homeostasis provides numerous examples of such recommendations. Users may oset the perceived safety bene ts of new regulations and devices against particular performance improvements. Cyclists who are compelled to wear safety-helmets may cycle faster than those who are not. Motorists who are provided with advanced braking systems may delay deceleration maneuvers [371, 370]. Others have conducted studies that reject the existence or the magnitude of such eects [532, 865]. The controversial nature of such studies not only indicates the diÆculty of validating such eects, it also indicates the diÆculty of ensuring that particular recommendations do not have any undesirable side-eects. As mentioned above, the relatively low frequency of many adverse occurrences makes it diÆcult to determine whether or not recommendations have any palliative eect upon the safety of a complex application. In consequence, the impact of many recommendations must be measured through indirect means. For instance, it can be diÆcult to determine whether training operators in the potential causes and consequences of poor situation awareness can reduce the number of incidents that stem from this particular human factors problem. Simulator studies can, however, be used under restricted experimental conditions to show the short-term bene ts of this training [864]. If such results cannot be obtained then there is a danger that the justi cation for any recommendation will be challenged. The support for particular remedies can be eroded as the salience of an incident fades over time. Further support for a recommendation can be elicited by repeating measurements that demonstrate the bene ts of a particular approach. Unfortunately, these indirect measures can also be used to justify particular recommendations even when there is more direct evidence that casts doubt on the usefulness of a particular approach [410]. Previous sections have argued that incidents seldom recur in exactly the same manner. Future failures are often characterised by subtle variations in the causal factors that have been identi ed in previous failures. It is, therefore, important that recommendations are proof against these small dierences. Similarly, recommendations must be applicable within a range of local working environments. They must protect against similar failure modes even though individual facilities may
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
537
exploit dierent technical systems and working practices. These problems are, typically, addressed by drafting guidelines at a relatively high-level of abstraction. Safety managers must then interpret these recommendations in order to identify the particular remedies that might prevent future failures. If guidelines are too context speci c then it can be diÆcult for safety managers to identify those lessons that might be usefully transferred to their own working environment. There is, of course, a tension between identifying recommendations that protect a diverse range of systems and drafting recommendations that provide safety managers with the level of detail that is necessary to guide subsequent intervention. If recommendations are drafted at a high level of abstraction then there is a danger that dierent managers will choose to interpret those recommendations in ways that re ect arbitrary preferences rather than the local operating conditions, mentioned above. This is illustrated by a recent US Army safety alert. Previous incidents had led to ground precautionary messages that recommended military personnel not to use commercial heaters in unventilated areas `use of un ued or unvented heaters is inherently dangerous because they vent exhaust containing carbon monoxide into living spaces' [815]. However, many soldiers chose to disregard this generic warning and continued to use heaters inside tents. These were not well ventilated The fabric was not intended to `breathe' and several soldiers died as a result. Recommendations are, typically, made by an investigator to the statutory body that commissions their work. These statutory organisations must then either accepting the recommendations or explain the reasons why they choose to reject them. If a recommendation is accepted then the statutory or regulatory body must ensure that it is implemented. This division of responsibilities is apparent in the aviation [423], maritime [833] and nuclear industries [204]. There are, however, occasions when investigators must identify who will be expected to satisfy a requirement. For instance, the US army requires that `each recommendation will be directed at the level of command / leadership having proponency for and is best capable of implementing the actions contained in the recommendation' [806]. Such recommendations encapsulate good practice. If investigators do not identify suitable proponents for a recommendation then there is a danger that it may eventually be passed back and forth between a number of dierent organisations [444]. These potential diÆculties make it important that any recommendations are well supported by the products of causal analysis. If this is not the case then it can be diÆcult for investigators to justify why particular remedies were, or were not, advocated in the aftermath of an incident. The following section, therefore, identi es a number of requirements that are intended to ensure that the results of a causal analysis can be used to support subsequent interventions. A number of factors complicate the task of extracting appropriate recommendations from the ndings of a causal analysis. For example, it can be diÆcult for investigators to assess the relative priorities of particular causal factors so that resources can be targeted towards eective forms of intervention. This task is further complicated because regional factors can reduce the impact of particular recommendations or, in extreme cases, can even mitigate any bene cial eects. Similarly, it can be diÆcult to identify recommendations that oer long term bene ts rather than immediate or short-term palliatives. Finally, all of these problems are compounded by the diÆculty of ensuring agreement amongst the diverse and multi-disciplinary groups that must concur with the recommendations that are produced by an investigation.
12.1.1 Requirements for Causal Findings Many organisations deliberately `target the doable' by restricting the focus of their recommendations to changes that aect the teams which support and implement a reporting system. In general, however, recommendations aect many dierent groups within complex organisations. One consequence of this is that representatives of these diverse interests must participate in the identi cation of remedial actions. At the very least, they must consent to their implementation. This can create a number of pragmatic concerns. For instance, recommendations may be drafted to address address causal ndings that were identi ed using one of the analysis techniques introduced in previous chapters. The products of some of these techniques, including Why-Because Analysis (WBA) and non-deterministic models of causation, cannot easily be understood by non-mathematicians. It is, therefore, important that the ndings of any causal analysis are translated into a form that is readily
538
CHAPTER 12.
RECOMMENDATIONS
accessible to those who must participate in and consent to the identi cation of recommendations in the aftermath of an incident. The following paragraphs summarise a number of further requirements that help in the use of causal analysis techniques to guide remedial actions.
Summarise the nature of the incident In order to understand the signi cance of the causal ndings that guide particular recommendations, investigators must summarise the course of a safety-critical incident. The US Army's Accident Investigation and Reporting Procedures Handbook requires an explanation of `when and where the error, material failure, or environmental factor occurred in the context of the accident sequence of events; e.g., during pre ight, while driving, etc' [806]. Information is also required to identify the individuals who are involved in an incident by their duty position or name. Components must be unambiguously denoted by a part or national stock number. Any contributory environmental factors must also be described. These requirements are often codi ed in the elds of an incident reporting forms. Operators are required to state the `national stock number' of a failed component. They may also be asked to provide information about potential environmental factors and so on. As we have seen, however, the information that is provided by an initial report can also be supplemented by subsequent reconstructions. Chapters 8 and 9 introduced a number of techniques, including computer-based simulation, that can be used to model the course of an incident. These techniques underpin the causal analyses that were described in Chapters 10 and 11.
Summarise the causal ndings It is important that the products of any causal analysis are accessible to those without any formal training in the techniques that were used to identify them in the rst place. This may seem like an unrealistic requirement given the underlying and inherent complexity of some causal models. It is, however, a prerequisite for ensuring a broad participation in the identi cation and implementation of any subsequent recommendations. If this requirement is not satis ed then there is a danger that other investigators, safety managers, regulators or line managers will mis-interpret the ndings of any STEP analysis, PRISMA categorisation or Tripod modelling. It is for this reason that WBA employs a graphical form that can include natural language annotations in addition to the clausal forms of the more formal analysis. The US Army summarises the requirement to provide the following details: \For human error, identify the task or function the individual was performing and an explanation of how it was performed improperly. The error could be one of commission or omission; e.g. individual performed the wrong task or individual incorrectly performed the right task. In the case of material failure, identify the mode of failure; e.g. corroded, burst, twisted, decayed, etc. Identi cation of the directive (i.e. Maintenance / technical manual, SOP, etc.) or common practice governing the performance of the task or function. e. An explanation of the consequences of the error, material failure, or environmental eect. An error may directly result in damage to equipment or injury to personnel, or it may indirectly lead to the same end result. A material failure may have an immediate eect on equipment or its performance, or it may create circumstances that cause errors resulting in further damage / injury inevitable. Identi cation of the reasons (failed control mechanisms) the human, material, environmental conditions caused or contributed to the incident. A brief explanation of how each reason contributed to the error, material failure, or environmental factor." [806] This quotation is interesting for a number of reasons. In particular, it provides an abbreviated checklist for the causal factors that must be considered when analysing particular types of failure. For instance, any analysis must consider the particular mode that characterised a materials failure. Such high-level guidance provides a lightweight means of combining the bene ts of checklist approaches, such as MORT, with the more open form of causal analysis, encouraged by STEP and MES. The previous quotation urges investigators to consider the control mechanisms that caused or contributed to an incident. This is interesting because it acts as a reminder to consider critical aspects of an analysis even if investigators choose not to exploit barrier analysis or the related concepts in Tripod.
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
539
Explain the signi cance of causal ndings Chapter 7 introduced contextual details, contributory factors and root causes. Subsequent chapters have described a range of further distinctions that have been introduced by both researchers and by practitioners. These include concepts such as proximal and distal causes [701], particular and general causes [508], deterministic and stochastic causes [313] and so on. Irrespective of the precise causal model that is adopted, it is important that investigators provide some indication of the perceived importance of any particular causal nding. For instance, the US Army recommends that ndings are categorised as `Found; Primary Cause, Found; Contributing, Found; Increasing Severity of Damage/Injuries, or Found; Not Contributing' [806]. As before, this recommendation acts as an important reminder to incident investigators. For example, the previous chapter brie y summarised the potential impact of weapon bias. Investigators can become xated on the primary cause of an incident at the expense of secondary failures that increased the severity of any outcome. By explicitly reminding investigators to consider these factors, these guidelines encourage analysts to look beyond the driver behaviour that leads to a collision. They are, for instance, encouraged to identify the reasons why a safety-belt failed or why the emergency response was delayed. The same guidelines also encourage investigators to separate the presentation of primary causes from contributory factors by noting that `THE FINDING(S) LISTED BELOW DID NOT DIRECTLY CONTRIBUTE TO THE CAUSAL FACTORS INVOLVED IN THIS INCIDENT; HOWEVER, IT (THEY) DID CONTRIBUTE TO THE (SEVERITY OF INJURIES) OR (INCIDENT DAMAGES)' [806]. This quotation shows how it is important not only to consider the information that must be identi ed by any causal analysis but also the format in which that information is transmitted. The Army handbook requires that such `contributing' causes can easily be distinguished form the `primary' causes that directly led to an incident.
Justify excluded factors Not only is it important to explain the signi cance of those causal factors that did contribute to an incident, it is also necessary to explain why particular `causes' did NOT contribute to an adverse occurrence . Investigators must not only explain why recommendations address particular aspects of a system, they must also explain why those recommendation did NOT address other aspects of the system. These excluded causes fall into two categories. Firstly, those factors that did not cause or exacerbate this incident but which have the potential to cause future failures if uncorrected. As before, the products of this form of causal analysis must be clearly distinguished from `primary' and `contributory' causes: `the ndings and recommendations tting this category will be separated from those that caused the incident or those that did not cause the incident but contributed to the severity of injuries / damage' [806]. There is, however, a second class of excluded `causal' factors that must also be considered in the ndings of any causal analysis. These are the factors that might have caused to, or exacerbated, an incident but which were considered not to be relevant to this or future failures. Without such justi cations it is impossible for other investigators, for managers and for regulators to distinguish between such those factors that were considered but rejected and those that were never even considered in the rst place.
Summarise the evidence that supports or weakens each nding This book has repeatedly argued that investigators and analysts must justify and document the reasons why particular decisions are taken at each stage of their work. Without this additional information it can be diÆcult for other investigators, for regulators and for other statutory bodies to understand why an investigation proceeded in a particular manner. It can be diÆcult to follow the reasons why a secondary investigation was not initiated. It may be diÆcult to identify the factors that led investigators to commit resources for computer-based simulations in one incident and not another. Similarly, it can be hard to understand why resources were not allocated to support a detailed causal analysis. The US Army handbook recognises the need to justify the outcome of a causal analysis; \Each cause-related nding must be substantiated." [806] The cursory nature of this requirement is, perhaps, indicative of a wider failure to recognise the importance of such justi cations. All too often,
540
CHAPTER 12.
RECOMMENDATIONS
individuals and groups must endeavour to `re-live' their decision making processes during the course of subsequent litigation. A number of techniques can be used to document the justi cations that support particular causal arguments. For instance, Chapter 9 introduced Conclusions, Analysis, Evidence (CAE) diagrams. These provide a means of linking the evidence that can be obtained in the aftermath of an incident to the arguments for and against a conclusion. These graphical structures are intended to provide a high-level overview of the justi cations that support particular causal ndings.
12.1.2 Scoping Recommendations Causal ndings help to guide the drafting of appropriate recommendations. The US Army handbook, cited in previous paragraphs, illustrates this relationship by advising that each nding is printed next to the remedy that has `the best potential' for avoiding or mitigating the consequences of future incidents [806]. As we have seen, however, it can be diÆcult to identify appropriate recommendations. In particular, interventions must be pitched at the correct level. They must be detailed enough so that they avoid ambiguity. They must present the organisational, human factors and systems details that are necessary if future incidents are to be avoided. They must not, however, be so speci c that the fail to capture similar incidents that share some but not all of the causes of previous incidents. The following paragraphs brie y describe some of the more detailed issues that must be considered when attempting to identify an appropriate scope for the recommendations in an incident report.
By time... Previous sections have identi ed important dierences between the short-term recommendations that are made in the immediate aftermath of an incident and the longer-term remedies that are, typically, the outcome of more considered investigations. Immediate instructions to alter operating practices may be supplemented by regulatory intervention to ensure that those changes are backed by appropriate sanctions. It is important to recognise, however, that very few recommendations ever provide inde nite `protection' against future failures. In military systems this is best illustrated by the continuing problem of `friendly re' incidents.
Accidents Friendly Fire Enemy action
World War II 1942-1945
Korea 1950-1953
Vietnam 1965-1972
56% 1% 43%
44 % 1% 55%
54% 1% 45%
Desert Storm/Shield 1990-1991 75% 5% 20%
Table 12.1: Battle and Non-battle casualties in the US Army [798]. Table 12.1 presents US Army gures for the changing impact of friendly re incidents on army casualties in major combat operations since 1942 [798]. Such incidents, however, have a far longer history. One of the most (in)famous incidents occurred in April 1863 when Robert E. Lee's Army of Northern Virginia attempted to halt the Union Army of the Potomac's advance across the Rappahannock River near Chancellorsville. Lee left a small force to contain Major General Joseph Hooker and sent the remainder of his strength with `Stonewall' Jackson to attack the Union ank. Jackson achieved considerable success and pushed ahead with a scouting party. As the party returned, they were mistaken for Union cavalry. Jackson was wounded and died soon after from complications that followed the amputation of his left arm [23]. Such incidents stem from a lack of situation awareness, often involving scouting parties and other advanced units. They also stem from the development of weapons that are eective at a range which is greater than the range at which combatants can easily distinguish friend or foe.
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
541
Such incidents were often seen as the result of undue recklessness, or bravery, on the part of the individuals involved. In the years following the Civil War, greater emphasis was placed on the development of communications systems and protocols that were intended to improve combatants' understanding of their combat situation. For instance, rules of engagement were drafted to identify situations in which it was `safe' to engage a potential enemy. An example of such procedures can be found in the Rules of Engagement-Southeast Asia (U), JCSM-118-65, 19 February 1965 (Declassi ed 21 June 1988, NARA) which removed the US military's restriction against pursuit of Vietcong into Communist China. These required that hostile vessels could only be attacked in Vietnamese (RNV) or Thai territorial waters if it had been `attacking or acting in a manner which indicates with reasonable certainty an intent to attack U.S./friendly forces or installations, including the unauthorised landing of troops or material on friendly territory' or `engaged in direct support of attacks against RVN or Thailand'. Unfortunately, these new tactics and tools were not always successful in eliminating the problem of friendly re. For example, the US Naval Institute published an account of an engagement during the Vietnam war. A B-57 from the 8th Bombardment Squadron attacked a US patrol boat after it had dropped its bombs on watercraft just north of demarkation zone. Coordination between the 7th Air Force and the naval forces was particularly poor. The Commander-in-Chief, Paci c, later observed that `this incident is an apparent lack of tactical coordination between operational commanders'. The 7th Air Force investigation concluded that the vessel did not know the `correct MAROPS challenge/response for air to surface'. The patrol boat had `two means of identifying themselves to aircraft' using their running lights or by radio communications but `the vessel did neither' [380]. `Friendly re' accounted for some ve percent of American casualties during Operation Desert Storm in 1991 [798]. These often had similar causes to incidents in previous con icts. Many stemmed from communications problems. Deployment information was not passed along the chain of command. Other incidents again revealed the disparity between the range and eectiveness of modern weapons systems when compared to battle eld communications equipment. Following the gulf war, several initiatives started amongst allied armies to lessen the number of these incidents in future con icts [747]. As can be seen, some of these initiatives focussed on new technologies. Others, however, have more direct parallels with the techniques that were proposed in previous con icts:
\Systems that align with the weapon or weapon sight and are pointed at the intended target. The system `interrogates' the target { a reply identi es it as friendly, otherwise it is identi ed as unknown.
`Don't shoot me' systems use the Global Positioning System and other similar data sources. An interrogation is sent in all directions containing the targeted position. Friendlies present in that position return a `don't shoot me' response.
Situational awareness systems rely on periodic updates of position data to help users locate friendly forces.
Non-cooperative target recognition systems compute a signature using acoustic and thermal signals, radio emissions, and other possible data sources. The system compares the signature in its library database to characterise the target as potentially a friend, foe or neutral." [22]
A number of reasons explain the way in which the similar hazards recur over time even though recommendations provide some immediate protection from a particular failure. For example, previous sections have cited research into risk homeostasis that determines whether or not users will sacri ce safety improvements in order to achieve other objectives. Car drivers will rely on advanced braking systems to save them from hazardous situations. A number of other potential problems can prevent previous recommendations from continuing to protect application processes. For instance, operators and managers may forget the importance of previous remedies as incidents and accidents fade from the `group memory' [633, 635]. This process of `forgetting' should not be underestimated given the relatively low frequency of many adverse occurrences. Organisational factors also intervene to increase the speed at which previous recommendations can be lost to those whose actions must be guided by them. The recommendations from less severe incidents may be lost more quickly than
542
CHAPTER 12.
RECOMMENDATIONS
those of the `friendly re' examples, cited previously. The Canadian armed forces have one of the most advanced health and safety infrastructures of any military organisation. Their computer-based General Accident Information System automates the submission and partial ltering of incident and accident reports [148]. These reports are summarised in Safety Digests that are similar to the Aviation Safety Reporting Systems DirectLine publication, introduced in Chapter 5. They provide key personnel with ` rst-hand' accounts of previous incidents. They also communicate the recommendations from investigations in a relatively accessible manner. Such feedback does not always have the strong, long-term remedial eect that might be expected. For instance, there have been several incidents involving the transfer of fuel under pressure between various types of bowser [137]. These have led military safety managers to stress that it is the \duty" of military personnel and civilian sub-contractors to refuse to engage in operations that jeopardise safety during peacetime [135]. Unfortunately, there recommendations have not had the impact that might have been hoped: \Training DND military/civilian personnel performing the transfer operation and those in the chain of command didn't have experience or training to safely conduct this non-standard fuel transfer. Basic/advanced fuel handling training for National Defence military/civilian personnel requires further in-depth evaluation. In the interim, 19 Wing is conducting enhanced local training. The applicable engineering publications governing the safe handling of fuels are outdated, not accessible to all personnel and appear to be technically inferior to industry standards and other Air Forces' publications." [140]. This quotation provides several detailed reasons why many of the recommendations from incident reports are limited to a relatively short `shelf life'. Personnel may not have been provided with access to the initial information. They may have joined an organisation or have been re-deployed within an organisation well after the ndings from an incident have been published. Sta may also be employed by sub-contractors who were not informed of the recommendations that were identi ed from previous incidents. Conversely, the organisation itself may have fallen behind best-practice in an industry. The previous quotation identi es that military procedures failed to meet civilian standards. The long-term eectiveness of particular recommendations can also be undermined by changes in working practices. These need not re ect deliberate neglect or the failure to communicate the importance of adopting particular remedies in the aftermath of previous incidents. In contrast, these changes can be forced upon personnel by the introduction to new technologies. Some recommendations that ensure safe fuel transfer from bowsers can also be applied to other fuel storage mechanisms, such as bladders [135]. For instance, it is important to ensure that hoses are hydrostatically tested in both situations. Other recommendations cannot be directly transferred in this way. For example, previous bowser res have established the importance of using industry-approved ow rates for particular fuel types [140]. This recommendation has some relevance for bladder devices. However, the particular properties of bladder devices require that recommendations from previous bowser res must be carefully reinterpreted if they are to protect operators using these containers. Personnel must ensure that fuel is pumped to the bladder's pressure rating rather than at its maximum lling speed. Changes in the operating environment undermine previous remedies. For instance, many military organisations responded to `friendly re' incidents by implementing protocols, such as terms of engagement, that guide personnel on the actions to be taken before engaging a potential target. Battle eld communications systems have also been developed to help distinguish friend from foe. These remedies are tailored to meet the speci c requirements of particular military organisations. It can be diÆcult to extend the same techniques to support joint operations by allied forces. For instance, there may not be the political support that is necessary to agree upon common terms of engagement. It is also rare for allied forces to share the same core communications technologies. One consequence of this is that joint operations often result in a large number of friendly re incidents. Remedies that reduce incidents in particular scenarios may, therefore, not provide protection under changed operational circumstances. Changing working practices, changing operational contexts and changing technologies create considerable problems for investigators who must ensure that their recommendations continue to
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
543
protect the safety of a system and its operators. Several techniques have been proposed to reduce the impact of such changes on the remedies that are advocated by incident reporting systems. For example, investigators can explicitly specify the shelf-life of a recommendation. Any remedial actions need only be implemented until an end-date that is speci ed in the incident report. The regulatory or statutory body is then responsible for explicitly renewing any recommendation that might be made after the initial period of enforcement has expired. This approach has obvious disadvantages for any regulatory body that must constantly review a mass of relatively low priority recommendations. An alternative approach is to require that organisations periodically update their safety cases to ensure that they conform to recommendations that have been made since their previous appraisal. This review also provides an opportunity for companies to argue that previous recommendations may no longer hold given new working conditions or technological innovations. This approach also suers from a number of limitations. For example, it can be diÆcult to identify an appropriate renewal period. Alternatively, companies may be required to revise a safety-case whenever new working practices or environmental conditions are introduced. Further technical diÆculties complicate the task of updating a safety case, see for example [434]. Some incidents raise a variety of more `pathological' temporal issues that exacerbate the problems of drafting and implementing appropriate recommendations. For example, the Singaporean army has made a number of recommendations that have reduced the number of heat related injuries reported in recent years: \During the rst two days of heat exposure, light activities would be appropriate. By the third day of heat exposure, 3 kilometer runs at the pace of the slowest participant are feasible. Signi cant acclimatisation can be attained in 4 to 5 days. Full heat acclimatisation takes 7 to 14 days with carefully supervised exercise for 2 to 3 hours daily in the heat. The intensity of exercise should be gradually increased each day, working up to an appropriate physical training schedule adapted for the environment." [741] As mentioned, these recommendations have encouraged a general decline in heat related injuries within the Singaporean defence forces since 1987. If we follow the argument that has been presented in previous paragraphs then it might be argued that greater concern should be devoted to other, potentially more pressing, safety recommendations. A number of factors have, however, combined to increase the salience of there recommendations. Since 1995 the army has continued to report approximately 3.5 cases per 1000 soldiers. These cases are not evenly distributed across all units. Training schools continue to suer the highest incidence of heat-related injuried as new soldiers transition from civilian life. The political and social impact of these incidents is exacerbated by the Singaporean army's continued use of enlistment. In consequence, the `shelf-life' or duration of a recommendation can be determined by a range of factors that may have relatively little to do with the relative frequency of particular incidents. The previous example can be used to illustrate a number of further problems that complicate the task of drafting appropriate recommendations. For instance, the previous remedies are increasingly important at particular times in the year. The Singaporean army reports the highest number of heat injuries in April and May. This re ects increases in heat and humidity during those months. As we have seen, the salience of particular recommendations can decline when they are not perceived to be important to an operator's immediate task. In consequence, safety managers must make particular eorts to reinforce the importance of these guidelines during March and early April. The complexity of drafting appropriate recommendations is further complicated by the bimodal distribution of these incidents within the day. Peaks occur in the reporting patterns from 08:00 to 09:59 hrs and from 16:00 to 17:59 hrs. These peaks straddle the interval between 11:30 and 15:30 hrs during which formal physical training is prohibited according to Singaporean army regulations. Further complexity is introduced by the time-limits that determine appropriate mitigating actions. If an individual's heat exposure is less than 90 minutes then they should be oered plain, cool water during a recovery period. If the heat exposure exceeds 90 minutes then they should be oered a \cool, suitably
avoured carbohydrate-electrolyte beverage" with no more than 8%, or 2 table spoons of sugar per litre [741]. If the soldiers' heat exposure exceeds 240 minutes then they should be provided with a
avoured \carbohydrate-electrolyte beverage supplemented with one tea spoon of salt per litre".
544
CHAPTER 12.
RECOMMENDATIONS
By place... The previous section has argued that it can be diÆcult to draft recommendations that can continue to have a medium or long-term eect on the safety of an application. Memories fade, working practices change and technology is seldom stable beyond the short-term. In consequence, regulatory or statutory intervention may be required to ensure continued compliance. Investigators may also be forced to draft their recommendations so that they are `future proof' against these changes. Unfortunately, remedies that avoid reference to particular technologies and working practices are likely to be of little practical bene t. Lack of detail encourages ambiguity and safety managers may nd it diÆcult to know how to implement remedial actions. These problems are compounded by the need to ensure that recommendations can be implemented in many dierent working environments that are often distributed across many dierent geographical locations. This is an increasing problem given recent initiatives to increase the coverage of national and international reporting systems. For instance, the initiatives of individual airlines led to the development of United States' Aviation Safety Reporting System (ASRS) in 1976. In the last ve years this has, in turn, motivated attempts to establish a Global Aviation Information Network [308]. Similarly, medical reporting systems that were initially established in individual units within individual hospitals are now being extended to regional and national systems. For example, the UK's Royal College of Anaesthetists has introduced guidelines to encourage recommended practice in incident reporting within their specialism [715]. Both the UK Government [633] and Presidential initiatives [453] have advocated the expansion of these systems beyond the local and regional levels. It is important to emphasise that national and international initiatives to expand the geographical coverage of incident reporting systems do not remove the need to draft recommendations that focus on particular local needs. For example, the Canadian Commander of the National Support Element Services Platoon and of Camp Black Bear in Bosnia-Herzegovina reported that the following actions had been taken to address previous safety recommendations: \First of all, we replaced three propane gas ranges in the kitchen and took steps to replace one tilting frying pan and procure another. The ranges in use were extremely old and beat up. In fact, the burners were cracked and the wire insulation was torn. New safety equipment in the kitchen at Camp Black Bear. Moreover, the plates inside the stoves had been removed, leaving the propane gas tubing unprotected. Hence, this equipment posed a serious risk to the people working in the food section. Our chief cook, ement, is exceedingly proud of his new equipment." [143] Sgt El The importance of such local recommendations and actions cannot be exaggerated. They provide immediate and direct feedback to the individuals and groups who contribute to incident reporting systems. This is particularly signi cant for work groups that perceive themselves to be isolated from administrative centres. The Canadian units in Bosnia-Herzegovina provide a good example of groups who most need to be reassured that their potential problems are receiving prompt and direct attention. There are also less obvious reasons for ensuring that recommendations address particular local concerns. If remedies are couched in abstract terms that can be applied to many dierent contexts they often lose the impact that can be observed from more direct and locally relevant recommendations. For instance, the previous actions might have addressed a requirement to `review the safety of cooking appliances in all military camps'. Such generic recommendations can often be lost amongst the plethora of similar high-level guidance that is issued from `lessons learned' systems. The introduction of particular local details can, arguably, provide more salient reminders even though the exact circumstances are not replicated in other working environments. This is an important feature of the anecdotes and `war stories' that provide a critical learning resource for workers in a vast range of occupations. This analysis was con rmed during the interviews that help to form the EUROCONTROL guidelines for incident reporting in air traÆc management [423]. Many controllers speci cally asked that details about speci c airports and shift patterns should be left in both the causal analysis and recommendations associated with individual incidents. They argued that this increased the perceived relevance of the analysis. These local details helped them to re-interpret
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
545
particular recommendations within the often dierent context of their own working environment. The controllers who were interviewed continued to support these arguments even after it was suggested that such local details might compromise the anonymity of some reports. It remains to be seen whether the same opinions would be expressed by those individuals who are involved in an incident. It is possible to identify a paradox that aects the drafting and implementation of recommendations from incident reporting systems. The eectiveness of such systems depends upon identifying remedies that can be applied well beyond the scope of the application or working group that rst identi ed a potential problem. On the other had, drafting recommendations that can be applied beyond the context of particular working group often implies that investigators must strip out the contextual details that help operators understand the signi cance of an incident. There is also a danger that by expanding the scope of a recommendation, investigators will address propose remedies for problems that do not exist beyond the boundaries of a local system. This is a signi cant concern given that each recommendation is likely to carry signi cant costs in terms of the time and money that may be required to implement them. There is a further danger that by addressing these spurious recommendations, working groups may divert resources away from more critical remedies. Many organisations, therefore, impose triggering conditions that must be satis ed before an incident must be addressed by both local and regional recommendations. For example, the Canadian forces in the Former Republic of Yugoslavia investigated a total of 250 dierent topics during initial investigations into incidents during 1996. The `Lessons Learned Information Warehouse' of the Army Lessons Learned Centre identi ed 142 possible areas of study but only 34 of these subjects were common to the majority of operations and rotations. There were identi ed using the following heuristics. Firstly, if an issue was identi ed during the early stages of the army's involvement in the region but ceased to be reported in latter stages then it was assumed that appropriate changes had been made and that no further recommendations need be made. This is a strong assumption. In other contexts, it would be necessary to seek further reassurance that such initial reports had been recti ed. For example, the last heuristic represents such an approach. This is a further illustration of the temporal problems that arise when investigators attempts to assess the `shelf-life' of a recommendation. Secondly, if an observation was made twice in the latter operations then it was retained as an issue for further investigation. The military reports do not state whether an issue would be retained if these reports were submitted by the same units on two dierent occasions. As with the previous heuristic, additional requirements might be necessary in other contexts to ensure that such generic issues did stem from more than one independent source. Finally, if an observation was made three times at any stage of the operations then it was retained as an issue. This increases the likelihood that high frequency reports, even in the early stages of the operation, will be addressed during subsequent investigations [134] As mentioned, these criteria were used by the Canadian military as a means of ltering the 250 topics that were identi ed by individual reports from the units in the Former Republic of Yugoslavia. Only those issues that satis ed these various criteria were passed for the next level of analysis. In other words, they ceased to be considered as isolated examples that could be addressed by local recommendations. They were, in contrast, considered as more generic issues that required national or regional remedies. This distinction can be illustrated by the contrast between the general health and safety issues that aected Camp Black in the previous quotations and the following issues that emerged from this higher level analysis of more generic issues:
\Operations - Maps (Issue 26, page A-13): Although units had adequate map coverage, the two map scales in use (1:50,000 and 1:100,000) did not coincide. Reporting of a grid on one scale produced an error in plotting on the other due to a dierence in data.
Operations - Mines (Issue 32, page A-16):
Mines were the single largest producer of casualties on operations in the Former Republic of Yugoslavia. All [reports] indicated that units conducted extensive mine awareness training prior to and during the operation. Despite this training, the vast majority of mine incidents were directly attributable to a lack of situational awareness, understanding risks and recognising
546
CHAPTER 12.
RECOMMENDATIONS
the indicators of a mined area." [134] Many organisations operate similar ltering processes to those implemented by the Canadian defence forces. Incidents that are reported at a local level are collated and then assessed to determine whether they have regional or national signi cance. If an issue is considered to be suÆciently serious according to the ltering criteria, or subjective judgement of the gatekeeper [423], then more generic recommendations are drafted. This, typically, implies a process of abstraction that strips out the contextual details that have been noted in the previous citations. This can be illustrated by the US Army's response to repeated incidents involving poor situation awareness; `many platoons continue to experience diÆculty with situational awareness because they do not have a system in place to properly battle track and manage information' [800]. The perceived importance of this continuing problem ensures that any recommendation must be directed to all battle eld personnel rather than those who are engaged in regional operations. This contrasts with the previous quotation that focussed on incidents involving maps and mines that were speci cally encountered by the Canadian element of the UN forces in the Former Republic of Yugoslavia. Incidents involving poor situation awareness cannot be viewed as local issues because similar problems led to the friendly re incidents mentioned earlier. An analysis of previous incidents determined that battle tracking in platoon command posts failed to provide squad leaders with necessary details about enemy locations, friendly unit dispositions and the current state of combat operations in their area. Squad leaders, in turn, rarely provided suÆcient detail for platoon leaders to gain a clear understanding of the signi cance and context of their objectives. The US Army responded to repated reports of similar incidents by directly considering situation awareness issues within its national training programmes: 1. \The platoons must provide brigade combat teams with the information necessary to have resolution of location, current status and missions of the Military Police units. 2. Military police platoons should be considered during the brigade combat team's clearance of res drills. The platoon command post must track the current brigade operation to the resolution necessary to provide squad leaders with information to plan and conduct operations. 3. Prevent fratricide. The platoon command post must also disseminate and provide feedback on the Commander's Priority Intelligence Requirements and Critical Information Requirements. Platoon leaders must require squad leaders to submit timely situation reports and route reconnaissance reports." [800] As can be seen, there recommendations have moved away from the speci c problems encountered by particular units. They have also abstracted away from the more regional problems that are associated with a particular theatre of operations. In contrast, the recommendations are expressed in a generic manner that might be used to inform battle eld operations in any anticipated con ict. It is important to emphasise that investigators must be aware of the dierent strengths and weaknesses of recommendations that are pitched at a national rather than a local level. The bene ts of extending the scope of any remedy are obvious. However, the sense of engagement that stems from addressing speci c local concerns is diÆcult to obtain from this more generic approach. The previous recommendations could be aimed at any combat platoon in the US Army. The insights gained from the analysis of particular incidents are distilled into a format that resembles standard training manuals that lack the immediacy of more local approaches. This eect is more readily apparent in the recommendations that are intended to avoid administrative or nancial `incidents'. For instance, the following quotation is taken from the US OÆce of the Assistant Secretary of the Army's proposals to avoid problems in the acquisitions process: \People who show active hostility to changes are easy to spot and deal with. If they express their dissatisfaction honestly and openly, then their objections can be addressed. The agreement may be stronger for having resolved the points troubling such a person. Con ict resolution, after all, is one of the primary reasons for forming a partnering agreement. More diÆcult to deal with are those individuals who pay lip service to the partnering agreement while they quietly work against it. Their hostility is expressed with
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
547
subtlety through stubbornness, procrastination, and ineÆciency. While the agreement encourages actively working to nd solutions to problems, a passive-aggressive person does nothing to further the process. Quite the contrary, they do whatever they can to wreck it. Reassign such a person to a job where they cannot block progress." [801] Such advice is deliberately pitched at a very high level of abstraction. In consequence, it can appear to be little more than `common sense'. Such advice can have potentially adverse consequences if oered to personnel who are faced with more direct and apparently pressing problems in their working lives [839]. The previous paragraphs have argued that there is a tendency for national and regional reporting systems to remove the local and contextual details that often increase the immediacy of particular incident reports. In consequence, recommendations can be seen as abstract requirements that have little relevance to more immediate problems. At worse, they can be resented as unwarranted impositions by external agencies that are intent on hindering the normal working practices of local teams. It is important to emphasise that this process of abstraction is not a necessary result of attempts to increase the scope of an incident reporting system. It is still possible to implement national and international systems that provide focussed information about detailed incidents. Unfortunately, this raises a number of fresh problems that must be addressed by investigatory and regulatory authorities. In particular, given that national and international systems may generate a large number of potential recommendations it can be diÆcult to ensure that particular members of sta can easily access all relevant recommendations. Several techniques have been developed to address this problem. Journals such as the ASRS' DirectLine or the Canadian National Defence forces' Safety Digest can publish information about individual incidents in `key areas' that are selected by the sta who are responsible for running the system. These publications are then distributed to appropriate members of sta. Unfortunately, this approach can be extremely costly. Paper-based publications must, typically, be distributed to many dierent regions. It also relies upon investigators to identify a subset of incidents that should be publicised at a national level. The diÆculty of this task increases in proportion to the scale of the reporting system. Electronic publication techniques provide alternative means of providing key members of sta with access to the recommendations that aect their particular tasks. This avoids the problems associated with making an explicit decision only to publicise a small number of the insights that can be gained from a national reporting system. With appropriate tool support, this approach also avoids some of the overheads associated with the costs of updating and distributing paper based journals. This approach has been successfully exploited by a range of armed forces [801, 148]. Preliminary steps have also been taken to extend this approach as a means of encouraging international cooperation. The ABCA Coalition Operations Lessons Learned Database is a notable example of this approach. This database was established in 1999 as a joint venture between the American, British, Canadian and Australian armed forces. It was intended to \identify and resolve those key standardisation issues which would aect the ability of a military force, comprising two or more of the ABCA nations, to operate eectively and to the maximum ability of its combat power" [799]. The password-protected web-site provided user with the ability to perform full-text searches. They could also browse a full listing of documents by country of origin. Chapter 15 will describe some of the technological limitations that reduce the utility of this approach and will introduce a number of further solutions to these problems. For now it is suÆcient to observe that there may be few guarantees that any particular member of sta will be able to access all of the recommendations that are relevant to their working tasks using computer-based systems. It is important not to underestimate the problems that arise when attempting to draft recommendations that might usefully be applied across national boundaries. Previous chapters have argued that the increasing scope of an incident reporting system can result in a process of abstraction that hides contextual information. Cultural dierences have the paradoxical eect of focusing international exchange almost exclusively on detailed technical issues. For example, it is diÆcult to translate previous advice on US Army acquisitions policies into cultures in which `apparent acceptance and covert opposition' are acceptable and even anticipated forms of disagreement [878]. It is for these reasons that the exchange of safety-related recommendations can yield deep insights into the alliances that exist between national organisations. Cultural similarities arguably explain the
548
CHAPTER 12.
RECOMMENDATIONS
United Kingdom's participation in the ABCA coalition rather than a coalition with other European defence forces. The previous analysis has argued that the eectiveness of an incident reporting system can be increased if investigators increase the scope of its recommendations. This, typically, involves abstracting away from local, contextual details so that lessons can be applied by operators working in dierent regions and even dierent countries. It is important to stress, however, that there are some situations in which there is a deliberate policy not to exchange information about safety related incidents. For example, the South African National Defence Force is still adjusting to the changes that were introduced when it was rst made subject to both the Machinery and Occupational Safety Act, 1983 and the Occupational Health and Safety Act, 1994. For the rst time, the Department of Defence has an explicit obligation to demonstrate compliance with the law, \or with the spirit of the law", in health and safety matters [708] Prior to the end of apartheid, there was a deliberate political motivation to promote self-suÆciency. This implied a willingness to learn from the mistakes of others but did not imply a willingness on the part of many governments to share those lessons. Even in the post-apartheid era, there are limits to the free exchange of information in military and strategic matters. This was implied in the recent White Paper on South African Defence Related Industries: \It is neither aordable nor necessary to strive for complete self-suÆciency in armaments production and all the technologies to support it. However, the South African National Defence Force requires that in certain strategic areas, limited self-suÆciency must be retained and maintained and that in others, the South African National Defence Force needs to remain an informed buyer and user of equipment" [26] Similar tensions exist in the wider commercial and industrial environment. Organisations must balance their need to learn from the mistakes of others against the potential consequences of disclosing information about their own past failures and successes. Sharing the recommendations that emerge from such incidents may result in the loss of competitive advantage that could otherwise be obtained from these insights. These tensions increase as recommendations are passed across geographical and organisational boundaries. Individual operators may see the bene ts of sharing their insights with their fellow workers. Management may be less motivated to share those recommendations with commercial rivals. Ultimately, national political and strategic interests can intervene to prevent the exchange of insights from past failures.
By function... The previous section has identi ed some of the problems that arise when investigators draft recommendation that must be applied by colleagues who are not part of the working group that reported an incident. It can be diÆcult to draft generic remedies that can be applied by groups in other areas. A lack of speci c details can remove the directness that characterises many local incident reports. In consequence, particular recommendations can appear to be impositions from external agencies that cannot easily inform the daily working lives of their recipients. These problems that complicate the exchange of information within national boundaries are further exacerbated by the cultural dierences that exist between the partners of international systems. These issues can be partly resolved if investigators ensure that recommendations are drafted to target speci c functional issues. Less emphasis is placed on generalise from a particular incident so that it can inform a wide range of tasks that are performed throughout an organisation. Greater emphasis is placed on ensuring that similar incidents do not aect the future performance of the particular task that was aected by a previous incident. At the lowest level, task based recommendations can be drafted to support particular working groups within particular units or factories. For example, the following excerpts are taken from the Picatinny Arsenal newsletter published for technicians working on the US 155mm M109A6 selfpropelled howitzer, known as the Paladin: \An inoperable drivers hatch stop inhibits the ability to properly secure the drivers hatch cover, forces non-operational vehicle status as prescribed in Paladins Operators
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
549
Manual (TM 9-2350-314-10, Feb 1999, Page 2-70, Item 77), and could cause injury if not corrected. Yet, some Paladin personnel improperly use unauthorised eld xes to correct the problem and by doing so promote a potentially dangerous situation. Typically, problems begin when the Grooved, Headless, Pin (NSN 5315-00- 584-1731) breaks (usually when the hatch cover is inadvertently swung open with more than necessary force) then, rather than correcting the problem with authorised parts, a makeshift solution is applied to connect the hatch stop to its shaft. Poorly tted cotter pins, nails, and similar devices, have been used in place of the Fan Impeller. After continued use, damage to the hatch stop usually occurs causing the hatch stop assembly to become totally inoperable. Units using a Lessons Learned approach to the problem generally maintained a small number of the inexpensive Grooved Pins on hand ($1.78, April 00 Fedlog). When pins were damaged, broken, or became loose, they were quickly replaced. This practice precluded unnecessary damage and replacement of parts, but more importantly a higher degree of operational safety was maintained." [802] These recommendations illustrate a number of important strengths that can be derived from taskbased, local incident reporting. For example, this guidance assumes a high degree of common understanding about the nature of the systems being maintained. Although reference is made to the operators manual and part identi ers, a range of technical terms such as `fan impeller' or `grooved pins' can be used without further elaboration. These recommendations dispense with the additional contextual information that is necessary for recommendations that have a wider scope beyond local working groups. Similarly, there is little need to expand on the details of previous violations. It is suÆcient to summarise the `makeshift solutions' for the readers to understand the nature of the incidents that are being addressed. The previous quotation also illustrates some of the weaknesses that limit the utility of such task-based approaches to incident reporting. In particular, the recommendations tend to focus on `cheap xes' rather than the large scale investment that may be required to address more systemic failures. Elsewhere, we have reviewed the way in which many aviation and medical incident reporting systems will repeatedly remind sta to `do better' rather than invest resources in addressing the conditions that led to particular failures [409]. The tendency to rely upon short-term measures is particularly apparent when recommendations are targeted on the tasks or activities performed by individual groups of workers. We can de ne a task to be the activities that are required to achieve a particular set of goals [686]. The previous quotations, therefore, examined a very speci c and detailed task from the perspective of US Army Engineers at the Picatinny Arsenal. This task focussed approach can also be applied to national and internation objectives. When failures occur at this level, the proposed remedies tend to avoid the short-term solutions that typify more local initiatives. This can be illustrated by the NATO recommendations that were compiled from detailed incident reports and interviews with the personnel who contributed to the peace-keeping missions in Somalia: \The evaluation noted many troop contributors' complaint that they were not suf ciently consulted during the formulation stage of the mandate and, thus, had varying perceptions and interpretations during its execution. Many participants in the exercise considered that the original UNOSOM mandate was formulated on political, humanitarian and military assessments, and was prepared, using insuÆcient information, by oÆcers borrowed for short periods from Member Governments and other peace-keeping operations. Some participants observed that although it was well known that a crisis was unfolding in Somalia, its seriousness and magnitude in humanitarian terms were not fully appreciated. " [624] It is important to emphasise that even though international organisations may take a more system view of the causes of particular incidents, there is no guarantee that they will be able to solve the problems that complicate high-level tasks such as peace-keeping. Sadly, this point is reinforced by NATO's Department of Peace-keeping Operations review of the Rwanda missions. It was argued that many of the problems and incidents report by NATO forces stemmed from a `fundamental misunderstanding' of the nature of the con ict [625]. Analysis of individual incidents raised concerns
550
CHAPTER 12.
RECOMMENDATIONS
that `the internal political con icts within the Government of Rwanda, and the mounting evidence of politically motivated assassinations and human rights violations in the country, were ignored or not explored'. Previous sections have argued that task focussed recommendations often lead to short-term xes that ignore more systemic problems. The United Nations recommendations have, however, provided a counter-example. This apparent contradiction can be explained by the very dierent nature of the tasks that we have considered. Clearly, there are considerable dierences between maintaining a driver's hatch and coordinating international peace-keeping operations. It might, therefore, be argued that it is the combination of task-focussed recommendations within a local reporting system that tend to lead to short-term remedies. Task-focussed recommendations at a national or international level are less susceptible to this problem. Previous studies have shown, however, that large-scale systems are far from immune from this problem [409]. There is still an understandable tendency to recommend improved training rather than reassess acquisitions policy. A number of further limitations aect recommendations that re ect this task-focussed approach to incident reporting. In particular, there is a danger that investigators will fail to consider the importance of particular incidents within the context of a larger operation or production process. For instance, the previous recommendation does not consider the possible acquisition or training problems that led sta to adopt `makeshift solutions' in the rst place. An alternative approach is to embed task-speci c recommendations within longitudinal accounts of particular operations. This approach weaves together the ndings from a number of dierent incidents. Any individual may only be directly involved in a small number of the tasks that contribute to the overall operation. However, this longitudinal approach enables them to see how incidents that occur earlier in a process have `knock on' eects for their own tasks. It also demonstrates that the eects that potential failures in their tasks can have upon the subsequent activities of their colleagues. This `process-based' approach can be illustrated by the US Army's Engineering Groups analysis of bridging operations. This draws together diverse recommendations from many dierent stages in a particular bridging operation. Initially, a small `S3' group compared the tools that the 1st Cavalry Division and the 937th Engineer Group would need to plan and control the operation. This planning exercise identi ed a number of limitations with current synchronisation techniques and new tools were developed based on `o-the-shelf' software. Task focussed approaches to incident reporting might have simply presented these recommendations as isolated guidance on the synchronisation of river crossings. This approach is widely adopted in other areas of the US military [802]. In contrast, the engineers of the 937th extended their analysis to integrate it with the recommendations that emerged from the subsequent execution of their plans with the 1st Cavalry Division. Although the tools enabled the engineers to calculate crossing times and schedules for both rafting and bridging, the eventual joint plans did not adequately address some of the fundamental problems that exacerbate the execution of such crossings: \A bypassed Orangeland special-forces team on the near shore observed and directed accurate artillery and close air support to destroy the bridge. During the after-action review, it was determined that the critical friendly zones had not been set properly and that the high-to-medium-altitude air defence coverage was inadequate. This action demonstrated that clearing the near and far-shore lodgements is a tenuous and diÆcult task. One lone member of an opposing force with a radio is the most dangerous person in the crossing area. In an eort to take advantage of the surprise created by the virtually unopposed crossing at Kaw, the division accepted risk by not absolutely ensuring that the crossing site was secure from observed indirect re before beginning bridging operations. This allowed the division to quickly cross two mechanised task forces but left the ribbon bridge at Kaw vulnerable." [463]. Chapter 3 introduced Perrow's argument that technological failures are unavoidable given that designers are forming increasingly complex interconnections between component systems. For this it follows that even if one organisation has implemented a particular recommendation, there is no guarantee that others will have met the same requirements. This has important implications because failures from one area of a system can propagate through an application to eect later processes that
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
551
might, themselves, meet the most stringent safety standards. This process-based approach to the presentation of recommendations, therefore, provides eloquent reminders of the mutual dependencies that exist between the component tasks of complex systems. It is important to note, however, that there is still a strong functional bias to the engineer's recommendations. They focus on the particular challenges of the bridging operation and are written from the perspective of those who are tasked to construct and maintain the river crossing. The recommendations do not address the wider strategic signi cance of the crossings within an exercise as a whole. For example, there is only a cursory description of the problems that the other units in the 1st Cavalry Division faced in exploiting the opportunities, or addressing the threats, that were created at the various crossing points. In other words, the recommendations re ect the functional preoccupations of the engineering group. They propose solutions to incidents that jeopardised their particular tasks. Incidents that occurred elsewhere on the battle eld are not considered. Again, it is important to stress that this example represents a far more general trend. For example, the US Army maintains a number of incident reporting systems that form part of `Lessons Learned' initiatives. These are organised along functional lines. In addition to the general Center for Army Lessons Learned there is a Center for Engineer Lessons Learned. As we have seen, there is a Contracting Lessons Learned Centre (http://acqnet.sarda.army.mil/acqinfo/lsnlrn/index.htm) and a Medical Lessons Learned unit . The Marine Corps also re ect these functional distinctions by operating separate systems for their combat personnel and for their maintenance sta. The obvious criticism to make of these systems is that there may be important lessons that cross functional boundaries. Some of these are address by Joint Center for Lessons Learned which covers joint forces operations. Other issues that cut across functional boundaries are captured by the US Army Safety Centre. This publishes the safety notices that have been mentioned in previous chapters. It should be stressed that the objectives of these dierent systems are quite dierent. For example, the reporting systems maintained by the Safety Centre do not issue the sorts of functional recommendations, for instance on fording tactics, that might appear in the systems operated by the combat engineers. It is important to stress, however, that safety-related incidents and recommendations appear in all of these systems. This section has argued that the process of drafting eective recommendations is complicated by the geographical scope, the timing and the functional focus of the proposed remedies. Some incidents provide universal insights that can be applied across many dierent workgroups in particular geographical regions. The Singaporean guidelines on heat injury provide an example of this form of recommendation. Other remedies, such as the proper insertion procedures for the Paladin hatch pin, relate to speci c workers performing speci c tasks in a few locations throughout the world. It should be stressed that these are not the only distinctions that complicate the drafting of recommendations from incident reports. For example, there are some notable situations in which potential remedies for previous incidents will not be acceptable to both genders. This is illustrated by the guidance provided in the US Army's `Female Soldier Readiness' [845]. These distinctions have important consequences and investigators must carefully consider their impact on any potential recommendations. For example, a mailshot about the dangers of heat exhaustion may have limited bene ts for US Army personnel working at Fort Wainwright in Alaska. Similarly, information about the Paladin hatch mechanisms is of little interest to combat engineers engaged in bridge construction. These geographical and functional distinctions have a profound impact upon many reporting systems. For example, the following list summarises the current titles published in the US Army Safety Centre's leadership guides. As can be seen, some documents provide recommendations that apply to particular geographical regions, including Southwest Asia, Korea, Iraq and the Caribbean. Others relate to particular functions, such as force protection and civilian work force management:
Annual Training Leader's Safety Guide Back Injury Prevention Leader's Safety Guide Caribbean Risk Management Leader's Guide Civilian Work Force Leader's Safety Guide Desert Shield Leader's Safety Guide
552
CHAPTER 12.
RECOMMENDATIONS
Force Protection Leader's Guide Korea Leaders Safety Guide Operation Provide Promise Risk Management Leader's Guide Operation Support Hope Risk Management Leader's Guide Redeployment & Port Operations Leader's Safety Guide Southwest Asia Leaders Safety Guide
This list re ects one of the simplest ways in which information can be structured so that users can identify which recommendations are most relevant to their everyday tasks in a particular working environment. Many organisations have developer far more elaborate means of collating and disseminate recommendations. For instance, previous paragraphs have brie y described the US Army's plethora of `lessons learned systems'. This mixed approach to the dissemination and implementation of recommendations will be the focus of Chapter 14. For not it is suÆcient to realise that these diverse information sources provide means of tailoring the presentation and dissemination of recommendations so that they support particular user groups. Unfortunately, they can also create arti cial barriers that prevent the free exchange of information about similar incidents between different groups in the same organisation. If any reader believes that these problems are unique to the US Army, it is worth considering the functional and geographical distinctions that are appearing within many healthcare systems. In the UK, dierent medical specialisms are developing their own guidelines on incident reporting. The Royal College of Anaesthetists' are, arguably, the more well known [715]. At the same time, individual hospitals, NHS trusts and national schemes are all being developed in parallel. This is hardly a situation designed to inspire con dence in the free exchange of information across dierent organisational boundaries [633].
12.1.3 Con icting Recommendations Previous sections have argued that the task of identifying appropriate recommendations is complicated because any remedies may have support a range of tasks that are performed by dierent operators in many geographical regions. It is further complicated by the ways in which working practices, procedures and technological systems will change over time. Recommendations may, therefore, have to be continually updated if they are to continue to support the safe operation of complex applications. The task of drafting recommendations is further complicated by potential disagreements between investigators, safety managers and national organisations. It is possible to identify at least three dierent forms for such potential con ict. Firstly, investigators may disagree about the remedies that are appropriate for super cially similar incidents. Secondly, investigators and their managers may disagree over the recommendations that emerge from a particular incident. Finally, safety managers can disagree over the interpretation of particular recommendations. The following paragraphs describe these problems in further detail and provide case studies to illustrate their impact on a number of incident reporting systems.
Dierent Recommendations from Similar Incidents Dierent recommendations are often proposed for incidents that have strong apparent similarities [409]. For example, the US Army's `Countermeasure' provides military personnel with feedback about a range of safety related incidents. The following incidents appeared in successive numbers of this journal. Both describe two fatalities that resulted from tank drivers using excessive speed during hazardous maneuvers. In the rst incident, rapid lateral momentum over a steep slope helped to overturn a seventy ton M1A1. The recommendations paid particular attention to the position of the crew during this incident: \Once again, human error became a contributing factor in the loss of a soldier. Leaders must ensure that they and their crew members are positioned correctly in their
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
553
vehicles and are taking advantage of all safety features. The nametag de lade position increases your ability to lower yourself safely inside the vehicle and prevents excessive exposure of body parts to the elements outside. Seatbelts (if provided), guards, clothing and securing equipment enhance your survivability if your vehicle should happen to invert or strike a solid object." [816] In the second incident, the driver of an M551A1 inadvertently drove into their excavated ghting position so that their vehicle also overturned. Again, the fatality resulted from crust injuries sustained by a soldier who was standing in the hatch above the nametag de lade in the vehicle. In contrast to the previous incident, however, the recommendations did not address the US Army's requirement that all personnel must assume a correct, secured position within any combat vehicles [809]. This apparent dierence between the recommendations from two similar incidents can be explained in a number of ways. Firstly, although these incidents resulted in similar outcomes and shared several causes there were also important dierences. In the former case, the incident occurred during a daytime exercise. In the second case, the personnel were operating using night vision devices. The recommendations, therefore, focused on the additional requirements for working with limited visibility rather than the requirement to obey seating regulations for combat vehicles. Such dierences are not the only reasons why super cially similar incidents might elicit very dierent remedies. As we have seen in Chapter 11 there are a host of individual and social biases that can aect the analysis of individual failures. These biases may it likely that dierent investigators may identify dierent causes for similar incidents. Such problems are further compounded when recommendations are identi ed in an ad hoc manner without the support of any shared methodology. Later sections in this chapter will describe techniques that have been speci cally developed to reduce such apparent dierences between the analysis of similar incidents. A number of further reasons help to explain why investigators derive dierent insights from similar incidents. New evidence can encourage analysts to revise previous advice. Investigators may also change their recommendations to focus operator attention on particular causes of subsequent incidents. This provides an important communication tool. Over time, the hope is that the readers of Countermeasure and similar journals will learn to recognise the diverse causes of many safety-critical failures. The changing emphasis of particular recommendations can also re ect changes in particular forms of risk assessment. They may signify a decision to focus more on limiting the consequences of an incident rather than reducing incident frequencies. For example, previous speed-related collisions involving combat vehicles had led the Army Safety Centre to reinforce the importance of enforcing recommended speed limits. Subsequent articles focussed more on protective measures that might mitigate the consequences of any collision if a speed-related incident should occur. It is, therefore, possible to distinguish between inadvertent and deliberate dierences between the recommendations that are derived from super cially similar incidents. Inadvertent dierences stem from the managerial problems of ensure consistency between the remedies that are proposed for complex events. Investigators often rely on ad hoc methods and do not share the common techniques that might encourage greater agreement. Later sections in this chapter introduce a number of techniques that are deliberately intended to ensure that similar recommendations are derived from similar incidents. However, the large number of incidents that must be investigated by national and international system make it unlikely that such tools will ever provide an adequate solution to this problem. In consequence, Chapter 15 describes a range of search and retrieval software that can be recruited to improve quality control in this domain. As we have seen, some investigators may deliberately introduce dierences between the recommendations that are intended to resolve similar incidents. These dierences may stem from individual or group biases that can compromise the value of any subsequent remedial actions. alternatively, deliberate dierences may re ect a policy of gradually exposing operators to the underlying complexity of the causes that characterise many incidents. These dierences may also re ect the previous success of a system in addressing some of the causes of similar failures. Conversely, they may re ect an apparent failure to address the causes of an incident. Investigators may subsequently focus attention on mitigating the consequences of particular failures. Whatever the justi cation, it is important that analysts consider presenting the reasons for such apparent dierences. For example,
554
CHAPTER 12.
RECOMMENDATIONS
the Countermeasure journal dealing with the second incident, described above, deliberately informs its readers that incidents in the special edition will present the diverse range of causal factors that contribute to `night vision' incidents. This justi es and explains why dierent recommendations are made after each incident is described. Unfortunately, such contextual explanations are often omitted so that readers have no means of distinguishing deliberate dierences with benign explanations from inadvertent dierences or dierences that are due to the deliberate bias of particular investigators.
Debate Between Investigators and Higher-Level Administration Previous paragraphs have argued that it is diÆcult for investigators to ensure that their recommendations are consistent with those of their colleagues. These problems are exacerbated when analysts may deliberately choose to emphasise certain aspects of an incident in their ndings. It is also diÆcult to under-estimate the problems that arise from the sheer scale of many national and international systems. Analysts must ensure consistency between thousands of dierent reports. As we have seen dierences can arise between recommendations for similar failures. They can also stem from dierent interpretations of the same incident. One important potential source of dispute stems from the nature of the recommendation `process' itself. It should be apparent from the use of the term `recommendation' that these ndings are usually recommended by investigatory organisations to a supervising body. For instance, the ndings of US Coast Guard reports are typically passed from an individual investigating oÆcer via the OÆcer in Charge of Marine Inspection to the Commander of the relevant Coast Guard District. Australian Military Boards of Inquiry present their ndings to the Minister of Defence and the Federal Government. This process of recommending corrective actions creates the opportunity for disagreement. The Australian Minister of Defence may reject some of the ndings made by a Board of Enquiry. Similarly, the Commander of a Coast Guard district may present his reasons for choosing not to implement the ndings of an investigating oÆcer. More elaborate mechanisms are also used to approve the recommendations from accident and incident investigations. For instance, the Investigating OÆcer's Report into the terrorist actions against USS Cole was endorsed by the Commander of US Naval Forces Central Command, by the Chief of Naval Operations and by the Commander in Chief of the US Atlantic Fleet. They `must approve ndings of fact, opinions and recommendations' [836]. Each of these endorsements occurred in a speci ed order. The Commander of US Naval Forces Central Command provided the initial endorsement, the Commander in Chief of the US Atlantic Fleet was second and the Chief of Naval Operations was last. Subsequent reviewers could not only comment on the report itself but also on the opinions of their colleagues. Most of the comments supported the ndings of the investigation. For example, the Chief of Naval Operations stated that \after carefully considering the investigation and the endorsements, I concur with the conclusion of the Commander in Chief, US Atlantic Fleet, to take no punitive action against the Commanding OÆcer or any of his crew for this tragedy" [836]. There were, however, some disagreements over particular recommendations. There were also disagreements between the endorsing oÆcers! For example, the Commander in Chief of the US Atlantic Fleet observed that: \The Investigating OÆcer and the First Endorser fault the Commanding oÆcer, USS Cole for deviating from the Force Protection Plan he had submitted to his superiors in the chain of command. The Investigating OÆcer states that had these measures been activated, the attack `could possibly' have been prevented. I disagree with this opinion, given that those measures would have been inadequate against attackers who were willing to, and actually did, commit suicide to accomplish their attack. I speci cally nd that the decisions and actions of the Commanding OÆcer were reasonable under the circumstances." [837] Other organisations can be commissioned by the ultimate recipients of incident reports to monitor the recommendations that are proposed. For instance, the United States' General Accounting OÆce was commissioned by members of the senate to review training related deaths. The resulting analysis was not only critical of the recommendations for improving training safety but also uncovered problems with the basis on which those recommendations were made:
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
555
\Our analysis revealed that six deaths categorised by the services as resulting from natural causes occurred under circumstances that could be related to training activities. These were primarily cardiac arrests that occurred during or shortly after the service members had performed required physical training exercises. A typical example of these was a Marine who died from cardiac arrest after completing a required physical tness regimen. Although he had just completed 5 pull-ups, 80 sit-ups, and a 3-mile run, his death was not considered to be a training death, but rather was classi ed as a natural cause death." [286] The Department of Defence responded, in turn, to defend the processes that had been used to investigate particular incidents and the recommendations that had been derived from them. In the nal report, the General Accounting OÆce continued the dialogue by countering these comments with further points about the need to trace whether those recommendations that were proposed had been eectively monitored within individual units. The complex nature of many incidents often creates situations in which organisations, such as the Department of Defence and the General Accounting OÆce, hold opposing views about recommendations to avoid future incidents. These con icts can be diÆcult to arbitrate. Regulatory or governmental bodies often cannot resolve the dierences that exist between the various parties that are involved in the analysis of safety-critical incidents. This point can be illustrated by the Canadian Army's Lessons Learned Centre investigation into their involvement in the NATO Implementation Force and Stabilization Force in Bosnia-Herzegovina (Operation Palladium). They analysed the individual incident reports that had been received during the initial stages of their involvement and made a systematic response to the recommendations that had emerged. The following quotation illustrates how it can be impossible to comply with the competing recommendations that can be made from the dierent parties who are involved in the analysis of speci c incidents: \(Reports from Units)...Many units stated that the standard rst aid training package (a holdover from Warrior training) lacks realism and that training should be oriented to treating injuries that would be sustained in combat. Many agreed that IV and morphine training were essential components to this training... \ During the six months in theatre, no soldier had to give arti cial respiration, treat a fracture or do a Heimlich manoeuvre. However, our soldiers did give rst aid to 17 bullet-wound cases, 3 shrapnel-wound cases and 7 mine eld cases (foot or leg amputated)." As the threat level dropped for latter rotations, unit comments on the need for IV and morphine training waned, there seems to be much debate on the usefulness and dangers of teaching this subject. All unit medical sta strongly recommended that it not be completed because of the inherent dangers that administering IVs or morphine entails...
(Army Lessons Learned Centre Observation) ...This issue can only be resolved at the highest levels of command in the Canadian Forces and a balance between operational imperatives and medical caution must be found." [129]
This quotation provides a detailed example of how it can be necessary to mediate con icting recommendations. In this instance, the Army Lessons Learned Centre must arbitrate between operational requests for training in the application of morphene and the unit medical sta's concerns about the dangers of such instruction. This example shows how particular recommendations often form part of a more complex dialogue between investigatory bodies and the organisations who are responsible for implementing safety policy. The previous quotation also demonstrates that the political and organisational context of incident reporting systems has a strong in uence on the response to particular recommendations. The Canadian Army's Lessons Learned Centre could not reconcile recommendations to expand the scope of trauma training with the medical advice against such an expansion. The fact that they felt uncomfortable with making a policy decision about this matter provides an eloquent insight into the scope of the reporting system and the role of the Centre within the wider organisation. This is not a criticisms of the unit. It would have been far worse if a particular recommendation had been adopted that compromised the reputation of the system or alienated groups who had contributed to the `lessons learned' process that is promoted by incident reporting.
556
CHAPTER 12.
RECOMMENDATIONS
It is, perhaps, unsurprising that the Lessons Learned Centre should pass such policy decisions to a higher level of authority. There can, however, also be disagreement at a governmental level. For example, the UK Defence Select Committee examines the expenditure, administration and policy of the Ministry of Defence on behalf of the House of Commons. As part of this duty, it monitors incidents and accidents within the armed forced. The following quotation is taken from the Defence Committee's report into the UK involvement in Kosovo. The rst paragraph expresses the Committee's concern about a number of incidents involving Sea Harrier missile con guration. The second paragraph presents the government's response to the Committee's recommendations. The Committee's request for further monitoring is parried by the Government's observation that the problem is not as bad as had been anticipated:
\(Committees' recommendation): The resort to cannibalising front-line aircraft in order to keep up the deployed Sea Harriers' availability is clearly a matter to be taken up by the new joint Task Force Harrier's command. We expect to be kept informed of any continuing incidents of damage to the Sea Harrier's fuselage-mounted missiles. (Paras 153 and 176).
(Government response 57): The Joint Force Harrier is addressing these issues, and the Committee will be kept informed of developments. The problem of AMRAAM carriage in certain Sea Harrier weapons con gurations is the subject of continuing in-service trials work, but trials since the potential problem was rst identi ed, together with a longer period of time carrying the missiles, have shown the damage to be much less than feared, and containable within current stock levels and maintenance routines." [792] This quotation again illustrates the way in which the response to particular recommendations can provide useful insights into the political and organisational context of many incident reporting systems. In this case the government accepts the Committee's request to be informed of subsequent damage to the fuselage-mounted missiles. This acceptance is, however, placed in the context of continuing work on the platform and of the relatively small number of incidents that have been observed. The quotation, therefore, captures the Committee's inquisitorial role and the Government's concern to counter any comments that might be interpreted as politically damaging. Previous sections argued that investigators must justify any dierences between the recommendations that are drawn from similar incidents. Statutory or governmental bodies might also be required to explain why they support particular recommendations and reject others. This was illustrated by the detailed justi cations that the US General Accounting OÆce provided in their rejection of US Army recommendations for training-related deaths. There are, however, situations in which governmental and regulatory bodies are forced to mediate between con icting recommendations. The Canadian Army's Lessons Learned Centre could not resolve the apparent contradiction between advice for and against speci c training in trauma medication. Under such circumstances, particular recommendations must be referred to a higher policy-making body if the position of the regulatory agency is not to be compromised. Even at the highest levels, however, it is important that governmental organisations explicitly justify their response to particular recommendations. For example, the UK Government accepted the Defence Select Committee's request for further information about incidents involving fuselage mounted missiles. It was also careful to explain its response in terms of the most recent evidence about the frequency of such incidents. These explanatory comments can equally be interpreted as political prudence. This underlines a meta-level point; the response to particular recommendations often provides eloquent insights into the political and organisational context of an incident reporting system.
Correctives and Extensions From Safety Managers The previous section described how dierences arise between investigators and the regulatory or governmental organisations that receive their recommendations. Most of the examples, cited above, focus on high-consequence failures rather than the higher frequency, lower severity incidents that are the focus of this book. Similar dierences of opinion can, however, be identi ed over the recommendations that are derived from these failures. These disputes can often be seen in the correspondence
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
557
that takes place after a report has been published. For instance, most military incidents and accidents are not directly related to either combat or combat training. A large proportion of work-related injuries stem from slips, trips and falls. Others are related to the road traÆc incidents that aect the wider community. For this reason, the Canadian National Defence Forces' Safety Digest reported a number of recommendations that were based on several detailed studies of previous incidents [139]. The main proposition in this summary was that car buyers should balance the fuel economy of a vehicle against potential safety concerns. The report argued that the fatality rate for passenger cars increases by 1.1% for every 100-lb decrease in vehicle weight and that in an accident between a Sport Utility Vehicle (SUV) and a car, the occupants of the car are four times as likely to die. Subsequent editions of the Safety Digest carried dissenting opinions from readers who disagreed with the recommendations drawn from previous incidents: \(The report) infers that `bigger' is `better' for vehicle safety, encouraging readers to buy large automobiles, SUVs, or trucks by feeding their fears. I don't dispute the fact that the larger the vehicle, the higher the chances of occupant survivability in crashes. However, following (this) logic, our family car should be a multi-wheeled armoured ghting vehicle." [141] The respondent cited studies in which front-wheel drive vehicles with good quality snow tires had outperformed all-season tire-equipped SUVs. They pointed to the problems of risk homeostasis and of decreased perception of risk in larger vehicles. Finally the correspondent argued that the recommendations from the study of previous incidents should have focussed on motivating `drivers to be more alert, attentive and polite, to practise defensive driving techniques, and to avoid distractions (such as cell phones) and road rage' [141]. This dialogue illustrates the way in which publications, such as the Safety Digest, can elicit useful correctives to the recommendations that can be drawn from previous incidents. Similar responses have addressed more fundamental misconceptions in safety recommendations. For instance, an article about the lessons learned from previous incidents involving electrical systems provoked correspondence that can be interpreted in one of two ways. Either the original recommendations failed to consider the root causes of those failures, as suggested by the respondent, or the respondent had misunderstood the original recommendations: \(the report) may leave the erroneous impression that they have discovered new procedures to prevent these types of accidents. The simple fact is that management, supervisors and employees were in violation of numerous existing rules, regulations and safe work practices. Like so many others, this accident was the result of a chain of events which, if carefully examined, often includes all levels - workers, supervisors and management... I have reviewed thousands of accident reports ranging from minor to serious and yes, some fatalities. The vast majority of these reports identify the employee as the cause of the accident. However, study after study has shown that the root cause of accidents is usually somewhere in the management chain. Unless management creates a safety culture based on risk management and unless supervisors instill this workplace ethos in their workers: `In my shop everyone works safely, knows and follows the rules, and has the right to stop unsafe acts,' and then enforces this view consistently, we will never break the chain and accidents will continue to occur." [136] As mentioned this correspondence might indicate that the original recommendations did not take a broad enough view of the causes of previous incidents. If subsequent enquiries concurred with this view then additional actions might be taken to ensure that investigators and safety managers looked beyond the immediate causes of electrical incidents. Alternatively, it might be concluded that the respondent had misunderstood the intention behind the original report. In such circumstances, depending on the nature of the reporting system, actions might be taken to redraft the recommendations so that future misunderstandings might be avoided. The previous response not only illustrates how disagreements can emerge over high-level issues to do with the recommendations in an incident report, it also demonstrates the way in which such feedback can challenge more detailed technical advice. The correspondent challenged `the recommendation that an electrical cane could have been used to eect rescue' during a particular incident [136]. Untrained personnel must not approach any
558
CHAPTER 12.
RECOMMENDATIONS
closer than 3.0 meters for voltages between 425V and 12,000V. The national recommendation for trained personnel is no closer than 0.9 of a metre. The respondent concluded that `the electrical cane shown (in the report) would clearly not be suitable for untrained personnel and only marginal for trained personnel in such a scenario' [136]. Both road safety and the precautions to be taken following electrical incidents are generic in the sense that they aect a broad range of industries. Incident reporting systems also reveal how particular `failures' can trigger more specialised debates that relate to particular safety issues. For example, the Canadian National Defence Forces' Safety Digest described a series of incidents that stemmed from the need or desire to directly observe particular forms of explosion. One incident occurred during a basic Engineering OÆcer training exercise. After a number of demonstrations by a tutor, each student prepared and destroyed a piece of ordnance. A student was injured when a fragment shattered a bunker viewport. The subsequent investigation found that the viewports were constructed using four-ply laminated glass. It was designed to withstand a blast equivalent to the detonation of 100 kg of TNT at 130 metres distance with less than 2% glass loss to the inside of the structure [144]. In this case, the glazing performed as designed. Unfortunately, some of the 2% of glass lost to the inside of the bunker lodged in the eye of a student. The recommendations from this analysis focussed on two areas. The rst concerns the use of a sacri cial layer of polycarbonate material on the inside of the glazing, not simply to protect against scratches and damage on the external surface. The material would be `easy to replace when scratched, discoloured or UV degraded, a nd would provide a failsafe nal protection for the viewer's eyes' [144]. The second recommendation focussed on the use of periscopes. The oset of the glass elements prevented fragment impact from translating to the viewing side of the optics; \one type of oset viewblock that is in plentiful supply is NSN 6650-12-171-9741 periscope, tank." [144] These recommendations helped to trigger a more general discussion about the technologies that might help to reduce injuries caused by the use of viewports to observe explosions. One correspondent argued that the introduction of sacri cial layers compromised the utility of viewports in other applications. The increasing thickness of the glass `precluded observation'. In consequence, they recommended the use of video technology: \I have seen technology advance to the point where miniature cameras now can be positioned in strategic locations with minimal exposure to blast and fragment impact. Should a lens suer a direct hit, the replacement cost would be minimal. Lessons learned involving the Coyote vehicle in Kosovo revealed that crews used their digital video cameras to obtain a colour picture rather than relying on the vehicle's integral observation system with its limited monochrome rendering. Closed-circuit TV or a variation thereof permits easy zooming in from a safe distance. It would also be possible to view the demolition site on a number of screens and to record the process for other purposes, including training, slow-motion analysis, replays, and engineering. I have seen video camera lenses smaller than the tip of a pen (using bre optics) for underwater or highrisk areas (pipeline)." [145] Such debates can help to increase con dence in particular recommendations. Dissenting opinions and alternative views can be addressed in subsequent publications either by revising previous recommendations or by rebutting the assertions made in critical commentaries on proposed remedies. There is a danger, however, that the results of such dialogues will be lost in many reporting systems. This would happen if the dissenting opinions were not explicitly considered during any subsequent policy decisions. It can be diÆcult to ensure that such dialogues are both reconstructed and reviewed before any corrective actions are taken. For instance, there is currently no means of reconstructing the thread of commentaries on previous incidents involving the direct observation of explosions. In consequence, safety managers must manually search previous numbers of the Safety Digest to ensure that they have extracted all relevant information. Search tools are available, however, Chapter 15 will describe how these might be extended with more advanced facilities that support the regeneration of threads of debate following from safety-critical incidents. For now it is suÆcient to realise that some organisations have devised procedures and mechanisms that are intended to explicitly introduce such debate into the production of incident reports. For instance, sub-regulation 16(3) of the Australian Navigation (Marine Casualty) Regulations, requires that if a report, or part of a
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
559
report, relates to a persons aairs to a material extent, the inspector must, if it is reasonable to do so, give that person a copy of the report or the relevant part of the report. Sub-regulation 16(4) provides that such a person may provide written comments or information relating to the report. The net eect of these regulations is to ensure that dissenting opinions are frequently published as an appendix to the recommendations in the investigators' `oÆcial' report. It is important to mention that these dialogues that are often elicited by particular recommendations not only play a positive role in challenging the proposed remedies for particular types of incident. They can also elicit praise that both motivates the continued operation of an incident reporting system and can encourage others to contribute their concerns. There may also be other more speci c safety contributions. For instance, one respondent to the Canadian National Defence Forces' Safety Digest publication expressed `delight' at a report about explosives safety. They then went on to express their disappointment that there had not been any subsequent articles on explosives incidents in the ve months since the report had been published; `Is the world of ammunition and explosives so safe that there is nothing else to write about?' [146]. The correspondent praises the previous article on the causes of explosives incidents. They are also concerned by the relatively low frequency of reports in this area that are summarised in the Canadian National Defence Forces' Safety Digest. This response, therefore, re ects pro-active attitudes to both the underlying safety issues and to the operation of the reporting system. Although such measured reactions are quite rare, they often indicate that an incident reporting system is in good health. If recommendations are challenged then at the very least there is direct evidence that they are being read by the intended audience. If respondents notice that certain types of incidents are under-represented then this can provide evidence of reporting bias. Such responses can also provide valuable feedback about the mechanisms that are used to publicise those recommendations that are derived from previous incidents.
The Dangers of Ambiguity... Previous sections have argued that the task of drafting appropriate recommendations is complicated by the various correctives that can be issued to address perceived short-comings in the remedies that are proposed in the aftermath of particular incidents. We have described how investigators often issue dierent recommendations for similar incidents. Such inconsistencies can be intended. For example new remedies may be proposed if previous recommendations have proved to be ineective. Dierences between recommendations can also be unintended. Investigators may not be aware that an incident forms part of a wider pattern of similar failures. Previous sections have also described how regulatory bodies and higher levels of management issue correctives to the recommendations that are proposed by incident investigators. These correctives may directly contradict particular ndings. They may also change the emphasis that it placed on particular remedies. Finally, we have argued that well-run reporting systems often elicit debates about the utility of particular recommendations. Operators and managers may also propose ways in which previous remedies might be extended or tailored to meet changing operation requirements. They may also directly contradict the recommendations that have been proposed to address future failures. The task of drafting eective recommendations is further complicated by the diÆculty of ensuring that they can be clearly understood and acted upon by their intended audience. Chapter 14 will describe a range of paper and computer-based techniques that can be used to support the eective communication of particular recommendations. For now, however, it is important to emphasise that there must be stringent quality control procedures to help ensure that the advice that its presented to operation units is unambiguous. This raises an important issue. We have already argued that recommendations must, typically, be expressed at a high level of abstraction if they are to inform the safety of a wide range of dierent applications. Unfortunately, this also creates opportunity for ambiguity as individual managers have to interpret those recommendations within the context of their own working environment. Peer review and limited eld testing can be used to increase con dence that others can correctly interpret the actions that are necessary to implement particular recommendations. If such additional support is not elicited then there is a danger that speci c recommendations will be rejected as inapplicable or, conversely, that generic recommendations will
560
CHAPTER 12.
RECOMMENDATIONS
be result in a range of potentially inappropriate remedies. At the very least, scrupulous peer review should help to identify `gross level' inconsistencies. For instance, the US Army Safety Centre reported an incident in which a soldier fell while attempting to negotiate an `inverted rope descent' [813]. The subsequent investigation a discrepancies between the recommended practices for the construction and use of the obstacle. For example, previous training related incidents had led to the development of standard FM 21-20. This requires that the obstacle should include a platform at the top of the tower for the instructor and the student. A safety net should also be provided. This standard also requires that the obstacle should be constructed to re ect the Corps of Engineers drawing 28-13-95. Unfortunately, this diagram does not include a safety net or a platform. The incident investigators, therefore, concluded that `confusion exists concerning the proper design and construction of this obstacle'. Following the incident, the army had to suspend the use of their inverted rope descent obstacles until platforms and safety nets had been provided in accordance with FM 21-20. The 28-13-95 diagram was also revised to remove any potential inconsistency. The previous incident shows how particular failures often expose inconsistent recommendations. Fortunately, many of these problems can be identi ed before an incident occurs. For example, the US General Accounting OÆce was requested to monitor the implementation of recommendations following Army Ranger training incidents [288]. They identi ed a range of problems, not simply in the implementation of those recommendations but also in the way in which those recommendations had been drafted in the rst place. For example, one recommendation required that the Army development `safety cells' at each of the three Ranger training bases. These were to include individuals who had served long enough at that base to have developed considerable experience in each geographic training area so that they understood the potential impact of weather and other local factors on training safety. Safety cells were also to help oÆcers in charge of training to make go/no go decisions. However, the National Defence Authorisation Act that embodied these provisions did not establish speci c criteria on the makeup of a safety cell. The General Accounting OÆce concluded that the approach chosen by the Army `represents little change from the safety oversight practice that was in place' at the time of the incidents [288]. They also found more speci c failures that relate to the implementation of previous recommendations rather than to potential ambiguity in the proposals themselves. For example, the Army Safety Program recommended that safety inspections are conducted on an annual basis. The Fort Benning Installation Safety OÆce failed to conduct any inspections of training operations safety at the Brigade or its battalions between March 1993 and March 1996. Chapter 15 addresses the problems and the bene ts of monitoring incident reporting systems. It is important to stress, however, that inspections such as that performed by the US General Accounting OÆce on Ranger Training, can satisfy several objectives. These inspections can be used to expose deliberate failures to implement particular recommendations. They can identify inadvertent neglect; situations in which sta did not know that particular recommendations had been made. These audits also help to recognise genuine diÆculties in the interpretation and implementation of remedial actions. Arguably the most signi cant bene t of such monitoring is that it can be used to institutionalise procedures that help to ensure compliance with key recommendations. For example, the Ranger investigation found that inspections by the Infantry Center, Brigade, and the Fort Benning Safety OÆce did not monitor compliance with safety controls. In particular, they failed to check that training oÆcers set up minimum air and land evacuation systems before daily training. They also failed to monitor whether instructors adhered to rules prohibiting deviations from planned swamp training routes. The General Accounting OÆce report concluded that: \The inspections are focused instead on checklists of procedural matters, such as whether accidents are reported and whether les of safety regulations and risk assessments are maintained. If the important corrective actions are to become institutionalised, we believe that formal Army inspections will have to be expanded to include testing or observing to determine whether they are working eectively." [288] The previous paragraphs have argued that monitoring programs can be used to detect potential ambiguity in the recommendations that are issued by incident investigators. They can also assess whether or not those recommendations are being acted upon. This approach does, however, suer
12.1.
FROM CAUSAL FINDINGS TO RECOMMENDATIONS
561
from a number of limitations. Unfortunately, the US General Accounting OÆce's review of Ranger training only provide a very limited snapshot of one particular area of activity. It ran from September through November 1998 `in accordance with generally accepted government auditing standards' [288]. It involved brie ngs from Brigade oÆcials. Inspectors observed training exercises and reviewed safety procedures at each battalion's facilities. To determine the level of compliance, they interviewed Brigade oÆcials. They also reviewed Army and Infantry Center inspection regulations, procedures, and records. Personnel were deployed to the Department of the Army headquarters, Army Infantry Center, Ranger Training Brigade headquarters, and the Ranger training battalions at Fort Benning, Dahlonega, Georgia, and Eglin Air Force Base, Florida. The extensive nature of such investigations helped to improve the quality of the eventual report. It also, however, contributed signi cantly to the costs associated with ensuring compliance. Such techniques cannot easily be applied to support local incident reporting systems where funds may be very tightly controlled. Conversely, they cannot easily be applied to monitor the implementation of recommendations throughout large-scale national systems. For instance, the Modi cation Work Order (MWO) program was intended to ensure that safety alerts and other maintenance notices were consistently implemented across the US Army [287]. The objective was the enhance elded weapon systems and other equipment by correcting `any identi ed operational and safety problems'. The implementation of this program was complicated by the number of advisories that it had to track. For example, the US Army approved 95 Modi cation Work Orders for its Apache helicopter between 1986 and 1997. The implementation of this program was further complicated by the diverse nature of these recommendations. For example, one procedure introduced a driver's thermal viewer, a battle eld combat identi cation system, a global positioning receiver and a digital compass system into Bradley Fighting Vehicles. The introduction and integration of such relatively sophisticated equipment poses considerable logistical challenges. The MHW program was also intended to monitor less complex modi cations. For example, early versions of the Army's High Mobility Multipurpose Wheeled Vehicles utilised a two-point seatbelt restraint system. This did not contain the inertial stopping device that is a standard feature of most civilian vehicles [814]. In consequence, users must remember to remove all of the slack from the retractor and to tighten the seatbelt. This procedure was described and recommended in a safety advisory (TM 9-2320-280-10). Modi cation Work Order 9-2320-280-35-2 then recommended the installation of a three-point seatbelt system. A centralised database was developed to record the progress of dierent maintenance recommendations. Queries could be issued by Army headquarters oÆcials and Army Materiel Command oÆcials to ensure that individual units met the timescales and objectives that were recommended in safety notices. Unfortunately, the centralised database was discontinued following a structural reorganisation in 1990. Control over modi cation installation funding was transferred from the headquarters level to the individual program sponsors who are responsible for major weapon systems, such as the Abrams tank, or for product centres that support particular pieces of equipment, such as the Squad Automatic Weapon. The result of this decentralisation was that `Army headquarters and Army Materiel Command oÆcials do not have an adequate overview of the status of equipment modi cations across the force, funding requirements, logistical support requirements, and information needed for deployment decisions' [814]. This lack of information also aected eld units. It was diÆcult for maintenance personnel to known which modi cations should have been made to particular items of equipment. Similarly, it was diÆcult to determine which modi cations had actually been made. For instance, depot personnel at Anniston Army Depot, Alabama, had to visually inspect 32 National Guard trucks because they had no way of knowing whether two authorised modi cations had been made when the vehicles arrived. The diÆculties associated with tracking modi cation recommendations also had knock-on eects. Engineers did not always receive necessary technical information. A General Accounting OÆce report described how division maintenance personnel did not receive revisions to the supply parts manual for the fuel subsystem on Apache attack helicopters. The aircraft were then grounded and the maintenance team wasted many hours troubleshooting because the old manual did not provide necessary information about a new fuel transfer valve [287]. The lack of an adequate monitoring system created a number of additional logistical problems. For example, it was diÆcult for engineers to coordinate the implementation of multiple modi cations to individual pieces of equipment. In
562
CHAPTER 12.
RECOMMENDATIONS
consequence, the same item might be repeatedly removed from service while multiple modi cation orders were completed. Maintenance teams did not receive adequate notice of modi cations. Some items of equipment did not always work together after modi cations. This loss of integration further delayed other maintenance procedures and reduced operational capability. For instance, modi ed parts were removed from Huey utility helicopters. Non-modi ed parts were then reinstalled because there were no modi ed parts in stock when the new parts broke. Such practices further exacerbated the problems that were created when responsibility for the database was distributed from headquarters control. The con guration of equipment was not always accurately portrayed in the database used by the maintenance personnel and Army headquarters oÆcials. A number of recommendations were made as a result of the General Accounting OÆce report. These included steps to ensure that program sponsors and supply system personnel supported modi cation orders by providing appropriate spare parts after the initial order had been implemented. The report also recommended that personnel should update technical information whenever a modi cation order was being performed. Old spare parts were to be `promptly' phased out and new items were to be added to the units supply system. One of the ironies of incident reporting is that the Accounting OÆce does not propose monitoring mechanisms to ensure that its recommendations about monitoring practices are eectively implemented! This section has shown the diÆculties of ensuring that the recipients of particular recommendations can unambiguously determine their meaning. It has also illustrated the technical and logistical problems of ensuring that safety recommendations are implemented in a uniform manner across complex organisations. Companies that lack the technological and nancial infrastructure of the US Army are likely to experience even greater problems in ensuring that recommendations are successfully implemented. Chapter 15 will describe a number of tools that can be used to address these problems. In contrast, the following sections present techniques that are intended to help investigators identify the recommendations that are intended to combat future failures.
12.2 Recommendation Techniques A range of techniques have been proposed to help investigator determine the best means of reducing the likelihood, or of mitigating the consequences, of safety-critical failures. Many of these approaches address the problems that were identi ed in previous sections. For example, some techniques provide methodological support so that the analysis of similar incidents should yield similar ndings. They provide a template for any analysis so that disputes can be mediated by reference to the approved technique. Ambiguity can be resolved by encouraging a consistent interpretation of recommendations that are derived from the approved system. The following paragraphs brie y introduce a number of dierent approaches. These are used to identify potential recommendations from an explosives incident that took place during a nighttime training exercise. The intention was that two maneuver platoons would lead supporting engineer squads across the line of departure. These elements would be followed by a third maneuver platoon. The two lead platoons were to occupy support-by- re positions. The engineers and the third maneuver platoon were then to occupy `hide' positions some twenty- ve meters from a breaching obstacle. This was to be a triple-strand concertina wire barricade. The breach exercise was rehearsed a number of times. There was a daytime walkthrough without weapons, munitions or explosives. This was followed by a `dry re' exercise in which the plan was rehearsed with weapons but without munitions or explosives. A team leader and two team members would use 1.5 meter sections of M1A2 Bangalore torpedoe to breach the concertina obstacle. The team leader would then pass elements of the initiation system to the team members. They were to tie in the torpedoes to the detonating cords. The initiation system `consisted of a ring main (detonating cord about 3 to 4 feet formed into a loop) with two M14 ring systems (approximately 4 feet of time fuse with blasting cap aÆxed to one end) taped to the ring main' [818]. At the opposite end of the M14 ring systems was an M81 fuse igniter that had been attached before the start of the operation. The intention was that the team leader would give each team member one of the M81 fuse igniters. On his command, they were then to pull their M81 and initiate the charge. The
12.2.
RECOMMENDATION TECHNIQUES
563
breaching team were then to retreat to their original hiding place. The detonation was to act as a further signal for a marking team to use chemical lights to help the following platoons locate the breach. The actual exercise began when the breaching team approached the concertina objective. The two team members successfully placed their Bangalore torpedoes on either side of a potential breach site. The leader then handed the initiation system to them so that they could tie-in the Bangalore detonating cord lines. The team leader then handed one of the two M81 igniters to the team member on the left-side of the breach. The team leader departed from the original plan when he placed the second M81 on the ground between the two team members. Instead, he handed a bag containing approximately eight meters of detonating cord and an extra M14 initiation system to the team member on the right-hand side of the intended breach. The team leader then radioed the platoon leader to inform them of his intention to re the charges. The left-side team member picked up the M81 fuse igniter that had been left on the ground. He also had the original M81 that had been given to him by the team leader. The right-hand team member held the two M81s from the bag. The team members pulled the M81 fuse igniters on the leader's order `three, two, one, PULL'. A Battalion S3 (operations, planning, and training oÆcer) observed the burning fuses and the added charge in the bag which had been placed to the right of the Bangalore torpedoes. He asked about the additional charge but did not receive any reply. The demolition team and the S3 then moved back approximately twenty- ve meters to separate hiding locations. As intended, the detonation acted as a signal for the marking team and a security team to rush towards the intended site of the breach. A second, larger, detonation occurred some three to ve seconds after the rst. Both of the approaching teams were caught by the resulting blast. The initial detonation had been caused by the additional charge in the bag that had been handed to the team member on the left of the breach. The second explosion was caused by the Bangalore torpedoes. Chapters 10 and 11 have introduced a number of analysis techniques that can be used to identify the causal factors from this incident. For instance, ECF charts might be used to reconstruct the
ow of events leading to the failure. Counterfactual reasoning can then be applied to distinguish causal from contextual factors. Table 12.2 illustrates the results of such an analysis. This tabular form is based on the ECF summaries shown in Tables 10.16 and 10.17. Only causal factors are shown, contributory factors are omitted for the sake of brevity. As might be anticipated, the results of this analysis are similar to the causal ndings produced by the US Army technical Centre for Explosives Safety [818]. The original reports do not, however, state whether any particular analytical techniques were used to support the causal analysis of this incident. The justi cations associated with the causal factors in Table 12.2 must, therefore, be inferred from the supporting documentation. The following paragraphs illustrate a range of techniques that can be used to identify particular recommendations once investigators have conducted an initial causal analysis. As will be apparent, there is a considerable imbalance between the number of techniques that might help to identify the causes of an incident and the number of approaches that support the identi cation of particular recommendations. A cynical explanation for this might be that there is a far greater interest in diagnosing the causes of managerial failure or human error than there is in divising means of addressing such incidents [408]. Alternatively, it can be argued that the identi cation of recommendations depends so much on the context of an incident and upon the expertise of the investigator that there is little hope of developing appropriate recommendation techniques. However, ad hoc approaches have resulted in inconsistent recommendations for similar incidents. We have also seen ambiguous guidelines that have contributed to subsequent accidents. The following pages introduce ve distinct types of recommendation technique. These distinctions re ect important dierences in the role that the particular approaches play within the reporting system as a whole. Some techniques embody the idea that recommendations are imposed upon those who are to `blame' for an incident. Other techniques reject this approach and provide more general heuristics that are intended to link recommendations more directly to the products of causal analysis techniques, such as ECF analysis. This opens up the scope of potential recommendations; operator failure and human error are not the focus for any subsequent analysis. Other techniques have built upon this link between recommendations and causal analysis by explicitly specifying what actions
564
CHAPTER 12.
Cause The breaching team leader failed to turn in excess demo material to the ammunition supply point. Excess demolition material was not tied into the ring charge. Addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed command-directed observer/controller at the breaching site.
Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training oÆcer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site. Marking team leader unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
RECOMMENDATIONS
Justi cation The incident would not have happened if the bag containing the additional M14 initiation system and detonating cord had been handed in. The incident would not have happened if all charges had been detonated together. The incident might not have occurred if the marking and security teams had been aware of the second charge. The incident might not have occurred if a controller had been monitoring the use of the second charge. They might have intervened to prevent the separate detonation of this material. The incident might not have occurred if team members had questioned the use of the M14 initiation system and detonating cord in the bag. The incident might not have occurred if they had intervened more directly when their question about the bag went unanswered. The consequences of the incident might have been signi cant reduced if they had been further from the detonation site. The incident might have been avoided if the marking team leader had been able to recognise that the initial detonation was not large enough to have been the Bangalore torpedoes.
Table 12.2: Causal Summary for Bangalore Torpedo Incident. should be taken whenever particular causal factors are identi ed. A further class of techniques exploit accident prevention models to identify potential remedies. For instance, barrier analysis approaches look beyond the `source' of an incident to analyse the defences that fail to mitigate the consequences of particular failures. Unfortunately, a number of practical problems can complicate these broader approaches. Financial and technical constraints can prevent commercial organisations from implementing all of the recommendations that might prevent the causes of an incident and might provide additional protection against the adverse consequences of those failures. A nal group of techniques, therefore, exploits concepts from risk assessment to help identify and prioritise the interventions that might safeguard future operations: 1. recommendations based on blame or accountability. These recommendation techniques help investigators to remedy the failings of groups or individuals who are `at blame' for an incident or accident. The intention is to `put their house in order'. As we shall see, these recommendation techniques are consistent with legal approaches to accident and incident prevention. Prosecution is perceived to have a deterrent eect on future violations. In consequence, recommendations may include an element of retribution or atonement in addition to any particular actions that are intended to have a more direct eect on the prevention of future failures. 2. recommendation heuristics. A second class of recommendation techniques take a broader
12.2.
RECOMMENDATION TECHNIQUES
565
view both of the causes of incidents and the potential recommendations that can be used to combat future failures. These techniques draft high-level heuristics that are designed to help investigators derive appropriate remedies from the ndings of any causal analysis. They provide guidelines such `ensure that a recommendation is proposed for each causal factor that has been identi ed during the previous stages of analysis'. Other heuristics describe appropriate implementation strategies. For instance, it might be recommended that `an individual or organisation is associated with the implementation of any recommendation'. Unfortunately, such ad hoc heuristics provide few guarantees that individual investigators will propose similar remedies for similar failures. There is a danger that inconsistent recommendations will be made within the same reporting system. 3. navigational techniques (enumerations, lists and matrices). Instead of focusing on notions such as retribution or blame, a further class of techniques are speci cally intended to improve the consistency of particular recommendations. These approaches often enumerate the interventions that investigators should approve in the aftermath of particular failures. For instance a list of recommendations may be identi ed for each class of causal factors. One consequence of this is that the utility of these techniques is often determined by the quality of the causal analysis that guides their application. 4. generic accident prevention models. It can be diÆcult to enumerate appropriate recommendations for classes of incidents that are still to occur. The dynamism and complexity of many working environments can prevent investigators from identifying eective interventions from pre-de ned lists. In consequence, a further class of techniques provides general guidance about ways of improving the barriers and defences that may have been compromised during an incident. Investigators must then interpret this general information within the speci c context of their system in order to draft recommendations that will preserve the future safety of an application process. Accident prevention models, including barrier analysis, have been extended to consider mitigating factors. This is important because investigators can use these extended models not simply to consider ways of addressing the causes of complex failures, they can also use them to consider ways of control the consequences of incidents whose causes cannot be either predicted or eliminated [675, 313]. 5. risk assessment techniques. A number of problems complicate the application of ad hoc approaches and techniques that rely upon accident prevention models to identify incident recommendations. In particular, they provide little guidance on whether particular recommendations ought to have a higher priority that other potential interventions. This is important given the nite resources that many commercial organisations must allocate to meet any necessary safety improvements. It can be argued that such priority assessments are the concern of the regulatory organisations that approve the implementation of investigators' ndings. Such a precise division of responsibilities cannot, however, be sustained in more local systems. In consequence, the closing paragraphs of this section consider ways in which risk assessment techniques can be used to identify the priority of particular recommendations. Subsequent chapters consider the regulatory use of these approaches to monitor the overall performance of incident reporting systems. The following paragraphs assess the strengths and weaknesses of these dierent approaches in greater detail. Subsequent sections examine the problems of validating the particular remedies that are identi ed by such recommendation techniques.
12.2.1 The `Perfectability' Approach The simplest recommendation technique is to urge operators to do better in the future. In this view, it can be argued that `if a system demonstrates its underlying reliability by operating without an incident for a prolonged period of time and given that no physical systems have failed then any subsequent failure must be due to operator error'. Such human failures can be corrected by reminding users of their responsibility for an incident. Changes can be made to training procedures and
566
CHAPTER 12.
RECOMMENDATIONS
recommended working practices to help ensure that an incident does not recur. Such recommendations are based on the idea that it is possible to avoid future incidents by perfecting previous human `errors'. This `perfectability' approach has numerous advantages beyond its apparent simplicity. For instance, reminders are often the cheapest form of remedial action [479]. One consequence of electronic communication facilities is that there be almost no marginal cost associated with sending safety-related emails to members of sta. Of course, such repeated reminders can impair the eectiveness of a reporting system if sta are alienated by repeated reminders about well-known topics [409]. On the other hand, reminders can also be issued more quickly than almost any other safety recommendation. This oers considerable advantages over the length of time that is typically required to implement the re-design of key system components. Elements of the perfectability approach can be identi ed in most military regulations. For instance, punitive actions are often prescribed as appropriate remedies when personnel disregard the permits and mandatory obligations that are imposed upon them: \the revocation ... of permits to conduct nuclear activities or hold ionizing radiation sources (and) the removal of inventory of ionizing radiation emitting devices from an organisation; and disciplinary or administrative action under the National Defence Act in the case of Canadian Force members and the application of all available administrative measures in the case of Department of National Defence employees. Director General Nuclear Safety may also recommend criminal prosecution." [132] Similar injunctions have been drafted to ensure that personnel take necessary safety precautions during more mundane activities. For example, the Canadian military stipulates the circumstances in which individuals must wear safety helmets and goggles when operating snowmobiles. Departures from these regulations can be interpreted as instances of individual negligence or of willful violation [133]. Previous researchers have focussed almost exclusively on the use of punishments to `perfect' operator behaviour in the aftermath of incidents and accidents [700]. It is important to recognise, however, that many organisations operate more complex systems in which rewards may also be oered for notably good performance during near-miss occurrences. For example, the US Army operates a range of individual awards that recognise notably good performance in avoiding incidents and accidents. These include the Chief of Sta Award for Excellence in Safety Plaque, the United States Army Safety Guardian Award, the Army Aviation Broken Wing Award, the Director of Army Safety Special Award of Excellence Plaque, the United States Army Certi cate of Achievement in Safety and the United States Army Certi cate of Merit for Safety [797]. Such recognition need take little account of the context that may have created the need for individuals to display such acts of bravery and initiative. Table 12.3 illustrates how the perfectability approach can be applied to the Bangalore case study. As can be seen, each cause describes a failure on the part of an individual. Recommendations are then drafted to ensure that those individuals learn from their apparent mistakes. For instance, the breaching team leader failed to turn in excess materials during the exercise. They should, therefore, be trained in the importance of following such turn-in procedures. Similarly, the marking team leader failed to distinguish between the smaller initial explosion of the detonating cord and the main charge provided by the Bangalore torpedoes. They should, therefore, receive training that might help them discriminate between such dierent types of detonation. Table 12.3 deliberately provides an extreme example of the `perfectability' approach. It illustrates some of the practical problems that arise during the application of this approach. For example, each recommendation is focussed on a particular individual. They, therefore, do not draw out more general lessons. These recommendations neglect the opportunities that an incident might provide for revising the training of all personnel involved in breaching and marking exercises. Further problems stem from the limited eectiveness that such individual recommendations might have in the aftermath of comparatively serious incidents. It is highly unlikely that the individuals involved in this incident would need to be reminded of their individual shortcomings given the consequences of their `errors'. Such objections can be addressed by drafting recommendations to `perfect' the performance of groups rather than individual. For instance, the rst recommendation in Table 12.3 might be
12.2.
567
RECOMMENDATION TECHNIQUES
Cause The breaching team leader failed to turn in excess demo material to the ammunition supply point. The breaching team leader did not ensure that any excess demolition material was tied into the ring charge.
The breaching team leader's addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed command-directed observer/controller at the breaching site. Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training oÆcer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site. Marking team leader was unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
Individual Recommendation The breaching team leader should be reminded of the proper procedures for the turn-in of excess munitions. The breaching team leader should be reminded of the proper procedures for the disposal of excess munitions. The use of a `last-shot' to dispose of excess munitions is a dangerous practice and creates the opportunity for such failures. The breaching team leader and the exercise safety oÆcer must be reminded of their responsibility to consider the consequences of and communicate necessary information about any unplanned changes to an exercise. The oÆcer in charge of the exercise should be reminded of the need to appoint observers to intervene during potentially hazardous training operations. Breaching team members must be reminded of their duty to immediately stop any unsafe life threatening act. Battalion S3 must receive additional training to ensure that they intervene if similar situations arise in future training exercises. The marking team leaders must be reminded to follow the required distance regulations speci ed in IAW FM 5-250. The marking team leader must be trained to a point where they can distinguish between such dierent types of detonation.
Table 12.3: `Perfectability' Recommendations for the Bangalore Torpedo Incident. applied to all individuals who perform similar tasks to the breaching team leader; `all personnel must be reminded of the proper procedure for turning-in excess munitions'. This more general approach leads to further problems. For instance, some reporting systems provide participants with apparently random reminders about particular safety procedures. It can be diÆcult for individuals to follow the justi cation for these reminders if they are not kept closely informed of the incidents that motivate safety managers to reinforce these particular guidelines. Chapter 14 will describe techniques that can be used to address these potential problems. For now, however, it is suÆcient to emphasise that a host of further problems complicate the application of the perfective approach to drafting incident recommendations. In particular, the perfective approach often relies upon demonstrating that individuals have in some way contravened regulations and procedures that they ought to have followed. This creates problems when an incident is not covered by any applicable regulation. It then becomes diÆcult to argue that individual operators should have intervened to mitigate a potential failure. One ad hoc solution is to continually redraft procedures in a (probably) futile attempt to codify appropriate behaviour in all possible situation. For instance, the US Army Safety Policies and Procedures for Firing Ammunition for Training, Target Practice and Combat contains a requirement
568
CHAPTER 12.
RECOMMENDATIONS
that: \Accidents caused by ring or evidence that would indicate that the safety provisions of this regulation are inadequate will be reported by letter. The letter must give all pertinent information on the alleged inadequacy of the regulation" [795] Further problems aect the punitive measures that are associated with the perfective approach. For example, it can be diÆcult to know exactly what sanctions can be applied to address particular errors and violations. These measures can be in uenced by local practices within particular organisations but they are ultimately governed by legislation. Chief Justice Lamer of the Supreme Court of Canada explained in R. v. Genereux in 1992: \The purpose of a separate system of military tribunals is to allow the Armed Forces to deal with matters that pertain directly to the discipline, eÆciency and morale of the military. The safety and well-being of Canadians depends considerably on the willingness and readiness of a force of men and women to defend against threats to the nation's security. To maintain the Armed Forces in a state of readiness, the military must be in a position to enforce internal discipline eectively and eÆciently. Breaches of military discipline must be dealt with speedily and, frequently, punished more severely than would be the case if a civilian engaged in such conduct. As a result, the military has its own Code of Service Discipline to allow it to meet its particular disciplinary needs. In addition, special service tribunals, rather than the ordinary courts, have been given jurisdiction to punish breaches of the Code of Service Discipline. Recourse to the ordinary criminal courts would, as a general rule, be inadequate to serve the particular disciplinary needs of the military. There is thus a need for separate tribunals to enforce special disciplinary standards in the military." [130] The Chief Justice refers to the importance of punishing breaches of the Code of Service Discipline in order to preserve the `safety and well-being' of the Armed Forces. This may seem to be a relatively clear-cut decision. There are, however, situations in which there are legal barriers that prevent the application of the `perfectability' approach even though organisations might want to impose particular sanctions. For instance, a former Sergeant in the Canadian Army found himself as a defendant in a standing court martial when he refused to receive an anthrax vaccination while deployed in Kuwait [142]. His opposition was described as `unsafe and hazardous'. The case was, however, stopped when the defence cited the Canadian Charter of Rights and Freedoms. This is one of several similar cases in which individuals have used legal arguments to defend themselves against punitive sanctions. Such defences must be provided because there is a danger that superiors may apply `perfective' sanctions for personal rather than professional reasons. Article 138 of the Uniform Code of Military Justice, section 938 of title 10, United States Code provides one such defence. This article enables a member of the US Armed Forces to seek redress for grievances against a commanding oÆcer and, if redress is denied, to le a formal complaint against that oÆcer. The Judge Advocate General of the Army will then review and take nal action on such `Article 138' complaints. In more serious cases, sanctions cannot simply be applied by commanding oÆcers. They must be supported by legal argumentation in court martials. Even here, however, there are checks and balances that prevent the arbitrary application of the `perfective' approach. For example, two of the six appeals currently recorded by the Canadian Judge Advocate General relate to military personnel challenging sanctions that are imposed following safety-related incidents. Such incidents illustrate the more general, pragmatic problems that make it diÆcult to identify appropriate recommendations within the `perfective' approach. Training can be ineective if it is not supported by practical demonstrations and almost constant reminders of the importance of key safety topics. These constant reminders can alienate sta unless properly motivated by concrete, `real-world' examples. Conversely, more punitive sanctions can be administered by organisations. These legal sanctions are bounded by the civil law and, in the context of our case study, by military law. The development of human rights legislation and of case law that stresses the importance of performance shaping factors
12.2.
RECOMMENDATION TECHNIQUES
569
as well as individual violations has helped to `draw the teeth' of the perfective approach in many application domains. There are a number of theoretical reasons why the perfective approach oers dubious support for investigators and regulators. Recommendations that are intended to perfect operator behaviour often lead to a vicious cycle in which employers become increasingly frustrated by recurring incidents. Reason terms this process the `blame cycle'. This cycle is based on the notion that operators exercise free will in the performance of their daily tasks [701]. They are assumed to be free to choose between right and wrong, between error-free and error-prone paths of interaction. Any incidents and accidents that do occur are, therefore, partly the result of voluntary actions on the part of the operator. As we have seen, employers and regulators who adopt the `perfectability' approach are likely to respond to such failures by reminding individuals of their responsibilities and duties. Retraining may be used to reinforce key safety information. Warnings about the consequences of violation are, typically, reiterated after particular incidents. Unfortunately, these recommendations and remedial actions may not address the underlying causes, or performance shaping factors, that created the context in which an `error' occurred. In consequence, it is likely that there will be future incidents. When these occur, employers and regulators are increasingly likely to resort to additional sanctions and punishments for what they interpret to be willful violations of publicised procedures. Their response to recurrent incidents can be driven by the `fundamental attribution error' that we have met several times in previous chapters [701]. This arises describes situations in which we ascribe the failure of others to personal characteristics, such as neglect or incompetence, when in similar circumstances we might justify our own mistakes by pointing to contextual factors, such as the level of automated support or time pressures. If punitive sanctions are introduced then they can have the paradoxical eect of making future incidents more likely. They may increase the level of stress in the workplace or may increase a sense of alienation between the employees and their supervisors. In either case, future incidents are likely unless the underlying causes are addressed and so the cycle continues. To summarise, the `perfective' approach drafts recommendations that are intended to avoid any recurrence of particular individual errors. This approach is limited because recommendations often address the causes of catalytic failures rather than the causes of more deep-seated managerial and organisational problems. Reason [701], Hollnagel [361], Perrow [675] and Leape [479] have done much to challenge previous applications of this perfective approach. They draw upon a wealth of evidence to suggest that punitive sanctions, individual retraining and constant reminders may have little long-term eect on the future safety of complex, technological systems. There is, however, a need for balance. Any consideration of the context in which an incident occurs must not obscure individual responsibility for certain adverse occurrences. For example, it is possible for risk preferring individuals to alter their behaviour when responding to particular situations in their working environment [368]. It then becomes diÆcult to distinguish between situations in which those individuals fail to recognise the potential danger inherent in a particular situation, for example because they did not receive adequate training, and situations in which they deliberately choose to accept higher risks in the face of adequate training. There is, therefore, a tension between the need to recognise the impact of contextual or performance shaping factors and the importance of an operator's responsibility for their actions. Many organisations have drafted guidelines that recognise this tension. For instance, the US Air Force's guidance on Safety Investigations and Reports contains the following advice about the drafting of recommendations: \5.10.1.5. Write recommendations that have a de nitive closing action. Do not recommend sweeping or general recommendations that cannot be closed by the action agency. Vague recommendations addressing the importance of simply doing ones job properly are also inappropriate. However, recommendations to place CAUTIONS and WARNINGS in Technical Order guidance relating the adverse consequences of not doing ones job properly may be appropriate. Recommendations for speci c action such as refresher training, implementing in-process inspections, etc. to ensure job duties are being properly performed may also be appropriate since they are speci c, and can be closed." [794] This re ects the tension that exists between the impact of more recent ideas about the organisational
570
CHAPTER 12.
RECOMMENDATIONS
roots of many incidents and the `perfective' notions of free will and individual responsibility. The USAF guidelines reject the `perfective' notion that individuals should be encouraged to do their job properly. They do, however, accept that it may be necessary to warn operators about the consequences of failing to do their job properly.
12.2.2 Heuristics Most incident reporting systems provide only a limited guidance about the techniques that investigators might use to derive conclusions for the results of a causal analysis. The NASA procedures and guidelines (NPG 8621.1) that structured the analysis in Chapter 10 recommend seven dierent causal analysis techniques. In contrast, they oer no suggestions about techniques that might be used to identify potential remedies once causes have been determined [571]. There are good reasons for this reticence. As has been mentioned, a relatively large number of techniques have been proposed to support causal analysis while only a handful have been developed to help structure the identi cation of recommendations. Those techniques that have been developed are not widely known and tend only to be applied within particular industries, such as chemical process engineering. This contrasts with a technique such as MORT which has been more widely applied and is known throughout many dierent safety-critical domains; There are further reasons why some organisations fail to identify appropriate recommendation techniques. Many organisations have failed to propose speci c techniques to support the process of identifying recommendations because there is a natural concern that such an approach might unnecessarily constrain the skill and judgement of investigators. A particularly important issue here is that considerable domain knowledge is needed when identifying appropriate remedies. Such expertise cannot easily be synthesised within recommendation techniques. This can be contrasted with causal analysis where it is possible to identify broad categories of failure that contribute to many dierent incidents. It is possible to challenge these diverse arguments. For instance, the lack of consistency between the recommendations of many investigators in the same industry seems to demonstrate that many do not currently share the same, necessary level of expertise. Similarly, as we shall see, some recommendation techniques have succeeded in identifying generic remedies that can be applied to particular causes in a broad range of industries. Finally, management may lack the will or the commitment necessary to ensure that investigators follow approved methods when proposing particular recommendations. As mentioned, incident investigators tend to be highly skilled in primary and secondary investigation. Considerable expertise is required in order to direct the causal analysis of safety-critical incidents. In consequence, investigators yield considerable power and in uence within investigatory and regulatory organisations. New techniques, that support either causal analysis or the identi cation or recommendations, can be perceived as a threat to their existing skills and expertise [686]. Many statutory bodies also fail to perform any quality control over the work of their investigators. This leads to a paradox. Investigatory and regulatory organisation do not follow the standardised working practices that they enforce on others. Many organisations do provide high-level guidance to their investigators. For instance, the Canadian Army's safety program includes a ve step guide to accident and incident investigation [131]. These steps are: visit the accident scene; conduct interviews; gather and record evidence; evaluate the evidence and draw conclusion; make recommendations. The following high-level advice is oered to support the nal stage of this process: \Recommendations: 31. Once the cause factors have been identi ed, the investigator(s) recommend(s) preventive measures be taken based on the ndings of the investigation. The basic aims when developing preventive measures are as follows: treat the cause and not the eect; ensure that the measures will enhance and not restrict overall operational eectiveness; ensure preventive measures eliminate or control all causes. 32. Simply recommending that the individual(s) involved by briefed contributes little. It merely indicates fault nding. If human factors (inaction or action - human error) is a
12.2.
RECOMMENDATION TECHNIQUES
571
cause, revising job procedures, training of all employees doing similar tasks and publicity of the accident, to name a few, would be more meaningful and certainly more productive. 33. If shortcomings in equipment, facilities or other resources are causes, then modi cations, substitution or acquisition would be valid recommendations." [131] This quotation illustrates the importance of eliminating or controlling all causes. Many organisations, therefore, require that investigators explicitly list the remedies that are proposed next to each cause of the incident. This enables colleagues to ensure that each cause is considered in an eventual report. A number of theoretical objections can be raised to this pragmatic objective. For example, the subjective nature of many causal analysis techniques provides few guarantees that this approach will address all of the causes that might possibly be identi ed in the aftermath of an incident or accident. The previous quotation stresses the overall objective of operational eÆciency. A number of caveats can also be made about this requirement. For example, the guidance does not provide a clear de nition of `operational eÆciency'. In practice, therefore, sta may nd particular problems in resolving the con ict that often arises between safety concerns and more eÆcient operational techniques. Paragraph 32 makes the important point that re-brie ng soldiers should not be seen as a recommendation. Previous work has noted the tendency of many incident reporting systems to rely upon issuing dozens of similar warning messages [409]. Such `remedies' provide cheap xes and may neglect underlying safety issues. The following paragraphs will refer to this as the `perfective approach' to issuing recommendations. Other organisations have issued more detailed guidance that is intended to help investigators derive particular recommendations from the ndings of a causal analysis. For instance, the US Air Force's involvement in aviation incidents has led to the publication of extensive guidance on incident and accident reporting [794]. The following paragraphs use the USAF guidelines to identify a number of high-level recommendation heuristics.
Heuristic 1: Match Recommendations to Each Causal Factor The USAF guidelines include the generic requirement that `all mishap investigations should include recommendations to prevent future mishaps'. Like Canadian Army guidance, investigators are urged to match recommendations to each causal nding although exceptions are permitted if they are explicitly justi ed. Recommendations can also be made against non-causal ndings. For example, an investigation may identify alternative ways in which an incident might have occurred. It is, therefore, important to draft recommendations that address both the causal chain that led to an incident as well as any other potential failures that might also have been identi ed.
Heuristic 2: Assign action agencies for all recommendations Investigators must clearly identify an agency that will be responsible for ensuring that a recommendation is implemented. Safety management groups should not routinely be tasked to implement particular remedies. In contrast, investigators should identify those groups that manage the resources that are necessary to implement a recommendation. Investigators should also con rm that they have correctly identi ed a responsible authority providing that this does not compromise their work, for instance by fueling rumours about the potential recommendations.
Heuristic 3: Recommendations Correct De ciencies Rather than requiring that an agency should implement a particular solution, investigators should draft recommendations to correct de ciencies. For example, investigators might avoid proposals to `move the right engine re push-button to the right side of the cockpit'. In contrast, it would be better to recommend that `changes should be made to the engine re push-buttons to help preclude engine shutdown errors' [794]. This second approach goes beyond a simple instruction and helps to provide the rationale behind a particular recommendation. There are further justi cation for this heuristic. The time-pressures that aect many incident investigations can often prevent investigators from identifying all of the potential ways in which a problem might be addressed. Investigators may also
572
CHAPTER 12.
RECOMMENDATIONS
lack the necessary, detailed, domain knowledge that is shared by particular system operators. They might, therefore, be able to device more optimal solutions to that recommended by an investigator in the immediate aftermath of an incident.
Heuristic 4: Recommendations Support Actions NOT Studies Investigators should be encouraged to draft recommendations that support particular actions. If there is insuÆcient information upon which to base those actions then studies can be advocated but only as part of the process of implementing the higher-level recommendation. If investigators simply recommend that a study is conducted then there may be no guarantee that any actions will be based on the ndings of such an enquiry. Similarly, if a recommendation refers to tests that are incomplete when the report is sent prepared then investigators must identify potential remedies that are contingent upon the outcome of such studies. These dierent recommendations must be explained and investigators should make explicit reference to the test. They should also explain the reasons why a report was issued before the analysis was completed.
Heuristic 5: Recommendations follow Implementation Paths It is important that any recommendations take into account the correct procedures and paths for ensuring that corrective actions are implemented eectively. Part of this requirement can be satis ed by ensuring that the recommendation identi ed an appropriate implementation agency. There may also be other constraints depending on the nature of the recommendation and the organisation in which the incident occurred. For example, investigating oÆcers who recommend changes to military documentation may be required to initiate those changes themselves. This involves the submission of revision requests by submitting the appropriate forms to the relevant oÆce. For example, the USAF guidelines describe the use of the Technical Order System, or AF Form 847, Recommendation for Change of Publication (Flight Publications), according to AFI 11-215, Flight Manual Procedures `as applicable' [794].
Heuristic 6: Recommendations Acknowledge Minority Opinions In multi-party investigations, dierent investigators can have dierent degrees of in uence on the drafting of recommendations. Problems arise when these `primary' analysts disagree with the remedies proposed by their colleagues. Alternatively, investigators may hold equal in uence but are divided into majority and minority opinions. In such circumstances, it is important that the dissenting opinions are voiced. Majority groups or primary investigators must justify their decision not to recommend certain courses of actions. The USAF guidelines are unusual. They provide detailed heuristics for the identi cation of particular recommendations. Those heuristics are relatively informal. No explanation is provided for how they were drafted. The reader is not informed of any validation that might con rm the utility of this guidance. They do, however, re ect the pragmatic concerns that are commonly voiced by incident investigators [850]. The US Army's Army Accident Investigation and Reporting Procedures Handbook contains less detailed advice [806]. It does, however, summarise many of the points made in the equivalent USAF publication: \Recommendations. Each nding will be followed by recommendations having the best potential for correcting or eliminating the reasons for the error, material failure, or environmental factor that caused or contributed to the incident. Recommendations will not focus on organisational steps addressing an individuals failure in a particular case. To be eective at preventing incidents in the future, recommendations must be stated in broader terms. The board should not allow the recommendation to be overly in uenced by existing budgetary, material, or personnel restrictions. In developing the recommendations, the board should view each recommendation in terms of its potential eectiveness. Each recommendation will be directed at the level of command / leadership having proponency for and is best capable of implementing the actions contained in the recommendation." [806]
12.2.
RECOMMENDATION TECHNIQUES
573
As can be seen, there are also similarities between these guidelines and those issued by the Canadian Army. Both emphasise the `eectiveness' of any recommendations. There are also dierences. For instance, the US Army explicitly states that investigators need not be `overly in uenced' by existing budgetary constraints. All three of the organisational guidelines in this section emphasise the importance of directing recommendations at a responsible authority. However, the previous quotation not only stresses the need to identify an appropriate agency, it also stresses the need to specify an appropriate level of command within that organisation. These guidelines are informal. They gather together ad hoc requirements that are intended to improve the quality of recommendations that are produced in the aftermath of safety-critical incidents. They are `ad hoc' because they have not been integrated into a systematic method or process. Investigators must endeavour to ensure that they obey these guidelines as they develop individual recommendation. It is important to emphasise, however, that these comments should not be interpreted as overt criticisms. Informal guidelines provide important pragmatic advice that is essential given the relative lack of well-developed methods in this area. Table 12.4 shows how the US Army Technical Centre for Explosive Safety's recommendations from our case study incident can be mapped onto the causal factors that were identi ed in Table 12.4. As can be seen, this summary explicitly identi es the responsible agency that was charged with implementing the recommendation. The tabular form also illustrates the relationship between recommendations and causal factors. As can be seen, some causal factors are not explicitly addressed. Similarly, some recommendations are not associated with an implementation agency. Table 12.5 also records a recommendation that was made in the incident report but which cannot easily be associated with any of the particular causal factors that were identi ed from this incident. This analysis shows how a simple tabular form can be used, together with Army guidelines, as a form of quality control for the recommendations that are made in incident reports. Investigators might be asked to ensure that a recommendation is associated with each of the causal factors. For example, Table 12.5 does not explicitly denote any recommendation that might have helped to avoid situations in which excess demolition material is not tied into a ring charge. It can be argued that this cause is addressed by the previous entry describing how excess material must be turned in. If this analysis were accepted then Table 12.5 should be revised to explicitly associate this recommendation with both causes. Alternatively, it can be argued that this approach would not provide any `defence in depth'. If excess munitions were not handed in then there is still a danger that the independent ring of charges might cause the same confusion that led to this incident. Under such circumstances, Table 12.5 should be revised by introducing an additional recommendation speci cally addressing the detonation of excess material as part of another charge. As mentioned, the case study incident report does not identify recommendations for each cause nor does it identify responsible authorities for the implementation and monitoring of each recommendation. It is not surprising that our case study does not conform to the US Army guidelines [806]. The recommendations that are cited in Tables 12.4 and 12.5 were derived from material that was used to publicise the remedies that were advocated in the main report. They were not directly taken from the report itself. The example does, however, illustrate the application of these informal guidelines to assess the recommendations that were publicised in the US Army Technical Centre for Explosive Safety's account of the incident. It can also be argued that many of the principles that are proposed in the army guidelines ought to have been carried forward into the accounts that are used to disseminate information about this failure to other engineers throughout that organisation.
12.2.3 Enumerations and Recommendation Matrices The heuristics that were introduced in the previous section leave considerable scope for individual investigators. They provide guidance about the general form of particular recommendations, for instance by stressing the importance of identifying appropriate implementation paths. They do not directly help investigators to identify appropriate remedies for particular causal factors. In contrast, enumerated approaches list the possible recommendations that might be made in response to particular incidents. For example, the incident involving the Bangalore Torpedoe was analysed according to the US Army's Accident Investigation and Reporting pamphlet PAM-385-40. This
574
CHAPTER 12.
Cause The breaching team leader failed to turn in excess demo material to the ammunition supply point.
Excess demolition material was not tied into the ring charge. Addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed commanddirected observer/controller at the breaching site. Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training oÆcer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site.
Marking team leader unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
RECOMMENDATIONS
Recommendation Training and safety brie ngs must present and stress proper procedures for disposal/turn-in of excess munitions and/or explosives. Introduction of left over demolition materials into the last shot has been a longstanding accepted procedure. Such action violates the requirement to turn in all excess explosives.
Agency Training and brie ng oÆcers
All personnel must have con dence in their authority to immediately stop any unsafe life threatening act and exercise it accordingly. All personnel must have con dence in their authority to immediately stop any unsafe life threatening act and exercise it accordingly. Inadequate personnel hide distance approximately 25 meters: Required distance (according to IAW FM 5250) would have been 100 meters for a missile-proof shelter, 200 meters for a de lade position with overhead cover, 50 meters for Command waiver authorised de lade position.
All personnel
All personnel
Unspeci ed
Table 12.4: Guideline Recommendations for Bangalore Torpedo Incident.
12.2.
575
RECOMMENDATION TECHNIQUES
Cause
Recommendation The Phase 1 and Phase 2 walk through exercise, prior to live re operation, is speci cally designed to validate the safe execution for all elements of the live re exercise. A thorough detailed review of all aspects of the operations should have identi ed the violation of the 50-meter safe hide distance. Had the marking team and security members been properly distanced from the breaching site, their survivability from injury would have been greatly increased.
Agency All personnel, to include command-directed observer/controllers and safety representative, failed to identify violation of the waiver authorised minimum safe hide separation distance during walk through and dry re iterations.
Table 12.5: Additional Recommendation for Bangalore Torpedo Incident. enumerates potential root causes. It also provides a list of recommendations that must be considered when drafting the ndings from any investigation: \Code: 01, Key Word/Explanation: Improve school training. The improvement recommended should be directed toward the content or amount of school training needed to correct the accident causing error. For example: a. Provide school training for the person who made the error due to not being school trained. b. Improve the content of a school training program to better cover the task in which the error was made. c. Expand the amount of school training given on the task in which the error was made.
Code: 02, Key Word/Explanation: Improve unit training.
The improvement recommended should be directed toward the content or amount of unit training needed to correct the accident causing error. For example: a. Provide unit training for the person who made the error due to not being unit trained. b. Improve the content of unit training to better cover the task in which the error was made. c. Expand the amount of unit training given on the task in which the error was made.
Code: 03, Key Word/Explanation: Revise procedures for operation under normal or abnormal/emergency conditions.
The changes recommended should be directed toward changing existing procedures or including new ones. If the change is to an AR, TM, FM, Soldiers Manual, or other Army publication, tell the date when Department of the Army Form 2028 was submitted.
Code: 04, Key Word/Explanation: Ensure personnel are ready to perform.
The purpose of this recommendation is to encourage supervisors to make sure that their people are capable of performing a job before making an assignment. They should consider training, experience, physical condition, and psychophysiological state (e.g., fatigue, haste, excessive motivation, overcon dence, eects of alcohol/drugs).
Code: 05, Key Word/Explanation: Inform personnel of problems and remedies. This recommendation should be used when it is necessary to relay accident related information to people at unit, installation, major Army Command, or Department of the Army levels.
Code: 06, Key Word/Explanation: Positive command action.
The purpose of this corrective action is to recommend that the supervisor take action to encourage proper performance and discourage improper performance by his people.
Code: 07, Key Word/Explanation: Provide personnel resources required for the job.
This recommendation is intended to prevent an accident caused by not enough quali ed people being assigned to perform the job safely.
Code: 08, Key Word/Explanation: Redesign (or provide) equipment or ma-
576
CHAPTER 12.
RECOMMENDATIONS
teriel.
This recommendation is made when equipment or materiel caused or contributed to an accident because: a. The required equipment or materiel was not available. b. The equipment or materiel used was not properly designed.
Code: 09, Key Word/Explanation: Improve (or provide) facilities or services.
This recommendation is made when facilities or services lead to an accident because a. The required facilities or services were not available. b. The facilities or services used were inadequate.
Code: 10, Key Word/Explanation: Improve quality control.
This recommendation is directed primarily toward the improvement of training, manufacturing, and maintenance operations where poor quality products (personnel or materiel) have led to accidents.
Code: 11, Key Word/Explanation: Perform studies to get solution to root cause.
This recommendation should be made when corrective actions cannot be determined without special study. Such studies can range from informal eorts at unit level to highly technical research projects performed by Department of the Army level agencies." [796] This enumeration illustrates some of the problems that arise when attempting to guide the drafting of recommendations in the aftermath of accidents and incidents. As we have seen, the US Air Force guidelines speci cally urge investigators not to draft recommendations that involve additional studies. Heuristic 4 in the previous section was that `recommendations support actions not studies' [794]. In contrast, the US Army guidance includes code 11 that explicitly covers recommendations to perform studies which can identify solutions to `root causes'. Such inconsistencies are unsurprising given that very few studies have addressed the problems of deriving appropriate recommendations from the outcome of causal analysis techniques. Table 12.6 illustrates the way in which the PAM 385-40 guidelines can be applied to the Bangalore Torpedoe case study. As can be seen, each causal factor is addressed by one or more of the recommendations proposed by the army guidance material. PAM 385-40 does not specify the way in which an investigator might identify a particular recommendation for any particular causal factors. This is left to the skill and expertise of the analyst. The speci c entries in Table 12.6 must, therefore, be validated by peer review. For this reason, it might also be appropriate to introduce an additional column that explains the reason why a recommendation code was associated with each causal factor. For example, a positive command action might address the unplanned addition of the second change because the \supervisor (would) take action to encourage proper performance and discourage improper performance by his people". This example illustrates the pervasive nature of the `perfective' approach to incident reporting. The US Army guidelines contain several recommendations that re ect this corrective attitude towards operator involvement in accidents and incidents: 01 (improve school training); 02 (improve unit training); 03 (revise procedures...); 04 (ensure personnel are ready to perform); 05 (inform personnel of problems and remedies) and 06 (positive command action). None of the proposed recommendations addresses the organisational and managerial problems that have been stressed by recent research into the causes of accidents and incidents. Similarly, the proposed recommendations only capture a limited subset of the performance shaping factors that have been considered in previous chapters. These are partially covered by recommendation 04 that encourages supervisors to ensure that their teams are properly trained and in an adequate `psycho-physiological state'. Such objections can be addressed by extending the list of proposed recommendations. Additional codes can direct investigator towards recommendations that improve communications between different levels in an organisation or between regulators and line management. Unfortunately, the piecemeal introduction of new recommendation codes raises a number of further questions. For example, previous chapters have argued that the nature of incidents will change over time as new equipment and methods of operation are introduced into complex working environments. This argument has been used to stress the problems of identifying the generic causal factors that drive
12.2.
577
RECOMMENDATION TECHNIQUES
Cause The breaching team leader failed to turn in excess demo material to the ammunition supply point. Excess demolition material was not tied into the ring charge. Addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed command-directed observer/controller at the breaching site. Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training oÆcer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site. Marking team leader unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
PAM 385-40 Recommendation Code 01 - improve school training. Code 06 - positive command action. Code 06 - positive command action. Code 10 - improve quality control. Code 02 - improve unit training. Code 06 - positive command action. Code 07 - provide personnel resources required for the job. Code 01 - improve school training. Code 02 - improve unit training. Code 01 - improve school training. Code 06 - positive command action. Code 01 - improve school training. Code 06 - positive command action. Code 01 - improve school training. Code 04 - ensure personnel are ready to perform.
Table 12.6: PAM 385-40 Recommendations for the Bangalore Torpedo Incident. checklist approaches such as MORT, see Chapter 11. Similar problems arise when investigators attempt to enumerate the recommendations that might be used to address these causal factors. It can be diÆcult to identify appropriate responses to future incidents. If thee could be predicted with any con dence then safety managers would deploy such remedies pre hoc in order to prevent incidents from occurring in the rst place! A number of further problems complicate the use of enumerations. Lists of approved recommendations can guide investigators towards eective remedies. There is equally a danger that they may bias analysts towards ineective or even dangerous interventions. Chapter 15 will introduce a number of monitoring techniques that can be used to identify such potential problems. It is important to emphasise, however, that the elements in an enumeration must be carefully validated if they are not to advocate ineective solutions. These problems are exacerbated by the delays that can arise before the publication of revised recommendation lists. In more ad hoc approaches, individual investigators can tailor their interventions to re ect local conditions and personal observations about eective remedies for particular root causes. Such practices can be constrained when analysts must select recommendations from an enumerated list of approved interventions. PAM 385-40 enumerates the recommendations that US Army investigators must consider when drafting their reports. As mentioned previously, it does not prescribe which particular remedies should be proposed for particular causal factors. This is both a strength and a weakness of this application of a checklist or enumerated approach. This technique relies upon the skill and insight of the investigator to determine whether or not any of the eleven recommendations can be applied. This provides a degree of exibility that can be important for organisations that are faced with diverse failures in many dierent geographical and functional areas. This exibility creates problems. As we have seen, subjective factors and individual biases might aect an investigator's decision to propose one of these recommendations. Any potential inconsistency is reduced by selecting a remedy from
578
CHAPTER 12.
RECOMMENDATIONS
the enumeration. There are, however, no guarantees that any two investigators will agree on the same recommendations from that list for any particular incident. Checklist approaches address these potential problems by providing guidance on which recommendations can be best used to address particular causal factors. It is important to distinguish between recommendation techniques that simply list proposed remedies and those that provide more direct guidance about when to apply particular remedies. The previous section has illustrated the US Army's use of simple enumerations in PAM 385-40. In contrast, Chapter 11 has introduced the use of more directed approaches. For instance, Table 12.7 reproduces the Classi cation/Action matrices that from part of the PRISMA causal analysis technique. As can be seen, incidents that involve a failure in knowledge transfer within an organisation might result in recommendations to revise training and coaching practices. Failures that stem from operating procedures are addressed by revising procedures and protocols.
Interdepartmental communication Training and coaching Procedures and protocols Bottom-up communication Maximise re exivity
Organisational Factors External Knowledge Operating Factors Transfer procedures (O-EX) (OK) (OP) X
Manag. priorities (OM)
Culture (OC)
X X X X
Table 12.7: Example PRISMA Classi cation/Action Matrix (2) [844] The approach is more `directed' than the enumeration presented in the previous section because investigators can identify appropriate recommendations by reading down the column that is associated with each causal factor. Conversely, if other participants in the investigatory process propose a particular recommendation then analysts can read along the rows of the Classi cation/Action matrix to determine whether this would be consistent with previous ndings. Table 12.7 only associated a single recommended action with each causal factors. It is important to stress that this need not be the case in all application domains. For instance, problems involving knowledge transfer might be addressed by revised training procedures and by changes in protocols and procedures. Conversely, there may be situations in which `cultural factors', such as deliberate violations of procedures, cannot simply be addressed by the `maximise re exivity' recommendation proposed in Van Vuuren's Classi cation/Action matrix. In such circumstance, investigators may not be able to directly read o an appropriate recommendation from such a table. Most of the proponents of this approach con rm this analysis by arguing that these matrices are intended as guidelines that can be broken after careful deliberation rather than rules that should be followed in all circumstances. In consequence, these matrices can only be relied upon to increase the consistency of the recommendations made by investigators. They are unlikely to ensure absolute agreement. Table 12.7 was originally developed by Van Vuuren to help identify recommendations within Healthcare applications [844]. The precise nature of recommendation tables is determined by the context in which they are applied. For example, the causal factors that are represented as columns in the table must re ect the causal factors that are likely to be identi ed within a particular application domain. Conversely, the recommendations that form each row of the matrix must capture appropriate remedies for those causes. In terms of our case study, the rows of the matrix can be
12.2.
RECOMMENDATION TECHNIQUES
579
directly derived from the enumeration provided by PAM 385-40 [796]. Fortunately, the same document also provides an enumeration of potential causal factors. For instance, Table B5 lists `System inadequacies/readiness shortcomings/root causes'. These can be incorporated into the matrix in a similar fashion to the recommendations that were enumerated in the previous section. Tables 12.8, 12.9, 12.10 and 12.11 are directly derived from the causal codes and recommendation codes that are given in the US Army's guidance on incident and accident investigation [796]. The crosses represent the only additional information that has been introduced into the matrices. These are used to denote those recommendations that might be made given that particular causal factors have been diagnosed. For example, Table 12.8 shows that if an incident had been caused by inadequate supervision by higher command, investigators might consider recommendations that are intended to ensure that personnel were adequately prepared for the tasks that they were presented with (recommendation code 04). Additional recommendations might be drafted to increase the personnel available in an operation (07), to improve facilities (09) or to improve quality control on maintenance and support services (10). Conversely, Table 12.9 can be used to deduce that recommendations to perform more studies (recommendation code 11) might be proposed if there is evidence of inadequate school training (cause code 05) or inadequate unit training (cause code 06). Not only can the drafting of recommendation matrices help investigators to move from a causal analysis to the ndings of an incident report, they can also help to identify potential aws in the guidance that is provided to investigators. For instance, Table 12.9 lists the causes that PAM385-40 associated with training failures. These include `habit interference' (cause code 08). This occurs when `a person makes an accident causing error because task performance was interfered with either in the way he usually performs similar tasks or the way he performs the same tasks under dierent operating conditions or with dierent equipment' [796]. As can be seen from the recommendation matrix, it is diÆcult to identify one of the approved recommendation codes that might be associated with this potential cause. Improved training, possibly following the principles of Crew Resource Management programmes, might address this problem. There is, however, considerable controversy about the eectiveness of such recommendations [410]. The recommendation matrices that we have derived from the PAM 385-40 codes can be applied to the Bangalore Torpedoe incident. For example, previous sections have argued that the Battalion S3 recognised but failed to question or stop the unrehearsed detonation of the excess munitions. This could have been caused by several factors. For instance, it might be argued that this stemmed from environmental factors such as the timescale available to complete the operation or the diÆculty of communicating eectively with personnel during a night-time exercise (cause code 21). In such circumstance, investigators might use Table 12.11 to guide their analysis towards particular recommendations. For instance, investigators might advocate that additional measures be taken to ensure that S3's are prepared, in terms of individual training and safety brie ngs, to ensure that such departures are prevented from occurring (recommendation code 04). Alternatively, investigators might stress the importance of positive command actions on the part of S3's in similar circumstances (recommendation code 06). It is important to recognise, however, that analysts must continue to exercise their skill and judgement in the application of recommendation matrices. For example, Table 12.11 advocates recommendation to improve facilities (recommendation code 09) and to perform more studies (recommendation code 11) in response to the environmental causes (cause code 21), mentioned above. It is diÆcult to identify ways in which such measures might help to avoid the recurrence of our case study. Investigators might, therefore, argue that they need not draft recommendations to cover all of the potential remedies that are identi ed in these matrices. The suÆciency of the proposed solutions can be judged by peer review with other investigators. The proposed remedies will, in most cases, also be assessed by regulators and safety managers when they eventually receive the investigators' report. It might also be argued that `failure' was caused by a variant of habit interference. The S3 had become habituated to personnel following the approved plan and so failed to identify that the use of excess munitions departed from the approved procedure. Conversely, departures from approved plans might have become so commonplace that the S3 did not interpret the use of the excess munitions as anything `out of the ordinary'. This analysis raises a number of problems for our application of the recommendation matrices. The causal taxonomy aorded by PAM 385-40 does
580
CHAPTER 12.
Recommend. 01: Improve school training Recommend. 02: Improve unit training Recommend. 03: Revise procedures Recommend. 04: Ensure personnel ready Recommend. 05: Inform personnel of problems, remedies Recommend. 06: Positive command action Recommend. 07: Provide more personnel Recommend. 08: Improve equipment Recommend. 09: Improve facilities Recommend. 10: Improve quality control Recommend. 11: Perform more studies
RECOMMENDATIONS
LEADER FAILURE Cause 01: Cause 02: Inadequate Inadequate supervision by supervision by higher com- sta oÆcer. mand.
Cause 03: Inadequate supervision by unit command.
Cause 04: Inadequate supervision by direct supervisor, NCO, platoon leader or instructor.
X
X
X
X
X
X
X
X
X
X
Table 12.8: Recommendation Matrix for Leadership Failures
12.2.
581
RECOMMENDATION TECHNIQUES
Recommend. 01: Improve school training Recommend. 02: Improve unit training Recommend. 03: Revise procedures Recommend. 04: Ensure personnel ready Recommend. 05: Inform personnel of problems/remedies Recommend. 06: Positive command action Recommend. 07: Provide more personnel Recommend. 08: Improve equipment Recommend. 09: Improve facilities Recommend. 10: Improve quality control Recommend. 11: Perform more studies
TRAINING FAILURE Cause 05: Cause 06: Inadequate Inadequate school training. unit/on the job training. X
Cause 07: Inadequate experience.
X X
X
X
X
X
Table 12.9: Recommendation Matrix for Training Failures
Cause 08: Habit interference
582
CHAPTER 12.
Recommend. 01: Improve school training Recommend. 02: Improve unit training Recommend. 03: Revise procedures Recommend. 04: Ensure personnel ready Recommend. 05: Inform personnel of problems, remedies Recommend. 06: Positive command action Recommend. 07: Provide more personnel Recommend. 08: Improve equipment Recommend. 09: Improve facilities Recommend. 10: Improve quality control Recommend. 11: Perform more studies
Cause 09: Inadequate written procedures.
RECOMMENDATIONS
STANDARDS FAILURE Cause 10: Cause 11: Cause 12: Inadequate Inadequate InsuÆcient facilities. equipment. personnel.
Cause 13: Inadequate quality control.
Cause 14: Inadequate maintenance.
X
X
X
X
X X X
X
X
Table 12.10: Recommendation Matrix for Standards Failures
12.2.
583
RECOMMENDATION TECHNIQUES
Recom. 01: Improve school training Recom. 02: Improve unit training Recom. 03: Revise procedures Recom. 04: Ensure personnel ready Recom. 05: Inform personnel of problems, remedies Recom. 06: Positive command action Recom. 07: Provide more personnel Recom. 08: Improve equipment Recom. 09: Improve facilities Recom. 10: Improve quality control Recom. 11: Perform more studies
Cause 15: Fear, Anger
INDIVIDUAL FAILURE Cause 16: Cause 17: Cause 18: Complacency Lack of Haste, con dence Attitude
X
X
X
X
X
X
X
X
Cause 19: Fatigue (selfinduced) X
Cause 20: Alcohol, drugs, illness X
Cause 21: Environment
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
Table 12.11: Recommendation Matrix for Individual Failures
584
CHAPTER 12.
RECOMMENDATIONS
not distinguish between these very dierent causes. In consequence, it can be diÆcult to identify recommendations that might be used to combat these problems. Further problems arise because even if we could unambiguously ascribe the S3's actions to an habituation error there are no speci c recommendations associated with this causal factor. In consequence, investigators are free to identify any remedy that is considered appropriate for such an error. The allocation of recommendations to causal factors in Tables 12.8, 12.9, 12.10 and 12.11 is arbitrary in the sense that it is based on an initial analysis of PAM 385-40. In practice, additional validation would be required before investigators could use such matrices. As we have mentioned, there can be profound consequences if safety managers propose inappropriate or ineective remedies for the particular causes of adverse incidents. A particular concern is that we have derived these tables from the US Army's published procedures and guidance documents. There are strong dierences between these sources and similar publications that guide civilian forms of incident reporting. For instance, the in uence of the `perfective' approach is arguably greater in systems where military discipline and the chain of command are guiding principles. Having raised this caveat, it is important to acknowledge that the recommendation matrices in Tables 12.8, 12.9, 12.10 and 12.11 are still vulnerable to the criticisms raised by Leape [479] and Reason [701]. The focus on individual error and leadership failures obscures the organisational and managerial factors that have been identi ed in many previous accidents. A number of problems complicate the use of navigational techniques that are intended to guide investigators towards particular recommendations from lists of approved interventions. For instance, it can be diÆcult to predetermine a range of appropriate remedies for incidents that have not yet occurred. In consequence, it is unlikely that investigators will be able to identify potential recommendations for all of the incidents that they might encounter. Similarly, the complex nature of many failures can make it diÆcult to ensure that approved recommendations address all of the detailed causes of particular incident. Further problems can arise when approved recommendations do not provide suÆcient details for investigators to implement them in the aftermath of a particular incident. For instance, the previous analysis of the Bangalore torpedoe case study identi ed the following causal factor `Battalion S3 (operations, planning and training oÆcer) recognised but failed to stop the deviated and unpractice operation'. PAM 385-40 codes can be used to classify this cause. For example, Table 12.11 identi es range of individual categories that might be used. These include a lack of con dence (code 17), undue haste (code 18) or problems with fatigue (code 19). As can be seen, however, these are at a more detailed level than the observation that was derived from the US Army's causal analysis. Investigators must, therefore, extend the initial investigation to ease the mapping between the products of the investigation and the classi cation provided by PAM 385-40. The same problem occurs in reverse when when the matrix approach is extended to identify `recommended' intervention techniques. Table 12.11 proposes improved school (code 01) or unit training (code 02). Investigators are also encouraged to draft recommendations that ensure a more positive command action (code 06) or that personnel are ready (code 04). At rst sight, this might seem to encourage the consistency that has been advocated in previous sections. Such an impression can be misleading. Even if investigators can agree upon a common recommendation code for a causal classi cation, there is no guarantee that a high-level remedy such as `improve unit training' will result in similar interventions at an operational level. There are many dierent ways in which training might be `improved' the eÆcacy of such interventions depends entirely upon which techniques are recommended and whether or not they are successfully implemented at the unit level.
12.2.4 Generic Accident Prevention Models A number of alternate recommendation techniques explicitly acknowledge the problems in classifying causes and then uses such a classi cation to identify recommended interventions. These techniques exploit a higher level of abstraction than that embodied within the guidance of PAM 385-40. Investigators are then encouraged to introduce additional `contextual' details into these abstractions. They are expected to exploit their skill and experience to identify the more detailed interventions that are intended to combat future failures. For instance, Haddon identi ed ten strategies for accident or
12.2.
RECOMMENDATION TECHNIQUES
585
incident prevention [299]. These strategies are associated either with the source of the energy that is transferred during an incident or with the barriers that protect the system or with the target that is potentially exposed to the energy release. They, therefore, have close links to the from of barrier analysis that was introduced in Chapter 10 that was developed in Haddon's earlier work [298]. These strategies are mentioned now because they have also been proposed as a high-level framework for the identi cation of recommendations in incident reports [444].
Energy source.
1. Prevent the buildup of energy, this will help to ensure that the conditions for an unwanted release do not slowly accumulate over time. 2. Modify the qualities of the energy, this will help to ensure that appropriate control measures are identi ed to help prevent any unwarranted releases. 3. Limit the amount of energy, this will minimise the consequences of any uncontrolled release and may make that release easier to control. 4. Prevent the uncontrolled release of energy. 5. Modify the rate and distribution of released energy, this will help to ensure that any unwanted release is stopped at source as soon as possible.
Barriers.
6. Separate the energy source from the target either in time or space, this helps to mitigate the consequences of any energy release. 7. Use physical barriers to separate the energy source and the target.
Target.
8. Increase the resistance of the target to any potential energy ows. 9. Limit any knock-on or consequent damage following any initial energy loss. 10. Stabilise the situation and initiate repairs as soon as possible in case of compound failures. As mentioned, the components of Haddon's model have been used to provide a high-level framework that is designed to help investigators identify potential recommendations in the aftermath of incidents and accidents [444]. The particular nature of those recommendations will vary from industry to industry and even from incident to incident. The intention is, therefore, not to explicitly provide an enumeration of potential remedies. In contrast, the components of the model are intended to provide an abstract model of those areas in which an investigator might focus any remedial actions. Table 12.12 shows how Kjellen's [444] application of Haddon's high-level strategies can be used to structure the identi cation of recommendations. In this case, we have applied Kjellen's approach to identify potential interventions following the Bangalore Torpedoe incident. This example illustrates both the strengths and the weaknesses of the general approach. Haddon's more general model of accident prevention strategies provides a number of high-level prompts that can guide an initial consideration of potential recommendations. The model is based on the notions of barrier analysis, introduced in Chapter 10, and so it avoids some of the myopia associated with `perfective' approaches. The focus both on causal factors, such as the build-up of energy, and on mitigating factors, such as the resilience of the target, ensure that investigators do not simply focus on the products of a causal analysis when considering the recommendations for an incident report. Table 12.12 also illustrates some of the potential problems that can complicate Kejellen's application of Haddon's strategies. Although this approach provides important general guidance, it can be diÆcult to determine what high-level concepts such as `prevent the build-up of energy' actually mean in the context of a particular incident. Further problems arise when there are clear con icts between the potential recommendations that might be derived from Haddon's strategy and the operational objectives that govern particular application domains. For instance, Table 12.12 suggests that the rate and distribution of the energy hazard might be altered by possibly increasing the size of breach that the explosives were used against. This would potentially distribute the forces acting on any particular individual who might be caught in a blast during a training exercise. Any intended
586
CHAPTER 12.
Type of Strategy 1. Prevent build-up 2. Modify quantities 3. Limit the amount 4. Prevent release 5. Modify rate and distribution 6. Separate Source and Target 7. Physical barriers 8. Increase resilience 9. Limit damage 10. Rehabilitation/initiate repairs
RECOMMENDATIONS
Case Study Recommendation Hazard/Energy Source Avoid use of explosives in night-time exercises. Limit the power of explosives used in night-time exercises. Limit the quantity of explosives issued to all personnel in a nighttime exercise. Limit the number of detonation devices issued in night-time exercises and have procedures for approving release of additional devices only when needed. Not applicable - possibly increase size of breach area to distribute force? Barriers Ensure marking do not proceed until permission to proceed received from the breaching team. Prevent any detonation without explicit con rmation from a member of the marking team. Vulnerable Target Ensure marking team carry additional protective equipment. Paramedic teams in immediate vicinity. Depends on type of injury.
Table 12.12: Applying Haddon's Ten Strategies to the Bangalore Torpedoe Case Study reduction in the severity of an incident would, however, have to be oset against the potential tactical problems of alerting the enemy to a failed attempt on their position. It is also important to remember that mission objectives should not be seen narrowly in terms of the short-term outcome from a particular training exercise: \Regardless of the training situation, leaders and soldiers must also understand that training exercises are just that training. Under no circumstances should safety be overlooked to achieve a training objective. It is the safety-oriented process that will assist the unit in achieving the mission successfully. Another accident demonstrates the importance of maintaining focus on the objective safely. The unit was engaged in a challenging river crossing operation when the decision was made to oat downstream. Even though current readings had not taken place, a safety boat was not on standby, and an exercise participant was not wearing a otation device, the squad decided to proceed with the mission anyway. Unfortunately, the rivers current was strong enough that it pulled all the team s elements under an anchored barge. Some of the team members survived, but two of them did not. Again, the mission was part of a training exercise. Now we can look back and think of all actions we could have taken to prevent this unfortunate accident; however, now it is too late for the unfortunate participants. Again, leaders must re-emphasise that when encountering an unsafe situation, the mission must now become safety." [807] Such complex trade-os between safety and mission objectives should not be surprising. The opening sections of this chapter argued that they are inevitable given that investigators may recommend changes in current operating practices. The key point is, however, that any potential recommendations that appear to t well with Haddon's accident prevent strategies must also be carefully validated to ensure that they do not result in unintended consequences that might ultimately increase the likelihood of other incidents. As mentioned, Haddon's strategies provide a high-level framework that Kjellen has used to guide the identi cation of potential recommendations following incidents and accidents. Some elements of
12.2.
RECOMMENDATION TECHNIQUES
587
this approach have been developed more than others. For example, barrier analysis is based around strategies 6 and 7 in Table 12.12. Chapter 10 has already referred to its widespread application as part of many causal analysis techniques. Not only can barrier analysis be used to help identify the failure of protection mechanisms, it can also used to identify potential interventions that might avoid future failures. Before providing an example of barrier analysis as a recommendation technique, it is important to emphasise a number of underlying dierences between this approach and the others that have been introduced in previous sections. As we have seen, perfective approaches place sanctions on those individuals and groups who are deemed to be responsible for particular failures. The enumeration and matrix approaches that we have analysed typically focus on identify corrective actions for the causes of incidents and accidents. In contrast, barrier analysis typically helps to identify interventions in the accident `process' that are intended to eliminate or reduce harmful outcomes. It is possible to object that this approach does not address the root causes that provide the `starting point' for any failure. On the other hand, barrier analysis is supported by the analysis of causal asymmetry that was introduced in Chapter 11. As we have seen, Hausman has argued that it is infeasible to perform `backwards reasoning' as a reliable means of identifying particular causes from a set of eects [313]. Perrow con rms this when he argues that it is impossible to anticipate the many dierent causes of technological failure [675]. It, therefore, makes great sense to attempt to control or mitigate those failures that do occur rather than try to eliminate them entirely. It is possible to identify a vast range of dierent barriers that might be recommended in the aftermath of an incident. Physical barriers restrict access to hazardous areas, they constrain the ow of energy from a source towards the target. Organisational barriers, such as permit to work schemes, rely upon procedural mechanisms and surveillance activities to achieve similar ends. Barriers may also be active, in other words they are dependent on the actions of operators or systems, or they may be passive. Passive barriers are inherent within a design and are independent of any initiating actions once they are deployed. They must, however, be monitored in case operational demands erode the protection that they aord to the user. For example, the doors of safety cages can be damaged in order to provide greater access to a working area. Kjellen also argues that barriers can be classi ed as either technical, organisational or social/individual in nature [444]. He provides a detailed list of such barriers that can provide the basis of a checklist approach to the identi cation of particular recommendations. Unlike some of the previous enumerations, the intention is not to provide a detailed, exhaustive domain speci c list. In contrast, these high-level barriers are domain independent and analysts must again apply their skill and experience to interpret them within the context of a particular incident. For example, the following list builds upon what Kjellen calls social and individual barriers. These are intended to prevent future incidents by changing the `safety culture' in a working environment:
1. education, training and experience of personnel; 2. feedback on causes and consequences of previous incidents; 3. motivational campaigns, safety meetings and awareness raising initiatives; 4. feedback rewards for `safe' performance and punishments for some violations; 5. use of automated and peer monitoring systems to assess safety performance.
As with Haddon's original strategies, investigators must translate these high-level barriers into the speci c measures that are recommended in the aftermath of an incident. It is entirely possible that this process of interpretation might result in ineective or even dangerous proposals. For example, there is no guarantee that a motivational campaign will have any eect upon individual behaviour. Similarly, reward and punishment systems can have negative eects if they alienate sta and create workplace con ict [701]. In consequence, it is also important that investigators consider means of validating the implementation of their recommendations. This is addressed in greater detail in Chapter 15. In contrast, the following list extends the previous analysis to summarise a range of organisational barriers to future incidents [444]. As can be seen, these relate to the procedural mechanisms that are intended to promote the safe operation of application processes:
588
CHAPTER 12.
RECOMMENDATIONS
6. ensure suÆcient numbers of sta; 7. monitor implementation and eÆcacy of all barriers. 8. ensure correct levels of expertise and training; 9. provide adequate reference documentation to support training; 10. provide adequate documentation for emergency procedures; 11. rehearse emergency procedures; 12. ensure maintenance is eective and timely; 13. exploit a `permit to work' system if maintenance is itself dangerous; 14. ensure adequate exchange of information and sta brie ngs.
It is possible to identify a number of common features between the elements of this barrier analysis and previous recommendation techniques. For instance, `8. ensure correct levels of expertise and training' is similar to recommendation code 01 `improve school training' and 02 `improve unit training' in PAM 385-40. Similarly, `6. ensure suÆcient numbers of sta' is similar to recommendation code 07 `provide personnel resources required for job'. Other organisational barriers have not been proposed by more ad hoc approaches to the enumeration of recommendations. For example, the army schemes that were described in previous sections have had relatively little to say about the maintenance activities addressed by items 12 and 13 in the previous list. The following list again extends Kjellen's application of barrier analysis to summarises a number of technical barriers that might prevent the recurrence of previous incidents. As before, these are intended to provide generic recommendations that might be proposed in the aftermath of many dierent incidents:
15. eliminate or reduce hazards in the design of equipment; 16. introduce physical barriers to minimise personnel's exposure to hazard s; 17. ensure that personnel wear protective equipment whenever necessary; 18. ensure that emergency and rst aid equipment is provided; 19. design workplace to support operators (noise, ventilation etc); 20. minimise the use, transportation and handling of hazardous materials.
As with social and organisational barriers, each of these barriers can satisfy a dual role. They can be used to guide the initial design of a safety critical application. For example, an injunction to `introduce physical barriers to minimise personnels' exposure to hazards' can be used to guide the development of a design. The products of barrier analysis can also be used to identify recommendations in the aftermath of incidents and accidents. For instance, the same injunction might be proposed as a potential remedy in the aftermath of an incident or accident. Table 12.13 builds on this analysis and uses our extended version Kjellen's barriers to identify potential recommendations from the Bangalore Torpedoe case study. As can be seen, our application of Barrier Analysis identi es a number of potential recommendations that might be used to inform the drafting of an incident report following the Bangalore Torpedoe case study. Potential remedies are again described at a high level of abstraction and must be re ned to include the domain details that characterise this particular incident. For example, a requirement to `ensure eective use of automated and peer monitoring systems' must be translated into particular procedures that can be implemented within the army's command structure. Further validation would then be required to ensure that the particular steps which were taken in the aftermath of an incident actually satis ed this high-level recommendation. These observations are similar to those that were made about the application of more general models of accident prevention,
12.2.
589
RECOMMENDATION TECHNIQUES
Cause The breaching team leader failed to turn in excess demo material to the ammunition supply point. Excess demolition material was not tied into the ring charge. Addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed command-directed observer/controller at the breaching site. Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training of cer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site. Marking team leader unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
Barrier 1. Education, training and experience of personnel. 2. Feedback on causes and consequences of previous incidents. 15. Eliminate or reduce hazards in the design of equipment. 14. Ensure adequate exchange of information and sta brie ngs. 6. Ensure suÆcient numbers of sta. 8. Ensure correct levels of expertise and training; 14. Ensure adequate exchange of information and sta brie ngs 5. Ensure eective use of automated and peer monitoring systems. 5. Ensure eective use of automated and peer monitoring systems. 14. Ensure adequate exchange of information and sta brie ngs. 2. Feedback on causes and consequences of previous incidents. 4. Feedback rewards for `safe' performance and punishments for some violations. 1. Education, training and experience of personnel. 8. Ensure correct levels of expertise and training.
Table 12.13: Barrier Analysis of Recommendations from Bangalore Torpedo Incident.
590
CHAPTER 12.
RECOMMENDATIONS
illustrated by Table 12.12. This should not be surprising as both approaches share a common root in Haddon's work on incident causation [298, 299]. There are also more worrying similarities. For example, the recommendations identi ed in Table 12.13 are similar to many of the interventions that were identi ed using ad hoc heuristics and enumerations, such as those illustrated in Table 12.6. It can be argued that these similarities perhaps re ect particular properties of our case study. The Bangalore Torpedoe incident does not provide a suitable example to demonstrate the dierences between these contrasting recommendation techniques. Alternatively, it can be argued that there are very few dierences between the application of accident prevention models and more ad hoc techniques. There is, however, a third explanation. Table 12.13 illustrates some of the problems that can arise when recommendation technique are driven directly by causal analysis. For instance, previous sections have argued that investigators must not only focus on the causes of an incident but also on those barriers and controls that help to mitigate its consequences. Unfortunately, Table 12.13 focuses only on remedies for the causes of the incident. It does not consider the performance of triage and evacuation procedures in the aftermath of the incident. This re ects the balance of detail that was provided in the initial US Army report [818]. In consequence, our analysis does not consider certain recommendations: `10. provide adequate documentation for emergency procedures'; `11. rehearse emergency procedures' or `18. ensure that emergency and rst aid equipment is provided'. If the initial causal analysis of the incident had also been extended to include mitigating factors, as was done in Chapter 10 then this would have exposed important dierences between the recommendations identi ed in Table 12.6 and those proposed in Table 12.13.
12.2.5 Risk Assessment Techniques This section began by describing perfective techniques. These approaches focus almost exclusive on exhortations for operators to `do better' in order to avoid previous failures. Subsequent sections identi ed a range of techniques that broadened the scope of this analysis. For example, US Army and Air Force heuristics urge investigators to identify recommendations that address each of the causal factors identi ed during previous stages in an investigation [794, 796]. This more general approach has become embodied within recommendation matrices. Accident prevention models further broaden the scope of any recommendations that are identi ed in the aftermath of an incident. Not only do they address individual causal factors, they have also been extended to identify recommendations that are intended to strengthen system defences. These including the mitigating factors that can help to control the adverse consequences of particular failures. The broadening scope of recommendations is appropriate because it re ects a growing recognition that most incidents involve complex interactions between people, systems and the environment in which they interact [675]. It does, however, create a host of practical problems. In particular, it can be diÆcult for the recipients of an incident report to determine how best to allocate nite resources to support the implementation of all of the diverse recommendations that might be made by investigators. The Canadian report into Operation Assurance illustrates the scale of this problem [128]. This summarised approximately fty-one recommendations that were made as a result of incidents that occurred during relief eorts in Rwanda. These recommendations included speci c measure to improve training at the highest level within the joint forces: \Canadian Forces must `educate leaders and stas at the most senior levels in both strategic and operational level doctrine processes'. This should be done as a teaching seminar either prior to or in concert with a major command post or computer assisted exercise. The objectives of the exercise should include education and validation with regards to joint doctrine and validation of Joint Forces Headquarters. Thereafter, education/review must be conducted on a routine basis." [128] It also included more detailed recommendations, for example about the amount of notice that personnel should be given prior to being deployed in remote locations. Similar observations can be made about detailed investigations into single incidents. Apparently simple incident, such as the misuse of commercial heaters in tents, can generate tens of recommendations that range from
12.2.
RECOMMENDATION TECHNIQUES
591
improved training of personnel through to changes in the monitoring of standards throughout the chain of command [815]. The diverse nature of many recommendations and the sheer volume of remedial actions that can be proposed in the aftermath of an incident can create considerable problems from the recipients of these reports. Risk assessment techniques provide investigators, regulators and end-users with means of prioritising the recommendations that are are made in the aftermath of an adverse occurrence. Previous sections have noted that the particular details of who performs this prioritisation vary between dierent industries. For instance, in the Air TraÆc Management domain there are well speci ed procedures that govern the reporting of recommendations by investigators back to safety managers who then prioritise their ndings [423]. In local incident reporting systems, for example with UK hospitals, the same individuals may identify and prioritise potential recommendations [119]. This informal process is intended to be highly cost-eective in terms of the resources required to perform the analysis. Although it can be carefully tuned to the local working conditions of the units that operate the incident reporting system, this approach is also open to the many subjective biases that can distort risk assessments [2]. A number of organisations have recognised the key relationship between incident reporting and risk analysis. For example, the US Army Safety Program recently devoted an issue of their Countermeasure magazine to `Accident Investigation: The Other Side of Risk Management' [807]. Of course, this relationship is more complex than the prioritisation of recommendations. For example, the US Army identi es ve stages in a risk management process [805, 798]. These can be summarised as follows: 1. Identify hazards. Incident reporting provides important guidance in this initial stage of any risk assessment because it provides information about previous failures. Data can be collated from other operational units both within the same organisation and from national and internation groups operating similar processes. There is, however, no guarantee that previous incidents will provide good information about future failures involving novel production techniques. 2. Assess hazards. This second stage of risk management is intended to assess the impact of each hazard in terms of potential loss and cost based on probability and severity. More will be said about this in the following paragraphs. However, for now it is suÆcient to observe that incident reporting systems not only provide information about previous types of hazard, they can also be used to identify the likely consequences of future failures based upon previous outcomes. 3. Develop controls and make risk decision. As control measures are developed, risks must be re-evaluated until they are reduced to a level that is `as low as reasonably practicable'. This ALARP principle is controversial because it implies that it is possible to identify situations in which the perceived bene ts of reducing the risk of a particular failure any further are outweighed by the potential costs of implementing such a risk reduction. Chapter 15 will describe how incident reporting systems can be used to support this aspect of risk management. In theory, it should be possible to demonstrate the eectiveness of particular recommendations by monitoring falls in the frequency and severity of future incidents. This is not always possible given the problems of ensuring the uniform implementation of recommendations and the relatively low frequency of many safety-critical incidents. 4. Implement controls. The fourth stage in any risk management program is to implement the controls that are intended to achieve the intended risk reduction. Again incident reporting systems provide an important source of information on the potential bene ts of particular forms of control or barrier. Data from other plants can be sued to determine whether the introduction of these measures can create the opportunity for further types of incident, for instance during the installation and `burn-in' of new equipment.
592
CHAPTER 12.
RECOMMENDATIONS
5. Supervise and evaluate. Finally, it may be necessary to monitor not simply the performance of any recommended controls but also to ensure that personnel continue to follow recommended practices. Similarly, it is important to ensure that recommended safety equipment is maintained and operated in a manner that is intended to ensure that it is availability on demand. It is important to remember the problems associated with gathering reliable evidence for violations from incident reporting systems. It can be diÆcult to obtain evidence about conformance to recommendations, especially if they have been implemented following previous incidents. As can be seen, several dependencies exist between risk management and the identi cation of recommendations in the aftermath of an incident. Risk management techniques can be used to determine the relative priority of particular recommendations. If a particular hazard is thought to be very unlikely or if it implies only marginal consequences for the safe operation of an application then the recommendation may be assigned a relatively low level of priority. Conversely, if an associated hazard is predicted to have a high frequency or a relatively large impact on safe and successful production then recommendations to address that hazard will be assigned a high-level of priority. Incident reporting systems can then be used to assess the eÆcacy of those recommendations that are rated particularly highly using such risk management techniques. If similar incidents continue to occur then the eectiveness of a recommendation may be questioned. Conversely, if incidents occur from hazards that were assigned a low relative priority then the eectiveness of the risk management system can be questioned. In order to understand the role of risk management techniques in the identi cation and prioritisation of recommendations it is rst necessary to describe the underlying components of a risk management system in greater detail. The fundamental concept behind this approach to the development of safety-critical systems is that:
Risk = Frequency
Cost
This formula provides a means of assessing the potential eectiveness of any recommendation in terms of reductions in the costs or consequences of an incident. It can also be used to prioritise recommendations in the aftermath of an incident. As mentioned in previous sections, US Army [796] and Air Force [794] guidelines argue that each recommendation must be clearly associated with the results of a causal analysis. The US Army de nes a hazard to be `any real or potential condition that can cause injury, illness, or death of personnel or damage to or loss of equipment, property or mission degradation' [805]. In consequence, the same frequency and consequence that are associated with a hazard can also be associated with the causes of an incident or accident. This can inform the identi cation of recommendations in one of two ways: 1. if recommendations have already been identi ed using one of the techniques described in previous sections then risk assessment techniques can be used to identify the priority of each proposed remedy in terms of the risk associated with the cause that it is intended to address; 2. if recommendations have not already been identi ed then a risk assessment can be performed for each cause. The results of this analysis help to establish a partial ordering that can be usd to allocate nite investigatory resources. Greatest attention should be paid to nding appropriate recommendations for those causes that are assumed to pose the greatest continuing threat to a system. Unfortunately, a number of factors complicate the application of the previous formula to guide the prioritisation of recommendations. The previous formula is a simpli cation. Subsequent paragraphs will introduce concepts such as risk exposure that must also be considered when attempting to assess the priority of particular recommendations. If such factors are not taken into account then it is possible to assign relatively low priorities to recommendations that could have a relatively large impact upon the risk of future incidents because investigators fail to accurately assess the potential frequency of a particular causal factor or hazard. It can be argued from the previous formula that the risk associated with a hazard will fall if either its frequency is reduced or the costs
12.2.
RECOMMENDATION TECHNIQUES
593
associated with that hazard fall. This, of course, assumes that such reductions are not oset by a corresponding fall in the other component of the equation. For instance, Bainbridge has argued that the implementation of many safety recommendations reduces the frequency of a particular cause but can also increase the consequences of those hazards when they do occur [65]. This one of several `ironies of automation'; the relatively low frequency of certain failures can leave operators unprepared to intervene in adverse incidents. Further problems stem from attempts to derive numerical estimates for the consequent cost of a particular hazard. The amount of money that must be spent in the aftermath of previous incidents can prove to be an extremely poor indication of what might have to be paid in the future. The increasing use of litigation within certain Healthcare systems has resulted in massive changes in the scale of compensation that must now be paid following many adverse incidents [453, 633]. There are well established mechanisms for calculating the potential liability associated with fatalities and personal injuries. It can, however, be diÆcult to predict the potential scale of such injuries that might result from future incidents. The costs associated with air collisions can vary greatly depending on the numbers of ground fatalities that are factored into any calculation. It is also diÆcult to predict the punitive elements that can be introduced during the settlement of claims within some legal systems In consequence, most organisations avoid precise numerical assessments for the potential costs associated with particular hazards. In contrast, they rely upon subjective bands that are described using keywords. This is an approach that complements the use of such terms within HAZOPS [27]. For instance, the US Army encourages risk managers to consider a number of basic categories that can be used to describe the consequences associated with a particular hazard [805]. The costs of an incident are assessed in terms of the expected degree of injury, property damage or other `mission-impairing' factors: 1. Catastrophic: death or permanent total disability, system loss, major damage, signi cant property damage, mission failure. 2. Critical: permanent partial disability, temporary total disability in excess of 3 months, major system damage, signi cant property damage, signi cant mission degradation. 3. Marginal: minor injury, lost workday accident, minor system damage, minor property damage, some mission degradation. 4. Negligible: rst aid or minor medical treatment, minor system impairment, little/no impact on mission accomplishment. A number of caveats can be raised about the interpretation of these dierent categories. The relatively low costs associated with near-miss incidents can persuade organisations to underestimate the consequences of a potential accident. It is for this reason that some organisations have argued that the cost component of the risk management equation, given above, should only be calculated in terms of the worst plausible outcome of an incident. If pilots were able to narrowly avert a collision then safety managers might assess the costs of such an occurrence in terms of the potential loss of both aircraft. This appears to be a rationale and well considered approach. Problems arise, however, when investigators must determine what `worst plausible outcome' actually means for speci c incidents. This issue was addressed in more detail in Chapter 4. Chapter 11 notes the diÆculty of obtaining quantitative data about incident frequencies. Data from bench trials and experimental observations often cannot be replicated in complex, working environments. Conversely, observational data and information from automated logging systems can be diÆcult to calibrate and interpret. Incident reporting systems provide a partial solution to these problems. They provide information about `actual' incidents in `real' working environments. Forms can be designed to elicit the information that is necessary to interpret observations about adverse events. In con dential systems it is possible to gather additional information about the context in which failures occur. As we have seen, however, participation bias and relatively low submission rates create signi cant problems for the use of incident reporting data as a `raw' source for risk management. The US Army [805], therefore, also provides guide-words that describe the frequency of a potential hazard:
594
CHAPTER 12.
RECOMMENDATIONS
1. Occurs often, continuously experienced. 2. Occurs several times. 3. Occurs sporadically. 4. Unlikely, but could occur at some time. 5. Can assume it will not occur. A number of further problems complicate the application of this approach to risk management. Previous paragraphs brie y mentioned that the risk associated with a particular hazard is partially determined by the length of exposure to that hazard. This must take into account both the cumulative duration of any exposure but also the summative eect of individual exposures across dierent operational units. These issues were not considered in previous formulae. One means of addressing this omission is to re ne the subjective categories that are used to describe the potential frequency of a hazard. This is the approach that is advocated by the US Army's guidance on Risk Management, illustrated in Table 12.14 [798]. One consequence of adopting this approach is that it can introduce additional complexity into the super cial simplicity which is an important strength of the initial frequency de nitions. As mentioned, the previous risk assessment formula can be applied together with the previous de nitions of consequence and frequency to estimate the risk that is associated with particular hazards. In practical terms this is accomplished using a risk assessment matrix similar to that presented in Table 12.15. The use of such matrices has important consequences for the use of risk analysis to drive the prioritisation of incident recommendations. As can be seen, Table 12.15 supports the high level classi cations of hazards into Extremely high, High, Moderate and Low risks. Such distinctions are unlikely to provide a total ordering over the many dierent recommendations that are made in the aftermath of safety-related incidents. In consequence, even if analysts do resort to the use of risk assessment techniques to supplement an incident investigation they will still have to exploit a range of additional techniques to rank individual recommendations within these gross categories. Table 12.15 can be used in conjunction with the US Army guidance on frequency and consequence assessment to prioritise the recommendations that were identi ed by the investigation into the Bangalore Torpedoe case study. This process begins by performing a risk assessment of the causal factors that were identi ed in the aftermath of this incident. This approach is justi ed by the Army guidance that points to the close relationship between the hazards that are considered in any risk assessment and the causes of previous accidents [805]. Table 12.16 illustrates the results of such an analysis. As can be seen, a frequency and criticality level are associated with each of the causal factors. The subjective nature of these assessments makes it important that investigators also document the justi cation for the allocation of particular levels to each of the causal factors. For instance, the breaching team leader's failure to turn in excess demo material was classi ed as a likely occurrence on the basis of comments made by the investigating oÆcer: \Introduction of left over demolition materials into the last shot has been a longstanding accepted procedure. Such action violates the requirement to turn in all excess explosives..." [818]. Similarly, the breaching team members' failure to question or stop the deviated and unpracticed operation was assessed as being unlikely. This was based on an analysis of previous exercises in which phase one and phase two walkthroughs established the pattern for an operation and helped personnel to question deviations from the planned actions. Such justi cations might be explicitly included within risk assessment documents such as Table 12.16. Previous paragraphs have brie y described the problems of assessing the likely consequence of a particular hazard in the aftermath of an adverse occurrence. It might be argued that there were no consequences from any of the particular causes of a `near-miss' incident. In contrast, if we apply the `worst plausible outcome' assumption then almost every cause can have potentially catastrophic outcomes. This dilemma can be illustrated by assigning a criticality level to the observation that there was `no appointed command-directed observer/controller at the breaching site'. It is diÆcult to argue that the lack of an observer led to mission failure, `death or permanent total disability'
12.2.
RECOMMENDATION TECHNIQUES
595
FREQUENT (A) Occurs very often, continuously experienced Occurs very often in service life. Expected to occur several times over duration of a speci c mission or operation. Always occurs. Fleet or inventory of items Occurs continuously during a speci c mission or operation, or over a service life. Individual soldier Occurs very often in career. Expected to occur several times during mission or operation. Always occurs. All soldiers exposed Occurs continuously during a speci c mission or operation. Single item
Single item Fleet or inventory of items Individual soldier All soldiers exposed
LIKELY (B) Occurs several times. Occurs several times in service life. Expected to occur during a speci c mission or operation. Occurs at a high rate, but experienced intermittently (regular intervals, generally often,). Occurs several times in career. Expected to occur during a speci c mission or operation. Occurs at a high rate, but experienced intermittently.
OCCASIONAL (C) Occurs sporadically. Occurs some time in service life. May occur about as often as not during a speci c mission or operation. Fleet or inventory of items Occurs several times in service life. Individual soldier Occurs some time in career. May occur during a speci c mission or operation, but not often. All soldiers exposed Occurs sporadically (irregularly, sparsely, or sometimes). Single item
SELDOM (D) Remotely possible; could occur at some time. Occurs in service life, but only remotely possible. Not expected to occur during a speci c mission or operation. Fleet or inventory of items Occurs as isolated incidents. Possible to occur some time in service life, but rarely. Usually does not occur. Individual soldier Occurs as isolated incident during a career. Remotely possible, but not expected to occur during a speci c mission or operation. All soldiers exposed Occurs rarely within exposed population as isolated incidents. Single item
UNLIKELY (E) Can assume will not occur, but not impossible. Occurrence not impossible, but can assume will almost never occur in service life. Can assume will not occur during a speci c mission or operation. Fleet or inventory of items Occurs very rarely (almost never or improbable). Incidents may occur over service life. Individual soldier Occurrence not impossible, but may assume will not occur in career or during a speci c mission or operation. All soldiers exposed Occurs very rarely, but not impossible. Single item
Table 12.14: US Army Guidance on Hazard Probability [798].
596 1. 2. 3. 4.
Catastrophic Critical Marginal Negligible
A. Frequent Extremely high Extremely high High Moderate
B. Likely Extremely high High Moderate Low
CHAPTER 12.
RECOMMENDATIONS
C. Occasional High High Moderate Low
D. Seldom High Moderate Low Low
E. Unlikely Moderate Low Low Low
Table 12.15: Risk Assessment Matrix. Cause
Frequency
Consequence
The breaching team leader failed to turn in excess demo material to the ammunition supply point. Excess demolition material was not tied into the ring charge. Addition of the second charge was not planned, practiced or communicated to the other participants. There was no appointed command-directed observer/controller at the breaching site. Breaching team members failed to question or stop the deviated and unpracticed operation. Battalion S3 (operations, planning, and training oÆcer) recognised but failed to stop the deviated and unpracticed operation. Marking team leader took up hide position closer than the authorised 50 meters to the breaching site. Marking team leader unable to distinguish between the initial (smaller) detonating cord detonation and the larger Bangalore detonation.
B. Likely
1. Catastrophic
Risk Assessment Extremely high
D. Seldom
4. Negligible
Low
D. Seldom
1. Catastrophic
High
B. Likely
1. Catastrophic
Extremely high
E. Unlikely
4. Negligible
Low
E. Unlikely
1. Catastrophic
Moderate
B. Likely
2. Critical
High
D. Seldom
4. Negligible
Low
Table 12.16: Risk Analysis for Bangalore Torpedo Incident. unless we know the context in which this hazard occurred. If excess material was being used in an unscheduled procedure then the lack of an observer can have catastrophic consequences. In other contexts the consequences are much less severe. This illustrates the need to provide additional guidance for investigators who must determine the potential future consequences of such causal factors in a variety of dierent contexts. Ideally, we would like a rule or form of argument that plays a similar role to counterfactual reasoning in many causal analysis techniques [470]. Without such a decision procedure, investigators must continue to rely upon their expertise and judgement when determining the consequence of future hazards. As before, it is important that others can follow the justi cations that support such judgements. For example, Table 12.16 assigns negligible consequences to the `breaching team members failed to question or stop the deviated and unpracticed operation' because this last line of defence should not be relied upon given the stress levels and distractions associated with nighttime operations. Of course, other investigators might argue that the consequences of breaching such a nal barrier are critical or catastrophic. The key point here is that by documenting the justi cations for such an allocation, it is then possible for other analysts to validate the reasons for prioritising the recommendations that are intended to address particular causes of an incident. We have argued that the priority of a recommendation can be determined by assessing the risk of the causes that it is intended to address. This depends upon the recognition that the causes of incidents provide valuable information about the hazards that threaten the future operation of
12.2.
RECOMMENDATION TECHNIQUES
597
safety-critical systems [805]. It is important to emphasise, however, that although all causes can be though of as hazards, it is not the case that all hazards are causes. In particular, there may be potential failures that have not yet contributed to particular incidents. Investigators must consider this issue when assessing the priority of a recommendation. For example, our case study did not involve a friendly re incident. The introduction of a controller/observer at key positions during a night exercise might also help to reduce the risks associated with this other form of hazard. Hence it can be argued that the priority of this recommendation ought to be increased to re ect the additional perceived bene t to be derived from such an intervention. It is also important to reiterate the argument that was made in the closing sections of Chapter 11. Causal asymmetries imply that there may be a number of dierent alternative causes for any particular incident. In consequence, investigators must consider the relative important of recommendations that will not simply address the causes of a particular incidents. They must also prioritise recommendations that address alternative causes that might have resulted in the same or similar failures. For instance, there are numerous ways in which the marking party might have suered similar injuries given that they were too close to the site of the breach when the Bangalore Torpedoes were deployed. A comprehensive risk analysis would, therefore, consider these dierent causal paths when determining the relative priority of recommendations that might ensure conformance to the distance requirements in the FM 5-250 [818]. The US Army promotes a ve stage process of risk analysis: identify hazards; assess hazards; develop controls and make risk decisions; implement controls; supervise and evaluate [798]. Previous paragraphs have described how, in the context of incident reporting, causal analysis techniques can be used to identify the particular hazards that lead to an incident or accident. Hazard assessment techniques can then be used to derive a partial ordering that prioritises those causes. The third step in the process is to identify `controls and make risk decisions'. This stage can be implemented using the recommendation techniques that have been introduced in this chapter. For example, the US Army's FM 100-14 advocates an approach that has much in common with barrier analysis: \After assessing each hazard, leaders develop one or more controls that either eliminate the hazard or reduce the risk (probability and/or severity) of a hazardous incident. When developing controls, they consider the reason for the hazard not just the hazard itself. Controls can take many forms, but fall into three basic categories educational controls, physical controls, and avoidance. Educational controls are based on the knowledge and skills of the units and individuals. Eective control is implemented through individual and collective training that ensures performance to standard. Physical controls take the form of barriers and guards or signs to warn individuals and units that a hazard exists. Additionally, special controller or oversight personnel responsible for locating speci c hazards fall into this category. Avoidance controls are applied when leaders take positive action to prevent contact with an identi ed hazard." [798] To summarise, there are two ways in which risk assessment techniques can be used to prioritise the recommendations from incident reports. Firstly, they can be used to rank the causes of an incident. Resources can then be deployed to focus on the generation of recommendations that address those causes with that pose the highest risk to the continued safety of an application. Secondly, recommendations might be identi ed for all causes without predetermining the relative importance of particular causes. Once those recommendations have been identi ed investigators can rank them by performing a post hoc risk analysis on the causes that are associated with those recommendations. This has the advantage of enabling investigators to increase the importance of recommendations that are perceived to address more than once cause. In our case study, Table 12.4 showed how the recommendation that `All personnel must have con dence in their authority to immediately stop any unsafe life threatening act and exercise it accordingly' was proposed by the Army investigation to address both the Battalion S3 and the breaching team members' failure to stop the `deviated and unpracticed' operation. Post hoc risk assessments can take this into account. This is arguably less likely if recommendations are only identi ed after a risk assessment has been performed on the causes of an incident. A number of further problems complicate the use of risk assessment techniques to prioritise
598
CHAPTER 12.
RECOMMENDATIONS
recommendations. The US Army [796] and Air Force [794] guidelines argue that recommendations should be associated with individual causal factors. The US Army's FM 100-14 goes on to argue that the causal factors in accidents and incidents help to identify the hazards that drive risk assessments. We have extended this argument by using these risk assessment techniques to derive priorities for the recommendations that are associated with particular causal factor. This creates problems because incidents are not, typically, the result of individual causal factors. They are, instead, the result of complex conjugations of causes. This is emphasised by the dierences between the previous formula Risk = Frequency Consequence and the more complex formulations of the partition models that were introduced in Chapter 11. The observation that incidents stem from causal complexes rather than individual causal factors has important implications for the use of risk assessment techniques to prioritise proposed interventions. If recommendation techniques focus on singular, particular causes rather than combinations of causes then investigators may fail to address systemic issues. For instance, Table 12.3 summarises several recommendations that advocate improved training as a potential remedy for the Bangalore Torpedoe incident. The combined eect of such individual recommendations might encourage investigators to consider a more systematic reappraisal of training procedures. Similarly, proposals to improve the `safety culture' within an organisation have the potential to address many dierent hazards [342]. We have argued that the priority of a recommendation is determined by the risk associated with the cause or hazard that it is intended to address. This creates problems if the proposed recommendation only has a negligible eect upon a high-risk hazard. From this it follows that the
priority of a recommendation is determined by the reduction that it causes on the risk of an associated hazard or hazards. There are, however, considerable practical diÆculties involved in assessing the
likely impact of a particular recommendation. This is acknowledged within the US Army guidance; \risk management is the recognition that decision making occurs under conditions of uncertainty" [798]. Uncertainty stems from several layers of subjunctive reasoning. The investigator must assess the likely probability of a hazard recurring then they must assess the likely consequences of that hazard. Finally, they must assess the potential impact that any recommendation will have on their predictions about the frequency and consequence of future incidents! A number of important consequences stem from the notion that the priority of a recommendation can be determined by the expected reduction in the risk of a particular hazard. In particular, investigators may have to accept that the residual risk after any recommendations have been implemented remains so high that an operation or task should not be permitted to continue. For example, incident data was used to justify permanently suspending the use of the 1370-L956, ash artillery simulator, M110, during any training activity in the US Army. The 1370-L956 was \identi ed as contributing to numerous serious injuries of our military members during training activities and was permanently suspended from future use with units directed to turn in all unused assets" [817]. As might be expected, the overall residual risk associated with a system is determined by the maximum risk associated with a particular hazard and not the average of those risks: \If one hazard has high risk, the overall residual risk of the mission is high, no matter how many moderate or low risk hazards are present... The commander must compare and balance the risk against mission expectations. He alone decides if controls are suÆcient and acceptable and whether to accept the resulting residual risk. If he determines the risk level is too high, he directs the development of additional controls or alternate controls..." [798]. Previous paragraphs have introduced the US Army's ve stage process of risk analysis: identify hazards; assess hazards; develop controls and make risk decision; implement controls; supervise and evaluate. Previous paragraphs have described how the rst three stages can be used to prioritise recommendations in terms of the dierence between an initial risk assessment and the residual risk associated with both the particular causes of an incident and the more general hazards that an incident helps to identify. Of course, the residual risk that motivates the promotion of a particular recommendation will only be achieved if the remedial actions are eectively implemented. The incidents that have been described in previous chapters of this book provide some idea of how diÆcult it can be to ensure such conformance.
12.3.
PROCESS ISSUES
599
The problems of implementing recommendations can be exacerbated by the organisational and institutional boundaries that exist between investigatory and regulatory authorities. As mentioned in Chapter 4, these distinctions help to preserve the investigators' independence from those who are partly responsible for promoting an industry. One consequence of this is that the powers to ensure compliance, typically, rest with the regulators rather than the investigatory agencies. There have been notable instances in which this has resulted in recommendations not being policed or enforced in the aftermath of previous incidents [193]. Such situations have been recti ed by creating a clear distinction between the roles of economic regulation and the policing of safety requirements. The follow section builds on this analysis by investigating the processes that support the implementation of particular recommendations.
12.3 Process Issues The previous section investigated a number of recommendation techniques including the `perfectability' approach, high level heuristics, navigation techniques including enumerations and recommendation matrices, generic accident prevention or barrier models and risk assessment techniques. These approaches are intended to help investigators identify interventions that will either mitigate the consequences of failure or will reduce the likelihood of similar incidents occurring in the future. It is important to recognise, however, that such a list of recommendations is not an end in itself. They must be validated and then presented to regulatory bodies and safety managers. They may challenge the utility of particular recommendations. The following paragraphs, therefore, analyse these additional stages that must be passed before a proposed intervention is adopted and then implemented.
12.3.1 Documentation It is important that investigators document the recommendations that are intended to address potential problems in existing systems. This is essential if others are to implement any proposed interventions. This does not simply involve drafting guidelines to describe the proposed recommendation. In most reporting systems, investigators must also document the reasons that motivate particular ndings. This is important if regulators, safety managers and other personnel are to understand the motivation for intervening in existing working practices. It is possible to identify a range of additional information that must be provided to support particular recommendations:
what causes or hazards does the recommendation address?
The opening sections of this chapter cited army and air force guidelines which require that recommendations are closely tied to particular causal factors. This is intended to ensure that as much as possible is learned from an incident; every cause should be addressed by at least one recommendation. Later sections have extended this argument by identifying recommendations, such as improvements in `safety culture' or in training practices, that may address many dierent causes of a particular incident. Finally, it has been argued that incident investigations can uncover potential hazards that were not involved in a previous incident but which have the potential to jeopardise future safety. It is important for each of these cases that investigators explicitly identify the hazard that a recommendation is intended to address. Without such information it will be diÆcult for others to assess whether or not a proposed intervention provides suÆcient protection against future failures.
what is the signi cance of the cause or hazard that a recommendation addresses?
As we shall see, recommendations are often passed to regulators or safety managers who must then guide the allocation of nite resources to ensure that they are implemented. From this it follows that investigators must help others to determine how to maximise their use of these resources. Risk assessment techniques have been proposed as a potential means of assessing the importance of a recommendation [798]. This can be derived from the risk associated with the hazard that a proposed intervention is intended to address. Unfortunately, a number of
600
CHAPTER 12.
RECOMMENDATIONS
problems complicate the application of this technique in `real world' systems. In consequence, a great deal of subjective judgement, of skill and expertise is required in order to assess the signi cance of a particular recommendation. Unless such judgements are documented, however, there are few guarantees that resources will not be diverted towards relatively trivial changes whilst more signi cant recommendations are neglected.
what are the intended consequences of the recommendation?
who will implement and monitor each recommendation?
Ideally, we would like to document measures that can determine whether or not a recommendation has been successfully implemented. This is easier with some recommendations than others. For instance, it is relatively straightforward to initiate plant inspections as a means of determining whether or not process components have been replaced. It can be more diÆcult for investigators to schedule inspections that might be necessary to determine whether a particular change has been made in a training regime. This often involves complex scheduling of site visits that can alert operators to a forthcoming inspection. There are further problems. It is generally much easier to determine whether or not a change has been made in an application process. It can be far more diÆcult to demonstrate that any change has had an anticipated impact upon the overall safety of a system. As we have seen, poor submission rates and reporting bias can prevent reliable conclusions being drawn from raw incident data. Investigators should, therefore, consider how to demonstrate the eectiveness of any funds that are invested in the implementation of particular recommendations. The US Army and Air Force heuristics urge investigators to identify the individuals or groups who are responsible for implementing particular recommendations [794, 796]. Investigators must not to specify how to implement a recommendation. This is important because investigators may lack the local expertise that is necessary to determine how best to implement a particular improvement. Similarly, the design and coordination of any changes might take far longer than the period of time that can be devoted to a particular investigation. Instead, incident reports must document what a recommendation is intended to achieve and why that objective is important. It is clear important, however, to determine who is responsible for implementing any proposed intervention. This individual must determine how to realise a recommendation from the investigators' description of what a recommendation must achieve and why it must achieve it. If they confuse the investigators' intentions or if they lack the resources to implement necessary changes then there is a danger that past failures will recur as future incidents.
establish the time-frame for any recommendation
The implementation of recommendations can be delayed by resource limitations, lack of managerial guidance, deliberate obstruction and so on. Ultimately, this can leave any system exposed to repeat failures if proposed changes are not introduced in time. In consequence, it is important that investigators specify when a recommendation should be implemented. There is a danger that this maximum time period will be seen as a target and not as an upper boundary for any remedial actions. Many investigators, therefore, provide detailed guidance on the phased introduction of particular recommendations. It is also important to monitor the implementation of key changes beyond the immediate aftermath of an incident. If this is not done then there is a danger that organisations will gradually forget previous lessons. In consequence, it is also important to consider how the monitoring of a particular recommendation might be incorporated into more routine activities.
The US Army's Accident Investigation Handbook illustrates the way in which organisations can provide detailed guidance on the approved format for the presentation of recommendations [803]. This handbook separates its advice into three causal categories: human error; material failure or malfunction and environmental factors. There are small dierences in the information that is to be recorded for recommendations that address hazards in each of these dierent sections. For example, the handbook requires that investigators document a range of information describing human `errors'. This includes a single sentence about what happened. This is then followed by a brief description of
12.3.
PROCESS ISSUES
601
the context in which the incident occurred, for example \while conducting night convoy operations using blackout drive lights". Investigators must also identify the individual involved in the `error' by describing their duty position, such as the OH-58D pilot-in-command or the driver of the M998, High Mobility Multipurpose Wheel Vehicle. As can be seen, such a requirement aords a degree of anonymity. Investigators must then identify the task error of omission or commission that motivates particular recommendations. These are classi ed according to Army standards. In particular, the accident investigation handbook recommends the error codes that are presented in PAM 385-40 [796]. These codes were used in recommendation matrices, such as Tables 12.8, 12.9, 12.10 and 12.11, that were presented earlier in this chapter. The example cited in the accident handbook is that the operator \exceeded the posted speed limit of 40 MPH by attempting to drive at 60 MPH in violation of Camp Swampy Reg 190-5 (Code 40)" [803]. This discussion of what happened then motivates an explanation of the consequences of the error. It may directly or indirectly result in damaged equipment or injury. For example, a road traÆc accident may involve substantial damage to a vehicle and its driver. It can also involve injury to third parties, such as pedestrians and other drivers, as well as damage to other vehicles or objects in the vicinity of the incident. After having described the context in which an error occurred and having explained the consequences of that failure, investigators must document the reasons why it happened. In other words, they must record the ndings of any causal analysis. As before, these causes must refer to the prede ned lists that are provided in PAM 385-40 [796]. These are supported by a free-text description of the reasons why an error occurred: \the driver's actions were a result of a lack of self-discipline and improper supervision by the senior occupant... the driver had a history of speeding [803]. The documented `causes' of an error help to motivate the subsequent section of the report that details the particular recommendations which are made in the aftermath of the incident. These are intended to answer the question, `What to do about it?'. Previous sections have already described how the US Army relies upon an enumerate list of recommendations that are published in PAM 385-40. The Accident Investigation Handbook, therefore, suggests that investigators consult this document before drafting their recommendations. It is important to note, however, that recommendations should not be addressed at the task error itself but at the system de ciencies that led to the error. This approach is advocated in the handbook and explicitly encouraged in PAM 385-40 by including relatively few recommendation codes that might support a `perfectability' approach. Recommendations must be addressed to unit level (company, troop, battalion), higher level (brigade, division, corps) or to Army level. The following format is recommended: \RECOMMENDATION (1, 2, 3, etc.): a. Unit Level Action: Commander, (unit): Brief all unit personnel on the facts and circumstances surrounding this accident. Emphasis should be placed on how human limitations combined with less than optimum systems and high task loading allow situations that contribute to undetected hover drifts. b. Higher Level Action: None. c. Army Level Action:
{ (1) Commander, U.S. Army Training and Doctrine Command:
(a) Validate requirements for automatic hover systems for all aircraft to assist in reducing task overloading. (b) Validate OH-58D crew coordination requirements, especially in Tasks 10 67, 1114, 1140, 1147, and 1148 in TC 1-209, to ensure safe compliance with the requirement for both crew-members to simultaneously direct their attention inside the aircraft, especially in aircraft without automatic hover systems. (c) Validate requirements for night vision systems with greater elds of view and resolution. (d) Increase, within the ight-training program, emphasis on situational awareness and spatial disorientation.
602
CHAPTER 12.
RECOMMENDATIONS
{ (2) Program Executive OÆcer, Aviation, eld upgrades to OH-58D aircraft which allow the use of the hover bob-up mode symbology in the LCD unit, even with weapons displayed in the LCD unit, and allow for adjusting the ODA intensity during low light ambient conditions. { (3) Commander, U.S. Army Safety Center, disseminate/publish the facts and circumstances surrounding this accident as appropriate." [803]
As mentioned, the US Army guidelines provide similar advice on how to document recommendations for other categories of failure, including equipment problems and environmental issues. In the case of material failures or malfunctions, investigators must explain what happened in a similar fashion to that described for human error. Such failures are de ned to occur when a piece of equipment \did not operate as intended or designed which contributed or caused the incident". Investigators are encouraged to search for human errors or mistakes, such as a failure to follow Army standards/procedures, design criteria or manufacturing process, that may have caused the material failure. As before, it is important to document the results of any causal analysis. This is again used to identify appropriate recommendations using PAM 385-40. Environmental recommendations follow a similar format. They are presented at the end of an analysis of the failure that describes what happened and why it happened in the manner that it did. The US Army guidelines also suggest that investigators can determine if an environmental factor should be assessed by asking `did this factor adversely in uence human and/or equipment performance; was the environmental element unknown or unavoidable at the time of the accident/injury/occupational illness?'. The explanation of why an environmental factor aected safe and successful operation often draws upon a range of disciplines. Microbursts provide an example of such a factor. They have been cited as causal factors in several recent incidents involving military aircraft. These environmental events cannot be predicted with present meteorological equipment. They are also invisible to aircraft crew-members. Such incidents show how investigators are constrained in the range of recommendations that might counter the adverse eects of many environmental factors. For example, the US Army's investigation handbook includes the following example of an Army level recommendation to deal with microburst incidents: `Commander, U. S. Army Safety Center, disseminate/publish the facts and circumstances surrounding this accident as appropriate'. In contrast, more detailed proposals are directed at unit Commanders:
\(a) Coordinate through the Commander, U.S. Air force, 1st Weather Group, Fort McPherson, Georgia, to establish a pro-active interface with several groups sponsoring research into the area of windshear. These groups include NASA, the National Technical Information Service (NTIS), the Federal Aviation Administration (FAA), the American Meteorological Society, the Langley Research Center, and the National Center for Atmospheric Research.
(b) Inform all aviation personnel assigned to Fort Rucker, Alabama, that severe weather in the form of microbursts can occur from isolated thunderstorms or rainshowers and cumulus clouds that give the impression of simple rainshower clouds." [803]
A nal section of the guidelines focus on the documentation of recommendations that address noncausal factors. The US Army handbook focuses narrowly on \ ndings that did not cause or contribute to the cause of the accident but contributed to the severity of injury or accident". An example would include a drivers failure to wear a seatbelt. This would not have caused a collision but would have signi cantly aected the injuried that the soldier sustained should a collision occur. This narrow de nition of non-contributory factors might, however, be revised following the arguments that have been made in previous sections. For instance, non-causal factors should be extended to include hazards that have been detected during the previous analysis but that did not contribute to the particular incident under investigation. The Army handbook recommends that these noncontributory factors should each be recorded in a single paragraph; `they are recorded to inform the command of problems that, if not corrected, could adversely aect the safety of future operations'. Recommendations that address these potential hazards are documented after recommendations that deal with human `errors', material failures and environmental factors.
12.3.
PROCESS ISSUES
603
This section has argued that investigatory organisations must publish guidelines that support the documentation of particular recommendations. It is important to identify those hazards or causes that are address by particular ndings. This helps to ensure that important lessons are not overlooked if potential hazards are not addressed by particular recommendations. Investigators must also document the perceived signi cance or importance of those hazards that are addressed by a recommendation. This information is necessary if others are to determine the best allocation of nite resources when implementing several, possibly con icting, ndings. Investigators must document the intended consequences of a recommendation. They must explain what it is intended to achieve rather than how it is intended to achieve it. This provides a degree of exibility to engineers who must determine the best mans of implementing a particular recommendation. The documentation of recommendations must determine who is responsible for ensuring that a nding is acted upon. They should also be provided with documents that describe a potential timescale for their actions. The importance of these documentation requirements varies from organisation to organisation. For instance, in local reporting systems the investigator may also be responsible for implementing any recommendations. In such contexts, much of this information may be super uous unless for auditing purposes. Many larger organisations, including the US Army, draft regulations and guidelines to ensure that most of this information is documented. These requirements are intended to ensure that the recipients of particular recommendations have suÆcient information for them to validate any proposed changes in working practices.
12.3.2 Validation Previous sections have focussed on techniques that investigators can use to draft recommendations that avoid or mitigate future failures. Such techniques only provide a partial panacea to the problems of incident reporting. A number of additional issues must be addressed before particular recommendations can be introduced to support the operation of safety-critical systems. For instance, there is a danger that valuable resources will be allocated to ineective remedies. Some recommendations have been motivated by organisational politics and managerial ambition rather than a concern to address the causes of previous failures. There is also a danger that by addressing one set of problems, recommendations will inadvertently introduce other potential problems into an application. It is, therefore, important that recommendations are validated before they are implemented. The way in which recommendations are validated can dier greatly between reporting systems. Many local systems rely upon informal meetings between the colleagues who are responsible for running the system. Large-scale systems often validate recommendations at several dierent levels within an organisation. Investigators may pass on the initial ndings to their immediate superiors. They perform an initial check and then pass a revised version of the recommendations to their superiors and so on. Some incident reporting systems also encourage dialogues between investigatory bodies, regulatory organisations and system management. These joint meetings help to ensure that each party understands the implications of a particular recommendation. Chapter 9 has described how these dialogues can, occasionally, introduce unacceptable delays into the implementation of important safety measures, such as Excess Flow Valves into gas service lines [588]. The US Army's Accident Investigation and Reporting Procedures Handbook contains detailed guidance on the dierent review procedures that are to be implemented at dierent levels within the command structure [806]. Reports about high-consequence incidents are validated at a local review, by installation level safety-managers, by an approving authority appointed to represent the Major Army Commands and by the US Army Safety Centre. The initial review is normally conducted by the commander of the unit or by the commander of the supervisor directly responsible for the operation involved in the incident. Their must review the report and provide written feedback about whether or not they concur with the ndings and the recommendations. They must ensure that any evidential data is circulated within the unit so that it can be used to inform future decision making. They are also responsible for ensuring that any immediate actions are implemented as a local level. The local reviewing oÆcer then hands the report through the designated chain of command to the `approving authority', see below. There is a danger that incidents and accidents may form part of a wider pattern within a partcular
604
CHAPTER 12.
RECOMMENDATIONS
installation. Similarly, there is a danger that particular recommendations that are intended to protect the operation of particular processes will have knock-on eects for the safety of other workers elsewhere in an installation. The installation-level safety manager's review is intended to identify any of these issues. The US Army reporting froms (DA 2397-R-series form, DA Form 2397-AB-R, DA Form 285, or DA Form 285-AB-R) contain special sections that are intended to help safety-managers identify these potential problems. Safety managers must review the data in these sections, not so much to validate particular recommendations, but to ensure that as much as possible can be learned from an incident. If primary and secondary investigators have missed previous incidents or patterns of systemic failure then this stage of validation is intended to identify them. The `approving authority' provides a further level of review within the US Army procedures. Major Army Commands appoint these representatives to accept or reject each nding and recommendation made by an investigation board. This takes place after the reports have been amended by local reviewing oÆcials, using the procedures described above. In addition, the Safety OÆce of the Major Army Command ensures that the report is complete with respect to the Army guidelines [806]. Major Army Commands-level recommendations will be tracked using a computerised tracking system. At this stage, the approving authority will also be concerned to identify any additional recommendations that might be made to `higher headquarters'. Finally, the US Army Safety Centre reviews all reports to ensure that they conform to regulatory and technical requirements. They are also responsible for maintaining the automated tracking system that the Major Army Commands use to track the implementation of particular recommendations. The Safety Centre is also responsible for disseminating information about the implementation of accepted recommendations to the relevant elements within the Army command structure. It is important to emphasise that such elaborate validation procedures can create a number of potential problems. In particular, the responsibility for validating and implementing particular recommendations can become lost between the various exchanges that take place at dierent levels within the command structure. The opportunity for administrative delays is, therefore, acknowledged by guidelies that are intended to keep investigators and contributors noti ed about the course of the validation and implementation process: \Acknowledgements: upon receipt of written noti cation of recommendations, the responsible Department of the Army-level organisation will provide an initial response to the US Army Safety Centre within 60 calendar days as to corrective action(s) initiated or planned. Interim and follow-up reports are required every 90 days after initial response until the action(s) is closed. Return non-concurrence or rebuttals: all Department of the Army-level recommendations not accepted or implemented by the responsible command, organisation, agency, or activity will be returned to the Commander, US Army Safety Centre, with support rationale within 60 calendar days after initial noti cation." [806] Local reporting systems provide a strong contrast to the elaborate procedures and mechanisms that are exploited by large organisations such as the US Amry. Peer review is often the only form of validation that is used to assess potential recommendations. These are often ad hoc, undocumented and informal. For example, many hospital-based systems hold monthly meetings between clinical and nursing sta. These discussions are, typically, unminuted. They are focussed to ensure the rapid implementation of changes providing there is general agreement about the utility of a particular proposal. There are, however, increasing pressures for such local initiatives to follow more documented processes [633, 453]. The importance of clinical audit within the medical domain and the wider public concern over high-pro le accidents has led to a requirement the individuals and organisations explain why particular recommendations are not implemented. In consequence, the following paragraphs concentrate on the more formal mechanisms that have been exploited by largescale systems. These may, of course, have to be scaled down to meet the more constrained budgets and scope of local systems. Both ad hoc and more formal validation procedures must determine whether or not to accept particular recommendations. If a proposal is accepted then the review panel implicitly accepts a degree of responsibility for the proposed intervention. It is, therefore, important that they agree
12.3.
PROCESS ISSUES
605
both with the form and the purpose of a recommendation. In consequence, many review bodies have introduced further distinctions beyond a simple accept or reject decision based on the recommendation that they have been asked to review. For example, the following quotation is part of a letter from the Commander in Chief of the US Army's Central Command. This letter reviews the recommendations that were made in the aftermath of an incident on a ring range in Kuwait. Rather than simply accepting the recommendations outright, the review approves of the intention behind the proposal but modi es it and also clari es that the modi cation should not bias the implementation of the recommendation: \d. Recommendation 1403 provides, That appropriate administrative action be taken against the Ground Forward Air Controller. The recommendation is modi ed, as follows; That administrative or disciplinary action, as appropriate, be considered with regard to the Ground Forward Air Controller. The recommendation, as modi ed, is approved. My modi cation does not in any way re ect my view as to what action may or may not be appropriate. It is intended to assure the appropriate Service oÆcial of his or her complete discretion in the matter." [824] These distinctions can be summarised as follows: Accept. Given the investment in time and money that is often made to support incident investigation, it might be expected that most review boards will concur with the ndings of an inquiry. Unfortunately, this can be surprisingly rare. As we have seen, some guidelines explicitly argue that investigators should not consider the costs associated with the implementation of their recommendations. These considerations often prevent regulatory organisations from sanctioning the implementation of particular interventions. A host of other issues can prevent review boards from accepting the ndings of incident investigators. For example, the members of these boards typically do not take part in an initial investigation. It can, therefore, be hard for them to follow the detailed causal arguments that motivate particular recommendations. Review boards, therefore, often request further clari cation or additional forms of evidence before they will accept many proposed interventions.
Accept with provisos.
Most review boards do not immediately accept all of the recommendations that are proposed by investigators. Instead, they may request additional evidence to support a causal analysis. Alternatively, review boards may propose alternative causal explanations that, if proven, would support other forms of intervention. Even if a recommendation is accepted, review panels may advise that its implementation is delayed or staged. Such ammendments can be motivated by the nancial constraints, mentioned above. They can also re ect the pragmatic problems of ensuring conformance to any proposed changes in working practices and equipment. These provisos are typical of reporting systems in which investigators are independent from any regulatory function. They also characterise more local systems in which investigators must secure the support of higher levels of management before any commitment can be made towards increased investment. In such circumstances, review boards can accept recommendations `subject to approval' from upper management.
Reject. Review boards, typically, exploit one of several standard `forms' of argument when attacking investigators' recommendations. The rst line of attack rejects the arguments that investigators make during the causal analysis of an incident. For example, review boards can use variants of the counterfactual arguments proposed during a causal analysis by suggesting that an accident would still have occurred even if particular recommendations were implemented. Alternatively, it might be argued that proposed interventions only address the speci c causes of an incident but fail to address more general failures. A second line of attack can be based around the risk assessment techniques that were introduced in previous paragraphs. It can be argued that the expected frequency or consequences of any future incident would be too low to justify the expenditure that is required to implement the investigators' recommendations.
606
CHAPTER 12.
RECOMMENDATIONS
Reject with provisos. Review boards must exercise a considerable degree of caution when rejecting the recommendations in an incident report. They run the risk of alienating the investigators who constructed such documents. There are obvious dangers in praising a review board for their careful use of resources if an incident does not recur within a given time period. Such rejections can also create a form of implicit responsibility should an incident recur. If an incident does recur then it can be argued that the failure might have been avoided if they had only approved the proposed intervention. It is, therefore, particularly important that review bodies document their reasons for rejecting a recommendation. In practice, this often leads to partial rejections or a refusal to implement a particular nding until some other condition is satis ed. This condition may involve eliciting additional evidence. It might also involve a commitment to perform additional studies should further incidents be reported.
Referral. Given the potential consequences of rejecting a recommendation and the possible costs associated with implementing some proposed interventions, it is hardly surprising that many review boards defer to another authority rather than reach a premature decision. Often validation exercises result in panels deciding that they are not competent to reach particular decisions. Alternatively, they may accept the high-level arguments associated with a particular recommendation but refer to another body who must then develop a more detailed implementation plan. This is an interesting strategy because that body then assumes partial responsibility should the costs exceed expectations or the implemented remedy fail to prevent future incidents.
This list illustrates the range of outcomes that validating bodies might consider when assessing a recommendation. It is remarkably rare for a review panel to accept every recommendation without some caveat or proviso. Most validation exercises accept some proposals, reject a few recommendations and request that the remaining proposals be amended in some form. It is important to note, however, that a number of comments can be made about these general remarks. For example, many incident and accident reporting systems exploit a hierarchical validation process where review committees at a lower level in an organisation review the investigators' proposals before they are validated at a higher level. At each stage in this validation process it becomes less and less likely that higher authorities will reject a recommendation that has been accepted at a lower level. A cynical interpretation of this process might be that political and organisational pressures can help to mould recommendations into an acceptable format before they are presented to the highest levels within an organisation. A more favourable view is that upper management are less likely to question the detailed operational decisions of their subordinates. A recent incident involving an Australian Army cadet helps to illustrate how dierent individuals and groups play dierent roles in the validation of particular recommendations. This incident occurred when a regional cadet unit were completing an exercise in which they had to swim to retrieve an object from a boat that was some twenty meters from the shoreline of a Dam. The cadets were wearing their army fatigues and boots. Several of them became entangled in weed beneath the surface of the water. One cadet became exhausted and went under the water approximately seven meters from the shorelines. Eorts to rescue him were unsuccessful. Arguably the highest level of validation for the Board of Inquiry's ndings came from the Hon. Bruce Scott MP, Australian Minister for Veterans Aairs and from Dr Brendan Nelson, Parliamentary Secretary to the Minister for Defence . They concluded that the Board \conducted a thorough and open investigation into the circumstances" surrounding the incident [731]. They agreed with the Boards nding that the \swimming activity was not authorised by (the) Army and that there was inadequate supervision or monitoring of the Army Cadet Corps activity". In consequence, they took actions to suspend all swimming activities conducted in areas other than supervised swimming pools were immediately suspended, and will continue to be so, until a new policy on swimming activities is issued. They also implemented a review of the Australian Services' Cadet Scheme policy on safety, risk analysis and activity clearance by the Defence Safety Management Agency. Such actions illustrate the way in which a nal stage of validation is usually performed by
12.3.
PROCESS ISSUES
607
organisations that exercise budgetary or political control over the implementation of particular recommendations. Their approval is required in order to approve the investment that may be required to support large-scale change. They must also provide the political support that is often necessary to implement what are often unpopular `systemic' changes to establish working practices [701]. It is important to note, however, that such press statements and ministerial announcements represent the nal stage in a range of more detailed validation activities. For example, the Australian Army's Board of Inquiry into the previous incident initially presented its ndings to the Chief of Sta, Headquarters Training Command. He then issued a detailed appraisal of their ndings. These illustrate the dierent forms of response that were sketched in previous paragraphs. For example, some of the Boards ndings were accepted without comment: \I accept the Board of Inquiry nding that Cadet Sperling drowned as a result of a combination of factors namely, the amount of weed in the water, the depth of water, the wearing of GP boots (with socks) and Disruptive Pattern Camou age Uniform (DPCU) clothing whilst in the water and the absence of safety devices (such as otation vests) and inadequate safety precautions for the swimming activity. These factors contributed to Cadet Sperling's drowning. The wearing of GP boots and DPCUs whilst swimming or treading water is a diÆcult activity for persons of average physical tness. A swimming activity undertaken by cadets as young as 13 years with unknown tness levels and unknown medical conditions in the circumstances existing on 18 Nov 00 at the Bjelke Peterson Dam, was inherently dangerous." [33] This acceptance illustrates the way in which validating bodies do not simply consider the recommendations that are issued by investigators. Review boards, typically, begin by assessing the evidence, the course of events and the causal analysis that are presented in the opening sections of most reports. For example, the Chief of Sta disagreed with the Board's analysis of one of the causal factors that was cited as a contributory factor in the incident: \I do not accept the nding of the Board of Inquiry that Corporal (Army Cadet was not fully quali ed as an instructor of cadets in the Army Cadet Corps Corps) in accordance with the Army Cadet Corps Policy Manual. Corporal (Army Cadet Corps) had completed the Instructor of Cadets Course and First Aid Course in compliance with the Army Cadet Corps Policy Manual and was quali ed as an Instructor of Cadets." [33] Such validation actions illustrate the importance of explicitly documenting the causal ndings that support particular recommendations. Without such analysis, it can be diÆcult to determine which recommendations might be aected by the review board's rebuttal of the investigators' analysis. It is for this reason that Tables 12.6, 12.8, 12.9, 12.10 and 12.11 were introduced to provide a bridge between the products of a causal analysis and the interventions that are intended to safeguard future operation. Such documentation can help investigators to determine whether or not a recommendation must be abandoned after such a rebuttal. If, for example, a recommendation is supported by several lines of causal analysis then it may still be retained even though one line of argument has been challenged. If reviewers accept that incident investigators have identi ed a cause of the incident then they may continue their validation by asking whether or not that cause is `adequately' addressed by the proposed recommendation. At rst sight, this might seem to be a relatively trivial task that should be based around an engineering assessment of whether or not an incident is likely to recur if a recommendation is implemented. As we have seen, however, such subjunctive reasoning is fraught with problems. Many of these relate to the psychological processes involved in reasoning about alternative possible futures without the support of some underlying model of formal reasoning [401]. Other problems stem from the way in which some recommendations are not intended to entirely avoid future incidents but to control or mitigate their consequences. The eectiveness of these measures often depends upon the nature of any future incident and this, in turn, may depend upon other defences functioning in the manner intended. As we have seen, however, many incidents stem from the failure of these `defences in depth' [701]. Further problems arise when recommendations have
608
CHAPTER 12.
RECOMMENDATIONS
social or political consequences that can prevent review bodies from adopting them. For example, the Chief of Sta, Headquarters Training Command could not accept one of the recommendations that would have had considerable implications on the size of the Australian Army's Cadet force: \I do not accept the Board of Inquiry recommendation (Reference A para 268(d)) that cadets suering from asthma should be required to comply with Army recruiting standards" [33]. Such a recommendation would reduce the likelihood of future incidents. It would also sacri ce some of the wider objectives that motivate the Army and the Department of Defence to run the Cadet Force. Previous sections have explained why it can be relatively rare for validating bodies to accept the recommendations of incident investigators without raising caveats and objections. There are, however, examples of proposed interventions that are accepted in this way. It is important that the review board explicitly documents the extent of their agreement so that there can be no subsequent disagreement about what was intended by their approval for particular measures. For example, the following review paraphrases the Board of Inquiries recommendation and uses their paragraph reference scheme, Reference A para 268(f), to make sure that the reader can trace their agreement back to the original proposal: \I accept the Board of Inquiry recommendation (Reference A para 268(f)) that the Application for Activity Approval be forwarded through the cadet unit's foster unit with the provision for comment and then on to the respective Regional Training Center for consideration. On approval or rejection, a copy of the Activity Approval Form should be returned via the foster unit who should then con rm the availability of requested equipment and other support. The revised arrangements are to be incorporated into the Army Cadet Corps Policy Manual. Action: COMD Army Cadet Corps by 14 Mar 01." [33] Previous paragraphs have described how review bodies can respond in several dierent ways to the recommendations that are proposed in incident reports. They may accept them, reject them or request modi cations. They may also defer comment and request additional evidence or support from others at dierent levels within an organisation. The following list uses the previous analysis to derive a list of requirements that might guide the validation of recommendations in incident reporting systems: 1. Clearly identify each stage of the review process. There are increasing pressures, especially within certain sectors of the Healthcare and transportation industries, to ensure that recommendations are not dismissed without due consideration. One consequence of this is that any proposals must be subjected to a clear and coherent review process if they are not to be implemented. From this it follows that each party in an investigation must understand the nature and extent of each validation. In particular, it is important that time limits be associated with each stage of a review so that investigators, regulators and contributors can track the progress of a report towards implementation. 2. Establish that the report is complete. Given that many incident reporting systems cover diverse geographical and functional areas, it is likely that some reports may omit important details about an incident. If such reports are dismissed late in the review process then there is a danger that important insights will be ignored. It is, therefore, important that an initial validation ensures that any potential report is considered complete so that any consequent recommendations will not be immediately dismissed. For instance, checks may be conducted to ensure that all relevant evidence is available and is cited correctly. Other forms of integrity check can also be carried out. For instance, if the US Air Force guidelines are followed then each recommendation must clearly identify an initial implementation route. 3. Validate the evidence. Review boards must ensure that evidence is cited in a consistent manner and that all of the necessary data about an incident has been presented in an incident report. There is an increasing recognition that complex incidents often stem from interactions between systems
12.3.
PROCESS ISSUES
609
failures, human `error', managerial problems and so on. Less `severe' incidents often cannot command the resources that are required to fund multi-disciplinary investigations. It can, therefore, be diÆcult for investigators to identify all of the information that might be relevant to an incident. This is especially true when individuals are unaware of similar incidents in other units or regions. In consequence, review boards must satisfy themselves not only that the relevant information has been collected but that it is also presented in a fair and impartial manner within the body of the incident report. Chapter 14 describes some of the pragmatic problems that can arise when attempting to satisfy such an abstract requirement. 4. Validate the causal analysis. Chapters 10 and 11 have described a range of techniques that support the causal analysis of adverse incidents. These approaches provide procedures to guide the analysis of adverse occurrences. They also depend upon a range of subjective decisions that must be validated. Even within the formal systems of reasoning, investigators must identify those elements of an incident that are to be represented within the abstractions of a formal logic. It is also important to emphasise that none of these techniques is `error proof'. The correctness of any causal reasoning must, therefore, also be veri ed. Any omissions or errors at this stage in the analysis can result in recommendations that fail to address the causes of an incident. 5. Validate each recommendation. This chapter has reviewed a range of heuristics that can be used to validate particular recommendations. For example, investigators may lack the time and the experience necessary to identify the best means of implementing particular recommendations. It is, therefore, important that any proposals should focus on what is to be achieved rather than the particular mechanisms that will be used. Similarly, we have argued that clear timescales must be associated with each recommendation so that their implementation is not inde nitely delayed. Proposed interventions should focus on speci c actions rather than on additional studies that may or may not identify potential safeguards. It is important that review bodies consider these various heuristics when validating particular requirements. Clearly, there may be instances in which some of these guidelines cannot be satis ed. For instance, if it would be dangerous to impose additional requirements without further investigations. Validation authorities must, however, satisfy themselves that there are indeed good reasons for violating these recommendation heuristics. This analysis must also consider any priorities that are associated with any proposed interventions. The risk analysis techniques, described in previous sections, often depend upon subjective assessments both of frequency and consequence that can have a profound impact upon any subsequent resource allocation. 6. Document the reasons for any rebuttal. There can be profound implications if a review body decides not to accept a particular recommendation. If a similar incident occurs in the future then they may be blamed for opposing a necessary safety improvement. It is, therefore, essential that some auditable justi cation should be recorded to support such decisions. This argument applies to the rebuttal of particular recommendations. It is also important to document any challenge to the evidence and any causal analysis in an incident report. For example, if a line of analysis is questioned then it is important to ensure that any associated recommendations are not supported by alternate causal arguments. If the recommendation is dismissed without such an additional check then there is a danger that s potential cause of future incidents will not be addressed by any proposed safeguards. 7. Validate implementation plans. The next section will identify some of the problems that can frustrate the implementation of recommendations once they have been approved by validating bodies. It is important, therefore, that review organisations should consider these potential barriers when assessing particular recommendations. If they request resources that cannot be made available at a local level then the validating authorities must provide some means of ensuring that additional resources are provided. If such resources cannot be found then they must either recommend
610
CHAPTER 12.
RECOMMENDATIONS
that a proposal be redrafted or, in extreme cases, that production should be halted until some remedy is identi ed. This validation activity does not simply focus on the sta and equipment that may be necessary to perform any changes to an application. It also focuses on the key personnel who must supervise those changes. In particular, the implementation of particular recommendations should not impose additional burdens that may result in other forms of failure being introduced into a system. 8. Initiate recommendation tracking. The problems that exacerbate the implementation of potential recommendations have motivated many organisations to create automated tracking systems. These enable safety managers to request and review reports from individual units as they are scheduled to adopt any changes in their working practices. These tracking systems are often integrated into the nal stages of validation. Once a recommendation has been approved for implementation then an entry is created in the tracking system. This is tailored to re ect the timetable and monitoring responsibilities that have been proposed by investigators and approved by successive reviews. The validation of particular recommendations provides no guarantees that they will ever be implemented. The complexity of many safety-critical applications can provide numerous barriers to the introduction of process improvements. It can be diÆcult to ensure that key personnel understand what they must do in order to avoid future incidents. Similarly, it can take months and even years before obsolete components are removed from a system. Even within the best resourced systems, engineers are often found to retain stocks of spare parts that have been condemned in previous incident reports [806]. The following section, therefore, brie y considers some of the challenges that must be addressed when investigators and safety managers must implement particular recommendations.
12.3.3 Implementation The implementation of recommendations involves the development and monitoring of a corrective action plan [571]. These plans are prepared by individuals who are, typically, appointed by the most senior validation board. These `implementation oÆcers' may or may not have been involved in the initial incident investigation. Their action plan must explain how they propose to address all of the recommendations that have been accepted following ammendment and clari cation. Each item in the action plan must address the following questions:
What causes are addressed?
In order for managers and operators to understand the importance of a corrective action, information should be included about those causes of previous incidents that are to be addressed by a particular intervention. NASA explicitly recommend that portions of a recommendation matrix should be included with an action plan [571]. This may, however, prove to be too cumbersome a requirement for smaller scale systems.
What is to be done?
The recommendations that are validated by review boards should describe what is to be achieved without describing how any particular requirement will be satis ed. Hence, this information can be directly derived from the nal version of a recommendation that is approved by any review board.
How is it to be done?
It is important that managers and operators can plan how to satisfy a particular recommendation. As mentioned above, this detailed information need not form part of the documented proposal that is validated by review boards. It must, however, be documented in an action plan that can be approved prior to implementation.
Who is responsible?
The proposed action plan must clearly identify who is to implement any intervention. This can involve a detailed consideration of which branch of an organisation or subcontractor is responsible for ensuring that a corrective action is completed.
12.3.
PROCESS ISSUES
611
What are the wider consequences of any corrective action?
The corrective action plan must consider any wider implications that result from the implementation of a particular recommendation. Previous sections have mentioned how some interventions can increase the risk of other forms of incident. Such trade-os may have to be accepted if the bene ts of preventing other forms of failure are perceived to outweigh this collateral risk. Corrective action plans must also review any wider process changes that may be necessary following the implementation of a recommendation.
How will the corrective actions be tracked?
It is important to ensure that corrective actions are implemented correctly if they are to have the intended impact upon overall system safety. An action plan must, therefore, consider how any interventions will be tracked. This analysis should ideally provide for interim status reports and for documentation to con rm the completion and closure of corrective actions.
The Canadian Forces provide an example of such action plans being used to direct the implementation of particular recommendations, known as needs assessments [149]. They encourage the development of speci c implementation programmes that are intended to meet these needs assessments. In addition to the high-level requirements mentioned in th previous list there is also a concern to ensure that any action plan considers an appropriate range of potential implementation mechanisms. Implementation oers `border on the negligent' if they only propose solutions that involve additional training. Improved tools, procedures and job-aids provide alternative solutions to inadequate knowledge or skills. It is ironic, however, that if these planned changes are not implemented then there may be future incidents are likely to be reported as training failures. As with the approval process that is used to validate individual recommendations, implementation oÆcers must identify a timetable both for the drafting and the approval of an action plan. For example, it might be speci ed that these actions should be completed within 30 working days of a validation panel accepting a particular recommendation unless they provide a written justi cation for extending the deadline. As mentioned, implementation plans are often not developed by investigators. It is important. however, that any action plan should be passed to them so that they can provide high-level feedback about whether or not the proposed intervention will ful ll their particular recommendations. Copies of an action plan may also be passed by the validating panel to safety managers and to regulators for further review. Their comments must be considered by the validation panel within the timescales, described above. If an implementation plan is rejected by the validation panel then it is returned to the responsible organisation for revision and resubmission. As before, a timescale for resubmission must be developed to ensure that potential safety improvements are introduced as soon as possible, It is important to emphasise that this process of working out how to implement a particular recommendation can help to uncover further recommendations that might not have been considered during an initial investigation. For example, `cook-o' incidents occur when the heat that is generated by a gun can cause premature ring of ammunition. A series of incidents persuaded the US Army to focus on the M60 machine gun. During a more detailed analysis of potential solutions to this problem, it was realised that `cook o' incidents also aect a range of other weapons that had not been considered during the initial analysis [820]. If a plan is accepted then the implementation oÆcer must initiate the proposed corrective actions, for instance by putting out any proposed work to tender or by disseminating relevant safety information. In larger organisations, these actions will, typically, be performed in close collaboration with safety management. In smaller organisations, an action plan may simply be approved by higher management and then be initiated by the sta running the reporting system. In either case, audit actions are often introduced so that review bodies can determine whether corrective actions have been implemented and whether they can be shown to produce the desired eects. Previous sections have mentioned the diÆculties of measuring safety improvements when adverse incidents are likely to be rare events. A range of further problems complicate these audit activities. For examples, the individuals and groups who are responsible for executing an action plan may discover that certain actions are unnecessary or unwise. In such circumstances, the implementation oÆcer must seek approval to alter the implementation plan. Such changes must be well-documented and validated
612
CHAPTER 12.
RECOMMENDATIONS
by a review board and by safety management before they can be accepted. It is important to determine who is responsible for monitoring compliance with particular safety recommendations. In smaller-scale systems, this is likely to be the same person who is responsible for ensure the implementation of any corrective actions. In larger scale systems, this monitoring function is more likely to be performed by an independent safety manager who must report any concerns about non-compliance to the validating panel. This feedback is necessary for several dierent reasons. For example, it can be diÆcult for validating bodies to identify whether or not a proposed intervention will be eective unless they are informed about the success or failure of previous initiatives. Similarly, review bodies may be able to act if they identify patterns of non-compliance within particular geographical areas or functional units. Chapter 15 will discuss the problems of interpreting and acting on such feedback in greater detail. The implementation oÆcer uses the responses from any monitoring together with any independent analysis from safety managers to determine whether or not it is possible to close a corrective action. Some organisations require approval from the validation or review body [571]. This approval can be obtained once the implementation oÆcer submits a nal incident review. This review includes the investigators' incident report, the corrective action implementation plan and a list of any additional lessons that have been learned from an adverse occurrence. The review should also document any signi cant departures from the approved implementation plan as well as any non-compliance concerns that had to be addressed. Final review documents should be archived for future reference. This is increasingly done using electronic databases and information retrieval systems. Such tools enable investigators and safety managers to automate the search tasks that can be used to identify previous recommendations for similar incidents. Chapter 14 considers a range of potential technologies that can be used to support these tasks. The US Air Force provide a speci c example of the generic nal review document mentioned in the previous paragraph. Their Air Force Instruction (AF 91-204) sets mandatory standards for incident and accident reporting [794]. This refers to a memorandum of nal evaluation. The Headquarters Safety Centre must draft one of these documents for each high-criticality incident that is reported to them. This is an important caveat, clearly the extensive implementation procedures mentioned in previous paragraphs might place too high a burden on organisations responding to low criticality events. If individual operators and investigators felt that the procedural burdens outweighed the potential bene ts from a particular recommendation then their might be a tendency to suppress or limit the number of proposed interventions. The Air Force, therefore, is careful to specify when these procedures must be followed. For instance, a Memorandum of Final Evaluation must be prepared for Class A and B incident reports even when the requirement to produce a formal report has been waived. Class A mishaps include `failures' that incur cost of $1 million or more. This classi cation covers fatalities, permanent injuries or the loss of an aircraft. Class B mishaps include `failures' costing between $200,000 and $1 million. Events may result in permanent partial disability or hospitalisation. The Memorandum of Final Evaluation collates input from various sources including the Major Commands that convene an investigation, the commander of the mishap wing, statements from individuals and groups who are cited in an incident nal report and so on. It is intended to provide an overall assessment both of the incident report and of any subsequent responses to the investigators' ndings. The US Air Force procedures also state that the Headquarters Chief of Safety must publish these memorandum using an electronic database (AUTODIN) and the Defence Messaging System. At this point, the memoranda become the \oÆcial Air Force position on ndings, causes and recommendations" that relate to the incident [794]. The Headquarters Chief of Safety, therefore, explicitly validates the recommendations that are embodied within the memorandum through this act of publication via these information systems. Any associated actions become active and must be executed by the named agencies that are associated with each recommendation. Suspense dates are also associated with these actions. Action agencies must report on completed actions or on progress toward completed actions by that date. All agencies and organisations within the Air Force are required to review each Memorandum of Final Evaluation to determine whether any of the de ciencies leading to the mishap apply to their commands. This involves a ltering process in which each memoranda is forwarded by a re-
12.3.
PROCESS ISSUES
613
ceiving oÆcer to the technical units that might be aected by any particular recommendation that is contained within it. The directors of these units review the memoranda to determine whether or not they are applicable to their systems an working practices. If they are then changes are initiated at this level. The incident reporting process does not nish with the local implementation of any recommendations in a Memorandum of Final Evaluation. Mishap Review Panels must be established within individual commands to ensure that recommendations continue to be addressed. The regulations require that these panel meet at least once every six months. These meetings are intended to ensure that preventive actions are implemented and that all parties review the status of open recommendations. Recommendations must remain open until Headquarters Safety OÆcers agree that either all recommended changes to publications have been made and the updated versions are issued or the recommended modi cations have been completed on all applicable systems or that all recommended studies and evaluations have been completed and that actions on all validated requirements have been closed. It is possible for recommendations to be closed if they are considered to be impracticable within existing operational constraints or cost parameters. Similarly, a recommendation can also be closed if an item is removed from service. Such actions must again be validated at a central level so that the outcome is recorded in the electronic information systems, mentioned in previous paragraphs. As mentioned above, these various reporting procedures apply to major incidents and accidents. A less formal approach is permitted for less serious mishaps. For example, an incident description can be drafted instead of the more formal incident report. It is important to note, however, that these descriptions must still be validated at the Major Commands level; \While (these) mishaps are not catastrophic, they are serious enough to require reporting on an individual basis and recommendations resulting from them require eective management". These less critical mishaps are not tracked by the Memorandum of Final Evaluation process, described above. The Air Force, therefore, introduces additional requirements to ensure that lessons are learned from the analysis of these incidents. The nal description, mentioned above, must outline all of the local actions that were taken after an incident. As we have seen throughout this chapter, these remedial actions must be explicitly related to the causal ndings that they are intended to address. These documents must also report any actions that are planned but not yet completed. Estimated completion dates must also be provided. These reports are also intended to provide local units with an opportunity for eliciting central support should it provide necessary in order to implement a particular recommendation.
12.3.4 Tracking This book focuses on two dierent levels of tracking or monitoring within incident reporting systems. The rst of these activities ensures that operators and managers conform to the individual recommendations that are made in the aftermath of incidents and accidents. We refer to this as recommendation `tracking'. The second of these activities ensures that incident reporting systems as a whole are having their intended eect on the safety of an application process. We refer to this as the `monitoring' of a reporting system. This section provides a brief overview of recommendation tracking. Chapter 15 provides a more detailed analysis of system monitoring. Previous pages have described how implementation action plans must be developed if high-level recommendations are to protect the future safety of complex, application processes. We have also described how electronic databases and messaging systems have been used both by the US Army and Air Force to track outstanding actions plans until they are closed. Such systems provide a particular example of more general techniques that have been developed to help implementation oÆcers track the progress towards achieving particular recommendations. These approaches must address a number of problems that can limit the eectiveness of any implementation plan. For instance, intended recipients may not receive a plan. Tracking systems must determine whether or not all appropriate personnel have access to the information that is necessary in order for them to implement a particular plan. This might seem to be a trivial requirement given the sophisticated communications infrastructure that supports many complex, organisations. As we shall see, many incidents recur because these communications systems are not completely reliable. For instance, paper-based instructions are frequently lost or destroyed. This creates particular problems when
614
CHAPTER 12.
RECOMMENDATIONS
information must be passed between dierent shifts or teams of co-workers. Electronic information systems often suer from usability problems that can prevent sta from accessing the information that is necessary for them to revise previous working practices. Technical problems and server loading can also prevent uses from accessing necessary information. Further problems stem from the diÆculty of keeping up with the number of implementation plans that aect the many dierent items of equipment that particular members of sta may be responsible for. For instance, the US Army issued at least eight revision requests for the M9 Armoured Combat Earthmover manuals in a single month in 2000: TM5-2350-262-10, TM5-2350-262-10HR, LO5-2350-262-12, TM5-2350-26220-1 & 2, TM5-2350-262-20-3, TM5-2350-262-34, TM5-2350-262-24P, TM5-2815-240-34&P [810]. These were published in paper form and disseminated via the Army Electronic products Support Bulletin Board (http://aeps.ria.army.mil/). In addition to these sources, Armoured Combat Earthmover operators also had to monitor at least two separate web sites (http://ncc.navfac.navy.mil and http://www.tacom.army.mil/dsa/) that contained further information about modi cations and revised operating procedures for their vehicles. The diÆculty of following all of the implementation plans and revised regulations that aect particular tasks can also be illustrated by the US Army's explosives safety policy. Between September and December 1999, the OÆce of the Director of Army Safety and the OÆce of Deputy Chief of Sta, Logistics issued revised guidance on loading Bradley Fighting Vehicles, on the Storage of Operational, Training and Ceremonial Ammunition in Arms Rooms and on Explosives Safety Site Plans for Ranges. Each of these involved major changes in the way that safety managers and operating units conducted many `routine' tasks. For instance, the revised guidance on loading ammunition into the Bradley Fighting Vehicles gave the following explosives safety guidance: \If a BFV is uploaded with only 25mm ammunition and other small arms ammunition, with the hatches and ramp closed, then that BFV is considered heavy armour. The heavy armour quali cation allows such a BFV to have reduced quantity distance separations. Uploading with TOW missiles or other high explosives items removes the allowed reduction in quantity distance." [819] These revised policies and procedures were published via the the US Army's Explosives Safety Website (http://www.dac.army.mil/es/). However, the recipients of these revised guidelines were also warned that they were minimum guidelines and that even if they followed them they may also be in contravention of more restrictive practice regulations enforced by individual Major Army Commands; \before personnel act on these policies, personnel should check with their MACOM safety oÆces to see if MACOM policy mirrors Army policy" [819]. This duplication of authority creates considerable problems for the operators and managers of complex, safety-critical systems. This interaction between local requirements and the recommendations from central incident reporting systems complicates the problems of ensuring conformance with safety requirements. Tracking must, therefore, assess whether operational units meet the minimum recommendations proposed by an implementation plan. It must also determine whether those units meet the more stringent requirements that are often imposed when local units seek to enforce those recommendations. A further purpose of tracking is to ensure that operators and managers receive correct information about revised operating procedures. Many reporting systems translate the recommendations that are embodied within implementation plans into more accessible formats. For example, the US Amry publishes information about such changes in its Countermeasures magazine. Very rarely, mistakes can enter into a recommendation as it is translated between an implementation plan and the story that is disseminated through these publications. Such errors have important safety implications if they are not detected either by feedback from the recipients of this information or through careful tracking by the operators of the reporting system: \Thanks to all the sharp-eyed readers who noticed that we published the incorrect maximum allowable speed for the M939A2 trucks in last month s Countermeasure . In the article The Rest of the Story on page 12, the correct sentence should read, `...the board checked the Army Electronic Product Support Bulletin Board via the Internet website http://aeps.ria.army.mil/and discovered that there are two safety messages (GPM 9604, 131807Z and SOUM 98-07,081917Z) restricting the maximum allowable speed for
12.3.
PROCESS ISSUES
615
M939A2 trucks to 40 mph (not 45 mph as previously stated) until antilock brakes and radial tires are retro tted. We're sorry for this error." [808] Even if the intended recipients of an implementation plan successfully receive information about revised working practices or material changes, there is no guarantee that they will act upon them. Chapter 2 has described the problems of identifying the reasons that motivate non-compliance with safety instructions. Some incidents are due to deliberate violations; operators may not understand the safety implications of a failure to comply with particular instructions. Other incidents stem from the operators' failure to understand the procedures that are required of them. These problems can be illustrated by a recent incident in which the right-side track of an M113 Armoured Personnel Carrier snapped. This prevented the driver from steering eectively. It also prevented any braking maneuvers which increased the vehicle's pull to the left. Subsequent examination showed that the pin on one block had worn through the metal parts that held it within the adjacent track block. A deep gouge in the hull and signi cant wear patterns on various track parts indicating a history of improper track maintenance. The M113 crew had all of the necessary tools and manuals to identify the problems. However, neither they nor the platoon leadership nor the company commander ensured the proper implementation of preventive maintenance procedures (PMCS) and revised operating regulations (DA PAM 738-750). It is important to emphasise that such incidents often stem from multiple failures in the dissemination of safety-related information. An individual's failure to act on a particular implementation plan can have consequences that are compounded by their lack of information about other safety issues. For instance, previous incidents had also resulted in a maximum speed limit of 25 miles per hour being imposed on tracked vehicles, such as the M113, for the type of road that the crew was driving on. The driver did not know these limits, and neither the vehicle commander nor the squad leader traveling behind him took any action to make him slow down; \excessive speed contributed to the track failure and to the rate of turn of the M113, which resulted in roll-over" [804]. Such incidents are important not simply because they reveal the problems of ensuring compliance with the recommendations that have been made following previous incidents. They also illustrate particular problems in the dissemination of information about the associated implementation plans. It is, therefore, important that the managers of incident reporting systems track the analysis of future incidents in order to assess whether or not previous recommendations are being disseminated and acted upon by operational units. Previous paragraphs have described how the recipients of an implementation plan may either explicitly refuse to revise their procedures or may neglect to follow their requirements. In other situations, personnel may be motivated to comply with an implementation plan but they may lack the necessary resources to follow its provisions. Necessary resources can include the time and skills necessary to perform new procedures. They also include any new components that are identi ed in particular recommendations. Finally, the recipients of an implementation plan may lack the nancial resources that might otherwise be used to make-up any shortfall in other resources. Ideally, such problems will have been considered and addressed during the development of an implementation plan. It would, however, be unrealistic to assume that such preparations would obviate the need to track the recipients' ability to satisfy the recommendations in these plans. There are situations in which the tracking of particular recommendations can reveal concerns about the eectiveness of an implementation plan. During 2000-2001, the US Amry introduced an Improved Physical Fitness Uniform (IPFU). This was intended to oer improved comfort during exercise. It was also intended to reduce accidents and incidents through the incorporation of re ective material into the uniform. Many of the personnel who were issued with these uniforms were clearly motivated to conform with these joint requirements; to increase personal comfort and ensure visibility during exercise. The Safety Centre, therefore, received several enquiries about the eectiveness of the improved uniform's re ectivity. Subsequent investigations found that the uniforms met their intended speci cation and the comments were not triggered by either a design or production defect. In consequence, the uniforms were not recalled in response to the end-users' concerns. Instead, the Safety Centre emphasised that the uniform was not intended to be a replacement for a luminous safety vest [811]. Such incidents are instructive because they contrast strongly with the use of implementation tracking to detect violations. In this case, sta were concerned to meet the
616
CHAPTER 12.
RECOMMENDATIONS
recommendations that informed the development of the improved uniforms. They felt, however, that the improved designs did not, however, oer the necessary degree of protection. The US Army safety Centre's response is also instructive. Instead of recalling the uniforms, their analysis of the end-users comments revealed additional safety concerns. Personnel were potentially relying on the protection oered by the uniform's re ectivity rather than wear a safety vest. Implementation oÆcers must track whether or not these validation and dissemination processes have introduced undue delays into the implementation of safety recommendations. This can be determined if a number of similar incidents occur before necessary changes are made to working practices or to process components. Such tracking activities can also reveal a converse problem in which recipients receive warnings well before they can act upon them. This occurs, for example, when advisories are issued for equipment that has not yet been received by its potential operators. In such circumstances, there may be an assumption that such warnings do not apply to their current tasks and hence they may be ignored. This can be illustrated by the ndings of an incident involving one of the US Army's M939A2 wheeled vehicles on a public road [812]. Weather and road conditions were good and the vehicle obeyed the planned convoy speed of 50 miles per hour. In spite of this, the driver of an M939A2 failed to prevent the trailer that he was towing from ` sh-tailing' as he started to descend a steep hill. One of the tires on the trailer blew and the truck rolled o the road. The subsequent investigation determined that the tires were well maintained and showed no defects. Witness statements and expert testimony con rmed that the vehicle was not exceeding the approved speed limit. The investigation board's maintenance expert asked if the unit was aware of any Safetyof-Use-Messages or Ground Precautionary Messages on the vehicle. At rst, unit personnel said no. They had only recently receivied their rst two M939A2 trucks as replacements for older models. \At that point, the board checked the Army Electronic Product Support Bulletin Board via the Internet website http://aeps.ria.army.mil/ and discovered that there are two safety messages (GPM 96-04, 131807Z and SOUM Investigators Forum 98-07,081917Z) restricting the maximum allowable speed for M939A2 trucks to 45 mph until antilock brakes and radial tires are retro tted. Further interviews with unit maintenance personnel determined that they had seen the messages when they came out. However, since the unit did not, at that time, have any M939A2 trucks, they did not inform the chain of command. The lesson here is whenever your unit receives new equipment; it is good practice to check all relevant Safety-of-Use-Messages and Ground Precautionary Messages to ensure that you and your personnel operate the equipment safely." [812] Such incidents illustrate the problems that can arise when attempting to ensure that implementation plans continue to be followed in the aftermath of previous failures. As we have seen, many modern organisations are characterised by their ability to change in response to their environment, to market opportunities and in response to technological innovation. This has several important consequences for those who must track the implementation of safety policies. New devices will be introduced into new working contexts. Those devices may be subject to previous recommendations that must be communicated to the operators who must employ them within these new contexts. Similarly, new devices may interact with other components or working procedures that were themselves covered by existing recommendations. These changes can force revisions to existing guidelines and procedures. It is also important to stress that many organisations bene t from a dynamic workforce that moves between dierent production processes and regional areas. These workers carry their skill and expertise with them. There is considerable potential for them to apply procedures and regulations that were appropriate in their previous working context but which can be potentially disastrous in their new environment. In consequence, safety managers must typically nd ways of ensuring that implementation plans do not simply provide short term or local xes for previous incidents. Tracking must continue until they are satis ed that revised procedures and components are seamlessly integrated into existing working practices throughout an organisation. As those procedures and components change, it may be necessary to revise previous recommendations and again track any consequent changes to ensure the continues safety of an application process.
12.4.
SUMMARY
617
12.4 Summary This chapter has argued that recommendations are made in response to the causal factors that are identi ed by incident investigators. Some organisations, including the US Air Force [794], have argued that each recommendation must be related to a causal factor and that every causal factor must be associated with a recommendation. If a causal factor is not addressed then there is a possibility that a potential lesson will not be learned from a previous failure. If recommendations are not associated with causal factors then these is a danger that spurious requirements may be imposed for reasons that are unconnected with a particular incident. We have, however, pointed to alternative systems in which recommendations can be derived from a collection of causal factors. This often happens when investigators identify an incident as part of a wider pattern of previous failures. This chapter has also identi ed a range of techniques that have been developed to help investigators derive the recommendations that are intended to prevent the recurrence of future failures or the `realisation' of near-miss incidents. The `perfectability' approach is arguably the simplest of these techniques. Given that many accidents and incidents are not the result of equipment failure, this approach focuses almost exclusively on the human causes of an incident. Recommendations are intended to perfect the performance of the fallible operators. An increasing number of researchers and practitioners have spoken out against this technique by arguing that investigators must focus on the context in which an error occurred [701, 342]. Instead they propose a more organisational view of failure that focuses recommendations on `safety culture'. They have certainly provided useful correctives to the `perfectability' approach. However, the backlash against `prefectability' has often neglected the pragmatics of situations in which operators and managers assume some responsibility for their actions. Subsequent sections reviewed the use of heuristics to guide the development of recommendations. These heuristics guide investigators away from interventions that are explicitly intended to rectify speci c instances of human error. They also provide useful guidance on the presentation and format of potential recommendations. For instance, we have cited heuristics that encourage investigators not to propose additional studies. Such recommendations often defer actions that are then not taken when the results of additional research are not acted upon. Similarly, other heuristics are intended to ensure that investigators consider what a recommendation is intended to achieve and who must implement it. A limitation with the heuristic approach is that it leaves considerable scope for individual differences to aect the detailed interventions that are proposed in the aftermath of an incident. Enumerations and recommendation matrices have been developed to ensure some degree of consistency between the ndings of dierent investigators. For example, US Army publications provide lists of commonly recognise causal factors. The same documents also enumerate potential recommendations [796] These can be linked into matrices so that investigators can identify a number of potential recommendations that might be used to address a particular cause. Unfortunately, this approach only provides high-level guidance about potential interventions. The entries in a recommendation matrix tend to be extremely abstract so that they can be applied to the wide range of incidents and accidents that might be reported to complex and diverse organisations, such as the US Army. In consequence, a number of more detailed accident prevention models have been developed. These are generic only in the sense that they provide a high level framework for the drafting of proposed recommendations. The intention is that investigators can re ne them to a far greater level of detail than is, typically, achieved in recommendation matrices. The barrier model has been described in previous chapters as a causal analysis technique. The same approach can also be used to guide the identi cation of proposed recommendations. An important limitation with all of the approaches that have been summarised in the previous paragraphs is that they can be used to identify recommendations but not to assess their relative importance or priority. This is a signi cant issue for the safety managers who have to justify the allocation of nite resources in the aftermath of an incident or accident. In particular, they must ensure that the greatest attention is devoted to those hazards that are most likely to recur and which pose the greatest threat to the safety of an application. A number of proposals have been
618
CHAPTER 12.
RECOMMENDATIONS
made to address these problems. Most of these attempt to synthesise incident analysis and risk assessment techniques. Many practical and theoretical problems are raised by this synthesis. we have illustrated those problems using the US Army's ve stage process of risk analysis: identify hazards; assess hazards; develop controls and make risk decision; implement controls; supervise and evaluate. Previous paragraphs have described how the rst three stages can be used to prioritise recommendations in terms of the dierence between an initial risk assessment and the residual risk associated with both the particular causes of an incident and the more general hazards that an incident helps to identify. The residual risk that motivates the promotion of a particular recommendation will only be achieved if the remedial actions are eectively implemented. The closing sections of this chapter show how diÆcult it can be to validate the claims that are implicit within a risk assessment and how hard it is to ensure conformance with recommended interventions. For example, we have brief examined the problems of documenting recommendations so that others can understand precisely what is intended and why it should be proposed. We have also looked at the diÆculties of ensuring that accepted recommendations are implemented in good time across the many dierent operating units of complex organisations. This closing sections of this chapter have stressed the importance of tracking recommendations. It is important to obtain feedback about how remedial actions are being implemented throughout an organisation. We have argued that implementation oÆcers must guard against non-compliance and the deliberate violation of proposed interventions. Equally, they must ensure that the relavent personnel are provided with access to the information that is necessary to implement a recommendation. They must also ensure that this information is presented in accessible format that is easily understood by those who must use it. The following chapter examines these presentation issues in more detail. It not only considers how individual operators can be informed about the recommendations that are intended to avoid future incidents. It also addresses the more general problems of structure, format and dissemination that must be addressed when drafting incident reports. In contrast, Chapter 15 considers some of the problems that arise when investigators and safety managers must gain an overview of the many previous incidents that can motivate sustained interventions in safety-critical applications.
Chapter 13
Feedback and the Presentation of Incident Reports This book has argued that incident reporting systems can play a prominent role in the detection, reduction and mitigation of failure in safety-critical systems. Previous chapters have reviewed a number of elicitation techniques. These are intended to encourage operators to provide information about near-miss incidents and about the failures that aect their everyday tasks. We have also identi ed the primary and secondary investigation techniques that must be used to recover necessary information about these incidents. This information can be used to reconstruct the events leading to failure. These models, in turn, help to drive causal analysis techniques. Finally, we have described how each cause of a failure must be considered when drafting the recommendations that are intended to avoid, or mitigate the consequences of, future failures. None of this investment in the analysis of adverse occurrenses and near-miss incidents would provide any bene ts at all if the ndings from an investigation cannot be communicated back to the many dierent groups who have a stake in the continued safety of an application process.
13.1 The Challenges of Reporting Adverse Occurrences A number of problems complicate the publication of information about near-miss incidents and adverse occurrences. Investigators must ensure that documentation conforms both to national and international regulatory requirements. These constraints are better developed in some industries than they are in others. For example, the Appendix to ICAO Annex 13 contains detailed guidance on the format to be adopted by incident reports within the aviation industry [384]. A title must be followed by a synopsis. The synopsis is followed by the body of the report which must contain information under the following headings: factual information; analysis; conclusions and safety recommendations. Further guidance is provided about the information that should be presented under each of these sub-headings. Annex 13 also provides detailed instructions on the procedures to be adopted when disseminating the nal report into an incident. This approach can be contrasted with the guidelines provided by the International Maritime Organisation's (IMO) code for the Investigation of Marine Casualties and Incidents adopted under Assembly Resolution A.849 [387]. This provides detailed guidelines on the investigatory and consultative process that must precede the publication of any report. It says almost nothing about the format and content of any subsequent documentation. The lack of national or international guidelines provides investigators with considerable exibility when they must document their ndings about particular failures. It also creates considerable uncertainty amongst those safety managers who must ensure `best practice' in the operation and maintenance of reporting systems. This uncertainty is well illustrated by the UK National Health Service; risk managers are responding to calls to introduce incident reporting systems without guidance on the form that those systems should take. In consequence, a vast range of local initiatives 619
620
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
have been started to develop appropriate formats that might be used to disseminate information about adverse occurrences. The result is that hospitals have developed diverse approaches that both re ect local needs and which also make it very diÆcult to identify potential similarities between related incidents in dierent trusts. Further problems are created by the development of dierent national standards that can cut across these local initiatives. For example, the Royal College of Anaesthetists has taken a leading role within the UK National Health Service by issuing detailed guidance on how to gather data about critical incidents [715]. Unfortunately, the lack of national guidance in other areas of the healthcare system has resulted in standardised formats being used within certain areas of healthcare but not within others. National and international standards are intended to support the exchange of information about previous failures. The recipients of these documents can have con dence that they will contain the information that is necessary to inform any subsequent intervention. Later paragraphs will return to the problems of encouraging this dissemination of incident reports between organisations that are often seen as `natural' competitors. For now, however, it is important to recognise the pragmatic problems that arise when attempting to draft minimum requirements for the formatting of incident reports. The problems that arise when attempting to apply causal taxonomies and recommendation matrices illustrate how hard it can be to anticipate the nature of future failures. Similarly, it can be very diÆcult to predict what form an incident report should take beyond the generic and extremely abstract categories that are proposed by the ICAO. There is a tension between the need to encourage consistency between reporting formats and the importance of allowing some exibility in the reporting of individual incidents. The diverse nature of near-miss incidents and adverse occurrences has many further consequences of the drafting and dissemination of incident reports. As we have seen, natural language is most often used to describe the sequence of events leading up to a potential failure. The same medium is used to represent the detailed causal analysis that will, eventually, support particular recommendations. Natural language has the bene ts of accessibility and exibility. No specialist training is required to understand it. It can also be used to capture diverse aspects of an incident and its causes. Unfortunately, it can also be ambiguous and vague about key aspects of an incident. It can also be diÆcult to follow the large number of concurrent events that often characterise technological failure. Detailed timing issues are not well represented and it can be diÆcult to form coherent natural language accounts from the individual analysis of multi-disciplinary experts. The exibility of natural language can be used to capture many dierent aspects of an incident. This exibility is also a weakness because it supports the variety of interpretations that can lead to potential ambiguities. Subsequent sections will explore each of these issues in more detail.
13.1.1 Dierent Reports for Dierent Incidents Previous chapters have emphasised the diverse nature of incidents within many industries. At one extreme, they include low-consequence near-misses that border on process improvements rather than safety issues. At the other extreme, reports provide information about high consequence failures that cannot easily be distinguished from accidents rather than incidents. This diversity has an important eect upon the nature of the documents that are used to disseminate the ndings of an investigation. For example, high-consequence failures are typically reported using a highly structured format in which reconstruction is followed by analysis, analysis is followed by recommendations and so on. This formal style of presentation can be illustrated by the US Coast Guard's table of contents for a report into the loss of a shing vessel [833]. What we have termed the reconstruction of the incident is contained within the ` nding of fact' section. The causal analysis is partly contained within these pages but is focussed on the `Conclusions' section:
Executive Summary
3 4
Finding of fact
6 6 8
Hearing witnesses
Background of People Key to the Investigation Description of the Fishery
13.1.
621
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
Chronology Description of the Vessel Details of Vessel Surveys and Repairs Stability Information Vessel Management External Environment
8 28 40 50 66 74
Conclusions Recommendations
77 96
A similar format can be seen in the Australian Transportation Safety Board's (ATSB) Marine Safety Investigation reports [52]. As with the previous US Coast Guard example, the table of contents re ects the detailed investigation that was conducted in the aftermath of the incident. The reconstruction of the incident is contained within the `Narrative' sections. Causal analysis is presented under `Comment and analysis'. There are some dierences between this report and the one described in the previous paragraph. Rather than presenting speci c recommendations, the ATSB investigators identi ed contributing factors in the `Conclusions' section. The lack of proposed interventions in part re ects the nature of the incident. The report's conclusions identi ed speci c procedural problems that contributed to the incorrect loading of this particular vessel. It can, therefore, be argued that the wider publication of such speci c recommendations would have had marginal bene ts for a more general audience. It also re ects recent initiatives by the ATSB to move away from a `perfective' approach towards a more `contextual' form of analysis:
Summary Sources of information Narrative
Sun Breeze The incident Loading at Bunbury Sailing from Bunbury, the list and subsequent events
Comment and analysis
Evidence The charter party Stability of the vessel Stability at rst departure Righting lever and heeling arm curves Sun Breeze: Lightship KG The Class Society Cargo stowage Notice of intention to load timber deck cargo Stowage factors, cargo weights and the masters responsibility Deck cargo lashings Two scenarios for the incident
Conclusions Submissions Details of Sun Breeze Attachment 1
Stability terminology and principles
1 2 5 5 5 6 9 15 15 15 15 16 17 19 20 21 22 23 23 25 27 31 33 31 35
The task of providing feedback from incident reporting systems is complicated by the dierent formats that are used to disseminate information about dierent types of incident. For example, the previous tables describe the formal structure that is typically associated with incidents that either did, or might have, resulted in high-consequence failures. The level of detail included in the analysis is indicative of the resources that have been invested in the investigation. In contrast, many less
622
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
`critical' incidents are summarised by less formal reports. There are other reasons for exploiting a range of formats. For instance, in many industries it can be diÆcult to persuade operators and managers to read what are perceived to be long and complex documents about previous incidents that may, or may not, have particular relevance for their daily activities. In consequence, investigators often publish abbreviated accounts in a more `accessible' format. They summarise the events leading to the failure and provide a brief causal analysis in two or three paragraphs. For example, the UK Marine Accident Investigation Branch (MAIB) uses its Safety Digest articles to provide a brief overview of previous incidents. Is possible to identify sentences that relate to the reconstruction of an incident, to the ndings of a causal analysis and to particular recommendations. The formal distinctions that are re ected in the section heading of the more exhaustive documents, illustrated by the US and Australian reports, are not used in these summaries: \The ro-ro cargo/passenger ferry SATURN was completing berthing operations alongside a pier at Gourock. Prior to rigging the gangway, it was normal practice for a seaman to throw the safety net ashore from the gangway gateway, which was normally secured in the open position by hooks but, on this occasion, was not. As the net was thrown ashore, part of it became entangled in one of the gates which caused it to close and knock the seaman o balance. He was caught in the net and fell overboard, landing heavily on a pier timber before falling in the water. The seaman surfaced and, with the assistance of another crew member, managed to hold onto a pier timber. Both were recovered from the water by a fast rescue craft." [514] This account provides a `vignette' or `failure scenario'. It describes an incident in an extremely compact manner. Minimal information is provided about the more detailed contributory factors that are considered in more formal reports. The previous summary does not explain the reasons why the door was not secured. In contrast, it provides readers with a direct account of the catalytic events that led to the incident and, most importantly, it illustrates the potential consequences of such incidents. Such accounts have strong similarities with the `war stories' or anecdotes that are an important means of exchanging safety-related information within teams of operators. This is an important strength of such immediate accounts. It can be argued, however, that the lack of more sustained analysis may limit any long-term eect on system safety. The previous paragraph provides a relatively simple example of the use of incident vignettes. Several regulatory agencies have developed variations on this approach. Investigators can use these techniques to inform readers about speci c safety issues. For example, there is a danger that short vignettes will focus on the speci c events that lead to a particular incident. It can then be diÆcult for readers to identify the more general safety issues that aect the operation or activity that was aected. There is even a danger that the readers of such incident scenarios will forget other safety issues by focusing on the speci c failure described in the report. In consequence, some incident reporting systems use vignettes as a form of hook that is used to motivate readers to consider more general safety issues. This can be illustrated by a US Coast Guard report into a particular incident involving a group of sea kayakers: \(they) unexpectedly encountered strong currents that resulted in three kayakers being separated from the group and set out to sea. While their friends were set oshore, the main group was able to land their kayaks on a small island. Because a member of the group now ashore carried a signal mirror, the group was able to attract the attention of persons on the mainland, who in turn noti ed the Coast Guard. Based upon information from persons ashore, an intensive 5 hour eort was launched that eventually located and recovered the missing kayakers. This incident underscores the need for proper planning and signaling equipment, and revealed some of the inherent diÆculties in mounting open water searches for objects as small as sea kayaks." [829] The nal sentence in this quotation presents the particular conclusion or nding that can be drawn from this speci c incident. It also illustrates the way in which such recommendations can reveal a great deal about the intended readership of the report. The vignette is clearly not intended for the members of the rescue service. If this were the case then some additional detail should be provided
13.1.
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
623
about the \inherent diÆculties in mounting open water searches...". In contrast, the recommendation is clearly intended for kayaking enthusiasts and to recreational sailors. After drawing this speci c conclusion, the Coast Guard report goes on to reminder the reader of a number of more general safety precautions that should be followed when kayaking. The investigators place the speci c recommendations about how to prevent this particular incident within the wider context of voyage planning and preparation. This is an extremely powerful technique. The particular circumstances of the incident act as a direct and clear example of the potential consequence of failing to follow safety information. It is doubtful whether the list of safety recommendations would have had the same eect if they had been presented without the incident as a preface: \Voyage planning: When planning a voyage, no matter how short or simple you intend it to be, take a few minutes to leave a oat plan, including departure/arrival times, number of people and color of kayaks with a responsible friend. If it's a spur of the moment trip, write a plan just before you go and leave it in an envelope marked \FLOAT PLAN" on the dashboard of your vehicle. Make sure to always monitor the weather before and during your trip. Know your limitations: You alone are the best judge of your own physical limitations, the capabilities of your kayak, and most importantly, your ability to operate your craft and gear. Respect the indiscriminate power of the sea along the exposed Maine coast, and carefully avoid operating in restricted visibility, including fog, rain, and darkness..." [829] We are currently working on a number of studies that intend to determine whether or not such presentation techniques have an impact upon decision making and risk-taking behaviour. A host of methodological problems aect such investigations. It is diÆcult to identify a procedure to demonstrate that individuals would be more likely to follow the safety guidance if they had been informed about previous incidents. These issues will be addressed more directly in the closing sections of this chapter when we look at the problems of validating the `eectiveness' of incident reports.
13.1.2 Dierent Reports for Dierent Audiences It is important to emphasise the diverse nature of those groups that have an interest in the ndings of an incident investigation. Other investigators must read the reports of their colleagues to encourage consistent analysis and common recommendations to similar incidents. This also helps to sensitise individuals to emerging trends within an industry. Designers and developers may also be concerns to read incident reports in order to ensure that previous mistakes are not replicated in future systems. The operators of the application processes that are described in an incident report must also be able to access the recommendations that emerge from previous failures. This not only helps them to understand any proposed revisions to their working practices, it also helps to disseminate information about the consequences of previous failures and the potential for future incidents. These reports must also be disseminated to the managerial sta who supervise end-user activities. In particular, safety managers must be informed of any recommendations. They are often required to ensure the implementation of proposed changes. Regulators have an interest to track individual incidents. This is important if they are to monitor the safety record of individual rms. Such information helps to guide the dissemination of best practice across between companies in the same marketplace. It is also important from regulators to monitor the changing nature of incidents across an industry if they are to identify potential patterns of failure. These comments apply to national regulators. There have also been a number of international attempts to compare incident data from dierent countries, such as the IMO's work to collate incident reports. National regulators and international bodies are not the only groups that are interested to learn about the insights provided by incident reports. The general public are often concerned to read the ndings in these documents. This interest is often motivated by concerns over personal safety issues, including consumer protection and healthcare provision. As we shall see, many investigation agencies have responded to this concern by placing information about past failures on publically
624
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
accessible web-sites. This wider interest is also being driven by an increasing willingness to engage in litigation In consequence, legal practices are often concerned to follow the incident reports in several industries. It is diÆcult to determine whether this public concern has been created by media interest or whether media interest has been fuelled by the engagement of this wider audience. In either case, it is important to acknowledge that all forms of the broadcast media and publishing have an active interest in reports of previous incidents and accidents. The diverse nature of the potential readership of an incident report creates problems for those who must draft incident reports. Dierent reporting formats oer dierent levels of support for particular tasks. For example, operators are likely to require precise summaries and detailed guidance on how to meet particular recommendations. Safety managers are likely to require more information about what a recommendation is intended to achieve and how to demonstrate conformance with its particular requirement. Investigators and lawyers are concerned to understand the reasons why certain causes were identi ed. They may also be concerned to ensure that recommendations provide appropriate defences against any recurrence of speci c causal factors. Designers and regulators will, typically, require a higher degree of technical detail than system operators. Those responsible for the implementation of future systems must also be able to generalise from speci c failures to anticipate whether similar problems might aect proposed designs. Table 13.1 illustrates the range and diversity of reports that can be generated by a single institution. This summarises the reporting activities conducted by the Hong Kong Marine Department [366]. Many of these publication requirements relate to more serious incidents and accidents. It is important, however, to see the presentation and dissemination of less critical incident reports within this wider context of regulatory and judicial requirements. The range of documents that must be produced in the aftermath of an adverse occurrence or near-miss also imposes considerable logistical demands upon such organisations. For instance, primary and secondary investigations may generate interim reports that are intended to warn operators and supervisors of any short-term actions that might help to avoid or mitigate any recurrence of an incident. These documents are, typically, superseded by the nal incident report that presents the outcome of the reconstruction, causal analysis and recommendation techniques that have been described in previous chapters. As we have seen, these reports trigger implementation advisories of various forms. These guide operators and managers on the actions that must be taken to ful ll particular recommendations. Finally, statistical summaries can be derived from databases of individual incident reports. These summaries may motivate issue-based reports that investigate a number of similar incidents. It is diÆcult to underestimate the logistical challenge that is posed by the production and dissemination of these dierent documents to the diverse groups that have an interest in a adverse occurrence or near-miss incident. For example, the recipients of an initial noti cation about shortterm corrective actions must be informed of any longer-term measures. If this is not the case then groups and individuals may continue to exploit stop-gap measures to prevent the recurrence of previous failures. Safety managers must have access to updated statistical information if they are to determine whether or not a newly reported incident indeed forms part of a wider pattern. If this seems to be the case then they may have to obtain access to information about on-going enquiries into these related failures. Similarly, it is important that the people who contribute incident reports should receive updated information about the various levels of intervention that have been triggered by their observations. These distribution problems must, typically, be solved within short time-limits. It is important for information to be disseminated in the aftermath of an incident. There is an obvious need to provide guidance on any short-term corrective actions. There is also a need to prevent any rumours that might be generated in the aftermath of an incident. Even in anonymous systems, it can be necessary to warn operators about the potential for future failure and to publicise the results of any secondary investigation. Conversely, it is important that any preliminary publications should not be premature. Unsubstantiated speculation can create confusion when subsequent incident reports are forced to contradict previous statements about the potential causes of an incident. These complexities not only aect the safety managers and incident investigators who must combat the causes of future incidents. They also aect the tasks of press oÆcers and media relations oÆcers. It is important that initial releases about an incident should not aect the results of any subsequent investigation.
13.1.
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
Investigation Type Informal Inquiry
Preliminary Inquiry (Hong Kong Registered)
Preliminary Inquiry (Pilotage)
Local Marine Inquiry (In Hong Kong waters)
Industrial Accident
Conduct of Fitness Inquiry
Summary of Process Carried out by investigation board into less serious accidents. Director of Marine accepts or rejects the report and institutes follow up action. Director of Marine appoints professional oÆcer(s) to conduct investigation. Report by the appointed oÆcer is submitted to the Secretary for Economic Services with Director of Marine's observations and action. Secretary for Economic Services accepts or rejects the report. For minor incidents, Pilotage Authority appoints a Board of Discipline. Board of Discipline may then recommend a caution, written warning, a downgrade on pilot's licence or that Board of Investigation be held. For serious incidents, Pilotage Authority commissions a Preliminary Inquiry. This can result in Board of Investigation. The Director of Marine orders a Local Marine Inquiry for incidents occurring in Hong Kong waters, to be conducted by professional oÆcer(s). The appointed oÆcers submit a report to the Director of Marine. The Director accepts or rejects the ndings/recommendations. The Marine Industrial Safety Section investigates incident involving repairs to any vessel, break up of a vessel, cargo handling etc. Director of Marine can initiate inquiry into conduct of Hong Kong certi ed oÆcer for \un tness, misconduct, incompetence or negligence whether or not an accident has occurred" [366]. Inquiry is conducted by a judicial oÆcer. Person conducting the inquiry may cancel or suspend certi cate of competency/licence or censure the holder.
625
Reporting Requirement No formal report is prepared but ndings may be included in a `Summaries' page on a web-site. If Marine Court is not ordered, usually the Preliminary Inquiry report will be published. If Marine Court is appointed then it reports to Chief Executive who can approve its publication and may accepts or rejects ndings/recommendations of the court. Board of Investigation submits report to Pilotage Authority. The Pilotage Authority decides on the report's recommendations and decides whether the report should be published.
Findings are published as a report or as a summary on the Department web-site
The Investigating OÆcer submits report to Director of Marine for serious or fatal accidents only. Director of Marine accepts or rejects ndings/recommendations. No formal report is prepared in most cases. A report is made to the Director of Marine.
Table 13.1: Accident and Incident Reports by the Hong Kong Marine Department
626
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
It is also important not to provoke immediate calls for action without careful consideration about the justi cation and potential risks associated with precipitate intervention. As we have seen, a number of dierent reports can be made about the same incident. Preliminary reports must be revised in the light of a secondary investigation. Final reports are informed by any subsequent causal analysis but their recommendations must be revised as regulators reassess the utility of any interventions. Figure 13.1 presents an annotated ow-chart of the procedures that support the New Zealand Transport Accident Investigation Commissions analysis of maritime incidents [631]. The rectangles that are drawn with a double line are used to denote the various publication activities that form part of a single incident investigation. These include the formal delivery of the nal report. They also include the distribution of preliminary drafts to the various individuals and groups that have an interest in the outcome of any investigation. Figure 13.1 extend the New Zealand process model with timing information. The investigation should begin within twenty-four hours of an incident being reported. The safety commission that oversees all investigations should receive a preliminary report within three days and so on. The details of such estimates depend on the nature of the incident being investigated. As we have seen, high risk incidents may justify the allocation of additional investigatory resources. In general, however, such diagrams are useful because they provide a working schedule for investigators and regulators. They also provide an important overview for operators and even for the general public who may be keen to receive feedback about the course of an investigation. In order to validate such timescales, it is important for investigators to assess the amount of time that must be spent at each stage of the analysis. Few organisations take this as far as the US Federal Railroad Administration who have estimated that is should take two hours to write an employee con dential letter, ve and a half hours to review each employee statement, ve hours to devise a monthly list of injuries and illness and so on [233]. The key point is, however, that if timescales are published then there must be some means of determining whether or not they are met. If they are routinely missed then either additional resources must be committed to an investigation or more realistic timescales must be published for the various participants in an investigation.
13.1.3 Con dentiality, Trust and the Media A host of social and contextual concerns also aect the dissemination of information about incidents. Con dentiality is arguably the most important of these issues. Many organisations are concerned to ensure that reports are only disseminated to those groups that are perceived to have a `legitimate' interest in their contents. Operators and managers are encouraged to read incident reports while strenuous eorts are made to prevent the press, lawyers and even regulators from accessing the same documents. The sensitive nature of many incidents has also created situations in which organisations are willing to sacri ce some of the potential bene ts from a reporting system in order to ensure that information about previous failures is not disclosed to these `unauthorised' sources. The ultimate examples of this sort of behaviour involve companies destroying incident databases to ensure that lawyers cannot detect examples of previous failures that might indicate negligence in failing to prevent subsequent incidents. Less extreme measures include the use of computer-based access control mechanisms that restrict those documents that a user of the system can view without speci c permissions. Some industries have also suered from a variant of the con dentiality and security concerns, mentioned in the previous paragraph. They have become the victim of `spoof' incident reports that are intended to undermine public con dence in their products [98, 104, 107]. These are often created by disaected employees, by competitors or by individuals with moral and political objections to particular industries. Within an organisation it is possible to exploit a range of technologies to ensure that an incident report has been produced by an authorised individual or group. For instance, electronic watermarking embeds a code within a le. This code is diÆcult to alter without corrupting the contents of the report but can easily be read by authenticating software to ensure the provenance of the ocument. If the watermark code is derived from the date at which the le was last edited then this approach can also be used to detect cases in which a report had subsequently been edited or `tampered with'. It is less easy to deal with spoof reports that originate from outside
13.1.
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
Figure 13.1: Simpli ed Flowchart of Report Generation Based on [631]
627
628
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
an organisation. In particular, it can be diÆcult for safety managers and public relations sta to respond to requests for information about incident reports that they know nothing about. This creates particular problems given the concerns to preserve con dentiality, mentioned above. Denials that an incident report has been produced can be interpreted as an attempt to cover up potentially damaging information. Many incidents involve more than one organisation. Air TraÆc Management incidents often stem from the interaction between dierent national systems. Similarly, maritime incidents can involve ships that are registered by dierent Sates. Each can independently produce reports into the same incident. In consequence, national and international regulators typically require some form of coordination that is intended to encourage agreement before a nal report is disseminated to its intended recipients. This is illustrated by items 4 and 5 in the following section from the Marine Accident Investigators' International Forum's Code for the Investigation of Marine Casualties and Incidents. This code was adopted by IMO Assembly Resolution A.849 (2.0). The State conducting an investigation should invite other \substantially interested" States to: 1. \question witnesses; 2. view and examine evidence and take copies of documentation; 3. produce witnesses or other evidence; 4. make submissions in respect of the evidence, comment on and have their views properly re ected in the nal report; 5. and be provided with transcripts, statements and the nal report relating to the investigation." [387] Unfortunately, such high-level guidelines provide little direct help if States are forced to resole any dierences over the analysis of an incident. The diÆculties of drafting and disseminating incident reports are further complicated when these documents can contain commercially sensitive information. Previous sections have argued that previous failures provide important learning opportunities. There may be clear commercial bene ts to be gained from not sharing these lessons with rivals in the same market place. Similar concerns can be seen within military `lessons learned' systems where new insights about previous failures can provide direct operational bene ts. In consequence, national and international initiatives often rely upon regulatory intervention to ensure that safety-related information is disseminated as widely as possible. This laudable aim raises further questions about the format and presentation of the information that is to be shared. Participation is such schemes often implies that local systems have to conform to the minimum data standards that ensure the consistency and integrity of the common dataset. At best, these national and international presentation requirements can be integrated into existing local formats. There are, however, instances when these wider requirements are perceived to impose unnecessary additional burdens [423]. It can also be argued that these requirements reduce the eectiveness of local systems if they prevent investigators from tailoring the presentation of particular information to their immediate audience. In consequence, many systems maintain multiple versions of an incident report. An internal version can be developed to provide readers with detailed information about the particular local circumstances that contributed to a failure. There may also be a more generic account that is provided to national and international regulators. These accounts supplement the aggregated statistical data about incident frequencies that are described in Chapter 15. Several problems can arise from attempts to maintain `separate' accounts of the same incident. Firstly, it can be costly to support the production, distribution and maintenance of these dierent versions of a report. This can involve the duplication of validation activities to ensure that each account conforms to dierent local and national requirements. There are also additional costs associated with the archiving and retrieval of each report. As we shall see, this can involve the development of two separate but linked information management systems. Secondly, it can be diÆcult to ensure that these separate accounts are consistent. Even if dierent accounts are maintained for
13.1.
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
629
the best of reasons, there may still be a suspicion that internal reports are `clean-up' or `sanitised' before being distributed more widely. This can be illustrated by incident reporting across European Air TraÆc Management systems. In one example, the manager of a national reporting system knew that a colleague in a neighbouring country had been involved in the analysis of a high-criticality air proximity violation. He was then surprised to see that they did not report any high-criticality incidents in their annual returns to EUROCONTROL. Their colleague later demonstrated that the incident did not ful ll the requirements that EUROCONTROL publish for such high-criticality mishaps. National safety managers had increased the level of criticality associated with the event because it was perceived to oer a number of key insights for the systems operating in that country. Such examples illustrate how inconsistencies between local and national or international reporting systems can arise from the best of intentions. There are other instances in which they re ect the deliberate `manipulation' of safety-related information. This section has brie y introduced some of the problems that complicate the presentation of information about previous failures. Chapter 15, in contrast, looks at the complexities that arise when investigators must conduct statistical analyses of aggregate incident data. In contrast, the remainder of this chapter looks at some of the existing and proposed solutions to these problems that aect individual incident reports. The analysis is structured around three generic issues that aect all reporting systems:
how to structure the presentation of an incident report?
As we have seen, there are national and international guidelines on the information that should be included within an incident report. These guidelines are not, however, available for many industries. When they are available, for example within the eld of aviation incident reporting [384], they typically only provide high-level guidance about what sections should be included. They do not provide the detailed advice that is necessary when investigators begin to draft detailed accounts of previous incident. This lack of guidance has resulted in a number of poorly formatted reports in which readers have to refer to information that is distributed across dozens of pages of analysis in order to gain a coherent overview of a particular mishap;
it how to ensure the eective dissemination of incident reports? Previous sections have described how the tension between a need to distribute incident reports to the many dierent groups and individuals who can make use of them and the need not to jeopardise con dentiality. There is also a concern to restrict `unauthorised' media intrusion. Other systems avoid these tensions by deliberately adopting an open distribution policy. This can create problems if contributors are reluctant to submit reports that can be seen by a broad audience. Irrespective of the overall dissemination policy that is adopted, investigators face considerable logistical problems in issuing and updating information about previous incidents. Increasingly, electronic information systems are being used to reduce the costs associated with paper-based distribution. These systems are often Internet based and come with a host of implementation issues that must be considered before such applications can eectively replace more traditional techniques. They dom, however, oer considerable bene ts in terms of monitoring the rate at which incident reports are accessed by their intended recipients;
how to validate the presentation and dissemination of incident reports?
Many incident reporting systems have been criticised because too much attention goes into the elicitation of data and too little goes into the eective application of that data to avoid future incidents [701]. It is, therefore, critical that some means be found of validating the particular presentation and distribution techniques are used to disseminate the lessons of previous failures. This creates a host of problems. For example, Chapter 5 has described the Hawthorne eect that can bias the results that are obtained when users know that their actions are being observed [686]. In a similar manner, direct questions about the utility of incident publications can elicit responses that may not provide accurate information about their true value.
The importance of ensuring the eective dissemination of incident reports should not be underestimated. Unless employees are provided with authoritative information about previous failures then informal networks can grow up to exchange `war stories'. These `war stories' provide important
630
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
learning opportunities because they often encapsulate users' experiences during adverse occurrences. Unfortunately, they often over-dramatise particular incidents [749]. They can also recommend potentially unsafe interventions that contravene accepted working practices. These informal accounts are also dangerous because they exist as a form of `distributed knowledge' that exists outside the standard safety management procedures. There is no guarantee that all sta will be told the relevant anecdotes [342]. Nor is there any certainty that appropriate actions will be taken to resolve the underlying failures that lead to the adverse incidents that are described in these accounts. It is important to identify the intended readers of a report before investigators decide upon an appropriate structure or format for the information that is to be presented. As mentioned in the previous paragraphs, the dierent recipients of an incident report will have dierent information requirements. Tables 13.2, 13.3, 13.4 and 13.5 illustrate how information-needs grids can be drawn up to support this analysis. A separate tabular form is produced for each participant in the investigatory process. As can be seen, we have initially focussed on regulators, executive oÆcers of board members, safety managers and operators. In local systems, some of these tables might be omitted. Some of the duties associated with safety managers might instead be allocated to system operators and hence the information needs would have to be revised appropriately. In larger, more formal systems, it would be necessary to introduce additional tables. For example, we have already described important distinctions between the information that is required by national and international regulators. Similarly, distinctions between dierent types of operator might result in additional tables being introduced to re ect their diering information requirements. Regulators Initial Report
Reconstruction An initial report of the events leading to the incident and an indication of the additional information sources to the analysed
Final Report
A detailed description of what happened during the incident together with explicit justi cation of that account citing the evidence to support each hypothesised event in the reconstruction.
Annual Summary
Reconstruction information may be omitted in a statistical summary or annual report. It should, however, be possible for regulators to work back from the items in the summary to the more detailed nal report.
Causal Analysis A summary of the likely causes based on the initial report together with a preliminary account of any similar incidents. Several causal hypotheses are likely at this stage. A documented causal analysis using one of the recommended techniques introduced in previous chapters of this book. This analysis may be presented in natural language but should be supported by an appendix documenting semiformal or formal reasoning. If a causal taxonomy is used, see Chapter 11, then the codes or identi ers for each causal factor should be included in the statistical returns.
Recommendations Any immediate measures to be taken in the aftermath of the incident.
A detailed analysis of the proposed recommendations describing what, when, who and how they are to be implemented (see Chapter 12).
If a recommendation taxonomy is used, see Chapter 12, the statistical analysis should include information about the correlation of those codes to causal factors.
Table 13.2: Generic Information-Needs Table for Regulators Each information-needs table identi es the dierent documents that are used to disseminate information about an incident to a particular participant group. For instance, Table 13.2 denotes that regulators should receive an initial noti cation in the aftermath of an incident. They should
13.1.
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
631
also receive a copy of the nal report and an annual summary of data about all incidents that have occurred. Of course, this is not an exhaustive list. Additional interim reports may be required in some industries. Similarly, Chapter 12 has argued that closer regulatory intervention can be required to monitor and validate the implementation of particular recommendations. These caveats illustrate the generic nature of the information contained in Tables 13.2, 13.3, 13.4 and 13.5. The rows in the table must be tailored to re ect the information needs of the particular participants that are being considered. The columns of each table identify the information that should be included within each of the documents that are issued to a particular participant. The generic information-needs tables in this chapter re ect the distinctions that have been used to structure previous chapters. Information about the reconstruction of events is followed by a causal analysis. This, in turn, supports the presentation of recommendations. Again, however, additional columns can be introduced to re ect the more detailed information requirements that are speci ed in some industries. For example, Tables 13.2 to 13.5 might be extended to explicitly denote whether each document should contain information about mitigating factors or about the failure of particular barriers. Other documentspeci c information can also be included within these tabular forms. For example, Chapter 12 has argues that it is essential to devise a timescale for the production and delivery of information in the aftermath of an incident or accident. If this is not done then there is a danger that important safety measures will be delayed. There is also a danger that potential contributors will be disillusioned by the lack of progress in addressing safety concerns. Such timetable information can be introduced into as an additional column within an information-needs table. We have not done this because such re nements can jeopardise the tractability of the tabular format. This problem might be addressed by drawing up a dierent information-needs table for each document that will be provided to participants in the investigatory process. Information-needs tables are intended to help investigators identify what information must be provided to each of the participants in an investigation. Each row of the table can be used to summarise the information that they must receive. It can also be used to explain the reasons why it is necessary to provide this information to regulators, executive oÆcers, safety managers, operators and so on. The information that is contained in each of these tables can be used in a number of ways. For example, the simplest approach is to view each row as a speci cation of the information needs for a single document that is to be provided to a particular group of recipients. This technique would result in a nal report being produced for regulators that was quite dierent from the version of the nal report that is presented to executive oÆcers. The former would focus more on the generic insights derived from the incident while the latter form might provide board members with more detailed information about particular local factors. Of course, any proposed dierences between these versions of a nal report would have to be approved by the intended recipients. Previous sections have mentioned the suspicions that can be aroused when the internal versions of an incident report is dierent from that delivered to a regulatory organisation. An alternative application of information-needs tables is to use them to derive requirements for single documents that are intended to support dierent participant groups. This is done by identifying similar information needs that might be addressed within a single publication. For example, there are strong similarities between the information that ` nal reports' are intended to provide to regulators, board members and safety managers. By collating the respective requirements into a single table, it is possible to construct a checklist that can be used to determine whether any proposed report satis es the individual requirements of each group. If a draft report does not, for example, provide safety managers with enough information about the potential need for additional data logging techniques, then it can be redrafted to support these potential recipients. Alternatively, investigators might choose to split-o this participant and draft a separate report to satisfy their particular information needs. No matter which approach is taken, the underlying motivation for these tables is that they focus the investigators attention on the recipient's information needs. If these are not considered early in the drafting of a report then there is a danger that individuals and groups may be denied important feedback about the course of an incident investigation. Conversely, there is a danger that some participants may be deluged by a large volume of apparently irrelevant information. Each table is
632
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Board Initial Report
Reconstruction Executive oÆcers within the organisation must be informed as quickly as possible about the course of events leading to an incident or accident. They must provide any necessary recourses to support a primary investigation and must coordinate any response to the media.
Causal Analysis The causal analysis must summarise the preliminary ndings but should stress any areas of uncertainty to ensure that precipitate action is avoided.
Final Report
This should provide an executive summary of the events leading to an incident together with all of the information that will be provided to the regulator so that high-levels of management can respond to questions from the regulator if necessary.
The products of a causal analysis should be summarised together with references to the methods used and documentation that was produced. This is important if strategic decisions are to be justi ed by a detailed understanding of the mechanisms that led to previous failures. Refer back to nal report unless new evidence has been obtained.
Implementation Refer back to nal report Updates unless new evidence has been obtained.
Recommendations Initial recommendations should be presented in the form of a risk assessment or cost-bene t trade-o. The most plausible worst case costs and consequences of potential future failures should be summarised. The potential interventions should be identi ed together with any potential adverse `side-eects' and their likelihood of preventing recurrence in the short to medium term. The nal report to executive oÆcers must include a detailed list of recommendations. They must justify the allocation of resources that are required to investigate how to achieve the objectives that are speci ed in the recommendations section of any report. Detailed information must be provided about attempts to validate the successful implementation of particular recommendations. This should include information from the statistical analysis of incidents and accidents that will be provided to the regulators, see Table 13.2, but will be provided to management on a more frequent basis.
Table 13.3: Generic Information-Needs Table for Executive OÆcers
13.1.
633
THE CHALLENGES OF REPORTING ADVERSE OCCURRENCES
Safety Manager Initial Report
Reconstruction The safety manager is responsible for overseeing the secondary investigation of an incident. They must be able to determine what is initially thought to have happened so that they can direct further investigation.
Final Report
Safety managers will not only need to know the evidence that supports elements of a reconstruction, they also need to determine whether any additional logging or tracking equipment might be required to gather additional evidence about future incidents.
Causal Analysis Initial reports from a primary investigation provide partial insights into the causes of an incident. They are, however, critical if safety managers are to allocate appropriate analytical resources. For instance, an initial causal analysis may indicate a need to provide human factors expertise or to consult with equipment suppliers. Safety managers must be able to determine whether or not similar causal factors have contributed to previous incidents. They may also need to assess the eectiveness of the analysis performed by their investigators, this may imply greater access to the supporting analytical documentation that is required by other parties to an investigation.
Recommendations Safety managers must safeguard systems in the aftermath of an incident. They must, therefore, consider ways of implementing those recommendations that they consider to be warranted.
Safety manager coordinates the implementation of recommendations and so must be able to unambiguously determine the intentions behind particular proposed interventions. They must then initiate the process of determining how to achieve the recommended objectives. The safety manager will be responsible for producing the implementation reports that are passed to executive oÆcers, see Table 13.3.
Table 13.4: Generic Information-Needs Table for Safety Managers intended to tailor the provision of information to the particular needs of each recipient rather than allowing the provision of information to be determined by ad hoc requests. After investigators have identi ed the information needs that are to be satis ed by particular reports, it is then necessary to consider the most appropriate more or form of presentation. The `mode' of presentation refers to the medium of transmission. Most incident reports are printed, although an increasing number are being published using electronic media. Some reports continue to be delivered orally, especially in the immediate aftermath of an incident when participant must focus their attention on mitigating actions. For example, Section 67 of the Honk Kong Shipping and Port Control Ordinance requires that the owner, agent or master of a vessel to le an oral or written report within twenty-four hours of an incident occurring [365]. The format of a report refers to the content, structure and layout of information that is delivered by a particular mode. For instance, a written report may have to be submitted using an approved form. Alternatively, national regulations may simply specify the information that is to be provided without imposing any particular requirements on the particular form of presentation. For instance, section 80 and 81 od the Hong Kong Merchant Shipping (Safety) Ordinance states that the Director of Marine must be informed of any noti able incident. This report does not have to be in a `prescribed format', however, the form M.O. 822 \Report of Shipping Casualty" is `recommended' [365]. Previous chapters have reviewed a range of dierent formats that can be used to support the
634
CHAPTER 13.
Operator Initial Report
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Reconstruction Even in con dential systems, other operators may be aware that an incident has occurred. It is, therefore, often important to brie y summarise the events that led to the incident so that short-term action can be taken to prevent future recurrence. Implementation It is important that operaReport tors receive feedback about the eventual outcomes of any investigation. This feedback is the single most critical factor in eliciting future contributions to most reporting systems. The feedback must provide a detailed account of what happened, this inevitably involves some compromise with the need to preserve con dentiality. Detailed accounts of what happened can be used to provide operators with the `bigger picture' of events that they may not have witnessed during an incident.
Causal Analysis An initial causal analysis may also be issued with the intended bene cial eect of reducing unnecessary speculation. This may, however, be premature in most cases.
Operators often form their own view of the causal factors behind an incident. These views may be modi ed or contradicted by the outcome of an oÆcial report. It is, therefore, important to justify particular ndings. This must, typically, be done without the use of the semi-formal or formal techniques that supported an analysis given that most operators will have no experience of those techniques.
Recommendations The recommendations that are made in the initial aftermath of an incident must be limited by the information that is available to investigators and supervisors. It is important that no changes should be made that increase the likelihood of other forms of failure. It is essential that operators understand the implications that particular recommendations have upon their working practices. They must not only be informed of what they must do and why, they must also be informed of the consequences of non-compliance and of proposed validation activities.
Table 13.5: Generic Information-Needs Table for Operators presentation of information about particular aspects of an incident. For example, Chapter 8 has described how plans and maps can be used to supplement event based reconstructions of the events that contribute to particular failures and near-miss incidents. The same chapter also examined a range of photorealistic and model-based virtual reality techniques that have been speci cally developed to support the electronic dissemination of information about particular events. In contrast, Chapters 10 and 11 have presented a number of formal and semi-formal approaches to causal analysis. These can play an important role in justifying the ndings that are made in many nal reports. Current documents have often been criticised because they lack any detailed justi cation of their causal ndings [469, 426]. Why-Because graphs, ECF charts and MORT tables might all be used to format the presentation of information with particular reports. Similarly, Chapter 12 has introduced recommendation matrices, barrier summaries and risk analysis matrices that can all be used to document proposed interventions. There is a tendency to satisfy the information requirements that are identi ed in Tables 13.2 to 13.5 using the products of those techniques that were used during the course of an investigation. For example, Table 13.2 argues that regulators must be provided with a detailed causal analysis as part of a nal report. If investigators had themselves used ECF analysis to identify any causal factors then it would be relatively straightforward to use ECF charts within the body of their submission to national or international regulators. This would ignore the prime injunction to consider the
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
635
recipient before drafting and disseminating any incident report. Within some industries, it may be entirely appropriate to exploit this semi-formal technique both to direct and document a causal analysis. In most industries, however, it would not be appropriate to expect that regulators would be familiar with this approach. In consequence, investigators must rst ask whether or not the recipient of a document might be able to use any proposed form. If the answer is no, or might be no, then that form must typically be supplemented by the use of natural language descriptions. In many situations, it can be diÆcult for investigators to determine whether or not participants might exploit the semi-formal and formal techniques that have been explicitly developed to support incident analysis. Similarly, it can be diÆcult to determine whether electronic presentation techniques provide an adequate alternative to more conventional modes. The closing sections of this chapter, therefore, describe validation techniques which might demonstrate that MORT, ECF etc can satisfy recipients' information needs.
13.2 Guidelines for the Presentation of Incident Reports The previous paragraphs have argued that both the mode and the format of incident reports must be tailored to the information needs of the intended recipients. If those recipients have limited access to computers then there are few bene ts to be gained from attempts to exploit electronic presentation techniques. Conversely, if the intended recipients' usual mode of working requires on-line support then it can be particularly frustrating for them to search through, and maintain, archives of paperbased incident reports.
13.2.1 Reconstruction Chapter 8 has considered the use of computer-based simulation techniques to support incident reconstruction. Chapters 9 has also described how investigators can exploit a range of graphical and textual notations to model the events leading to an adverse occurrence. This section looks at how investigators can communicate the products of this analysis to the wider audiences that were identi ed in the previous paragraphs of this chapter. In particular, we focus on the use of prose descriptions to describe the events that lead to near-miss incidents and adverse occurrences. This decision is motivated by the fact that Chapters 8 and 9 provide a detailed overview of graphical techniques. Subsequent sections in this chapter will also focus on recent advances in the use of computer-based techniques to support these prose reconstructions. A number of constraints limit the extent to which investigators can tailor the presentation of incident information to support the particular needs of the intended recipients. In particular, they must ensure that each report satis es any applicable national or international requirements. This can be non-trivial. For example, the AUSREP and REEFREP Australian maritime reporting systems both exploit a message format for submitting initial reports that complies with IMO Resolution A648(16) of 19 October 1989. The initial reports feed into systems that comply with the International Convention for the Safety of Life at Sea (SOLAS) Chapter V regulation 8-1, adopted by the IMO in 1996. The format of the reports derived from these systems must comply with the more recent IMO investigatory code [387]. National and international requirements often provide detailed guidance on the information that must be included in any reconstruction of an adverse occurrence or near-miss incident. For example, the reporting of maritime in incident in the UK is covered by the Merchant Shipping (Accident Reporting and Investigation) Regulations 1999. These require that the master of a vessel must send a report to the Chief Inspector of the MAIB using the quickest means at their disposal [347]. In any event, the report must arrive within twenty-four hours of the incident taking place. These initial reports must contain: the name of the ship and the vessel number or IMO identi cation; the name and address of owners; the name of the master, skipper or person in charge; the date and time of the accident. The report must also state where the vessel was from and where it was bound; the position at which the accident occurred; the part of ship where the incident occurred if on board; the prevailing weather conditions; the name and port of any other ship involved; the names, addresses
636
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
and gender of people killed or injured. Finally, the initial report must also provide brief details of the incident, the extent of damage and whether it caused any pollution or hazard to navigation. There can often be several dierent sets of applicable national and international regulations covering the dissemination of incident reports. It is possible to identify two dierent types of information requirement that are speci ed in these documents: declarative data and incident chronologies. Declarative data provides the contextual information about a system and its working environment. These details often `set the scene' for incident chronologies. Textual and graphical time-lines focus less on the state of the system or environment prior to an incident. Instead, they focus more directly on the events leading to particular failures. The previous paragraph illustrated how the MAIB maintains relatively high-level requirements for this declarative and `procedural' information. Other organisations have far more detailed requirements for the information that must be provided when reconstructing an incident. For example, the following list summarises the Marine Accident Investigators' International Forum [519] requirements that have been adopted by the IMO [387]. 1. Particulars of voyage: Port at which voyage commenced and port at which it was to have ended, with dates; details of cargo and draughts (forward, aft and midships) and any list; last port and date of departure and Port bound for at time of occurrence; any incident during the voyage that may have a material bearing on the incident, or unusual occurrence, whether or not it appears to be relevant to the incident; plan view of ship's layout including cargo spaces, slop tanks, details of cargo, bunkers, fresh water and ballast and consumption. 2. Particulars of personnel involved in incident: full name, age, capacity on board and details of injury; description of accident; person supervising activity; rst aid or other action on board; certi cate of Competency/Licence: grade; date of issue; issuing country/authority and any other Certi cates of Competency held; time spent on vessel concerned and experience on similar vessels and experience on other types of vessels experience in current capacity and experience in other ranks; number of hours spent on duty on that day and the previous days; number of hours sleep in the 96 hours prior to the incident; any other factors, on board or personal, that may have aected sleep whether smoker, and if so, quantity and normal alcohol habit together with information about any alcohol consumption immediately prior to incident or in the previous 24 hours; whether under prescribed medication and any ingested non-prescribed drugs and records of drug and alcohol tests. 3. Particulars of sea state, weather and tide: direction and force of wind; direction and state of sea and swell; atmospheric conditions and visibility; state and height of tide, in particular, the direction and strength of tidal and other currents, bearing in mind local conditions. 4. Particulars of the incident: type of incident together with date, time and place information; details of incident and of the events leading up to it and following it; details of the performance of relevant equipment with special regard to any malfunction; persons on bridge, in engine room and location of master and chief engineer; mode of steering (auto or manual); extracts from all relevant ship and, if applicable, shore documents including details of entries in oÆcial, bridge, scrap/rough and engine-room log books, data log printout, computer printouts, course and engine speed recorder, radar log, etc; details of communications made between vessel and radio stations, SAR centres and control centres, etc., with transcript of tape recordings where available; details of any injuries/fatalities; voyage data recorder information (if tted) for analysis. 5. Assistance after the incident: if assistance was summoned, what form and by what means; if assistance was oered or given, by whom and of what nature, and whether it was eective and competent; if assistance was oered and refused, the reason for refusal.
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
637
It is important to emphasise that this is a partial list that is intended to be applicable to all types of incidents. Additional guidelines describe the more detailed information that must also be included when reports are submitted after groundings, collisions, res etc. Such guidelines help to identify the information that should be included when documenting the events leading to an incident. They do not, however, provide detailed guidance on the format or mode of submission. As we have seen, the UK regulations simply require that masters make their initial report by the fastest means possible. The IMO's guidelines recognise that both initial and nal reports can be submitted in a range of formats to be determined by the legal requirements of each State. The IMO's requirements not only aect the initial information that must be reported in the aftermath of an incident. They are often used to guide the presentation of subsequence documents, including the nal report. This can be illustrated by the Transportation Safety Board of Canada's report into the striking of a dock with an unloading boom from a bulk carrier [786]. The reconstruction of the incident begins with a section that presents background `factual' information in a tabular format. This is illustrated in Table 13.6. This format illustrates how information requirements, such as those proposed by the IMO, can be proceduralised. The rows of the table act as a prompt to ensure that investigators provide necessary information. ALGOBAY Port of Registry Flag OÆcial Number Type Gross Tons Length Draught Forward: Aft: Built Propulsion Number of Crew Registered Owner
Sault Ste. Marie, Ontario Canada 372053 Self-unloading Bulk Carrier 21,891 222.51 m 8.05 m 8.11 m 1978, Collingwood, Ontario Two 10-cylinder Crossley Pielstick (10PC2-3V-400) diesel engines; 7870 kW total. Single controllablepitch propeller and bow thruster. 24 Algoma Central Marine Sault Ste. Marie, Ontario
Table 13.6: Canadian TSB Tabular Preface to A Maritime Incident Report [786] Such tabular summaries provide the contextual details that are required by national regulations, such as the Merchant Shipping (Accident Reporting and Investigation) Regulations 1999, and international guidelines, such as the IMO investigation code. They provide an overview of the technical details that are necessary in order for the reader to understand the nature of the vessel or vessels that were involved in an adverse occurrence. They also act as useful points of reference or aidememoires that the reader can refer to as they consider a report. Rather than having to look through dense paragraphs of prose, it is possible to go back to this initial table as a reference point for information about the vessel and her crew. The example illustrated in Table 13.6 is relatively brief. Extended versions of these summary tables have been used in incident reports for other modes of transport. For example, the UK Air Accident Investigation Branch exploits a similar approach to record the registered owner of the aircraft involved in an incident, the operator, the aircraft type, the nationality of the operators, their registration, the place, date and time of the incident [15]. They use a similar tabular format to summarise information about the operators who are involved in an incident. These tables include the sex and age of the individual, the status of any operating licence, their rating, medical certi cation, the start of any relevant duty period, the time of their previous rest period and so on. Such tabular forms help to structure the presentation of information that must be present if an incident reports is to conform to particular industry guidelines. They are declarative representations.
638
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
The facts that they describe, typically, hold throughout an incident. They cannot easily be used to describe more dynamic aspects of an incident. Table 13.6, therefore, illustrates a common means of prefacing more detailed reconstructions of a near-miss incident or adverse occurrence. Natural language descriptions are used in most reports to describe the way in which particular events contributed to an incident. This can be illustrated by the paragraphs that follow Table 13.6. A section entitled `History of the Voyage' follows this summary. It describes how the vessel departed Superior, Wisconsin, at 17:10 eastern daylight time on the 7th June 1999 carrying a cargo of 26,137 tons of coal. The opening sentences of the narrative, therefore, satisfy more of the information requirements speci ed in the IMO's code. Unlike the contextual information that is summarised in the tabular format, illustrated in Table 13.6, this information is integrated into the description of the incident. As mentioned, it can often be diÆcult to locate such necessary information in large sections of prose. It is, therefore, important that check-lists be developed so that investigators can ensure that a report satis es national and international requirements prior to publication. After having presented the contextual information summarised above, the report goes on to describe the events immediately preceding the incident. As with most incident reports, a discussion of more latent causes is postponed until the analysis sections. The narrative is often presented in as simple a form as possible. There is an attempt to minimise any additional commentary on the signi cance of particular events. This too is postponed until the subsequent sections of analysis. This is an important strength of many incident reports because the readers' interpretation of particular events is not biased by a premature commentary. Equally, however, it can be diÆcult for readers to determine which events in a narrative will turn out to have a critical signi cance for the causal analysis and which are introduced to provide additional background. For instance, the following quotation presents the subsequent sections of the `History of the voyage' section in the Transportation Safety Board of Canada's report. It is diÆcult to determine whether the unscheduled maintenance will or will not play a signi cant role in the course of the incident until readers complete the remaining paragraphs: \On the morning following departure, as the vessel crossed Lake Superior, the chief engineer, with the authorisation of the master and shore management, shut down the port main engine for unscheduled repairs. At 17:45, the vessel advised Vessel TraÆc Services that the passage through the locks at Sault Ste. Marie and the St. Marys River would have to be conducted with one engine. Permission was granted by the United States Coast Guard and Sault Ste. Marie harbour master. Before arriving at the Sault locks, the vessel had developed a one and one-half degree list to port. Concerned with low water levels in the St. Marys River and the 10 cm of extra draught the list would have created, the master ordered the second oÆcer to lift the unloading boom and slew it to starboard as the vessel departed Poe Lock at 0225 on June 9. In doing this, it was hoped that ballast water remaining on board in the No. 3 port ballast tank would be displaced to starboard and the list corrected. As the boom was being lifted from its saddle, it began swinging to port. Despite attempts by the second oÆcer to check its movement with the slewing controls, the boom accelerated outwards until it contacted the front of the accommodation at an angle of 90 degrees to the vessel." [786] At rst sight, it would appear that the unscheduled maintenance is no more than a contextual detail. Investigators mention it in the report because it represents an unusual event that took place before the problems with the boom. In subsequent paragraphs, however, the reader learns that as soon as the boom began to hit objects on the shoreline the master put the engine to `full astern'. Unfortunately, this overloaded the single remaining engine. The watchkeeping engineer, therefore, informed the master that he would have to reduce power. The master eventually arrested the vessel's forward motion by ordering the lowering of the stern and both bow anchors. This illustrates the manner in which the unscheduled maintenance removed a potential barrier, the main engine, that might have prevented further damage once the incident had begun. This argument is not explicitly presented in the incident report until the analysis section. In consequence, it is very easy for readers to overlook the signi cance of the maintenance event even as they read about the problems of putting the remaining engine `full astern'. As mentioned, this technique avoids prejudicing the
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
639
reader. By separating the reconstruction and the analysis, individuals are encouraged to form their own hypotheses before reading the investigators' interpretation. On the other hand, this approach can be deeply frustrating. During recent interviews, a safety managers referred to the `Perry Mason' or `Agatha Christie' style of incident reporting. The reader never understands the true signi cance of a particular event until they read the nal pages of a report. As a result, they must re-read the report several times in order to identify the way in which the elements of a reconstruction contribute to a causal hypothesis.
A Structural Analysis of Incident Reconstructions There are a number of dierent ways in which investigators can structure their presentation of the events leading to an incident. For example, they can use prose accounts to summarise the elements of graphical time-lines such as those introduced in Chapter 9. This approach describes events in the order in which they occurred during an incident. Alternatively, investigators can exploit a more thematic approach in which the time-line leading to a failure is described from one particular perspective. Subsequent paragraphs then go back to the beginning of an incident to present the
ow of events from another perspective. These dierent approaches have a number of strengths and weaknesses. For example, readers can easily trace the ordering of events if they are described in the order in which they occurred. This can, however, create a false impression. The reader is presented with a global view of all the events that occurred across a diverse system for each point of time. The participants in an incident, typically, would not be in such a fortunate position. Further problems aect the use of more thematic approaches. If investigators describe the course of events from particular perspectives then it can be diÆcult for readers to piece together an overview of the concurrent failures that often characterise many incidents. The following list, therefore, summarises these dierent techniques for structuring the presentation of a prose reconstructions:
a single chronology;
As mentioned, this approach simply reconstructs the time-line leading to an incident. Each signi cant event is described in chronological order. This provides a relatively simple overview of an incident. An important bene t of this approach is that readers can quickly scan the text to identify what events occurred at any particular point in time. This scanning is facilitated by using time-stamps as marginal notes that can act as indices into particular paragraphs: 08:00
repairs were begun; however, while removing the cylinder head from the engine, it was discovered that a cylinder head stud was broken and would have to be replaced. 08.30 (est.) The chief engineer, who had previous experience with this type of repair, informed the master of the situation and revised his estimated completion time upwards to 24 hours. ... ... 20:00 the head tunnelman reported to the chief engineer that he had nished raining and cleaning the hydraulic system. The chief engineer indicated that, be cause the head tunnelman was unfamiliar with the procedure to bleed the air from the hydraulic system, he decided to wait until daylight the following morning. The limitation with this approach is that events from many diverse areas of an application process can be listed next to each timestamp. This can create problems because these entries will not be uniformly distributed over time. This implies that for any particular system, there may be relevant information scattered across many dozens of paragraphs. This imposes considerable burdens upon readers who want to piece together what happened to a particular subsystem or operator.
a single chronology with backtracking;
A number of further problems aect the use of single chronologies to structure the presentation of any incident. As we have seen, catalytic and latent events are not uniformally distributed across the time-line of an incident. Typically, initial failures may lie dormant until a number
640
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
of triggering conditions defeat any remaining barriers. This creates problems because it can be diÆcult for readers to gain an overview of an incident. An initial description of the latent failures can quickly become swamped by the mass of details that typically accompany any presentation of catalytic failures. Many investigators have responded to this problem by starting the reconstruction of an incident with a brief overview or summary of the events leading to a failure or near miss. Subsequent paragraphs then go back to the start of the catalytic failures to examine those events in greater detail. The Transportation Safety Board of Canada's report exploits this approach. The initial summary, presented in previous paragraphs, is followed by a more detailed reconstruction of the engineering and maintenance activities during the incident: \At 0800 on June 8 the repairs were begun; however, while removing the cylinder head from the engine, it was discovered that a cylinder head stud was broken and would have to be replaced. The chief engineer, who had previous experience with this type of repair, informed the master of the situation and revised his estimated completion time upwards to 24 hours. As the work would be conducted near the running starboard main engine, the chief engineer suggested that the vessel be stopped in Lake Superior for the duration of the repair. In consultation with the master and chief engineer, the company engineering superintendent decided that the vessel should proceed towards Sault Ste. Marie in case further shore support was needed for the repairs. During the previous sailing season, the vessel had operated for several months on one engine, including during passages through the American locks at Sault Ste. Marie..." [786] This is intended to provide readers with the contextual framework, including latent failures and mitigating factors, that is necessary to understand the signi cance of particular catalytic events. Unfortunately, this approach also suers from a number of limitations. In particular, the use of a more detailed chronology for catalytic failures can reduce the amount of attention that is paid to latent failures. This potential bias is often countered by reports that devote most of the subsequent analysis sections to the longer-term causes of an adverse occurrence or near-miss incident.
multiple thematic chronologies;
Previous chapters have attempted to distinguish between incident reconstruction, which explains what happened, and causal analysis, which explains why an incident occurred in the manner that it did. These distinctions have been maintained because they are, typically, re ected in the structure of most incident reports. An initial discussion of the events leading to a failure are then followed by a discussion of the causes of those events. These distinctions can, however, become blurred in some reports. For example, some accounts are structured around several dierent chronologies. These time-lines each re ect a particular analytical approach to the incident. For example, an account of the human factors failure may precede a description of the events that contribute to any system failures. The subsequent analytical sections in the incident report are then used to `weave' together the individual events that are included in these dierent chronologies. An explanation of systems failure may be given in terms of human factors issues, or vice versa. This approach has much to recommend it. The individual chronologies can be used to demonstrate that analysts have considered a suitable range of potential causal factors before performing their analysis. The causal analysis, in turn, provides an explicit means of unifying these disparate accounts. There are, however, a number of problems with this approach. It often makes little sense to provide a chronology of operator actions without also considering the system behaviours that they were responding to and were helping to direct. Their are logistical problems in ensuring consistency between these multiple chronologies. It can also be expensive to recruit and retain the necessary technical expertise to construct these dierent perspectives, especially in small-scale local systems.
multiple location-based chronologies;
A variation on the previous approach is to present dierent chronologies that record the events taking place within particular locations or subsystems during an incident. For instance, the
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
641
report can describe the events on the bridge before describing what happened in the engine room. This approach provides readers with some impression of what happened to individuals and systems within that particular location. It avoids the false `global' view that can often makes readers wonder why operators did not intervene to rectify what to them is an `obvious' problem. There is, however, no guarantee that readers will avoid this potential pitfall even if location-based chronologies are used to structure a report. Each successive account contributes to their understanding of an incident. The cumulative insights that can be obtained from reading each of these accounts would clearly not have been available to operators or line managers. There are also more pragmatic problems. It can be diÆcult to ensure that these dierent accounts are consistent with each other, especially when materials and other forms of communication pass between dierent locations. A common aw in this form of incident report is to nd that a message has been sent from one location but that its receipt is never mentioned in subsequent descriptions. In such circumstances, the reader cannot easily determine whether the message was never received or that it did arrive and the investigator simply omitted to mention its receipt in their reconstruction. This is a partial list, investigators have used a number of hybrid techniques that draw from several elements of this list. Many reports also combine textual chronologies with some of the graphical and diagrammatic reconstruction techniques that were introduced in Chapters 8 and 9. Subsequent sections of this chapter will describe how these combined approaches have recently been combined to support the on-line publication of incident reports. For now, however, the key point is that investigators must consider the consequences that prose chronologies have upon the intended recipients of the report. If an extended single chronology is used then investigators can help readers to navigate a reconstruction by providing timestamps as marginal indices and by using dierent paragraphs to describe concurrent events in dierent areas of the system. If investigators do not consider the potential weaknesses of these formats then there is a danger that the resulting document will fail to support the various user groups that have been identi ed in previous paragraphs. The task of reconstructing an incident does not simply depend upon the chronology that is developed. Investigators must also determine what to include and what to omit from any reconstruction. The following list summarises potential guidelines that might be used for determining what information should be included in a reconstruction: 1. is the information required by regulators? Bodies such as the IMO enumerate the information that must be provided in many incident reports. These requirements typically focus on declarative data about the type of system that was involved in an incident. It can be diÆcult to identify a suitable format with which to present this information. The statistical nature of much of this data lends itself to tabular formats rather than the prose descriptions that are used in other sections. They also focus on gathering information about the catalytic events leading to a failure. 2. is the information necessary to understand catalytic events? Chapter 10 has shown how ECF analysis can proceed by reconstructing the ow of events back from the point at which energy was `transferred'. Conversely, investigators might use P-Theory to work forward from the rst event that deviated from the normal pattern of operation. In either case, there is a focus on the catalytic events that contributed to an incident. It may seem to be relatively straightforward to present this material. The previous paragraph has, however, summarised the problems that arise when concurrent interactions can contribute to the course of an incident. 3. is the information necessary to understand latent events? We have also argued that it is important to understand the longer-term factors that contribute to an incident. For example, Chapter 3 described how systems may not be in a `normative' state for many years. For example, working practices can evolve to remove important barriers. This creates considerable problems for the investigators who must determine how best to present this material. If a linear chronology is used then it may not be easy for readers to understand how an apparently insigni cant event contributed to an eventual incident. The
642
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
signi cant of that description may only emerge many dozens of pages later. Alternatively, if backtracking is used then the latent events can be described together with more immediate `triggering' conditions. As we have seen, however, such techniques can provide a perspective that was not available to operators at the time of any failure. 4. is the information necessary to support the narrative of other events? Some events are included not because they are essential to the readers' understanding of an incident but because they link other more important events. These events can create considerable confusion. For instance, many investigators maintain the distinction between analysis and reconstruction by separating them into dierent chapters of a report. In consequence, it can be diÆcult for readers to distinguish between these ` ller' observations and latent or catalytic events. They can, therefore, help to spark alternative causal hypotheses that must be explicitly rejected in any subsequent analysis if readers are to be satis ed by the investigators' interpretation of events. 5. is the information necessary to eliminate certain hypotheses? One means of restricting the number of putative hypotheses that might be evoked by any reconstruction is to explicitly provide information about events or conditions that were not apparent during an incident. For example, investigators often begin a reconstruction by providing information about the prevailing weather conditions. If they were bad then readers are informed of a potential cause of any failure. This information is, however, also included for incidents that occur under favourable conditions so that readers can better interpret the subsequent chronology. Such techniques must be used with care. It can be argued that in seeking to inform the readers' interpretation of key events, investigators may be introducing an unwarranted bias into their accounts of an incident. 6. does the information describe a failed barrier? Investigators often decide to provide information about the protection mechanisms that were intended to protect a system and its operators. These events often stand-out from reconstructions because they describe how cross-checks were not made. They may also describe decisions that were contradicted or countermanded to achieve particular operational goals. As with the previous items in this list, the presentation of such information can create certain problems for those who must draft incident reports. Many chronologies describe these checks without explicitly stating that this was an opportunity to protect the system from a potential failure. This is an appropriate approach because the reconstruction of what happened is separated from the causal analysis of why it happened in that way. Equally, however, it can lead to disjointed accounts where catalytic events are interrupted by accounts of apparently insigni cant conversations between key personnel. It may then take many pages before readers learn that these conversations might have prevented or mitigated the consequences of the incident. This is a partial list. Chapters 8 and 9 present further requirements for the information that must be considered by any reconstructions. These requirements can also inform the presentation and dissemination of information to the readers of incident reports. For instance, investigators must extend the scope of any reconstruction to include remedial or mitigating actions. We have not extended the list to explicitly include these items because they have already been addressed in the previous chapters. In contrast, the items in this list describe the issues that must be considered when presenting particular elements of a reconstruction. For instance, it can be diÆcult for readers to understand the role that a latent failure can play in an incident if it included in a reconstruction without any supporting explanation. The elements in this list also help to highlight a number of more general issues. For example, investigators may decide to separate analysis from reconstruction in the manner recommended by the previous chapters of this book. This does not, however, imply that readers will simply switch o their analytical skills as they read a reconstruction and then switch them back on again as they start the section labelled `analysis' or ` ndings'. For example, it is tempting to interpret any situation in which an individual questions the safety of an operation as a failed barrier. A reader who forms this belief while reading the reconstruction of an incident may
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
643
retain this impression even if subsequent sections do not consider the event any further and if the operation has little in uence on the outcome of an incident.
A Case Study in the Literary Criticism of Incident Reports The following paragraphs use a report that was drafted by the ATSB to illustrate some of the issues raised in the previous paragraphs. This case study was chosen because the investigators exploit a simple single chronology with limited backtracking. It, therefore, exploits a relatively simple narrative structure. s will be seen, the form of analysis that is applied to this incident report resembles the techniques that are used in literary criticism. This is entirely intentional. It is important to recognise the prose techniques that investigators often implicitly exploit when drafting their reconstructions of adverse occurrences. As we shall see, these techniques often have an important impact upon the readers of an accident report. The report concerns a stability problem that aected a merchant vessel, the Sun Breeze. The reconstruction begins with a textual summary of the declarative information that was provided Table 13.6 in previous incident reports: \The Panama ag Sun Breeze is a 11,478 tonne deadweight general cargo vessel, owned by NT Shipping SA. It was built in 1998 by Miura Shipbuilding Co Ltd of Japan and is classed with ClassNK (Nippon Kaiji Kyokai). The vessel was delivered to the owner on 9 February 1999, six months prior to the incident. Sun Breeze is 109.30 m in length overall, has a beam of 19.8 m and a summer draught of 9.264 m. Propulsive power is delivered by a seven-cylinder Akasaka diesel engine developing 5,390 kW driving a single xed-pitch propeller and providing a service speed of 13.5 knots..." [52] This declarative summary continues by describing the dimensions of the vessel's holds and its ballast capacity. It also described the previous expertise of the crew. For example, the master `had sailed as mate for ve years on general cargo ships, log carriers and bulk carriers and had 22 years experience in command of various vessels, mainly bulk carriers' [52]. The reconstruction then goes on to describe how `no untoward incidents' were reported on previous voyages between Japan and Indonesia, Singapore, Malaysia and Thailand. The report then focuses on the chronology of events leading to the incident. The master received a fax from the charterers of his vessel on the 2nd August. His voyage instructions were to load a minimum of 10,000 m 3 , `up to the vessels full capacity' of jarrah and karri... The instructions also described how there were approximately 650 packs of 4.2 metre lengths of timber, 650 packs of 3 metre lengths, 20 packs of 4.5 metre lengths and 820 packs of 6 metre lengths. The master calculated that this would require some 14; 354m 3 without any allowance for broken stowage [52]. Previous paragraphs have considered how the Sun Breeze case study satis es the declarative requirements imposed by the IMO. The opening pages of the narrative describe the vessel, its crew, the cargo and the nature of the proposed voyage. The report then goes on to trace latent communication problems between the master and the company that was chartering his vessel. He tried to nd out if they wanted to load the minimum agreed gure of 10,000 m 3 ? If so then the entire cargo could be loaded underdeck. If not then did they want to load the 14,354 m 3 of cargo in the instructions? The shippers acting on this presumption had prepared to load about 15,000 m 3 of cargo. This would exceed the vessel's bale capacity. The charterers replied by noting that the cargo had been xed with them on the basis of lump sum freight and that the vessel had been accepted. This illustrates the way in which the ATSB investigators use the preliminary paragraphs to reconstruct the situation that was faced by the master of the Sun Breeze before he commenced the loading of his vessel. This is important because it enables the reader to follow the way in which the context for an incident developed from its initial stages. Readers are informed of the relationship between the charterer and the master. He voiced his concerns but was ultimately reminded of the contract that he was expected to honour. His concerns were also partly addressed by the development of a loading plan that met the charterer's requirements. It is also important to note that, although the opening sections of the chronology provide important information about the context in which the incident occurred, they only hint at the events that eventually threatened the safety of the vessel. It would surprise the reader to nd that the incident did not involve the way in which the cargo was
644
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
stowed on-board the Sun Breeze. Equally, however, the opening paragraphs provide few cues about the eventual nature of the incident. This narrative technique avoids the problems associated with reconstructions that provide information that could not have been available to the participants in an incident. The report follows a simple, single path chronology. The background events, mentioned above, are followed by an account of the more immediate events that led to the incident. The Sun Breeze arrived at Bunbury at 23:24 on 15 August 1999. Subsequent paragraphs describe how loading commenced at 09:00 on the 16th August. It is important to emphasise, however, that many incident reports exploit a number of dierent chronologies. For example, the ATSB's reconstruction goes on to develop a multiple, location-based chronology. The report surveys the operations to load the cargo. While this was going on, the report also describes how the third mate took various steps to improve on these operations. Previous paragraphs have mentioned the burdens that this can impose upon readers. It can be diÆcult to reconstruct the time-line of events that led to an incident if individuals must piece together the partial orderings of multiple chronologies. This problem is often apparent in the marginal notes that are made by the readers. For example, printed incident reports often contain informal sketches and time-lines that individuals have made to help them keep track of the parallel events that are described in the sequential accounts of prose reconstructions. Previous paragraphs have argued that readers cannot simply disengage their critical and analytical faculties as they read the reconstruction section of an incident report. In consequence, there is a continual temptation to lter and analyse the information that is presented in any account. This can be illustrated by the following paragraph from the ATSB's report. At one level, it describes how the ship's master managed to convince the harbour master that he knew how to address any stability concerns. At another level, the same prose can be interpreted as describing the failure of a barrier that might have prevented the incident. Previous paragraphs in this chapter have described a number of reasons why investigators might include particular information in an incident report. It is important to recognise that a careful reading of such documents will yield not only the basic event structure in any reconstruction but also the investigators intentions behind the presentation of particular information: \Between 06:00 and 09:00 that morning, after loading had been completed, the harbour master noted that the ship initially had a list to starboard of about 4 degrees. It then had port list of about 4 degrees before becoming upright. He became concerned about the vessels stability... The harbour master went on board Sun Breeze at about 15:00 to discuss the vessels stability with the master. The ship was upright then and the harbour master recalled the master saying that he was going to transfer fuel oil from high tanks to low tanks to provide additional stability. The master gave the harbour master a copy of the stability calculation for the vessels departure condition, indicating that the GM, after correction for free surfaces eects in tanks, was 47cm. The harbour master asked the master how he knew what the weight of the cargo was and whether the packs of timber were marked with weights. The master said that the packs were not marked with weights and that he had estimated cargo weights by draught survey. The masters reply gave the harbour master the impression that the master knew what he was doing." [52] Investigators have clear intentions when they introduce such narratives into incident reconstructions. They describe how a potential barrier, provided by the harbour master's stability checks, were circumvented. The question that this paragraph raises is whether or not the intended audience for a particular report would be able to identify this intention as they read the narrative reconstruction. The previous sections of this book have introduced accident and incident models that provide semantics for terms such as `barrier', `target', `event', `condition' and `catalytic failure'. These concepts form part of a vocabulary that many readers of an incident report will not have acquired. This has important consequences. Some readers will be able to interpret the intention behind particular elements in a reconstruction. Other readers of the same document may view them as random observations that seem to contribute little to the overall report. These individuals may require considerable help in the subsequent analysis if they are to lter the mass of contextual information and
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
645
` ller' events to identify key aspects of an incident. The closing sections of this chapter will describe a range of validation techniques that can be used to determine whether or not such dierences can jeopardise the eective communication of safety information within an incident report. As we shall see, however, these techniques have not been widely applied and there is little empirical evidence about individual dierences in the interpretation and analysis of incident reports. One of the key problems in reading an incident reconstruction is that it can be diÆcult to distinguish key events from the mass of background detail that investigators often include in their accounts. As mentioned, these background details include contextual information that is necessary to establish the circumstance in which an incident occurred. They also include the less important ` ller' events that link together other more signi cant aspects of an incident. This can be illustrates by the ATSB's description of the events that occurred immediately after the Sun Breeze left Bunbury. The rst sentence can be described as a ller; `When the harbour master disembarked at 18:15, he returned ashore and drove back to the wharf where Sun Breeze had been berthed, watching the ship'. The second sentence provides important information that must be supported by interviews that were conducted after the incident had occurred; `he had some lingering concerns about the vessels stability but, when there seemed to be no problems as the vessel proceeded outbound, he returned home'. Readers face a number of further problems in anticipating what will and what will not turn out to be key elements of any reconstruction. For example, the same paragraph in the report continues; `the vessel was being set to the east by the tide and he adjusted the course to 335 degrees, using about 5 or 10 degrees of port rudder to do so'. It is diÆcult to determine how important the bearings will be for the subsequent analysis of the incident. An initial reading cannot determine the signi cance to attach to the change of course. One consequence of this is that the reader of an incident report often have to read reconstruction sections several times after having read a subsequent analysis in order to understand how particular events contributed to an eventual incident [749]. The previous paragraphs in this section have presented a structural analysis of the ATSB report. For instance, we have shown how this document exploits several dierent chronological structures. A single linear thread can branch into consecutive accounts that each depict parallel events in dierent locations. It has been argued that this can impose signi cant burdens on the reader of an incident report who must piece together these accounts in order to derive an overview of the events leading to a near miss or adverse occurrence. Similarly, we have identi ed some of the problems that can arise from the usual practice of presenting a reconstruction before any causal analysis. Some readers may be forced to re-read narrative accounts several times before they can place key events within the time-line of an incident. We have also argued that it can be diÆcult to entirely separate analysis from reconstruction. A careful reading of an incident reconstruction will not only provide information about the ` ow' of events, it can also reveal the investigators' intentions behind the presentation of particular events. This structural analysis should not obscure the importance of the prose that is used in any reconstruction. Investigators must tailor their use of language so that readers can clearly follow the ow of events. It is important that the prose should not over-dramatise the incident by adding literary eects that do not contribute to the exposition. The ATSB avoid this potential pitfall and provide a valuable example of the precise and concise use of language to describe what must have been an `extreme' situation: \The 3rd mate changed back to manual steering and ordered 10 degrees of port helm to bring the vessel back on course. At this time, the vessel started listing to port. The mate, who was on the bridge at the time, told the 3rd mate to telephone the master. The masters phone was busy so the 2nd mate went below to call him. When the mate was on his way to the bridge to take over the watch, he noticed that the vessel was taking a port list. He thought that the vessel might have taken a 15 degrees list but, by the time he got to the bridge, the vessel was coming upright again. He telephoned the master asking him to come to the bridge, after which the vessel took a starboard list. When the master returned to the bridge the list was about 15 degrees or 20 degrees to starboard. He stopped the engine. The rudder was amidships but the vessel was still turning to starboard. The list continued to increase as the vessel turned slowly. The ship attained a maximum list of about 30 degrees to starboard before it reduced to about 25 degrees.
646
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
At about this time, lashings on the cargo on no. 1 hatch top released when securing clips opened and nine packs of timber were lost over the side." [52] The prose that describes subsequent events also exhibits this sparse but eective style. It also provides further examples of the way in which a single chronology will branch at key moments during an incident. In particular, the report uses consecutive parallel chronologies to describe the remedial actions on shore and on-board the vessel. The basic structure is further complicated by the use of consecutive time segments. In other words, an initial paragraph describes how a distress broadcast was received by a volunteer group who, in turn, triggered the response on-shore. A subsequent paragraph then describes the crews' actions while the police and harbour master were coordinating their eorts. A further paragraph then resumes the account of the shore-side activities. This technique is similar to the way in which movie directors frequently cut between parallel streams of `action'. It is, however, a diÆcult technique to sustain throughout the many pages of prose narrative that can are presented in many incident reports. As in many lms, the dierent strand of activity that we have termed `chronologies' are brought together by joint eorts to resolve the incident: The harbour master drove to the pilot boat and eventually helped to coordinate the use of the tug Capel to disembark some of the Sun Breeze's crew. A single chronology is then resumed as the report describes how the master reduced the list to about 5 degrees by altering the ballast in the vessels tanks. A surveyor joined the Sun Breeze and performed more detailed stability calculations. These identi ed problems both in the previous assumptions that had been used by the crew and in the factors that had been included in their calculation. The vessel eventually berthed again at Bunbury, where the cargo was secured and some of the packs were removed; `the vessel sailed at 2005 on 25 August for the discharge port in China, arriving there at 0600 on 10 September 1999 without further incident' [52].
Less Detailed Reconstructions... The directness of the prose style that is used in the ATSB report is even more important for incident reports that summarise less critical failures. These documents may be limited to a few brief paragraphs that must not only reconstruct the events that led to the incident but must also document any analysis and summarise the subsequent recommendations. Space limitations are not the only constraint that complicates the task of drafting these less formal accounts. The national and international requirements, listed in previous pages, do not apply to the less `critical' failures that may only be reported to internal company schemes. Greater diversity is permitted for incidents which are perceived to oer a relatively low risk from any potential recurrence. A report into a minor workplace injury need not record the position of the captain and the chief engineer, as required under the IMO guidelines that were cited in previous paragraphs. Similarly, the information that is required in any reconstruction is also tailored according to the audience. The exhaustive lists of information requirements compiled for the UK MAIB and the IMO are intended to ensure that national and international regulators can access necessary details. This amount of contextual information is often irrelevant to the masters, operators and employees who must endeavour to avoid future incidents. Investigatory agencies, therefore, do not include all of the informations that is provided in a nal report to them when they disseminate accounts of an incident or accident to the industry that they protect. This point can be illustrated by the summary reports that various agencies have published to disseminate information about previous incidents These documents strip out much of the contextual information and what we have described as ` ller' details to focus in on particular hazards. As can be seen, four sentences are used to reconstruct the near-miss incidents. The investigators also summarise the recommendation to ensure that fuel containers are separated from potential ignition sources and secured to prevent shifting: \Portable space heaters are frequently utilised on-board vessels in this shery to warm divers and to keep sea urchins from freezing. Coast Guard personnel have observed, at sea, a number of shing vessels using these portable heaters while also carrying portable fuel containers, including those used to carry gasoline for outboard engines. If gasoline or other ammable liquids are carried on-board a vessel it is critically important to ensure
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
647
that these items are well separated from any potential ignition sources and secured or lashed in place to prevent shifting. Accidental spillage of any ammable liquid, especially gasoline, in the vicinity of an open ame source can result in a catastrophic re that will quickly engulf a vessel." [827] Such brevity and directness is achieved at the cost of much of the detail that is provided in the previous example of the Sun Breeze report. Relatively little is said about the range of heaters and storage devices that were observed on the vessel. Nor do the Coast Guard state whether or not they saw any particularly hazardous uses of space heaters and fuel storage containers. It can be argued, however, that such details are irrelevant to the intended readers of this account. The Coast Guard have identi ed a generic problem based on their observation of previous incidents. The recommendation is also suitably generic so that individuals can apply the advice to guide their daily operations. It might appear from the preceding discussion that the level of detail that is provided in any reconstruction should simply re ect the nature of the incidents that are being reported on. The Sun Breeze was an inherently more complex incident, involving potentially greater risks from any recurrence than the `simpler' problems observed by the US Coast Guard. The ATSB could have summarised the previous incident in a few lines; `various communications failure contributed to a failure to correctly load the cargo, this ultimately compromised the stability of the vessel'. Such a summary would strip out necessary information so that readers would have little opportunity to gain the many insights that the investigators derived from their more detailed reconstruction of this incident. Conversely, 20-30 pages of reconstruction could have been devoted to the reconstruction of previous incidents involving the storage of ammable substances on-board oyster boats. It is far from certain that such a detail analysis would contribute much beyond the existing summary [827]. It is, however, too super cial to argue that the nature of an incident determines the depth of any reconstruction. A number of additional factors must be considered when investigators decide how much detail should be introduced into a reconstruction. In particular, Chapter 2 identi ed the tension that exists between the need to provide suÆcient contextual information for operators to understand the ways in which an incident occurred and the potential problems that can arise if a report threatens the anonymity or con dentiality of a submission. If too many details are provided about the context in which an incident occurred then the readers of a report may be able to infer the identify of the person who originally instigated a subsequent investigation. Conversely, if too few details are provided then operators may feel that any recommendations are not properly grounded in the detailed operational circumstances of their working practices [423]. The elements of the following list present some of the issues that help to determine the amount of detail that must be introduced into any reconstruction: 1. what are the boundaries of trust and con dentiality? There may be signi cant constraints upon the amount of detail that an investigator can include within any reconstruction. Participation rates in any incident reporting system can be threatened if previous assurances of anonymity are compromised by an investigators publication of particular items of information. This can lead to a diÆcult ethical decision in which investigators might choose not to release important information about a potential hazard in order to safeguard the longer term future of the reporting system [444]. 2. how serious is the incident? If investigators are not constrained by bounds of con dentiality then the level of detail in a report can be determined by an assessment of the potential seriousness of an incident. Chapters 11 and 12 have described how risk assessment techniques can be used to assess the potential threat the might be posed by the recurrence of an incident. This can be estimated in terms of the probability and consequence associated with each of the hazards that contributed to the incident. Investigators must also account for the risk associated with those hazards that were identi ed during an investigation but which did not actually occur during a near miss. In general, these techniques are only likely to provide subjective estimates that cannot easily be validated until some time after a report has been published and disseminated.
648
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
3. how complex is the incident? Chapter 1 reviewed Perrow's argument that the increasing coupling of complex application processes is producing new generations of technological hazards. These hazards are being generated faster than techniques are being developed to reduce or mitigate those hazards [675]. If this is correct then it will be increasingly diÆcult to summarise an adverse occurrence or near miss incident in only a few sentences or paragraphs. The complexity and coupling of application processes defy attempts to summarise their failure. It is certainly true that incidents involving the use of high-technology systems, in general, require greater explanation than those that involve less advanced systems. It is diÆcult, however, to be certain that this is a result of the inherent complexity of high-technology systems or the lack of familiarity that many potential readers have with their design and operation. 4. how much of the context can be assumed? Investigators can often omit contextual information if they assume that such information is already widely known amongst their intended readership. This provides a partial explanation for the hypothesised dierences between reports into the failure of high and low technology systems. Investigators need not provide additional information about the nature and use of fuel containers because they can assume that most readers will be familiar with these components. On the other hand, greater details must be provided for similar reports into the failure of navigational radar systems. Such assumptions can, however, be ill-founded. As we have seen, it can be diÆcult to anticipate the many diverse groups and individuals who have an interest in any particular incident report. In consequence, investigators can make unwarranted assumptions about their knowledge of particular working practices. 5. how signi cant are the recommended changes? Investigators may be forced to provide a detailed reconstruction of an incident or incidents in order to demonstrate that particular recommendations are justi ed by previous failures. If they propose radical changes to particular working practices or considerable investment in additional plant then safety managers must be con dent that those recommendations are warranted by a detailed analysis of a near miss or adverse occurrence. In particular, it can be important to identify those plants, systems or production processes that have been aected by previous failures. If this is not done then operators can claim exemptions from particular recommendations on the grounds that any subsequent analysis is not based on evidence from their production processes. 6. can details be introduced to achieve particular eects? The introductory sections of this chapter argued that investigators must consider the information requirements of the various readers of an incident report. Any proposed document must support the dierent needs of regulators, safety managers, operators and so on. These requirements clearly have an eect on the level of detail that is required in any reconstruction. As we have seen, safety managers will require a considerable attention to detail if they are to accept radical reforms. Conversely, system operators may require less detailed information in order to motivate them to implement speci c changes to their daily routines [864]. Not only can investigators vary the level of detail in incident reconstructions to support the dierent end users of a report, they can also adjust these details to achieve speci c eects upon those readers. For example, the US Coast Guard place their warnings about the dangers of fuel cannisters in the context of the oyster eet even though the same warning might be applied to everyone at sea. It can be argued that a generic or abstract warning `never put fuel cannisters next to an ignition source' would have lacked the immediacy and directness of the report that places the analysis and recommendation within the context of a brief reconstruction of previous near misses involving particular vessels that were observed by particular Coast Guard oÆcers. The previous list identi es a number of factors that can in uence the level of detail that investigators provide in the reconstruction of a near miss or adverse occurrence. Several of these concerns can be illustrated by a case study from the UK MAIB's Safety Digest. This provides a relatively brief account of a particular set of failures . The location of the incident and the types of vessel involved
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
649
are all clearly identi ed in the reconstruction. Anonymity is less important than disseminating this contextual information which, as we have seen, can contribute to the authenticity and `directness' of a report. As we shall see, however, this detailed example is then used to make some very generic points about the nature of system failure: \The Veesea Eagle, a 622gt standby vessel, was on station in the North Sea. Early one morning, the superintendent received a call saying that No 1 generator had failed due to an exhaust pipe failure. Shortly afterwards, he received a further report saying that because of a damaged piston, No 2 generator had also failed. Back on board, the harbour generator was started to enable repairs to No 2 generator to be undertaken. While the company arranged for a replacement standby vessel, the chief engineer started to replace the damaged piston. Meanwhile, the harbour generator also failed. Although the main engine was still functioning, and steering was available by using the independently driven Azimuth Thruster, the company decided to tow the vessel back to port for repairs." [517] This example illustrates how relatively short reports can be well situated within particular locations and operating environments. The vessel is clearly identi ed as the Veesea Eagle and its purpose and methods of operation can be assumed from its role as a standby vessel in the North Sea oil elds. The previous list, however, makes the point that such assumptions can be dangerous. Some potential readers may lack the prior knowledge that is necessary to infer the style of operation from the rst sentence in the report. This MAIB report is instructive for a number of further reasons. At one level, the previous quotation describes a `freak' collection of failures that successively denies operators of their reserve power sources. The author of this report, however, takes considerable care to use this particular incident to make several more generic points about the nature of incidents and accidents. They achieve this using a relatively simple technique that avoids much of the jargon that often weakens more academic work on accident models. These points are illustrated by the ndings that follow the reconstruction cited above: 1. No 1 generator Had been running successfully following a complete overhaul, including a new crankshaft, earlier in the year. After the vessel re-entered service, the chief engineer adjusted the fuel timing to improve performance. When he left, the relieving chief engineer also adjusted the fuel timing, but had not been told about the last adjustment. The result of the latter was massive after burning damage to a piston head, cylinder head, and exhaust trunking. 2. No 2 generator Had also been running successfully, when a piston failed for no apparent reason. 3. Harbour generator Failed because of a lack of lubricating oil and it wasn't monitored. [517] There is no mention of a Swiss cheese model or of dominoes or of latent and catalytic causes. The analysis does, however, provide clear examples of the ways in which dierent types of failure combine to breech redundant defences. It is tempting to argue that the format and layout of the report re ect the investigator's intention to explain how each defence was breeched. This implicit approach might, in time, encourage readers to recognise the value of this style of analysis. Such an analysis is not as unrealistic as it might sound, given that readers are often expected to infer the more general lessons that can be derived from such speci c reconstructions. Unfortunately, there is no way of knowing whether this was a deliberate intention on the part of the investigator or whether they simply described the reasons why each power source failed on demand.
13.2.2 Analysis The previous paragraph illustrated the way in which the MAIB presented the particular ndings that were derived from the Veesea Eagle. The following paragraphs build on this analysis to identify a number of more general issues that must inform the presentation of any causal analysis in
650
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
incident reports. This chapter draws upon case study reports that have been published by maritime investigation authorities in countries ranging from Australia to Japan, from Hong Kong to Sweden. Before reviewing the presentation of causal ndings in these reports, it is important to emphasise that the case studies re ect existing reporting practices within the maritime domain. A number of important characteristics dierentiate these practices from those that hold in other areas of incident reporting, for example within commercial aviation or the medical industries. Maritime incident reporting systems have some advantages over these other systems. In particular, as we have seen, the IMO provides a structure for ensuring some degree of consistency between the incident reporting systems that are operated in member states. Equally, there are a number of features of the maritime industry that create problems which are less apparent in the aviation domain. For example, many maritime occupations are characterised by a plethora of relatively small companies. The structure of the in-shore shing industries provides a strong contrast with that of the global market in commercial passenger aviation. In consequence, although we focus on particular case studies that are drawn from the maritime industries the following analysis will also occasionally digress to look at wider issues that aect the presentation of causal analyses in wider domains. Very few maritime incident investigations exploit any of the semi-formal or formal causal analysis techniques that have been introduced in Chapters 10 and 11. This is not to say that causal techniques have not been proposed or developed for these industries. For example, Pate-Cornell has demonstrated how a range of existing techniques might have been applied to the Piper Alpha accident [665]. Wagenaar and Groeneweg's paper entitled `Accidents At Sea : Multiple Causes And Impossible Consequences' exploits a variant of the Fault tree notation introduced in Chapter 9 [852]. Unfortunately, this pioneering work has had little or no impact on the reports that are disseminated by investigation agencies. The situation is best summarised by recent proposals for a research program to `tailor' causal analysis techniques so that they can match the requirements of the maritime industries. The US Subcommittee on Coordinated Research and Development Strategies for Human Performance to Improve Marine Operations and Safety under the auspices of the National research Council helped to draft a report advancing a `prevention through people program' [835]. This report proposed that in order for the Coast Guard to interpret information about previous successes and failures `a methodology using root cause (or factor) analysis tailored for the marine industry and taking legal issues into account, could be developed'. Such `a system of analysis' would `trace the chain of events and identify root causes or factors of accidents in maritime transportation system'. The general lack of established causal analysis techniques within the maritime industry is also illustrated by very similar European research initiatives. For instance, the European Commission recently funded the Casualty Analysis Methodology for Maritime Operations (CASMET) project [154]. This was to facilitate `the development of a common methodology for the investigation of maritime accidents and the reporting of hazardous incidents, to improve the understanding of human elements as related to accidents and account for these aspects in the common methodology'. Although the project developed a high-level architecture for such a method it did not have the impact on regulatory organisations that was anticipated. As we shall see, the CASMET criticisms of previous practices could equally be applied today as they were when the project was launched in 1998. They argued that `present schemes are rooted in a compliance culture, in which the competence and focus are by tradition oriented towards guilt- nding' [154]. Even those countries that have established independent investigation units will still initiate legal proceedings if an investigation reveals a violation. They further argue that this inhibits individuals from contributing to an incident reporting system. This quotation re ects the results of a survey that the CASMET project conducted into existing incident and accident investigation techniques within the European maritime industries. The emphasis was on identifying situations in which individuals and organisations failed to comply with regulations. Few attempts were made to perform any form of deeper causal analysis to understand why such failures occurred. Their analysis raises important issues but it is slightly super cial. Chapter 5 has analysed the limitations of `no blame' systems in comparison to the `proportionate blame' approaches that have been adopted within areas of the aviation and healthcare communities. There are a number of further reasons why maritime incident reports are not guided by the causal analysis techniques that have been applied in other domains. One important consideration is that statutory requirements often focus on the ndings that are derived
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
651
from a causal analysis rather than the process that is used to produce them. The use of particular techniques is, typically, less important than that any incident should be investigated and reported on within particular time limits. For instance, the UK Merchant Shipping (Accident Reporting and Investigation) Regulations 1999 simply state that the master will `provide the Chief Inspector with a report giving the ndings of such examination and stating any measures taken or proposed to prevent a recurrence' within fourteen days of a serious injury [347]. Such requirements do not simply re ect a concern with identifying violations, they also re ect the need to address important safety concerns within a speci ed timelimit. They also re ect a pragmatic desire to maximise limited investigatory resources. These are particularly stretched by the transient and dynamic nature of the maritime industry. It is an obvious point but the location in which an incident occurs does not remain in the same position as it might do in many other industries. Indeed, it may move outside of the territorial waters of the nation in which the incident occurred. This gives rise to the National Research Council's concern that any causal analysis technique must recognise the particular legal concerns that characterise the maritime industry [835]. The complexity of addressing these legal issues often results in a high-level of ambiguity in the international recommendations that are intended to guide both the conduct and the publication of any causal analysis. For instance, the IMO's Code for the Investigation of Marine Casualties and Incidents simply de nes a cause to mean `actions, omissions, events, existing or pre-existing conditions or a combination thereof, which led to the casualty or incident' [387]. While this indicates a relatively broad view of causation, it provides little concrete guidance to investigators who must rst identify appropriate causal analysis techniques and then determine how best to publish the ndings of such techniques. In contrast, the IMO code continues by identifying the high-level, organisational processes that might support any analysis. It avoids imposing any constraints on the techniques that might be used; `marine casualty or incident safety investigation means a process held either in public or in camera conducted for the purpose of casualty prevention which includes the gathering and analysis of information, the drawing of conclusions, including the identi cation of the circumstances and the determination of causes and contributing factors and, when appropriate, the making of safety recommendations' [387]. The lack of speci c guidance on appropriate causal analysis techniques and or presentation formats for the ndings of such approaches does not imply that international regulatory organisations are insensitive to the problems of analysing incident and accident reports. The 20th session of the IMO assembly adopted resolution A.850(20) on the `human element vision, principles and goals' for the Organisation. This stressed that the ships' crews, shore based management, regulatory bodies, recognised organisations, shipyards and even legislators contribute to the causes of near miss incidents and adverse occurrences. It was argued that \eective remedial action following maritime casualties requires a sound understanding of human element involvement in accident causation... gained by a thorough investigation and systematic analysis of casualties for contributory factors and the causal chain of events''. It was argued that the \dissemination of information through eective communication is essential to sound management and operational decisions". Unfortunately, the resolution did not identify appropriate means for reconstructing such causal chains nor did it describe eective means of dissemination. In contrast, the resolution identi ed a number of high-level goals that mirror the more general objectives of the IMO's investigation code. Item (e) describes the highlevel objective of disseminating the lessons learned from maritime incident investigations within the wider context of safety management within these industries:
\(a) to have in place a structured approach for the proper consideration of human element issues for use in the development of regulations and guidelines by all committees and subcommittees;
(b) to conduct a comprehensive review of selected existing IMO instruments from the human element perspective;
(c) to promote and communicate, through human element principles, a maritime safety culture and heightened marine environment awareness;
(d) to provide a framework to encourage the development of non-regulatory solutions and their assessment based upon human element principles;
652
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
(e) to have in place a system to discover and to disseminate to maritime interests studies, research and other relevant information on the human element, including ndings from marine and non-marine incident investigations; and
(f) to provide material to educate seafarers so as to increase their knowledge and awareness of the impact of human element issues on safe ship operations, to help them do the right thing." [388] The lack of speci c guidance provided by national and international organisations is understandable. Their main concern is often to ensure that incident and accident reporting systems are established in the rst place. Unless incident and accident reporting systems are perceived to meet particular local needs then there is a danger that they will be under-resourced or abandoned. The imposition of detailed requirements for causal techniques or presentation formats might jeopardise the ability of individual agencies to tailor their system to meet those local needs [423]. In consequence, dierent nations have adopted a radically dierent approaches to incident analysis even though they are signatories to the same IMO resolutions. These dierences have a signi cant impact on the manner in which the ndings of a causal analysis are both presented and disseminated. For instance, the Japanese maritime incident reporting system is based upon a judicial model. \The inquiry takes an adversarial form pitting the parties concerned, against each other. `Open court', `Oral pleadings',`Inquiry by evidence' and `Free impression' are employed in the inquiry. Moreover the independence of the Judge's authority in exercising inquiry is laid down by the Marine Accidents Inquiry Law. The examinee, counselor and commissioner may le an appeal with the High Marine Accidents Inquiry Agency within seven days after the pronouncement when he is dissatis ed with a judgement pronounced at a Local Marine Accidents Inquiry Agency. A collegiate body of ve judges conducts the inquiry in the second instance through the same procedure of the rst instance. When a judgement delivered by the High Marine Accidents Inquiry Agency is not satisfactory, it is possible to le litigation with the Tokyo High Court within 30 days after the delivery to revoke the judgement."[393] The Swedish reporting system is far less adversarial. The purpose of maritime investigations is to provide `a complete picture of the event'. The intention is to understand why an accident or incident occurred so that eective preventive measures can be taken to avoid future failures. The Swedish investigation board argue that `it is to be underlined that it is not the purpose of the investigation to establish or apportion blame or liability' [768]. Such contrasting approaches have important implications for the nature of the ndings that are likely to emerge from any investigation. They can also have a profound impact upon the techniques that might be used to disseminate those ndings. For instance, in the Japanese system the individuals who contribute to an incident will have direct disciplinary action taken by the investigatory board if they are found to be negligent. In contrast, the Swedish system arguably adopts a more contextual approach that seeks to understand the circumstances that contribute to a failure rather than simply punish any particular violation or error. Such distinctions are an over-simpli cation. They ignore many of the cultural in uences that have a profound eect on the manner in which these systems are actually operated. They also ignore the detailed motivations that justify particular approaches. These caveats are important because other researchers and analysts have often been too quick to identify a `blame culture' or a `perfective approach' in systems that they are not involved in. Although it is clear that the Japanese system exploits a judicial process, this does not mean that it ignores the contextual factors that lead to errors and violations. Although the Japanese system may ultimately prosecute individuals and groups, this is the very reason why a judicial process is exploited: \Marine accidents occur frequently not merely from human act or error but from a complexity of factors like working conditions, ship and engine structure, natural circumstances such as harbor and sea-lane, and weather and sea condition. On the other hand, material and circumstantial evidence is often scant. So it is sometime diÆcult to grasp what actually happened and to investigate its cause. Since disciplinary punishment against the holder of a seaman and pilot's competency certi cate may restrict
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
653
their activities and right, a lawsuit-like procedure is used in marine accidents inquiry to ensure careful consideration and fairness." [393] One cannot assess the way in which an incident reporing system is operated simply by the aims and mission statements that they espouse. Abstract comments about the multi-facted nature of incidents and accidents need not result in practices that re ect a broad view of causation. It is, therefore, important to look beyond such statements to examine the ndings that are produced by these dierent systems. For instance, the Japanese Marine Accident Inquiry Agency have published a number of the incident reports that they have submitted to the IMO. These include an analysis of a collision between a dry bulk carrier, Kenryu-Maru, and a cargo vessel, Hokkai-Maru [391]. The reconstruction exploits two location-based chronologies. The events leading to the collision are described rst from the perspective of the Kenryu-Maru's crew and then from that of the HokkaiMaru. The Kenryu-Maru reconstruction describes how the third mate found the echo of Hokkai-Maru on the radar but did not report it to the master . The Summary of Events section continues that `the master went down to his room without a concrete order that the third mate should report him when the visibility become worth'. Some time later, the third mate again established radar contact with the Hokkai-Maru `but he did not set eyes on HOKKAI-MARU carefully'. He changed course `degrees but he did not watch out for HOKKAI-MARU carefully on her radar to determine if a close-quarters situation is developing and/or risk of collision exits' nor did he use `sound signals in restricted visibility'. The report describes how the third mate oredered the vessel hard to starboard when he eventually recognised that the echo of the Hokkai-Maru was crossing their course. He made an attempt to telephone the master `but he could not dial as it was so dark'. The collision occurred shortly afterwards. The style of reconstruction used by the Japanese Marine Accident Inquiry Agency in the HokkaiMaru report is very dierent from that described in previous incident reports. There is a strong element of interpretation and analysis in the Summary of Events section. For instance, the reconstruction includes the statement that `the third mate did not know (about) this dangerous situation, as he did not watch carefully by her radar... he believed each ships could pass on the port each other'. Such sentences are, typically, to be found in the analysis sections of the reports produced by organisations such as the US National Transportation Safety Board (NTSB) or the UK MAIB. In contrast, the ndings from the analysis of the Hokkai-Maru incident were presented to the IMO as follows: 1. Principle ndings: Both KENRYU-MARU and HOKKAI-MARU did not sound signals in the restricted visibility, proceed at a safe speed adapted to the prevailing circumstances, reduce her speed to the minimum at which she can be kept on her course and take all her way o if necessary, because they did not keep watch on each other by radars. Both masters did not ordered to report to them in case of restricted visibility. Also both duty oÆcers did not report to their masters on restricted visibility. 2. Action taken: The masters of KENRYU-MARU and HOKKAI-MARU were in icted reprimand as a disciplinary punishment. The duty oÆcers of the both ships were in icted suspensions of their certi cates for one month as a disciplinary punishment. 3. Findings aecting international regulations: No reported" [391] It can, therefore, be argued that the judicial nature of the Japanese maritime reporting system has a profound impact on the presentation of a reconstruction, on the interpretation of an adverse occurrence and on the ndings that are drawn from an analysis. Similar collisions have led other organisations to look in detail at the precise communications that occurred between members of the crew immediately before the incident [826]. These ndings have prompted further research into what has become known as Bridge Resource Management. Conversely, other incident reports have looked at the particular demands that are imposed on crewmembers who use radar for two potentially con icting tasks. Such systems are typically used both for target correlation, for example to identify
654
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
the position of the Hokkai-Maru, and for navigation support [406]. It should be stressed that it is impossible to tell whether these broader issues were considered during the investigation and analysis of the incident. It is clear, however, that the ndings that are published in the report focus on the responsibility of the individuals who were involved in the incident. Chapter 12 has argued that a perfective approach that focuses exclusively on individual responsibility and blame is unlikely to address the underlying causes of many incidents. This does not imply, however, that the Japanese approach will be counter-productive. We have already cited the objectives of the investigation agency. These clearly recognise the complex, multi-faceted nature of many incidents and accidents. It is, therefore, possible to distinguish the broader aims of the reporting system from the ndings of judicial enquiries. It is entirely conceivable that other actions may be instigated within the transport ministry to address the broader problems that are illustrateed by the Hokkai-Maru incident even though the investigation agency publishes a narrower view that re ects a focus on individual responsibility. Such a hybrid approach avoids situations in which operators view a reporting scheme as a means of avoiding disciplinary action for their actions. It is, however, important not to base such wide-ranging observations upon the analysis of a single report. The Hokkai-Maru case study might not be typical of the other reports that are produced by this system. It can be argued that the investigator who drafted the ndings in this report was unusual in their focus on individual blame. Alternatively, the crews' actions in this incident might genuinely be interpreted as negligent in some respect. A careful reading of the other submissions to the IMO does, however, con rm that this style of analysis is part of a wider pattern. The judicial nature of the investigatory and analytical processes results in ndings that focus on individual responsibility for the causes of adverse occurrences and near miss incidents. For instance, another report describes the grounding that ultimately led to the loss of the Bik Don, a cargo vessel. The ndings of this report again focus on the role of the master rather than the circumstances that helped to in uence his actions; `the grounding was caused from that the master did not check the circumstances of course, which he selected newly for short-cut, on charts or sailing directions' [392]. However, no disciplinary action was taken even though the vessel was lost and the crew had to be rescued by a patrol boat of the Japanese Maritime Safety Agency. There are number of puzzling aspects to this incident report. It is unclear why the analysis identi es human error as the primary cause and yet the judicial system does not take any retributive action. This might re ect a recognition that the Master had suered enough from the loss of a vessel under their command. Alternatively, this might be an indication that the investigatory authorities had recognised some of the contextual issues that in uence human error. There is some evidence for this in the report's `Summary of Events' which again contains detailed causal analysis: \He thought she could pass through south of the island safely because he sometimes saw the same type ships as her navigated that area. He had no experience to pass through south of Shirashima islands. But he did not con rm about Meshima island shallow waters extended from Meshima island, and especially independent shallow water named Nakase that was apart 1,000 meters from the Meshima island shallow waters by checking charts or sailing directions before his deciding." [392] This case study illustrates two key points about the presentation of any causal analysis. Firstly, readers can experience considerable practical problems if the interpretation of an incident is distributed throughout a report. It can be necessary to read and re-read many dierent chapters in order to piece together the reasons why particular causes were identi ed. Unless this can be done then it is diÆcult to justify or explain the recommendations that are proposed in the aftermath of an incident. In the previous example, the decision not to act against the Master could only be explained by piecing together elements of the analysis that were presented in the Summary of Events and in the Principle Findings sections of the IMO report. Secondly, it is important that investigators explain the reasons why particular conclusions were reached and why other potential causes were excluded during an analysis. In this case, the incident report focuses exclusively on the actions of the Master. It does not consider factors such as external pressures to meet contract obligations that motivated the navigational error. In contrast, the reader is simply informed of the nding without any of the intermediate analysis or documentation that might accompany the application of the techniques
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
655
described in Chapter arefpart:analysis and 11, such as MORT or ECF analysis. It is important to note that the IMO reports, cited above, provide only brief summaries of each adverse occurrences. The documents are only between ve and six pages in length. The tractability of these documents might be signi cantly compromised if these additional details were included. This, in turn, can act as a powerful disincentive for many readers who are daunted by the task of carefully reading incident reports that often run to dozens of pages in length. There are, however, a number of techniques that might be used to address this problem. For instance, summary reports might explicitly refer to the more detailed documents that are required to justify a causal analysis. In the Japanese maritime system this might include summary transcripts of the judicial process that is used to examine causal hypotheses. ECF charts or other more formal documentation might be provided. Alternatively, the ndings of a causal analysis might be justi ed in a natural language summary. Clearly the choice of approach must depend on the investigatory process, the resources that are available to support any analysis and on the nature of each incident. Without this information, however, readers can be left with considerable doubts about the ndings of many incident investigations [749]. The Japanese approach to the presentation of causal ndings within IMO summary reports can be contrasted with the approach adopted by the Swedish Board of Accident Investigation. The highlevel guidelines that describe their approach to causal analysis are very similar to those presented by the Japanese Marine Accident Inquiry Agency. They both stress the way in which incidents stem from multiple causes. As we have seen, the Japanese argue that \...marine accidents occur frequently not merely from human act or error but from a complexity of factors" [393]. The Swedish approach is summarised as follows: \An accident can normally not be attributed to one cause. Behind the event are often enough a series of causes. An investigation is therefore aimed at the consideration of dierent possible causes to the accident. Many of these will in the course of the investigation be eliminated as improbable. This elimination however means that the remaining possible causes gain strength and will develop into probable causes. Needless to say, both direct and indirect causes must be considered." [768] There are also some interesting dierences between the Japanese Marine Accident Inquiry Agency's guidelines and those published by the Swedish Board of Accident Investigation. As might be expected, the Japanese stress the need to protect those involved in an incident when the ndings of any report `may restrict their activities and rights'. A judicial process is required because it can be `diÆcult to grasp what actually happened and to investigate its cause' [393]. The value of these defences cannot be underestimated in proportionate blame systems. The judicial proceedings provide for the representation of the individuals concerned in an incident. This contrasts strongly with many Western systems in which the causal ndings of an investigation can be reported, often without any detailed supporting analysis, and with only a minimal participation of those involved in an incident. The Swedish system focuses less on issues of individual representation and protection. In contrast, it focuses on the more detailed problems that must be addressed by any causal analysis. They make the important distinction between direct and indirect causes. This resembles the distinction that we have made between latent and catalytic failures. Their high-level guidelines also stress the need to eliminate certain hypotheses in order to identity potential causes. Previous sections have argued that it is important that readers can understand why particular hypotheses have been eliminated if they are to have con dence in the ndings of any causal analysis [199, 198]. The following paragraphs analyse a case study report that follows the Swedish Board of Accident Investigation's guidelines. This can be contrasted with the case studies from the Japanese reports to the IMO, although it should be remembered that these are summary descriptions whereas the Swedish report provides a more detailed analysis. This incident resulted in a collision between the Swedish vessel, MT Tarnsjo and the Russian vessel, MV Amur-2524 [767]. The Swedish report follows the formatting conventions that have been described in previous section. A summary section is presented which includes an initial set of recommendations. This is then followed by factual information. The `Analysis' chapter is then presented before the conclusions and a re-statement of the recommendations. Our focus here is on the presentation of the analysis section. This begins with an analysis of the place and the time of the accident. There were \concurring statements
656
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
from the pilots and the crews, the accident took place approximately 100m west or west north west of Stromskars northern point" [767]. There was, however, some disagreement about whether the collision occurred at 18:28 or 18:30. This reinforces the point that reports do not simply focus on the causal analysis of an incident. They often also include an assessment or interpretation of the evidence upon which such an analysis is grounded. The Tarnsjo report continues with an analysis of the speed at which each vessel approached the point of collision. This again illustrates the way in which the investigators analyse the various accounts of the incident to make inferences that will eventually support their causal ndings: \The accounts of the course of events show that the Tarnsjo passed Toppvik about the same time that Amur-2524 passed Nybyholm. The distance between Toppvik and the collision site is approximately 2.8 M. Hence the Tarnsjo sailed about twice as far as Amur2524. The Tarnsjo's average speed on the stretch becomes 10.5 or 9.4 knots... Toward the end of the stretch, her speed was reduced by the reversing and was approximately 2 knots at the collision. This means that her speed during the rst part of the stretch must have been above the average. Taking the lower average speed of 7.6 knots and assuming the full astern manoeuvre was somewhat hampered by the ice and therefore began 0.3 M before the collision site, then the Tarnsjos speed before reversing becomes 10 knots. If one assumes instead that the reversing was fully eective as in the emergency stopping trial, then the same calculation gives a speed just before the reversing of close to 10 knots. Against this background the Board considers that Tarnsjos speed when entering the yaw o Tedaro light must have been considerably greater than the 6-7 knots stated by the pilot." [767] As mentioned, the analysis of the evidence is used to support particular ndings about the course of the incident. These, in turn, support any subsequent causal analysis. In this case, the Swedish board found nothing to show that the reversing procedure was less eective than what the stopping trial had revealed. From this they concluded that the reversing procedure was probably initiated too late and that this, combined with the high speed of the Tarnsjo, prevent the crew from stopping the vessel before the collision. Many investigators might have concluded their analysis with the nding that one of the crews had failed to apply the full astern manoeuvre in time to avoid the collision. In contrast, the Swedish report follows the distinction between direct and indirect causes made in the guidelines, cited above. The investigators go on to analyse the communications that took place between the vessel prior to the incident. The analysis section goes on to explain that the pilots of the two vessels had met on at least two occasions prior to the incident during which they discussed the passing manoeuvre. The report observed that \from then on they did not communicate nor give any information regarding positions or speeds; neither did they agree upon how they would handle the meeting, in other words, at what speed they would meet or if one vessel should heave to" [767]. The analysis then focuses on the role played by the oÆcers of the watch in the interval immediately before the incident. As before, the investigators use the evidence that was identi ed during the reconstruction of the incident to support particular ndings about the crews' behaviour. In this instance, it is argued that the Tarnsjos chief oÆcer and second oÆcer \did not participate in the navigation nor did they follow what was happening other than temporarily during the passage through the Hjulsta bends" [767]. In consequence, they were both unsure about the vessel's speed before the collision. Neither could recall when and where the reversing procedure was started. The report describes a similar situation on the Amur-2524. The helmsmen and the master did not participate in the navigation once responsibility for this had been handed to the pilot. After having analysed the evidence that was obtained for the events immediately before the incident, the analysis section then goes on to consider more proximal factors. It prefaces this by summarising the outcome of the investigation which draws upon the analysis of the evidence that was presented in previous paragraphs: \The immediate cause of the collision was that the meeting was poorly planned. This in turn was mainly due to faulty communication between the pilots regarding how and where the meeting would take place. The Board however is obliged to note a circumstance that it has had reason to mention in several previous investigations; that is, the
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
657
tendency that vessels with a pilot on board are conned by the pilot alone without use of the other resources available on the bridge. The accident could have had far worse consequences had the Tarnsjo's stem collided with the Amur-2524 a little further aft and also penetrated the athwartships bulkhead to hold nr 3. If this had happened Amur-2524 would probably have capsized and sunk." [767]. This quotation not only illustrates the way in which the Swedish report gradually build up a causal analysis from the investigators' interpretation of the available evidence. It also makes explicit the investigators' assessment of the worst plausible consequences of an incident. Chapters 5, 7and 10 have all emphasised that such estimates must inform the allocation of resources to an investigation as well as the implementation of any recommendations. The Tarnsjo report also illustrates good practice by the way in which the investigators justify their assessment of this `worst plausible outcome' . Auch justi cations are particularly important if readers are to be convinced by this subjunctive form of reasoning. The report describes how the investigators' nding was reached without access to the Amur-2524's stability data under the relevant load conditions. In spite of this they argue that `if the timber had not entirely lled the hold but the logs, instead, had been able to move around or shift while oating, the vessels metacentre height could have been decreased by half a metre through the eect of the free water surface alone... (this) could have constituted a serious risk to the vessels stability" [767]. A further strength of the Swedish report is that the analysis extends beyond the immediate and the distal causes of the incident to examine the response to the collision. This analysis extends the initial reconstruction because it does not simply discuss what was done, it also considers what was not done in the aftermath of the incident. The Sodertalje TraÆc Information Centre was informed immediately but the Marine Rescue Control Centre was not informed until approximately ninety minutes after the collision. Similarly, the analysis goes on to argue that the crew of the Tarnsjo should have remained on-site to assist the Amur-2524 after thad determined that they only sustained minor damage. Instead of illuminating the area and helping the Amur-2524 to break the surrounding ice, the \Tarnsjo's master chose to leave the site right after the accident and before the Amur-2524 began her attempt to reach shallow water" [767]. They had to abandon this plan as it became clear that the vessel was more seriously damaged than had rst been appreciated. The investigators acknowledge the inherent diÆculty of performing an accurate damage assessment in the aftermath of a collision. The investigators use a form of counterfactual reasoning to argue that the Tarnsjo's haste contributed to Amur-2524's predicament: \The Tarnsjo or the tug could have escorted the Amur-2524 back to Hjulsta bridge where they could have obtained help to pump out the bilge and investigate the damage. Even if the vessel had been assisted by the tug during the journey, the Board estimates it would have been more diÆcult and risky to transfer the crew to the tug if the vessel had started to list or even capsized. The Board considers it remarkable that the Tarnsjo's master chose to proceed immediately following the accident without ensuring that the Amur-2524 was out of danger. Evidently the vessel left the site without having fully ascertained the extent of the damage. He was aware that the Amur-2524 was going to try to reach shallow water, but did not wait to see that she accomplished this." [767]. This counterfactual argument illustrates the importance of looking beyond an initial adverse event to look at the response to an incident. In this case, the consequences to the Amur-2524 could have been far more serious than they actually were. It is interesting to note, however, that this line of analysis is not represented in the conclusions that are drawn by the report. The investigators argue simply that the immediate cause of the collision was that the meeting of the two vessels was \poorly planned" [767]. Contributory factors included insuÆcient communication between the pilots and ineective use of the resources available on the bridge. It is also interesting to note the recommendations that were made on the basis of the extensive analysis that was performed by the investigators. The proposed interventions were summarised by a single word: None. This raises a number of questions. In particular, the lack of any recommendations suggests that similar incidents might recur in the future.
658
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Presentation Issues for Causal Analysis The previous paragraphs in this section have used case studies drawn from the Japanese Marine Accident Inquiry Agency and the Swedish Board of Accident Investigation to illustrate a number of key points about the presentation of any analysis in an incident report. It has been argued that this analysis should consider not only the proximal and distal causes of an incident. The report must also present the ndings of any analysis of mitigating actions in the aftermath of an incident. Similarly, it has been argued that incident reports should document the evidence and the analytical processes that support particular ndings so that readers can understand why particular conclusions were reached. The following enumeration builds on these observations and presents a number of recommendations for the presentation of analytical material in incident reports. It is important to emphasise that these guidelines are not appropriate for all systems. For example, the informal nature and limited resources of local systems can impose tight constraints both on the analysis that is performed and on the documentation of that analysis. Similarly, it can be diÆcult to summarise all of the procedures that might be used in order to support the ndings of larger-scale investigations. Equally, however, it remains the case that many incident reports present ndings that cannot easily be justi ed from the evidence that is presented in a report [749]. This can lead to skepticism about the bene ts of participation and can encourage the creation of `war stories' that provide alternative accounts of the events that were not revealed in the oÆcial report: 1. describe the process used to derive the ndings. It is important that incident reports describe the steps that were taken to identify particular causes. For example, the IMO request information about the `form' of any investigation that is conducted for more serious classes of incident. The Japanese Marine Accident Inquiry Agency describe how \the Judges make the announcement of the judgement in the court after argument" [393]. Similarly, it is important that the readers of an incident report should be able to identify the validation activities that support particular causal ndings. For instance, the Board of Accident Investigation ensures that `interested parties' have the right to follow an investigation. They can also request \further investigative measures that they deem necessary" [768]. This does not go as far as the right of reply that is granted under sub-regulations 16(3) and 16(4) of the Australian Navigation (Maritime casualty) regulations. These provide `interested parties' with the opportunity to have their responses published together with the ndings of the investigators' analysis. The key point here is that the investigatory process should be apparent to the readers of an incident report. Unless people are provided with this information then they cannot be assured that potential biases will not unduly in uence the results of an investigation. The suspicion that ndings may be based upon an isolated subjective opinion can be sustained even when investigators have gone to great lengths to ensure the veracity of their ndings [199]. 2. document the process used to produce the ndings. For many incidents, it may be suÆcient simply to outline the processes that were used to identify and validate particular ndings. In other situations, however, it is also important that the readers of a report can assess the analytical procedures that were employed. This is important if, for instance, the possible consequences of a future recurrence are judged to be particular severe. Similarly, it is important that readers can critically examine the conduct of any analysis if the proposed remedies impose considerable burdens upon the recipients of a report. Very few organisations that recommend techniques, such as the use of Fault Tree diagrams or ECF analysis, ever publish the documents that are intended to support their ndings. This is worrying because it can be diÆcult to avoid making mistakes with these and similar techniques [529, 424]. In consequence, the readers of a report simply have to trust that investigators and the other members of investigation boards have correctly applied those methods that have been adopted. The role of the outside expert is a particular concern in this respect. Too often, investigation agencies have relied upon the advice of individuals whose reputation and expertise is suÆcient validation of their insights. This can be particularly frustrating for other readers with their area of expertise who must struggle to interpret their
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
659
ndings during subsequent design and redevelopment. Several years ago I wrote a paper entitled `Why Human Error Analysis Fails to Support Systems Development' [408]. This summarised the problems that I felt when the analysis of human factors analysis in an incident report used technical terms, such as `crew resource management' or `high workload', without showing any evidence or reasoning to support these particular ndings. This created enormous problems for the safety managers and operational sta who then had to take concrete measures to prevent such failures from recurring in the future. It is insuÆcient simply to state that these problems occurred without documenting the detailed reasons why such a diagnosis is appropriate. 3. link evidence to arguments. A particular strength of the Swedish Board of Accident Investigation's approach is that the analysis of an incident is directly linked to an assessment of the evidence that is presented within any reconstruction. The Tarnsjo case study illustrates the way in which the investigation board is content to leave potential inconsistencies, for instance over the time of the collision, where they do not aect their ndings. In other situations, the Board actively reject eye-witness statements that are contradicted by other forms of evidence. For instance, observations about the progress of the Tarnsjo are used to infer the speed of the vessel. This is then used to support the main ndings of the report that contradict the Pilot's statements. In other situations, the Board clearly state where they could not obtain the evidence that might be necessary to directly support their ndings. For instance, the investigators argue that the consequences for the MV Amur-2524 could easily have been far worse even though they could not obtain a detailed stability analysis for the vessel. Such transparency is rare found. Most incident reports simply describe the results of any causal analysis. This leaves the reader to piece together the evidence that supports those ndings from the previous reconstruction sections of the report. This creates many problems. Firstly, it increases the burdens on the reader who must also assimilate a complex mass of contextual information that increasingly characterises incident reporting involving high-technology applications. Secondly, there is a danger that readers will fail to identify the evidence that support the investigators' ndings. In pathological cases, this evidence may not even have been included within the report [426]. Finally, there is a danger that readers may infer causal relationships that were not intended by incident investigators. In such circumstances it is likely that particular items of evidence, such as witness testimonies, will be given undue signi cance. This becomes important if that evidence is subsequently challenged or is stronger that the evidence that the investigator used in deriving a particular nding. 4. justify proportion of blame. As we have seen, most reporting systems employ a proportionate approach in which deliberate violations and illegal acts are handled dierently than human error. It can, however, be diÆcult to distinguish between violations, slips, lapses or mistakes from observations of an incident. Such evidence rarely yields insights into operator intention. Investigators must, therefore, explain why they interpreted observations in a particular manner if readers are to understand the basis for their ndings. In other systems, such as the Japanese maritime case study, incident reports result in punitive actions. The importance of justifying those ndings is correspondingly greater. This explains the use of a judicial investigation process. Even in `no blame' systems there is still an obligation to explain why the analysis of an incident focuses on certain causes. For instance, it is important to describe why systems failure might have been considered more signi cant than operator involvement. `No-blame' systems also assume additional burdens. By shifting attention away from system operators, they hope to identify the contextual and environmental factors that contribute to incidents and accidents. They must, therefore, look to the higher-level organisational and managerial in uences that expose systems and their operators to such `adverse' conditions. Too often there is a tendency to describe environment and contextual factors as part of a more general `safety culture' without ever address the detailed resons why such a culture is permitted to exist within a particular organisation. Such approaches not only avoid blaming speci c individuals, they also ignore
660
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
the importance of managerial responsibility for the consequences of particular incidents. 5. justify any exclusions. Not only must investigators justify their decision to focus on certain causal factors, it is also important that they document the reasons why other forms of analysis were excluded or eliminated. For instance, the NTSB routinely includes a section entitled `exclusions' within their analysis of incidents and accidents. In their maritime reports, this typically considers whether adverse weather conditions contributed to any failure. They may also document the investigators' ndings that \the vessel's navigation, propulsion, and steering systems had no bearing on the cause of the re" [606]. Subsequent paragraphs, typically, exclude inadequate training or quali cation. They may also consider the potential in uence of drugs or alcohol. In some cases this can lead to quali ed exclusions. This can be illustrated by a recent investigation into an on-board re. Tests indicated that the rst oÆcer had used marijuana in the weeks before the re occurred, however, \based on witnesses descriptions of the rst oÆcers actions on the bridge during the emergency, no behavioral evidence indicated that he was impaired by drugs at the time of the re" [606]. Such explicit exclusions help to address the alternative causal hypotheses that readers often form when analysing incident and accident reports. This is particular important because empirical studies have shown that trained accident investigators will often sustain such theories in the face of contradictory evidence. For instance, Woodcock and Smiley describe a study involving incident investigators from the American Society of Safety Engineers. One of the fteen participants continued to believe the drugs were a factor in an incident even though witness statements contradicted this belief [870]. 6. clearly assess the `causal asymmetry' of the incident. Chapter 11 introduced the term `causal asymmetry'. This is used to describe the imbalance that exists between attempts to identify the eects of causes and attempts to identify the causes of given eects. It can be diÆcult or impossible to determine the precise causes of a given eect. Lack of evidence and the existence of multiple causal paths can prevent investigators from ever nding a single causal hypothesis. This is particularly true of incidents involving human intervention. As we have seen, it can be particularly diÆcult to identify those factors that in uence particular operator behaviours. Previous chapters have described the problems that arise when attempting to identify the ignition source of maritime res [621]. Two particular eects are at work here. Small-scale, local reporting systems often lack the necessary investigatory resources to conduct the detailed analysis that may be required to unambiguously determine causal sequences. Large-scale national systems are, typically, only able to devote necessary analytical resources to a small percentage of the most serious incidents that are reported. The opposite aspect of causal asymmetry is that investigators must trace a path between the cause that is identi ed in their ndings and the observed eects that are recorded in the aftermath of an incident. These forms of analysis are, typically, constructed using the counterfactual arguments that have been described in Chapters 10 and 11. There are particular problems associated with these forms of argument. In particular, readers often draw `incorrect' inferences about the potential consequences of any counterfactuals [124]. In practical terms, the problems of causal asymmetry in incident reports are best addressed by validation activities. These can be used to determine whether the potential readers of a report can reconstruct the causal arguments that are presented. These studies can also be used to determine whether readers are convinced by the exclusions that are intended to prune alternative causal hypotheses. These validation activities are described in later sections of this chapter. 7. present the analysis of catalytic, latent and mitigating factors. Chapters 10 and 11 have described the importance of analysing a range of causal factors. These include both the triggering events that led directly to an incident. They also include the longer term factors that contribute to a failure. We have also emphasised the importance of analysing the mitigating or exacerbating factors that can arise after an initial failure. Many investigators routinely consider these factors but then fail to document them within the resulting report. These omissions can be justi ed in the summary reports that support the statistical analyses
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
661
described in Chapter 15. Even there, however, it can be important for safety managers and regulators to gain a clear understanding of the broader context in which an incident occurs. Too often incident databases simply record the frequency of particular catalytic failures [444]. It, therefore, comes as little surprise when the overall failure rate fails to decline over time. New catalysts replace those that have been addressed by previous recommendations and the latent causes of accidents and incidents remain unresolved. Such potential pitfalls can be avoided if information about latent and mitigating factors is propagated into the reports that document particular investigations. 8. examine the adequacy of interim recommendations. The analysis in an incident report may also consider whether or not any interim recommendations are suÆcient to address those causes that are identi ed by a more detailed investigation. It might be argued that this form of analysis should be considered together with the recommendations in an incident report. The NTSB follows the practice adopted by a number of similar organisations when it includes this form of assessment in the analytical sections of their incident reports. Investigators categorise each interim recommendation proposed by the Safety Board as either acceptable or unacceptable and as open or closed. For instance, recommendation M-98-126 was drafted following a re on-board a passenger vessel. This required operators to `institute a program to verify on a continuing basis that the laundry ventilation systems, including ducts and plenums, remain clean and clear of any combustible material that poses a re hazard on your vessels' [606]. 21 of the 22 recipients of the recommendation replied to indicate that they had taken measures to comply with the proposed actions. As a result, the Board classi ed Safety Recommendations M-98-126 as a Closed Acceptable Action. The previous paragraph described how organisations such as the NTSB often include an assessment of interim recommendations within the analysis section of their incident reports. We have not, however, considered how to include particular recommendations within these documents. The following section, therefore, extend the previous analysis to consider the problems that must be considered when presenting readers with proposal that are intended to avoid future incidents.
13.2.3 Recommendations Previous sections have described how the executive summaries are usually followed by incident reconstructions. These precede paragraphs of analysis that then support any recommendations that might be made in the aftermath of an incident. Previous sections have also stressed the diverse nature of the publications that are produced in order to publicise information about adverse occurrences and near-miss failures. This has important consequences. In particular, Chapter 12 assumed an investigatory process in which investigators presented their recommendations in a nal report to a regulatory organisation. They, in turn, were then assumed to provide feedback on whether or not each recommendation might be accepted for implementation throughout an industry. As we have seen, however, incident reports ful ll a more diverse set of roles. For example, interim reports can be directly targeted at the operators and managers who must implement particular recommendations. Conversely, statistical reports may not be published in any conventional sense. Instead, they often summarise the recommended actions so that investigators can survey previous responses to similar incidents in the future. Investigators must, therefore, tailor the presentation of particular recommendations to both the audience and the intended use of the document in which they appear. This point can be illustrated by the list of proposed outputs that are to be derived from the US Coast Guard's International Maritime Information Safety System (IMISS) [834]:
Alert Bulletins. These distribute interim recommendations as soon as possible after an incident has occurred. These recommendations may simply focus on the need to increase vigilance in order to detect potential failures that, as yet, cannot be prevented by more direct measures.
For Your Information Notices. Less critical incidents can result in notices that inform personnel of potential problems. The criticality of the information contained should be clearly
662
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
distinguished from the more immediate alarm bulletins mentioned above. Any recommendations associated with these notices should be periodically reviewed to ensure that they do not provide stop-gap solutions to the latent conditions for more serious incidents in the future.
Monthly Safety Bulletins. These documents can present collections of `lessons learned, safety messages, areas for improvement, precautions and data trends' [834]. They play an important role in informing the wider maritime community of signi cant safety issues. The Notices and Alerts, mentioned above, can be distributed directly to the safety managers and representatives who must oversee the implementation of particular recommendations. In contrast, Monthly Bulletins often present more general recommendations directly to individual operators in the eld.
Periodic Journal and Magazine Articles and Workshops. Individual incident reports can be analysed by a `data centre'. They may then write articles that examine the eectiveness of recommendations in terms of safety analyses and trends. These should not only be published through in-house media. They should also be submitted to relevant safety journals or conferences. This provides an important means of obtained external peer review for many of the activities that are conducted within a reporting system.
Publicly Available Database. Databases can be established to provide the public with access to summary information about previous adverse occurrences and near-miss incidents. Increasingly these are provided over the Internet, this approach will be discussed in the following sections and in Chapter 15. An important bene t of these systems is that they can enable `interested parties' to follow the actions that have been recommended in the aftermath of previous incidents. If they have been unsuccessful in prevent recurrences then this also can be inferred from these collections.
Client Work and Research Services. Some incident reporting systems recoup some of the inherent expenditure that is involved in their maintenance by oering a range of additional commercial services. These can include advances search and retrieval tasks over their datasets. It can also include services that support other investigatory bodies. For instance, they may be anxious to identify any previous recommendations that have been made in response to similar incidents in other `jurisdictions'.
International initiatives. Incident reports can trigger actions to change international agreements. This can involve the publication of requests to amend the recommendations that have been agreed to by members from many dierent states. Clearly, the scope and tone of such requests are liable to be quite dierent from recommendations that are directed at the operators and managers within a particular industry.
Measures of Success. The data derived about the causes of incidents and the proposed recommendations can be used to derive a number og high-level measures of the success of a reporting system. The problems that can arise from this approach are described in Chapter 15. For now it is suÆcient to realise that such activities can identify the need to revise previous recommendations. The publication of this data can, therefore, trigger the publication of additional Alerts and Notices.
Brevity prevents a full discussion of the particular techniques that are used to present incident recommendations in each of these dierent forms of report. In contrast, the following pages draw upon examples of the most widespread formats, for example Information Notices and Alert Bulletins. These are used to identify general features that also characterise aspects of these more diverse formats. For example, the UK MAIB's Safety Digest is typically only produced twice or three times per year. They are, however, intended to ful ll the same role as the `monthly bulletin' described in the previous list. These documents present case study incidents and then draw a range of general recommendations that are intended to inform the future actions of many dierent operators and managers within the maritime industries. This can be illustrated by two recent incidents involving shing vessels [516]. The rst incident began when a wooden shing vessel touched bottom. This
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
663
caused damage to the vessel's planking. Shortly after this a bilge alarm in the sh hold went o. The crew discovered that the hold was ooding. They contacted the coastguard and headed for the nearest port. Watertight bulkheads either side of the hold restricted the ooding. The vessel made port safely and was eventually pumped dry by a re brigade tender. In the second incident, a bilge alarm went o during a severe gale. The crew of the trawler found that their engine room was ooding from a fractured casing of the main engine driven cooling pump. A bilge pump was started from an auxiliary power source. This incident was particularly serious because the damaged casing meant that the main engine had to be slowed in order to reduce the rate of ooding. This compromised the vessel's ability to remain close to a lee shore. The engine had to be shut down when water reached the main engine ywheel. The valve to the sea water inlet was closed to stop any more ooding and the engineers then investigated why the bilge pumping had so little eect. Debris had accumulated inside the bilge line valve body and this prevented it from closing. Once it had been cleaned, the bilge pumping system functioned as intended. A salvage pump from a lifeboat was also used to lower the water level in the engine room so that the main engine could be restarted. These incidents inspired a number of recommendations. It is interesting to note, however, that the MAIB refers to these as lessons. This is an important distinction because the term `lesson' can to imply the sharing of insight into a particular problem. The term `recommendation' is reserved for the proposed actions that are identi ed as part of a formal enquiry into particular accidents. The Safety Digest, in contrast, identi es the following ndings from the two case studies described above: 1. \Both cases illustrate the bene ts of bilge alarms, functioning bilge pump systems, and watertight bulkheads to limit the severity of a ooding incident. 2. The second incident shows the importance of maintaining a vessel's bilges free of rubbish so it cannot be drawn into the bilge system. 3. Valves on a bilge system must be regularly checked for correct operation." [516] This case study illustrates the way in which the readers of periodical bulletins can be presented with insights from recent incidents. As can be seen, they typically take the form of reminders. They emphasise the importance of particular safety-related activities, such as the maintenance of bilge lines and valves. They also illustrate the potential consequences of neglecting those activities. The previous example also illustrates the way in which incident reports can be used to emphasise the importance of particular items of safety equipment. This is particularly signi cant for highlycompetitive industries in which their might be an economic temptation to avoid the installation and maintenance of such devices that are not strictly necessary in order to achieve particular production quotas. It is diÆcult to capture the diversity of recommendations that appear in incident reports. It can also be diÆcult to anticipate the particular focus that investigators will take when making particular proposals. Both of these points can be illustrated by the conclusions of a report by the Australian Maritime Safety Authority into livestock mortalities on board the MV Temburong [42]. This incident occurred when the deck generator supplying power to a livestock ventilation system failed. The crew identi ed that the generator's diesel fuel supply had been contaminated with a heavier grade of intermediate fuel oil. The power for the ventilation system was then transferred to the engine room generators. These subsequently failed. The consequences were a complete loss of electrical power including the shut down of the ventilation system to the livestock spaces. This second failure was traced to water contamination of the engine room generator's diesel fuel supply. The report focuses on the adverse outcome of the incident; over 800 cattle died. Less attention is paid to the potential dangers posed to the crew and other mariners by the loss of power to the vessel. This focus is entirely justi able given that the investigators associated with a relatively low risk with the second and total power loss. This assessment is never made explicit within the report. The Australian Maritime Safety Authority do not present particular recommendations in this report. In contrast, they enumerate a number of conclusions. This approach is instructive because it typi es incident reporting systems that separate the determination of what happened from the recommendation of potential interventions:
664
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
\14.1 The failure of both the primary and secondary sources of power was due to contamination of the ships fuel supplies: contamination of the fuel for the primary power source was water and the contamination in the case of the secondary power source was heavy fuel oil. 14.2 The situation was compounded by the failure of the ships communications systems during the incident as a result of an unexplained battery failure. 14.3 The serious nature of the incident was exacerbated by inadequacies on the part of the ships personnel in their immediate response to the incident and the absence of contemporary ship management practices. ... 14.6 On departure the operation and con guration of the ships generating equipment was in accord with the prescribed regulatory requirements. However the manner of operation has raised the issue of whether the Marine Order requires redrafting so as to ensure that the meaning of a secondary power source is more clearly de ned. 14.7 Whilst the fuel contamination situation was avoidable the incident has nevertheless raised the general issue of the ability of livestock vessels to recover from a dead-ship situation. Whilst the Temburong satis ed class and ag rules in this regard the system was shown to be particularly vulnerable when the well being of the livestock is considered. 14.8 Following the restoration of power the crew performed well in a most diÆcult situation to remove the dead cattle from the vessel." [42] As can be seen the ndings from this investigation address a broad range of system failures. They consider problems in the crews' management, they address the failure of speci c subsystems, they also consider regulatory failure and potential problems in the interpretation of Marine orders. The previous quotation does not, however, present particular recommendations. These are left implicit within the ndings that are derived from the causal analysis. Sentences such as `whilst the the Temburong satis ed class and ag rules in this regard, the system was shown to be particularly vulnerable' imply that the rules or their interpretation may be inadequate to prevent future incidents. Similarly, the report questions `whether the Marine Order requires redrafting so as to ensure that the meaning of a secondary power source is more clearly de ned'. This arguably implies that a recommendation be made to examine the suÆciency of the relevant Marine Order. Previous paragraphs have argued that this approach typi es incident reporting systems in which the identi cation of what happened is separated from the recommendation of remedial actions. This represents an extreme example of the heuristic introduced in Chapter 12 that investigators should leave domain experts to determine how to avoid future failures. It can, however, lead to a number of potential pitfalls. In particular, it is essential to ensure that the organisations and individuals who subsequently identify potential recommendations are independent from those who are implicated in a report. Fortunately, the investigators in our case study avoid this potential pitfall by passing their report directly to the Australian Maritime Safety Authority who, in turn, are responsible for identifying means of avoiding similar incidents in the future [43]. The use of language in an incident report can re ect the eect that particular recommendations are intended to have upon their audience. The previous recommendations illustrate the focused use of particular recommendations to reinforce existing safety information. They do not require signi cant additional expenditure nor do they imply a major safety initiative on behalf of the industry involved. In contrast, very dierent language is used by the US Coast Guard when a Quality Action Team presents its recommendations into towing vessel incidents on American Waterways [828]. They argued that the incidence of fatalities and injuries can be reduced by a program including prevention measures, the collection and dissemination of lessons learned, improved investigation and
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
665
data collection techniques and the regular assessment of towing industry performance over time using a fatality rate model. in order to interpret the language used in these recommendations it is necessary to understand the nature of the relationship between the Quality Action Team and the intended recipients of their proposals. The team were anxious to preserve a cordial working relationship between the Coast Guard and the American Waterways Operators Safety Partnership. In consequence, they were anxious to stress `non-regulatory solutions'. Under prevention, they argued that companies should implement a fall overboard prevention program. In particular, they should: 1. \Formulate and implement fall prevention work procedures consistent with vessel mission. crew complement, and geographic area of operation. Procedures should emphasise teamwork, communication, and safe work practices. 2. Ensure that all crewmembers receive initial and recurrent training in such procedures. 3. Assign responsibility for ensuring compliance with procedures to on-board supervision. 4. Enforce fall overboard prevention policies and consider the use of counseling, recurrent training, and discipline in the enforcement process. 5. Investigate all fall overboard incidents to determine what happened and how such incidents could be prevented in the future. 6. Inform all employees of fall overboard incidents and use lessons learned as part of a recurrent training program. 7. Modify fall prevention procedures as necessary based on investigation of fall overboard incidents." [828] The detailed nature of these requirements arguably illustrates the way in which the Quality Action Team sought to ensure that the recipients of their report clearly understood the proposed remedies. This contrasts with the previous recommendations from the UK MAIB report that were far less detailed in encouraging companies to follow what can be regarded as existing practices. The previous requirements can be contrasted with further recommendation that were made by the Coast Guard. These acknowledged the diverse circumstances that characterise towing operations. They, therefore, encouraged companies to select `best practices' from a list that was developed during the drafting of the report. This illustrates the cooperative, `non-regulatory' tone of the report. These is an interesting recursive element to the Coast Guard report. Their recommendations include the establishment of an incident reporting systems that would be used to `publicise the ndings and recommendations of the Towing Vessel Crew Fatalities Quality Action Team through the American Waterways Operators Letter, sector committee meetings, regional meetings, and the Interregion Safety Committee' [828]. The recommendations were also applied re exively to members of the Quality Action Teams own parent organisation. They argued that the Coast Guard should publicise their recommendations through the Marine Safety Newsletter and `Industry Days'. Brevity prevents a more sustained analysis of a detailed and exhaustive set of recommendations to a complex problem. It is worth noting in passing, however, that this document like many other products of incident report systems breaks many of the guidelines that were presented in Chapter 12. For example, we have described the US Air Force's injunction not to recommend further studies [794]. Instead, investigatory and regulatory bodies should propose actions even if they are are contingent upon on-going studies. In contrast, the Coast Guard's report identi es numerous areas for further analysis. American Waterways Operators and the Coast Guard are urged to nd out more about the factors in uencing survivability following a fall overboard Incident. They also recommend that the Quality Action Team's report be used to guide a more detailed analysis of fatalities in other segments of the barge and towing industry. Finally, they describe how further studies must be conducted to determine a means of measuring the impact that fall prevention programs, policies, and procedures have upon incident rates. The dierences between this Coast Guard report and the heuristics that guide investigatory practice in the US Air Force can, in part, be explained by the
666
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
dierences in the nature of the recommendations that they propose. These dierences, in turn, stem from the diverse nature of the reports that are being produced. The Air Force procedures refer to the recommendations that are produced after speci c incidents and accidents. Further studies imply a delay that may jeopardise the safety of existing operations. Hence, the focus is on providing a rapid and eective response to particular failures. In contrast, the Coast Guard report presents recommendations that are themselves the product of a study into a collection of incident reports. Given the complex nature of these occurrences and previous problems in reducing the rate of towing incidents it is hardly surprising that some of the recommendations should focus on the need for further analysis. As mentioned, the form and tone in which recommendations are presented can vary depending on the intended audience and the publication in which it appears. This can be illustrated by the recommendations from a more detailed investigation by the Transportation Safety Board of Canada into a `bottom contact' incident involving a bulk carrier [785]. This report cannot easily be categorised in terms of the previous enumeration of document types because it does not directly introduce any new recommendations. It does, however, validate the interim measures that were proposed in collaboration with the Canadian Coast Guard . It is important to emphasise, however, that the lack of any new recommendations does not imply that the report does not play a role in the presentation of proposed interventions. In contrast, it provides an important means of disseminating information about those actions that have already been initiated by other organisations. The main cause of the incident was identi ed as the bridge crew's lack of information about a `shallow spot' in the navigable channel of the Fraser River. This was attributed to inadequacies in the current system of monitoring navigable channels and producing depth information for vessels in that area. As mentioned, rather than making speci c recommendations the investigators described the `safety actions' that had been taken. The report concludes the Transportation Safety Board's investigation and, therefore, the reader can assume that these actions are considered to be suÆcient to ensure that similar incidents will not occur in the future: \The Canadian Coast Guard advised that a formal Working Committee, with representatives from the Fraser River Pilots Association, Fraser River Port Authority and Coast Guard has been established and will be meeting quarterly to review channel conditions and status of the channel monitoring and maintenance dredging program. A possibility of modelling the sedimentation process to determine various rates of in- ll associated with forcasted river ow/discharge will be explored." [785] It is clear from this quotation that the previous recommendations have a relatively limited geographical scope. They focus on the activities of the Fraser River Pilots Association and the Fraser River Port Authority. The local nature of such proposed interventions enables investigators to defer much of the detail about any recommendation. In contrast, previous paragraphs have described how incident investigations can motivate far more general proposals. These recommendations go beyond working practices in particular geographical area to address industry-wide operating procedures throughout the globe. The presentation and form of such recommendations can be very dierent from those in the interim and nal reports that are generated after many incidents. This can be illustrated by the NTSB's recommendations following a series of re on board passenger ships. The Board issued detailed proposals to the International Council of Cruise Lines and Cruise Line Companies Regarding Fires on Board Passenger Ships. As might be expected, such an action was not taken without considerable resources being allocated to analysing the causes of many previous incidents. Although the letter was drafted in July 2000, the initial incidents that prompted their intervention occurred on board the Universe Explorer in 1996 and the Vistafjord in 1997 [613]. The NTSB issued safety recommendations to the U.S. Coast Guard in April 1997 that automatic smoke alarms should be installed to protect both crew berthing and passenger accommodation areas. These recommendations were, in turn, forwarded by the Coast Guard to the International Council of Cruise Lines and the International Chamber of Shipping. The NTSB's proposals were opposed on the grounds that such systems might generate false alarms and could create crowd control problems. Three further incidents in 1998, 1999 and 2000 prompted the Board to reiterate their recommendations. They requested that Cruise Line Companies `without delay install automatic local-sounding smoke alarms
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
667
in crew accommodation areas on company passenger ships so that crews will receive immediate warning of the presence of smoke and will have the maximum available escape time during a re' [613]. The same recommendation was also made for the installation `without delay' of the same devices in accommodation areas on company passenger ship. The International Council of Cruise Lines were requested to withdraw their `opposition to the amendment of the Safety of Life at Sea Convention chapter II-2 to require automatic local-sounding smoke alarms in crew accommodation spaces on board passenger ships and support a full discussion of the technical issues and any further U.S. Coast Guard actions on this matter before the IMO'. The same request was reiterated for their opposition to smoke alarms in passenger accommodation spaces. The recommendations concluded with an appeal that the council should support a `full discussion of the technical issues involved and any further U.S. Coast Guard actions on this matter before the IMO' [613]. It is important not to overlook the colour or tone of the language used in this document. The note of exasperation or frustration re ects the Board's concern over this issue. Only time will tell whether this choice of language was an appropriate means of ensuring international agreement over these particular recommendations. The analytical procedures that have been described in the previous section have a strong in uence both on the recommendations that are derived from an investigation and also on the manner in which any proposed interventions are described in a subsequent report. For instance, the Swedish Board of Accident Investigation's `no blame' approach is deliberately intended to help them issue recommendations as soon as possible after an incident has occurred; \...with the ongoing technical development it is also necessary to quickly get knowledge of shortcomings in an operation that can cause an accident or contribute to one" [768]. In more judicial systems, the focus is on identifying recommendations in a more deliberate fashion given that their ndings can have a profound implication upon the livelihoods of the individuals who are aected [393]. The majority of incident reporting systems can be interpreted to lie between these two extremes. Proportionate blame approaches, such as that adopted by the US NTSB, will often issue interim recommendations that are then supplemented by any subsequent ndings from a secondary investigation. The previous section has also described how reports assess these initial recommendations in terms of whether or not they were acceptable and, if so, can their implementation be declared closed. This two stage process creates considerable problems for those who must publish and disseminate information about potential recommendations. They must rstly ensure that all of the intended recipients receive copies of initial advisories. It is important that these individuals and organisations both understand the intended response to such recommendations and that they have the adequate resources to satisfy particular requirements. Regulators must then, typically, ensure that the same recipients of the initial advisories are then provided with the updated recommendations that may be made in any particular incident report. Of course, these may also `countermand' or replace recommendations that were made as a result of incident reports that were issued many months or years before the present investigation. Such dissemination activities require considerable logistical support. Investigatory organisations have, therefore, developed a range of databases to monitor and support the dissemination of incident recommendations. For example, Chapter 12 describe some of the applications that support US Army reporting systems. Other organisations have developed billboards that help the intended recipients of a recommendation to monitor amendments and revisions. Such resources are important when logistical barriers prevent regulators and investigators from guaranteeing the delivery of proposed interventions. For example, it can be diÆcult for the crews of many dierent merchant vessels to continually monitor the ndings of incident reports that cover all of their possible ports of call. Later sections will focus on these dissemination problems in more detail. For now it is suÆcient to stress that many of these systems ful ll a dual purpose. Not only do they help monitor the distribution and receipt of particular recommendations, they can also be used to monitor their implementation. These databases, typically, restrict access to such information. In contrast, the New Zealand Transport Accident Investigation Commission maintains a World Wide Web list of information about marine safety recommendations [630]. They recognise that \not all safety recommendations and responses are published in the Commission's Occurrence Reports because at the time of printing some recommendations may have been incomplete or some responses may not have
668
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
been available". The bulletin-board is accessed via a list of previous incidents. For example, the user would select a link labelled `Report 00-211, harbour tug Waka Kume, loss of control, Auckland Harbour, 19 November 2000' in order to view any revisions to the recommendations that were made in that particular report. An alternative approach is to publish revised recommendations in terms of particular safety-related themes, such as transfers between ships and helicopters. This approach is similar to that adopted by the Australian Maritime Safety Authority [43]. Assuming that the reader had selected the New Zealand link, cited above, they would be presented with information about the recommendations that were made in the initial report. The Commission recommended to the manager of marine services for the Ports of Auckland Limited that he revise their tug operator training manual to include detailed information about engine and control system failures. They also requested that he introduce regular, documented peer reviews to ensure that all tug operators complied with the relevant safety regulations. These reviews should also assess the training that the Ports of Auckland Limited provides for its skippers. They should be performed by independent experts in the operation of similar tugs. The New Zealand Transport Accident Investigation Commission's recommendation summary goes on to describe the Manager's response to these proposals. This response illustrate the way in which bulletin boards can be used to inform `interested parties' that particular recommendations have been accepted and are being implemented. The Manager con rmed that: 1. \Updates to the tug operator training manuals shall be carried out as follows: engine control system failures, this section shall be expanded; response to engine control system failures, this section shall be expanded; handling the tug with one azimuthing unit, this task shall be given greater emphasis; introduce a system of peer review, this is being done. 2. The requirement to review Ports of Auckland Ltd training skippers by independent experts is not practical. The trainer we originally used was from Canada no other trainer exists in NZ. Many of our sta have gained good skill levels with these vessels and will be adequate for use in peer reviews." [629] Unfortunately, further problems complicate the presentation of those recommendations that are made in aftermath of an adverse occurrence or near-miss incident. In particular, incident reporting systems must not simply present the ndings that are derived from isolated near-miss incidents or adverse occurrences. They must also consider interventions that address common features amongst a number of previous failures. The problems of presenting these ndings from multiple incidents are assessed in later sections. In contrast, the following paragraphs focus more narrowly on guidelines for the presentation of recommendations that relate to single incidents.
Presentation Issues for Recommendations As we have seen, previous incident reports have been weakened by factual omissions. This can prevent readers from gaining a good impression of the events leading to an adverse occurrence or near-miss incident. Similarly, if parts of an analysis are omitted then it can be diÆcult to follow the conclusions that are derived from an incident investigation. The level of quality control that can be observed by reading large numbers of incident reports is, arguably, higher in the preliminary sections of these documents than it is in the concluding paragraphs that, typically, list any potential recommendations. One means of validating this claim is by looking at the litigation that follows from many incidents and accidents. These proceedings, typically, focus on the proposed interventions in the aftermath of an adverse occurrence. Far fewer proceedings are initiated to question the evidence that is put forward in an incident report. Such dierences can also be explained in terms of the consequences that recommendations have upon the future operation of safety-critical systems. Incident reconstructions are likely to be less contentious than any proposed interventions because they need not carry with them the costs that are associated with implementing any subsequent changes. It makes little practical dierence whether one explains the focus on recommendations as being due to aws in their presentation or to natural concerns over the cost implications their implementation. In either case, it is clearly important that analysts devote considerable time and attention to ensuring that their ndings are presented in a clear and coherent manner. As we
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
669
shall see, this can involve the publication of preliminary or interim reports to solicit comments of proposed interventions. It can also involve more restricted forms of pre-publication or consultation during which analysts validate their recommendations before disseminating them more widely. Such techniques can be used to ensure that the presentation of recommendations satis es a number of detailed requirements: 1. Identify the recipients and draft recommendations accordingly. Chapter 12 has already described how recommendations should consider what must be achieved but not necessarily how those goals will be implemented. It has also been argued that the recipients of a recommendation must have the necessary resources to achieve particular objectives by the dates that are speci ed in an incident report. Unfortunately, many incident reports fail to follow these guidelines. For example, the following excerpt is taken from a report issued by the Hong Kong Marine Accident Investigation Section: \Non compliance of the safety guidelines for the transport of motor vehicles/cycles is considered the principle cause of this accident. Had the guidelines been followed, there would have been no uncleaned fuel tanks of motorcycles containing residue of volatile hydrocarbon based substance and residual fuel in the engine assemblies, and as a result no accumulation of hydrocarbon vapour in the container." [364] http://www.info.gov.hk/mardep/dept/mai/elmi399.htm This illustrates how recommendations may often be implicit within the ndings of an incident report: operating companies must follow the safety guidelines for the transport of motor vehicles. There are a number of problems with this implicit approach. Firstly, reminders to operators that they must `try harder' oer few guarantees about the future safety of an application process [409]. Secondly, the implicit nature of such recommendations can lead to a range of dierent interpretations of the proposed remedies. Readers might infer that the harbour authorities ought to initiate more frequent checks to ensure compliance with these regulations. Alternatively, improved training might be oered to the particular working groups that were involved in loading this vessel. In an ideal world, such alternative interpretations might encourage readers to initiate a range of interventions. In practice, however, there is a danger that any ambiguity can encourage individuals and organisations to pass-on responsibility for implementing particular recommendations. 2. consider the problems of non-compliance or opposition. The most observant readers of this work will have noticed a particular trend that is apparent in the response to many of the recommendations that have been proposed by the NTSB. In Chapter 9 we looked at the recommendations that were issued in response to a succession of gas explosions. The OÆce of Pipeline Safety in the Department of Transportation questioned NTSB initiatives to install Excess Flow Valves [588]. Similarly, previous sections in this chapter have described the reluctance of the International Council of Cruise Lines and the International Chamber of Shipping to introduce automatic smoke alarms into crew berthing and passenger accommodation areas [613]. One explanation for the opposition that is often provoked by NTSB recommendations is that their investigators focus primarily on the safety issues rather than the cost implications of implementing particular proposals. As we have seen, other incident reporting systems take a more cautious approach when they `target the doable'. The reluctance to implement particular recommendations, arguably, provides a further illustration of the relationship that exists between corporations and federal agencies in the United States. In either case, such opposition illustrates the importance of explicitly considering what to do when the recipients of a report object to particular recommendations. As we have seen, some reporting agencies simply document the ndings of an investigation without providing the reader with any idea of whether or not they were actually implemented. Any objections that block the implementation of a recommendation are then, typically, only revealed in subsequent reports that document the recurrence of similar incidents. Of course, this approach cannot be used to identify situations in which objections to a recommendation have not contributed to
670
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
subsequent incidents. This approach creates considerable problems both for incident investigators and safety managers. It can be diÆcult for them to recreate the dierent arguments that support and oppose particular recommendations. As a reader of incident reports, it can often be frustrating to see investigatory agencies repeatedly advocate the same interventions without any explanation of why their recommendations are consistently not being implemented. Other reporting organisations have avoided these problems by publicising any dialogues that take place with `interested parties'. This can be illustrated by the way in which the IMO have collated the arguments for and against particular recommendations in the aftermath of Erica oil pollution incident o the French coast. The IMO collated and published responses from the IMO Maritime Safety Committee and their Maritime Environment Protection Agency. These addressed recommendations and proposals made by the International Association of Classi cation Societies and a resolution by the European parliament. Their views on changes to condition assessment schemes provoked further response from the International Association of Ports and Harbours. Amendments were proposed by maritime organisations in Belgium, Brazil, the Bahamas, Germany Greece, Japan etc. The IMO Maritime Environment Protection Agency then commented on these national submissions. Further statements were made by the International Chamber of Shipping and the International Association of Independent Tanker Owners. Such multi-national responses represent an extreme example. In most incident reports, it is possible to represent dierent attitudes to particular ndings within a nal report. This enables investigators to document the reasons why particular groups might oppose the implementation of any potential recommendations. This explicit statement of objections is very important. There have been many instances in which operating companies have denied objecting to recommendations that might have prevented incidents and accidents. 3. de ne conformance and validation criteria. Like most guidelines, it is possible to identify a number of problems that must be addressed when attempting to use the items in this list as a means of informing the presentation of recommendations that are derived from incident reports. For instance, we have argued that investigators should, if possible, avoid trying to specify the precise mechanisms and procedures that might be used to satisfy a particular nding. This create problems because regulatory authorities, safety managers and operators must determine appropriate means of meeting particular recommendations. There is an obvious danger that incident investigators might conclude that any failure to prevent subsequent incidents indicates a failure to satisfactorily implement their recommendations rather than concluding that their previous recommendations were inadequate. It is, therefore, important that if a report documents what must be done rather than how then the report must also specify conformance criteria that can be used by operators and regulators to determine whether particular mechanisms have satis ed particular high-level objectives. Or put another way, without such criteria investigators may continue to blame inadequate implementation of previous recommendations rather than investigating whether those recommendations were adequate in the rst place. Establish conformance is nontrivial. We might like to specify that any proposed changes will make it extremely unlikelihood that an incident will recur. As we have seen, however, many safety-critical incidents have an extremely low probability. If we consider an incident that occurs once in every 100,000 hours of operation, we may have to wait a considerable period of time before we can have any assurance that proposed changes have reduced this frequency. Basic probability theory suggests that even if we observe a system for 99,999 hours there is no guarantee that an incident will not occur twice in the remaining hour. These issues will be discussed in greater detail in Chapter 15. For now, the UK Coastguard Agency's Marine Pollution Control Unit's report into the Sea Empress Incident illustrates some of the points mentioned in this paragraph [793]. The report does not focus on the causes that contributed to the grounding of the vessel and subsequent release of 72,000 tonnes of crude oil into Milford Haven. Instead, it analyses the environmental response and makes a number of recommendations that are intended to improve the `cleanup' operation after future incidents. The Sea Empress was grounded in 1996, since that time there have been no comparable incidents that might be used to judge whether or not the
13.2.
GUIDELINES FOR THE PRESENTATION OF INCIDENT REPORTS
671
recommendations have had their intended eect. All that can be done is for the relevant agencies to test their preparedness in simulated exercises that are intended to demonstrate that they meet the recommendations outlines in the Coast Guard report. The nature of these exercises can vary signi cantly from one local authority to another. This guideline, therefore, argues that incident reports should document acceptance criteria that can be used to determine whether or not particular mechanisms have actually satis ed a recommendation. 4. Link the recommendations to the analysis. The previous section has argued that incident reports must develop explicit links between the evidence that supports a reconstruction and the analysis that leads to certain causal ndings. This is intended to reassure readers that those ndings are grounded in the evidence that is obtained in the aftermath of an incident. Similarly, it is important that an incident report documents the relationship between particular recommendations and the products of a causal analysis. Such links create a clear path between the events that contributed to previous failures and the proposed interventions that are intended to prevent future recurrences. The importance of these connections can be illustrated by two recommendations from a report that was issued by the UK MAIB [515]. The incident began when the watchkeeping motorman on a roll-on, roll-o cargo vessel started to clean the top of an electrical cabinet. To gain access, he stood on top of a pipe and lent over a fuel oil booster pump. As he did this, he inadvertently activated the emergency stop button on the pump. The pump stopped and this caused a fall in the fuel pressure for the main engine. This, in turn, caused the main engine to stop resulting in a `black out' when the generator breaker on the main switchboard tripped. The particular details of this incident are less signi cant in the context of this section than the recommendations that were identi ed. The rst of these stressed that all emergency stop buttons should be tted with bright protective covers that should alert operators to their presence and function. The relationship between this recommendation and the details of the incident is relatively clear given the details that were provided in the reconstruction. In contrast, a further recommendation focussed on the \need for an active Safety Committee" even though the report tells us nothing at all about the activities of the safety committee of the vessel involved in the incident. A recommendation that \safety committees should examine all new regulations, operating requirements and the implications of new equipment to see whether any aect the risk pro le" seems almost incidental to the occurrence being reported. Such proposals require further justi cation if readers are to be convinced that they might play a signi cant role in prevent future incidents [124]. Conversely, it is also important to justify a decision not to derive any recommendations from an incident. This can be illustrated by the conclusion to the Swedish Board of Accident Investigation's report into the collision between the MT Tarnsjo and the MV Amur-2524 mentioned in previous paragraphs [767]. This tersely summarises the proposed interventions as follows: \4. Recommendations. None". When investigators propose particular interventions, they must provide safety managers with a justi cation that will motivate them to implement any recommendations. Conversely, if investigators decide not to propose any potential interventions then they must provide the authorities that commission the report or which regulate the reporting system with a justi cation for their decision that no recommendations need be drawn. 5. explain why recommendations have been rejected or modi ed. The previous example provides an extreme case in which no recommendations were identi ed in the aftermath of what might have been a very serious incident. More generally, it is necessary to explain why particular recommendations were rejected rather than justifying a decision to entirely reject making any recommendation at all. This is important because readers often form particular hypotheses about potential interventions that might have avoided an incident. If investigators propose alternative remedies then there is a danger that operators and safety managers will choose to follow their own intuitions rather than the proposed alternative. As we have seen, this can have the knock-on eect of complicating any attempts to validate recommendations unless signi cant resources are used to assess the degree of non-compliance within an industry. The US Coast Guard avoid this problem by a relatively sophisticated process that
672
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
also helps to validate particular recommendations. The initial investigation restricts itself to an examination of the facts that are known about an incident. A subsequent section of `Findings' are then used to present the results of the analysis. A preliminary report is then presented together with a letter containing proposed recommendations is then sent to the Coast Guard Commandant. They review each recommendation, typically, with the Commander of the district in which the incident occurred. They then draft a letter that is printed at the start of each report. Their response lists each recommendation and states whether or not they concur with the proposal. If they do not concur with the speci c recommendation, they may agree with the intent and modify the proposal. It is diÆcult to nd any examples of investigator's recommendations that have been rejected. However, this style of report does present detailed arguments to explain why some recommendation are considered inappropriate. This, in turn, can help to justify the alternative recommendations that are proposed by the Commandant and his sta [831]. The Transportation Safety Board of Canada report into the Fraser river incident, mentioned in previous paragraphs [785], provides an example of an alternative approach. In this instance, rather than having an external auditor validate the recommendations made by the investigators, the investigators validate the interim actions that have been taken by local and regional oÆcers in the aftermath of the incident. As mentioned previously, the report into this incident describes the investigation as closed without proposing any new recommendations. The investigators do not modify or revise the interim recommendations presumably because they are considered to be suÆcient to prevent any recurrence of a similar incident. 6. Distinguish between recommendations and lessons learned. This section began by listing the publications that are used to report on information submitted to the US Coast Guard's International Maritime Safety System. These range from formal alert bulletins that describe interim recommendations through to the informal information notices that summarise more general problems and which provide background information about potential hazards. This distinction between dierent forms of publication provides readers with important cues about the relative importance of particular recommendations. Alert bulletins, typically, describe particular actions that should be allocated an extremely high priority within daily working practices. In contrast, information notices may provide general advice that often simply describes `best practice' without requiring the rapid implementation of particular procedures. Such dierences are apparent both in the medium of publication and in the style of prose that is used to convey these dierent types of information. For instance, the dierence between formal alerts and information notices is intended to re ect a distinction between incident recommendations and more general `lessons learned'. The Coast Guard describe the style and format of lessons in the following terms: \Lessons learned: the information presented through the links below is mostly anecdotal and primarily intended for those who work on the many vessels that navigate our oceans and waterways" [832]. In contrast, incident recommendations follow the more structured format suggested by the previous guidelines in this section. This section has presented a number of high-level guidelines that are intended to help investigators draft reports into adverse incidents and near-miss occurrences. The focus has been on the more formal style of report that are typically produced in the aftermath of failures that carry relatively severe potential consequences. This is justi ed by the observation that many of these reports have been awed by omissions and inconsistencies [426]. It is also possible to selectively use subsets of these guidelines to inform the presentation of less formal reports. In both cases, it is important that investigators consider means of validating the documents that they produce. The guidelines presented in this section are no more than heuristics or rules of thumb. They are the product of experience in generating incident reports. They may not, therefore, provide appropriate guidance for the ast range of incident reporting systems that are currently being implemented. It is, therefore, often necessary to conduct further studies to increase con dence that incident reports provide their readers with all necessary information in a format that supports the operation and maintenance of safety-critical applications.
13.3.
QUALITY ASSURANCE
673
13.3 Quality Assurance This section presents a range of techniques that can be used to support both the validation and the veri cation of incident reports. Veri cation establishes that a document meets a certain number of technical requirements. These include the need to ensure that a report does not contain inconsistent information about the course of events leading to an incident. They also include the requirement to ensure that any recommendations do not rely upon contradictory lines of analysis. In contrast, validation techniques can be used to establish that a document actually satis es a range of user requirements. For instance, it is important to determine whether or not the potential readers of an incident report gain a clear understanding of any proposed recommendations. Similarly, it should be possible to demonstrate that the recipients of a report will have con dence both in the reconstruction of events and in the analysis that is derived from any reconstruction. It can be argued that properties such as consistent and a lack of contradiction are basic user requirements and hence that veri cation techniques form a subset of the more general validation methods. We retain a clear distinction between these dierent approaches by assuming that veri cation techniques often yield insights into a document without the direct involvement of the intended recipients. In contrast, validation techniques involve user testing and the observation of readers using incident reports to support particular activities.
13.3.1 Veri cation It is possible to identify a range of properties that investigators might require of an incident report. A small subset of these can be summarised as follows:
consistency. It is important that an incident report should `get the facts correct'. This implies that the timing of events should be reported consistently throughout a document. If there is genuine uncertainty over the timing of a particular action then this should be made explicit. This form of temporal coherence should be supported by location coherence. The position of key individuals and systems should be consistently reported for any particular moment during an incident. In other words, people should not appear to be in two places at once. Similarly, technical details such as the serial number, type and operating characteristics of particular devices should not change throughout a report unless such changes are explained in the supporting prose. It might seem unnecessary to state these requirements. However, a number of previous studies have document numerous violations of these apparently simple requirements. One of the best known instances involves an incident report in which inconsistent timings were given throughout the document because the emergency services disagreed over who reached the scene rst. This disagreement was not made explicit in the document and the reader was oered no explanation as to why key events were given dierent times in dierent sections of the report [502].
lack of contradiction. Contradictions can be seen as a particular form of inconsistency. For
instance, mathematical proofs of consistency can be used to identify potential contradictions. In incident reports, they occur when the same document assets that some fact A is both true and not true. It is relatively rare to nd factual contradictions within an incident report. It is more common to nd that an event A is reported to occur at a range of dierent times rather than an assertion that A did not occur at a particular time. Contradictions are, however, more often found in the arguments that support particular ndings in an incident report. They occur when the same evidence is used both to support and to weaken particular lines of analysis. As we shall see, the same events can be interpreted within the same report as evidence that operators were following standard operating procedures and yet were disregarding particular safety requirements. It is important that investigators identify such situations not because they are in some sense `incorrect' but because they often require further explanation in order to convince readers that they do not re ect a deeper weakness within the interpretation and analysis of an incident.
674
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
limited use of rhetorical devices. Chapter 11 has described a range of biases that aect the interpretation of information about safety-critical incidents: author bias; con dence bias; hindsight bias; judgement bias; political bias; sponsor bias; professional bias; recognition bias; con rmation bias; frequency bias; recency bias; weapon bias etc. These often stem from the external pressures that social and political groups can place upon investigators. There is a danger that these pressures can encourage individuals to support lines of reasoning that may not be directly supported by the available evidence. There is also a danger that this will result in reports that use rhetorical devices, which help to persuade readers that alternative hypotheses should be discounted. As we shall see, it is impossible within any prose document to entirely avoid the use of rhetorical devices. It is, however, important that investigators are aware of the potential impact of these techniques. It is also possible to employ structured reading techniques to ensure that these devices are not used in a way that might mislead the intended audience about the course or causes of adverse occurrences and near-miss incidents.
structural simplicity. There are a number of structural problems that complicate the task
of drafting an incident report. The standard formats mentioned in previous sections require that reconstructions and summaries of the events leading to an incident are presented before any sections that analyse the causes of an incident. Dozens of pages can, therefore, separate the presentation of evidence from the arguments that use particular facts about the course of events. Similarly, recommendations may be presented many pages after the lines of analysis that support them. Many investigators have addressed this problem by reiterating factual information within the analysis section. Analytical arguments may also be repeated before the presentation of any recommendations. This creates further problems because subsequent editing of either section can introduce the inconsistencies and contradictions mentioned above. In other reports, key facts can be omited from a reconstruction and may only be presented before particular recommendations. Previous sections have referred to such practices as the `Perry Mason' or `Agatha Christie' approach to incident reporting. Readers cannot form a coherent picture of an incident until they have read the closing pages of the report.
This is a partial list. Investigators can identify a range of other requirements that should be satis ed by incident reports. For instance, they might wish to guarantee that suÆcient warrants are provided to support the evidence that is presented in these documents. Warrants describe the backing or source that supports particular items of information. Brevity prevents a complete analysis of these diverse properties. In contrast, the following pages brie y introduce a number of techniques that can be used to verify particular properties in incident reports. As mentioned, they can be distinguished from validation techniques because they can typically be performed by an analysis of the document prior to publication and they do not require direct access to the intended recipients of a particular report.
Analysis of Rhetorical Devices Rhetorical devices, or tropes, represent common and traditional techniques of style and arrangement that can be used in prose to achieve a number of particular eects. These eects include emphasis, association, clari cation and focus. They also involve the physical organisation of text through transition, disposition and arrangement. A further class of tropes deals with variety and decoration. It is diÆcult to write eective prose without making use of these dierent devices. It is also important to understand the particular eects that these devices can have upon the readers of an incident report. In particular, investigators must recognise when their use of particular rhetorical devices might have an unwanted or unwarranted impact upon their audience. There are many good textbooks that provide an overview of rhetorical techniques. Some, such as the primer by Corbett and Connors, also include progymnasta or classical composition exercises that are intended to help writers use particular devices [183]. It is ironic that many of these techniques, that are traditionally intended to persuade others of your own opinion, might be abused in the context of incident reporting. They provide case studies in the use of language to create an eect that does not stem solely from the information that is contained in the sentence. Brevity prevents
13.3.
QUALITY ASSURANCE
675
a complete introduction to all of these techniques. In contrast, the following list brie y summarises a selection of the most common tropes and provides examples of their use within a single incident report. These illustrations come from a report that was published by the Transportation Safety Board of Canada [784]. In this incident, the Navimar V, a pilot boat, came alongside a bulk carrier, the Navios Minerva. As she did so, she overtook a wave generated by the carrier and pitched onto its crest. The Navimar V then surged into the trough of the next wave and plunged into the sea. The submerged bow slowed the pilot boat but her momentum continued to pitch the vessel until she turned over. The incident resulted in serious injuries to one crew member and minor injuries to three more people, including two pilots who were using the accommodation ladder to board the Navimar V. The subsequent report into this incident can be used to illustrate the role that particular tropes or rhetorical devices can have upon the reader of an incident report:
Ampli cation
This technique involves the restatement of an idea or argument. It often also involves the introduction of additional details. For example, the report into the capsize of the Navimar V includes the following ndings: \2. Under international regulations, the vessel was required to use a pilot ladder to transfer pilots. 3. Instructing Marine Communications and TraÆc Services traÆc regulating oÆcers to tell foreign crews to use the accommodation ladder is a request made by the pilots but is contrary to international regulations." [784] The rst nding notes that international requirements require the use of a pilot ladder. The second nding ampli es this observation by noting that the request to use an accommodation ladder is also forbidden. This technique can have the eect of drawing the reader's attention to a particular concept or idea. The ampli cation not only introduces new facts but it also supports and reiterates the arguments that are introduced in previous sentences. This technique can create problems when the ampli cation of particular aspects of a previous assertion can detract from other arguments or items of information. The combined eect of these con rmatory statements can have a signi cant impact upon the reader of an incident report. It is, therefore, also important to provide suÆcient evidence to establish the credibility of both an initial assertion and the subsequent ampli cation. For instance, the reader of the Navimar V report is referred to Chapter 5, Rule 17 of the 1974 International Convention for the Safety of Life at Sea (SOLAS). This stipulates that pilot ladders must be used for pilotage service.
Anaphora
This technique uses repetition at the beginning of successive phrases, clauses or sentences. It can create an impression of climax in which the repetition leads to a particularly important insight or conclusion. The Navimar V report uses this technique in the paragraphs of analysis that immediately precede the investigator's conclusions: \Because the bulk carrier was moving at a speed through the water of about 10 knots, the waves she generated were probably unusually large. During the transfer, the pilot boat was manuevred onto the crest of one of those waves which was just forward of the accommodation ladder platform. Because the speed of the wave was less than the speed of the pilot boat, the `NAVIMAR V' accelerated into the trough towards the next wave. Because the accommodation ladder was on the vessel's quarter, the bottom platform was not resting onto the vessel's side and the pilot boat's bow had to be rested on the vessel's side to line up the after deck under the accommodation ladder".[784] This example illustrates the successive use of the phrase `Because the...' to build up a causal explanation of one aspect of the incident. It is reminiscent of the phrases that can be derived using the Why-Because Analysis (WBA) described in Chapter 10, although this technique was not used in this instance. The rhetorical device creates the impression of `building up a case'.
676
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
The investigator uses each successive sentence to `stack up' the evidence in a manner that supports the analysis. It is important to emphasise that such techniques are not of themselves either `good' or `bad'. Rhetorical devices can be used to convince us of well-justi ed conclusions or to support half-baked theories. It is important, however, to be sensitive to the eects that such techniques might have on the readers of an incident report. For instance, the previous citation can be interpreted to provide readers with a clear summary of the arguments that support the investigators' conclusions. It can also be interpreted in a more negative light. The repetition of such phrases may create an impression of certainty about the causes of an incident that might not be justi ed by the evidence. This particular rhetorical device leaves little room for suggesting alternate hypotheses as this particular argument is being presented.
Antithesis
Asyndeton
This uses juxtaposition to contrasts two ideas or concepts. This can be illustrated by the use of the term `rather than' in the following sentence: \The custom on the St. Lawrence River for a number of decades has been for pilots to use the accommodation ladder rather than the pilot ladder, unless exceptional conditions require a departure from that practice" [784]. Here the practice of using the pilot ladder is being juxtaposed with the use of the accommodation ladder. This technique is important because readers may make a number of additional inferences based upon such constructions. In this context, it is tempting to infer that the prevailing custom on the St Lawrence is some form of violation. We know that an incident has occurred and the juxtaposition of existing practice with another procedure, such as the use of the pilot ladder, suggests ways in which those alternative practices might have avoided the failure. It can be argued that this analysis makes too many assumptions about the inferences that readers might draw from such an antithetical device. It is important to remember, however, that these additional inferences are very similar to those that been identi ed by Byrne and Tasso's experimental studies of counterfactual reasoning [124]. This device omits conjunctions between words, phrases and clauses. This technique creates an impression of `unpremeditated multiplicity' [307]. The document creator can think of so many elements in the list that they hardly have time to introduce explicit conjunctions. This is illustrated by an enumeration describing the commitment of Transport Canada: \to the review and approval of construction plans, stability data, and the subsequent inspection of proposed or existing vessels, to ensure compliance with applicable regulatory requirements." [784] The omission of the nal conjunction implies that the list is incomplete. There is also a sense in which this technique also builds to a particular conclusion. The nal term `to ensure compliance with applicable regulatory requirements' may be the most important. Asyndeton creates precise and concise summaries. It, therefore, oers considerable stylistic bene ts to more formal incident reports that can otherwise appear to be verbose examinations of complex failures. There are, however, dangers. The implied omission of closing conjuncts, as in the previous example, can lead to uncertainty if the reader is unsure of what other information might have been omitted from a list.
Conduplicato
This technique relies upon the repetition of key words or phrases at, or very near the beginning, of subsequent sentences. This can be illustrated by the following quotation in which the investigators stress the eects of \modi cations" on the trimming characteristics of the Navimar V. \The series of modi cations carried out in 1993, 1996, and in the spring of 1997, prior to this occurrence, were collectively unsuccessful in eliminating the perceived shortcomings in the vessel's dynamic trimming characteristics. Further modi cations, after re oating the vessel in 1998, were made to reduce once again the detrimental after trimming characteristics." [784]
13.3.
QUALITY ASSURANCE
677
Conduplicato provides a focusing device because writers can use it to emphasise key features in preceding sentences. This helps to ensure that readers notice key concepts or ideas that may have overlooked when they read the initial sentence. The previous example also illustrates the way in which particular passages can simultaneously exploit several dierent rhetorical techniques. Not only is conduplicato used to emphasise the importance of \modi cations" to the vessel. The term \trimming characteristics" is also emphasised by its use at the end of both sentences in the previous example. This technique focuses the readers' attention on the consequences of those modi cations. Most investigators draft incident reports without ever being aware that they are exploiting such rhetorical devices. However, some of the material in this section is widely taught within courses on technical writing and composition. The intention behind these courses is to make writers more aware of the techniques that they can exploit in their documents. Previous paragraphs have stressed the dangers that can stem from the (ab)use of particular rhetorical techniques. They have also described how the inadvertent use of some devices can encourage readers to form hypotheses that were not intended by those who drafted an incident report. It is also important to acknowledge that some familiarity with the application of tropes might improve the prose that it often used to present these documents.
Diazeugma
This rhetorical device involves sentences that are constructed using a single subject and multiple verbs. It is frequently used in the reconstruction sections of incident reports and can help to describe a number of consecutive actions. Diazeugmas can provide an impression of rapid change over time. For instance, the Navimar V report describes the master's actions in extricating himself from the capsized vessel: \he saw light which he thought was coming from the surface and swam in that direction, but found himself in the engine compartment" [784]. As can be seen, this rhetorical device captures a sense of urgency in addition to the temporal information that may be implicit within such structures. There is, however, a danger that diazeugma can provide misleading information if the actions occur over a prolonged period of time. There is also a concern that the implied sequences may divert attention from intervening events. This problem can arise from the use of a single subject throughout the sentence. The actions of other subjects may be postponed to subsequent sentences even though they may have been interleaved with those of the initial subject. For instance, the passage cited above continues as follows: \In the meantime the deck-hand, who was wearing a otation device, surfaced near the hull. One of the two relief pilots hurried to the bulk carrier's bridge to inform the bridge team of the situation. At 0012, he reported the accident to the Quebec MCTS centre" [784].
Chapter 9 describes the problems that can arise when readers must reconstruct partial timelines from such prose descriptions.
Expletive
This technique can be used to emphasise particular concepts by interrupting normal syntax. Examples include the use of `in fact' or `indeed'. The following excerpt drawn from the Navimar V report uses the expletive `moreover' as a preamble to form of ampli cation. This again illustrates the way in which tropes can be combined to particular eect: \More recent pilot boats generally have a larger embarkation area forward of the wheel-house than aft of it, which makes it easier for the master to observe the transfer manuevre. Moreover, today's pilot boats operate at higher speeds during pilot transfers. Consequently, speci c attention must be paid to their dynamic longitudinal trimming characteristics in the design stage, to ensure a safe operation throughout the vessel's displacement, transition, and full-planing modes." [784] Such techniques can be used in a variety of dierent ways. In this example, the expletive is simply used to focus attention on an additional factors that increases the importance of
678
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
considering the trim characteristics of pilot vessels. There are other instances in which reports have used expletives to cat doubt upon particular aspects of a testimony. For instance, the NTSB report into the loss of a clamming vessel contains the following sentence: \Mr Rubin also testi ed that he `absolutely' registered his emergency position indication radio beacon with the National Oceanic and Atmospheric Administration" [831]. In this case the witnesses own rhetorical use of the expletive `absolutely' is deliberately cited as a precursor to subsequent arguments questioning the truth of their statement. The witnesses own emphasis, therefore, imperils the credibility of the rest of their evidence if readers believe the investigators counterarguments about the registration of the beacon.
Procatalepsis
This technique enables an argument to develop by raising and then answering a possible objection. This is intended to avoid a situation in which the reader's attention is distracted from any subsequent argument by the doubts that might have arisen during their reading of the preceding prose. The case study report provides a relatively complex example of this technique: \it is diÆcult to see how a pilot boat could be completely immune to capsizing or plunging, but pilot boat design criteria must meet the needs of the industry and pilotage authorities" [784]. This illustrates the use of procatalepsis because it addresses the implicit objection that it is impossible to design a pilot boat that is completely immune to capsizing or plunging. The answer to this possible objection is that `design criteria must meet the needs of the industry and pilotage authorities'. As with previous techniques, this is not without its dangers. The purpose behind the use of procatalepsis is to enable investigators to continue with the main thrust of their argument. There is a danger that such brief comments may do little to address the underlying doubts of the reader. For instance, the Navimar V does not consider the form that such criteria might take not does it address the problems of establishing consensus about the needs of industry and pilotage authorities.
This list presents a preliminary analysis of the rhetorical devices used in incident reports. It builds on initial work by Snowdon [749]. He has argued that it is possible to apply this style of analysis as a means of hecking whether or not particular linguistic constructs are (ab)used to support bias in incident and accident reports. The objective of his work is to teach investigators to perform a detailed and critical reading-through of their reports prior to publication. The intention is not that they should be forced to learn the complex names and ideas associated with each trope. It is, however, intended that greater attention be paid to the eect that particular devices can have upon the readers of an incident report.
Logic As mentioned, the previous list only provides a partial account of the many rhetorical devices that can be identi ed in incident and accident reports. Over sixty of these are identi ed by Harris' Handbook of Rhetorical Devices [307]. Brevity prevents a more sustained analysis. It is, however, worth pausing to consider one additional trope known as enthymeme. This is an informally-stated syllogism in which either a premise or the conclusion is omitted. This can be illustrated by the following quotation for the Navimar V case study: \since visibility was good, conduct of both the vessel and the pilot boat was carried out by visual observation during the approach of the two vessels and the transfer of the pilots" [784]. This is an enthymeme because it omits the premise that if visibility is good then such manuevres should be performed using visual observation. It is also possible to omit the conclusion in an enthymeme if it can be `generally' understood from the premises. For instance, a meeting was convened between Transport Canada, the Corporation des pilotes du Saint-Laurent central, the Corporation des pilotes du Bas Saint-Laurent and the Laurentian Pilotage Authority to decide upon an initial response to the capsize of the pilot vessel. It was agreed that \Transport Canada would issue a Ship Safety Bulletin if the parties came to a consensus, but such consensus was not reached" [784]. This omits the conclusion that no Ship Safety Bulletin was issued. This illustrates a potential danger that stems from the use on enthymemes. The previous quotation provides no guarantees that such a Bulletin was not issued for other reasons.
13.3.
QUALITY ASSURANCE
679
In logical terms the `if' in the preceding extract represents implication not bi-implication. Many readers of this extract may, however, make the inference that the publication was not issued and that this can be entirely explained in terms of the lack of agreement between the parties involved in the investigation. There are further forms of enthymeme. An initial premise can create a speci c context that aects the readers' interpretation of subsequent, more general, premises. Readers may then apply the initial premise to the generalisations in order to infer a number of more particular conclusions. For instance, the following excerpt refers to a range of human factors issues that may have aected the Navimar V incident: \The level of care and skill required of a crew manoeuvring a pilot boat are signi cant factors in this occurrence. Even the most experienced master may suer a moment's inattention. An emergency manoeuvre to correct the vessel's behaviour may be as harmful as poor vessel design. The human factor is also part of the operating system." [784] The initial sentence states that operator behaviour contributed to this particular incident. The following sentences provide generalised premises that do not refer directly to the circumstances surrounding the capsize of the Navimar V. The implicit conclusions that readers might identify from this enthymeme is that, in this particular incident, key personnel suered from a moment's inattention, an emergency maneuver may have taken place and that human factors issues may have impaired the operation of the Navimar V. There are considerable dangers in the use of enthymemes within incident reports. As we have seen, they often rely upon readers inferring conclusions that are implicit within the premises that appear in the published account. Unfortunately, there are few guarantees that every reader will correctly form the implied syllogism. In particular, the distinction between implication and biimplication can lead to numerous problems with a negated premise. Statements of the form not A and if A then B does not enable us to conclude notB . However, we can conclude not B if we have a premise of the form A if and only if B . These problems may appear to be of esoteric signi cance. They have, however, resulted in numerous objections to the accounts that are presented in incident reports [427]. One solution is to use formal logic as a means of verifying that incident reports present all of the information that readers require in order to form the syllogisms that are used by investigators [412]. It is important to emphasise that there are some important dierences between this use of logic and that proposed by Ladkin and Loer [470], reviewed in Chapter 11. WBA uses causal logics to support the causal analysis that must be conducted before a report is drafted. In contrast, the techniques that I have developed are aimed more at improving the presentation and argument in incident reports. Philosophically these dierences are important because WBA embodies an objective view of causation from Lewis' approach to counterfactual reasoning [491]. This creates some technical problems when there may be rival explanations for the same observed events; Chapter 11 describes proposals to resolve this by introducing weightings into Lewis' modal structures. In contrast, the use of logic to verify the content of incident reports can avoid making any strong assumptions about what actually was the cause of an incident. This can be left up to the skill and expertise of the investigators. The use of logic in this context is simply intended to ensure that the account of an incident avoids the problems associated with the inappropriate use of enthymemes. The practical application of logic to support the veri cation of incident reports is very close to that described in Chapter 9. In this previous chapter, mathematically-based notations were used to reconstruct the events leading to an incident. Rather than constructing clauses from the primary evidence that is obtained in the aftermath of an incident, veri cation proceeds by building a formal model from the phrases in a report document. This can be illustrated by the previous excerpt from the Navimar V case study. It was agreed that \Transport Canada would issue a Ship Safety Bulletin if the parties came to a consensus, but such consensus was not reached" [784]. This an be formalised using the following clauses, note that some of the parties to the agreement have been omited to simplify the exposition:
consensus (transport canada ; laurentian pilotage authority ) )
680
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
issue (transport canada ; ship safety bulletin ):
(13.1)
not consensus (transport canada ; laurentian pilotage authority ):
(13.2)
The rst clause states that if consensus is reached between Transport Canada and the Laurentian Pilotage Authority then Transport Canada issues a Ship Safety Bulletin. The second clause states that consensus was not reached. Unfortunately, the laws of rst order logic do not enable us to make any inferences from these premises about whether or not a bulletin was issued. This can be illustrated by the following inference rule that represents arguably represents the informal inferences that many readers would apply to the previous quotation. If we know that A is true and that if A is true then B is true, we can conclude that B is indeed true:
A; A ) B
`B
(13.3)
Unfortunately, this rule cannot be applied to clauses (13.1) and (13.2) because these take the form
A ) B ; notA. In order to address any potential confusion, we would be forced to explicitly state that no bulletin was issued by Transport Canada:
not issue (transport canada ; ship safety bulletin ):
(13.4)
Alternatively, we could re-write the prose used in the incident report: Transport Canada would only issue a Ship Safety Bulletin if the parties came to a consensus, but such consensus was not reached. The introduction of the modi er `only' rules out other circumstances that might have led to the publication of the bulletin and which are not mentioned in that particular passage. This would result in the following formalisation which includes the , operator (read as `if and only if'):
consensus (transport canada ; laurentian pilotage authority ) , issue (transport canada ; ship safety bulletin ): not consensus (transport canada ; laurentian pilotage authority ):
(13.5) (13.6)
The use of the , operator provides a number of additional inference rules that can be used to verify the informal reasoning process that has been described in previous paragraphs. One of these rules can be formalised as follows:
not A; A , B
` not B
(13.7)
The proof proceeds by applying (13.7) to clauses (13.5) and (13.6) to derive:
not issue (transport canada ; ship safety bulletin ):
(13.8)
The use of formal logics oers a number of additional bene ts to the veri cation of incident reports. In particular, it can be used to strip out repetition when it is used as a rhetorical device. For example, the Navimar V case study includes the following phrases: \In the compulsory pilotage areas on the St. Lawrence River, most pilots use the accommodation ladder for access to vessels." (Section 1.12.2, [784]) \The custom on the St. Lawrence River for a number of decades has been for pilots to use the accommodation ladder rather than the pilot ladder, unless exceptional conditions require a departure from that practice." (Section 2.2, [784]) \Most St. Lawrence River pilots use the accommodation ladder to board vessels." (Section 3.1, [784]) Such repetition can have an adverse eect on the reader of an incident report. The recurrence of similar sentences reiterates particular observations. This indirectly lends additional weight to arguments even though each restatement of the information is based on the same evidence. There
13.3.
681
QUALITY ASSURANCE
may be insidious eects when, as in the previous examples, no evidence is cited to support particular assertions about existing practices on the St. Lawrence. The previous citations might be represented by the following clauses.
size of (pilot (P 1); perform (P 1; access accommodation ladder ); N 1) ^ size of (pilot (P 2); perform (P 2; access pilot ladder ); N 2) ^ most (N 1; N 2):
(13.9)
As mentioned, such formalisations help to strip out the rhetorical eects of repetition. Logical conjunction is idempotent. In other words, A ^ A ^ A ^ A is logically equivalent to A even though the rhetorical eect may be quite dierent. It is important to note, however, that we have had to rely upon a second order notation in order to formalise the notion of `most' in (13.9). This illustrates a limitation of our application of logic. A range of relatively complex mathematical concepts may be required in order to formalise the prose within an incident report. There are further limitations. For example, we have not provided the semantics for most . We might have resorted to the use of > but this would not have captured the true meaning of the investigators' remarks. Such a formalisation would evaluate to true if just one more pilot used the accommodation ladder rather than the pilot ladder. We might, therefore, specify that most (N 1; N 2) is true if N 1 is twice as big as N 2, three times as big as N 2, four times as big as N 2... The key point here is that the process of formalisation forces us to be precise about the meaning of the prose that is used within an incident report. This oers important safeguards during the veri cation of a particular document. For example, if we consider precise numerical values to support the de nition of most we might then require that investigators providing statistical evidence to demonstrate that three, four or ve times as many Pilots use the accommodation ladders as use the Pilot ladders. The previous example has illustrated not only how logic can be used to combat the rhetorical eects of repetition, it has also illustrated the level of precision that this approach promotes during the veri cation of an incident report. There are further bene ts, especially when investigators consider variants of the enthymeme tropes mentioned above. An enthymeme involves the omission of a premise or conclusion from an argument. It is relatively rare to nd incident reports that deliberately omit major facts from their account [426]. More frequently, evidence can be cited many pages away from the arguments that it supports This creates problems because readers can easily overlook this con rmatory evidence and hence may not draw the conclusions that might otherwise have been derived about the course and causes of an incident. Similar problems can arise from the use of the structuring mechanisms that have been described in previous sections of this chapter. In particular, by stating the conclusions at the end of an incident report it is relatively easy for readers to forget or overlook the caveats and provisos that may have been used to circumscribe those ndings in the previous sections of a report. For example, the Navimar report includes the following conclusion: \9. the pilot aboard the bulk carrier and the master of the pilot boat did not come to an agreement by radio communication on the time and position for the transfer" [784]. An initial reading of this nding might suggest that radio communication ought to have been made and that this might have helped to avoid the incident. Such an interpretation ignores some of the contextual factors that convinced both of these experienced mariners that such a course of action was unnecessary. For instance, the report make the following observations ten or more pages before the conclusions cited above: \Since neither the master of the pilot boat nor the pilot on board the bulk carrier was expecting any problems with the transfer manoeuvre, they did not see any point in making contact by radiotelephone to determine when the pilot boat should come alongside and transfer the pilots, nor were they required to do so by regulations." (Section 1.9.2, [784]) In order to correctly interpret nding 9, cited above, readers must remember that the pilot and master were not required to make radio contact and that both considered such contact to be unnecessary given the prevailing conditions at the time of the transfer. This is a non-trivial task. As we have seen, most incident reports contain a mass of contextual detail. For example, the Navimar V report provides details of previous pilot transfers that did not play any direct role in this particular incident.
682
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
It can be diÆcult for readers to identify those details that will be signi cant to their understanding of the conclusions in an incident report and those that simply add circumstantial information. Logic can be used to strip out this contextual information. we have pioneered a style of analysis that is similar to the WBA, described in Chapter 9. Rather than starting with a temporal sequence of events leading to an incident, we start with the conclusions in an incident report. For example, the conclusion about the lack of communication can be represented by the following clause:
notmessage (pilot bulk carrier ; master pilot boat ; transfer details ):
(13.10)
The veri cation process then proceeds by a careful reading of the incident report to identify any previous information that relates to this conclusion. The previous citation might be formalised as part of this analysis in the following manner:
weather (visibility good ) ^ notrequired message (pilot bulk carrier ; master pilot boat ; transfer details ) ) notmessage (pilot bulk carrier ; master pilot boat ; transfer details ): (13.11) Ideally, we would like to apply a formal proof rule, such as (13.3), to show that the conclusion was supported by available evidence. In order to do this we must rst demonstrate that the visibility was good and that there was no requirement for the master and the pilot to communicate the details of the transfer. The report contains detailed meteorological information: \since visibility was good, conduct of both the vessel and the pilot boat was carried out by visual observation during the approach of the two vessels and the transfer of the pilots" (Section 1.8.2, [784]). Much less information is presented about the regulatory requirements, which might otherwise have required that the communication take place. This example illustrates the way in which logic can be used to focus on particular aspects of an incident report. The evidence that supports particular conclusions can be described in a precise and concise manner. This directs further analysis of an incident report. Investigators must identify those passages that provide the evidence to support these conclusions. Such extracts can, in turn, be translated into a logic notation to complete the formal proof in a manner similar to that illustrated in previous paragraphs. A number of caveats can be raised about our use of logic to verify particular properties of an incident report. In particular, our formalisations have relied upon relatively simple variants of rst order logic. These lack the sophistication of the more complex, causal techniques that have been derived from Lewis' work on counterfactual arguments [491]. This issue is discussed at greater length in Chapter 9. As we shall see, however, there are more fundamental objections against the use of any logic formalism to analyse the arguments that are put forward within an incident or accident report [775]. For now, however, it is suÆcient to brie y summarise a number of additional bene ts that can be derived from this technique. In particular, it is possible to demonstrate inconsistencies between two or more accounts of the same incident [427]. For instance, if one report omits a particular piece of evidence then we must consider not only the impact of that omission itself but we must also account for the loss of any inferences that may depend upon that evidence. In the previous example, if a rival report failed to provide any information about the prevailing weather conditions then readers might doubt clause (13.11) as an `explanation' for the lack of communication. It is also possible to extend the formal model of an incident report to analyse any proposed recommendations. For instance, in previous work we have used formal proof techniques to show that proposed interventions following a rail collision need not prevent the recurrence of a similar incident in the future [427].
Toulmin Toulmin's `The Uses of Argument' [775] can be seen as a measured attack against the use of formal logic as a primary means of understanding rational argument. This work, therefore, has considerable importance for any attempt to use logic as a means of verifying the correctness and consistency of arguments within incident reports. One aspect of Toulmin's attack was that many arguments do not follow the formal rules and conventions that are, typically, used to construct logics. An example of this is the use of warrant, or an appeal to the soundness criteria that are applied within a particular
13.3.
QUALITY ASSURANCE
683
eld of argumentation. These criteria dier between elds or domain. The style of warrant that might be acceptable within a court of law might, therefore not be acceptable in a clinical environment. Similarly, the clinical arguments that support a particular diagnosis are unlikely to exploit the same soundness criteria that might establish validity within the domain of literary criticism. The abstractions of a formal logic are unlikely to capture the argumentation conventions that characterise particular domains. Toulmin urges us to question the notions of objective or `universal' truth. The truth of a statement relies upon the acceptance of a set of rules or procedures that are accepted within a domain of discourse. This initial analysis of Toulmin's work can be applied to the veri cation of incident reports. For instance, it is possible to identify certain norms and conventions within particular reporting systems. These norms and conventions help to de ne what is and what is not an acceptable argument about causes of an incident. For example, eye-witness testimony typically provides insuÆcient grounds for a causal analysis unless it is supported by physical evidence. Conversely, the data from a logging system is unlikely to provide a suÆcient basis for any argument about the systemic causes of an adverse occurrence or near-miss incident. One approach to the veri cation of incident reports would be to enumerate a list of these conventions that are often implicit assumptions within an investigation team. Any report could then be checked against this list to con rm that it met the appropriate argumentation conventions. One of the strong features of this eld-dependent aspect of Toulmin's work is that it re ects the diering practices that are apparent between dierent reporting systems. What might be acceptable as a valid argument about causation for a local investigation into a low-risk incident might not be acceptable within a full-scale investigation of a high-risk incident. Similarly, the standards that might be applied to the argument in an incident report within the nuclear industry might be quite dierent from those that would be acceptable within a catering business. This domain-dependent approach contrasts strongly with logic-based techniques that rely upon formal notions of correctness. The underlying proof procedures remain the same irrespective of the domain under investigation. This has important practical consequences. The high costs that can be associated with the application of formal modelling techniques can prevent it from being used to verify the correctness of low-risk incidents. The complexity of higher-risk failures can also force investigators to construct abstract models of the information that is contained in accident reports. These models often fail to capture important details of an incident or accident. In either case, problems are apparent because of the in exibility of logic-based techniques. They cannot simply be tailored to re ect the dierent forms of argumentation that are exploited within dierent contexts. Some people would argue that this is a strong bene t of a formal approach; it avoids the imprecision and inconsistencies that can arise from domain dependent argumentation procedures. The con ict between logic-based models and Toulmin's ideas of domain-dependent discourse has had a profound impact upon the theory of argumentation. Fortunately, many of the consequences of Toulmin's ideas do not apply within the domain of incident reporting. In particular, large-scale reporting systems often encourage their investigators to adopt a model of argument in their reports that closely mirrors aspects of formal proof. Evidence is presented in a reconstruction section, arguments are then developed within an analysis section, conclusions are then presented on the basis of the arguments. It can, therefore, be argued that the domain dependent procedures of incident and accident investigation are similar to our previous application of proof procedures such as Modus Ponens. This, in turn, explains why so many people have proposed logic based techniques as appropriate means of verifying the products of incident investigations [469, 412]. Toulmin acknowledges that people rely upon both domain-dependent and domain-independent procedures to establish the validity of an argument. In contrast to the eld-dependent issues mentioned in the previous paragraphs, , most of Toulmin's work focuses on domain-independent procedures. The simplest of these procedures consists of a claim that is supported by some data. A claim is an assertion, for example about a cause of an incident. There are three types of claim: 1. Claims of fact. A claim of fact is supported by citing data, such as the results of simulator studies or of data recorders. This data must be suÆcient, accurate, recent, appropriate. 2. Claims of value. These claims represent moral or aesthetic judgements which are not factual and cannot be directly supported by data alone. They can be supported by citing unbiased
684
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
and quali ed authorities. A claim of value can also be supported by arguing that it produces good results or that negative results may be obtained if it is ignored. 3. Claims of policy. A policy claim is supported by showing that a procedure of regulation is both feasible and positive. Such arguments tend tpo rely upon a combination of fact and of value. Most incident reports rely upon claims of fact. Arguments about the causes of an incident must be grounded in the evidence that can be obtained by primary and secondary investigations. We are also often concerned with claims of policy. For instance, there may be little a priori evidence that a particular recommendation will avoid or mitigating future incidents. The best that can be done is to follow the advice of relevant experts based on data from similar systems. It is more rare for an incident report to be concerned with value claims, except in circumstances where moral decisions must be made, for instance, over the amount of money that might be invested to avoid future fatalities. as mentioned, incident reports are primarily concerned with claims of fact. The rest of this section, therefore, focuses on the arguments that can be used to support this form of argument.
Figure 13.2: Data and Claims in the Navimar Case Study. As mentioned, data can be used to support particular claims of fact. Data represents a body of evidence that can be used to determine whether or not a claim is valid. This concept is not straightforward because there can be further arguments about the validity of an item of evidence. It is for this reason that data, or grounds, refers to the part of an argument that is not in dispute. Toulmin avoids some of the practical issues that can arise when the dierent parties to an incident investigation cannot agree about what is and what is not acceptable evidence: \Of course we may not get the challenger to even to agree about the correctness of these facts, and in that case we have to clear his objection out of the way by a preliminary argument: only when this prior issue or `lemma', as geometers would call it, has been dealt with, are we in a position to return to the original argument. But this complication we need only mention: supposing the lemma to have been disposed of, our question is how to set the original argument out most fully and explicitly". [775]
13.3.
QUALITY ASSURANCE
685
Chapters 6 and 7 describe techniques that are intended to encourage agreement over the reliability and accuracy of particular items of evidence. In contrast, Figure 13.2 illustrates how Toulmin approach can be applied to the Navimar case study. As can be seen, the Transportation Safety Board of Canada report argues that the incident was caused when the pilot boat pitched onto a wave crest and surged into the trough of the next wave. This claim of fact is supported in the incident report by evidence of a similar, previous incident in 1997. Figure 13.2 also introduces two further components of the Toulmin model of argumentation. A warrant describes the assumptions that help to connect a claim with the grounds or data that supports it. This illustrates a super cial relationship between Toulmin and the syllogisms that have been introduced in previous sections [305]. The grounds and warrant can be though of as premises, the claim represent the conclusion to be drawn. Figure 13.2 also illustrates the notion of backing. This helps to establish a warrant; backing can also be a claim of fact or value. In this example, the relationship between the previous incident and the conclusion of the report is never made explicit in the incident report. In consequence, readers have to infer the reason why this data might support the overall conclusions. Vessels that have suered previous incidents are more likely to suer future recurrences of similar incidents. This warrant might be supported by statistical studies to indicate that vessels which are involved in one incident and then more likely to ve involved in another similar incident in the future.
Figure 13.3: Quali cation and Rebuttal in the Navimar Case Study. An argument is valid if the warrant and any associated data provides adequate support for the claim. It is possible, however, that can investigators' colleagues might raise objections to a particular argument. Alternatively, the writer of an incident report may themselves have doubts about the scope and applicability of their analysis. Finally, the readers of an incident report might question the argument that is embodied within such a document. Toulmin's domain independent model can be used to capture these alternate positions that challenge or modify an initial position. Figure 13.3 uses the Navimar case study to illustrate such an extension. As can be seen, a quali cation node had been introduced to record the observation that a previous incident can induce greater caution amongst the crew of some vessels. This quali er eects the previous warrant the represents the implicit argument that vessels involved in previous incidents will be more likely to be involved in future incidents. The claim can also be quali ed in a similar fashion, if it has been challenged and its truth is in doubt. This use of a quali er does not deny that there may be a relationship between the
686
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
vessels involved in an incident and previous involvement in similar occurrences. Instead, it argues that this eect may not apply to all vessels. Figure 13.3 also illustrates the use of a rebuttal to challenge the argument that was rst sketched in Figure 13.2. The suggestion that vessels are more likely to be involved in an adverse occurrence if they have been involved in previous incidents is contradicted in this case by arguing that previous modi cations were eective in addressing the causes of the problem. This rebuttal is supported by the lack of evidence to indicate that there was a continuing problem. The crew also continued to risk their lives by operating the vessel in what can be a diÆcult and hazardous environment. Such counterarguments can be challenged. For instance by arguing that economic pressures often force individuals to operate equipment that they know to have safety problems. Similarly, the lack of evidence about further incidents between spring 1997 and the incident need not provide direct assurance that the problem would not recur. The vessel might not have faced similar operating conditions. Such arguments against a rebuttal could be incorporated into the structures of Figures 13.2 and 13.3. The resulting graphs can quickly become intractable. A number of researchers have, therefore, developed tools that can be used to support the use of Toulmin's techniques to map out argument structures [503, 748]. Much of this work has been inspired not by Toulmin's initial interest in studying the structure of argument but by a more prosaic interest in improving the support that is provided for particular decisions. This practical application of Toulmin's model can be illustrated by Locker's recent work on improving `Business and administrative communication' [498]. Locker suggests that business writers decide how much of Toulmin's model they should use by analysing the reader and the situation. Writers should make both their claim and the data explicit unless they are sure that the reader will act without questioning a decision. The warrant should be included in most cases and the backing should be made explicit explicit. It is also important for eective communication that any rebuttals should be addressed by counter-claims, as suggested above. Authors should also be careful to explicitly address any limiting or qualifying claims. Locker's normative application of the Toulmin model suggests how this approach might be used to support the presentation of incident and accident reports. Given the importance of eective communication in this context, we might require that this approach be used to make explicit the association between data and the claims that are made in a report. If data is not presented in the document then the claim is unsupported. This application of the technique is similar to the manner in which logic might be used to identify enthymemes in other forms of syllogism. It can also be argued that investigators should explicitly document the warrant that links data and evidence within an incident report. This is important because, as we have seen, incident reports are often read by many diverse communities including operators, managers, regulators, engineers etc. It is, therefore, diÆcult to make strong assumptions about the background knowledge that is required in order to infer the relationships that exist between data and particular clausal claims. This is illustrated by the Navimar case study that has been introduced in the previous paragraphs. The Transportation Safety Board report never makes explicit the relationship between evidence of a previous incident and the overall conclusion of the enquiry. We have had to infer an appropriate warrant in Figure 13.2. It is entirely possible that we have made a mistake. Investigators may have had entirely dierent reasons for introducing the events in 1997. Unfortunately, we have know way of telling whether this argument is correct or not from the report into the subsequent capsize. This requirement to make explicit data, claims and warrant goes beyond Locker's requirements for eective business communication. These dierences should not be surprising, given that Toulmin acknowledges the importance of domain independent and domain dependent requirements for eective argument. The `standards of proof' are potentially higher in the case of incident investigation than they might be amongst more general business applications. Other aspects of good practice will be common across these dierent domains. For example, Locker argues that eective communication relies upon writers explicitly addressing any proposed rebuttals of an initial argument. This is equally important within the eld of incident reporting. For instance, the Navimar report comments that `it was reported that adding ballast improved but did not completely eliminate the boat's unsatisfactory trimming behaviour' [784]. This partly addresses the rebuttal in Figure 13.3. It does not, however, explicitly provide backing for such a counter-argument beyond the rather vague reference to previous reports.
13.3.
QUALITY ASSURANCE
687
Figure 13.4 provides a slightly more complex example of the application of Toulmin's model to the Navimar case study. In this instance, data about the layout of the pilot vessel is used to support a claim that the use of the accommodation adder rather than the pilot ladder contributed to the incident. This argument is supported by the warrant that the use of the accommodation ladder complicated the task of keeping the pilot boat's transfer deck in position because their boat cannot rest parallel to the vessel's side during the transfer. The diÆculty of performing such maneuvers is recognised by international regulations requiring that pilot ladders be used for pilotage transfers. The argument is also supported by the warrant that the master had to divide his attention between completing this relatively complex manuevre with the vessel in front of him and the task of looking aft to ensure that the after deck lined up under the accommodation ladder. This warrant is not backed by any particular citations in the incident report. There are no direct observations or accounts of the diÆculty of this task. In contrast, the investigators acknowledge that \there is every indication that the crew were well rested and highly experienced" [784]. The omission of any backing provides a further example of an enthymeme trope. If we apply Toulmin's model in the normative manner proposed above then it can be argued that more information ought to be introduced into the report to support this warrant. For instance, an analysis of the ergonomics of the bridge design might provide suÆcient detail for operators to determine whether similar problems might aect not simply transfer tasks but other pilot operations as well.
Figure 13.4: More Complex Applications of Toulmin's Model. Figure 13.4 also provides a further illustration of a rebuttal that readers might form from the
688
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
information that is presented in the Navimar report. There are a number of drawbacks that aect the use of pilot ladders. Some if these potential problems relate to signi cant safety concerns, especially if hybrid pilot and accommodation ladders are joined together. This rebuttal is supported by the observation that it was common practice to use accommodation ladders in the pilotage areas of the St. Lawrence River. This had reached such an extent that the Marine Communications and TraÆc Service centre explicitly informed foreign crews of this practice. Previous paragraphs have argued that it is important for investigators to address such rebuttals if readers are to have con dence in the ndings of an incident report. This caveat is, therefore, addressed by two counter claims. Firstly, the free board of the Navios Minerva was less than nine meters. This obviated the need for a combination ladder of the type mentioned above. The rebuttal is also addressed by the counter claim that the Marine Communications and TraÆc Service instruction was neither in accordance with international nor Canadian regulations. The safety concerns mentioned in the initial rebuttal cannot avoid the conclusion that common practice was in violation of the recommended rules and regulations that had been issued to the crews. A number of further comments can be made about the use of Toulmin's model in Figure 13.4. It is possible to use the resulting graphs to trace the location of information to various sections within the report. Although most of the information that supports the rebuttal appears together in Section 2.2, some of the material can be found in 1.12.2. This is important because this information provides evidence that can be used to contradict the initial rebuttal. Similarly, the backing for part of the warrant, in Section 2.2, supporting the argument in Figure 13.3 is to be found in Section 1.12.1. This forms part of a more general pattern that can be observed through the application of Toulmin's model to incident reports. The warrant that outlines a particular line of support for an argument, typically, appears many pages after the backing that supports it. This separation arises from the policy of separating the arguments and analysis that explain the signi cance of key events from the reconstructions that rst describe the context in which an incident occurs. This approach helps to avoid any confusion between what is known and what is inferred about a near miss incident or adverse occurrence. A further consequence of this is that readers may only learn the signi cance of particular items of information after they have nished reader the report. In consequence, it is often necessary to read and re-read such documents several times in order to follow the complex argument that may be distributed across hundreds of pages of prose. Snowdon has argued that these problems might be reduced if, instead of using Toulmin to check an argument in an incident report, the readers of a report were provided with diagrams such as those shown in Figures 13.2, 13.3 and 13.4, [748]. These could be printed inside an incident report to provide readers with a `roadmap' of the various arguments that are being proposed by an investigator. This approach might also increase con dence in any conclusions by explicitly indicating the counter-arguments that might be deployed against particular rebuttals. Figure 13.4 illustrates the relatively complex diagrams that can emerge from the application of Toulmin's techniques to incident reports. It also illustrates some of the problems that arise in the practical use of this approach. Toulmin's focus was on explaining the domain dependent and domain independent components of argument structures. His purpose was \to raise problems, not to solve them; to draw attention to a eld of inquiry, rather than to survey it fully; and to provoke discussion rather than to serve as a systematic treatise" [775]. His model was never intended to be used as a tool to support the development of incident reports. One consequence of this is that is can be diÆcult to categorise the paragraphs within an incident report. For example, it can be argued that the rebuttal in Figure 13.4 might be reclassi ed as a form of quali er. It does not directly contradict the argument that the decision to use the accommodation ladder contributed to the incident. In contrast, it explains why many operators chose not to use pilot ladders. This could be interpreted as a quali er because it refers to previous instances in which the use of the accommodation ladder had not resulted in an adverse outcome. Further problems complicate this application of the Toulmin model. For instance, we have constructed our analysis at the level of individual paragraphs within the Navimar report. Rather than translate the original prose in a manner that might support the classi cation of those paragraphs within the Toulmin approach, we have chosen to retain verbatim quotations within our analysis. The drawback to this application of the model is that some of the paragraphs may themselves contain more detailed argument structures. For example, the backing
13.3.
QUALITY ASSURANCE
689
for the rebuttal in Figure 13.4 contains a claim that most pilots use the accommodation ladder in the St Lawrence River pilotage areas. This might, in turn, be supported by additional data. The introduction of this data would then need to be supported by an appropriate warrant and so on. The previous objections relate to the diÆculty of applying Toulmin's model to the complex prose and argument structures that are used in many incident reports. It is possible to reverse these objections by arguing that this very complexity increases the importance of any techniques that might be able to identify potential errors of omission and commission. The principle bene t of this approach is that it provides a graphical representation of various positions within an incident report. These diagrams can then form a focus for subsequent discussion amongst an investigation team prior to publication. The very accessibility of these diagrams helps to ensure that any disagreements about the classi cation of particular sentences and paragraphs can be checked by an investigator's colleagues. It also forms a strong contrast with the use of more formal logic-based approaches. The accessibility of the Toulmin model, however, comes at the price of far weaker concepts of proof or correctness. The adequacy of an argument can only be assessed in terms of the domain dependent procedures that are accepted within an investigation team. These procedures guide the normative application of Toulmin's model, proposed for business communication by Locker [498] and sketched for incident reporting in previous paragraphs. Problems can arise when those procedures that are acceptable within one domain of argument are questioned or rejected by other groups who employ dierent standards of `correctness' . A number of authors have proposed further extensions to the Toulmin model of argumentation. For instance, the initial proposals provide backing for the propositions that are captures in a warrant. They did not provide similar support for the data that backs a claim. As mentioned above, data is assumed to be accepted. If it is questioned then it must be addressed by a secondary argument. In contrast, Ver Linden has argued that the Toulmin model should be expanded to include `veri ers' for data [494]. Explicit veri ers involve a further argument, which concludes that the data in the original argument is correct. Thee include citations as well as reference to common knowledge and to personal background. Implicit veri ers stem from the observation that arguers often do not express a clear rationale for accepting data. Instead, people provide a range of cues that are intended to convince the recipient that data is correct. Ver Linden argues that such \sincerity cues probably dier from culture to culture and in the general American culture they include the use of eye contact, tone of voice, and facial expression and other signs of emotion appropriate to the subject, as well as language that emphasises the arguer's sincerity". Implicit veri ers are suggested by the person supporting the claim. In contrast, inferred veri ers are supplied by the receiver without explicit suggestion by the claimant. For instance, the reader of an incident report may believe in an assertion simply because it has been made by a national transportation safety board. It is important to emphasise that this does not imply that the arguers are unaware of the likelihood that recipients will form such inferences. For instance, a national transportation safety board might make an assertion and state it as a fact without citing any source. In such circumstances, they rely on the belief that their reputation will carry a particular weight with the intended audience. This analysis has important implications for the presentation and analysis of incident reports. Data should be supported by explicit veri ers rather than the suggestive expressions of belief provided by implicit veri ers or any reliance on reputation to support the use of inferred veri ers. Ver Linden's analysis not only applies to the backing for data, it can also be applied to the backing that supports warrants and rebuttals. For instance, the backing for the rebuttal in Figure 13.4 clearly relies upon an inferred veri er because no evidence is supplied to support the observation that most pilots use accommodation ladders. The previous paragraph focused on practical extensions to the Toulmin model. A number of other authors have raised more theoretical objections to this approach. For instance, Freeman introduces the notion of `gappiness' between a warrant and some data [280]. Warrants can be thought of as inference rules that allow us to move from data to a claim. They are only necessary because the reader senses a gap between the data and the claim that it supports. Freeman's revision has disturbing implications. It can be diÆcult to predict where readers might sense a `gap' in an underlying argument. Investigators might, therefore, attempt to exhaustively addresses all of the possible doubts that a skeptical reader might have about an incident report. Such an approach raises
690
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
further problems. Many supporting arguments would be unnecessary. They exhaustive approach would address gaps that would never occur to many of the readers of an incident report. For instance, it might be necessary to justify the reference to SOLAS in Figure 13.4. The importance and relevance of such agreements would be self-evident to all domain experts. The introduction of additional warrants would support a minority of readers but it would also increases the size and scope of incident reports. To summarise, Toulmin argues that it is possible to identify domain dependent procedures that help to de ne convincing arguments within a particular context. Freeman argues that for structural reasons, some of those procedures relate to the individual reader's perception of gaps between claims and data. This implies that normative techniques, such as those proposed by Locker [498], may fail to identify the individual information needs of particular readers. There are currently two practical means that can be used to address Freeman's more theoretical caveats. One technique involved the use of user-testing and experimental analysis to determine whether or not domain dependent procedures are suÆcient for a broad cross-section of the intended readership of an incident report. This approach is described in more detail in the next section. It will not address the individual information needs identi ed by Freeman. It can, however, increase con dence that the argument in an incident report provides suÆcient backing to convince a speci ed proportion of its intended audience.
Figure 13.5: Snowdon's Tool for Visualising Argument in Incident Reports (1). Freeman's caveats about the individual perception of `gappiness' in argument structures can also be addressed by tool support. These techniques enable readers to view the argument that supports an incident report at a number of dierent levels of granularity. For example, Snowdon has developed a tool that is based on the Conclusion, Analysis and Evidence structures that were in introduced in Chapter 9. This is a simpli cation of the full Toulmin model that is speci cally intended to support the analysis of incident and accident reports [415]. When the reader of an incident report initially
13.3.
QUALITY ASSURANCE
691
uses the tool, they are presented with a simple overview of the highest level argument. This typically consists of a node that lists the conclusions of the report. By clicking on one of those conclusions they can expand their view of the argumentation in the report to look at the evidence or data that supports a conclusion. By clicking on that data, they can expand their view of the warrant, or analysis, that connects the evidence to the conclusion. This interactive process continues until the user reaches the bottom level in the system. These leaf nodes represent the paragraphs of prose that have been written by the investigators. Figure 13.5 illustrates a partial expansion of the argument structure in an aviation incident report. Figure 13.6 represents the end result of continuing to ask for more information about the investigator's argument. The reader is free to continue to ask for more justi cation until they reach the sections of prose written by the investigators.
Figure 13.6: Snowdon's Tool for Visualising Argument in Incident Reports (2). Snowdon's tool provides an alternative means both of verifying the argument that is presented in an incident report and of presenting the contents of the report to end users. Investigators can explore the graphical hierarchy to ensure that suÆcient backing has been provided for key arguments. Readers can also use the graphical interface to rapidly ll any gaps between data and claims. The key bene ts from their perspective is that they need only request additional information about those areas of an argument that they perceive to require additional support. Extensions to the system can also log those areas of a report where users repeatedly ask for additional warrants. Such information can help to guide the presentation of future incident reports.
13.3.2 Validation There is no direct evidence that any of the techniques described in the previous section will contribute to a sustained improvement in the quality of incident reports within complex, real-world applications. These approaches are the product of research initiatives that have been commissioned by regulatory and other government organisations because of the perceived weaknesses in existing reporting techniques. A number of groups, including my own, are using several of the more recent veri cation techniques as part of an initial ` eld trial'. These studies are intended to yield the evidence that many will require before introducing such `leading edge' techniques. Until such evidence is available there remains a strong suspicion that the more elaborate approaches, such as the use of formal logics, may contribute little beyond what can be achieved through a careful reading of the prose in an incident report. Two principle objections can be raised to this argument:
692
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
1. why do so many people criticise the quality of incident reports if careful reading is suÆcient to identify the enthymemes and other rhetorical eects that re ect systematic biases and inappropriate assumptions about the potential readership of these documents? 2. careful reading is, typically, conducted by other investigators. Those investigators often do not re ect the broad range of skills and expertise that characterise the intended audience of an incident report. This is signi cant because, as Freeman suggests [280], the background of the reader can help to determine the suÆciency of an argument. In other words, members of an investigatory organisation may fail to identify the `gaps' that will be identi ed by the eventual readership of an incident report. The following sections explore the substance of these objections. Firstly, we assess empirical work to determine the nature of existing criticisms against the presentation of incident reports. Subsequent sections then describe a range of techniques that have been used to identify particular weaknesses in individual reports. These validation techniques dierent from the approaches that were described in the previous section because they focus on end-user testing to assess the utility of these documents. In contrast, veri cation techniques look more narrowly at whether particular documents satisfy a range of technical properties that can, typically, be established without direct user testing.
User Testing It is surprising how little research has actually been conducted into the presentation of incident and accident reports. The format described in the previous sections of this chapter has remained largely unaltered for at least one hundred and thirty years [357]. Most of the previous studies commissioned by investigatory and regulatory organisations have focussed on recurrence rates to demonstrate that a reporting scheme has had the desired eect. Relatively little work has been conducted to determine whether any observed improvements might be increased by changing the presentation and dissemination of incident reports. This omission led Snowdon to survey a number of safety managers in an attempt to identify particular attitudes towards incident reporting practices. This study raised considerable practical challenges. In particular, it proved diÆcult to obtain responses by contacting individuals in their workplace. The sensitive nature of their job can be argued to create a justi able nervousness in replying to surveys that address their relationship with investigatory and regulatory organisations. He was, therefore, forced to issue questionnaires to a random sample of safety managers at a trade conference. Steps were taken to ensure that the delegates were registered participants at the meeting but assurances were also provided to protect the respondents' anonymity. The average age of respondents was 43 years old. The average experience in a eld relevant to safety management was 13 years. These constraints limit the generalisations that can be made from the results of this work. Snowdon argues that it can only be seen as a pilot study but that his observations are strongly suggestive of the existing limitations with incident and accident reports. Question Do you use accident reports to inform you of design problems? Do you only read reports that you feel are related to your areas? Do you read the whole of the report? Do you assess the conclusions by checking the evidence or other forms of analysis?
Positive responses (N= 27) 19 17 24 22
Table 13.7: Summary of Results from Snowdon's Survey of Accident Reporting Practices [749] Table 13.7 presents an overview of the results from part of Snowdon's survey. As can be seen, he focuses on attitudes towards the larger-scale documents that report on accidents and highconsequence incidents. It is interesting to note the relatively high number of respondents who claim to read the entire report and who check the conclusions against either the evidence or the analysis that is presented in these documents. He also documents a number of responses to more
13.3.
QUALITY ASSURANCE
693
open questions about the nature of such reports [748]. Some of these responses provide additional evidence for the broad results that are summarised in Table 13.7. For example, one safety manager was able to illustrate their detailed knowledge of a report that had a direct impact upon their working life. The level of detail in the following response is highly indicative both of the checking that this individual had been motivated to perform and of a careful reading of the entire report: \They are poorly structured and are often not tailored to their audiences mixed ability to get an overall picture. Many scenarios are hidden in dierent parts of the report, e.g. in the Watford junction tain accident in paragraph 143/4 it says a signal sighting committee should have been set up before the accident. After the accident this committee suggested that the removal of some trees would increase the sighting distance. In paragraph 171 it says that if the train had braked earlier, i.e. given the removal of the trees, there would not have been an accident." (Cited in [748]) Some of these responses reveal the continuing perception that many of these documents re ect a blame culture: \investigators usually assume that the victim is the sole cause even when the equipment has glaring design defects that entrap the user". Other responses are less easy to interpret. For instance, one safety manager identi ed the \tendency on the part of the reporter to write report to support his or her conclusions rather than openly evaluate all of the information/evidence that collected about the event." At one level, this seems to re ect an awareness of the conformation bias that is described in Chapter 11. At another level, it is diÆcult to know how an investigator could consider every aspect of the evidence that is obtained from a primary and secondary investigation. As we have seen, the drafting of an incident or accident report inevitably involves a ltering or selection process. Snowdon's survey also provides some con rmation of the analysis that is presented in previous sections of this chapter. Ver Linden's [494] comments on the importance of `veri ers' are illustrated by the following comment; \they present their account as de nitive without acknowledging missing evidence or contradictory opinions". Some of the comments also relate to the concept of `gappiness' proposed by Freeman [280]. One safety manager wrote that \conclusions draw that do not appear to be `in line' with facts presented; change in level of strength of assertions - report starts with `it may have been due to' and ends with `this was because'." Other feedback relates more strongly to the normative application of Toulmin's ideas proposed by Locker [498]. For example, the following quotation emphasises the importance of explicitly addressing both the quali ers and rebuttals that can limit the scope of an argument: \A good report is one w(h)ere all possibilities are considered and a balanced view is taken. If the report is not able to be conclusive then so be it. All possible causes are listed." (Cited in [748]) As with the previous comment about accounting for every item of evidence, this response illustrates the high expectations that many safety managers have for the presentation of information in accident and incident reports. Chapter 11 describes how causal asymmetries prevent investigators from identifying all of the possible causes for any observed set of eects. The results of previous studies into the theoretical and technical foundations of causation have, therefore, had little impact on the practical actitivies of safety managers. This is surprising given that some of the responses to the survey show a considerable level of knowledge about previous studies of accidents and incidents. These studies have clearly provided a vocabulary with which to voice their criticisms: \The report just contains direct causes of accident but information of underlying factors of accident is not available. Accident report should encompass not only direct causes but also proximal causes as well as distal causes. It means that failures mechanism should cover not only operative's failure but also management and organisational failure including design failure". (Cited in [748]) The main impression gained from an analysis of these various comments was that many safety managers are unhappy with the structure and format of the information that is presented in incident reports. Several comments related to the problems of using these documents to inform their
694
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
daily activities: \...the information I am interested in (human error contribution to the accident and cognitive factors in general) is dispersed over dierent sections of the report, and that related information might be `hidden' and I will have to spend a lot of time and eort trying to nd it". It can be argued, however, that incident and accident reports are not intended to be used in this way. Their primary purpose is to trigger regulatory action by the bodies that commission individual reports. They are not intended to inform local initiatives by local safety managers. Such a response ignores the high level of interest in these documents that is reported by the safety managers that Snowdon surveyed. They seem to perceive these documents as important components of any safety management system that should be, and are, read as part of their normal working activities. The negative reaction to existing reporting techniques can be summarised by the following response: \It is becoming less uncommon to nd a report that re ects little eort at gathering evidence. Consequently the analysis and conclusions are shallow. Often when the analysis is shallow, the few facts available are over-emphasised, as if the writer knows the facts are insuÆcient and attempts to cover by reaching rm conclusions." (Cited in [748]) Unfortunately, there are theoretical and practical barrers that prevent investigators from addressing some of the criticisms that were voiced in Snowdon's survey. As we have seen, it is often impossible to accurately summarise all of the evidence that is collected in the aftermath of an adverse occurrence or near miss incident. Similarly, it is infeasible to consider every possible cause of a failure. Freeman [280] provides a possible solution when he emphasises the subjective nature of the problems that many readers experience when they read complex documents. He argued that the notion of a `convincing' argument depends upon the information that a reader requires in order to bridge the devide between a claim and some evidence. It, therefore, follows that any claims to the suÆciency of an argument make little sense without additional validation to assess whether or not the intended audience can make the necessary connections. User testing provides one means of assessing whether or not particular groups of individuals are convinced by the argument in an incident report. This approach is described in a large number of introductory textbooks. The following pages, therefore, brie y summarise the main features of the available techniques. The interested reader is directed to [686] and [740] for a more sustained analysis. It is possible to distinguish between two dierent approaches to validation that can be applied to assess the quality of incident reports. Summative techniques can be used at the end of the production phase when a report is ready to be issued. In contrast, formative evaluation techniques can be used to determine whether particular sections of a report provide the necessary feedback the safety managers and regulators need to complete their tasks. Formative evaluation helps to guide or form the decisions that must be made during the drafting of an incident report. The importance of this form of incremental user testing depends on the scale of the incident report. It also gives rise to an important paradox. Reports into `higher-risk' incidents are, typically, longer and more complex than lower risk occurrences or near misses. In consequence, there is a greater need to identify potential problems in the presentation of material about the incident. Any potential `gaps', omissions or inconsistencies should be identi ed well before the document is published. In contrast, it is precisely these documents that create the greatest concerns about security and media interest prior to publication. These concerns, mentioned in the opening sections of this chapter, create considerable practical problems when recruiting subjects to provide feedback on the information that is contained in a draft report. In contrast to formative evaluation, summative evaluation takes place immediately prior to the publication of an incident report. This approach can be problematic unless it is supported by other forms of quality control. It can be costly and time-consuming to make major structural changes to the argument that is presented in an incident report at this late stage in the investigation process. User testing can, occasionally, identify rebuttals that are not addressed either by existing arguments or by available evidence. This is particularly the case when investigators may not have the same degree of domain expertise as the individuals reading the report [248]. Summative evaluation can identify these potential doubts at a time when there are insuÆcient resources available to commission additional studies of the available evidence. User-testing methods have been widely applied to computer systems and to the documentation
13.3.
QUALITY ASSURANCE
695
that is intended to support them. As we shall see, some of these techniques can be applied to support the validation of incident reports. The following list identi es some of these approaches and indicates whether they oer the greatest bene ts for summative or formative evaluations:
Scenario-Based Evaluation. Scenarios or sample traces of interaction can be used to drive both the drafting and evaluation of an incident report systems. This approach forces investigators to identify key requirements of an incident report. These requirements are summarised as brief descriptions of the sorts of prototypical tasks that dierent readers might want to perform with such a document. The resulting scenarios resemble a more detailed form of the descriptions that are presented in Table 13.1. This identi ed the reporting requirements that must be satis ed by the various accident and incident reports that are published by the Hong Kong Marine Department. As the drafting of a report progresses, investigators can ask themselves whether the document that they have prepared might be used to complete the tasks that are identi ed in each of the scenarios. The `evaluation' continues by showing a colleague what it would be like to use the document to achieve these identi ed tasks. This technique an be used at the very earliest stages of drafting a report and hence is a useful approach to formative evaluation. The problems with the use of scenarios are that it can focus designers' attention upon a small selection of tasks and users. For instance, a scenario might require that the Director of Marine Operations should be able to rapidly identify any regulations that were violated in the course of an incident. Such a scenario would not provide con dence that an engineer would be able to use the same document to rapidly identify detailed design recommendations. A further limitation is that it is diÆcult to derive empirical data from the use of scenario based techniques. Investigators may be able to convince their colleagues that a draft report satis es the proposed requirements. This need not increase con dence that others will reach the same conclusions.
Experimental Techniques. The main dierence between the various approaches to evaluation
is the degree to which investigators must constrain the reader's working environment. In experimental approaches, there is an attempt to introduce the empirical techniques of scienti c disciplines. It is, therefore, important to identify the hypothesis that is to be tested. The next step is to devise an appropriate experimental method. Typically, this will involve focusing in upon a particular aspect of the many tasks that might eventually be supported by an incident report. For example, a safety-manager might be asked to reconstruct the events leading to an incident after having read a report for some speci ed amount of time. The reconstructions might then be examined to demonstrate that potential readers can more accurately recall these events in one version of the report than in another. In order to avoid any outside in uences, tests will typically be conducted under laboratory conditions. Individuals are expected to read the draft report without the usual distractions of telephones, faxes, other readers etc. The experimenter must not directly interact with the reader in case they bias the results. The intention is to derive some measurable observations that can be analysed using statistical techniques. There are a number of limitations with the experimental approach to evaluation. For instance, by excluding distractions it is extremely likely that investigators will create a false environment. This means that readers may be able to draw inferences from a report more quickly and with greater accuracy than might otherwise be obtained within a noisy, complex working environment. These techniques are not useful if investigators only require formative evaluation for half-formed hypotheses. It is little use attempting to gain measurable results if you are uncertain what it is that you are looking for.
Cooperative evaluation techniques. Laboratory based evaluation techniques are useful in the nal stages of summative evaluation. In contrast, cooperative evaluation techniques (sometimes referred to as `think-aloud' evaluation) are particularly useful during the formative stages of drafting an incident report. They are less clearly hypothesis driven and are an extremely good means of eliciting feedback on the initial drafts of a document. The approach is extremely simple. The experimenter sits with the reader while they work their way through a series of tasks with a potential report. This can occur in the reader's working context or in a quiet
696
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
room away from the `shop- oor'. The experimenter is free to talk to the reader but it is obviously important that they should not be too much of a distraction. If the reader requires help then the investigator should oer it and note down the context in which the problem arose. These requests for help represent the `gaps' identi ed by Freeman [280]. Additional evidence or warrants may be necessary to support the claims that are made in an incident report. The main point about this exercise is that the reader should vocalise their thoughts as they work with the draft. This low cost technique is exceptionally good for providing rough and ready feedback. Readers can feel directly involved in the drafting of a nal document. The limitations of cooperative evaluation are that it provides qualitative feedback and not the measurable results of empirical science. In other words, the process produces opinions and not numbers. Cooperative evaluation is extremely bad if investigators are unaware of the political and other presures that might bias a reader's responses.
Observational techniques. There has been a sudden increase in interest in the use of obser-
vation techniques to help `evaluate' a range of computer-based systems [751], of management structures [396] and of safety-critical working practices [120]. This has largely been in response to the growing realisation that the laboratory techniques of experimental psychology cannot easily be used to investigate the problems that individuals can experience in real-world settings. Ethnomethodology requires that a neutral observer should enter the users' working lives in an unobtrusive manner. They should not have any prior hypotheses and simply record what they see, although the recording process may itself bias results. This approach provides lots of useful feedback during an initial requirements analysis. In complex situations, it may be diÆcult to form hypotheses about readers' tasks until investigators have a clear understanding of the working problems that face their users. This is precisely the situation that aects many incident reporting systems; the individuals who run these applications often have very limited information about the ways in which others in their organisation, or in other organisations, are using the information that they gather [748]. There is a natural concern to be seen to meet existing regulatory requirements. Consequently it can be diÆcult to interpret an organisations' response when asked whether or not a particular report has guided their operating practices. Ethnography focuses not on what an organisation says that it does but on what key individuals and groups actually do in their everyday working lives. Unfortunately, this approach requires considerable skill and time. It is extremely diÆcult to enter a working context, observe working practices and yet not aect the users behaviour in any way. There have been some notable examples of this work. For example, Harris has shown how these techniques can be used to improve our understanding of midwife's behaviour in a range of safety-critical applications [306]. Her work also illustrates the potential drawbacks of the approach; her observations are grounded in several years of experience observing their working environment.
As mentioned, these techniques have been widely applied to evaluate the usability and utility of computer-based systems and of more general forms of documentation [686]. They have not been widely used to have validate the presentation of incident reports. It is for this reason that McGill conducted a series of initial investigations to determine whether some of these approaches might support such evaluations [530]. This work formed part of a wider study that was intended to determine whether the use of electronic presentation techniques support or hinder the presentation of incident reports. These wider ndings are discussed in the closing sections of this chapter. For now it is suÆcient to observe that the tasks were designed to determine how well readers could use the published report to: identify key events that occurred at the same time during an incident but that occurred in dierent areas of the system; discover the timing and location of key events; identify whether or not an individual was involved in particular events. His study used laboratory-based, cooperative evaluation techniques but he also derived a number of high-level performance measures during these tasks. Brevity prevents a complete analysis of McGill's results. It is, however, possible to illustrate some of the ndings. For instance, readers were relatively pro cient at using paperbased reports to nd answers to factual information. Questions of the form `write down the time that the Chief OÆcer left the mess room to return to the bridge' were answered in an average of 4 minutes and 28 seconds using a relatively small sample of only ve readers. In contrast, tasks that
13.3.
QUALITY ASSURANCE
697
involved the resolution of con ict or that related to causal hypotheses took far longer to resolve. For instance, none of the users were able to answer `OÆcer A gave con icting evidence as to the time at which he left G deck to to the mess room - write down the page reference where these con icting statements are highlighted'. Mc Gill's work, like that of Snowdon [748], is suggestive rather than conclusive. It provides an indication of some of the problems that readers face when attempting to use existing incident reports to perform a particular set of tasks. The study makes a number of recommendations about how those problems might be addressed for an individual incident report and hence indicates how lab-based studies might be used to support both formative and summative evaluation. These preliminary studies raise more questions than they answer. The subject groups are too small to provide results that can easily be generalised beyond the speci c context of the particular evaluations. Previous studies have also been conducted away from the reader's working environment. Snowdon's study looked at safety managers' attitudes during the `atypical events' of a trade conference. McGill's study took place within a University laboratory. Neither of these environments approaches the ecological validity that is a focus for the ethnographic techniques that are summarised in previous paragraphs. McGill's work also identi ed signi cant problems in accounting for learning eects. In order to demonstrate that any changes to an incident report had addressed the problems identi ed in previous studies, he was forced to retest subjects. Unfortunately, the readers of the reports were able to use the knowledge gained with earlier versions of the report to perform some of subsequent tasks. He, therefore, used elaborate counter-balancing to ensure that some of the readers were presented rst with the revised version of the report and others with the initial version. Unfortunately this leads to further problems where preliminary drafts provide greater support for some tasks that the subsequent versions. The complex nature of prose incident reports make it very diÆcult to be certain about which particular aspects of a document actually support particular user tasks. The previous observation leads to further reservations. As mentioned, we know remarkably little about the diverse ways in which readers exploit incident reports. The same documents have been used to inform the drafting of recommendations that are intended to avoid future incidents, they have also been analysed as part of wider statistical summaries, they have been added to international databases that can be searched by interactive queries etc [413]. It is, therefore, diÆcult to identify appropriate tasks that might help to drive any experimental evaluation. Those studies that have been conducted have all commented on the diÆcult of `pinning down' reporting organisations to a precise list of the tasks that these documents are intended to support.
Figure 13.7: MAIB On-Line Feedback Page
698
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Even if we could agree upon an appropriate selection of tasks, it is diÆcult to envisage ways in which studies might move beyond the test-retest approach taken by McGill. This approach attempts to identify small dierences between dierent formats by testing and retesting users with various versions of a report over a short period of time. It is likely, however, that any changes to the format of incident reports may be introduced as part of more systematic changes and that these changes could have consequences for the long-term application of the information that they contain. I am unaware of any longitudinal studies into the strengths and weaknessed of particular presentation styles for incident reports. This forms a strong contrast with the many studies that have been conducted into the usability and utility of design documents [94, 512]. None of the investigation agencies that I have contacted throughout the preparation of this book have reported the use of direct user-testing to improve the quality of their incident and accident reports. This does not imply that such organisations are not concerned to elicit feedback about the information that they provide. For instance, Figure 13.7 illustrates how the UK MAIB provides a web-based form to elicit feedback about its work. Unfortunately, there is no publically available information about the insights that such systems provide about the presentation of incident reports. It is instructive to observe, however, that this feedback page asks respondents to indicate whether their information related to \ship's oÆcer/crew, shipping company, shing skipper/crew, shing company, leisure craft, insurance, training/education, legal, government or other". This diverse list again illustrates the problems of devising appropriate reports that might satisfy the dierent information requirements for all of these groups. The use of electronic feedback froms to elicit information from the intended readers of an incident report is a relatively new innovation. It re ects a far wider move towards the use of electronic systems in the presentation and dissemination of information about near-miss incidents and adverse occurrences.
13.4 Electronic Presentation Techniques Many of today's accident reports exploit the same presentation format and structure that was rst used at the beginning of the twentieth century. This would not be a concern if the nature of accident investigation had not changed radically over the same time period [248]. Chapter 3 has described how the introduction of computer technology has rapidly increased the integration of heterogeneous production processes. This has now reached the point where isolated operator errors or single equipment failures have the potential to create complex and long lasting knockon eects within many dierent applications. Accident reports must now not only consider the immediate impact of an accident but also the way in which emergency and other support services responded in the aftermath of an incident. This was illustrated by the Allentown incident, described in Chapter 9. Technological change is only one of the factors that have changed the nature of accident reporting. There has, for example, been a move away from blaming the operator. This wider perspective has resulted in many regulatory organisations looking beyond catalytic events to look at the organisational issues that form the latent causes of many failures. These two factors, technological change and new approaches to primary and secondary investigation, have increased the length of many incident reports. The scale and complexity of such documents imposes clear burdens upon the engineers, managers and operators who must understand their implications for the maintenance, development and running of many safety-critical systems. Recently, however, a number of regulatory authorities have begun to use the web as a primary means of disseminating accident reports. The MAIB \are in the process of converting all reports into a format suitable for viewing and downloading from this web site" [518]. Electronic media are perceived to oer considerable support for the communication of these documents. It is diÆcult to obtain direct evidence that the changing nature of incident investigation and the complexity of technological failures has increased the length of incident reports. Tables 13.8 and 13.9 show how the number of pages required by marine occurrence reports has risen between 1991 and 1999 for the ATSB. These two years were chosen because 1991 is the earliest year for which it is possible to obtain a relatively complete collection of reports. The fact that it can be diÆcult to obtain particular documents is itself worth noting. 1999 is the most recent year for which a complete
13.4.
ELECTRONIC PRESENTATION TECHNIQUES
1991 Report Number 27 28 29 30 31 33 42 32 34 35 36 37 38
699
Number of Pages 27 28 Unavailable 44 23 Unavailable 21 29 18 34 23 38 29
Table 13.8: ATSB's 1991 Marine Report Page Counts 1999 Report Number 143 144 145 146 147 148 149 150 151 152
Number of Pages 40 27 23 35 29 37 40 42 35 42
Table 13.9: ATSB's 1999 Marine Report Page Counts collection can be obtained, given that some investigations that were started towards the end of 2000 have not published their ndings. It was possible to obtain copies of 11 of the 13 reports that were published in 1991. These extended to an average of 28.5 pages, the standard deviation was 7.6 and the total number of pages was 314. In 1999, the average had risen to 35 pages per report over 10 incidents with a standard deviation of 6.63. A certain degree of caution should be exercised over the interpretation of these gures. For instance, the style of presentation has changed radically over this period. In particular, the introduction of photographic evidence has considerably lengthened later reports. It can also be argued that this trend is atypical. The ATSB reports focus on high-risk incidents, they also represent the publications of a single national agency. Further evidence can be found to support the hypothesis that changes in the scope and complexity of incident investigations has had the knock-on eect of increasing the length of many incident reports. Table 13.10 illustrates how word counts can be used to strip out the formatting dierences that distort the page counts shown in Tables 13.8 and 13.9. These word counts were derived from the less critical incidents that are reported by the UK MAIB's Safety Digest publication. Unfortunately, a number of further problems aect such an analysis. The 1996 volume examined approximately 54 incidents while the 2000 edition considered more than 110. It is for this reason that Table 13.10 considers the word counts for the rst ve incidents in the rst number of each volume for each of the years. As can be seen, however, the results of this analysis provide only partial support for
700
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Volume and number
Year
Safety Digest 1/00 Safety Digest 2/99 1/99 unavailable Safety Digest 1/98 Safety Digest 1/97 Safety Digest 1/96
2000 1999
No. of Words First 5 Incidents 2845 2658
1998 1997 1996
4577 2720 2178
Table 13.10: UK MAIB Safety Digest Incident Word Counts the general hypothesis of increased page counts. The sudden rise in 1998 cannot be explained by a change in the investigatory process nor any sudden increase in the complexity of maritime incidents. In contrast, this sudden rise can be explained by the introduction of more detailed `lessons learned' summaries into individual incident reports. Recent initiatives to exploit electronic presentation and dissemination techniques cannot be explained simply in terms of the changing nature of incidents or new investigatory techniques. There are strong nancial incentives that motivate the use of web-based systems to support incident reporting. There are considerable overheads involved in ensuring that investigatory and regulatory organisations have an `up to date' inventory of previous reports. The growth of the Internet also oers the possibility of dissemination reports to organisations and individuals who might not have taken the trouble to order and pay for paper-based versions of an incident report. Of course, there are obvious security concerns associated with such electronic techniques. There are also a host of additional bene ts for the automated indexing and retrieval of large scale incident collections. These retrieval issues will be discussed in the next chapter. In contrast, the following pages identify a range of techniques that are intended to support the electronic presentation of incident reports.
13.4.1 Limitations of Existing Approaches to Web-Based Reports Two primary techniques have been used to support the electronic dissemination of incident reports: the Hypertext Mark-up Language (HTML) and Adobe's proprietary Portable Display Format (PDF). Figure 13.8 illustrates the use of HTML by the UK MAIB in their Safety Digest, mentioned above. Users can simply select hyperlinks to move between the pages that describe similar incidents.
Figure 13.8: MAIB Safety Digest (HTML Version) This use of the web oers a number of bene ts for the presentation of incident reports. It avoids the overheads of maintaining a catalogue of paper-based documents. It is important to note,
13.4.
ELECTRONIC PRESENTATION TECHNIQUES
701
however, that very few agencies intend to entirely replace paper-based incident reports. For instance, the Transportation Safety Board of Canada include an explicit disclaimer on their web-site that: \These documents are the nal versions of occurrence investigation reports as approved by the Transportation Safety Board. The TSB assumes no responsibility for any discrepancies that may have been transmitted with the electronic versions. The printed versions of the documents stand as the oÆcial record." [788]. HTML oers a number of further bene ts. No special software is needed beyond a browser, such as Netscape Navigator or Internet Explorer, that are now installed as a standard feature of most personal computers. The introduction of hyperlinks and on-line keyword search facilities also helps to reduce the navigation problems that frustrate the readers of paper-based documents. There are, however, a number of limitations. Previous research has identi ed both perceptual and cognitive problems associated with the on-screen reading of technical documents [657]. This explains the subjective diÆculties that reader's report when using large HTML documents. The improved comprehension that can be obtained through the appropriate use of hyperlinks as structuring tools can be jeopardised if on-line documents simply replicate the linear structure of paper based reports [670]. It can take almost twice as long to read electronic copy [555]. Readers are more prone to error when reading on-line documents [875]. HTML tags work well when they are interpreted and displayed by current web browsers. They do not work well when the same browsers are used to obtain printed output from HTML documents. Figures and photographs are often embedded as hypertext links in existing HTML reports. These will be missing in the printed version. Readers must manually piece together their hardcopy. Many of these limitations can be avoided through the use of Adobe's proprietary PDF. It is for this reason that the MAIB publish their Safety Digest in both HTML and PDF formats. This exhaustive approach is, however, rare. Most agencies use either PDF or HTML. Figure 13.9 presents an excerpt from a PDF report into a marine incident published by the ATSB . The freely available PDF viewer integrates images and text to emulate the printed document on the screen. Readers can also obtain well-formatted, printed copies. This reduces the psychological and physiological problems of on-screen reading. However, these important bene ts must be balanced against a number of problems. Firstly, it can be diÆcult for people to obtain and correctly install the most recent version of the PDF reader. This is important because these programmes are, typically, not a standard part of most browsers. Although PDF les are compressed, users can also experience signi cant delays in accessing these documents compared to HTML reports. Finally, it can be diÆcult to extract information once it has been encoded within a PDF document. This is a signi cant barrier if readers want to compile their own index of related incidents within an industry. The ATSB avoid this problem by providing extensive summaries of each incident in HTML format. Readers access the full PDF version of the report by clicking on a hyperlink within the HTML summary. This enables search and retrieval tools to index the HTML summary so that readers can easily nd it using many of the existing search engines. They can then read and print the report in the PDF format that is less easy to index because of Adobe's proprietary encoding. The hybrid use of both HTML and PDF by the ATSB and the MAIB does not address all of the problems that aect the electronic presentation of incident reports. These formats are, typically, used to reproduce the linear document structure that has been exploited since the beginning of the twentieth century. This imposes signi cant burdens upon the reader and may fail to exploit the full potential of web based technology [158]. It can be argued that investigatory authorities have focussed upon the electronic dissemination of accident reports over the web. Few, if any, have considered the opportunities that this medium provides for the eective presentation of these documents. The following paragraphs, therefore, describe how visualisation techniques from other areas of human-computer interaction can be used to support the electronic presentation of incident reports.
702
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Figure 13.9: ATSB Incident Report (PDF Version)
13.4.2 Using Computer Simulations as an Interface to On-Line Accident Reports Chapter 8 describes a range of simulation techniques that are intended to help investigators gain a better overview of the events leading to an incident. It is surprising that this approach is not more widely integrated into the on-line presentation of accident reports. For instance, Figure 13.10 shows how the NTSB's PDF report into a rail collision uses a still image from a 3-D simulation to provide readers with an overview of the train collision. This simulation was undoubtedly available to accident investigators. It would have been relatively simple to provide access to other readers alongside the PDF report. Instead the reader has to piece together events from more than 40 pages of prose.
Figure 13.10: NTSB Incident Report (PDF Version) Such examples, arguably, illustrate a missed opportunity to exploit novel means of presenting information about near-miss incidents and adverse occurrences. Several of my students have, therefore, begun to use simulations as an interface to on-line reports. Users are presented with animations of the events leading to an incident. Their browser also simultaneously present a set of links to those sections of the existing report that deal with the stage of the accident that is currently being simulated. The links are automatically updated as the simulation progresses. Users can stop a simulation at any point during its execution. They can then use their browser to retrieve the relevant sections of the text-based report. Figures 13.11 and 13.12 illustrate two dierent applications of this approach. The system in
13.4.
ELECTRONIC PRESENTATION TECHNIQUES
703
Figure 13.11: Douglas Melvin's Simulation Interface to Rail Incident Report (VRML Version) Figure 13.11 uses the scripting facilities in the Virtual Reality Mark-up Language (VRML) to generate a scale model for part of the Channel Tunnel. The technical details of the VRML approach are brie y introduced in Chapter 8. A free helper application is integrated into the browser so that users can view the pseudo-3D images. The user can select areas of the image to replace the simulation with an HTML from the report into the incident. The intention is that the simulation will provide a hgh-level overview of the events leading to the failure. Readers can then use the animation as a means of indexing into the analysis and more detailed reconstructions that are presented in a more conventional format.
Figure 13.12: James Farrel's Simulation Interface to Aviation Incident Report (VRML Version) The approach in Figure 13.12 exploits similar techniques but focuses on cockpit interaction during an air accident. A series of still images can be updated using the controls in the centre of the screen. On either side of the simulation are sections that present the transcripts both from the cockpit voice recorders and from the Air TraÆc Controllers. The prose from the accident report is presented at the bottom of the screen. All sections of the interface are updated as the user moves through the simulation. Unfortunately, simulations cannot be used to support the presentation of all aspects of an incident report. Chapter 8 reviews the problems that arise when these techniques are used to model the distal causes of an incident. It is relatively easy to simulate the immediate events surrounding a particular occurrence, it is less easy to recreate the management processes and regulatory actions that create the context for an incident. Similarly, it can be diÆcult to represent near-misses or errors of omission. readers and `viewers' often fail to detect that something which ought to have happened has not, in fact, been shown in the simulation.
704
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
13.4.3 Using Time-lines as an Interface to Accident Reports The previous simulations attempted to recreate the events leading to failure. It is also possible to exploit more abstract visualisations to provide readers with a better overview of an incident [502]. For example, we have used the Fault tree syntax that was introduced in Chapter 9 as a gateway into an incident report. As shown in Figure 9.9, 9.10 and 9.12 these diagrams can be used to map out both the proximal and distal causes of an incident. They can also be annotated in various ways, for instance events can show the time or interval during which they are assumed to have occurred. Imagemap techniques provide a means of using these diagrams to directly index into an incident report. Imagemaps work by associating the coordinates of particular areas on an image with webbased resources. The net eect is that if the user clicks on a node in a Fault Tree the browser can be automatically updated to show those sections of prose in an incident report that are represented by the graphical component. However, such representations quickly suer from problems of scale. This is a signi cant limitation if the scope and complexity of incident reports is increasing in the manner suggested by the opening paragraphs of this section. Again the desktop virtual reality provided by VRML and similar languages can be exploited to ease some of these problems. We have developed the pseudo-3D time-line shown in Figure 8.7 and the perspective wall, shown in Figures 8.8 and 8.9, to provide novel means of interacting with incident reports. These visualisations enable readers to access reports over the web. Rather than having to scroll over large, two-dimensional imagemaps of graphical structures such as Fault Trees, users can `walk' into these structures along the Z-plane. The claimed bene ts of this approach include a heightened sense of perspective and the ability to focus on particular aspects of the structure by choosing an appropriate viewpoint. As before, readers can use their mouse to select areas of the 3-dimensional models. The system then automatically updates an area of the browser to present the relevant areas of a textual report.
Figure 13.13: Peter Hamilton's Cross-Reference Visualisation (VRML Version) Unfortunately, the models shown in Figure 8.7, 8.8 and 8.9 were all built manually . This requires considerable skill, expertise and patience. Together with Peter Hamilton, I have developed a similar range of three-dimensional interfaces that can be directly generated from textual incident report. This approach is illustrated in Figure 13.13. A colour-coded index is presented on the left of the image. This refers to each of the bars on the three dimensional representation shown on the right hand side. The image on the right represents time advancing into the Z plane. Each bar, and therefore each colour in the index, relates to a chapter in the incident report. Links are drawn between any two bars that refer to the same instant in time during the accident. It is, therefore, possible to `walk' into the structure on the right to see whether dierent chapters treat the same events from slightly dierent perspectives. The lower area of the browser is used to present the HTML version of the report. Selecting a bar at any point in the Z plane will result in the image being updated to show any text relating to that moment in time in the selected chapter .
13.4.
ELECTRONIC PRESENTATION TECHNIQUES
705
Many of these tools satisfy two dierent sets of requirements. For instance, the perspective wall and the three-dimensional time-line were introduced in Chapter 8 as tools to help investigators reconstruct the events leading to an incident. In this chapter, we have also argued that these techniques might also support the presentation of accident and incident reports. Similar comments can be made about the use of location-based simulations. For example, the QuicktimeVR images shown in Figure 8.4 have been integrated into several incident reports. Similarly, the imagemap shown in Figures 8.1 and 8.2 was initially intended as an alternative interface to an incident report rather than as an aid to incident investigators . These interfaces also support a number of additional tasks. For instance, Scho eld has pioneered the use of similar simulations within a range of legal proceedings [730]. This application of forensic animation is still very much in its infancy. The Civil Procedure Rules (CPR) substantially changed the legislative framework that governed the use of these systems in English courts. These rules came into force in April 1999 and gave judges considerable powers to control both the conduct and preparation of trials [292]. However, there is insuÆcient precedent for the use of these techniques and so much of the focus of Scho eld's work has been to build up collections of `case law' from other legal jurisdictions. It is clear that the cost associated with the more innovative visualisations proposed in this section can only be justi ed for high-risk incidents. On the other hand, all of the interfaces shown here were produced using mass-market, software that is available without charge or under shareware license agreements. It is also important to stress that the development costs were minimal. They were developed over a period of approximately 1-2 weeks by nal-year undergraduate students participating in my course on safety-critical systems. The intention behind this exercise was to see whether people who knew little about incident reporting but who knew more about the computerbased presentation of information might come up with some innovative presentation ideas. There are, however, a number of caveats that must be raised about the potential bene ts of these innovative presentation techniques. These can be summarised as follows: 1. The problems of navigating in current desktopVR environments. Many users experience considerable diÆculty in moving along the three dimensional time-lines shown in the previous paragraphs. This problem has been widely reported in previous work on the validation of desktopVR [202]. Several solutions have been proposed. For instance, recent versions of our software have made extensive use both of virtual way-points and of sign posts [637]. This approach takes users on a `conducted tour' of an incident . Users do not have to concentrate on achieving a particular orientation at a particular point in three-dimensional space in order to view particular events. They can simply click on a menu item that will take them to those events. Their viewpoint and orientation in the virtual world is automatically updated to provide them with the best view of the surrounding events. These navigation problems do not aect all accident simulations. For instance, the interface in Figure 13.12 uses animation techniques that enable the developer to treat the simulation as a `movie'. They can specify the viewpoint and the sequence of events so that the reader need not navigate within a three dimensional environment. 2. Unknown rhetorical eects. Many of the techniques that are presented in this paper introduce new rhetorical techniques, or tropes. The introduction of a simulation can have a profound impact upon many readers. Initial eld trials have indicated that people are more willing to believe the version of event shown in a simulation or model than they are in a paper based r eport. On the one hand this illustrates the importance of these new techniques. Investigation agencies can use them as powerful tools to convince readers about a particular view of an incident. On the other hand, these new techniques may persuade people to accept a simulated version of events that cannot be grounded in the complex evidence that characterises modern accidents. 3. The problems of validating novel interfaces. DesktopVR interfaces and simulations have a subjective appeal [406]. This should not be under-estimated. Investigatory and regulatory agencies have organisation reasons for being seen to embrace new technology. The Rand study
706
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
[482] and the Institute of Medicine report [453] cited in previous chapter both criticised a range of perceived problems that have limited the eectiveness of more `traditional' approaches to incident reporting. This pressure to innovate can be reinforced by the strong subjective appeal, mentioned above. However, the attraction can quickly wane. There is a danger that the users of these novel interfaces will reject the additional facilities that they oer if those techniques are not perceived to support their diverse working tasks. Previous sections have described the very limited nature of those studies that have been conducted into the longitudinal use of conventional, paper-based incident reports. We have even less information about the potential long-term use of these more innovative approaches. There have been some limited evaluations of these novel presentation techniques. For instance, the McGill study mentioned in previous sections was extended to compare the use of a VRML timeline with the use of a location-based imagemap. The imagemap presentation was introduced in Chapter 8 and is illustrated in Figures 8.1 and 8.2. As mentioned, the readers of the report use a web browser to view a cross-sectional diagram of the ship that was involved in the incident. By selecting areas of that image, they can access a time-line of the events that happened in particular locations. These events are accompanied by a brief summary of their importance within the wider causes of an incident. McGill's laboratory evaluation also considered the use of the VRML time-line illustrated in Figure 8.7 as a presentation format for incident reports. Readers used a web-based interface to `walk' or ` y' through an abstract map of the events leading to a particular adverse occurrence or near-miss incident. By clicking their mouse on certain areas of the structure, they accessed web pages that presented more detailed textual or graphical information. The participants in the study were asked to perform a number of tasks within a xed period of time . In order to minimise the eects of fatigue, each participant was stopped when they had spent a maximum of ve minutes on any individual task. The study was counter-balanced so that each task was performed using a dierent presentation technique and the same number of participants performed each task using a particular approach. McGill also altered the order in which particular tasks were performed to help minimise any eects of fatigue or of learning from answering previous questions using a dierent presentation technique. The ve tasks were as follows: 1. Write down the time that the Chief OÆcer left the mess room to return to the bridge. 2. Write down the key events which happened on the bridge at approximately 18:23. 3. OÆcer A gave con icting evidence as to the time at which he left deck G to go to the mess. Write down the page/paragraph reference where these con icting statements are highlighted. 4. What events happened at 18:28? 5. Write down the time that crewmember B returned to his cabin. Table 13.11 summarises the times that were obtained for ve users performing these dierent tasks using cooperative evaluation under laboratory conditions. The small number of participants in the study meant that a very limited number of readers used each interface to perform each task. It can also be argued that his choice of questions may have produced results that are favourable towards particular presentation formats. McGill concludes that the electronic versions oered signi cant bene ts over the paper-based presentation of the incident report. This is revealed principally in the time take to perform the speci ed tasks but also in a range of attitudinal questions that assessed the readers' subjective response to these systems. These subjective questions indicated a strong preference for the VRML time-line over the image map for most tasks. McGill did not compare subjective responses for the paper-based report because he assumed that the electronic formats would supplement rather than replace more traditional presentation techniques. The results summarised in Table 13.11 hide a number of factors that in uenced the course of the study. For instance, subject 3 in Task 4 missed one of the events. McGill awarded a time penalty for each incorrect answer when presenting his results [530]. His decision to add thirty seconds to the time reported for task completion seems arbitrary. These penalties have not, therefore, been
13.5.
707
SUMMARY
User 3
Paper report Task 1: Did not complete Task 4: 133 sec. Task 2: 227 sec. Task 5: 173 sec. Task 3: Did not complete
User 4
Task 2: Did not complete
User 5
Task 1: 236 sec. Task 5: 254 sec.
User 1 User 2
Image Map Task 2: 52 sec. Task 5: 144 sec. Task 3: 62 sec. Task 1: Task 4: Task 1: Task 5: Task 2: Task 4:
189 sec. 179 sec. 177 sec. Did not complete 40 sec. 273 sec.
VRML Time-line Task 3: 35 sec. Task 1: Task 4: Task 2: Task 5: Task 3: Task 4: Task 3:
19 sec. 45 sec. 38 sec. 27 sec. 61 sec. 69 sec. 42 sec.
Table 13.11: McGill's Timing Results for Tasks with Electronic Incident Reports included in Table 13.11. The preliminary results provided by this study must be supported by more sustained investigations into the potential errors that might arise from the use of such innovative presentation techniques. A number of further comments can be made about these results. For example, user 4 in task 5 insisted on abandoning the image map and resorted to the paper based report after 3 minutes. Such situations typify experimental studies with users performing complex tasks over even relatively short periods of time. There are inevitable frustrations from being forced to use presentation formats that the reader might not accept if they were given a free choice. This emphasises the need for more sustained longitudinal studies into the `real world' use of electronic incident reports.
13.5 Summary Incident reports provide a primary means of ensuring that the weaknesses of previous applications are not propagated into future systems. This chapter has described a number of the problems that complicate the task of drafting these documents. Incident reports may have to be tailored to meet the particular requirements of several diverse audiences. For instance, some reports are produced to be read by the regulators who must decide whether or not to act on their recommendations. Other forms of incident report are intended as case studies or scenarios that are to be read by operators and practitioners. Further factors complicate even these relatively simple distinctions. For instance, the format and presentation of reports that are to be read by operators within the same organisation as the investigator can be quite dierent from those that may be disseminated to a wider audience. The problems of addressing the needs of intended readers of an incident report are also complicated by the way in which dierent types od report may be drafted for dierent types of occurrences. For example, incidents that have a high-risk associated with any future recurrence may warrant greater detail than those that are assumed not to pose any future threat to a system. Dierent presentation techniques may also be necessary in order to draft reports at dierent stages of an investigation. The format of an interim report in the immediate aftermath of a near-miss incident or adverse occurrence is unlikely to support the more polished, nal report that is required by international organisations, such as the IMO. The task of meeting these various requirements is also exacerbated by the need to preserve con dentiality in the face of growing media and public interest in technological failures. Subsequent sections of this chapter have presented a number of detailed recommendations that are intended to support the presentation of incident reports. These recommendations are ordered in terms of the dierent sections that are, typically, used to structure more formal reports into high-risk incidents: reconstruction then analysis then recommendations. However, many of the detailed guidelines can also be applied to documents that analyse less critical failures. For example, any reconstruction should consider both the distal and the proximal events that contributed to an incident. It is also important to describe the evidence that supports a particular version of events. Similarly, the analysis section should not only describe the causes of an incident but also the methods
708
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
that were used to identify those causes. It is no longer suÆcient in many industries to rely simply on the reputation of domain experts without the additional assurance of documented methods to support their conclusions [280]. The guidelines for drafting recommendations include a requirement that investigators should de ne conformance criteria so that regulators can determine whether or not particular changes have satis ed those recommendations. It is also important to explain the relationship between any proposed changes and the causes that were identi ed from any analysis of the incident. It is important that investigators have some means of determining whether or not a particular report supports the tasks that must be performed by its intended audience. This can be done in one of two ways. Firstly, we have described a range of veri cation techniques that can be used to demonstrate that particular documents satisfy principles or properties that are based on the guidelines, summarised in the previous paragraph. Careful reading of an incident report can also be used to identify a range of rhetorical techniques. These tropes are often eective in more general forms of communication but can lead to ambiguity and inconsistency within engineering documents. Such problems can have profound consequences for safety when they weaken the presentation of incident reports. Mathematical proof techniques provide one means of identifying where enthymemes weaken the use of syllogism in an incident report. This is particularly important because the ommision of evidence not only removes particular information from the reader but it also prevents them from drawing necessary inferences that depend upon the missing information. We have also described how Toulin's [775] model was produced as a response to criticisms of logic as a means of analysing argument. This approach rejects any a priori notions of truth or falsehood. Instead it links evidence and claims through the warrants and backing that support them. It also emphasises the importance of quali ers and rebuttals in explaining why someone might hold a contrary opinion. We have identi ed limitations that aect both the use of logic and of models of argumentation to support the veri cation of incident reports. Neither approach provides any guarantee that the properties which are established of any particular document will actually be signi cant to their intended readership. Subsequent sections, therefore, brie y described a range of user-testing techniques that can support the validation of particular formats. Surprisingly few of these techniques have ever been applied to determine whether incident reports actually support the activities of their end-users. The results of some preliminary studies have been described, however, these raise a host of methodological problems that complicate such validation activities. For example, it can be diÆcult to identify a representative set of tasks that must be supported by a particular incident report. This lack of evidence about the utility of existing formats is worrying given that a number of factors are creating potential problems for their continued application. Existing paper-based reporting techniques have, however, remained relatively unchanged over the last century. During this time, the introduction of microprocessor technology has signi cantly increased the integration of heterogeneous processes. There have also been changes to the techniques that drive incident analysis. There is an increasing awareness that contextual and organisational issues must be considered in addition to any narrow ndings about operator `error'. The integration of safety-critical interfaces and the development of organisational approaches to incident analysis have created new challenges for the presentation of these reports. Their scope and complexity is increasing. At the same time, many investigatory organisations have begun to disseminate their incident reports over the World Wide Web. It is, therefore, possible to extend the application of advance visualisation and navigation techniques from other areas of information technology to support the readers of these documents. Unfortunately, most agencies focus on the Web as a means of dissemination rather than communication. Paper based reports are directly translated into HTML or Adobe's PDF. This creates a number of problems. The HTML approach suers from the well-known problems of reading large on-line documents. The PDF approach prevents many search engines from eectively retrieving information about similar accidents. Finally, the direct translation of paper-based reports into on-line documents ignores the potential exibility of electronic media. The closing sections of this chapter have described a number of alternative approaches that support the presentation of incident reports. These formats exploit the opportunities created by the electronic dissemination and communication of incident reports. They must, however, be treated as prototypes. Further evidence is required to demonstrate that they support the range of tasks that are currently achieved using paper-based
13.5.
SUMMARY
709
documents and their more orthodox counterparts on the web. The increasing use of electronic media to document information about adverse occurrences and near-miss incidents not only creates opportunities for new presentation formats. It also, for the rst time, creates opportunities to provide automated tools that will help investigators nd records of similar incidents in other organisations, in dierent industries, in other countries. It is far from certain that we will be able to exploit this opportunity. For example, the vast number of records that have been compiled in some industries now makes it practically impossible to accurately search and retrieve information about similar mishaps. The following chapter, therefore, describes a range of computer-based tools and techniques that have been speci cally developed to support large-scale collections of incident reports.
710
CHAPTER 13.
FEEDBACK AND THE PRESENTATION OF INCIDENT REPORTS
Chapter 14
Dissemination The previous chapter looked at the problems associated with the presentation of incident reports. It was argued that the format and structure of these documents must be tailored so that they support their intended recipients. It was also argued that care must be taken to ensure that the rhetoric is not used to mask potential bias in an incident report. This chapter goes on to examine the problems that are associated with the dissemination of these documents. It is of little bene t ensuring that reports meet presentation guidelines if their intended recipients cannot access the information that they provide. There are signi cant problems associated with such dissemination activities. For example, the FDA's Medical Bulletin that presents information about their MedWatch program is currently distributed to 1.2 million health professionals. Later sections analyse the ways in which many organisations are using electronic media to support the dissemination of incident reports. This approach oers many advantages. In particular, the development of the Internet and Web-based tools ensures that information can be rapidly transmitted to potential readers across the globe. There are, however, numerous problems. It can be diÆcult to ensure the security and integrity of information that is distributed in this way. It can also be diÆcult to help investigators search through the many thousands of incidents that are currently being collected in many of these systems. The closing sections of this chapter present a range of techniques that are intended to address these potential problems.
14.1 Problems of Dissemination Chapters 12 and 13 have already described some of the problems that complicate the dissemination of information about adverse occurrences and near miss incidents. For example, it can be diÆcult to ensure that information is made available in a prompt and timely fashion so that potential recurrences are avoided. It can also be diÆcult to ensure that safety recommendations reach all of the many dierent groups that might make use of this information. The following pages build on these previous chapters to analyse these barriers to dissemination in greater detail.
14.1.1 Number and Range of Reports Published It is important not to underestimate the scale of the task that can be involved in the dissemination of incident reports. Even relatively small, local systems can generate signi cant amounts of information. For instance, one of the Intensive care Units that we have studied generated a total of 111 recommendations between August 1995 and November 1998. 82 of these were `Remind Sta' statements. The 29 other recommendations concerned the creation of new procedures or changes to existing protocols (e.g. `produce guidelines for care of arterial lines'), or were equipment related (e.g. `Obtain spare helium cylinder for aortic pump to be kept in ICU'). As one might expect the task of keeping sta and management informed of recent incidents and recommendations is signi cantly more complex in national and international systems. This is illustrated by Table 14.1, which presents the total number of dierent reports that were published by 711
712
CHAPTER 14.
1995 Hazard Notices Device Bulletins Safety Notices Device Alerts Advice Notices Pacemaker Notes Total Reports Total Incidents
1996
1997
1998
1999
DISSEMINATION
2000
2001 (-Aug) 6 12 16 2 6 13 2 5 7 4 6 3 5 4 33 40 20 43 41 28 24 8 5 6 1 0 6 6 3 4 7 4 4 50 65 43 55 63 59 39 4,298 4,330 5,852 6,125 6,860 7,352 -
Table 14.1: Annual Frequency of Publications by the UK MDA the UK Medical Devices Agency (MDA) over the last ve years. It should be noted that the gures for 2001 are currently only available until August. As we have seen from the Maritime examples in the previous chapter, incident reporting agencies produce a range of dierent publications to disseminate their recommendations. In Table 14.1, hazard notices are published following death or serious injury or where death or serious injury might have occurred [542]. A medical device must also be clearly implicated and immediate action must be necessary to prevent recurrence. Device bulletins address more general management interests. They are derived from adverse incident investigations and consultation with manufacturers or users. They are also informed by device evaluations. In contrast, safety notices are triggered by less `serious' incidents. Their are published in circumstances where the recipients' actions can improve safety or where it is necessary to repeat warnings about previous problems. Device alerts are issued if there is the potential for death or serious injury particularly through the long term use of a medical device. Finally, Pacemaker Technical Notes publish information about implantable pacemakers, de brillators and their associated accessories. For the purpose of comparison, Table 14.1 also contains the total number of adverse incidents that were reported in the MDA's annual reports [539]. As can be seen, there has been a gradual rise in the frequency of incident reports while the total number of publications has remained relatively stable. Such an analysis must, however, be quali ed. The total incident frequencies are based on the MDA's reporting year. Hence the gure cited for 1996 is, in fact, that given for 1996-1997. However, the number of reports and associated publications provides some indication of the scale of the publication tasks that confront organisations such as the MDA.
Safety Alerts Drug Labeling Biologics Safety Food and Applied Nutrition Devices and Radiology
1997
1998
1999
2000
55 239 3 9
55 519 24 2 12
54 512 10 2 9
67 505 29 2 4
2001 (-Aug) 38 241 14 2 3
Table 14.2: Annual Frequency of Publications by the US FDA's MedWatch Program The level of activity indicated in Table 14.1 is mirrored by the gures in Table 14.2. This presents publication gures for the US Food and Drug Administration's MedWatch initiative. This Safety Information and Adverse Event Reporting Programme is intended to `serves both healthcare professionals and the medical product-using public' [269]. It covers a braid range of medical products, `including prescription and over-the-counter drugs, biologics, dietary supplements, and medical devices'. It, therefore, has a slightly wider remit than that of the UK MDA. The primary MedWatch publication provides Safety Alerts about drugs, biologics, devices and dietary supplements. As can be seen in Table 14.2, the MedWatch programme also publishes information from several dierent
14.1.
PROBLEMS OF DISSEMINATION
713
groups within the FDA. It publishes safety-related drug labeling change summaries that have been approved by FDA Center for Drug Evaluation and Research. It also incorporates recalls, withdrawals and safety issues identi ed by the Center for Biologics Evaluation and Research. The program also publishes selected warnings and other safety information identi ed by the Center for Food Safety and Applied Nutrition. Finally, the Medwatch initiative incorporates safety alerts, public health advisories and notices from the Center for Devices and Radiological Health. We have not calculated totals for Table 14.2 as we did for Table 14.1 because of the inherent diÆculty of calculating the frequency of recommendations to change drug labelling in the FDA adverse event reporting programme. Some drugs form the focus of several reports in the same year. Recommendations can be applied to a particular generic named product or to the dierent brands of that product. We have chosen to calculate frequencies on the basis of named drugs identi ed in the Center for Drug Evaluation and Research warnings. It is also important to stress that the reports identi ed in Tables 14.1 and 14.2 only represent a small subset of the publications that the FDA and the MDA publish in response to adverse incidents. For instance, the MedWatch programme also disseminates articles that are intended to support the continuing education of Healthcare professionals. These include information about the post-marketing Surveillance for Adverse Events After Vaccination and techniques for assuring drug quality. The FDA also provides more consumer oriented publications to encourage contributions from the general public. It uses incident information to address speci c consumer concerns, for instance in special reports on drug development and the interaction between food and drug. The wide scope of these publication activities is also illustrated by the User Facility Reporting Bulletins. This quarterly publication is speci cally targeted at hospitals, nursing homes and `device user facilities'. Dissemination activities are not only focussed on the generation of speci c incident reports or articles on more general issues. They also include the organisation of workshops, such as the 1998 meeting on `Minimising Medical Product Errors' [261]. The UK MDA host similar events, such as their annual conference which in 2001 will address the theme `Protecting Patients - Maintaining Standards' [545]. The MDA also holds more focussed study days. These provide sta training to address common problems with particular devices. For example, the MDA set up a recent meeting for nurses on best practice and the potential pitfalls in operating infusion systems [543].
14.1.2 Tight Deadlines and Limited Resources The previous paragraphs illustrate the high frequency and the diversity of dissemination activities that are conduced by many incident reporting organisations. It is diÆcult to under-estimate the logistical challenges that such activities can impose upon nite resources. There is also increasing pressure in many sectors to increase the eÆciency of many reporting bodies. For instance, one government measure estimates that the MDA managed to increase its output by 9% with a stable workforce between 2000-2001. These pressures can also be illustrated by some of the objectives being promoted by the MDA. For 2001-2002, it is intended that all Hazard Notices will be issued within 20 working days; 90% of Safety Notices will be issued within 60 days and 75% within 50 days. It is also intended to increase the number of adverse incident reports that will be published by a further 9% while at the same time making `eÆciency' savings of 2% [539]. The results of tight nancial constraints can also be seen in the manner in which the FDA has altered it's publication policy in recent years [867]. Previous sections have mentioned the User Facility Reporting Bulletin, this publication is intended for hospitals, nursing homes and other enduser facilities. The initial twenty, quarterly issues of the Bulletin were printed in the conventional manner and were posted to any organisation that requested a copy. At its peak, 77,000 subscribers received copies of these documents that presented summarised reports based on recent incident reports. Budgetary restrictions forced the FDA to review this policy. In Issue 17, readers were asked to respond to a retention notice. If they did not respond then they were removed from the distribution list. It was hoped that the high initial administrative overhead associated with this initiative would yield longer term savings in distribution costs. By 1999, however, Federal funding cuts prevented any distribution in paper form. The twenty- rst issue of the Bulletin was, therefore, distributed through electronic means including an automated Fax system. The FDA summarised
714
CHAPTER 14.
DISSEMINATION
their feelings about this situation; `we regret the need to move to this new technology if it means that many of our current readers will no longer have access to the Bulletin' [867]. The joint pressures imposed by the need to disseminate safety information in a timely fashion and the need to meet tight nancial objectives has resulted in a number of innovations in incident reporting. Many of these systems start with the premise that it is impossible to elicit and analyse voluntary incident reports across an entire industry. Even with mandatory reporting systems there will be problems of contribution bias that result in a very partial view of safety-related incidents. These problems stem partly from the cost and complexity of large scale voluntary reporting systems. They also stem from the passive nature of most mandatory systems that simply expect contributors to meet their regulatory requirements when responding to an adverse occurrence. As we have seen, even if a potential contributor wants to meet a reporting obligation they may fail to recognise that a safety-related event has occurred. Most regulatory and investigatory organisations lack the resources necessary to train personnel across an industry to distinguish accurately between reportable and nonreportable events. Similarly, there are signi cant nancial barriers that prevent routine inspections and audits to review compliance. Sentinel reporting systems provide an alternative solution that is intended to reduce the costs associated with incident reporting and, thereby ensure that recommendations are disseminated in a timely fashion. This approach identi es a sample of all of the facilities to be monitored. This sites within the sample are then oered specialist training in both mandatory and voluntary incident reporting. The incidents that are reported by these sentinel sites can then form a focus for more general safety initiatives across an industry. These ideas are extremely suasive to many governmental organisations. For instance, the FDA Modernisation Act (1997) required that the FDA make a report to Congress in late 1999 about progress towards such a sentinel system [262]. In September 1996, CODA Inc. was awarded a contract to conduct a study to evaluate the feasibility and eectiveness of a sentinel reporting system for adverse event reporting of medical device use. The explicit intention was to determine not whether a sentinel system could supplement passive, voluntary systems, such as MedWatch, but to replace them entirely. The trial ran for twelve months and the nal report emphasised the importance of feedback and dissemination in the success of any sentinel system. The CODA trial provide several dierent forms of feedback. These included a newsletter, faxes of safety notices, responses to questions presented by Study Coordinators. The individual reports that were received by the project were summarised, anonymised and then published in bimonthly newsletters for Study Coordinators. These coordinators acted as an eÆcient means of disseminating safety related information within the sample sites. This project not has important implications for the eÆcient dissemination of safety-related information. It also provides important insights about the practical problems that can arise when attempting to ensure the timely dissemination of incident recommendations and reports. Many reporting systems endeavour to ensure that operators, safety managers and regulators are provided with information about incidents according to a sliding timescale that re ects the perceived seriousness of the incident. This is the case with the MDA targets, cited above. It can, however, be extremely diÆcult to estimate the seriousness of an incident. Previous chapters have referred to the `worst plausible outcome' that is often invoked to support such assessments. The practical problems of applying such heuristics can be illustrated by the CODA pilot study. The project analysts determined that only 14% of the reports received would have been clearly covered by the existing mandatory systems. 56% of the reports described less serious incidents that fell within the voluntary reporting provisions. 30% of all submission, or 96 reports, fell between these two categories; `the determination of serious patient injury according to FDA's de nition was diÆcult to make'. Of these 96, 60% were submitted on voluntary forms. 25% of the reports clearly documenting serious patient injury also were submitted on voluntary forms. If these results provide an accurate impression of the true severity of the incidents then they indicate that analysts cannot accept the contributors' severity assessments at face value. Two senior nurse-analysts agreed to review all reports and classify them urgency using a scale of: very urgent, urgent, routine monitoring, well-known problem or not important. Approximately one-third (113) were classi ed as very urgent or urgent. Of these, only 19 were clearly mandatory reports. This is a signi cant concern given that distribution deadlines focus on a rapid response to mandatory reports.
14.1.
PROBLEMS OF DISSEMINATION
715
The results of this analysis can be presented in another way. As mentioned, 14% of all reports clearly fell within the existing mandatory systems. About half of these, according to the nurses' analysis, needed only routine monitoring. The FDA cite the example of a `problem with a catheter in which there was medical intervention, but for which FDA already had taken action, so that additional reports would not make a very valuable contribution to the agency' [262]. However, 50 of the 175 reports that fell under voluntary reporting rules were rated as very urgent or urgent. This creates considerable problems for the prompt dissemination of safety-related information. Delays in a regulatory response do not simply stem from the time required to analyse a report and make recommendations. They also stem from the amount of time that it takes a contributor to actually g le a report in the rst place. Some of the contributors complained about the time limits that were recommended for reporting particular classes of incidents. In some cases, what contributors classed as less severe occurrences went unreported for more than ten days. The previous paragraphs have questioned the reliability of such severity assessments and so it seems likely that such delays may be a signi cant factor in ensuring the prompt dissemination of alerts and warnings.
14.1.3 Reaching the Intended Readership Chapter 13 argued that the task of drafting incident reports is complicated by the diverse readerships that these documents can attract. This section extends this argument. The diverse readership of these documents not only complicates the drafting of an incident report but also exacerbates their dissemination. There is an immediate need to ensure that individuals within a working unit are informed of any recommendations following an incident. Chapter 5 argued that such actions are essential to demonstrate that contributions are being acted on in a prompt manner. In particular, reports should be sent to the individuals who initially provided noti cation of an adverse occurrence. These requirement apply to the dissemination of incident reports in both large and small scale systems. National and international schemes face additional distribution problems. In particular, incident reports must forwarded to other `at risk' centres. This is a non-trivial requirement because it can often be diÆcult to determine precisely which centres might be aected by any potential recurrence. Within these associated working groups, it is important to identify who will assume responsibility for ensuring the reports are read by individual members of sta. This can involve close liaison between the investigators who draft a report and safety managers or other senior sta distributed throughout their organisation. It is possible to identify a number of dierent dimensions that characterise the distribution of incident reports. The following list summarises these dierent dimensions. Particular reporting system may tailor the approach that they adopt according to the nature of the incident. They may also use hybrid combinations of these techniques. For example, a closed distribution policy might be exploited within the organisation that generate the report to ensure that information was not prematurely leaked to the media. However, a horizontal approach might also be used to ensure that key individuals in other companies are also made aware of a potential problem:
Closed distribution.
Horizontal distribution.
This approach restricts the dissemination of incident reports to a few named individuals within an organisation. This creates considerable problems in ensuring that those individuals and only those individuals actually receive copies of a report. It is also important to note throughout this analysis that the receipt of a report does not imply that it will be either read or acted upon. This approach allows the dissemination of incident reports to other companies in the same industry. The distribution may be further targeted to those organisations that operate similar application processes.
Vertical distribution.
This approach allows the dissemination of reports to companies that occur within the same supply chain as the organisation that was noti ed about an incident. Reports can be passed
716
CHAPTER 14.
DISSEMINATION
down the supply chain to ensure that companies, which rely on the products and services of the contributor organisation, are altered to a potential problem. Supply companies may also be informed if an incident occurs as the result of problems at previous stages in the supply chain.
Parallel distribution.
This approach ensures that reports are distributed to companies in other industries that operate similar processes. For example, incidents involving the handling and preparation of nuclear materials can have implications in the defence, medical and power generation industries. It is for this reason that organisations such as the US Chemical Safety and Hazard Investigation Board were set up to span several related domains.
Open distribution.
This approach allows the free distribution of incident reports. Increasingly, this approach is being adopted by regulatory organisations, including the FDA and MDA, and by independent research organisations, such as the NHS Centre for Reviews and Dissemination [191] As we shall see, these open publication initiatives increasingly rely upon Internet-based distribution techniques. The healthcare industry provides extreme examples of the problem associated with distributing incident reports to a diverse audience. In 2000, the MDA received 7,249 reports of adverse incidents involving medical devices. These resulted in 4,466 investigations after an initial risk assessment. 49 safety warnings were published; the MDA's annual report bases this gure on the sum of the numbers of Hazard Notices, Safety Notices and Device Alerts in Table 14.1 [539]. Safety Notices are primarily distributed through the Chief Executives of Health Authorities, NHS Trusts and Primary Care Trusts as well as the directors of Social Services in England. These individuals a responsible for ensuring that they are brought to \the attention of all who need to know or be aware of it" [535]. Each Trust appoints a liaison oÆcer who ensures that notices are distributed to the `relevant managers'. Similarly, each local Health Authority appoints a liaison oÆcer to ensure that notices are distributed to Chairs of Primary Care Groups, Registration Inspection Units, Independent Healthcare Sector and representatives of the Armed Services. The MDA also requires that notices are sent to the Chief Executives of Primary Care Trusts who are then responsible for onward distribution to their sta. Social Services Liaison OÆcers play a similar role but are speci cally requested to ensure distribution to Registration Inspection Units and Residential Care Homes. The distribution responsibilities of the individuals in the MDA hierarchy are presented in Table 14.3. The detailed responsibilities of each individual and group are, however, less important than the logistic challenges that must be addressed by the MDA when they issue a Safety Notice. For instance, any individual warning will only be sent to some portion of the total potential audience. Many Safety Notices are not relevant to the work of Social Services. In consequence, each published warning comes with a list of intended recipients. These are identi ed by the rst level in the distribution hierarchy: Health Authorities, NHS Trusts, Primary Care Trusts and Social Services. Liaison oÆcers are then responsible for ensuring that information is directed to those at the next level in the hierarchy. This selective distribution mechanism creates potential problems if, for example, a Social Service department fails to identify that a particular Safety Notice is relevant to their operations. The MDA, therefore, issue a quarterly checklist that is intended to help liaison oÆcers ensure that they have received and recognised all applicable warnings. The MDA distribution hierarchy illustrates a number of important issues that aect all reporting systems. There is a tension between the need to ensure that anyone with a potential interest in a Safety Notices receives a copy of the warning. This implies that Liaison OÆcers should err on the side of caution and disseminate them as widely as possible. On the other hand, this may result in a large number of potentially irrelevant documents being passed to personnel. The salience of any subsequent report might then be reduced by the need to lter these less relevant warning. These arguments, together with the expense associated with many forms of paper-based distribution, implies that Liaison OÆcers should target any distribution as tightly as possible. Later sections of this chapter will return to this tension when examining the generic problems of precision and recall in information retrieval systems.
14.1.
717
PROBLEMS OF DISSEMINATION
Organisation NHS Trust
Liaison OÆcer forwards to Appropriate Manager
Health Authority
Primary Care
Registration Inspection Units
Social Services Department
In-house services
Registration Inspection Unit
For onward distribution to Relevant sta to include Medical Directors, Nurse Executive Directors, Directors of Anaesthetics, Directors of Midwifery, Special Care Baby Units/Pediatric Intensive Care, Maternity Wards, Operating Theatres, Ambulance NHS Trusts and Accident and Emergency Units. Directors of Primary Care Local Representatives Committees Chief Executives of Primary Care Groups Individual GP Practices Dentists Opticians Pharmacists Care in the Community, Homes (Group Homes), Nursing homes, Managers of independent sector establishments, Private hospitals, Clinics and hospices Residential Care Homes (elderly, learning diÆculties, mental health, physical disabilities, respite care), Day Centres, Home Care Services (in-house and purchased), Occupational Therapists, Children's Services, Special Schools, Other appropriate Local Authority departments (for example Education departments for equipment held in schools). Any of the above services provided by the independent sector.
Table 14.3: MDA's Distribution Hierarchy for Safety Notices The success of the MDA distribution hierarchy relies on individual Liaison OÆcers. They exercise discretion in disseminating particular warnings to appropriate managers and directors. The signi cance of the Liaison OÆcer is also acknowledged by the MDA in a range of practical guidelines that are intended to ensure the integrity of these distribution mechanisms. For instance, healthcare organisations must identify a fax number and e-mail address for the primary receipt of Hazard Notices and Device Alerts. They must also arrange for someone to deputise in the Liaison OÆcer's absence. The Liaison OÆcer is responsible for ensuring that Hazard Notices and Device Alerts are distributed immediately after publication. Safety Notices can take a less immediate route, as described in previous paragraphs. Liaison OÆcers are also responsible for documenting the actions that are taken following the receipt of Hazard Notices, Device Alerts and Safety Notices. In particular, they must record the recipients of these various forms of incident report. The documentation should also record when the reports were issued and a signed assurance from the recipient that any required actions have been taken. Liaison OÆcers not only pass on Safety Notices to `appropriate' managers, they can also choose to distribute particular warnings to sta. For instance, such direct actions might be used to ensure that new employees or contract sta are brought up to date with existing warnings. These groups of workers create particular problems for the distribution of incident reports in many dif-
718
CHAPTER 14.
DISSEMINATION
ferent industries. Not only do they create the need for special procedures in the national system operated by the MDA, they also complicate the task of communicating recommendations from local systems. Changes in working procedures in individual hospital departments create signi cant training overheads for temporary `agency' sta who may be transferred between dierent units over a relatively short period of time. When such training is not explicitly provided then it is likely that communications problems will occur during shift hand-overs [342]. Previous sections have argued that Liaison OÆcers play an important role within the particular distribution mechanisms that are promoted by the UK MDA. Aspects of their role are generic; they characterise issues that must be addressed by all reporting systems. For example, the con ict between the need for wide distribution and the problems of overloading busy sta apply in all contexts. Many reporting systems must also ensure that new workers and contract sta are brought up to date. Similarly, there is a generic tension between enumerating the intended recipients of a report and allowing local discretion to determine who receives a report. This last issue can be illustrated by the way in which particular, critical reports constrain or guide the actions of Liaison OÆcers. For example, a recent Device Bulletin into patient injury from bed rails explicitly stated that it should be distributed to all sta involved in the procurement, use, prescription and maintenance of bed rails. Liaison oÆcers were speci cally directed to ensure that copies of the report were forwarded to `health and safety managers; loan store managers; MDA liaison oÆcers (for onward distribution); nurses; occupational therapists; residential and nursing home managers; risk managers.' [538] In contrast, other Device Bulletins explicitly encourage Liaison OÆcers to adopt a far broader dissemination policy. A report into the (ab)use of single-use medical devices enumerated the intended recipients as all Chief executives and managers of organisations where medical devices are used, all professionals who use medical devices, all providers of medical devices and all sta who reprocess medical devices [536]. Previous paragraphs have focussed on the problems of ensuring that incident reports are disseminated eectively within the organisations that participate in a reporting scheme. We have not, however, considered the additional problems that arise when any lessons must be shared between organisations that operate their own independent reporting systems. Legislation is, typically, used to provide regulators with the authority necessary to ensure that safety-related information is shared through national or industry-wide systems. Such legal requirements often fail to address the concerns that many companies might have about providing information to such reporting systems. There is a clear concern that commercially sensitive information will be distributed to competitors. The exchange of safety-related information often raise questions about con dentiality and trust: \(The) FDA is keenly aware of and sensitive to the impacts of these new regulatory requirements on the pace of technological advancement and economic well-being of the medical device industry. At the same time, the agency is cognizant of the usefulness of information about the clinical performance of medical devices in ful lling its public health mandate... FDA may require the submission of certain proprietary information because it is necessary to fully evaluate the adverse event. Proprietary information will be kept con dential in accordance with Sec. 803.9, which prohibits public disclosure of trade secret or con dential commercial information.." [252] Less critical information, for instance about near-miss occurrences, may be retained within corporate reporting systems. Other organisations can then be prevented from deriving any insights that such reports might oer. The ability to overcome these barriers often depends upon the micro-economic characteristics of the particular industry. For instance, it can be diÆcult to encourage the altruistic sharing of incident reports in highly competitive industries. In other markets, especially those that are characterised by oligopolistic practices, it can be far easier to ensure the cooperation and participation of potential rivals. For example, the major train operating companies combined with the infrastructure provides to establish the CIRAS reporting system on Scottish railways [197]. This scheme has a lot in common with the CNORIS regional reporting system that has recently been established across Scottish NHS hospitals [417]. Another feature of these systems is, however, that the lessons are seldom disseminated beyond the small group of companies or organisations within the oligopoly.
14.2.
FROM MANUAL TO ELECTRONIC DISSEMINATION
719
The increasing impact of a global economy has raised a number of diÆcult moral issues that were not initially considered by the early proponents of reporting systems. For example, there have been situations in which the operators of a non-punitive reporting system have identi ed failures by individuals who work in counties that do operate punitive, legal approaches to adverse occurrences [423]. Such situations can create particular problems when individual employees may have contributed an incident report on the understanding that they were participating in a `no blame' system. Although these dilemmas are relatively rare, it is important to acknowledge the increasing exchange of data between dierent reporting systems. For instance, the 49 MDA warnings, cited in previous paragraphs, resulted in 32 noti cations being issued to other European Union member states [539]. The direct distribution of reports by the MDA to other EU member states represents one of several approached to the international dissemination of safety-related information. It eectively restricts the dissemination of information, in the rst instance, to the other participants in the political and economic union. Other distribution mechanisms must be established on a country-bycountry basis for the wider distribution of information, for example with the US FDA. The Global Aviation Information Network initiatives represent an alternative approach to the dissemination of safety-related incident reports [308]. As the name suggests, the intention is to more beyond regional distribution to provide global access to this safety information. Similar initiatives can be seen in the work of the International Maritime Organisation (IMO) and the International Atomic Energy Authority. Such distribution mechanisms face immense practical and organisational barriers. The same issues of trust and con dentiality that complicate the exchange of information between commercial organisations also aect these wider mechanisms. There is also an additional layer of political and economic self-interest when incidents may aect the viability and reputation of national industries. These problems partly explain the halting nature of many of these initiatives. They are addressing the distribution problem by making information available to many national and regional organisations. However, they often fail to address the contribution problem because very few reports are ever received from some nations.
14.2 From Manual to Electronic Dissemination Previous paragraphs have argued that the problems of disseminating information about adverse occurrences and near miss incidents stem from the frequency and diverse range of publications; from tight publication deadlines and resource constraints and from the diÆculty of ensuring that the intended readership can access a copy of the report. These problems have been addressed in a number of ways. For example, the last section examined a number of distribution models that are intended to ease the logistics of disseminating incident reports. The hierarchical approach adopted by the MDA was used to illustrate the manner in which key individuals, such as Liaison OÆcers, often lie at the heart of hierarchical approaches. In contrast, this section moves on from these organisation techniques to look at the way in which dierent technologies can be recruited to address some of the problems that complicate the dissemination of incident reports.
14.2.1 Anecdotes, Internet Rumours and Broadcast Media It is important not to overlook the way in which information about an incident can be disseminated by word of mouth. This can have very unfortunate consequences. For instance, the U. S. Food and Drug Administration's Center for Food Safety and Applied Nutrition describe how the company at the centre of an investigation rst became aware of a potential problem through the circulation of rumours about their involvement [253]. They report that `the rst news the dairy plant received that they were being investigated in relation to this outbreak was through rumour on the street'. The plant operators then demanded to know what was going on; `Apparently someone had heard someone else talking about the Yersinia outbreak and how it was connected to the dairy plant'. These informal accounts then had to be con rmed with a consequent loss of con dence in the investigatory procedures that had prevented disclosure of the potential incidents before the rumour began.
720
CHAPTER 14.
DISSEMINATION
Informal channels are often faster and, in some senses, more eective at disseminating information than more oÆcial channels. Rumours often circulate about the potential causes well before they are published by investigatory organisations. Very often oÆcial reports into an adverse occurrence or near-miss come as little surprise to many of the individuals who work in an industry. The dissemination of safety information by word of mouth is not entirely negative. Many organisations, such as the FDA and the MDA rely upon such informal measures given limited printing budgets and the vast audiences that they envisage for some warnings. Similarly, the use of anecdotes about previous failures has provided an important training tool well before formal incident reporting systems were ever envisaged or implemented. There is a danger, however, that the information conveyed by these informal means will provide a partial or biased account of the information that is published by more oÆcial channels. Word of mouth accounts are likely to provide an incomplete view before the oÆcial report is distributed. This can also occur after the oÆcial publication of an incident report if individuals mis-understand or forget the main ndings of an investigation. They may also be unconvinced by investigators' ndings. In such circumstances, there is a tendency to develop alternative accounts that resolve uncertainties about the oÆcial report. These unauthorised reports are, typically, intended to gain the listeners' attention rather than to improve the safety of application processes. It is diÆcult to underestimate the impact of such informal accounts. They can undermine the listeners' con dence in the investigatory agency even though they may retain signi cant doubts over the veracity of the alternative account [278]. In recent years, the informal dissemination of incident related information has taken on a renewed importance. The growth of electronic communication media has provided signi cant opportunities for investigatory agencies to distribute `authorised' accounts. The same techniques also enable engineers, focus groups and members of the general public to rapidly exchange information about adverse occurrences and near miss incidents. The recognition that e-mail, Internet chat rooms and bulletin boards can facilitate the `unauthorised' dissemination of such information has attracted signi cant attention from organisations, such as the FDA. The issues surrounding these informal communications are extremely complicated. For example, there is a concern that drugs companies and device manufacturers might exploit these communication media to actively promote their products. This resulted in a recent initiative to directly consider the position of the FDA towards `Internet Advertising and the Promotion of Medical Products' [256]. During this meeting, a representative of one pharmaceutical company argued that they had an obligation to make sure that the information available to the public was as accurate as possible. Given the lack of Internet moderation, however, it was impossible for companies to correct every misconception that might arise; `we do not correct every piece of graÆti that may be painted in some remote area of Australia or Alabama or Philadelphia, but we do respond where we feel this is signi cant and we need to clarify the issues'. A representative of another drug company addressed rumours about adverse events more directly: \there may be a rumour that a certain product is going to be withdrawn at a certain time and if no one comes in and steps in who has a authoritative information and says, `This is not true', that kind of rumour can absolutely snowball and can become uncontrollable if it is not quashed right when it starts" [256]. Companies are not the only organisations that can have a direct interest in refuting what can be termed Internet rumours. The FDA recently had to launch a sustained initiative to counter rumours about the safety of tampons [263]. The FDA identi ed three dierent versions of this rumour: 1. One Internet claim is that U.S. tampon manufacturers add asbestos to their products to promote excessive menstrual bleeding in order to sell more tampons. The FDA countered this rumour by stating that `asbestos is not, and never has been, used to make tampon bers, according to FDA, which reviews the design and materials for all tampons sold in the United States' [274]. 2. Another rumour alleged that some tampons contain dioxin. The FDA reiterated that `although past methods of chlorine bleaching of rayon's cellulose bers could lead to tiny amounts of dioxin (amounts that posed no health risk to consumers), today, cellulose undergoes a chlorinefree bleaching process resulting in nished tampons that have no detectable level of dioxin'.
14.2.
FROM MANUAL TO ELECTRONIC DISSEMINATION
721
3. A nal Internet rumour argued that rayon in tampons causes toxic shock syndrome (TSS) and could make a woman more susceptible to other infections and diseases. The FDA responded that `while there is a relationship between tampon use and toxic shock syndrome{about half of TSS cases today are associated with tampon use{there is no evidence that rayon tampons create a higher risk than cotton tampons with similar absorbency'. In order to counter these various rumours, the FDA launched a coordinated distribution of information on the Internet and to the broadcast media. This response indicates the seriousness with which they regard the Internet as a distribution medium for alternative or `unoÆcial' accounts of particular incidents, in this case involving Toxic Shock Syndrome. Such actions do not, however, come without a price. They help to ensure that the public are aware of the scienti c evidence in support of the FDA claims. They also inadvertently raise the pro le of those Internet resources that disseminate the rumours in the rst place. The FDA's response, therefore, adds a form of re ected legitimacy to the original arguments about the link between Tampon's and TSS. It is important to emphasise that our use of the term `rumour' is not intended to be pejorative. In many cases, the informal dissemination of information can provide a useful corrective to the partial view put forward by more `oÆcial' agencies. Such alternative sources of information must, however, support their claims and statements with appropriate warrants. In particular, it can be argued that these informal sources of information force oÆcial agencies to focus more directly on the issues and concerns that aect the general public. The Internet rumours about the relationship between tampons and TSS may have contained numerous statements that could not subsequently be supported, however, they did persuade the FDA to clarify the existing evidence on any potential links. The previous case study illustrates some of the complex changes that are occurring in the manner in which information about adverse occurrences is being disseminated. Internet bulletin boards and chat rooms help to publicise rumours that are then picked up by the popular media. At this stage, regulatory authorities must often intervene to correct or balance these informal accounts. It is, however, insuÆcient simply to publish a response via an oÆcial web site which is unlikely to attract many of the potential readers who have an interest in a particular topic. The regulatory agency is, therefore, compelled to exploit more traditional forms of the broadcast media to refute rumours that were primarily disseminated via the web and related technologies. This reactive use of the media represents a relatively recent innovation. More typically, investigatory agencies have used the press, radio and television in a more pro active manner to disseminate the ndings of incident reports. As we have seen, this use of the media requires careful planning; there is a danger that the parties involved in an investigation may learn more about their involvement from the press than from more oÆcial channels. The FDA is similar to many national agencies in that it follows detailed guidelines on the use of the media to disseminate information. For example, media relations must be explicitly considered as part of the strategy documents that are prepared before each product recall. The dissemination of information in this manner must be treated extremely carefully. It is important that the seriousness of any recall is communicated to the public. It is also important to avoid any form of panic or any adverse reaction that might unduly in uence the long term commercial success of the companies that may be involved in an incident. The sensitive nature of such recall notices is recognised in the FDA provision that the warnings may be released either by the FDA or by the recalling rm depending on the circumstances surrounding the incident [257]. The political sensitivity of these issues is also illustrated by the central role that is played by the FDA's Division of Federal-State Relations during Class I recalls. This classi cation is used when is expected and when the `depth' of the recall is anticipated to require action by a retailers and consumers. The Federal-State Relations division is required to use e-mail to notify state and local oÆcials of recalls that are associated with serious health hazards or where publicity is anticipated. These oÆcials are then issued with enforcement papers that are prepared by the FDA Press Relations Sta. This mechanism illustrates the manner in which investigatory agencies may operate several parallel dissemination activities each with very dierent intentions. In addition to the publication of incident reports, press releases are prepared to initiate actions by the public and by retailers. These may be distributed at press conferences, by direct contact with particular reporters and by releases to all Associated Press and United Press International wire services. Further distribution mechanisms must also ensure that individuals within relevant organisations are `well
722
CHAPTER 14.
DISSEMINATION
briefed' to respond to questions from the press. It is also important to acknowledge the central role of press and media relations sta. Not only does this department warn other members of the organisation of media interest. They also ensure that their colleagues are adequately briefed to respond to media interest. Their ability to perform these tasks is dependent upon them being noti ed in the early stages of any incident investigation. FDA regulations require that the Press Relations Sta are noti ed by any unit that `publicity has occurred relating to the emergency condition, as well as pending requests for information from the media and/or public' [257]. The senior media relations sta then liaise directly with the oÆcials closest to the scene to ascertain what information needs to be released and when it should be disseminated to best eect. It can, however, be diÆcult to ensure that such press releases will be given the prominence that is necessary in order to attract the publics' attention to a potential hazard. Some warnings have a relatively high news value. The FDA's Consumer magazine often provides journalists with a valuable starting point for these incidents. For instance, a recent warning centred on a particular type of sweet or candy that had resulted in three children choking to death. Some of these products carried warning labels, suggesting that they should not be eaten by children or the elderly. Other labels warn of a choking hazard and say to chew the sweets thoroughly. Some were sold without any warning. This story attracted immediate and focussed media interest. Another warning, which was issued on the same day as the one described above, attracted far less media attention. This concerned the potential dangers of consuming a mislabeled poisonous plant called Autumn Monkshood [266] The packages containing the plant were mistakenly labeled with the statement, `All parts of this plant are tasty in soup'. They should have indicated that consumption of the plant can lead to aconitine poisoning and that death could occur due to ventricular arrhythmias or direct paralysis of the heart. Simply releasing information to the media about potentially fatal incidents does not imply that all incidents will be equally news worthy nor that they will receive equal prominence in press, radio or television broadcasts. As we have seen, it can be diÆcult for regulatory and investigatory agencies to use the media as a means of disseminating safety information. This involves the coordination of press releases and conferences. It also involves the training of key sta, such as press liaison oÆcers, and the use of electronic communications techniques to ensure that other members of sta are informed how to respond to media questions. Even if this infrastructure is established there is no guarantee, without legal intervention, that a particular warning will receive the prominence that is necessary to attract public attention. Such problems are most often encountered by large-scale national systems. The issues that are raised by media dissemination of incident information are, typically, quite dierent for smaller scale systems. There can also be a strong contrast in media relations when incident information attracts `adverse' publicity. This is best illustrated by the phenomenon known as `doctor bashing' which has emerged in the aftermath of a number of incidents within the UK healthcare industries. Many professionals nd themselves faced by calls from the government and from the media to be increasingly open about potential incidents. For example, Alan Milnburn the UK Health Secretary has argued that the \National Health Service needs to be more open when things go wrong so that it can learn to put them right" [111]. Together with this increased openness \they would also have to be accountable for their errors and prepared to take responsibility". Some doctors have described such statements and the associated media publicity as `hysterical'. Recent BBC reports summed up this attitude by citing a General Practitioner from the North West of England; \Shame on the media for sensationalising and exaggerating incidents...shame on you for failing to report accurately adverse clinical events" [110]. Public and government pressure to increase the dissemination of information about medical incidents must overcome many doctor's fear of adverse or `sensational' press coverage. At present, many NHS trusts have still to face up to the consequences of this apparent con ict. They are reluctant to disclose information about previous incidents even to their own sta for fear that details might `leak' to the press. In this domain at least, we are a very long way from the culture of openness that the proponents of incident reporting systems envisage as a prerequisite for the eective implementation of their techniques. It is important not to simply view these tensions as simply the result of media interest in disseminating sensational accounts of adverse incidents. They re ect deeper trends in society. The chairman of the British Medical Association's Junior
14.2.
FROM MANUAL TO ELECTRONIC DISSEMINATION
723
Doctors' Committee saw this when he argued that \we have a more consumerist society... people are complaining more about everything... there is a lot of doctor-bashing in the press" [108]. Such quotations illustrate the way in which the media not only inform society, as in the case of FDA warnings, but they also re ect the concerns of society. This section has focussed on the `informal' dissemination of information about adverse incidents. In particular, it has focussed on the way in which electronic and Internet-based communications have provided new means of distributing alternative accounts of near-misses and adverse occurrences. We have also described how regulatory organisations have used the same means to rebutt these alternative reports. The conventional media is routinely used to support these initiatives. It can also be used to publicise more general safety warnings and can initiate investigations where other forms of reporting have failed to detect safety-related incidents. This more positive role must be balanced with the problems of media distortion that dissuade managers from disseminating the ndings of incident reporting systems. There is a stark contrast between the use of the media to publicise necessary safety information and the fear of publicity in the aftermath of an adverse event.
14.2.2 Paper documents The previous section has done little more that summarise the informal communication media that support the distribution of safety related information. Similarly, we have only touched upon the complex issues that stem from the role of the media in incident reporting. These related topics deserve books in their own right, however, brevity prevents a more sustained analysis in this volume. In contrast, the remainder of this chapter focuses on more `oÆcial' means of disseminating incident reports. In particular, the following section analyses the strengths and weaknesses of conventional paper-based publications to disseminate safety-related information. One of the most suasive reasons for supporting the paper-based dissemination of incident reports is to meet regulatory obligations. The importance of this media is clearly revealed in the various regulations that govern the relationship between the FDA, manufacturing companies and the endusers of healthcare products. The primary focus of these regulations is on the exchange of written or printed documentation. This emphasis is not the result of historical factors. It is not simply a default option that has been held over from previous versions of the regulations that were drafted in an age before electronic dissemination techniques became a practical alternative. As we have seen, the recommendations in some incident reports can have a legal force. Companies may be required to demonstrate that they have taken steps to meet particular requirements. This creates problems for the use of electronic media where it can be very diÆcult to determine the authenticity of particular documents. It would be relatively easy to alter many of the reports that are currently hosted on regulatory and governmental web-sites. Later sections will describe a range of techniques, such as the use of electronic watermarks, that can increase a reader's con dence about the authenticity of the documents that are obtained over the Internet. Unfortunately, none of the existing incident reporting sites have adopted this technology. In consequence, paper versions continue to exist as the `gold standard' against which compliance is usually assessed. Copies obtained by other distribution mechanisms are, therefore, seens as in some way additional to this more traditional form of publication. A further bene t of conventional, paper-based dissemination techniques is that regulatory agencies can exploit existing postal distribution services. A host of external companies can also be used to assist with the formatting, printing and mailing of these documents. The technology that is required to perform these tasks is well understood and is also liable to be readily available within most organisations. These are important considerations. Simplicity and familiarity help to reduce the likelihood of failures occurring in the distribution process, although as we have seen they are not absolute guarantees! Minimal sta training is needed before information can be disseminated in this way. It is for this reason that most small scale reporting systems initially exploit this approach. Typically, newsletters are duplicated using a photocopying machine and are then made available either in sta common areas or in a position that is close to a supply of reporting forms. Paper-based dissemination techniques simplify the task of distributing incident reports because they can exploit existing mechanisms, including sta distribution lists as well as both internal and
724
CHAPTER 14.
Very Well 17,862,477
Well 7,310,301
Not Well 4,826,958
DISSEMINATION
Not at All 1,845,243
Table 14.4: 1990 US Census Data for Self-Reported Ability in English state postal services. There are further advantages. No additional technology, such as a PC with an Internet connection or CD-ROM, is required before people can access safety-related information. This is a critical requirement for the dissemination of some incident reports. One participant at a recent FDA technical meeting was extremely irritated by the continual reference to web sites as a primary communication medium. He asked the others present whether they knew how many American could access the Internet or could understand English [260]. Such comments act as an important reminder that paper-based publications continue to have an important role in spite of the proliferation of alternative dissemination techniques. For the record, Table 14.4 provides the latest available gures from the 1990 US Census describing self-reported English ability. The total US population was reported as 230,445,777 of which there were some 198,600,798 individuals who reported that they could only speak English. There were 31,844,979 who described themselves as being primarily non-English speakers. The self-reported gures for the standard of English amongst this community are shown in Table 14.4. The proportion of the population who express problems in understanding English appears to be relatively small. However, there may be a signi cant proportion of the population who did not return a census form and there is a concern that the proportion of non-English speakers might be relatively high in this community. There is also a natural tendency to over-estimate linguistic ability in such oÆcial instruments. Such factors motivate the provision of alternate language versions of safety-related information [822]. The 2000 census provided further insights into the growth of the Internet amongst the US population [823]. The census asked `Is there a personal computer or laptop in this household?'. The returns indicated that 54,000,000, or 51%, of households had one or more computers in August 2000. This was an increase of 42% from December 1998 45,000,000, or 42%, of households had at least one member who used the Internet at hone, This had risen from only 26% in 1998 and 18% in 1997. Such statistics reinforce the point that signi cant proportions of the population in what is arguably the world's most technologically advanced nation still do not have Internet access. This is liable to be less signi cant for incident reports that are targeted at commercial organisations, for which one might expect a higher percentage of Internet connectivity. The census statistics are, however, a salient reminder for more general reports and warning such as those issued by the FDA that are deliberately intended for the general public. Paper-based dissemination techniques are also resilient to hardware failures. It is a relatively simple matter to nd alternative printing facilities and postal services. It can be far more complex to introduce alternative web-servers or automatic fax routing services. The reliability of the distribution service is only one aspect to this issue. There can also be considerable problems in ensuring that the intended recipients of incident reports can successfully retrieve alternative formats. Postal services are seldom swamped by the volume of mail. The same cannot be said by web servers or even by the use of fax-based distribution techniques. If the intended recipient's fax machine is busy at the time when an automated distribution service attempts to distribute an incident report, critical information can be delayed by hours and even days. At peak times of the day, many requests can either fail entirely or be signi cantly delayed as users request incident reports from the FDA or MDA web-sites. One particular problem here is that many government web sites only make limited use of more advanced techniques, such as predictive cacheing or mirror sites [416]. Similarly, the servers that provide access to incident reports may also be used to provide access to other documents that attract a large volume of users throughout the day. There is a certain irony in the manner in which some incident reporting web sites also elicit user-feedback about the failure of those sites that are intended to provide access to other forms of incident reports. Even if readers can download a computer-based report, there is no guarantee that they possess the application software that may be required to view it. Chapter 13 described how most incident reporting sites exploit either HTML and PDF. The former supports the dissemination of web-based documents because no additional support is required beyond a browser. Unfortunately, there is no guarantee that a document, which
14.2.
FROM MANUAL TO ELECTRONIC DISSEMINATION
725
is formatted in HTML will be faithfully reconstructed when printed. This is signi cant because the psychological literature points to numerous cognitive and perceptual problems associated with the on-screen reading of long and complex documents [875]. In consequence, many organisations exploit Adobe's proprietary PDF format. PDF readers can be downloaded for most platforms without any charge. Problems arise, however, when incident reports that have been prepared for viewing under one version of the reader cannot then be viewed using other versions. For instance, a recent MDA report into Blood Pressure Measurement Devices contained the following warning: \Adobe Acrobat v.4 is required to view on screen the content of the tables at p.9 + 16...Adobe Acrobat v.3 can view remainder of document and can print in full". Paper-based dissemination techniques avoid such problems, which present a considerable barrier for many users who might otherwise want to access these documents. There are further bene ts to more traditional dissemination techniques. For instance, the physical nature of paper-based publications enables regulators to combine documents in a single mailshot. This is important because potential readers can skim these related items to see whether or not they are relevant to their particular tasks. This can be far more diÆcult to achieve from the hypertext labels that are, typically, used to encourage readers to access related items over the web [757]. The
exible nature of printed media can be illustrated by the way in which Incident Report Investigation Scheme news and safety alerts were directly inserted into printed copies of the Australian Therapeutic Goods Administration newsletter [45]. Similar techniques have been adopted by many dierent investigation schemes. Safety-related information is included into publications that are perceived to have a wider appeal. This is intended to ensure that more people will consider reading this information than if they had simply been sent a safety-related publication. There are also situations in which investigatory and regulatory organisations have no alternative but to use printed warnings. For example, the FDA took steps to ensure that printed warnings were distributed about the danger of infection from vibrio vulni cus as a result of eating raw oysters [254]. The signs and symptoms of previous cases were described and the resulting warnings were posted at locations where the public might choose to buy or consume these products. The use of the Internet or of broadcast media provides less assurance that individuals who are about to consume raw Oysters are aware of the potential risks. This incident also illustrates some of the limitations of paper-based dissemination techniques. Many of the cases of infection were identi ed in and around Los Angeles. The FDA soon discovered that, as noted above, a signi cant proportion of this community could not speak or read English at the level which was required to understand the signs that had been posted. The States of California, Florida, and Louisiana only required Oyster vendors to post signs in English. In consequence, the FDA supplemented these printed warnings with a 24-hour consumer `Seafood Hotline' that provided information in English and Spanish. There are a number of problems that limit the utility of paper-based dissemination techniques as a means of distributing the documents that are generated by incident reporting systems. The most obvious of these issues is the cost associated with both the printing and shipping of what can often be large amounts of paper. These costs can be assessed in purely nancial terms. They are also increasingly being measured in terms of their wider ecological impact, especially for large scale reporting systems that can document many thousands of contributions each year. Many organisations attempt to defray the expenses that are associated with the generation and distribution of incident reports by charging readers who want to obtain copies of these documents. This raises a number of complex, ethical issues. For example, the cost of obtaining a copy of an incident report can act as a powerful disincentive to the dissemination of safety-related information. This should not be underestimated for state healthcare services where any funds that are used to obtain such publications cannot then be spent on more direct forms of patient care. Some regulatory bodies, therefore, operate a tiered pricing policy. For example, the MDA do not make a charge for any of the Device Bulletins requested by members of the National Health Service. In contrast, Table 14.5 summarises the prices that must be paid to obtain copies of a number of recent MDA documents by those outside the national health system [540]. The costs illustrated in Table 14.5 do not simply re ect the overheads associated with the printing and shipping of these documents. They also, in part, re ect the costs of maintaining a catalogue of previous publications. This can prove to be particularly diÆcult with paper-based reports given the
726
CHAPTER 14.
Number DB 2001(04) DB 2001(03) DB 2001(02) DB 2001(01) Number DB 2000(05) DB 2000(04) DB 2000(03) DB 2000(02) DB 2000(01)
Device Bulletins - 2001 Title Advice on the Safe Use of Bed Rails Guidance on the Safe Transportation of Wheelchairs MDA warning notices issued in 1995 Adverse Incident Reports 2000 Device Bulletins - 2000 Title Guidance on the Purchase, Operation and Maintenance of Benchtop Steam Sterilisers Single-Use Medical Devices: Implications and Consequences of Reuse Replaces DB9501 Blood Pressure Measurement Devices - Mercury and Non-Mercury Medical Devices and Equipment Management: Repair and Maintenance Provision Adverse Incident Reports 1999 Reviews adverse incidents reported during 1999 and describes MDA actions in response.
DISSEMINATION
Issue Date July 2001 June 2001
Price $15 $25
May 2001 March 2001
Free Free
Issue Date October 2000
Price $25
August 2000
$15
July 2000
$15
June 2000
$25
March 2000
Free
Table 14.5: Pricing Policy for Recent MDA Device Bulletins storage that is required to hold the large numbers of publications that were described in the opening pages of this chapter. The MDA has published well over 300 dierent reports in the last ve years. The logistics of supporting the paper-based distribution of such a catalogue has led many similar organisations to abandon such archival services. The Australian Institute of Health and Welfare now only provide the detailed back-up data and tables for many of their publications in electronic format [41]. A number of further limitations aect the use of paper-based dissemination techniques. The previous paragraphs have argued that such approaches do not suer from the problems of server saturation and network loading that can aect electronic distribution mechanisms. Unfortunately, more convention dissemination mechanisms can suer from other forms of delay that can be far worse than those experienced with Internet retrieval tools. Even with relatively eÆcient administration procedures there can be a signi cant delay between the printing of a report and the time of its arrival with the intended readership. These delays are exacerbated when safety managers or members of the general public require access to archived information about previous incidents. For instance, the MDA promise to dispatch requested reports by the next working day if they are in stock [541]. If they are not currently in print then they will contact the person or organisation making the request within forty-eight hours. These delays can be exacerbated by the use of the UK's second-class postal service to dispatch the requested copies of the report. This reduces postage costs, however, it also introduces additional delays. The second class service aims to deliver by the third working day from when it was posted. In the period from April to June 2001, 92.5% of second class `impressions' satis ed this target. This is an important statistic because it implies that even if there is a relatively long delay before any requested report can be delivered, the duration of this delay is relatively predictable. In the same period, the UK postal servise achieved close to 100% reliability in terms of the number of items that were lost. The high volume of postal traÆc does, however, mask the fact that Consignia received 223,495 complaints about lost items, 40,529 complaints about service delays and 37,256 complaints about mis-deliveries by the Royal Mail service from April to June 2001. The delays introduced by a reliance on the postal service or similar distribution mechanisms also creates problems in updating incident reports. In consequence, organisations may be in the process of implementing initial recommendations at a time when these interim measures have already been
14.2.
FROM MANUAL TO ELECTRONIC DISSEMINATION
727
revised in the nal report. Updating problems aect a wide range of the publications that are produced from incident reporting systems. For instance, the FDA explicitly intended that their Talk Papers, which are prepared by the Press OÆce to help personnel respond to questions from the public, are subject to change `as more information becomes available' [267]. Even when revisions are made over a longer time period, it is important not to underestimate the administrative burdens and the costs of ensuring that all interested parties receive new publications about adverse incidents. This point can be illustrated by the problems surrounding Temporomandibular Joints (TMJs). These implants have been used in several dental procedures. They were initially introduced onto the market before a 1976 amendment that required manufacturers to demonstrate that such products were both safe and eective. TMJs were, therefore, exempt from the terms of the amendment. From 1984 to June 1998, the FDA received 434 adverse event reports relating to these devices. 58% of these incidents resulted in injury to the patient. In 1993, the Dental Products Advisory Panel reclassi ed TMJs into their highest risk category (III). All manufacturers of TMJ devices were then required to submit a Premarket Approval Application, demonstrating safety and eectiveness, when called for by the FDA. In December 1998, the FDA called for PMAs from all manufacturers of TMJ implants. This was followed up by the publication in 1999 of a consumer handbook entitled, `TMJ Implants - A Consumer Informational Update'. In April 2001 this was updated to present further information about the changing pattern of incidents involving these devices. As can be seen, adverse occurrences led to the publication of reclassi cation information in 1993. This had to be disseminated to all device manufacturers. This was revised in 1998 when the Premarket Approval Applications were called for. This change has considerable implications; the FDA have to ensure that they contact all of the commercial organisations that might be aected by such a change. TMJ's are relatively specialist devices and so only a hand-full of companies are involved in manufacturing them in the United States. It is important to recognise, however, that the Class III categorisation also applied to the sale of foreign imports. One solution to the potential problems that might arise in such circumstances is to use legal powers to require that all device manufacturers take measures to ensure that they are aware of any changes to the regulatory status of the devices that they produce. Such an approach is, however, infeasible for members of the general public and even for clinicians. It would clearly not be a productive use of FDA resources if their administrative sta had to answer repeated requests from concerned individuals who were simply wanting to check whether or not they had received the most recent information about particular devices.
The web oers considerable bene ts for the dissemination of updated information about adverse occurrences and near-miss incidents. A single web-site can act as a clearing house for informations about particular products, such as TMJs, users can then access this page in order to see whether or not the information there had been updated. This approach raises interesting questions about the relationship between the reader and the regulatory or investigatory organisation that disseminates the information. In a conventional paper-based approach, a push model of distribution was used. The incident reporting organisation actively sent concerned individuals updated copies of information that they had registered an interest in. This enabled regulators to have a good idea about who read their reports. The overheads associated with this approach persuaded some organisations to adopt a pull model in which interested readers had to explicitly request particular documents. The dissemination of reports could then be targeted on those who actually wanted them rather than simply sending everyone a copy of every report. The administrative costs associated with such a scheme have persuaded many organisations to adopt the electronic variant of this approach in which individuals are expected to pull updated reports from a web page of information. This removes many of the costs associated with the production and distribution of paper-based reports. It also prevents regulators and investigators from having any clear idea of who has read the incident reports and associated publications that they have produced. Web server logs can prove to be misleading, given the prevalence of cacheing and other mechanisms for storing local copies of frequently accessed information [416].
728
CHAPTER 14.
DISSEMINATION
14.2.3 Fax and Telephone Noti cation Telephone and fax-based systems provide a compromise between the push-based approach of paper dissemination and the pull-based techniques of more recent, Internet approaches. In their simplest form, a pre-recorded message can be used to list all of the most recent updates and changes to paperbased documentation. This can help potential readers to identify the report that they want without consuming the regulator's nite administrative resources. It also enable frequent and rapid updates to be made to the information that is pre-recorded. Unfortunately, the linear nature of recorded speech can make this approach impractical for agencies that publish many dierent reports. A potential reader would have to listen to the recording for many minutes before hearing about a potential item of interest. The use of pre-recorded messages to provide an index of updates to incident reports still does not address many of the administrative and resources problems that can arise from the paper-based distribution of these documents. At some point, copies of the report have to be printed and shipped to the prospective readers. One solution to these problems is to use fax machines to distribute incident reports. This approach has numerous bene ts. For instance, fax-servers can be pre-programmed with large sets of telephone numbers. They will then automatically ensure that a faxed document is sent to every number of the list. More advanced systems will suspect a call if the fax machine is busy and will then re-try the number later in the run. The use of fax machines can also help regulatory authorities to keep track of the recipients of particular documents. For example, the UK MDA's institutional Business Plan for 2001-2002 includes the objective to monitor ` rst time' fax failures when urgent safety warnings are issued to liaison oÆcers. Of course, it is not possible to ensure that named individuals will have received and read a document that is sent in this manner. There can, however, be a reasonable degree of assurance that the fax has been received by the organisation associated with a particular fax number. The FDA has pioneered the development and use of a more re ned version of the systems described in previous paragraphs. They have developed a fully automated `Facts on Demand' system. The user dials up the service and they then hear a series of instructions. If, for example, they press `2' on their keypad then they can hear more detailed instructions on how to use the system. If they press `1' then they can choose to order a document. If they dial `INDX' or 4639 on the keypad then they can order an index of all documents on the system. If they choose to order an index, the system will call them back to fax a catalogue of publications. This currently runs to more than 50 page, however, it avoids the problems associated with listening to a pre-recorded listing for several hours! Callers can then use this faxed index to identify the identi er of the document that they want to retrieve. They must then call the system again, select the required option and then enter the document identi er. The system will then automatically fax them back with the required publication. The only technical requirement for the user of such a system is that they have access both to a fax machine and to a touch-tone telephone [264]. Most incident reporting systems continue to use paper-based dissemination techniques. Technological approachs, such as that described above, provide additional facilities that build on these more traditional approaches. This situation is gradually changing under increasing nancial and administrative pressures. These in uences can be seen behind the decision to move to the electronic publication of the FDA's User Facility Reporting Bulletin. In 1997, it was decided this it was no longer possible to print and mail this document out to anyone who requested it: \Time, technology, and budget restrictions have come together in the Food and Drug Administration. Ten years ago, our computer capability allowed us to communicate only within FDA. Now, with advanced computer technology we can globally communicate through the Internet and through Fax machines. As you would expect, Congressional budget cuts have aected all parts of government. FDA did not escape these cuts. In the search for ways to reduce our expenses, printing and mailing costs for distribution of publications in traditional paper form have come to be viewed as an extravagant expenditure... Now, budget restrictions prevent future distribution in paper form. We regret the need to move to this new technology if it means that many of our current readers will no longer have access to the Bulletin. We would like to remind you that you
14.3.
COMPUTER-BASED DISSEMINATION
729
can also obtain copies through our Facts-on- Demand System or the World Wide Web." [867] The concerns voiced in this quotation are understandable given the relatively low penetration of Internet connections into many areas of the US economy in 1997. Fax systems, such as `Facts on Demand', provided an alternative dissemination technique. It can, however, be argued that they are likely to be replaced as more and more companies invest in Internet technology. Computers-based dissemination will then become the primary means of distributing incident reports. Before this can happen, however, we will need to address security concerns and the legal status of electronic reports. We will also need to consider the consequences of providing electronic access to large collections of safety-critical incident reports.
14.3 Computer-Based Dissemination There are many diverse reasons that motivate the increasing use of information technology to support incident reporting systems. These approaches oer the potential for almost instantaneous updates to be disseminated across large distances. As we shall see, computer-based systems also oer security and access control facilities that cannot easily be provided using paper-based dissemination techniques. The same technological infrastructure that supports the rapid dissemination of individual incident reports also oers mass access to historical data about previous incidents and accidents. There are further motivations for providing this form of access to incident databases:
supporting risk assessment. An important bene t of providing wider access to incident databases is that safety managers can review previous incidents to inform the introduction of new technologies or working practices. This information must be interpreted with care; contribution and reporting biases must be taken into account. Even so, incident databases have been widely used to support subjective estimates about the potential likelihood of future failures [423].
identifying trends. Databases can be placed on-line so that investigators and safety managers
monitoring the system. Regulators and safety managers can monitor incident data to determine whether particular targets are being achieved. Incident databases have been monitored to demonstrate reductions in particular types of incidents. They have also been used to support arguments about overall safety improvements. The following chapter will address the problems that aect this use of reporting systems. For instance, any fall in the number of contributions to a reporting system can re ect a lack of participation rather than an increased `level' of safety.
encouraging participation. If potential contributors can monitor previous contributions, they can be encouraged to participate in a reporting system. Information about previous incidents helps to indicate the types of events and near misses that fall within the scope of the system. Evidence of previous participation can also help to address concerns about retribution or of accusations about `whistle blowing'.
transparency and the validation of safety initiatives. Wider access to incident data helps to
can nd out whether or not a particular incident forms part of a more complex pattern of failure. This does not simply rely upon identifying similar causes of adverse occurrences and near misses. Patterns may also be seen in the mitigating factors that prevent an incident developing into a more serious failure. This is important if, for example, safety managers and regulators were to take action to strengthen the defences against future accidents.
validate any actions that are taken in the aftermath of an adverse occurrences. For example, several reporting systems have used their incident data to draft a `hit' list of the most serious safety problems [36]. By providing access to the underlying data that supports such initiatives, manufacturers and operators can see the justi cations for subsequent regulatory intervention.
730
CHAPTER 14.
DISSEMINATION
information sharing. The development of on-line incident databases enables safety managers and regulators to see whether similar incidents have occurred in the past. This information technology provides further bene ts. For the rst time, it is becoming possible to extend the search to include the on-line databases of incidents in other countries and even in dierent industries. It is important not to underestimate the opportunities that this creates. For instance, it is possible to directly view incident data relating to the failure of medical devices in the USA prior to their approval for use in the UK. Conversely, it is becoming possible for the authorities in the USA to view elements of the submission for approval for particular forms of drugs that are submitted to the UK authorities. The electronic indexing of all of this data can help investigators, regulators and safety-managers to search through a mass of information that would otherwise overwhelm their nite resources.
The FDA recently summarised the advantages of electronic over paper based systems for incident reporting in the healthcare domain [271]. Firstly, they argued that automated databases enable readers to perform more advanced searches of information. This is important because it is likely that individuals may miss relevant information if they are expected to perform manual inspections of large paper-based data sets. Secondly, FDA also argued that computer-based retrieval systems can be used to view a single collection of information from a number of dierent perspectives. For instance, it is possible to present summaries of all incidents that relate to particular issues. This might be done by issuing a request to show every incident that involves a software bug or the failure of particular infusion devices. Similarly, other requests might be issued across the same collection of reports to identify incidents that occurred in particular geographical locations or over a speci ed time period. Such dierent views can, in principle, be derived from paper-based documentation of adverse occurrences and near-miss incidents. The costs of obtaining such information are, however, likely to be prohibitive. The third justi cation that the FDA identi ed for the use of electronic information systems was that they support the analysis of trends and patterns. Many incident reporting systems are investigation in new generations of `data mining' applications and `search engine' that can identify subtle correlations within a data-set over time. Finally, the FDA argued that electronic information systems avoid `initial and subsequent document mis ling that may result from human error' [271]. As we shall see, this particular bene t can be more than oset by the problems that many users experience when they attempt to use computer-based systems to retrieve particular incident reports. Many applications require the use of arcane command languages or pre-programmed queries that are often poorly understood by the people that must use the information that is returned to them.
14.3.1 Infrastructure Issues It is important not to automatically assume that all incident reporting systems are following a uniform path in the application of information technology. There is an enormous diversity of techniques. Some systems, such as the CIRAS rail application [197], deliberately avoid the use of computer networks. Security concerns have persuaded them to use stand-alone machines. This has profound concerns for the collation of distributed data. Paper forms are used throughout the Scots rail network and extensive use is made of telephone interviewing, even though it can often be diÆcult to arrange times when contributors can be contacted in this manner. Other organisations, such as the Swedish Air TraÆc Control Organisation, have deliberately created computer-based reporting systems that exploit the bene ts of networked applications. Individuals can log-onto the system from many diverse locations both to submit an incident report and to monitor the progress of any subsequent investigation. The following pages, therefore, review the strengths and weaknesses of the technological infrastructures that have supported incident reporting systems. The term `technological infra-structure' is used here to refer to the means of distributing computer-based records of adverse occurrences and near-miss incidents. Subsequent sections examine issues that relate more to the retrieval of reports once they have been disseminated. In other words, this section looks at the techniques that investigatory bodies can use to push information out to end-users. Later sections look at the systems that end-users can exploit to search through that data and pull out information about particular incidents.
14.3.
COMPUTER-BASED DISSEMINATION
731
Stand-Alone Machines Many incident reporting systems initially make very limited use of computer-based tools. Typically, they recruit mass-market desktop applications such as spreadsheets and text editors to help with managerial and logistical tasks. There are also more mature systems that have deliberately adopted the policy not to use more advanced computational tools. This decision can be justi ed in a number of ways:
security concerns. The most pressing reason not to exploit more advanced technology in
cost issues. The declining costs associated with computer hardware have not been matched by
general, and network connectivity in particular, is that many safety managers have concerns about their ability to maintain the security of commercially sensitive incident data. Even if those operating the system have satis ed themselves that precautions can be taken against potential threats, senior levels of management may intervene to prevent the use of local or wide area networks. Security concerns are often most signi cant when an independent reporting agency holds data on behalf of operating companies. In such circumstances, the integrity and privacy of that information is often a prerequisite for running the system in the rst place;
similar reductions in connectivity charges within particular areas of the globe. In consequence, many small scale reporting systems may not be able to justify the additional expense associated with anything more advanced than a stand-alone machine. This is particularly important when the potential contributors to a reporting system are geographically distributed and may be involved in occupations that do not directly involve the use of information technology. For example, high wiring costs and legacy buildings have prevented many NHS trusts from providing direct network access to all of the wards in every hospital.
lack of relevant expertise. Costs not only stem from the physical infrastructure. They also relate to the additional technical expertise that is required to connect stand-alone machines to local and wide area networks. It is relatively simple to register a single machine with an Internet service provider. Most incident reporting systems, however, depend on gather information from and disseminating information to large networks of contributors. Previous sections have also stressed the importance of maintaining connectivity within these networks to ensure that safety information is disseminated in a timely fashion. These factors combine to make it likely that any move beyond a stand-alone architecture will incur additional costs in terms of technical support to maintain the computer-based infrastructure.
`not invented here' syndrome. The previous justi cations for the continued use of stand-alone
machines are well considered and appropriate. There is, however, a further reason for the longevity of this simple architecture that does little credit to the systems that embody it. As mentioned, many incident reporting systems begin by using commercial, o-the shelf packages to collate statistical data about previous incidents. As we shall see, many of the individuals who maintain the automated support are acutely aware of the problems and limitations that this software imposes on its users. There can, however, be a reluctance to move beyond these initial steps to elicit professional support from software engineers. This stems form a justi able fear that the use of more advanced systems may imply a loss of control. In consequence, we see safety-critical data being held on mass-market systems whose licenses explicitly prohibit their use in such critical applications.
Stand-alone computers, without network connectivity, continue to play a signi cant role in many incident reporting systems. It is possible to identify a number of dierent modes of operation. They can be used simply to disseminate forms that can then be edited locally. Contributors can then print out the details of a particular incident and post the completed form back to a central agency. This is the approach advocated by the US Joint Commission on Accreditation of Healthcare Organisation's Sentinel Event system [431]. They request that all organisations transmit `root cause analysis, action plan, and other sentinel event-related information to the Joint Commission through the mail' rather than by electronic means. The UK MDA also oer a range of electronic forms
732
CHAPTER 14.
DISSEMINATION
in PDF and in Microsoft Word format that can be printed and posted back when they have been completed [537]. The dissemination of incident reporting forms that are then intended to be printed and posted back to central agencies creates something of a paradox. Organisations that use stand-alone machines to coordinate their reporting policies must rst obtain copies of these documents before they can complete their submission in a secure manner. The most common means of doing this is to use another networked machine to download the form then copy this across to the isolated machine on a disk. This is clearly a protracted mechanism. It may also fail to achieve the level of security that many people believe it ought to. For example, stand-alone machines are equally vulnerable to the free distribution of passwords and the problems associated with unlocked oÆces [1]. A slightly more complex use of information technology is to distribute a suite of programs that not only helps with drafting incident reports but also helps with the investigation and analysis of near misses and adverse events. Many of these systems are not primarily intended to support the exchange of information between institutions but to ensure that clear and coherent procedures are adopted within an institution. Rather than supporting the dissemination of information about incidents, regulators disseminate software support for local reporting systems. A recent collaboration between Harvard School of Public Health, the MEDSTAT Group, Mikalix & Company and the Center for Health Policy Studies illustrates this approach [3]. They devised the Computerised Needs-Oriented Quality Measurement Evaluation System (CONQUEST) system. This is intended to support general information about quality assurance in healthcare. It does, however, share much in common with many incident reporting systems. For instance, it was designed to help managers and clinicians derive answers to the following questions: 1. Did the clinician do the right thing at the right time? 2. Was eective care provided to each patient? 3. Was care provided safely and in an appropriate time frame for each patient? 4. Was the outcome as good as could be expected, given each patient's condition, personal characteristics, preferences, and the current state of medical science? [3] The project began by devising a classi cation scheme for the information that it was to maintain. It then \became obvious that a computer database was the logical way to store and retrieve the data". A mass-market, single-user database application was then chosen as the implementation platform for this system. This decision re ects considerable expediency. It is possible to use these applications to quickly craft a working application that can be used to store information about several thousand incidents. This is suÆcient for individual hospitals, however, it will not provide inde nite support as the number of incidents increases over time. Nor do these systems provide adequate support for the storage and retrieval of incident information on a national scale. Further problems complicate the use of stand-alone architectures to support local incident reporting systems. Many of the distributed software systems that run on these applications oer means of tailoring the format of the data to meet local requirements. In general, this is an excellent approach because it provides safety managers with a means of monitoring incident-related information that might have particular importance within their working context but which has been ignored at a national level. Unfortunately, one consequence of this exibility is that local systems often develop electronic data formats and classi cation schemes that are entirely inconsistent with those used by other organisations. These problems even occur when organisations exploit the same version of incident reporting software. Local changes can, over time, partition incident data so that it is diÆcult to join the individual approaches into a coherent overview of incidents across an industry. Technically, it is possible to match variant elds in dierent systems providing that it is possible to identify relationships between this information. This will not, however, resolve the problems of missing or partial data.
14.3.
COMPUTER-BASED DISSEMINATION
733
Electronic Mail If incident information is only ever to be held and used locally then many of the previous caveats are of limited importance. In most cases, however, there is a need to support the exchange of incident information about near-misses and adverse occurrences. Arguably, the most common means of supporting such transmission is through the use of electronic mail. This generic term is usually applied to a range of software applications that exploit the Internet-based Simple Mail Transfer Protocol (SMTP). This enables users to send arbitrary messages between dierent accounts. The `Seafood Network' provides a good example of the eective use of this simplest form of electronic communication [687]. This ensures that any message sent to central account, or list server, is automatically distributed to everyone who is registered with the service. The primary purpose of this \Internet based seafood network is to facilitate information exchange about the Hazard Analysis and Critical Control Point (HACCP) system of food safety control in the seafood industry". The HACCP system can be thought of as a form of incident reporting scheme that also disseminates more general safety-related information. There are, however, certain pitfalls that can arise from this relatively simple use of electronic dissemination techniques: \It has become apparent that some network users may not realise that the e-mail address,
[email protected], automatically distributes a message to over 400+ subscribers worldwide. By all means, if you want everyone on the seafood network to read your message, address it to
[email protected] To communicate privately as a followup, please respond to the individual's e-mail address that is listed on the message. As mentioned, SMTP supports the exchange of simple text messages. Some email applications also support Multipurpose Internet Mail Extensions (MIME). These enable senders to attach les of a particular type to their mail messages. This is useful if, for example, the sender of a mail message wanted to ensure that the recipient used CONQUEST or a similar system to open the le that they had attached to their email. MIME not only sends the le but also send information, that is usually hidden from the user, about those programs that can be used to open the attachment. It is clear that the recipients of any message must be able to interpret its contents. In the case of standard SMTP mail, the human reader must be able to understand the contents of any message. In the case of a MIME attachment, the associated program must be able to interpret the data that is associated with the mail message. Some industries are more advanced than others in specifying the format that such transmissions should take. In particular, there are many reasons why this issue should be of particular interest to healthcare professionals. There must be clear standards for the transmission of information if doctors are to interpret patient-related information sent from their colleagues. Similarly, the increasing integration of testing equipment into some hospital networks has created situations in which results can be automatically mailed to a clinician. In consequence, many professional bodies have begun to collaborate on standards for the transmission of clinical information. For example, Health Level 7 is an initiative to develop a Standard Generalised Markup Language similar to that used on the web, for healthcare documents [184]. These standards are not primarily intended to support incident reporting. They can, however, provide convenient templates for the transmission and dissemination of incident reports that are consistent with emerging practices in other areas of healthcare. This approach is also entirely consistent with attempts to develop causal taxonomies for incident reporting. The leaf nodes in techniques, such as PRISMA described in Chapter 11, might be introduced within such languages to record the results of particular investigations. At present, most reporting systems do not exploit such general standards in the electronic transmission of incident information. Instead, they rely upon a range of formats that are tailored to particular reporting systems and its associated software. One of the most advanced examples of this approach is provided by the Australian Incident Monitoring System [36]. Initially, like many reporting systems, this relied upon the paper-based submission of incident forms from hospitals and other healthcare organisations. As the system grew, it established a network of representatives within `healthcare units'. These representatives now collate the paper-based reports and enter them into a database that exploits the Structured Query Language (SQL). SQL can be viewed as a standard that describes the language that is used when forming requests for information. The software
734
CHAPTER 14.
DISSEMINATION
embodies the Generic Occurrence Classi cation (GOC). This supports the categorisation and subsequent analysis of the incident records that are held within the system. As mentioned, this initial data entry is performed within the `health units'. At this stage, the system contains con dential information on those involved in the incident. It is, therefore, protected from legal discovery under Australian Commonwealth Quality Assurance legislation. For monitoring purposes, the Australian Patient Safety Foundation (APSF) then collates information from the individual units. Before this is done, all identifying information is removed from the individual reports. Current versions of the AIMS software enable individual units to email this information to the APSF system using the MIME techniques, described above. An important aspect of this transmission is that the individual records are encrypted prior to transmission. Later sections will describe how SMTP mail services are insecure and that such techniques are a necessary precaution against the unauthorised access to such information. As mentioned, the AIMS approach is both innovative and well-engineered. There are, however, a number of potential problems with the techniques that it exploits. It adopts a model that is very similar to the stand-alone architecture, which was described in previous paragraphs. The collated database of anonymised incident information is held by the APSF as a central resources. This not only secures the data it also, potentially, acts as a bottleneck for other healthcare professional who might have a legitimate interest in analysing the data. In particular, safety managers in individual health units cannot directly pose queries to determine whether an incident forms part of a wider pattern. They must go through the mediation of the APSF. Increasing incident reporting systems are adopting a more egalitarian model in which anonymised incident data is also be distributed by electronic means. The move has been inspired by work in the aviation community and, in particular, the metaphor of an information warehouse that has been promoted by the GAIN initiative [308]. Those who contribute data should also have direct access to the data that is contributed by peer organisations. This egalitarian approach poses considerable logistical problems, including the diÆcult of ensuring the security of information transfers. As we shall see, a range of encryption techniques can be used during transfer. Password protection can also be used to restrict access. Neither of these techniques addresses the problems of ensuring the consistency of incident databases that may be replicated in each of the peer organisations. In other words, if the AIMS database were to be distributed more widely there would be a danger that one hospital might be using a collection that was updated in 2000 while others used more recent versions of the database. This problem arises because information about new incidents must not only be sent to the central clearing house operated by the APSF, it must also be sent to all peer organisations. Email can be used for this but there is no guarantee that every message will be acted upon and incorporated into the existing database. It is also likely that the size of many reporting systems would prevent any attempt to email out the entire collection at regular intervals. It is for this reason that organisations such as the National Transportation Safety Board (NTSB) exploit a mixture of on-line updates and CD-ROM digests of their incident databases. Organisations can then either download each new set of reports as they become available or simply order a CD-ROM of the entire updated collection.
CD-ROMS Compact Disk-Read Only Memory provides a relatively cheap means of disseminating incident information without requiring that the recipient exposes their machine to the security risks associated with a network connection. This technology also avoids the level of technical support that can be associated with network administration. The popularity of is format is based on storage capacity of this media. Most CD-ROMs provide nearly 0.7 gigabytes of data; this is equivalent to almost 500 high-density oppy disks. The emerging successors to this format, such as Digital Versatile Disc (DVD-ROMs), expand this capacity to 8.5 gigabytes. CD-ROMs are also highly portable and light weight making the postal distribution of large amounts of data far cheaper using this format than printed documentation. It is also possible to encrypt the information on a CD-ROM; this provides added protection against the problems that can arise if critical documents go missing in the postal system. All of these technical attributes make this format particularly well suited as a communication medium for the dissemination of incident databases.
14.3.
COMPUTER-BASED DISSEMINATION
735
The CD-ROM format has further advantages. In particular, they provide a maximum data rate of between 2.8 and 6 megabytes per second on a 40x drive. This might seem a relatively trivial, technical statistic. However, such speed enable regulators and investigatory agencies to include multi-media resources, such as short audio and video clips, in addition to textual information and static images. Previous sections have described how CD-ROMs can be used to ensure consistency through periodic updates to the many dierent users of a reporting database. These databases do not, typically, make use of multimedia resources. In contrast, the additional facilities provided by the CD-ROM format tend to be exploited by some of the other publications that are generated by incident reporting systems. Hence, the sheer storage capacity of this medium provides means of disseminating textual incident databases. The access speeds supported by CD-ROM enable regulators to disseminate multimedia training presentations that are intended to guide safety managers and operators who must follow particular recommendations in the aftermath of near miss incidents and adverse occurrences. The use of CD-ROMs to distribute information about adverse occurrences raises a number of further problems. In large, distributed organisations it may not be possible to provide all members of sta with access to personal computers that can play these disks. A number of innovative solutions have been devised to address this problem. One of these is illustrated by a sta training scheme that was developed by the Royal Adelaide Hospital. This scheme was closely tied to the Australian Incident Monitoring System, mentioned above. This hospital developed `Mobi-ed' units that resemble the information booths that are found in public areas such as shopping malls and airports; `the cabinet has solid wheels with brakes, a handle at the back for moving it around, and locks the computer behind two separate doors' [40]. They were based on standard desktop PCs. The booths also had the advantage that they could be left in common areas where a number of members of sta might have access to them during many dierent shift patterns. The units were deliberately moved between locations in the hospital; \this appearance and disappearance of the units encourages sta to check them out whenever they appear in their ward or work area" [40]. Multimedia training material was obtained to address speci c training needs identi ed frm incident reports. The Mobi-Units also provided sta with access to an on-line tutorial about how and when to complete a submission to the AIMS reporting system. It is important to recognise that CD-ROMs are simply a storage technology that supports the distribution of incident databases and training material. A number of deeper questions can, however, be raised about the eectiveness of the material that is distributed using this technology. Chapter 15 will analyse the eectiveness of incident databases. The following paragraphs provide a brief appraisal of multimedia training packages that are increasingly being developed by reporting agencies for dissemination on CD-ROM. We initially became interested in the eectiveness of this approach when developing materials for Strathclyde Regional Fire Brigade. Their training requirements are very similar to those of the healthcare professionals in the Royal Adelaide Hospital, mentioned above. Sta operate shift patterns and geographically distributed across a number of sites. There are also dierences, for instance the training activities of a particular watch will be disrupted if they are called out in response to a call from a member of the public. Such circumstances increase the appeal of CD-ROM based training. The intention was to develop a series of multimedia courses that supported key tasks, which had caused particular concern during previous incidents. Fire- ghters could work through a course at their own pace. They could also suspend an activity and return to it after a call-out. There was also a perception that the use of computer-based technology might address the subjective concerns of sta who found conventional lectures to be ineective. Figure 14.1 presents the results of a questionnaire that was issued to 27 sta within the Brigade, At the start of the project, they did not have access to any computer-based training. They received information about safety-related issues through paper publications, lectures and videos as well as drill-based instruction. However, many re- ghters viewed `real' incidents as an important means of acquiring new knowledge and reinforcing key skills. This nding has important health and safety implications. Incidents should reinforce training gained by other means. They are clearly not a satisfactory delivery mechanism for basic instruction. As mentioned, a series of CD-ROM based training packages were developed to address the perceptions identi ed in Figure 14.1. These applications were developed in collaboration with the Brigade
736
CHAPTER 14.
DISSEMINATION
Figure 14.1: Perceived `Ease of Learning' in a Regional Fire Brigade training oÆcer, Bill West, and two multimedia developers who were employed by the Brigade, Brian Mathers and Alan Thompson. Figure 14.2 illustrates one of these tools. This exploited the desktop virtual reality techniques, introduced in Chapter 8. Fire- ghters could use a mouse and keyboard to `walk' into a Heavy Rescue Vehicle. They could then look inside equipment lockers and obtain a brief tutorial on the design and use of particular rescue devices. Previous incidents had illustrated the diÆculty of providing oÆcers with enough time to train on this particular vehicle given an operational requirement to keep it `on call' as much as possible. As mentioned, multimedia applications can be devised to address particular concerns that emerge in the aftermath of near miss incidents and adverse occurrences. The performance characteristics, in particular the access speeds of CD-ROMS, make this the favoured distribution media for such materials given network retrieval delays. It is important, however, to both understand and assess the various motivations that can persuade organisations to invest in tools such as that illustrated in Figure 14.2. In particular, we have argued that the strong motivational appeal of computer-based systems can support sta who nd it diÆcult to be motivated by more traditional forms of training. We were, however, concerned that the introduction of computer-based techniques should not compromise particular learning objectives. We, therefore, conducted an evaluation that contrasted a computer-based system with more traditional techniques. A matched subject design was adopted; each re- ghter was paired with another oÆcer of equivalent rank and each member of the pair was then randomly assigned to one of two groups. Both groups were given access to the same computer based training package on techniques to support the eective application of foam to combat particular types of re. The technology used to produce the interactive application was similar to that used in Figure 14.2. One group was then given a CD-ROM based Comprehension Tool. This guided the oÆcers through a series of questions about the training material and provided immediate feedback if any problems were diagnosed [425]. The
14.3.
737
COMPUTER-BASED DISSEMINATION
Figure 14.2: The Heavy Rescue Vehicle Training Package second group was given a pencil and paper test without any feedback about the accuracy of their responses. One week later both groups were re-tested using the Comprehension Tool. It was hypothesised that the group that had previous access to the CD-ROM based self-assessment tool would achieve signi cantly higher scores than the group that had performed the pencil and paper test. A weakness in this experimental design is that learning eects might improve the results of the group that already had some experience with the Comprehension Tool. These eects were minimised by ensuring that both groups were entirely con dent in the use of the tool before the second test began. Tables 14.6 and 14.7 present the results obtained for the two groups involved in this evaluation. Rank
Number
Sub oÆcer Leading re- ghter Leading re- ghter Fire- ghter Fire- ghter Fire- ghter Fire- ghter Fire- ghter
1 2 3 4 5 6 7 8
Comprehension Tool or Paper test? Paper Comprehension tool Comprehension tool Paper Paper Comprehension tool Paper Comprehension tool
Score 1
Score 2
72 60 80 36 88 64 40 52
76 56 68 44 84 52 40 32
Table 14.6: Results for the rst group of Fire Fighters This evaluation forms part of a far wider attempt to validate the use of computer-based learning techniques. It does, however, provide a case study in the problems that can arise during these validation exercises. For instance, subtle dierences in position 1 of the ranking schemes in Figures 14.6 and 14.7 complicate the task of making accurate comparisons. The station oÆcer's performance is better than that of the sub-oÆcer. There dierences re ect the operating characteristics and compo-
738
CHAPTER 14.
Rank
Number
Station oÆcer Leading re- ghter Leading re- ghter Fire- ghter Fire- ghter Fire- ghter Fire- ghter Fire- ghter
1 2 3 4 5 6 7 8
Comprehension Tool or Paper test? Comprehension tool Paper Paper Comprehension tool Comprehension tool Paper Comprehension tool Paper
DISSEMINATION
Score 1
Score 2
96 80 92 68 64 52 84 60
88* 60* 80 48 68 40* 64 60
Table 14.7: Results for the second group of Fire Fighters (* post re) sition of this organisation; they could not simply be changed for experimental expediency. Further problems arose because the second group of re- ghters was called out while we administered the retest. It would have been unethical to prevent them from responding until after they had completed the evaluation! This study also provided some direct insights into the use of computer-based training techniques. Statistical T-tests failed to show any signi cant dierences in the re-test scores between those who had access to the CD-ROM based tool and those who sat the pencil and paper test. We were unable to establish that the computer-based tool was any better than the more traditional techniques. Or put another way, it was no worse than existing methods for the dissemination of safety-related information. The previous paragraphs are intended to correct the euphoria that often promotes the use of CDROM based technology for the publication of safety-related training materials. Often managerial and political pressures encourage the use of `leading edge' technology without any careful analysis of whether this technology will support key learning objectives. Cost constraints can also act to limit many organisation's ability to disseminate the insights gained from previous incidents and accidents in any other format. Some regulatory and investigatory bodies have, however, continued to resist these pressures. For example, the State Training Teams of the FDA's OÆce of Regulatory Aairs support a `lending library' of courses that must be presented by \trained state/federal facilitators that have already completed the original satellite course" [265]. These trained mentors are support by course videos, books, exams and answer key forms. It is interesting to note that these courses cover topics that are perceived to have a relatively high degree of importance in the FDA's regulatory role. For example, they include two courses on the investigation and reporting of adverse occurrences. This might imply that many of the issues addressed in previous chapters of this book cannot easily be taught using computer-based techniques.
Local and Wide Area Networks The previous section has identi ed some of the strengths and weaknesses of CD-ROM technology for disseminating the information that can be derived from incident reporting schemes. They can be used to distribute the multimedia training resources that are intended to address previous failures. It can, however, be diÆcult to demonstrate the eectiveness of these resources. In contrast, CD-ROMs oer numerous bene ts for the distribution of incident databases. They are relatively cheap. They oer relatively high storage capacity together with a relatively compact, lightweight format that is rugged enough to survive most postal services. Data can also be encrypted to provide additional security should a CD-ROM be lost or intercepted. There are, however, a number of limitations with this use of CD-ROM technology. In particular, it can be diÆcult to use this approach to issue more immediate updates to safety-related information. We have already describe the delays that can be introduced through the use of postal services to distribute physical media, such as CD-ROM. In contrast, many organisations are increasingly using computer networks to support the more rapid dissemination of information about adverse occurrences and near miss incidents. The MDA provide an example of this in their Business Plan for 2001-2002. They express the intention to develop
14.3.
COMPUTER-BASED DISSEMINATION
739
closer links with the National Health Service and the Commission for Health Improvement with the objective of `improving the dissemination' of information about adverse events. This will be done by `electronic dissemination through our website and other Internet systems, so that healthcare professionals will increasingly have important safety information at their ngertips' [544]. It is convenient to identify two dierent sorts of incident data that can be accessed over computer networks. Firstly, incident databases help to collate information about large numbers of adverse occurrences and near misses. Secondly, incident libraries provide access to small numbers of analytical reports that may summarise the ndings from many dierent incidents. In either case, these electronic documents must be stored in a particular format if they are to be disseminated across computer networks. Chapter 14 described the strengths and weaknesses of two of these formats. Hypertext Markup Language (HTML) documents can be viewed using standard browsers and are easily indexed by search engines but cannot easily be printed. Adobe's Portable Display Format (PDF) avoids this problem but most search engines have to be adapted to search this proprietary format for the keywords that are then used when users issue search requests. Incident data can also be stored in the le format that are supported by commercial mass-market databases and spreadsheets. This approach tends to be associated with incident databases. They are used to provide access to summary data about large numbers of individual incidents. PDF and HTML are more commonly used to support the dissemination of analytical surveys and the longer reports that are contained in on-line `reading rooms'. The distinction between on-line libraries and incident databases is signi cant not simply because it in uences le formats and retrieval techniques but also because it re ects important distinctions in the policies that determine what is, and what is not, made available over computer networks. Private Databases and Public Libraries. Some organisations maintain private electronic databases that are not mounted on machines that are accessible to a wider audience. These same organisations may, however, provide wider access to the libraries of reports and recommendations that are derived from these private databases. This approach is currently being exploited by the MDA; `we plan to introduce web reporting facilities that will feed directly into the Adverse Incident Tracking System (AITS) software' [544]. AITS is intended to help the Agency keep all its main records in electronic form for `action and archiving'. It will also provide MDA sta with ` exible data analysis tools to identify trends and clusters of incidents and that will enable us to adopt a more pro-active approach to reducing adverse incidents'. It is not intended that other organisations should have access to this database. In contrast, electronic access will be provided to what the previous paragraphs have characterised as `libraries' of analytical overview documents. In passing, it should be stressed that the technical details of the AIMS software have not been released, not is there a detailed account of the full system development plan. It may very well be that the objectives outlined in the 2001-2002 Business Plan will be revised as AIMS is implemented. Public Databases and Public Libraries. Other organisations provide access both to `reading rooms' of analysis and to the databases of incidents that are used to inform these more analytical accounts. For example, the FDA's Manufacturer and User Facility Device Experience Database is freely available over the Internet [272]. It is comparable to AITS because it records voluntary reports of adverse events involving medical devices. An on-line search is available which allows you to search the Centre for Devices and Radiological Health's database of incident records. It is also possible to download the data in this collection over the Internet. These les are updated every three months. They are in a text format that enables safety managers and other potential readers to import them into a commercial database or word processor for further analysis. At the same time, the information in the MAUDE system is also used to inform more detailed incident investigations and surveys of common features across several adverse occurrences. The resulting reports are also available on-line in PDF format via an electronic `reading room' [270]. This open dissemination policy enables readers to examine the warnings that are contained in a particular safety issue or alert publication. They can then trace additional details about particular incidents, and related occurrences, using the MAUDE database. It is important to stress, however, that the provision of public databases and reading rooms need not imply that sponsor organisations do not also maintain more private systems that are not made available in the manner described above. Private Database and Private Summaries. Some incident reporting systems restrict access to
740
CHAPTER 14.
DISSEMINATION
both their database information and the reports that are derived from them. This policy is re ected in the way in which the AIMS system restricts access to its central database and only provides feedback on comparative performance to the individual units that participate in the scheme. These private summaries can be distributed over networks, either using the e-mail systems that have been described in previous paragraphs or using more explicit forms of le transfer [186]. This approach is entirely understandable given the sensitive nature of incident reporting within individual hospitals. There are other circumstances in which computer networks have been developed to support incident investigation and analysis within an organisation. The intention has never been that the data should be made public but that it should support speci c tasks and objectives within the particular teams that must act upon incident data. This is illustrated by the U.S. Department of Health and Human Services' PulseNet system [259]. This system was established to distribute information generated from a molecular technique, pulsed- eld gel electrophoresis (PFGE), that can be used to identify similarities between dierent samples of E. coli O157:H7. The PFGE technique was rst used during a food-borne illness in 1993. This enabled laboratories in dierent locations to determine that they were ghting a common strain of bacteria. The lack of eÆcient computer networks to distribute the information from the independent PFGE tests prevented analysts from identifying these common features for the rst week of the outbreak. Seven hundred people became ill and four children died in the outbreak. PulseNet is intended to reduce the interval taken to detect future incidents down to approximately 48 hours. This system is not intended to support the public dissemination of information about such events. It is, however, intended to support the analytical and decision making tasks that are necessary in order to detect common features between apparently isolated incidents. It is important to emphasise that the increasing use of computer networks in incident reporting is only a small part of a wider move to intergrate diverse information sources about potential hazards. This integration is intended to support decision making. In other words, the dissemination of incident-related information is not an end in itself. This is illustrated by the manner in which epidemiologists can use PulseNet to trace common features in E. coli outbreaks. It is also illustrated by recent attempts to integrate diverse Federal databases to support the FDA eld oÆcers that have to determine whether or not to admit medical devices into the United States of America. The intention behind this initiative is to provide oÆcers with rapid access to the range of data that they require in order to reach a decision. This data includes information about any previous adverse events involving particular products. However, this is only one part of a more complex set of requirements. For example, oÆcers will also have to access the FDA's Operational and Administrative System for Import Support (OASIS) and Customs' Automated Commercial System (ACS). These data sources can be used to identify whether the product violates particular regulations by virtue of its point of origin. They can also be used to determine whether or not previous samples conformed with regulations when explicitly tested by the FDA. There is not intention that this integrated system should be widely accessible over public computer networks. It is, however, possible to access some of these information sources. For instance, it is possible to view the safety alerts that apply to particular products over the World Wide Web. This provides an example of the exibility that such computer networks can oer for the provision of safety-related information. Users can choose whether to view warning that are sorted by particular industries, by country of origin or by FDA reference number. Users can also conduct free-text searches over the database of import alerts. The FDA's import system contradicts the binary distinction between public and private distribution that was made in previous paragraphs. In practise, computer networks enable their users to make ne grained decisions about who can and who cannot access incident information. In the case of the import system, full functionality is reserved for eld oÆcers. Individual elements of the entire system are, however, made available for the public to access over the Internet. The same techniques can also be used more generally to restrict access to particular information about previous incidents. These approaches implement access control policies. The most common approach is to erect a ` rewall' that attempts to prevent access from anyone who is not within the local network that hosts the system. The following section discusses some of the consequences that such measures have for the implementation and maintenance of incident reporting systems. In particular, it is argued that compromises must often be made between restricting access and simplifying the procedures that
14.3.
741
COMPUTER-BASED DISSEMINATION
users must follow in order to obtain access to incident data.
14.3.2 Access Control Security deals with the unauthorised use and access to the hardware and software resources of a computer system. For example, unauthorized disclosure occurs when a individual or group can read information that they should not have access to. They, in turn, can then pass on information to other unauthorized parties. For instance, an unauthorised party might pass on information about an adverse occurrence to the press or broadcast media before that incident has been fully investigated. Unauthorised modi cation occurs when an unauthorised individual or group can alter information. They might have permission to `read' data items but this does not automatically imply that they should also be able to modify data. It is, therefore, important to distinguish between dierent levels of permission. For example, an individual hospital contributing incident reports to a central database may have permission to access and modify their own reports. They might, in contrast, only be able to read reports from other hospitals without being able to modify them. Finally, unauthorised denial of service occurs when an individual or group can shut-down a system without authority for taking such an action. Unauthorised denial of service is a general problem in computer security. For example, the propagation of viruses can deny other applications of the computational resources that they require. I am unaware of any speci c instances in which this form of attack has been a particular problem for incident reporting. It is important to stress, however, that unauthorised denial of service could have potentially profound consequences as incident reporting system become more tightly integrated into complex decision support systems, such as the FDA's Import application. The issue of security aects incident reporting systems in a number of ways. For example, the Central Cardiac Audit Database (CCAD) project identi es two main threats [184]. The rst centres on the security of data during transmission. When data is transmitted across open networks, such as the Internet, it can be intercepted unless it is encrypted. The second set of security concerns centres on controlling access to incident information after it has been collated. This is important because it is often necessary to ensure that dierent individuals and groups have dierent degrees of access to sensitive information. Some may be denied access to particular records. Other groups may be entitles to read data without being able to modify or `write' it. Administrator Regulator Unit 1 Unit 2 Unit N
Unit 1's Reports read, write read read, write read read
Unit 2's Reports read, write read read read, write read
Unit N's Reports read, write read read read read, write
Table 14.8: General Form for an Access Control Matrix The distinction between `read' and `write' permissions has led to the development of access control policies. In their simplest form, these techniques implement a matrix that associates particular privileges with the users of a system and the objects that are held by that system. This is illustrated by Table 14.8. As can be seen, system administrators must be able to access and modify the reports that are submitted from all of the units that contribute to a reporting system. This requirement is, typically, enforced so that they can implement any revisions or updates that may subsequently prove to be necessary for the maintenance of the system. For example, they can automatically insert additional elds into the record of an incident. External regulators, in this instance, are provided with read-only access to all reports. Each of the contributing units can also read the reports from other contributors. They can also modify their own information. It is important to stress that the exact form of an access control matrix depends upon the nature of the reporting system. For example, some applications only provide read access to its contributors. They cannot modify their own data and all updates must be performed through an administrator who is entirely responsible for any `write' actions. This re ects elements of the AIMS approach. In this case, the access control
742
CHAPTER 14.
DISSEMINATION
matrix would only contain write entries in the Administrator row. It is also important to emphasise that access is only granted if it is explicitly indicated in the matrix. By default, all other permissions are denied. In Table 14.8, the general public would not have any right to obtain or modify incident data. Access control matrices are explicitly embodied within many of the more sophisticated software applications that have been developed to support incident reporting schemes. When a user makes a request to access a particular item of information, the system identi es the row associated with that user in the matrix. It then looks along the columns until it nds an entry associated with the object of the request. If the user does not have the relevant permissions then the request is denied. Unfortunately, this approach is not a feature of single-user systems. In general, access control makes little sense when there is only one row in the matrix. This has important consequences for many reporting systems that continue to use mass-market, desktop applications to support the dissemination of incident information. Single-user spreadsheets and databases, typically, have no means of making the ne grained distinctions implied by Table 14.8. In consequence, if a user is granted access to the system then they have complete permission to access all data. It is, typically, possible to apply locking techniques to the information that is held by these systems. This prevents unauthorised modi cation. However, this `all or nothing' approach is usually too restrictive for large-scale systems [186]. Access control matrices provide numerous bene ts to incident reporting systems. They explicitly represent the security policy that is to be enforced during the distribution of potential sensitive information. They are not, however, a panacea. As we have seen, it is entirely possible for individuals or groups to abuse their access permissions. For example, Unit 1 might pass on information about Unit 2 to a third party that is not entitled to this access, according to Table 14.8. In such circumstance, it is possible for system administrators to identify the potential sources of any `leak' by inspecting the entries in the column that is associated with any disclosed information. Table 14.8 makes a number of simplifying assumptions. For instance, we have not considered `grant privileges'. These enable particular users or groups to provide access permissions on certain objects. This is most often necessary when new Units join the system. The administrator would have to ensure that they were granted permission to read the contributions from all of the other Units. The entries associated with the system administrators would, therefore, be revised to read, write, grant. Paradoxically, the ability yo grant access also implies the ability to remove or deny access permissions. For instance, if incident information were being leaked to a third party then administrators might take the decision to remove all read permissions except those that apply to the Unit that contributed particular reports.
14.3.3 Security and Encryption Access control matrices de ne the policy that is to be followed in the distribution and modi cation of incident information. In order to implement such a policy, most software systems rely upon encryption algorithms. As might be expected, these techniques take the original document, or plain text, and produce a cipher. Ideally, it should not be possible for an unauthorised person or group to derive the plain text from the cipher. One means of helping to prevent this is to create an encryption algorithm that relies not only on the plain text but also an additional input parameter known as a key. This can be illustrates by Caesar's algorithm. Caesar's algorithm replaces each letter in the plain text with the next letter in the alphabet. The letter `a' would be replaced by `b' in the cipher, `c' would be replaced by `d' and so on. This is a relatively simple algorithm to guess and so we might require that the user also supplies a key. The key could be the number of places that each letter is oset. For example, in order to decipher the following phrase we must know that each letter has been shifted by 14 places in the alphabet: Cipher: Plain text:
VAPVQRAG ERCBEGVAT INCIDENT REPORTING
This simple algorithm is vulnerable to many dierent forms of attack. For instance, we can examine letter frequencies in the cipher to make guesses about the identity of particular letters. It does,
14.3.
COMPUTER-BASED DISSEMINATION
743
however, illustrate the key features of secret or private key encryption. In order for this approach to work, it is important that the key is never disclosed to unauthorised individuals. The previous example also illustrates further aspects of secret key encryption. For instance, this approach can be strengthened by choosing a suitably complex algorithm. Alternatively, it is often more convenient to choose a relatively simple algorithm but a very complex key. Without the key, even if the algorithm is well known, it can be extremely diÆcult for unauthorised individuals to decipher a message. It is for this reason that encryption software often emphasizes the size of the keys that they support, 64 or 128 bits for example. Secret key encryption is a feature of many national reporting systems. Regulatory and investigatory agencies provide each unit in the system with the encryption software and a password. This is then used to protect incident reports when they are transmitted across computer networks. This approach was adopted by the APSF's AIMS system when they moved from paper to electronic submissions [35]. One problem with secret key encryption is that it can be diÆcult to secure the dissemination of keys. They cannot be transmitted over the computer network because this would defeat attempts to secure transmission over that network. Conversely, it is impossible to secure the network until the secret key has been agreed upon. Public key encryption provides an alternative to secret key encryption. This relies upon algorithms that require dierent keys to encode and decode the plain text. Typically, users distribute a public key to anyone who might want to send them secure information. They can do this because they know that this key can only be used to encode data. Only they have access to the second private key that is required in order to read any message. This is the approach adopted by the `Pretty Good Privacy' or PGP mechanisms that are widely available over the Internet. PGP is one of several systems that are recommended for transmission of data to the Central Cardiac Audit Database, mentioned earlier [184]. The PGP package provides a variety of utilities for the generation, management and use of encryption keys. It has the advantage of being low or no cost and is widely available. Secure/Multipurpose Internet Mail Extensions (S/MIME) is an application of public key encryption to the MIME technology that is widely used for the dissemination of incident data. This approach is supported by recent versions of Netscape Communicator and Internet Explorer. S/MIME applies encryption to individual les that are mailed from one machine to another. At is also possible to create secure links that encrypt information passing between two or more machines. This approach is deliberately designed to support more interactive forms of communication, such as web browsing, where information can be passing in both directions over the connection. The best known application of this approach is known as the Secure Socket Layer (SSL) protocol. This applies public key encryption over an entire session rather an individual item of mail. PGP, S/MIME and SSL have all been used to secure the data that is transmitted between the contributors of incident reports and regulatory or investigatory agencies [271, 184]. It is important to emphasise that these technologies do not provide any absolute guarantees about the security of any electronic communication. It is theoretically possible to break most implementations. The technical expertise and computation resources do make it extremely unlikely that this will occur, at least in the short term. These observations re-iterate an important concept; it is seldom possible to achieve absolute security. Safety managers must adopt an informed approach to risk assessment. The degree of technological sophistication applied to secure incident data must be proportional to the sensitivity of that data. It is important, however, that these issues are explicitly considered as more and more reporting systems use computer-based networks as a cheap and eective means of disseminating information about near misses and adverse occurrences [378]. One way in which cryptography has been used by incident reporting systems is to support digital signatures. A digital signature is a means of encoding a message in such a manner that it authenticates the sender's identity. This is important if regulators and investigation agencies are to ensure that reports of an incident have not been sent for malicious reasons. Both private and secret key techniques can be used to implement digital signatures. The fact that the recipient can decode a message that was encoded using the secret key agreed with the sender might, at rst sight, seem to be suÆcient for a secret key implementation. No other person should know the secret key. Unfortunately, this is vulnerable for a number of reasons. This approach is vulnerable to a replay attack in which a previous message is saved and later resent by some unauthorised agent. Similarly,
744
CHAPTER 14.
DISSEMINATION
some portion of a previous message may be cut and pasted to form a new message. This is feasible because it is possible to make inferences about the contents of a message even even if it is impossible to completely decipher all of its contents. For these reasons, secret key implementations of digital signatures usually also encode characteristics of the entire message, such as the date when it was sent and the number of characterise in the plain text. When the message has been decoded the recipient can check this additional information to ensure the integrity of the content. This technique can also be used when the message is not, itself, encoded. A signature block can be encrypted at the end of the message. Again, the recipient can decode the signature block and use the techniques described above to establish its authenticity. This approach has given rise to a range of more elaborate techniques that support the concept of electronic watermarks . Public key implementations of a digital signature can be slightly more complicated. For instance, a contributor might encrypt a message using it's secret key. It will then encrypt the results of this encryption using the public key of the regulator. The message is then sent over a computer-based network. The regulator rst decodes the message using their secret key. Ideally, no other users can complete this rst step assuming that the regulators secret key is not compromised. Next, the regulator can apply the contributor's public key to extract the plain text. The regulator knows that the contributor sent the message because only the contributor has access to their secret key. It might seem that such details are a long way removed from the practical issues that must be considered in the development and operation of incident reporting systems. The key point about digital signatures is that they enable organisations to transmit information in a secure manner that can be granted the same legal status as conventional, paper-based documents. For instance, in 1997 the FDA issued regulations that identi ed the criteria that would have to be met for the use `of electronic records, electronic signatures and handwritten signatures executed to electronic records as equivalent to paper records and handwritten signatures executed on paper' [271]. These regulations applied to all FDA program areas and were intended to support the widest possible use of electronic technology `compatible with FDA's responsibility to promote and protect public health'. The FDA requirements illustrate the importance of understanding some of the concepts that have been introduced in previous paragraphs. For instance, Section 11.70 requires that `electronic signatures and handwritten signatures executed to electronic records must be linked to their respective records so that signatures cannot be excised, copied, or otherwise transferred to falsify an electronic record by ordinary means'. Such concerns have motivated the development of message-speci c signature blocks mentioned above. There are many reasons why incident reporting systems are forced to introduce some of these more advanced security measures. The FDA regulations re ect a concern to ensure that electronic reports of adverse occurrences have the same legal status as their paper-based counterparts. These more advanced security techniques can also be implemented in response to the concerns expressed by potential contributors. There is often fear that an individual's identity will be revealed. Similarly, safety managers in the healthcare industry must also respect patient con dentiality. There is, however, usually a trade-o to be made between the security of a system and the ease with which it can be used by its operators. For example, the use of cryptographic techniques implies a considerable managerial overhead both on the part of systems administrators and the users who must remember the passwords that protect and sign their information. Recent studies have suggested that for every user of a system there is a request for support sta to reset a forgotten password every three-four months [1]. It is diÆcult to under-emphasise the human element in any secure system. For example, recent attempts to introduce public and private key cryptography into one national reporting system produced a series of responses from potential contributors. They commented that these human issues would compromise the most advanced technology. It is possible for an owner to `lend' a le, as a collaborative fraudulent gesture, or to unwittingly assist a fraudulent colleague in an `emergency'. The FDA has acknowledged that `such fraudulent activity is possible and that people determined to falsify records may nd a means to do so despite whatever technology or preventive measures are in place' [271]. There are also more `mundane' threats to the security of incident reporting systems. Previous research has suggested that people are often very lax in their selection of passwords [880]. The use of recognisable words, of names of friends, of addresses makes the entire system vulnerable
14.3.
COMPUTER-BASED DISSEMINATION
745
to dictionary attacks. These take the form of repeated requests to access a system where a dierent word or phrase is supplied from an electronic dictionary in response to each password request. Eventually this form of attack will succeed unless care has been taken in selecting a password. Further problems arise when passwords are distributed to friends and colleagues. More commonly, however, systems are compromised by simply leaving a machine connected to the network. Users issue the requested password and then leave the machine unattended. I recently saw an example of this on a hospital visit. A screen-saver was used with the warning `unauthorised access on this machine is prohibited'. Anyone who ignored this message could have directly accessed and altered the patient records for that ward while the nurse was away from her station. Mnny of the usability problems that aect secure systems can be reduced through the use of biometric authentication. These techniques avoid the need for users to remember arbitrary passwords. They include the use of ngerprints, retinal patterns, signatures and voice recognition [441] This technology has not yet been widely used to secure the transmission of incident data. It is, however, likely that it will be used within future systems. Part of the reason for this is apparent in recent observations made by the Central Cardiac Audit Database Project. They argue that such security techniques appear to be `an extremely expensive solution to a non-problem, but public fears about Internet security, mostly unfounded but encouraged by the popular media, would need to be allayed before widespread medical data transmission via the Internet would be acceptable' [184].
14.3.4 Accessibility `Accessibility' can be thought of as the converse of access control. Just as it is important to ensure that unauthorised people are denied access to an incident report, it is equally important that authorised individuals can obtain necessary information. This implies that any computer-based resource should be evaluated to ensure that the human-computer interface does not embody inappropriate assumptions about the potential users of such as system. This is particularly important because the computer-based dissemination of incident reports can oer particular advantages to certain groups of users providing that their requirements are considered during the early stages of systems development. People with visual disabilities can use a range of computer-based systems to access incident databases in a manner that cannot easily be supported using paper-based techniques. Unfortunately, the use of icons and complex menu structures can prevent many users from exploiting screen readers and similar devices. It is for this reason that many organisations publish minimum standards in this area, such as Federal Regulations Section 508 on the accessibility of electronic information. In order to satisfy these requirements it is important that the developers of incident reporting systems provide some means for users to communicate any diÆculties they might have experiences in access their data. This can, however, create a recursive problem in which the users of the information resource cannot access the information resource in order to learn of alternative format or other forms of help: \The U.S. Department of Health and Human Services is committed to making its web sites accessible to all users. If you use assistive technology (such as a Braille reader, a screen reader, TTY, etc.) and the format of any material on our web sites interfere with your ability to access the information, please use the following points of contact for assistance. To enable us to respond in a manner most helpful to you, please indicate the nature of your accessibility problem, the preferred format in which to receive the material, the web address of the requested material, and your contact information..." [31] Although Section 508 of the US Accessibility Act focuses on users with special needs, there is also a more general requirement to ensure that people can access incident information. In consequence, observational studies and laboratory-based evaluations may be conducted to ensure that users can operate computer-based information systems. This implies that designers must consider the previous expertise of their users and of their ability to exploit particular human-computer interaction techniques. Brevity prevents a more detailed introduction to the design and evaluation of interactive
746
CHAPTER 14.
DISSEMINATION
computer systems in general. Preece et al provide a survey of techniques in this area [686]. In contrast, the following pages focus more narrowly on techniques that can be used to identify patterns of failure in large-scale collections of incident reports.
14.4 Computer-Based Search and Retrieval Previous sections have considered the dissemination of information about individual incidents. In contrast, this section focuses more narrowly on the problems that arise when providing access to databases of previous incidents over computer networks. Recent technological innovations, often associated with mass-market applications of the World Wide Web, are creating new opportunities for rapidly searching large number of incident reports that can be held in many dierent countries across the globe. Before looking in more detail at these `leading edge' systems, it is rst necessary t understand why organisations are exploiting computer-based dissemination techniques for their incident databases. The sheer scale of many reporting systems motivates the use of electronic dissemination techniques for incident databases. The number of incident reports that are submitted to a system can accumulate rapidly over a relatively short period of time, even in local or highly specialised systems. For instance, the Hyperbaric Incident Monitoring Study was started in 1992 to collects reports of incidents and near misses in hyperbaric medicine. Hyperbaric Oxygen Therapy is the most common form of this treatment. The patient enters a chamber that is lled with compressed air until a required pressure is reached. The patient breaths `pure' oxygen through a mask or a transparent hood. In addition to diving recompression, this techniques has been used in the treatment carbon monoxide poisoning, wound healing and post radiation problems. This system is currently operated through the APSF, the same organisation that maintains AIMS. The Hyperbaric Incident Monitoring Study was launched internationally in 1996 and the associated forms have been translated into 4 dierent languages. By early 2001, there were some 900 reported incidents in the database [38]. This partly re ects the success of this system. It also creates considerable practical problems. The costs associated with maintaining even a relatively simple paper-based indexing system would be prohibitive. It would also be diÆcult for other organisations to access this data without replicating each paper record or posing a succession of questions to the sta who are responsible for maintaining the paper indexing system. It is feasible but unlikely that the Hyperbaric Incident Monitoring Study database could be implemented using paper-based techniques. In contrast, other national reporting databases could not be maintained without electronic support. Firstly, the sheer volume of reports makes it essential that some form of database be used to collate and search that data. Secondly, the large number of individuals and groups who might legitimately want to retrieve incident information increase the motivation to provide access to these databases over computer networks. For instance, the FDA's Adverse Event Reporting System (AERS) for adverse events involving drugs and therapeutic biological products contains more than 2 million reports. Table 14.9 provides a break-down of the number of incidents that are entered into the FDA's databases within a single year. The Center for Biologics Evaluation and Research's Error and Accidents Reporting System records incidents that occur in the manufacture of biological products. An error or accident is a deviation from the `good manufacturing practice' set down by FDA regulations. The Drug Quality Reporting System receives reports of similar incidents that aect the manufacturing or shipping of prescription and over-thecounter drug products. These incidents can result in problems for the formulation, packaging, or labeling of these products. Post-marketing surveillance for vaccines is handled by the Vaccine Adverse Event Reporting System. Approximately 15 percent of the reports describe a serious event, de ned as either fatal, life-threatening, or resulting in hospitalization or permanent disability. The Manufacturer and User Device Experience (MAUDE) Database receives between 80,000 and 85,000 reports per year. The 1984 Medical Devices Reporting regulation required manufacturers to report device-related adverse events to the FDA. In 1990, the Safe Medical Devices Act extended this regulatory structure to include user facilities such as hospitals and nursing homes. Serious injuries that are device-related must be reported to their manufacturers. Fatalities must be reported both
14.4.
747
COMPUTER-BASED SEARCH AND RETRIEVAL
to the manufacturer and directly to the FDA. The MAUDE database was established in 1995 to support the Safe Medical Device Act and now contains more than 300,000 reports. Another 500,000 reports are collected in a pre-1995 database. Finally, table 14.9 records that the FDA's risk-based summary reporting system receives some 30,000 reports per annum. Products that are approved for this summary reporting process are `well known' and have a `well-documented' adverse event history [276]. This approach involves the periodic submission of adverse event statistics in a tabular form that yields `economies for both the devices industry and FDA'. Reporting System Adverse Event Reporting System (CBER) Biologics Error and Accidents Reporting System Drug Quality Reporting System Vaccine Adverse Event Reporting System Manufacturer and User Device Experience Risk-Based Summary reports
No. of reports per annum 230,000 13,000 2,500 12,000 80,000 30,000
Table 14.9: Annual Number of Incidents Included in FDA Databases The volume of data that can be gathered by successful national systems justi es the use of information technology. However, the use of this technology does not provide a panacea for the management of large-scale databases. For instance, the cost implications and the requirement to use specialist hardware and software often convinces many public or Federal agencies to involve contractors to run these systems. The FDA's Drug Quality Reporting System database, mentioned above, is run in this manner. FDA sta interact with the system via an on-line interface that is intended to help them pose particular queries or questions that are then relayed to the database, which is administered by the contract organisation. The management of the Vaccine Adverse Event Reporting System database is even more complex. This is jointly administered by the FDA's Center for Biologics' Division of Biostatistics and Epidemiology and the Centers for Disease Control and Prevention, Vaccine Safety Activity, National Immunization Program. Representatives of both agencies oversee data processing and database management that is again performed by a contractor. Such complex relationships occur in other industries. For instance, the ASRS is largely funded by the FAA. NASA manages the system, which is in turn operated under contract by the Battelle Memorial Institute. In most cases, these relationships have proven to be highly successful. There are, however, considerable problems in ensuring that technological requirements are accurately communicated between all of the stake-holders in such complex, interactive applications. In consequence, previous studies have revealed considerable frustration from users who feel that many incident databases no longer support all of the retrieval tasks that they must perform [472, 413]. Further problems aect the management of large-scale incident databases. For example, it is often convenient for national regulators to establish a number of dierent schemes that focus upon particular incidents. For instance, Table 14.9 summarises the dierent schemes that are operated by the FDA, This can create problems because the same incident can fall within the scope of more than one database. This problem can be exacerbated if dierent agencies also run apparently complementary reporting systems. For instance, the FDA has to monitor medication error reports that are forwarded by clinical sta to the United States Pharmacopeia and to the Institute for Safe Medication Practices. Some of these incidents are then incorporated into the FDA's own databases. Similarly, the FDA must also review the medical device reports that are submitted to the MEDWATCH programme in case they have any relevance for possible medication errors. In addition to all of the systems mentioned in Table 14.9, the FDA also maintains a central database for all reports involving a medication error or potential medication error. This contains some 7,000 reports. In total contrast to this amalgamated database, the FDA's Vaccine Adverse Event Reporting database is entirely `independent of other FDA spontaneous reporting systems' [276]. This diversity creates a exible approach that is tailored to the various industries which are served by the FDA. It also creates considerable managerial and technical problems for those individuals who must support the
748
CHAPTER 14.
DISSEMINATION
exchange of data both within and between the various systems. For instance, some of these systems provide public access to incident data. There is a web-based search engine that can be used to nd the detailed records held in the MAUDE system. Other applications are strictly con dential and no public access is provided. This can create problems if incident data is transferred from one system to the other. Con dential reports can be made public if they are transferred to the open system. Alternatively, if potentially relevant reports are not disclosed then valuable device-related safety information will be withheld and the MAUDE data will be incomplete. This would create doubts about the value of the system as a means of tracing more general patterns from information about previous incidents. Further legal and ethical issues complicate the use of on-line systems to help search through incident databases. For example, a number of countries now have powerful disclosure laws that enable people to access databases in the aftermath of near miss incidents and adverse occurrences. This provides an opportunity to search through the database and identify previous failures with similar causes. In subsequent litigation, it might then be argued that responsible organisations had not shown due care because they had failed to learn from those previous incidents. Several safety managers have described their concerns over this scenario during the preparation for this book. One even commented that their company was considering deleting all records about previous incidents. These concerns are partly motivated by the diÆculties that many organisations have in storing and retrieving information about previous incidents. Incidents often recur not because individuals and organisations are unwilling to learn from previous failures but because they lack the necessary technological support to identify patterns amongst thousands of previous incident reports. For instance, many users of incident databases cannot accurately interpret the information that is provided by database systems. It is also diÆcult for safety managers to form the commands that are necessary to retrieve information about particular types of incident. The following sections describe these problems in greater detail and a number of technological solutions are proposed.
14.4.1 Relational Data Bases There are two central tasks that users wish to perform with large-scale incident databases. These two tasks are almost contradictory in terms of the software requirements that they impose. On the one hand, there is a managerial and regulatory need to produce statistics that provide an overview of how certain types of failures are reduced in response to their actions. On the other hand, there is a more general requirement to identify common features amongst incident reports that should be addressed by those actions in the rst place. The extraction of statistical information typically relies upon highly-typed data so that each incident can be classi ed as unambiguously belonging to particular categories. In contrast, the more analytical uses of incident reporting systems involve people being able to explore alternative hypotheses about the underlying causes of many failures. This, in turn, depends upon less directed forms of search. It is diÆcult to envisage how investigatory bodies could construct a classi cation scheme to re ect all of the possible causal and mitigating factors that might arise during the lifetime of a reporting system. Unfortunately, most schemes focus on the development of incident taxonomies to support statistical analysis. Relatively, few support the more open analytical activities, described above. This is re ected in the way in which most reporting systems currently rely upon relational databases. As the name suggests, relational database techniques build on the concept of a relation. This can be thought of as a table that holds rows of similar values. For example, one table might be used to record values associated with the contributor of an incident report. The cells in the table might be used to store their name, their contact information, the date when they submitted a report and so on. Each row of the table would represent a dierent contributor. Of course, each contributor might make several incident reports and so another table would be used to hold this relation. The columns in this table might include the name or identi er of the person making the report, the date of the report, the plausible worst case estimate of the severity of the incident and so on. Each row in this incident table would store infromation about a dierent report. Relational database oer a number of important bene ts for the engineering of incident databases. One of the most important of these is the relative simplicity of the underlying concept of a relation.
14.4.
COMPUTER-BASED SEARCH AND RETRIEVAL
Figure 14.3: Overview of the MAUDE Relations
749
750
CHAPTER 14.
DISSEMINATION
Unfortunately, the operational demands of many reporting systems have created the need for complex relational schemas. A schema can be thought of as a high-level model of the relationships between the dierent information elds that are held in the system . They are often structured in terms of objects that would be recognisable to the users of the system. Hence, as we shall see, the MAUDE schema groups information about devices, patients, manufacturers. Figure 14.3 provides a slightly simpli ed overview of the schema that is used to structure the FDA's MAUDE data. This overview has been reverse-engineered from the technical information that was released to enable interested parties to make use of data that is released under the US Freedom of Information provisions [268]. It is also based on a study of the way in which information is retrieved by the FDA's on-line databases. As can be seen from Figure 14.3, the data is structured around four dierent relations: a master event relation; a device relation; a patient relation and a text relation. For convenience, these are stored and can be retrieved over the Internet in four dierent les. The Master Event Data holds information about the person or group that reports an event. This is the most complex of the relations. It distinguishes between a total of seventy-two dierent items of information. It also illustrates previous comments about the way in which relatively simple ideas can quickly be compromised by the operation demands of an incident reporting system. The high level structure of this le is illustrated by the top component of Figure 14.3. It is based around the idea of a nested relational schema because any entry in the Master Event Data relation is, itself, composed of more complex relations. These relations hold summary data about the nature of the event, about who has reported the event and about the devices that were involved. Section A of the meta-relation holds identi cation information relevant to the report. This is compulsory for all reports because, as we shall, see this acts as an index that can be used to cross-reference between the information that is held in other relations within the MAUDE system. Section B hold information about the particular event that is described in the report and again should be included for all reports. Section E is only used if the report was submitted by a healthcare professional. Similarly, Section F only applies to reports led by device distributors. Section G only applies if the report was completed by a device manufacturer. Finally, Section H is based around a relation that is used to structure information about the devices that were involved in an incident. This should be included for all incidents. As mentioned, the precise information that is held about any particular report depends on the nature of the person or group who submitted the form. For instance, if a form was submitted by a device distributor then the master record will hold information both about the distributor and about the manufacturer that provided them with the device. In this instance, the Master Event relation would contain sections A, B, F and H. If the report is led by a healthcare professional then they might not be in a position to enter this information into the reporting form and hence it will be omitted from the database. The relation would, therefore, be composed from sections A, B, E and H. Figure 14.3 is simpli ed by only showing the relations that would record a report from a manufacturer. The FDA provide summary information about the other formats [268]. It is reasonable to ask why anyone should devote this level of attention to the manner in which data is stored within an incident database. It might be argued that such details relate solely to the implementation of particular systems and are of little interest to a more general audience. Such arguments neglect the consequences that such techniques can have upon the end-users of incident databases. For instance, many local incident reporting systems adopt a more `naive' approach and simply ` atten out' the nested relations that we have described in the previous paragraph. This would result in every entry in the system being provided with a cell for a Health Care Professional's contact address even though the report was submitted by a distributor or manufacturer. The signi cance of this should be apparent if we recall that MAUDE receives between 80,000 and 85,000 submissions per year. Relatively minor changes to the relational schema can have a huge impact upon both the time that is taken to search through or download an incident database. As mentioned, the Master-Event component of the MAUDE database is composed from nested relations. One element of this more complex structure structures the information that is necessary to unambiguously identify each incident. Figure 14.3 illustrates this by the thick line that links the MDR report key across all of the other relations in the MAUDE system. The importance of this `key' information can be illustrated by the following example. Supposing that we wanted to nd out how many patients had been injured by all of the devices produced by a particular manufacturer, we
14.4.
751
COMPUTER-BASED SEARCH AND RETRIEVAL
MDR Report Key Generated
MDR Event Key Generated
Report Number Generated
Source Code
2339271
319405
2339103
319248
29190162001-00002 2124823Manufacturer 2001-00010
Voluntary/ User facility/ Distributor/ Manufacturer Manufacturer
No. of devices 0..Max
No. of Patients 0..Max
Date received Date
1
1
06/22/2001
1
1
05/20/2001
Table 14.10: The MAUDE Master-Event Relation (Section A)
could begin by using the Master Event Section G to list all of the MDR Report Keys associated with that Manufacturer's name. After having retrieved the list of MDR Report Keys we could then use Section A of the Master Event File to nd out the number of patients that had been reported to be aected in initial reports to the system. It is important to note, however, that the MDR report Key is insuÆcient to unambiguously identify all information within the system. For instance, an incident might involve more than one device. In this case, Figure 14.3 shows how each entry in the MAUDE Device Data relation must be identi ed both by the MDR report key and by the Device Event Key. Again, it is important to emphasise that these implementation details have a profound impact upon the users of an incident reporting database. If they are not taken into account during the early stages of development then it can be diÆcult to extract critical information about previous incidents. In the example outlines above, it might be diÆcult to extract information about the individual devices that are involved in a single incident. If this data is not clearly distinguished then it can, in turn, become either diÆcult or impossible to trace the pervious performance of the manufacturers of those devices. Too often, investigators and regulators have sub-contracted the implementation of incident reporting databases with the assumption that such relational structures are both obvious and easy to implement. Equally sub-contractors have often failed to communicate the impact of these technical decisions on those who must operate incident databases. It is, therefore, hardly surprising that so many people express disappointment and frustration with the systems that they are then expected to use. Table 14.10 provides more details about the report identi cation information that is held in the Master Event relation. As can be seen, each new entry is automatically assigned three reference keys: the MDR report key, the event key and a report number. The precise meaning and purpose of each of these values is diÆcult to infer from the FDA documentation that is provided with the MAUDE database. However, it is apparent that the MDR report key and the event key are used to index into other sources of information in the manner described above. The source code information helps to distinguish between the various groups and individuals who might submit a report to this system. Any contribution is either voluntary or is provided to meet the regulatory requirements on end-user facilities, device distributors or manufacturers. The second row of the table summarises the type of information that can be entered in each column. The third and fourth rows of table 14.10 present the values that were entered into the MAUDE database to describe two recent software related failures. Table 14.11 characterises the nested relation inside the Master Event record that holds information about device manufacturers. Recall that this is only used if a report is submitted by a manufacturer. As can be seen, the name and address of the contributor is stored with the record. Table 14.11 is a slight simpli cation because the address component of this relation is itself a nested relation containing elds to store manufacturer's street name, their city and state, their telephone number and so on. It might seem like an obvious and trivial requirement to provide such a structure to record the contact information of a contributor. However, the MAUDE structure shows considerable sophistication in the manner in which this data is handled. For instance, two elds
752
CHAPTER 14.
MDR Report Key Generated
Manufacturer's Name Text
Manufacturer's Address Address
2339271
A. Maker
Somewhere
2339103
Another Maker
Somewhere Else
Source Type Other/ Foreign/ Study/ Literature/ Consumer/ Professional/ User facility/ Company rep./ Distributor/ Unknown/ Invalid data Professional, Other Professional, User Facility
DISSEMINATION
Date Manufact. Received Date
05/30/2001 05/18/2001
Table 14.11: The MAUDE Master-Manufacturer Relation (Section G)
are associated with street address information. Many databases simplify this into a single eld and then subsequently have problems encoding information about appartments and oÆces that have `unconventional' addresses. These issues are not simply important for the technical operation of the incident database. They can also have signi cant consequences for the running of the system. It is clearly not desirable to have investigators search for a contributor to a con dential or anonymous system in order to conduct follow-up interviews. Table 14.11 also includes an enumerated type, or list of values, that can be used to describe the source that rst alerted the device manufacturer to the incident. This information is critical and is often omitted from incident databases. Chapters 3 and Chapters 6 have argued that it is important not simply to identify the causes of a near-miss or adverse occurrence. It is also important to identify those barriers that prevented such incidents from developing into more critical failures. Any mechanism that alerts a contributor to a potential failure is an important component in the defences that protect the safety of future applications. Often this information is embedded within free-text descriptions of adverse events and it can prove to be extremely diÆcult to collate data about such defence. The MAUDE system avoids this problem by rstly prompting the contributor to provide this information by ticking an element in a list of the incident form and then by encoding their response within the `source type' eld of the relation illustrated in Table 14.11. The elements of this type are instructive in their diversity. Incidents may be detected from a healthcare consumer or a healthcare professional, they can also be identi ed by literature reviews or other forms of eld study. The nal rows of Table 14.11 illustrate sample values for this relation. Table 14.12 presents the nal nested component of the Master Event relation. This is completed for all submissions and provides initial details about the devices that were involved in a near miss or adverse occurrence. Again an examination of the components of this nested relation can be used to illustrate some of the points that have been made in the previous chapters of this book. For example another enumerated type is used to categorise the immediate remedial actions that a manufacturer has taken to address any incident that has been brought to their attention. This is signi cant because many incident and accident analysis techniques focus directly on causal events rather than examining the critical actions that are taken in the aftermath of an adverse occurrence. In this instance, the manufacturers' actions are important because they may pre-empt any further regulatory actions by the FDA, for instance, if a device recall has already been issued. The care that has been taken in devising the FDA's relational scheme is also illustrates by the use code in Table 14.12. Figure 3.3 in Chapter 3 illustrated the higher probability of failure that
14.4.
753
COMPUTER-BASED SEARCH AND RETRIEVAL
MDR Report Key Generated
Made When? Date
Single use? Yes/ No
2339271 2339103
No 05/18/2001 No
Remedical Action
Use Code
Recall/ Repair/ Replace/ Relabelling/ Other/ Noti cation/ Inspection/ Monitor Patient / Modi cation/ Adjustment/ Invalid data Noti cation
Initial use / Reuse/ Unknown
Initial use Initial use
Correction No. Previous FDA Noti cation Reference
Event Type
No No
Other Other
Death/ Injury/ Malfunction/ Other
Table 14.12: The MAUDE Master-Device Relation (Section H) is associated during the initial period of operation for hardware systems. This occurs because of component variations but also from the problems associated with setting up devices and of learning to operate them under particular working conditions. The use code in the Master Event Relation, therefore, distinguishes between initial use and reuse of any particular device. This eld also illustrates a generic problem with incident reporting databases. Ideally, we would like to provide a meaningful value for every eld in every relation. This would enable use to satisfy requests of the following form `how many incident reports related to the initial use of a device?' or `how many reports were immediately resolved by the manufacturer issuing a recall?'. Unfortunately, lack of data can prevent investigators from entering all of the requested data into an incident database. For example, if a healthcare professional informs a manufacturer of an incident they may neglect to pass on the information that is necessary to complete the use code. In such a circumstance, the manufacturer would tick the `unknown' category and return the form to the FDA to be entered into the MAUDE database. This would create problems because if we attempted to answer the question \how many incident reports related to the initial use of a device?' then it would be unclear how to treat these `unknown' values. If they were excluded then this might result in a signi cant underestimate of the initial device set-up problems. If they were excluded then the converse problem would occur. Similar concerns can be raised about the `Remedial Action' eld in Table 14.12. In this case, the FDA analysts can enter an `invalid data' category rather than `unknown'. This is worrying because manufacturers might, in fact, be exploiting a range of potentially valid remedial actions that are lost to the database simply because they do not t easily within the categories of remedial action that are encoded within this component of the Master Event relation. These concerns lead to two key heuristics for the application of relational databases to incident reporting:
unknown data. If an `unknown' value is entered for a eld then a caveat must be associated
with any statistics that are derived from the data in that eld. Ideally, this warning should provide information about the proportion of unknown values compared to those that are known for that eld.
invalid data. If `invalid data' is entered for a eld then analysts should also record a reason why this option was selected. System managers should then conduct periodic reviews to ensure that important information is not being omitted through poor form design or an incomplete relational schema.
It is important to emphasise that to entirely exclude either of these categories would place severe constraints on data entry for incident reporting systems. In contrast, these heuristics are intended to ensure that relational schemas continue to oer the exibility that is necessary when encoding
754
CHAPTER 14.
DISSEMINATION
incomplete accounts of adverse occurrences and near misses. They are also intended to ensure that this exibility does not compromise the integrity of the information that is derived from incident databases [224]. MDR Report Key Generated
Device Event Key Generated
Device Seq. No. 1.. Max
Age
1
Device Available? Yes/ No/ Returned/ No answer No
2339271
328578
2339103
328407
Brand Name 0..Max Text
1
No
3
2
Generic Name Text
The HNID Item Panel Product Central Station
Baseline id Generated
...
K833027
...
K954629
...
Table 14.13: The MAUDE Device Records Table 14.13 illustrates how MAUDE holds further device information in a separate relation. This separation can be explained by the observation that the Master Event relation holds information that is derived from the incident report. The device relation, in contrast, can hold information that need not be available from the initial report. For instance, Table 14.13 includes a eld that is intended to hold information about any baseline report that is associated with a device. Chapter 6 has described how baseline reports must be submitted in response to the rst reportable incident involving a particular device. It provides basic device identi cation information including: brand name, device family designation, model number, catalogue number and any other device identi cation number. This information helps ensure clear, unambiguous device identi cation. From this it follows that if the incident described in the Master Event relation is not the rst occurrence to aect a device than the baseline report summarised in the device relation will be based on previous information. It is again important to emphasise that the relation shown in Table 14.13 simpli es the data that is actually held by the MAUDE system. The device relation ic composed of 43 individual elds. These include information not only about the particular device that was involved in the incident but also about the product range or family that the device belongs to. Such details again emphasise the importance of considering each element of a relational scheme in order to ensure that it captures all of the information that may subsequently help to identify patterns of failure in similar devices. Table 14.13 illustrates a number of further, generic issues that aect relational schemas in many dierent incident databases. For instance, the device sequence number helps to distinguish between the dierent items of equipment that can be involves in any single incident. In order to refer to any particular device record, therefore, it may be necessary to supply both the MDR report key and the sequence number. Alternatively, a device event key can also be supplied to unambiguously identify a device record. Table 14.13 also captures some forensic information, including whether or not a particular device is available for examination. As with the use code in Table 14.12, this eld also permits the entry of an unknown value. In this case it is termed `no answer'. This illustrates a potential problem for the coders who enter the data into the system. They must be trained to distinguish between, or conversely to ignore, the subtle dierences in terminology that are used to represent null values in these two dierent contexts. A further relation holds information about the patents that were aected by a near miss or adverse occurrence. This is completed even if there were no long term consequences for the individuals who were involved in an incident. Just as there can be several devices that are involved in an incident, there can also be more than one patient. In consequence, any individual patient record must be identi ed both by the MDR report key and also by the patient sequence number. The sequence numbers that are associated with any incident can be inferred from Section A of the Master Event relation because this records the total number of patients that were aected by an incident. The integrity of the database therefore depends upon a number of assumptions:
...
14.4.
755
COMPUTER-BASED SEARCH AND RETRIEVAL
MDR Report Key Generated
Patient Sequence No. 0..Max
Date Received
2339271 2339103
1 1
06/22/2001 05/20/2001
Date
Sequence Treatment SequenceTreatment pair
-
Patient Outcome Life threatening/ Hospitalization/ Disability/ Congenital Abnormality/ Requireed Intervention/ Other/ Unknown/ No information / Not applicable / Death / Invalid data Other Other
Table 14.14: The MAUDE Patient Records 1. the Master Event record must accurately record the total number of patients that were aected in an incident. 2. a dierent Patient Data record must be stored for each patient involved in an incident. 3. each Patient Data record must include a unique Patient Sequence Number and these must follow consecutively from 1 to the total number of patients stored in the Master Event record with the same MDR report key. If any of these integrity constraints are violated then there is no guarantee that it will be possible for an implementation of the database to return the patient records of all individuals who may have been aected by a near miss or adverse occurrence. For instance, if a patient record was allocated a sequence number greater than the maximum number of patients noted in the Master Event record then doubts would be raised about the reliability of that data. It is also likely that the algorithms for assembling information about an incident might miss the additional patient record if they assumed that the Master Event record was correct. There are a number of similar constraints that must be observed by those who maintain the MAUDE system. For example, the device sequence number might be related to the total number of devices in the same manner that the patient sequence number is related to the total number of patients. There are further examples. For instance, the Master Event relation, Section H, contains information about the nature of the adverse event. Analysts must enter whether the incident resulted in a death, injury, a malfunction or some other outcome. Similarly, the individual patient records include a `patient outcome' eld that distinguishes between the following categories: life threatening; hospitalization; disability; congenital abnormality; requireed intervention; other; unknown; no information; not applicable, death and invalid data. Clearly the integrity of the database would be compromised if a patient record indicated that a fatality was associated with a particular MDR report key while the Master Event relation showed that the outcome was a malfunction. Fortunately, many database management systems provide explicit support for automating these consistency constraints. They will alert users to potential problems if they arise during data entry. It is, however, less easy for these systems to help users distinguish between the overlapping categories that the FDA have introduced for some elds. These include the `other', `unknown', `no information', `not applicable' and `invalid data' options, mentioned above. Later sections will describe the problems that coders have experienced in choosing between these dierent values when they complete report forms and enter them into incident databases. Table 14.14 also includes information about the treatment that the patient received following an incident. Any individual patient may receive a number of dierent treatments. In consequence, the
756
CHAPTER 14.
DISSEMINATION
MAUDE relation includes a sequence-treatment pair. This simply associates a number with each of the treatments that was used on that particular individual. It would be possible to construct a further relation that holds more detailed information about each treatment. This could be indexed by the MDR report key, the patient sequence number and the sequence number of the treatment. The data that is released by the FDA from the MAUDE system does not do this. Instead it adopts this compromise approach that resembles a compound attribute [224]. The more general point here, however, is that the database records remedial actions that relate to individual devices, such as product recalls, and the treatments that are taken to counter any adverse consequences for an incident to the patients that are aected. In other words, MAUDE illustrates the broad approach that must be taken when considering what information to capture about the response to any incident. Table 14.15 provides an overview of the nal relation that is used to structure incident data in the FDA's MAUDE system. This relation os central to the success of the system and it represents a solution to a generic problem that aects all incident databases. The previous relations have provided a means of grouping or structuring related information. The patient relation holds information about an individual patient, the device record holds information about an item of equipment that is implicated in an incident and so on. Each element of information that might be placed within a eld in one of these relations has an associated type. Most of these types are constrained. For instance, the event type in Section H of the Master Event record can only take the values: death; injury; malfunction or other. This helps to reduce coding problems. Analysts must only dierentiate between a few values rather than the subtle dierences that might exist between a larger range of potential values. They also help to provide numerical results for statistical analysis. It is relatively easy to sum the total number of incidents which were classi ed as resulting in a death. This would be far harder if analysts were able to enter any free text value that they liked in the event type eld. Analysis might use the terms `fatal' or `fatality', `dead' or `death' and so on. An automated system would then have to predict all of these potential values and recognise that they were equivalent in calculating any summary statistics. These problems are avoided by have a small range of admissible values that are associated with the various elds in a relation schema. Similarly, the elds in the schema also de ne a minimum data-set that should be obtained about each incident. The previous paragraphs have described how this minimum data-set can depend upon the nature of the incident report. The information that is available for voluntary reports by a healthcare professional might be very dierent than that which is available following a mandatory report from a manufacturer. Similarly, we have also described how lack of evidence in the aftermath of an incident can prevent investigators from satisfying the minimum requirement implied by a relational schema. The key point is, however, that by providing `invalid data' or `unavailable' options in the database, investigators can be sure that this analysts were prompted for this information when they entered incident data into the system. Any omission, in principle, should be due to the constraints that characterise the aftermath of the incident rather than neglect on the part of the analyst. Unfortunately, the strengths that the relational model derives from the explicit grouping of related elds of typed information can also be a signi cant weakness for many incident reporting systems. As we have seen, many near misses and adverse occurrences cannot easily be characterised into the relatively small number of elds that have been introduced in the previous pages. For example, the MAUDE relational schema oers almost no opportunity for analysts to enter the contextual information about workplace factors, such as time pressure or staÆng issues, that have been stressed in previous chapters. If any database only recorded the typed information mentioned above then subsequent investigators would derive a very biased view of the causes of previous incidents. In consequence. most relational systems also provide for storage and retrieval of large textual accounts. For instance, Table 14.15 shows how there are two dierent types of text that can be associated with each MDR report key. Event description summarise any additional information about the immediate course of a near miss or adverse occurrence that cannot be provided in the previous elds. The manufacturer narrative, in contrast, provides an opportunity for the producers of a device to respond to any incident reports. As can be seen in Table 14.15, this response can include information about subsequent studies into the cause of an incident. Such studies must describe the methods used and the results that were obtained. It might, therefore, be argued that a nested relation could be used to distinguish these approaches. This would enable analysts to
14.4.
757
COMPUTER-BASED SEARCH AND RETRIEVAL
MDR Text Key Text Type Report Key Generated Generated Event description/ Manufacturer narrative 2339271 1173556 Event description
Patient Seq. No. 0..Max
Report Date
Text
Date
Text
1
2339271
1
06/22/2001 User reported a clinical isolate was identi ed on Microscan HNID panel read by the walk-away instrument system as Neisseria Gonorrhoeae with 99% probability. The specimen was from a blood source from PT. due to the unusual source for this organism the specimen was sent to the State Health Dept Reference Lab for con rmation. The State Reference Lab Identi ed the organism as Neisseria Meningitis. 06/22/2001 H.6 EVAL METHOD: obtained clinical isolate from customer and tested on products involved i.e. HNID panels and Microscan Walk-away Instrument system. Reviewed complaint history, performance evals, labeling and literature regarding reported issue. H.6. RESULTS: Biotype reported by user was duplicated by Microscan Technical Services lab. Atypical results suggest and footnotes indicate N. Gonorrhoeae identi cation required additional tests to con rm. Results from add'l tests should lead to a presumptive identi cation on N. Meningitis. Results of complaint history review revealed a very low complaint volume for this issue...
1173558
Manufacturer narrative
Table 14.15: The MAUDE Text Records
758
CHAPTER 14.
DISSEMINATION
pose queries about which manufacturers had used a particular evaluation method in response to an incident failure. This is not, however, possible using MAUDE because a general text eld is used rather than the more strongly typed approaches that are embodied in previous relations. The examples in Table 14.15 illustrate the way in which several textual accounts can be associated with a single incident. As mentioned, there is both an event description and a manufacturer narrative for event report 339271. It is also important to realise that more than one individual may be aected by an incident. In such circumstances, an event description can be associated with each person who was, or might have been, injured. The previous table, therefore, includes the patient sequence number associated with each text report. For this it follows that in order to uniquely identify any particular report, analysts will have to supply the MDR report key, the text type and the patient sequence number. In some cases, manufacturers may make more than one response to an incident. If such multiple responses were admitted then analysts would also have to specify the date of the message that they were interested in retrieving. Such requirements appear to introduce unnecessary complexity into an incident database. It is important to remember, however, that there could be profound implications if it appeared that a subsequent response to an incident had in fact been made in the immediate aftermath of an adverse report. Unless the database supports such version control, investigators would have no means of knowing when a narrative was introduced into the system. This section has provided a relatively detailed analysis of the relational model that is used by the FDA to structure the data contained in their Manufacturer and User Facility Device Experience Database. The level of detail in this analysis is justi ed by the observation that many reporting databases have failed to provide their expected bene ts precisely because those who have commissioned these systems have failed to pay suÆcient interest to these details. Conversely, the sub-contractors who are typically enlisted to implement these systems often fail to explain the consequences of particular relational schema both on the queries that can be posed of the system and on the performance that can be obtained as the size of the system grows. Our analysis has also helped to identify a range of bene ts that can be derived through an appropriate use of the relational model that is embodied in most incident databases:
Analytical help in developing the relational schema. It can be argued that the process of
developing the relational schemas that underly many databases can help to indentify key information requirements. This process is supported by a range of well-documented methods, including entity-relationship modelling and the analysis of normal forms [224]. In particular, these approaches can help to expose the integrity constraints that must be satis ed by any implementation. Although these techniques have their limitations and none are speci cally intended to support the development of incident databases, they do have the strong advantage that they are `industry standard' and hence widely understood. This introduces an important paradox because the reverse enginering of many incident databases has revealed important structural weaknesses, which suggest that many of these systems have been built without the bene t of these relatively simple engineering techniques [414].
Analytical help in guiding incident classi cation. The development of a relational schema is intended to enable investigators, regulators and safety managers to classify incident reports so that they can be analysed and retrieved at a later date. The process of constructing a relational schema, therefore, forces people to consider the forms of analysis that any system must support. This leads to an important decision. Either the person submitting a form must indicate appropriate values from an incident classi cation or the analysts must codify a less structured account into the elds that are included in a relational schema or a hybrid model can be adopted where the contributor performs a ` rst pass' classi cation that is then re ned by the investigator. No matter which approach is adopted, the key point is that the development of the incident database must have an impact on the manner in which data is both elicited and codi ed. It is, therefore, extremely diÆcult to simply bolt-on an existing database to an incident reporting system where either contributors or analysts will have to adjust their behaviour to support the values that are built into a relational schema. In such circumstances, rather than guiding analysts and contributors towards an appropriate classi cation they can nd that a database forces them to `squeeze' or `massage' an incident into inappropriate data
14.4.
COMPUTER-BASED SEARCH AND RETRIEVAL
759
structures.
EÆciency. One of the key technical bene ts behind the relational approach is that it helps to
avoid the duplication of redundant information. Ideally, we might store information about a device manufacturer once. Similarly, an optimised system would only ever store a single record about any particular device. This would record the complete service and version history of that item. A link could then be made from an incident report to a device record and from there to the associated manufacturer. An alternative model would be to duplicate manufacturer information each time a new incident record was created. This is not only wasteful in terms of the storage that is required, it can also signi cantly increase the amount of time that is required to collate incident information. The technical reasons for this relate to the search latencies that are associated with primary and secondary storage. Relational techniques can use indexing so that once a common item of information is stored in main memory then those details do not then need to be repeatedly fetched from slower secondary media [224].
This is a partial summary, however, it is also important to stress that our analysis has identi ed a number of problems with the use of relational databases for incident reporting. For instance, many of these applications rely upon strong typing to clearly distinguish between the admissible values that can be entered into each eld. This creates problems because in the early stages of an investigation it is often impossible to be certain about which values might hold. A good example of this might be the problems associated with any assessment of the consequences of an incident based on a clinical prognosis. This uncertainty results in a proliferation of `unknown' values that make it very diÆcult to interpret the accuracy of statistics that are derived from incident databases. Similarly, it can be very diÆcult for analysts to accurately and consistently distinguish between the numerous values that might be entered into particular elds within a relation. This can result in similar incidents being classi ed in a number of dierent ways within the same relational schema. It can also lead to `not applicable' values being used as a default. The previous discussion has also identi ed the potential vulnerabilities that can arise from the relationships that often exist between the components of a schema. In particular, problems can arise from the way in which MAUDE links the maximum number of patients and devices in the Master Event record to provide a range for patient and device sequence numbers. Automated support must be provided to ensure that consistency requirements between these linked values are maintained throughout the lifetime of the database. This is a partial summary. the following sections expand on the problems that can aect the use of the relational model for incident reporting databases. Subsequent sections then go on to review further computational techniques that can be used either to replace or augment this approach.
Problems of Query Formation Previous sections have described how the relational model can be used to reduce the storage requirements and increase the speed of queries that are performed on incident databases. It provides further advantages. For instance, search requests can be formulated a using relational algebra. The operators within these languages have a close relationship to the operators of set theory, such as union, intersection and set dierence. This oers a number of bene ts. Firstly, the components of the relational algebra should have a clear semantics or meaning. Users can apply the basic ideas in set theory to gain some understanding of the query languages that are supported by most relational databases. There are further bene ts. As most implementations exploit set theoretic ideas, it is therefore possible to apply knowledge gained from one relational database system to help understand another. The mathematical underpinning of the approach support skill transfer and a certain degree of vendor independence. Unfortunately, as we shall see, relatively few investigators or safety managers have acquired the requisite understanding of set theory or of relational algebra to exploit these potential bene ts. The following sections provide a brief overview of the relational operators applied to an incident database. The intention is both to illustrate the potential application of this approach and also to illustrate some of the complexity that can arise from the relational algebra.
760
CHAPTER 14.
DISSEMINATION
Before discussing the set operators, mentioned above, it is necessary to introduce two additional elements of the relational algebra: SELECT and PROJECT. The SELECT operator is usually represented in the relational algebra by and is applied in the following manner: < selection condition > (Relation )
(14.1)
The selection condition is a Boolean expression that is usually formed from attribute names, operators and constants. The operators include =; >; ; ( < Use Code = Initial Use > (Table 14:12))
(14.19)
SQL oers further bene ts. In particular, it is possible to construct complex queries that select elements from several dierent relations or tables. The following example extracts the MDR report key, any remedial actions and the patient outcome for any incidents involving the CIU panel brand of devices. Notice that the patient outcome is derived from the data that is held in the MAUDE patient le, illustrated by Table 14.14, while the remedial action is associated with a particular device report, illustrated by Table 14.12: SELECT FROM Table 14.12,Table.14.14 WHERE