Safety-Directed System Monitoring Using Safety Cases. Yiannis Papadopoulos. Thesis submitted for the degree of Doctor of Philosophy. The University of York.
Safety-Directed System Monitoring Using Safety Cases Yiannis Papadopoulos
Thesis submitted for the degree of Doctor of Philosophy
The University of York Department of Computer Science
February 2000
This thesis is dedicated to my grandmother Veta Kanli, for her heroic struggle in life.
Abstract Currently, the safety studies of the system (which are also collectively known as the safety case) cease or reduce in their utility after system certification, and with that, a vast amount of knowledge about the failure (or safe) behaviour of the system is usually rendered useless. In this thesis, we argue that this knowledge could be usefully exploited in the context of an appropriate on-line safety monitoring scheme. As a practical application of our approach, we propose a safety monitor that operates on safety cases to support the on-line detection and control of hazardous failures in safety critical systems. Firstly, we identify a number of problems encountered in the development of safety cases using classical safety analysis techniques, and propose a new safety analysis method which can guarantee the consistency and improve, to some extent, the completeness and correctness of the safety case. The new method enables the assessment of hierarchically described complex systems that may exhibit either static or dynamic behaviour or structure. The assessment process in the proposed method revolves around a hierarchical structural and behavioural model of the system under examination. The result of the assessment is a semi-mechanically synthesised, well-formed, layered, safety case, which is composed of a collection of inter-related design models and safety analyses, and which enables automated checks that confirm the consistent integration of those models and analyses. We show that such a safety case can be mechanically transformed into an executable specification, upon which an automated monitor could operate in real-time. In the second part of the thesis, we develop the engine of the safety monitor. This is a set of generic mechanisms by which the monitor operates on such specifications in order to deliver a wide range of monitoring functions. We show that these functions span from the primary detection of the symptoms of disturbances though on-line fault diagnosis to the provision of corrective measures that minimise or remove the effects of failures. Finally, in the light of our study, we deal with some of the issues that arise from previous research in model-based diagnosis, and still concern other model-based approaches. More specifically, we discuss the extent to which the safety case can help to represent, and solve, the range of monitoring problems that are encountered in complex systems, and whether the proposed approach is likely to succeed in scaling up to large systems or systems with complex behaviour.
Contents CHAPTER ONE INTRODUCTION ............................................................................................................... 13
1.1 PROBLEM AND SCOPE ................................................................................................................ 13 1.2 APPROACHES TO SAFETY MONITORING..................................................................................... 16 1.2.1 The Dominant Applied Approach............................................................................ 16 1.2.2 Research on Safety Monitoring................................................................................ 17 1.3 THE THESIS................................................................................................................................ 18 1.3.1 Early Work on Safety Cases .................................................................................... 18 1.3.2 Aims, and Approach of the Thesis to Safety Monitoring......................................... 19 1.3.3 Approach of the Thesis to Safety Analysis .............................................................. 20 1.3.4 Case Studies and Evaluation .................................................................................... 22 1.3.5 Structure and Presentation ....................................................................................... 23
CHAPTER TWO APPROACHES TO SAFETY MONITORING .................................................................... 24
2.1 INTRODUCTION .......................................................................................................................... 24 2.2 OPTIMISATION OF STATUS INFORMATION AND ALARMS ............................................................ 24 2.2.1 Detection of Sensor Failures .................................................................................... 26 2.2.2 State-Alarms Association......................................................................................... 27 2.2.3 Alarm Suppression Using Multiple Stage Alarm Relationships............................... 28 2.2.4 Identifying Alarm Patterns ....................................................................................... 28 2.2.5 Organising Alarms Using Alarm Trees .................................................................... 29 2.2.6 Synthesising Alarms Using Logical Inferences........................................................ 30 2.2.7 Function-Oriented Monitoring ................................................................................. 31 2.2.8 Hierarchical Presentation of Status Information ...................................................... 32 2.3 ON-LINE FAULT DIAGNOSIS ....................................................................................................... 34 2.3.1 Rule-based Expert Systems...................................................................................... 36 2.3.2 Fault Propagation Models ........................................................................................ 40 2.3.3 Qualitative Causal Process Graphs .......................................................................... 41 2.3.4 Models of Functional Decomposition ...................................................................... 46 2.3.5 Diagnosis from First Principles................................................................................ 48 2.3.6 Qualitative Simulation ............................................................................................. 51 2.4 FAILURE CONTROL AND CORRECTION ........................................................................................ 52 2.5 DISCUSSION ............................................................................................................................... 57
CHAPTER THREE HIERARCHICALLY PERFORMED HAZARD ORIGIN AND PROPAGATION STUDIES. 60
3.1 INTRODUCTION .......................................................................................................................... 60 3.1.1 The Brake-by-wire System....................................................................................... 60 3.1.2 Difficulties Arising from the Fragmentation of Classical Safety Analyses .............. 62 3.2 HIP-HOPS................................................................................................................................. 65 3.2.1 Overview of the Method .......................................................................................... 65 3.2.2 An Extended Functional Failure Analysis Process................................................... 67 3.2.3 Hierarchical Modelling ............................................................................................ 69 3.2.4 Assessment of Failure at Component Level ............................................................. 71 3.2.5 An Algorithm for the Mechanical Synthesis of Fault Trees ..................................... 76 3.2.6 The Synthesis Algorithm.......................................................................................... 78 3.2.7 Extending the Synthesis Algorithm.......................................................................... 89 3.3 CASE STUDY .............................................................................................................................. 91 3.3.1 Functional Failure Analysis ..................................................................................... 91 3.3.2 Hierarchical Model .................................................................................................. 94 3.3.3 Interfaced Focused-FMEAs ..................................................................................... 96 3.3.4 Fault Tree Synthesis............................................................................................... 100 3.4 DISCUSSION ............................................................................................................................. 102
CHAPTER FOUR MODELLING AND SAFETY ANALYSIS OF SYSTEMS WITH DYNAMIC BEHAVIOUR OR STRUCTURE ............................................................................................................ 104
4.1 INTRODUCTION ........................................................................................................................ 104 4.2 DYNAMIC BEHAVIOUR & STRUCTURE AS IMPEDIMENTS IN SAFETY ANALYSIS ....................... 105 4.3 REPRESENTING DYNAMIC BEHAVIOUR OR STRUCTURE ........................................................... 110 4.3.1 Representing Dynamic Behaviour with Abstract Functional States and State Machines....................................................................................................... 110 4.3.2 Modelling Structural Transformations ................................................................... 112 4.3.3 Scale, Complexity and the Role of Hierarchical Modelling................................... 114 4.3.4 The Dynamic Model .............................................................................................. 116 4.3.5 The Dynamic Model as a Form of Safety Analysis................................................ 120 4.4 MODELLING AND SAFETY ANALYSIS PROCESS ........................................................................ 121 4.4.1 The Analytical Stage of the Process....................................................................... 122 4.4.2 The Synthetic Stage of the Process ........................................................................ 124 4.4.3 Two Ways to Apply HiP-HOPS in Dynamic Systems ........................................... 126
4.5 CASE STUDY ............................................................................................................................ 129 4.5.1 Introduction to the Fuel System and Scope of the Case Study............................... 130 4.5.2 The Static Hierarchy .............................................................................................. 132 4.5.3 Normal Behaviour of the Fuel System and its Representation in the Dynamic Model ..................................................................................................... 135 4.5.4 Analysis of the Engine Feed Cross-Feed Sub-system ............................................ 140 4.5.5 Synthesis of the Fuel System Mode-chart .............................................................. 150 4.6 DISCUSSION ............................................................................................................................. 154
CHAPTER FIVE THE SAFETY CASE AS A MODEL FOR ON-LINE SAFETY MONITORING................. 156
5.1 INTRODUCTION ........................................................................................................................ 156 5.1.1 Sketching the Architecture of the Safety Monitor.................................................. 157 5.1.2 Sketching the Role of a Safety Monitor in the Life-cycle of a Safety Critical System............................................................................................ 160 5.2 PRIMARY DETECTION OF ABNORMAL CONDITIONS .................................................................. 162 5.2.1 Selecting a Level for Performing the Primary Failure Detection ........................... 162 5.2.2 Using Constraints to Detect the Symptoms of Failure ........................................... 164 5.2.3 Filtering Spurious Measurements and Transient Behaviour .................................. 164 5.2.4 Monitoring Deviations from Parameter Trends and From Complex Relationships Among Process Parameters Over Time ........................................... 167 5.2.5 The Syntax of Monitoring Expressions.................................................................. 169 5.2.6 The Architecture of the Event Monitor .................................................................. 171 5.2.7 Evaluating Expressions that Contain Components with Unknown Truth Values... 174 5.2.8 Evaluating Timed Expressions............................................................................... 176 5.3 FAULT DIAGNOSIS.................................................................................................................... 177 5.4 ASSESSMENT OF THE EFFECTS OF FAILURE ON SYSTEM FUNCTION AND PROVISION OF CORRECTIVE MEASURES ........................................................................... 181
5.5 VALIDATION OF SENSORY DATA .............................................................................................. 185 5.6 EXPERIMENTS .......................................................................................................................... 188 5.6.1 Validating the Safety Case ..................................................................................... 191 5.6.2 Detection of Failures.............................................................................................. 193 5.6.3 Diagnosis and the Response of the Safety Monitor to Detected Disturbances....... 197 5.7 DISCUSSION ............................................................................................................................. 206
CHAPTER SIX CONCLUSIONS .............................................................................................................. 208
BIBLIOGRAPHY.............................................................................................................. 217
List of Tables Table 2.1. Our view of effective safety monitoring......................................................................... 57 Table 3.1. IF-FMEAs of components of the simplified wheel node
81
Table 3.2. Main functional failures of a wheel braking function
93
Table 3.3. Sensor IF-FMEA
98
Table 4.1.Functional Failure Analysis of the EFCF sub-system
141
Table 4.2. IF-FMEAs of the basic components in the EFCF architecture
143
Table 5.1. The failure modes that can be injected to the fuel system (and its simulator) .............. 190
List of Figures Figure 1.1. Goals of advanced safety monitoring
15
Figure 2.1. Classification of alarm optimisation methods
26
Figure 2.2. Heating system & operational sequence [Papadopoulos, 1993]
27
Figure 2.3. Alarm pattern matching
29
Figure 2.4. Production rule and transfer pump
30
Figure 2.5. Checklist of essential items for take-off [Morgan et al, 1992]
31
Figure 2.6. A hydraulic failure reported by the MD-11 EIS [Morgan and Miller, 1992]
33
Figure 2.7. Families of diagnostic methods
35
Figure 2.8. An instance of a hypothetical BATTERY frame [Talatarian, 1992]
38
Figure 2.9. Cause Consequence Diagram (CCD)
40
Figure 2.10. Heat exchanger system and digraph
42
Figure 2.11. Automatically generated fault tree for heat exchanger
44
Figure 2.12. Typical structure of a GTST [Chen and Modarres, 1992]
46
Figure 2.13. Example of a multilevel flow model [Larsson, 1996]
47
Figure 2.14. Multiplier and adder circuit [Davis and Hamscher, 1992]
48
Figure 2.15. A model of the behaviour of an adder
49
Figure 2.16. Assumption based Truth Maintenance System [deKleer and Williams,1992]
50
Figure 2.17. Transformation of a differential equation into a set of qualitative constraints [Kuipers, 1987]
52
Figure 2.18. Quick Reference Handbook, MD11 [Billings, 1991]
53
Figure 2.19. Example GOST in initial configuration [Hill, 1993]
55
Figure 2.20. The GOST after the first corrective action [Hill, 1993]
55
Figure 2.21. The GOST after the second corrective action [Hill, 1993]
56
Figure 3.1. Architecture of the brake-by-wire system
61
Figure 3.2. Overview of design and safety analysis in HiP-HOPS
65
Figure 3.3. The standard FFA process
67
Figure 3.4. Example functional model
68
Figure 3.5. The proposed extended FFA process
69
Figure 3.6. Modelling notation
70
Figure 3.7. The hierarchical model editor in SAM
71
Figure 3.8. Model and fragment of the IF-FMEA of the pedal task
74
Figure 3.9. Component and its IF-FMEA table
75
Figure 3.10. Grammar for IF-FMEA expressions
76
Figure 3.11. The parse tree for the given example expression
77
Figure 3.12. The equivalent mini fault tree
78
Figure 3.13. The Fault Tree Synthesis algorithm
79
Figure 3.14. The two layers of the hierarchy of the simplified wheel node
80
Figure 3.15. The first level of the fault tree
82
Figure 3.16. Fault tree synthesis in progress
83
Figure 3.17. Depth first traversal and expansion of the fault tree
84
Figure 3.18. The final fault tree for the event loss of braking
85
Figure 3.19. Automatic Fault Tree Synthesis in SAM
87
Figure 3.20. A fragment of the table of events for a fault tree of the brake-by-wire system
88
Figure 3.21. An example fault tree and the tabular representation of its structure
89
Figure 3.22. The extended fault tree synthesis algorithm
90
Figure 3.23. Abstract functional model of the brake-by-wire system
92
Figure 3.24. Top level of the hierarchical model of the BBW system
95
Figure 3.25. Two successive levels in the hierarchy of the brake-by-wire model
96
Figure 3.26. Sensor and analogue to digital converter unit
97
Figure 3.27. Simplified model of the peak detection and removal task
99
Figure 3.28. Distant view of the fault tree for the event “Loss of braking”
101
Figure 3.29. Distant view of the fault tree for the event “Inadvertent braking”
101
Figure 4.1. Classification of systems according to their behaviour and structure
105
Figure 4.2. Describing dynamic behaviour using mode-charts
111
Figure 4.3. Portraying structural transformations in mode-charts
113
Figure 4.4. Two levels in the hierarchy of sub-systems participating in flight control
115
Figure 4.5. Relationship between the static hierarchy and the dynamic model of the system
116
Figure 4.6. Communication between mode-charts in the hierarchy of the dynamic model
117
Figure 4.7. Notation for mode-charts
118
Figure 4.8. A sub-system in the static hierarchy and its mode-chart
119
Figure 4.9. Modelling and safety analysis of dynamic systems
121
Figure 4.10. Synthesis of mode-charts from the bottom towards the top of the hierarchy
124
Figure 4.11. Incorporating modes in low-level analysis
127
Figure 4.12. The extended fault tree synthesis algorithm
128
Figure 4.13. Physical configuration of the aircraft fuel system
130
Figure 4.14. The first two levels in the static hierarchy of the fuel system
133
Figure 4.15. Further decomposition of the EFCF, CD and LWD subsystems
134
Figure 4.16. Distribution of fuel flows among sub-systems in normal operation
135
Figure 4.17. The control scheme of the fuel system
136
Figure 4.18. The effect of PI control on the line that feeds the port engine
138
Figure 4.19. Tank levels over a complete run of the fuel system
139
Figure 4.20. The static model of the fuel system and its dynamic model in normal conditions
140
Figure 4.21. Fault tree for the condition “no fuel provided to the starboard engine”
144
Figure 4.22. Part of the dynamic model of the EFCF sub-system
146
Figure 4.23. Updating the Functional Failure Analysis and IF-FMEAs of components during the analysis of the cross-feeding sub-mode.
147
Figure 4.24. Fault tree for the condition “no fuel provided to the starboard engine” in the cross-feeding mode
148
Figure 4.25. The dynamic model after a partial analysis of the cross-feeding mode
149
Figure 4.26.The effect of EFCF failure on the fuel system mode-chart
150
Figure 4.27. The new control scheme for the fuel system in the cross-feed mode
151
Figure 4.28. Part of the dynamic model of the fuel system
152
Figure 5.1. The position and architecture of the safety monitor
158
Figure 5.2. The role of the safety monitor in the lifecycle of the system
160
Figure 5.3. Parameter values often exhibit a probabilistic distribution
165
Figure 5.4. The central tank of the fuel system
168
Figure 5.5. The grammar of monitoring expressions
170
Figure 5.6. The event monitor
171
Figure 5.7. A ring buffer
172
Figure 5.8. Basic logical operations between known and unknown truth values
175
Figure 5.9. Two symmetrical wing tanks of the fuel system
177
Figure 5.10. Fault tree for the event of fuel imbalance between two symmetrical tanks
178
Figure 5.11. The diagnostic algorithm
179
Figure 5.12. The architecture of the diagnostic engine
180
Figure 5.13. The event processing algorithm
183
Figure 5.14. The architecture of the event monitor
185
Figure 5.15. An event tree for sensor validation
186
Figure 5.16. The experimental platform
188
Figure 5. 17. The two configurations of the central sub-system
191
Figure 5.18. The response to a transient sensor failure and a permanent pump failure
194
Figure 5.19. Detection of leaks
195
Figure 5. 20. Distant view of the mode-chart of the engine feeding (EFCF) sub-system
197
Figure 5.21. Fragments of the architecture and the safety case of the left wing sub-system
198
Figure 5.22. Examples of formally specified corrective measures
200
Figure 5.23. Failure and recovery in the left wing sub-system
201
Figure 5.24.The control scheme of the fuel system after a permanent failure of the left wing
203
Figure 5.25. The effect of left wing failure on the mode-chart of the fuel system
204
Figure 5.26. Taking corrective measures at system level
205
Acknowledgements Many people have accompanied me in the long road that led to this paragraph, and have helped me with their knowledge and friendship to complete this project. John McDermid supervised my studies. He created an environment that encouraged the autonomy of spirit and the development of ideas, he gave me support and advice throughout and he has my deepest thanks. I am also indebted to all those who contributed to the development of the experimental platform upon which I evaluated the concepts and methods presented in this thesis. Jonathan Moffett has supported and advised me at critical stages of this work. Pete Kirkham and Steve Wilson have supplied the basic infrastructure upon which I developed the algorithms and safety analysis tools that facilitate the construction of safety cases. Anthony Moulds has engineered (and re-engineered for me) the hardware and low-level communication mechanisms of the model fuel system that I used in one of my case studies. John Firth, a colleague and friend, has contributed immensely in the work on monitoring and the fuel system, mainly with the implementation and optimisation of the algorithms presented in chapter five of the thesis. To all those colleagues, for the mornings (and sometimes evenings) that they spent in the department with me and for the quality of their work and friendship, I am deeply grateful. I am also thankful to Giuseppe Mauri, a fellow doctoral student and friend, with whom I have shared a common interest in the topic of safety analysis. Our collaboration over the last couple of years has helped me clarify some aspects of my work in this area. Thanks also go to Ralph Sasse and Guenter Heiner from DaimlerChrysler, and to Hermann Kopetz and his colleagues at the Technical University of Vienna, who provided me with their invaluable insight to the mechanics of brake-by-wire and time-triggered technologies. Finally, my thanks to Ginny Wilson, who despite her busy schedule as our research group administrator, has always been wonderful in sorting out the practical difficulties that I encountered in my work. For the guitar, the pints, the coffee, the discussions, the gym, the friendship and all the things that spirited life in and out of work, thanks go to John Firth, , Qi Shi, , Giuseppe Mauri, Karsten Loer, , Sunwoo Kim, Nathalie Foster, , and my good friends (the eternal traveller and dreamer) DQG ZLWK ZKRP , KDYH VKDUHG VR PXFK. I will always be grateful to Majella Kilkey for being a brilliant company and a constant source of inspiration and of personal development throughout those years. There are no words that can express my thanks to my parents DQG and my big sister who, not only have always encouraged me to study, but they also kept sending me cigarettes from so far away; a thousand thanks also to my brother (in law, but not only) for, among other things, dealing with the bureaucracy concerning my military service and my love to my little nice . During those years my heart has always been with them.
Author’s Declaration Several colleagues whom I have acknowledged have contributed to the development of the experimental platform that I have used to evaluate the concepts, algorithms and methods presented in this thesis. I declare, though, that the material contained in this thesis represents original work undertaken solely by the author. The various aspects of the work covered in this material have been presented in a number of international conferences and scientific journals. The work on safety analysis (chapters three and four of the thesis) forms the basis of [Papadopoulos and McDermid, 1999(b,c&d)]
and
[Papadopoulos
et
al,
forthcoming(a)],
and
contributes
to
[Papadopoulos and McDermid, 1998(b&c)] and [Papadopoulos and McDermid, 1999(a)]. The work on safety monitoring (chapter five) forms the basis of [Papadopoulos and McDermid, 1998(a)] and contributes to [Papadopoulos et al, forthcoming(b)].
Chapter One 1.
Introduction
1.1 Problem and Scope The term safety critical systems describes a class of engineered systems that pose hazards for people and the environment. During the development of such systems, rigorous design methods and analysis techniques are employed to minimise the likelihood of potential hazards. Such measures, however, are only partially effective in preventing the occurrence of hazardous failures. Operational experience from systems such as aeroplanes and nuclear power plants, for example, shows that unlikely or unanticipated failure scenarios still occur and that, in practice, such failures have often severely compromised the safety of such systems. One way of preventing failures from developing into serious disturbances is to use additional external devices that continuously monitor the health of the system or that of the controlled processes. The role of such safety monitoring devices is to detect conditions that signal potentially hazardous disturbances, and assist the operators of the system in the timely control of those disturbances. The traditional approach to safety monitoring has been to rely primarily on the abilities of operators to cope with a large amount of status information and alarms. Until the early 80s, monitoring aids were confined to hard-wired monitoring and alarm annunciation panels that provided direct indication of system statuses and alarms. Those conventional alarm systems have often failed to support the effective recognition and control of failures. A prominent example of such failure is the Three Mile Island nuclear reactor (TMI-2) accident in 1979 [Malone, 1980], where a large number of alarms raised at the early stages of the disturbance confused operators and led them to a misdiagnosis of the plant status. Similar problems have also been experienced in the aviation industry. Older aircraft, for example, were equipped with status monitoring systems that incorporated a multiplicity of instruments and pilot lights. Those systems were also ineffective in raising the level of situational awareness that is required in conditions of failure. Indeed, several 13
accidents have occurred because pilots ignored, disabled or responded late to the indications or warnings of the alarm system [Billings, 1991]. Those problems have helped to realise that it is unsafe to rely on operators (alone) to cope with an increasing amount of process information. They highlighted the need of improving the process feedback and the importance of providing intelligent support in monitoring operations. Statistics have also consistently supported that view by showing that operator errors contribute significantly to accidents in safety critical systems. A large survey [Lloyd and Tye, 1992] of aircraft accidents that was conducted by the Civil Aviation Authority in 1992, for example, indicated that approximately 50% of those accidents were caused by improper crew operations. In several cases, crews simply failed to take a set of corrective measures that could have resolved the anomaly and prevented the accident. This survey clearly highlighted the importance of the human factor in the safety of a system. At the same time though, it questioned the extent to which the present state of the art aircraft warning and advisory systems improve the awareness of the operator and its ability to respond correctly to complex failures. In view of the problems experienced with conventional monitoring systems, considerable work has been focused on the development of computerised monitors. A series of significant studies in the aerospace1 and nuclear2 industries re-assessed the problem of safety monitoring and created a vast body of literature on requirements for the effective accomplishment of the task. We have abstracted away from the detailed recommendations of individual studies, and we present here a structured set of goals, which, we believe, effectively summarises this body of research. Figure 1.1 illustrates these goals and the hierarchical relationships between them. The first of those goals is the optimisation of status monitoring and alarm annunciation. We use the term optimisation to describe functions such as: the filtering of false and unimportant alarms; the organisation of alarms to reflect priority or causal relationships, and the integration of plant measurements and low-level alarm signals into high-level functional alarms. The rationale here is that fewer, and more informative, functional alarms could help operators in understanding better the nature of disturbances.
1
For studies on requirements for safety monitoring in the aerospace sector, the reader is referred to [Randle et al, 1980], [Wiener and Curry, 1980], [Wiener, 1985], [Wiener, 1987] and [Billings, 1991]. 2
For similar studies in the nuclear and process industries, the reader is referred to [Long, 1980a/b], [Lees, 1983], [Rankin and Ames, 1983], [Roscoe and Weston, 1986] and [Kim, 1994].
14
Functional alarms, though, would generally describe the symptoms of a disturbance and not its underlying causes. Complex failures, however, are not always manifested with unique, clear and easy to identify symptoms. The reason is that causes and their effects do not necessarily have an exclusive one to one relationship. Indeed, different faults often exhibit identical or very similar symptoms. The treatment of those symptoms typically requires fault diagnosis, that is the isolation of the underlying causal faults from a series of observable effects on the monitored process. Over recent years considerable research has been focused on the development of automated systems that can assist operators in this highly skilled and difficult task. The principal goal of this research is to explore ways of automating the early detection of escalating disturbances and the isolation of root failures from ambiguous anomalous symptoms. The detection and diagnosis of failures is followed by the final and more decisive phase of safety monitoring, the phase of failure control and correction. Here, the system or the operator has to take actions that will remove or minimise the hazardous effects of failure. In taking such actions, though, operators often need to understand first how lowlevel malfunctions affect the functionality of the system. Thus, there are two areas in which an automated monitor could provide help in the control and correction of failures: the assessment of the effects of failure, and the provision of guidance on appropriate corrective measures. An additional and potentially useful goal of advanced monitoring is prognosis, the prediction of future conditions from present observations and historical trends of process parameters [Lewantowski, 1997].
Advanced Safety Monitoring
Optimisation of status monitoring and alarm annunciation
Filtering of false and unimportant alarms
Organisation of alarms to reflect priority or causal relationships Provision of high level functional alarms
Early detection of escalating disturbances
On-line fault Diagnosis
Isolation of root failures from observed anomalous symptoms Assessment, and provision to the operator, of the effects of failure
Failure control and correction
Guidance on corrective measures that remove or minimise the effects of failure
Figure 1.1. Goals of advanced safety monitoring
15
In the context of mechanical components, and generally components which are subject to wear, prognosis can be seen as the early identification of conditions that signify a potential failure [Pusey, 1998]. At system level, the objective is to make predictions about the propagation of a disturbance while the disturbance is in progress and still at the early stages. Such predictions could help, for example, to assess the severity of the disturbance and, hence, the risk associated with the current state of the plant. Prognosis is, therefore, a potentially useful tool for optimising the decisions about the appropriate course of action during process disturbances. In this work, we deal with two aspects of the safety monitoring problem that are closely related to prognosis, the early detection and diagnosis of disturbances and the assessment of the functional effects of low-level failures. We wish to clarify though that the issue of prognosis itself is out of the scope of the thesis.
1.2 Approaches to Safety Monitoring A number of different approaches to achieving those goals have been explored over the years. One approach has been particularly influential to the development of advanced computerised monitors for safety critical systems. In the centre of this approach lies the concept of the fault dictionary [Tzafestas, 1989].
1.2.1 The Dominant Applied Approach A fault dictionary contains a list of faults, the symptoms that these faults cause on the monitored process, other effects on the system and possible corrective measures. This list is typically compiled by establishing first a set of possible faults and then using simulation techniques to identify the effects of these faults. The list is then organised by symptom to provide a dictionary that relates each observed misbehaviour to one or more underlying faults. In real-time, this dictionary provides a basis for the detection of anomalous symptoms and the provision of diagnostic messages about underlying faults and possible corrective measures. A number of advanced industrial monitors have adopted a form of this approach to safety monitoring. Such systems include the MD-11 Electronic Instrument System [Bell, 1992] and the A320 Electronic Centralised Aircraft Monitor [Airbus, 1989]. Despite their practicability and prevalence in applied systems, however, fault dictionaries also have some serious drawbacks. Firstly, they cover only a limited set of anticipated faults (usually built from relatively simple component fault models). Secondly, they 16
provide an oversimplified model of the actual relationships between faults and their effects, which does not allow in depth diagnostic operations and other useful forms of reasoning about the nature and treatment of disturbances.
1.2.2 Research on Safety Monitoring To address those limitations a number of approaches have been proposed. The majority of those approaches has emerged from advances in artificial intelligence (AI), and in particular from developments in the areas of expert systems and model-based diagnosis. Expert monitoring systems transferred AI technology from the area of medical diagnosis, where it has been previously applied with significant success. Among the first systems that reached the stage of a research prototype was NPPC [Underwood, 1982], and REACTOR [Nelson, 1982]. Both systems helped nuclear plant operators to determine the causes of abnormal events, by monitoring instrument readings and looking for deviations from normal operating conditions. Expert systems reason about such disturbances by operating on a knowledge base that contains production rules. These rules typically describe functional relationships between system entities in the context of process disturbances and causal relationships between failures and effects. Expert systems have demonstrated the ability to reason about complicated problems such as the diagnosis of root failures, the determination of the effects of failure and the generation of remedial measures. Their application has been very successful in relatively simple systems and processes. However, attempts of wider application to complex systems have shown that rule-based systems are prone to inconsistencies, incompleteness, long search time and lack of portability and maintainability [Chen and Modarres, 1992]. From a methodological point of view, many of these problems can be attributed to the limitations of the production rule as a model for knowledge representation.
Indeed, “if-then”
statements offer analysts limited help in thinking about how to represent the knowledge about the system and its processes that the monitor would require for reasoning about the nature of disturbances. These problems underlined the need for more elaborate models for knowledge representation and created a strong interest towards model-based monitoring systems. Model based systems can solve advanced monitoring problems by operating on functional, logical, structural or behavioural models of the system or its processes. In doing so, they are more likely to be consistent, and they provide better diagnostic
17
coverage than expert systems, because the model building and model validation processes supply a way of systematically enumerating the required knowledge about the monitored process. A number of models (and algorithms) have been proposed over the years. These include various forms of goal trees, goal hierarchies, fault propagation models, flow models, system digraphs, other forms of semantic networks and qualitative simulation models. The application of model based systems has been demonstrated successfully in a number of domains from the diagnosis of micro-electronic devices to the monitoring of chemical plants. However a number of open research issues still remain. Firstly, there is still a plethora of monitoring and diagnostic problems which are too hard to describe using current modelling technology. Secondly, model-based approaches are difficult to scale up to large systems or systems with complex behaviour. Indeed, scale and complexity often make it impossible to achieve consistent and representative models of the monitored process. Simplifying assumptions can usually help to reduce the complexity of the monitoring model. But the challenge here often lies in how to make the right simplifications that do not compromise the detection and diagnostic ability of the monitor. Closing our brief introductory reference to model-based approaches, we wish to point out that one possible way to address those issues would be to improve existing models and algorithms. Given the limitations of current model-based approaches however, it also makes sense to explore new or alternative solutions to the problem of safety monitoring.
1.3 The Thesis In this thesis, we introduce the concept of the safety case as a model for safety monitoring. More specifically, we investigate the possibility of using the safety analyses of the system contained in a well-formed safety case for representing and solving a number of advanced monitoring problems that we have outlined in Figure 1.1.
1.3.1 Early Work on Safety Cases A safety case is a structured compilation of the results from the off-line safety assessment of the system. Over recent years, the production of such documents has become increasingly an obligation on developers of safety critical systems. Indeed, the aerospace [McDermid, 1994], railway [Edwards, 1995], nuclear [Fenelon et al, 1995] and process 18
[Lees, 1990] industries have already established certification procedures based on the development and assessment of such documents. Despite differences in content and presentation, these safety cases share a common purpose: to present a convincing argument that the system will operate in an acceptably safe manner. Research, however, indicates that this aim is not easily achievable and that the synthesis of a clear, consistent and complete safety argument, from a potentially vast amount of often inter-related safety evidence, is in practice an extremely difficult task. In an effort to assist the development and maintenance of safety cases, the ASAM projects at the University of York [McDermid, 1994] have developed a structured method and comprehensive tool support. The tool is called the Safety Argument Manager (SAM), and enables analysts to conduct a wide range of classical safety analyses and relate the results from individual studies to an overall safety argument about the system. This argument is typically developed as a hierarchy of lower-level safety arguments, which are substantiated by technical and procedural evidence from the safety assessment. Hierarchies of safety arguments are developed in SAM using a particular form of representation called the Goal Structure Notation (GSN) [Wilson and McDermid, 1995], [Wilson et al, 1995]. One significant contribution of this work is that it has created the concept of an electronic safety case; a safety case which, rather than being a linear document, is a structured, electronically held, set of safety arguments and technical evidence of safety. In this work, we draw from that earlier research on electronic safety cases. At the same time, though, we revise radically the form of the safety case as well as the methods that support its development. In addition, we explore the potential of extending the lifespan of such electronic safety cases into the operational phase of the lifecycle.
1.3.2 Aims, and Approach of the Thesis to Safety Monitoring Currently, safety cases cease or reduce in their utility after system certification, and with that, a vast amount of knowledge about the failure (or safe) behaviour of the system is usually rendered useless. We believe, and we wish to demonstrate, that an appropriately formulated electronic safety case can put this knowledge to the service of operators. As a practical application of our approach, we propose a safety monitor that uses safety cases to support the on-line detection and control of hazardous failures in safety critical systems.
19
Firstly, we identify appropriate methods for safety analysis and the development of the safety case. Secondly, we determine how to organise the knowledge contained in the safety case into an executable specification, in other words, we show how to develop the monitoring model. Finally, we define the engine of the safety monitor. This is a set of generic algorithms by which the safety monitor operates on monitoring models in order to deliver a range of failure detection and control functions in real-time. We explore the feasibility of this new approach to the problem of safety monitoring and we investigate potential benefits, and limiting factors to its application in complex systems. In the light of our study, we also deal with some of the issues that arise from previous research in model-based monitoring, and still concern other model-based approaches. The primary questions that this thesis will attempt to address are:
a)
Is the safety case an appropriate model for representing and solving the range of problems that are encountered in safety monitoring of complex systems?
b)
Is this approach likely to succeed in scaling up to large systems or systems with complex behaviour?
c)
Finally, to what extent can the techniques that we employ for the development of the safety case help to achieve a consistent and representative monitoring model for the system?
1.3.3 Approach of the Thesis to Safety Analysis Safety cases are typically developed in two parts. The first part is based on the technical assessment of the system while the second part describes the quality and safety management procedures that were applied during the design and development processes. For obvious reasons, the focal point of this work is the technical part of the safety case, which is largely formed using the results from various classical safety analysis techniques. We use the term classical here to describe a number of well-established techniques that include, for example, Functional Failure Analysis (FFA) [SAE, 1996], Hazard and Operability Studies (HAZOPS) [Kletz, 1992], Failure Mode and Effects Analysis (FMEA) [Palady, 1995] and Fault Tree Analysis (FTA) [Vesely, 1981]. Those techniques have demonstrated their value over the years, and they are still widely practised in safety assessments. As the complexity of modern programmable electronic
20
systems increases though, the application of those techniques is also becoming increasingly more problematic. Classical techniques are typically applicable at different stages of the design lifecycle, and require models of the system that reflect different levels of abstraction in the design. In a complex assessment though, the selective and fragmented application of different techniques has a number of negative implications for the quality of the results gained from the assessment. In chapters three and four of this thesis, we will examine those implications and we shall see that the union of classical safety analyses often fails to provide a coherent and complete picture of how the system and its constituent parts behave in conditions of failure. This however raises an additional important question concerning our research:
Given the problems with classical safety analysis, are classical techniques appropriate for the development of effective monitoring models?
We did not need to rely exclusively on our intuition or judgement, or indeed to experiment, in order to answer this question. Previous work on model-based diagnosis has confirmed that model-based monitoring can only be as good as the model [Davies and Hamscher, 1992]. This generalisation, we believe, is true for any model and we can, therefore, use it to draw immediately an important conclusion in relation to our work. That is, if we wish to achieve high quality safety monitoring, we also need to achieve high quality safety analyses. To address the problems encountered in the application of classical techniques, in this thesis we propose a new method for safety analysis. The new method effectively modifies and extends a number classical safety analysis techniques to enable the assessment of a complex system from the functional level through to the low levels of its hardware and software implementation. It integrates design and safety analysis, and during the assessment, links a consistent hierarchical description of the system to the results of the safety studies. In the context of this work, we have extended the Safety Argument Manager (SAM) tool to provide support for the method and realise some of its potential for mechanical safety analysis and automated consistency checks on the safety case. The primary questions that this thesis will attempt to address with respect to this approach to safety analysis are as follows:
21
a)
Can the method help us rationalise the assessment and gain a better understanding of the mechanics of failure and recovery in complex systems?
b)
At the end of the assessment, do the results form a representative model of the behaviour of the system in conditions of failure, and if so, how has the method helped in improving the quality of this model?
1.3.4 Case Studies and Evaluation Two case studies provide the basis for the evaluation of the concepts and methods that we develop in this thesis. The first is a safety study of a distributed brake-by-wire system for cars, which provides a design concept for future brake-by-wire applications in the automotive industry. The result of this study is a safety case that we have constructed by analysing a detailed model of the hardware and software implementation of this system. This study will help us address the main questions concerning our approach to safety analysis. The second case study goes beyond the issue of safety analysis and demonstrates our approach to safety monitoring. The system that we examine here is a computer controlled laboratory model of an aircraft fuel system. This study is developed in two parts, the safety case and the safety monitor. The safety case re-addresses the safety assessment problem and explores some additional elements of our approach to safety analysis. More specifically, in this study, we show how to deal with the complications introduced in the analysis by the dynamic behaviour or structure of a system, and how to generate safety cases for systems with temporally variable functional profiles or dynamically changing configurations of components. The second part of the case study is devoted to an experimental safety monitor that we have developed and then used, on-line, in order to monitor the fuel system. The safety monitor can detect and diagnose faults that are injected to the system, by operating on the fuel system safety case. In responding to those anomalies, the monitor can also take corrective actions that we have specified in the safety case. The experiments we have conducted with the safety monitor on the fuel system provide the basis for the evaluation of our approach to safety monitoring.
22
1.3.5 Structure and Presentation The thesis is organised into four main chapters. In chapter two, we explore the literature on safety monitoring. We identify other safety monitoring methods and develop a critical review and a classification of those methods. In chapter three we describe our approach to safety analysis and the construction of the safety case. We develop the safety analysis method and discuss its application on the brake-by-wire system. In chapter four we enrich the method with a set of modelling and analysis concepts that, we believe, simplify and rationalise the assessment of dynamic systems. As a practical application of this approach, we discuss the safety assessment process and the safety case of the fuel system. In chapter five, we present the algorithms of the safety monitor and discuss the experiments that we have conducted on the fuel system. Drawing from those experiments, we also attempt an evaluation of our approach to the monitoring problem. Finally in chapter six, we conclude the thesis and discuss the contribution and limitations of this work.
23
Chapter Two
2.
Approaches to Safety Monitoring
2.1 Introduction In this chapter we explore safety monitoring through a comprehensive review of the relevant literature. We identify and study other safety monitoring methods and determine the requirements that an effective new approach to the problem would need to address. The results from this study (on requirements and methods) provide: (a) a more refined view of the problem (b) a classification and review of safety monitoring methods and (c) the conceptual relationship of our work to other developments in this field. Our presentation is structured around the three principal dimensions of the problem that we have identified earlier: the optimisation of a potentially enormous volume of low-level status information and alarms, the diagnosis of complex failures, and the control and correction of those failures.
2.2 Optimisation of Status Information and Alarms During a disturbance in a complex system, the system monitor can potentially return to operators an enormous, and therefore potentially confusing, volume of anomalous plant measurements. In traditional alarm annunciation panels this phenomenon has often caused serious problems in the detection and control of disturbances. Indeed, in such systems, highly significant alarms often went undetected by operators as they were indiscriminately reported amongst a large number of other less important and even spurious alarms [Lees, 1983]. The experience from those conventional monitoring systems shows that a key issue in improving system monitoring is the filtering of false and unimportant alarms. False alarms are typically caused either by instrumentation failures, that is failures of the sensor or the instrumentation circuit, or normal parameter transients that violate the acceptable operational limits of monitored variables. An equally important factor in the generation of false alarms is the confusion often made between alarms and 24
statuses [Domenico et al, 1989]. The status of an element of the system or the controlled process indicates the state that the element (component, sub-system or physical parameter) is in. An alarm, on the other hand, indicates that the element is in a particular state when it should be in a different one. Although status and alarm represent two fundamentally different types of process information, statuses are often directly translated into alarms. The problem of false alarms can arise whenever a particular status is not a genuine alarm in every context (e.g. mode) of the system operation. Another central issue in improving the process feedback appears to be the ability of the monitoring system to recognise distinct groups of alarms and their causal/temporal relationships during the disturbance. An initiating hazardous event is typically followed by a cascading sequence of alarms, which are raised by the monitor as the failure propagates in the system and gradually disturbs an increasing number of process variables. However, a static view of this set of alarms does not convey very useful information about the nature of the disturbance. The order in which alarms are added into this set, though, can progressively indicate both the source and future direction of the disturbance (see, for example, the work of [Dansak, 1982], [Roscoe, 1984]). Thus, the ability to organise alarms in a way that can reflect their temporal or causal relationships is an important factor in helping operators to understand better the nature of disturbances. An alternative, and equally useful, way to organise status information and alarms is via a monitoring model which relates low-level process indications to functional losses or malfunctions of the system (see for example the work of [Kim and Modarres, 1987], [Chen and Modarres, 1992], [Larsson, 1996]). Such a model can be used in real-time to provide the operator with a functional, as opposed to a purely architectural view of failures. This functional view is becoming increasingly more important as the complexity of safety critical systems increases and, almost inevitably, operators become less aware of the architectural details of the system. A number of methods have been proposed to address the various aspects of optimising the process feedback during a disturbance. Their aims span from simple filtering of false alarms to the provision of an intelligent interface between the plant and the operator that can organise low-level alarm signals and plant measurements into fewer and more meaningful alarm messages and warnings. Figure 2.1 identifies a number of such methods, and classifies them into three categories using as a criterion their primary
25
goal with respect to safety monitoring. A discussion of these methods follows in the remainder of this section
2.2.1 Detection of Sensor Failures A primary cause of false and misleading alarms is sensor failures. Such alarms can be avoided using general techniques for signal validation and sensor failure detection. Sensor failures can be classified into two types with respect to the detection process: coarse (easy to detect) and subtle (more difficult to detect). Coarse failures cause indications that lie out of the normal measurement range and can be easily detected using simple limit checking techniques [Meijer, 1981]. Examples of such failures are a short circuit to the power supply or an open circuit. Subtle failures, on the other hand, cause indications that lie within the normal range of measurement and they are more difficult to detect. Such failures include biases, non-linear drifts, and stuck at failures. One way to detect such failures is to employ a hardware redundancy scheme, for example, a triplex or quadruple configuration of components [Clark, 1978]. Such a system operates by comparing sensor outputs in a straightforward majority-voting scheme. The replicated sensors are usually spatially distributed to ensure protection against common cause failure. Although hardware replication offers a robust method of protection against sensor failures, at the same time it leads to increased weight, volume and cost. An alternative, and usually more cost-effective, way to detect subtle failures is by using analytical redundancy, in other words relationships between dissimilar process variables. State estimation [Isermann, 1984] and parity space [Upadhyaya, 1985], [Patton and Willcox, 1987], [Willcox, 1989] techniques, employ control theory to derive a normally consistent set of such relationships, detect inconsistencies caused by sensor failures and locate these failures. Optimisation of
Filtering of false
Sensor Validation Methods
Status Information
and unimportant alarms
State & Alarms Association
and Alarms
Multiple Stage Alarm Relationships Organisation of alarms
Alarm Patterns
to reflect causal relationships
Alarm Trees Logic and Production Rules
Function-Oriented
Electronic Checklists
Monitoring
Goal Tree Success Tree
Figure 2.1. Classification of alarm optimisation methods
26
Kim and Modarres have proposed an alternative method [Kim and Modarres, 1990a], in which diagnosis is achieved by parsing the structure of a sensor validation tree, and by examining in the process a number of sensor validation criteria specified in the tree structure. These criteria represent coherent relationships among process variables, which are violated in the context of sensor failures. Such relationships are formulated using deep knowledge about the system processes, for example, pump performance curves, mass and energy conservation equations and control equations.
2.2.2 State-Alarms Association Another cause of false and irrelevant alarms is the confusion often made between statuses and alarms [Domenico et al, 1989]. A seemingly abnormal status of a system parameter (e.g. high or low) is often directly translated to an alarm. This direct translation, however, is not always valid. To illuminate this point, we will consider a simple system in which a tank is used to heat a chemical to a specified temperature (see Figure 2.2). The operational sequence of the system is described in the GRAFCET notation [Oulton, 1992] and shows the four states of this system: idle, filling, heating and emptying. Initially the system is idle, and the operational sequence starts when the start cycle signal arrives. A process cycle is then executed as the system goes through the four states in a sequential manner and then the cycle is repeated until a stop signal arrives. Let us now attempt to interpret the meaning of a tank full signal in the three functional states of the system. In the filling state, the signal does not signify an alarm. On the contrary, in this state the signal is expected to cause a normal transition of the system from the filling state to the heating state.
operational sequence system idle Pump A tank full
(start cycle) or (tank empty) filling
Heater
Pump A on
stop
tank empty
temperature sensor
tank full
Pump B
heating
Heater on temperature hi
emptying
Pump B on tank empty
Figure 2.2. Heating system & operational sequence [Papadopoulos, 1993]
27
In the heating state, not only does the presence of the signal not signify an alarm, but its absence should generate one. The tank is now isolated, and therefore, the tank full signal should remain hi as long as the system remains in that state. Finally, in the emptying state, the signal signifies a genuine alarm. If tank full remains while the system is in this state then either Pump B has failed or the tank outlet is blocked. This example indicates clearly that we need to interpret the status in the context of the system operation before we can decide if it is normal or not. One way we can achieve this is by determining which conditions signify a real alarm in each state or mode of the system. The monitor can then exploit mechanisms which track down system states in real time to monitor only the relevant alarms in each state. The concept was successfully demonstrated in the Alarm Processing System [Domenico et al, 1989]. This system can interpret the process feedback in the evolving context of operation of a nuclear power station. The system has demonstrated that it significantly reduces the number of false and unimportant alarms.
2.2.3 Alarm Suppression Using Multiple Stage Alarm Relationships Multiple stage alarms are often used to indicate increasing levels of criticality in a particular disturbance. A typical example of this is the two stage alarms employed in level monitoring systems to indicate unacceptably high and very high level of content. When the actual level exceeds the very high point both alarms are normally activated. This is obviously unnecessary because the very-high level alarm implies the high level alarm. The level-precursor relationship method that has been employed in the Alarm Filtering System [Gorsberg and Wilkie, 1986] uses the hierarchical relationship between multiple stage alarms to suppress the precursor whenever a more important alarm is generated. Many alarms could be filtered using this method alone in a typical process plant.
2.2.4 Identifying Alarm Patterns An alarm pattern is a sequence of alarms which is typically activated following the occurrence of an initiating hazardous event in the system. Figure 2.3 illustrates, for example, a pattern consisting of three alarms in a particular temporal sequence. Alarm patterns carry a significantly higher information content than the set of alarms that compose them. A unique alarm pattern, for example, unambiguously points to the cause of the disturbance, and indicates the course of its future propagation.
28
Initiating event
pattern matched alarm 1 alarm 2 alarm 3
t1
t2
t3
Figure 2.3. Alarm pattern matching
The idea of organising alarms into patterns and then deriving diagnoses on-line from such patterns was first demonstrated in the Diagnosis of Multiple Alarms system at the Savannah River reactors [Dansak, 1982]. A similar approach was followed in the Nuclear Power Plant Alarm Prioritisation program [Roscoe, 1984], where the alternative term alarm signature was used to describe a sequence of alarms which unfold in a characteristic temporal sequence. A shortcoming of this approach is that the unique nature of the alarm signature is often defined by the characteristic time of activation of each alarm. The implication is that the pattern matching algorithm must know the timing characteristics of the expected patterns, a fact which creates substantial technical difficulties in the definition of alarm patterns for complex processes. An additional problem is that the timing of an alarm signature may not be fixed and may depend on several system parameters. In the cooling system of a nuclear power plant, for example, the timing of an alarm sequence will almost certainly depend on parameters such as the plant operating power level, the number of active coolant pumps, and the availability of service water systems [Kim, 1991].
2.2.5 Organising Alarms Using Alarm Trees An alternative way to organise alarms is the alarm tree [Long, 1980a]. The tree is composed of nodes, which represent alarms, and arrows that interconnect nodes and denote cause-effect relationships. Active alarms at the lowest level of the tree are called primary cause alarms. Alarms which appear at higher levels of the tree (effect alarms) are classified in two categories: important (uninhibited) alarms and less important (inhibited) alarms. During a disturbance, the primary cause alarm and all active uninhibited alarms are displayed using the causal order with which they appear in the tree. At the same time, inhibited alarms are suppressed to reduce the amount of information returned to the operator. Beyond real process alarms based on measurements, the tree can also generate
29
messages based on such alarms. These messages, for example, may relate a number of alarms to indicate a possible diagnosis of a process fault. Such diagnostic messages are called deduced or synthesised alarms. Alarm trees has been extensively tried in the UK Nuclear industry. Plant-wide applications of the method, however, have shown that alarm trees are difficult and costly to develop. It took ten man-years of effort, for example, to develop the alarm trees for the Oldbury nuclear power plant [Lees, 1983]. Operational experience has also indicated that large and complex alarm trees often deviate from the actual behaviour of the plant and that they do not appear to be very useful to operators. In evaluating the British experience, Meijer states that the “most valuable trees were relatively small, typically connecting three or four alarms” [Meijer, 1980].
2.2.6 Synthesising Alarms Using Logical Inferences A more flexible way of organising alarms and status information is by using logic and production rules [Chester et al, 1984]. A production rule is an IF-THEN statement that defines an implication relationship between a prerequisite and a conclusion. Production rules can be used to express logical relationships between low level process indications and more informative conclusions about the disturbance. Such rules can be used in real time for the synthesis of high level alarms. Rule-based monitors incorporate an inference engine which evaluates and decides how to chain rules, using data from the monitored process. As rules are chained and new conclusions are reached, the system effectively synthesises low-level plant data into higher level and more informative alarms. To illustrate the concept we will use the example of a pump, which delivers a substance from A to B (Figure 2.4). One of the faults that may occur in this system is an inadvertent closure of the suction valve while the pump is running. In such circumstances, the operator of a conventional system would see the status of the pump remain set to running and two alarms: no flow and low pressure at the suction valve.
IF
the pump state is running AND there is no flow in the pipe AND low pressure exists at the suction valve
THEN suction valve inadvertently closed
B
delivery valve
suction valve
A flow sensor
pump
Figure 2.4. Production rule and transfer pump
30
pressure sensor
On the other hand, a rule-based system would call a simple rule to synthesise these lowlevel data and generate a single alarm that does not require further interpretation. This alarm (suction valve inadvertently closed) certainly conveys a more dense and useful message to the operator. It is worth noticing that in generating this alarm, the system has performed a diagnostic function.
Indeed, rule-based systems have been extensively
applied in the area of fault diagnosis. The idea of using logic and production rules for alarm synthesis has been successfully applied in the Alarm Filtering System [Gorsberg and Wilkie, 1986], an experimental monitor for nuclear power plants.
2.2.7 Function-Oriented Monitoring The status monitoring and alarm organisation methods that we have examined so far share a fundamental principle. Monitoring is performed against a model of the behaviour of the system in conditions of failure. Indeed, the different models that we have discussed (e.g. alarm patterns, alarm trees) are founded on a common notion of an expected anomalous event, a signal which signifies equipment failure or a disturbed system variable. A radically different approach is to monitor the system using a model of its normal (fault free) behaviour. The objective here is to detect discrepancies between the actual process feedback and the expected normal behaviour. Such discrepancies indicate potential problems in the system and, therefore, provide the basis for generating alarms and warnings. A simple monitoring model of normal behaviour is the electronic checklist. An electronic checklist associates a critical function of the system with a checklist of normal conditions which ensure that the function can be carried out safely. In real-time, the checklist enables monitoring of the function by evaluation of the necessary conditions. Electronic checklists have been widely used in the design of contemporary aircraft. Figure 2.5 illustrates an example from the Electronic Instrument System of the MD-11 aircraft [Morgan and Miller, 1992]. Before a take-off, the pilot has to check assorted status information to ensure that he can proceed safely. Critical function
Checklist
Safe take – off
Stab not in green band Slats not in take-off position Flaps not in take-off position Parking brake on Spoilers not armed
Figure 2.5. Checklist of essential items for take-off [Morgan et al, 1992]
31
The function “safe take-off” is, therefore, associated with a list of prerequisite conditions. If a take-off is attempted and these essential conditions are not present, the monitor generates a red alert. The checklist is then displayed to explain why the function cannot be carried out safely. Although electronic checklists are useful, they are limited in scope. To enable more complex representations of normal behaviour, Kim and Modarres [Kim and Modarres, 1987] have proposed a much more powerful model called the Goal Tree Success Tree (GTST). According to the method, maintaining a critical function in a system can be seen as a target or a goal. In a nuclear power plant, for example, a basic goal is to “prevent radiation leak to the environment”. Such a broad objective, though, can be decomposed into a group of goals which represent lower-level safety functions that the system should maintain in order to achieve the basic objective. These goals are then analysed and reduced to sub-goals and the decomposition process is repeated until sub-goals cannot be specified without reference to the system hardware. The result of this process is a Goal Tree which shows the hierarchical implementation of critical functions in the system. A Success Tree is then attached to each of the terminal nodes in the Goal Tree. The Success Tree models the plant conditions which satisfy the goal that this terminal node represents. The GTST contains only two types of logical connectives: AND and OR gates. Thus, a simple value-propagation algorithm [Kim and Modarres, 1990a], [Kim and Modarres, 1990b] can be used by the monitor to update nodes and provide functional alarms for these critical safety functions that the system fails to maintain. The GTST has been successfully applied in a number of pilot applications in the nuclear [Chen and Modarres, 1992], space [Pennings and Saussais, 1993], [Gerlinger and Pennings, 1993] and process industries [Wilikens et al, 1994]. Later on in this chapter we return to this model to explore how it has been employed for on-line fault diagnosis.
2.2.8 Hierarchical Presentation of Status Information An additional aspect of the problem of optimising status information is presentation. This aspect of the problem involves many issues such as the design of appropriate media for input and display, the user-interface design and other issues of man-machine interaction. Many of those issues are very much system, application and task dependent. Different interfaces, for example, would be required in a cockpit and a plant control room. We have identified however, and we discuss here, a fundamental principle that
32
has been used in a number of advanced systems to enhance the presentation of status information. This is the concept of layered monitors or, in other words, the idea of hierarchical presentation of the process feedback. Operational experience in safety critical systems shows that operators of such systems need much less information when systems are working than when they are malfunctioning [Billings, 1991]. To achieve this type of control over the amount of information provided to the operator, many contemporary systems employ a layered monitor. Under normal conditions, a main system display is used to present only a minimum amount of important data. When a subsystem malfunction is detected, more data is presented in a subsystem display, either automatically or on request. This approach is used in systems such as the Electronic Instrument system of Douglas MD-11 [Morgan and Miller, 1992] and the Electronic Centralised Aircraft Monitor of Airbus A320 [Airbus, 1989][British Airways, 1991]. In both systems, the aircraft monitor consists of two displays. The first display is an Engine/Warning Display, which displays the status and health of the engines in terms of various parameters. The engine status is the most significant information under normal flight and it is always made available to the pilot. The second part of the display is an area dedicated to aircraft alerting information. The second display is called the Subsystems Display and can display synoptics of the aircraft subsystems (Electrical, Hydraulic, Fuel, Environment etc.). A particular synoptic can be selected from a control panel by pressing the corresponding push-button. Part of this display is an area where the system generates reports about failures, provides warnings about consequences and recommends corrective measures. Figure 2.6 illustrates how the MD-11 electronic instrument system reports a hydraulic fluid loss in one of the three hydraulic systems. Subsystems Display Engine/Warning Display HYD SYS 0
3,009
3,009
Control Panel
A ELE
HYD
FUEL
AIR
Alert Area HYD SYS A FAILURE
0.5
B
5.5
C
5.5
Alarm: HYD SYS A FAILURE Consequences flight control effect reduced use max. 35 flaps GPWS OVRD if flaps less than 35 If flapsmax-sensor_a | (driver’s message). Vmax-sensor_b | It can be caused by task Vmax-Feedback | Vmax-Braking_msg | VmaxFeedback Vε,
then it is considered invalid and is discarded. The model of a software task that uses such a mechanism is illustrated in Figure 3.27. The task performs peak detection on the input value (m). When the input value does not violate the peak detection criterion, it is copied to the task output (o). In the opposite case the output carries the average of the last k valid values (a). During the IF-FMEA analysis of the task and as we systematically examine the task output (o) for potential failure modes, we will have to consider the possibility of the output being stuck at a certain value. Part of the examination process is to identify deviations of the input (m) that can cause the stuck at failure at the output. An obvious case of such a deviation is the omission of input. For as long there is an omission of input, the output will be stuck at a value defined by the average of the k last valid measurements. More importantly the stuck at failure may persist following the end of a temporary omission. Indeed, if the omission is long enough to create a deviation between the restored measurement and the last valid average which is greater than ε, then all new measurements will be discarded as invalid, i.e. we will have a persistent stuck at failure.
m
a=(i=1Σ vmi)/k;
o
measurement
If |m-a|≤ε o=m else o=a;
output
k
Figure 3.27. Simplified model of the peak detection and removal task
99
Let us now assume that the task is part of the wheel node implementation. The task input is the pedal message arriving through the bus and the task output is the braking pressure applied to the wheel. Our analysis has shown that if there is a temporary omission of the pedal message at the early stages of braking (e.g. due to electromagnetic interference), the output might be permanently stuck at zero or at a low braking value which will cause a failure to brake. This problem was rectified in a redesign of the peak detection algorithm.
3.3.4 Fault Tree Synthesis Once we have determined the local failure behaviour of all components, we then proceed to the final stage of the analysis where we examine the structure of the fault propagation process in the system. At this stage, we determine how the functional failures that we have identified in the exploratory FFA arise from combinations of the low-level component failure modes that we have identified in the IF-FMEAs. As we have explained, in HiP-HOPS this is achieved mechanically with the aid of an algorithm for the synthesis of fault trees. Indeed in the course of this study we have mechanically generated, regenerated and evaluated a number of such fault trees for the brake-by-wire system. Figure 3.28 and Figure 3.29 illustrate, for example, the fault trees that we have synthesised for two of the main hazardous functional failure modes of the system: “loss of
braking” and
“inadvertent braking”. Using assumptions about component failure rates that we have made in the IF-FMEAs, SAM has calculated that the event “loss of braking” can occur with a probability of 8.08x10-7 f/h, and that the event “inadvertent braking” can occur with a probability of 5.02x10-7 f/h. We must point out that some of the assumptions that we have made about component failure rates may not be realistic. In addition, that the design of the system has changed since we last analysed it. For those reasons the numbers that we have calculated do not provide realistic failure rate predictions for the brake-by-wire system. They indicate, though, that HiP-HOPS can rationalise the development and maintenance of large fault trees, and, in that sense, can alleviate some of the problems currently encountered in the quantitative aspects of complex safety assessments.
100
Figure 3.28. Distant view of the fault tree for the event “Loss of braking”
|
Figure 3.29. Distant view of the fault tree for the event “Inadvertent braking”
The mechanical analysis of those fault trees pointed out weak areas of the design and focused our efforts to those areas. The minimal cut-set analysis, for example, pointed out single points of failure and areas of the design that contributed more to the overall failure probability of the system. These results initiated a number of useful design iterations and guided the revision of the fault tolerance strategies in the system and the allocation of additional redundant resources. It is equally important to point out that the synthesis algorithm could not have generated those fault trees if there were any inconsistencies in the hierarchical model or between the analyses. In that case, the algorithm would have simply pointed out the inconsistencies. Synthesised fault trees, therefore, link in a consistent manner the results from the various analyses to each other and back to the high-level functional failure analysis, and hence guarantee the consistency of the safety case.
101
3.4 Discussion Classical safety analysis techniques are evolving to deal with the complexity of modern systems. However, two significant problems still arise in the assessment of complex systems using classical safety analysis. The first problem is inconsistencies in the results from the different safety studies of the system, while the second is a difficulty in relating the various analyses among them and back to the functional hazard assessment. In this chapter, we have shown one way to address these problems by extending, automating and integrating a number of classical techniques into a new method for safety analysis called HiP-HOPS. HiP-HOPS enables the assessment of a complex system from the functional level through to low levels of the hardware and software implementation of the system. The method integrates design and safety analysis, and in the process of the assessment, links a consistent hierarchical model of the system to the results from the safety studies. The method also harmonises hardware safety analysis with the hazard analysis of software architectures, and introduces a new algorithm for the synthesis of fault trees which mechanises and simplifies a large and difficult part of the analysis, the development of fault trees. We have described the method and demonstrated its application on a distributed brake-by-wire system for cars. In the course of our presentation, we attempted to address two questions concerning our approach to safety analysis. Firstly, can the method help us rationalise and simplify safety assessment, and generate consistent safety cases? Secondly, can the results from the analysis help us improve the failure detection and recovery mechanisms of the system under examination and, if so, how? We believe that the brake-by-wire case study demonstrates positive results with regard to both those questions. Firstly, the fault trees that we have mechanically generated guarantee consistency among the low-level safety analyses and between those analyses and the hierarchical model. Secondly, the results from the analysis have helped us to systematically improve the failure detection and control mechanisms of the brake-by-wire system. The functional failure analysis, for example, helped the design of mechanisms for the recovery from single and multiple wheel locking failures. Sensor IF-FMEAs have directly supported the design of hardware redundancy schemes and averaging or voting algorithms for the detection and masking of sensor failures. IF-FMEAs have also helped us analyse the pedal and wheel node architectures, and to identify subtle errors in the design of certain 102
software algorithms. Finally, the mechanical analysis of fault trees, further helped us to identify weak areas of the design and focus our efforts to those areas. We believe that the proposed method improves the quality and consistency of the results from the safety assessment and, in that sense, it also improves the quality of the safety case as a potential on-line monitoring model. At the same time, though, the method inherits some of the limitations of classical safety analysis. In the quantitative aspects of the assessment, for example, we still rely on component failure rates (λ) the validity and value of which are generally disputed [O’Connor, 1991], [Khrishna and Shin, 1997]. We do not perceive this limitation, though, as a significant obstacle to our work on safety monitoring. As we shall show in chapter 5, it is the conditions that need to be monitored and the propagation of failure in the system - rather than quantitative failure rates - that we need to extract from the safety case in order to achieve an effective monitoring scheme. An additional limitation of HiP-HOPS is that the method requires that the behaviour or configuration of the system remains stable within the period of operation covered by the analysis. There is of course a large number of continuous control systems, such as the brake-by-wire system which deliver only a fixed set of functions using a static configuration of components. This category includes a class of fault tolerant systems which can maintain a fixed set of functions in the presence of failures by employing active redundancy schemes and forward recovery mechanisms. The safety analysis of such systems is a straightforward task and it is carried out as we have described in this chapter. How can we analyse, though, a system that delivers different sets of functions in different phases of its operation? And what if, to maintain a fixed functionality, the system changes configuration under normal conditions or as a result of recovery operations? What if, for example, in the event of a primary failure, a secondary subsystem (cold or hot spare) takes over the role of the failed primary sub-system? How can the method handle such changes in the behaviour or structure of a system, and what is the impact on the analyses and the safety case? These are the questions that we will attempt to address in the following chapter.
103
Chapter Four
4.
Modelling and Safety Analysis of
Systems with Dynamic Behaviour or Structure 4.1 Introduction Often the behaviour or structure of a system during a certain time interval is different from that within other periods of operation. Such behavioural and structural changes can be observed in almost any large industrial plant and, indeed, become the norm as the scale and complexity of the system increases. Those changes, however, create difficulties in the application of the classical techniques that we have discussed and the safety analysis method that we have proposed so far. Our first aim in this chapter is to discuss those difficulties. Our second aim is to show an approach to representing, and reasoning about, dynamic behaviour or structure in safety analysis. Before we address the problem, though, let us attach a more precise meaning to those two fundamental notions that concern us here, the behaviour and structure of a system. By behaviour we mean “the set of functions delivered by the system at any given time during its operation”. If this set remains fixed and stable, we say that the behaviour of the system is static. On the other hand, if this set is altered, for instance if functions are suspended or initiated during operation, we say that the system exhibits dynamic behaviour. By structure we mean “the set of active components in the system and the connections between them”. A component is considered active as long as it is engaged by the system in some form of useful operation11. Outside periods of activity the component is said to be passive and it typically remains switched off or idle, waiting to be energised or to be activated by the system. If the structure of the system remains fixed, 11
From those definitions, it can be easily deduced that the term structure here carries similar semantics with those often implied by the term configuration of the system. We introduce the term structure in this context though, since we believe that it makes a clearer and more direct reference to the composition (as opposed to the behaviour) of the system.
104
we say that the system has a static structure. On the other hand, if the structure is altered, for instance if components are activated or de-activated during operation, then we say that the system has a dynamic structure. Using these notions of behaviour and structure as criteria, we can now classify systems into four broad categories (see Figure 4.1):
(a) those with dynamic behaviour and structure [DD], (b) those with dynamic behaviour but static structure [DS], (c) those with static behaviour but dynamic structure [SD], and (d) those with static behaviour and structure [SS].
Figure 4.1 illustrates this categorisation and shows the application scope of our approach to safety analysis, which, at this point, is limited to systems with static behaviour and structure. Having this as a basis, we can now examine some cases of dynamic behaviour or structure and explain the difficulties that behavioural or structural changes introduce in safety analysis.
4.2 Dynamic Behaviour & Structure as Impediments in Safety Analysis The utility and, often the necessity, of dynamic behaviour can be easily understood if we consider for a moment the world of large industrial systems. Manufacturing systems and process plants are among those systems that exhibit dynamic behaviour. They usually take material resources through a sequence of different stages of processing, and in each of those stages, they perform a different set of operations to achieve different transformations on the materials handled.
Structure
[D]ynamic
[S]tatic
[D]ynamic
[DD]
[DS]
[S]tatic
[SD]
[SS]
Behaviour Current Scope of our approach to safety analysis
Figure 4.1. Classification of systems according to their behaviour and structure
105
Another class of systems which exhibit dynamic behaviour is represented by phased mission systems [Alam and Al-Saggaf, 1986]. These systems typically serve a mission composed of several distinct phases, each of which is characterised by different objectives. Typical examples of such systems are the on-board flight management systems of modern aircraft. The operation of those systems is often divided in modes that correspond to the main phases of a flight, usually take-off, ascent, cruise, approach and landing. It must be said that although dynamic behaviour is a typical characteristic of large scale systems, it is not an exclusive characteristic of those systems. Indeed, there is a plethora of smaller scale multi-moded real-time systems12 that also have variable functional profiles. In many cases, the functional profile of the system is fixed during normal operation, but it changes under conditions of failure. Less important functions are sacrificed, for example, at the expense of some more critical ones. This phenomenon, often described as graceful degradation, represents another case of dynamic behaviour. To achieve a variable functional profile in a system, we often need to employ different configurations of components. Thus, the requirement for dynamic behaviour often (but not necessarily) forces us to adopt a dynamic structure. But while this implication (dynamic behaviour ⇒ dynamic structure) seems almost natural, it is probably less obvious why systems that deliver only fixed set of functions often require a dynamic structure. In many cases the explanation of this phenomenon lies in the ability of the system to tolerate faults. For example, there is a large number of fault tolerant systems that, in conditions of failure, reconfigure themselves and use cold or hot spares to maintain their normal functions. A second case where static behaviour and dynamic structure coexist on the same platform is seen in systems that operate in variable and sometimes particularly stressing environmental conditions (systems on spacecraft, for example). Environmental conditions can cause dramatic variations of the component failure rates, and may force such systems to deploy additional redundant resources in order to function at a required level of safety and reliability [Bondavalli and Mura, 1999]. Behavioural or structural changes pose difficulties in safety analysis. The classical techniques that we have discussed, as well as the alternative method that we proposed in the preceding chapter, examine the function or architecture of a system at
12
Aircraft sub-systems, for example, such as the engine controller or the fuel system.
106
different levels of the design decomposition. Functional failure analysis, for instance, examines a set of functions delivered by the system, and helps to identify potential functional failure modes. HAZOP and IF-FMEA, on the other hand, can be used for the assessment of hardware and software architectures, the analysis of material, energy and data flows, and the identification of deviations of those parameters from their normal behaviour. As the behaviour or the structure of the system changes, though, functions and their failure modes, as well as flows among components of the architecture and the potential deviations of those flows vary. Similarly, behavioural and structural changes affect the set of component failure modes that we need to examine in different states of the system and, ultimately, affect the composition and structure of fault trees. Consider, for instance, the fuel system of an aircraft. In this system, there are usually a number of alternative ways of supplying the fuel to the aircraft engines. During operation, the system switches between different modes in which it uses different configurations of fuel resources, pumps and pipes to maintain the necessary fuel flow. Initially, for example, the system may supply the fuel from the wing tanks and when these resources are depleted it may continue providing the fuel from a central tank. The system also incorporates complex control functions such as fuel sequencing and transfer among a number of separated tanks to maintain an optimal centre of gravity. If we attempt to analyse such a system with a technique like HAZOP or IFFMEA, we will soon be faced with the difficulties caused by the dynamic character of the system. Indeed, how precisely can we define a normal flow in this system, and what constitutes a deviation from this hypothetical normality, when flows change value and direction according to the system state, whilst remaining consistent with the design intention? A second implication for safety analysis is that, as components are activated, deactivated or perform alternative fuel transfer functions in different operational states, the set of failure modes that may have an adverse effect on the system changes. Almost inevitably, the causes and propagation of failure in one state of the system are different from those in other states. In that case, how can such complex state dependencies be taken into account during the fault tree analysis, and how can they be represented in the structure of fault trees? To a certain degree, classical techniques enable us to represent the sensitivity of safety analyses to changes in the behaviour or structure of the system. In functional failure analysis, for instance, the analyst is allowed to specify conditions that are necessary for the manifestation of each functional failure. Those conditions are called
107
contributing factors [SAE, 1996], and they may refer to particular states or modes of the system. Similarly, certain HAZOP variations allow multiple interpretations of flow deviations, where the meaning, causes and effects of each deviation change depending on the conditions stated in a separate column which extends the basic HAZOP table [McDermid and Pumfrey, 1994]. If, for example, the intended flow in the system is A litres/min in a particular context and B litres/min in a different context, then the deviation “more flow” can have two different meanings (flow>A or flow>B) depending on context. Fault trees also provide some basic mechanisms that can be used to represent state dependencies in the analysis of dynamic systems. One such mechanism is the external event, which is represented with the “house” symbol. The fault tree handbook [Vesely et al, 1981] states that the symbol “signifies an event that is normally expected to occur: e.g., a phase change in a dynamic system”. An alternative mechanism for describing state dependencies in classical fault trees is the inhibit-gate [ibid.]. The gate is represented by a hexagon that connects an input event to an output event. The output event is caused by the input event but only if some qualifying condition (defined within an ellipse symbol at the right of the gate) is satisfied at the same time. The qualifying condition usually represents an adverse environmental condition (i.e. Temperature > Critical Temperature). The condition, though, can also be used to represent a state of the system. In that case the inhibit-gate restricts the scope of the branch that lies below the gate to the state declared as the qualifying condition. Dugan, Bavuso and Boyd [Dugan et al, 1993] extend this classical fault tree model by introducing four dynamic fault tree gates: the functional-dependency gate, the cold-spare gate, the priority-and gate and the sequence-enforcing gate. By translating this extended model into a Markov chain, the authors resolve the complications that the new gates introduce to the reliability evaluation of fault trees. At the same time, they argue that the extended fault tree is particularly suitable for modelling the fault propagation in dynamic fault tolerant systems with sequence dependent failure behaviour13. It is true that, to a certain degree, this extended model enables analysts to take into account certain types of dynamic behaviour. But although a notation can be useful, it cannot in itself resolve the complications that dynamic behaviour or structure introduce in safety analysis. The notation merely provides the ability to represent certain dynamic aspects of the system. But it does not provide us with effective ways to identify 13
A system the failure of which does not only depend on combinations of events but, often, also on the sequence that these events occur.
108
complex dependencies between the state and the potential failure behaviour of the system, nor can it help us to determine correctly the effect that such dependencies have on the possible causes and propagation of failure. Dynamic behaviour or structure complicate safety assessment and further exacerbate the fragmentation, inconsistency and incompleteness of classical safety analyses. But the real problem here we believe is not a lack of powerful notational constructs that can reflect better the dynamics of a system in safety analysis. The primary issue, in our view, is rather of a methodological nature, one of finding ways for rationalising the assessment and placing the results in the correct context, when this context changes due to the dynamic character of the system. In many ways this problem is analogous to the one that we addressed in the preceding chapter. There, we were faced with the complications caused in safety analysis by the complexity of a static structure. By organising this structure into a consistent hierarchical model, and then using this model as a basis for the various analyses, we were able to simplify the assessment and guarantee the consistency of results. Here we face similar difficulties in the analysis, but this time the difficulties are caused by the dynamic character of the system. It is possible, in our view, to overcome those difficulties by following a similar approach. That is, by organising our knowledge about the dynamics of the system into an appropriate model, and by using this model as a basis for structuring the process and the results from the assessment. In the following sections we describe an approach to the analysis of dynamic systems, which is founded on this idea. Our approach to this problem encompasses and extends the principles that we have developed in the preceding chapter. At the same time, it shows how these principles can be effectively applied in the context of a dynamic system. The first element of this approach is a dynamic model, which identifies potential transformations in the behaviour or structure of the system during operation. The second element is a process for safety analysis, which is primarily based on repetitive application of the techniques that we have described in the preceding chapter (i.e. HiPHOPS). Before we discuss the safety analysis process, though, let us turn to the issue of modelling, and let us try to define what would constitute an appropriate form for the dynamic model that we require as a basis for the analysis. The definitions of behaviour and structure that we have given in this section will help us derive an appropriate form for this model.
109
4.3 Representing Dynamic Behaviour or Structure According to the definitions given above, a system with dynamic behaviour is a system with a temporally variable functional profile or, alternatively, a system that delivers different functions at different phases of its operation. This direct association that we have made between the behaviour and the functionality of a system will allows us to represent dynamic behaviour in the framework of a functional model.
4.3.1 Representing Dynamic Behaviour with Abstract Functional States and State Machines In this model, dynamic behaviour is expressed as a set of different functional states of the system and a set of transitions between those states. In each of those states, the system has a stable functional profile, in other words it delivers a fixed set of functions. This set is altered, however, every time that the system moves from one functional state to another. Thus, to describe the dynamic behaviour of a system using the concept of a functional state, not only we would have to identify all the possible functional states of the system, but would we also need to determine the conditions that cause transitions between those states. It is evident that, in its general form, the model that we propose is an abstract state machine. The fundamental building block of this model is the functional state, an abstraction that embraces “a set of states in which the system maintains a stable functional profile”. For reasons of simplicity, from this point onwards, we will use the more familiar term mode as an alternative way of referencing what we have just defined as a functional state. Our definition of mode, therefore, is “an abstract functional state in which the system maintains a stable functional profile” System designers often use the term mode to describe a few qualitatively different phases of operation, in which the system functions (or should function) in obviously or visibly different ways. In general though, mode is not a well-defined, monosemic term, which signifies in every context a functional state of a system. Despite differences in interpretation, however, there appears to be a consensus that the term effectively summarises a group of lower-level states. This minimum consensus is reflected in the general definition of mode that Kopetz gives in [Kopetz, 1997]. Kopetz defines a mode as “a relaxed set of states in a real-time system” and gives an example of an aircraft
110
being either in a taxiing mode or in a flying mode. We must say that our definition of mode is different in that it embraces not the unary concept of an abstract state, but the more eclectic notion of an abstract functional state. From this definition, we can derive four general types of modes. -
Normal operational modes, in which the system delivers different sets of functions.
-
Degraded modes where the system delivers safely part of the intended functionality.
-
Failed modes in which there is complete loss of function or the system behaves in an unpredictable and hazardous manner.
-
Temporary failed or degraded modes in which the system has lost its normal functionality, but action can be taken to recover and restore a normal mode.
Using these types, it is possible to create abstract state-machines (strictly speaking modecharts) that portray the dynamic behaviour of a system. Figure 4.2 gives an example of a mode-chart for a hypothetical system with two normal operational modes. At start up, the system enters its first normal operational mode where it provides functions A and B 14. It remains in this mode until event T1 causes a passage to a second normal mode, where function B is replaced by function C.
first normal mode
T1 (normal transition)
functions: {A,C}
functions: {A,B}
T2 (B failed)
second normal mode
T3 (C failed) T4 (A failed) temporary failed mode
degraded mode functions: {A} actions: ∅
T5 (A restored)
T6 (restart failed)
functions: ∅ actions:{restart)
failed mode functions: ∅ actions: ∅
Figure 4.2. Describing dynamic behaviour using mode-charts
14
A function, in general, can be seen as an operation that the system performs on material, energy or data resources that are contained in the system, supplied to the system or drawn from the environment. A function of an aircraft fuel system, for example, is to supply the engines with fuel at a variable rate which is defined by the flight operations performed by the pilot. In the proposed model, normal functional transformations or functional deviations and the transitions that they cause in mode-charts can be verified by observable conditions on the variables manipulated by the system. As we shall see in section 5.2, these conditions can be formally specified in the mode-chart as expressions that accompany transition events. Those expressions define explicitly the precise conditions that verify transitions between states and implicitly the conditions that bound the current state of the system.
111
Events T2 (function B failed) or T3 (function C failed) cause a transition from normal operation to a permanently degraded mode. In this mode, the system has lost permanently part of its functionality. Indeed, it can be noticed that there are no actions15 that can be taken to restore the lost functions, and therefore, there is no return route to any of the normal operational modes. The occurrence of event T4 (function A failed), while the system is in degraded operation, will move the system into a temporary failed mode. The mode-chart suggests that it may be possible to revive function A, by restarting the system. If this procedure fails, the system will go into a permanently failed mode. The example illustrates how a mode-chart can portray the dynamic behaviour of a system both in normal conditions and under conditions of failure. The next question that we need to address is whether the same model could also represent significant transformations in the structure of the system. As we have explained earlier, such transformations may equally affect the potential failure scenarios in a system, and for this reason they must be treated separately in safety analysis.
4.3.2 Modelling Structural Transformations At this point we must make a distinction between two types of structural transformations: those that are caused by modifications in the behaviour of the system and those that are not. If a modification in structure is a result of a change in the behaviour of the system, then this (dual) transformation is registered in the mode-chart as a transition to a new mode. Modifications in structure, though, often occur without any parallel alteration in the functionality of the system (recall, for the example, the case of fault tolerant architectures). The purely functional model that we described in the preceding subsection, however, cannot represent structural transformations of this type. In this model, the two configurations involved in any such transformation are unavoidably amalgamated within a single mode.
15
The actions specified in a mode-chart are generally informal annotations. They describe suggested corrective measures which, if they are successfully applied, remove or minimise the effects of failures. However in sections 5.2 and 5.6.1 we will show that the proposed model also allows the formal description of corrective actions as well as of the conditions that verify the success of those actions. As we will see, the advantage of formalising the specification of corrective measures is that those measures can then be interpreted and automatically executed by an automated safety monitor that can operate on the proposed model.
112
To address this problem and enable the representation of structural transformations that occur within the same mode, we introduce the notion of a sub-mode. We define a submode as
“a distinguishable abstract state within a mode, in which a certain structural configuration of components is employed to deliver the mode functionality”.
According to this definition, within each primary normal or degraded mode we can define a number of sub-modes in which the system employs different configurations to deliver the functionality of the mode. These sub-modes can model the effect of recovery actions in the structure of fault tolerant systems, or normal changes in configuration that are essential to achieve the intended functionality. Figure 4.3 illustrates the mode-chart of a hypothetical system that performs two different functions (A and B) in two different modes of operation. The figure also illustrates the composition of the system, in other words its three components (c1, c2, c3) and the connections between them (c1c3, c2c3). We can see that the system uses two different structural configurations to provide function B. Initially, it activates components c1, c3 and connection c1c3. After an hour, it deactivates c1 and activates for an hour component c2 and connection c2c3. From this point onwards, the same structural transformation is cyclically repeated every two hours. This example demonstrates that, beyond changes in functionality, mode-charts can also portray structural transformations that occur within the same functional state of the system.
c1c3
System Composition
c1
System Behaviour
c3 c2
c2c3 second normal mode functions: {B}
first normal mode functions: {A}
T1 (perform function B)
T2 (after 1 hour) second sub-mode
first sub-mode structure: {c1,c3}, {c1c3}
T3 (after 1 hour)
Figure 4.3. Portraying structural transformations in mode-charts
113
structure: {c2,c3}, {c2c3}
4.3.3 Scale, Complexity and the Role of Hierarchical Modelling In the last two sub-sections, we introduced some basic mechanisms for representing the dynamic behaviour or structure of a system, and gave some small theoretical examples. The real question, however, is whether these mechanisms can be applied effectively in the context of a large and complex system. Such a system may perform a large and varied range of functions in the different phases of its operation. In an aircraft, for example, these functions span from critical navigation and communication functions through to the control of fuel resources, engines, aerodynamic surfaces, secondary electrical and domestic facilities. If we attempted to develop a dynamic model for such a system using the concepts that we have delineated in this section, we would have encountered serious impediments. Firstly, we would have to identify a potentially enormous number of possible stable functional and structural configurations and their permutations (modes and sub-modes). Secondly, we would have to define all the possible transitions between those configurations. Eventually, we would have to compile this information in a very large and complex mode-chart for the system. The problem that we face here is essentially one of state explosion, which renders the task of developing a single mode-chart for a large and complex system intractable. This problem, in our view, could be resolved by enabling a form of modelling in which the system is represented as a hierarchy of mode-charts. This type of modelling would enable us to rationalise the description of a complex system by breaking it down into smaller constituent parts. It could also enable aggregation of states between successive levels of the hierarchy, and in that sense, the reduction of the information passing from the low levels towards the higher levels of the model. The hierarchy of mode-charts, we believe, should reflect the hierarchy of command and control through which operators orchestrate the delivery of functions in complex systems. Modern automated plants are indeed implemented as hierarchies of sub-systems. The operation of those sub-systems is usually controlled by computers, which interpret operator actions and commands from other sub-systems into control procedures and plant level operations. Consider, for instance, the flight control system of an aircraft. This system is typically composed of a flight control computer and electromechanical or hydraulic sub-systems that move the control surfaces of the aircraft. The primary role of the system is to achieve and stabilise a new aircraft attitude (pitch and roll) every time that a new command is given by the pilot. Beyond that, the system is
114
responsible for maintaining the stability of the aircraft in the yaw axis. These functions are accomplished as the flight control computer translates the position of the pilot yoke into procedures that co-ordinate the operation of the power-plants that set the aircraft control surfaces in motion. In modern aircraft, more sophisticated layers of automation have been added. Flight Management Systems, for example, can fly aircraft at a certain speed and altitude keyed in by the pilot on the auto-pilot control panel. In more advanced modes of operation, such systems can interpret a flight plan and generate appropriate auto-pilot and engine thrust commands to guide the aircraft on a pre-planned route. Figure 4.4 illustrates an abstract model of the main aircraft systems participating in flight control. At the first level of this model we see the flight management system, the power-plants that control surfaces, the engine control computers and the engines. At the second level, we see the decomposition of the flight management system into a number of sub-systems that participate to the three modes of operation of this system (yoke control, auto-pilot and flight plan interpretation). Notice that this model is similar to the hierarchical models that we introduced in chapter three. Indeed, the model provides a static view of the design hierarchy, which is confined to the composition of the system and the flow of material, energy and data in the horizontal and vertical axes of the hierarchy. Despite its static character, the model records two important aspects of the design: the physical and logical decomposition of the system and the logical hierarchy of command and control. This static hierarchy, in our view, provides a natural frame around which we can construct a dynamic model of the system. position and acceleration settings
control type Pilot settings
energy
Aircraft Control Surfaces
engine control
Engines
Power-plants
Flight Management System thrust setting
Engine Control Computers (FADEC)
Flight Management System yoke position Manual Flight Controls
actual thrust
Position and acceleration settings
Flight Control Computer
control type
Pilot
actual pitch,roll,yaw
thrust setting pilot setting Auto-pilot Controls
Auto-pilot Controller
settings
Flight Management Computer
interpretation of flight plan
Figure 4.4. Two levels in the hierarchy of sub-systems participating in flight control
115
4.3.4 The Dynamic Model The dynamic model that we propose is a hierarchy of mode-charts, which is developed and structured around the physical and logical decomposition of the system. For each sub-system in the static hierarchy, the dynamic model defines a mode-chart that describes behavioural transformations in this sub-system. Figure 4.5 illustrates this relationship between the dynamic model and the static hierarchy that we have introduced in chapter three. At the top of the dynamic model lies a mode-chart recording the main operational modes of the system and the transitions between them. At lower levels, mode-charts describe the dynamic operation of sub-systems. The figure also shows that modes can be expanded into nested state diagrams to allow multiple representations of dynamic operation at each level, for example, a basic diagram with functional transformations (modes) and more detailed diagrams that describe structural transformations (sub-modes) within modes. An important question with respect to this model is how do the different layers within the model relate to each other. To address this question, we need to look more closely into the conditions that can cause transitions between modes and submodes. We also need to examine the effects that such transitions may have across the hierarchy. Let us start this discussion by examining transitions between sub-modes. In our model, a transition of a system between two sub-modes reflects a structural transformation in the architecture of the system as this is defined in the next level of the static hierarchy. At the lowest layer of the dynamic model, such transitions can be caused by malfunctions of basic components. In the system of Figure 4.5 for example, a transition of s2 between two sub-modes could be caused by a failure of c2 and the need to replace it with c3. Dynamic Model - Mode charts of the system and its subsystems
Static Model - Flow diagrams of the system and its subsystems System & mode chart
s
Subsystems & mode charts
s1 s2 s1
s3
Sub-modes within a mode c2 c1
c3
Figure 4.5. Relationship between the static hierarchy and the dynamic model of the system
116
In a similar fashion, transitions between sub-modes at higher layers of the dynamic model can be caused by malfunctions of sub-ordinate sub-systems. Notice, though, that such failure events also represent transitions of those sub-systems into degraded or failed modes. Thus, a transition of s2 into a failed mode can cause a transition of s between two sub-modes. What does this observation tell us, though, about the relationship between the two mode-charts? Simply, the mode-charts of s2 and s communicate. The mechanism with which mode-charts communicate is illustrated in Figure 4.6. The figure illustrates the decomposition of a normal mode of system s into two submodes of operation. At a second level, it illustrates the mode-chart of sub-system s2. The mode-chart shows two normal operational modes, in which s2 contributes in different ways to the correct operation of the system (s). Certain conditions (T2 and T4) can move s2 from a normal mode into a temporary failed mode. In the general case, such conditions would represent the effects of component (c1, c2, c3) failures on the function of the given sub-system. Such conditions, for example, could be the top events of fault trees that determine the root causes and propagation of failure in the given mode. The figure also shows that within each temporary failed mode of s2, an action can be taken to reverse the adverse effects of failure. We can see that if normal function has not been restored as a result of those actions within a reasonable time interval, s2 moves into a permanently failed state. At this point, notice how the system (s) is informed about the failure of s2. The condition s2 in failed mode, which is represented as a state in the mode-chart of s2, also appears as a transition event in the mode-chart of (s). There, as we can see, it causes a transition of the system into its second sub-mode of operation. This, precisely, is the mechanism that allows the two mode-charts to communicate. A normal mode of system s functions: {A,..} first sub-mode
s2 in failed mode
second sub-mode structure: {s1,s3}, {s1s3}
structure: {s1,s2}, {s1s2}
Mode-chart of sub-system s2
first normal mode
T1 (normal transition)
T2 (Aa failed)
second normal mode functions: {Ab}
functions: {Aa}
T4 (Ab failed)
T3 (Aa restored)
T5 (Ab restored)
temporary failed mode
temporary failed mode
functions: ∅ actions:{action one)
functions: ∅ actions:{action two)
T6 (Aa still failed)
T7 (Ab still failed)
failed mode functions: ∅ actions: ∅
Figure 4.6. Communication between mode-charts in the hierarchy of the dynamic model
117
The mechanism enables a transition in one mode-chart to trigger other transitions at higher or lower layers of the dynamic model. This mechanism allows us to represent situations where failures of sub-systems may lead to losses of function at system level. It also allows us to represent situations where a change of function at system level should be followed by a number of necessary functional or structural transformations at lower levels. Take as an example, the failure of an engine in a twin-engine aircraft. This event should cause a transition of the aircraft into a degraded mode of operation, whereby the objective is to fly with a single engine. In turn, the latter should cause a reconfiguration of the fuel system to ensure that fuel is supplied only to the functional engine. The way in which mode-charts communicate such changes is always the same. A transition into a significant state in one mode-chart triggers an event that, by itself or in conjunction with other events, can cause transitions in other mode-charts. The protocol for this type of communication is a global broadcast of the significant event. The protocol allows events in one mode-chart to be referenced by any other mode-chart in the model. Assuming that the original event has occurred, its references then are simultaneously activated and may cause one or more transitions in other locations in the model. This mechanism is very generic and permits communication both across the vertical and the horizontal axes of the hierarchy. In practice, though, we can envisage situations where one may wish to restrict the scope of this communication. In systems that employ strictly hierarchical supervision and control schemes, for example, one may want to prevent direct horizontal communication, so that a mode-chart of a system can communicate only with the mode-chart of the parent system and the mode-charts of any child sub-systems in the static hierarchy. Let us now introduce a notation that will enable us to describe a dynamic system in the form that we discussed in this section. Figure 4.7 illustrates the five primitive elements of the notation. Simple State
A sub-mode or a simple (not further decomposed) mode
Super State
A mode that is expanded into a diagram with sub-modes
Initial State
Local Transition
Triggered Transition
The entry mode or sub-mode in a mode-chart
A transition in a mode-chart caused by a local event (e.g. a low-level failure of a component in the architecture of a sub-system) A transition in a mode-chart (partly) triggered by an event that has occurred in a different mode-chart (e.g. by the transition of a lower-level sub-system in a failure mode)
Figure 4.7. Notation for mode-charts
118
According to the notation, a mode-chart of a system or a sub-system is composed of two types of states: simple states and super states. Simple states represent sub-modes or simple (in other words not further decomposed) modes of the system. Super states, on the other hand, represent more complex modes that encapsulate structural transformations in the system, and which themselves are expanded into diagrams with sub-modes. One state in each mode-chart has a special status. It provides the default entry point to the modechart, and it is, therefore, marked with a different symbol. The notation defines two types of transitions between modes and sub-modes. The first is a transition caused by a local event that occurs in the part of system described by the mode-chart. The second type is a transition triggered by an event that has occurred in a different mode-chart. Using this notation, it is possible to represent a dynamic system as a hierarchy of abstract state-machines, structured around the physical and logical decomposition of the system. To enable the development of such models around the static hierarchies that we have introduced in the context of HiP-HOPS, we have extended our tool with a graphical editor that supports the proposed notation. In practice, whenever we wish to develop a mode-chart for a sub-system in the static hierarchy, we select the sub-system rendering and command the tool to create a new canvas, where we can draw the mode-chart (see, for example, Figure 4.8).
Figure 4.8. A sub-system in the static hierarchy and its mode-chart
119
The same technique is used for the decomposition of modes into diagrams with submodes. During the development of the model, the tool hyperlinks sub-systems and modes to their subordinate mode-charts. It also creates hyperlinks between any triggered transitions and the states that trigger those transitions. Those links allow easy navigation across the dynamic hierarchy and between the dynamic model and the static hierarchy.
4.3.5 The Dynamic Model as a Form of Safety Analysis One significant attribute of the proposed model is that it is not confined to a description of the system in normal conditions of operation. It continues to examine behavioural and structural transformations after the system has stepped into the domain of failure and until the point that successive functional failures at lower levels cause transitions of the system to unacceptable modes of failure. In doing so, the model records the gradual transformation of lower-level failures into system malfunctions, taking into account the physical or logical dependencies between those failures and their chronological order. This model has some conceptual similarities to an abstract fault tree. For instance, it records the logical relationship between the causes and effects of failures in a hierarchical structure. At the same time, the model is different from a classical fault tree because it situates the propagation of failure in a dynamic context. The precise relationship of this model to fault trees is discussed later on in the thesis16. What is important to register about the model at this point, is that it interweaves the description of normal dynamic behaviour and the description of possible failure scenarios in a common hierarchical structure. In that sense, the model is a form of safety analysis. One possible way in which we could characterise it from the point of view of safety analysis is as a dynamic, hierarchical functional failure analysis of the system. This duality in the role of the dynamic model (as a basis for the analysis and a form of safety analysis) may seem paradoxical at first sight. It is explained in the following section where we introduce a process for modelling and safety assessment of dynamic systems. There, we will see that, as a description of normal dynamic behaviour, the model enables us to extend the application of HiP-HOPS in dynamic systems. At the same time, the model itself grows during the assessment process as yet another form of safety analysis.
16
In the concluding chapter, we speculate that it may actually be possible to translate the hierarchy of mode-charts into a large dynamic fault tree, which conforms to the extended fault tree model proposed by Dugan, Bavuso and Boyd [Dugan et al, 1993].
120
4.4 Modelling and Safety Analysis Process Figure 4.9 illustrates diagrammatically the proposed process. There are two distinctly different stages in this process, an analytical one and a synthetic one. In the analytical stage, we model the normal behaviour of the system and, using HiP-HOPS, we identify potential failure scenarios that violate this behaviour. This examination takes place at low levels of the design, where we can identify relative simple and easy to analyse subsystems. The results from the analysis help us to determine the potential degraded and failed modes of those sub-systems, as well as the conditions that cause transitions to such modes. At the synthetic stage of the process, we determine how logical combinations or sequences of such low-level failure modes propagate upwards and cause functional failures at higher levels of the design. At this stage, we synthesise the part of the dynamic model that describes the dynamic behaviour of the system in conditions of failure, and eventually arrive at the failure modes of the overall system as well as the conditions that lead to such failure modes.
Safety Analysis
System Modelling Static Model - Flow diagrams of the system and its subsystems
Dynamic Model - Mode charts of the system and its subsystems in normal and failure conditions
Safety Analyses of Functionally and Structurally Stable Configurations - HiP-HOPS analysis of each subsystem in its different modes or sub-modes of operation
System & mode chart Subsystems & mode charts
Mechanically synthesised fault trees, which show how functional failures that we have identified in the FFA arise from low-level component failure modes that we have identified in the IF-FMEAs
Dependencies between modes or sub-modes and static safety analyses FFAs
Loss of bra ke. Caused by a actuator failures O R an number of om ission of 0
-
0
_buffer_stuc k - _at_0 8e-
0
cuit_to_ PS 9e-
_failu re
3e-
cally_st Control Pressure uc k 0. Caused by mem ory stuck at 4e0
CR .controller_me stuc k_a mory_ t_0 -
Fault Tree Synthesis Algorithm
O mission of pressure. failure of the
0
5e006
0
I2 -
Pressure stuc k at 0.value feedback number of sensor 0
0
Synthesised Fault trees
-
-
3e006
IF-FMEAs
0
I1
Pressure setpoint at 0. kCaused by Stuc failure of the 0
mecha nical ly_ me mory_buffer_s - stuc k -tuc k_a -
D efault brakin Om g ission of t he Stuc k at 0. pressure D efault Input Caused by a 0 0 G enerator failure of
1e007
5e006
0.000 5
Pressure Pressure is below the feedback valueexceeds feedbackthe value valid Lim it. valid limit. 0 0
_ms g
hort_circuit_to -_G ND 0
9e007
pen_li ne
0
hort_circuit_t Om ission of t he - o_PS braking set-point. 8e8e0 Caused by 007 006
0 -
-
-
0.00 05
0.00 09
IF-FMEAs
Figure 4.9. Modelling and safety analysis of dynamic systems
121
-
0
e_ms g
0.00 04
re_m sg
0
g 0
0
4.4.1 The Analytical Stage of the Process The analytical stage of the process starts with the physical and logical decomposition of the system into sub-systems and basic components (see, for example, in Figure 4.9 below the static model). For each composite element in the static model (i.e. the system or a sub-system), we then develop a mode-chart, which at this point records only the normal modes (and sub-modes) of operation of this element (see, for example, in Figure 4.9 below the dynamic model). The modelling process starts at the top of the hierarchy and moves towards its lower levels. The reason for this order is that the normal behaviour of a sub-system would typically depend on the normal behaviour of its parent system. Functional or structural transformations in the parent system, for instance, would normally trigger mode or sub-mode changes in sub-systems. The architectural and behavioural decomposition of the system stops once we have identified sub-systems with relatively simple architectures and small numbers of normal operational modes (or sub-modes). Each simple state17 of those sub-systems represents a functionally and structurally stable configuration of components, the failure behaviour of which can be determined using HiP-HOPS. One way to determine the failure behaviour of a system with more than one simple state is the repetitive application of HiP-HOPS in each of those states. In each state we know the precise configuration and the functions delivered by the sub-system. It is therefore possible to carry out a functional failure analysis of the system and apply IF-FMEA on its components. We can then mechanically synthesise the fault trees that describe the root causes and propagation of failure in this particular state (see, for example, the right part of Figure 4.9). The results that we gain from this type of analysis help us to determine potential failure scenarios in each state and incorporate this information in the mode-chart of each sub-system. Functional failure analysis, for example, identifies functional losses or malfunctions that cause transitions from normal states into degraded or failed modes of operation. Similarly, IF-FMEAs contain the conditions that signify the presence of such functional failures, deviations of physical parameters or data at the outputs of the subsystem, for example. By providing the symptoms of failure on process parameters, IFFMEAs can help us identify the conditions that verify the transition of the sub-system 17
According to our notation (Figure 4.7), a simple state of a system is either a sub-mode or a nondecomposed mode of this system.
122
into one of its failure modes. Furthermore, the mechanically generated fault trees determine the root causes of such disturbances in the operation of the system. Once we have defined the failure modes of each sub-system, and analysed their symptoms and root causes, we can then assess the potential for local recovery from those failure modes. In this step, we examine if there are any actions that can be taken to restore the normal function of the sub-system. If such actions exist, then the current mode is characterised as a temporary degraded or failed mode. In the opposite case, it is marked as a permanently degraded or failed mode. A temporary failure mode represents a short-lived state that follows an initial failure where there is a potential for recovery but also an ambiguity about the effects of any backward recovery operations that may have been initiated. During the analysis of such modes, we need to define two sets of conditions: a)
conditions that verify the success of recovery operations, and therefore, signify a safe return of the sub-system to a normal mode (but possibly to a different submode) and
b)
conditions that verify the omission or failure of recovery operations, and therefore, formulate the prerequisites for a transition of the sub-system to a permanently degraded of failed mode.
Permanently failed modes represent states in which the sub-system has irrevocably lost all its functions. There are no escape transitions from such modes, and therefore, no need for further safety analysis. Permanently degraded modes, however, may represent states in which although the sub-system has lost some functionality, it is still usefully employed by its parent system. Such modes are analysed precisely as normal modes, that is, via repetitive application of HiP-HOPS. The analysis of those modes helps us to determine conditions that move the sub-system even further into the domain of failure. Finally, the analytical stage is completed once we have identified all the possible scenarios that may lead sub-systems into permanently failed modes. At the end of this stage, we have extended the mode-chart of every sub-system that lies at the lowest level of our decomposition with information about the dynamic behaviour of this sub-system in conditions of failure.
123
4.4.2 The Synthetic Stage of the Process In the Synthetic Stage, we determine the effects of the low-level failure modes that we have recorded in the analytical stage to higher levels of the design. In other words, we use the mode-charts that we have derived using HiP-HOPS to complete other modecharts in the dynamic model. The synthetic process starts at the bottom and moves, through succeeding layers, towards the top of the hierarchy. At each step of the process, we complete the mode-chart of a sub-system in the static hierarchy using information contained in the mode-charts that we have previously synthesised for lower-level sub-systems. This concept is illustrated in Figure 4.10, which presents the static and dynamic model of a system (s). At the lowest level of the dynamic model, the figure illustrates the mode-charts of subsystems s11, s12 and s13, which have been derived, for example, at the analytical stage of the process. At the synthetic stage of the process, we exploit this information to complete the mode-chart of the parent system s1. Having synthesised the mode-charts of s1, s2 and s3, we can then proceed in a similar fashion to synthesise the mode-chart of the overall system (s). Let us now focus on a fundamental pattern that we repeat in the course of this inductive process. More specifically, let us examine the way in which we synthesise a mode-chart for a system given that we have previously derived the mode-charts of the subordinate sub-systems. The first step in this process is to examine the system level effects of failure transitions that occur at sub-system level. The aim here is to identify potential functional losses or malfunctions of the system, and the combinations of subsystem failure modes that cause those functional failures. Static Model
Dynamic Model System & mode chart
s
Subsystems & mode charts
s2 s1
s3
s12 s11 s13
Figure 4.10. Synthesis of mode-charts from the bottom towards the top of the hierarchy
124
The results from this analysis indicate the failure modes of the system and the events that trigger transitions to those modes. They can, therefore, be used for the synthesis of the system mode-chart. In this mode-chart, transitions are triggered by logical combinations of significant failure events that occur in the lower level mode-charts. Once we have defined the failure modes of the system, and identified the causes in the mode-charts of its sub-systems, we can then assess the potential for recovery in each of those modes. In this step, we examine if there are any actions that can be taken to restore normal function at system level. Such actions, for example, may include the activation of a stand-by, redundant sub-system to replace a failed sub-system. If such actions exist, then the current mode is characterised as a temporary degraded or failed mode. In the opposite case, it is marked as a permanently degraded or failed mode. As we have already mentioned, a temporary failure mode represents a short-lived state of failure where there is ambiguity about the effects of any backward recovery operations that may have been initiated. When we examine such modes in non-terminal nodes of the dynamic hierarchy, we need to define two sets of conditions.
a) The first set includes those conditions that verify the success of recovery operations, and therefore, signify a safe return of the system to a normal mode. These conditions would typically be represented by significant events that occur in sub-system modecharts. If, in the event of failure for example, the response is to activate a standby sub-system, then the condition that verifies the success of the recovery operation is a transition of this sub-system from an idle state to a normal mode of operation. b) The second set includes those conditions that verify the omission or failure of recovery operations, and therefore, formulate the prerequisites for a transition of the system to a permanently degraded of failed mode. Once more, such conditions would typically be represented by significant events in the sub-system mode-charts. For instance, if in the previous example, the redundant sub-system fails to start, then the condition that verifies the failure of the recovery operation is the direct transition of this sub-system from the idle state to a permanently failed mode.
As opposed to temporary failure modes, permanently failed or degraded modes represent states in which the system has irrevocably lost part of, or all, its functionality. There are no escape transitions from such modes directed to normal modes, and therefore, there is no need for further analysis in this direction. The transitions to such failure modes are 125
triggered by lower-level (sub-system) failure modes, the causes of which we have already analysed in the preceding step of the synthetic stage or the analytical stage of the process. At this point we have identified the potential temporary and permanent failure modes of the system and identified the conditions that trigger transitions between those modes. One remaining aspect of the analysis is the assessment of the effects that those failure modes have on higher levels of the design. But as we have already explained, once we have synthesised the mode-charts at a certain level of the hierarchical model we can then proceed with the synthesis at the next level. The process that we have described in this sub-section is repeated until we reach the top of the hierarchy and construct a mode-chart for the overall system. At this point, the dynamic model identifies the potential functional failures at system level. In conjunction with the fault trees that we have derived in the analytical stage of the process, the model also defines the combinations or sequences of component failures and lower-level malfunctions that, instantly or gradually, lead the system to those high-level failure modes.
4.4.3 Two Ways to Apply HiP-HOPS in Dynamic Systems During the analytical stage of the process, HiP-HOPS is applied at the lower levels of the design where we can identify relatively simple, but often still multi-moded, sub-systems. One way to analyse such systems using HiP-HOPS is by developing a separate set of analyses for each mode. Each of those sets will contain a functional failure analysis table for the system, one IF-FMEA table for each component and a set of fault trees for the given mode. One problem with this approach is that it potentially generates an unnecessarily large volume of partially repetitive analyses. A transition between modes does not necessarily imply a radical change in the functionality of the system. A sub-set of the functions delivered by the system often remains unaffected by mode changes. Similarly, components usually behave in exactly the same manner across different modes of operation. This, however, implies that a large part of the functional failure analysis and the IF-FMEAs that we have derived in one mode will also remain valid in other modes. To avoid the repetition of analyses across different modes, it is possible to build one set of analyses for the system, in which though every segment of the analysis that is relevant to more than one mode is clearly marked so with references to those modes. One way to achieve this in HiP-HOPS is by adding an extra column to the FFA and IF-FMEA tables. For each functional failure of the system
126
or deviation of a component output we can then specify in this column a list of modes in which the failure event is relevant. Figure 4.11 illustrates how the sensitivity of the analysis to the state of the system is represented in HiP-HOPS. The figure shows the static and dynamic model of a system (s). We can also see part of a hypothetical functional failure analysis for the system and segments of a hypothetical IF-FMEA for one of its components (c2). Notice that the FFA relates functional failures to deviations of system outputs (column symptoms in the FFA table). The system delivers two different sets of functions ({A,B} and {A,C}) in two corresponding modes of operation. As we can notice, function A is common between the two modes. This property is also reflected in the functional failure analysis, where functional failure loss of A has been marked as being relevant to both modes. Similarly, the IF-FMEA of component c2 shows that an omission (O-out) or commission (C-out) of the output can occur in both operational modes. An interesting observation on the IFFMEA is that output failure modes can have different interpretations in different modes. We can see, for example, that the interpretation of value failure more-out changes as we move between the two modes, and as the expected value at the output of the component changes from set-point_one to set-point_two.
Static Model
Dynamic Model mode-chart of system s
out2
inp s
out
mode two functions:{A,C}
mode one functions:{A,B}
out2
inp c1
flow 1
c2
out
IF-FMEA of c2
Functional Failure Analysis of s Functional Failure
..
Symptoms
Loss of A Loss of B Loss of C
.. .. ..
O-out O-out2
Loss of A and B
..
Loss of A and C ..
.. ..
O-out & O-out2
Description
..
Relevant Modes
one, two one two
Output Failure Mode O-out C-out more-out
Omission of the output Commission of the output Value > set-point one
.. .. ..
one, two one, two one
one
more-out
Value > set-point two
..
two
two ..
..
..
..
..
Relevant Modes
Figure 4.11. Incorporating modes in low-level analysis
127
By introducing modes in FFAs and IF-FMEAs, we clearly reduce the volume of the analyses produced during the analytical stage of the process. The implication, however, is that we also need to modify the fault tree synthesis algorithm so that we can still mechanically generate fault trees from such mode-sensitive FFAs and IF-FMEAs. The synthesis problem is now being transformed into one of how to determine the causes of a given functional failure of the system in a certain mode of its operation. Figure 4.12 provides a pseudo-code representation of a modified fault tree synthesis algorithm that could generate such fault trees. Given that we wish to create a fault tree for a certain functional failure of the system at a certain mode of operation (Get(failure,system,mode)), the first step in the synthesis is to locate the failure in the FFA table. By parsing the failure logic that relates this failure to deviations of system outputs we can then construct the first level of the fault tree (Parse(system.FFA.failure.expression)).
SynthesiseFaultTree (cnode,csys,cmode){
//Recursive fault tree synthesiser
comp=FindComponent(cnode,csys); //Traverse the model and find the component (comp) which generates //the output failure (cnode)
correct_row = Locate(comp.FMEA,cnode,cmode); //Find the row in the IF-FMEA of comp that gives the correct //interpretation for output failure cnode in the given mode
Parse(correct_row); //Parse the IF_FMEA expression and expand cnode in the fault tree
For each leafnode in the sub-tree of cnode If leafnode is not (a component malfunction or a deviation of a system input) SynthesiseFaultTree(leafnode,csys,cmode); //expand each leaf-node calling the recursive synthesiser
} Main(){
//Fault tree synthesis algorithm
Get(failure,system,mode); //Get the functional failure for which to generate a fault tree, //the name of the system and the system mode
Parse(system.FFA.failure.expression); //Find failure in FFA table and parse the expression (in the //symptoms column) that relates this failure to deviations of //system outputs. Generate the first level of the fault tree
For each leafnode in the fault tree SynthesiseFaultTree(leafnode,system,mode); //For each leaf-node in the fault tree (i.e. for each deviation of //a system output) call the mode sensitive version of the recursive //fault tree synthesiser to create a branch
}
Figure 4.12. The extended fault tree synthesis algorithm
128
The lower levels of the fault tree are constructed as we ask the modified, mode-sensitive, recursive fault tree synthesiser to generate a sub-fault tree for each output deviation that contributes to the given functional failure (SynthesiseFaultTree(leafnode, system,mode)). The only difference between the modified version of the recursive
fault tree synthesiser and the version that we have presented in chapter three (Figure 3.12) is that, in the modified version, the search for failure events in IF-FMEAs is mode sensitive. More specifically, each time we wish to locate an output failure mode of a component and its respective failure logic, we search for the row that gives the correct interpretation for the output failure in the given mode. If the same functional failure can occur in n (: n>1) modes of operation, we can use the modified synthesis algorithm to generate n fault trees which record the potentially different root causes of failure in those modes. It is important to point out that these fault trees would be equal in number and identical to the fault trees that we would have derived if we have opted for a straight forward repetitive application of HiP-HOPS in each mode. From the point of view of fault tree synthesis, therefore, the two approaches to the application of HiP-HOPS in dynamic systems are not only equivalent but actually generate identical results. In practice, of course, the extended syntax and algorithmic modifications that we have discussed here would benefit safety assessments by reducing the volume and increasing the conciseness of the analyses. Despite the considerable practical benefits, though, the implementation of the proposed modifications is considered peripheral to the aims of this thesis, and is therefore left as an open issue for further work. As we have explained, the standard version of HiP-HOPS provides a sufficient (albeit not the most efficient) way for safety analysis at the low-levels of the dynamic model. It was therefore possible to use the techniques and tools that we have developed in chapter three in order to experiment with the concepts that we have presented in this chapter and study further our approach to the assessment of dynamic systems.
4.5 Case Study Our case study is based on a collection of models and safety analyses that we have derived during the safety assessment of an electromechanical and hydraulic model of an aircraft fuel system that we have developed at the University of York. The result of this study is a safety case for this system, which we discuss in this section. Drawing from the analyses, we will demonstrate how the proposed method has helped us to rationalise the
129
assessment of a complex dynamic system, and to gain an understanding of the mechanics of failure in this system. Our analysis is deliberately based on a very basic control scheme with no safety monitoring or fault tolerance functions. This will allow us to demonstrate the way in which, using the method, we have finally arrived at a safety case which provides an extensive specification of potential failure detection and control mechanisms that improve the safety of the system.
4.5.1 Introduction to the Fuel System and Scope of the Case Study The fuel system is a model of the fuel storage and fuel distribution system of a twinengine aircraft. The physical configuration of the system is illustrated in Figure 4.13. The system consists of seven fuel tanks which are identical in dimensions and connected with transparent polythene piping. The configuration represents a hypothetical design in which the fuel resources of the aircraft are symmetrically distributed across the longitudinal and lateral axes of the system. We can see, for instance, that there are tanks in the front, central and rear parts of the aircraft, as well as tanks in both aircraft wings. A series of DC pumps can transfer fuel through the system and towards the engines. The direction and speed of those pumps, and hence of flows in the system, are computer controlled, while the maximum flow rate that can be achieved is in the region of 2 litres/min. Front
P
Pump & Speed Sensor Valve
L P Refuelling Point
F
Flow Meter
L
Level Sensor
F P P
Left Wing
L
P
F
P
F
Right Wing
L
L
L
F
F
F
L
P
F
F
P
P F
F P
L
Jettison Point
Jettison Point Rear
VR2
P Port Engine
VL2
F
F VC2
P Starboard Engine
Figure 4.13. Physical configuration of the aircraft fuel system
130
The figure also shows a number of valves which can be used to block or activate paths and isolate or provide access to fuel resources. The position of those valves ranges from fully open to fully closed and is also software controlled. Finally, analogue sensors that measure the speed of pumps, fuel flows and fuel levels provide indications of the controlled parameters in the system. Refuelling is performed by injection of fuel into the central tank and automatic distribution of fuel resources from the central tank to the rest of the system. During normal operation, the central valve (VC2) is closed and each engine is fed from a separate tank at a variable rate, which is defined by the current engine thrust setting. More specifically, the port engine is fed from the front tank and the starboard engine is fed from the rear tank. As fuel is consumed, more fuel is continuously transferred from the wings and the central tanks to the front and rear tanks and from there to the engines. Under normal operating conditions, fuel flows are controlled in a way that fuel resources are always symmetrically distributed in the system and the centre of gravity lies near the centre. This balanced distribution of fuel is essential for ensuring the stability of the aircraft during the flight. In conditions of failure, there are provisions for cross-feeding both engines from the same tank. If valve VL2 fails stuck closed, for example, it is possible to open valve VC2 and cross-feed both engines from the front tank. A number of failures (such as a leak in one of the tanks or a loss of feed to one of the engines) may also affect the balance of the system. In such cases, it is possible to transfer fuel from various sources to various destinations at various speeds so as to correct the imbalance. As we will show, the results from the safety studies will help us to enumerate conditions that cause asymmetrical loading and, for each condition, to determine control actions that potentially restore the balanced distribution of fuel. We must note that the configuration of Figure 4.13 also allows the jettison of fuel to the atmosphere, which may occur from two points that are symmetrically placed at the outer wing tanks. Jettison would typically occur in an emergency landing situation when there is a need to discard excess fuel in order to reduce the total aircraft mass down to a level that allows safe landing of the aircraft (the maximum landing mass). The system is controlled by a central computer, which monitors parameters of the process, and controls valves and pumps to regulate the path and rate of flows in the system. It is apparent that this centralised computer architecture represents a single point of failure for the whole system. On the other hand, the architecture is much more
131
interesting at plant level. There, we can observe a number of redundant paths and redundant resources that potentially make the system fault tolerant at this level. For that reason, the safety study of the fuel system that we present in this section is focused on failures that occur at plant level and stops at the interface between the plant and the central computer18. Such failures include fuel leaks, pipe blockages, valve and pump malfunctions and corruption of control commands to those elements (caused, for example, by electromagnetic interference). The second difference between this case study and the study of the brake-by-wire system is that here we assume a reliable sensory interface, where all sensor failures have been masked using some form of hardware or analytical redundancy (see also section 2.2.1). We can, therefore, consider such failures as highly improbable and we can safely omit the analysis of those failures and their effects on the monitoring and control processes. We must clarify that this is not at all a precondition for the general application of the safety analysis techniques that we have proposed in this chapter. The reason that we make this assumption in this particular case study is that our aim is to generate a safety case which can also be used as a model for on-line safety monitoring. In chapter five, we will show that it is possible to operate on the safety case in order to detect, diagnose or assess the effects of failures. Prior to performing those tasks though, we must apply appropriate techniques to ensure that we have detected sensor failures and located any unreliable sources of information about the state of the system. Thus, in our approach to safety monitoring at least, the detection of sensor failures and the validation of sensory data are logically separated from (and precede) other failure detection and control tasks. After this brief introduction to the fuel system and the scope of our case study, let us move to the safety analysis of this system starting with a discussion of the model that we have developed in SAM, which represents the system as a hierarchy of subsystem architectures.
4.5.2 The Static Hierarchy Figure 4.14 displays the first two levels in the static hierarchy of the fuel system. At the first level of the hierarchy, we can see the whole plant (FS) encapsulated in a box, which has one input, the flow from the refuelling line, and four outputs, the fuel flows to the 18
As opposed to the study of the brake-by-wire system, which examines the software architecture of each programmable node in the system.
132
engines and the two jettison outlets. The second level of the hierarchy shows the decomposition of the plant (FS) into four sub-systems: the engine feed and cross-feed sub-system (EFCF), a sub-system which provides fuel to EFCF from a central deposit (CD), and two sub-systems providing fuel from deposits located at the two aircraft wings (LWD, RWD). The hardware architecture of those sub-systems is illustrated in Figure 4.1519. Although some of the diagrams in Figure 4.14 and Figure 4.15 look more like piping and instrumentation diagrams, we must say that they still conform to the flow diagram notation of HiP-HOPS. Architectures are composed of sub-systems, components and flows. Shaded renderings represents sub-systems, while non-shaded renderings represent basic components the failure behaviour of which is known and it can be recorded in an IFFMEA table. It can be noticed, though, that different types of components are represented with different renderings. We must clarify that this is a superficial distinction, which does not serve any useful purpose from the point of view of safety analysis, and that the only reason that the current implementation of HiP-HOPS supports this feature is to enable more familiar looking representations of plant architectures. LWing_Jettison
SEngine_Line Refuelling_Line
FS PEngine_Line
Fuel System
RWing_Jettison
FS architecture
LWing_Jettison_Line LWD
Left Wing Deposit LWing_Line SEngine_Line Refuelling_Line
EFCF
CD Central_Line_1 Central Deposit
Engine FeedCrossfeed
Central_Line_2
PEngine_Line
RWing_Line RWD RWing_Jettison_Line Right Wing Deposit
Figure 4.14. The first two levels in the static hierarchy of the fuel system
19
In fact, the figure displays three out of the four sub-systems (EFCF, CD and LWD). The architecture of the right wing is identical to that of the left wing and it is, therefore, intentionally omitted.
133
EFCF architecture
TL1
PVL2
PVL1
L
P
P
LWing_Line
L1
FL1 F
SPL1 S
VL2_VL1
L1_VL2
SEngine_Line
VC2_VL1
Central_Line_1
Rear Tank
VL2
VL1
VL1_PL1 PL1
VL2_VC2
PVC2 VC2
P
TR1 PVR2
L
PVR1
P
FR1
SPR1
F
S
P VR2_VC2
Central_Line_2 R1
PEngine_Line
VC2_VR1
RWing_Line
R1_VR2
Front Tank
VR2_VR1 VR1
VR2
VR1_PR1 PR1
CD architecture PVC
TC
SPL3
L
S
FL3 F
Pump
P
Valve
C_PL3 C
VC_C
Refuelling_Line
PL3 Central Tank
VC
C_PR3
Central_Line_1
L
Level sensor
Central_Line_2
F
Flow sensor
S
Pump Speed sensor
P
Valve position sensor
PR3
F S
FR3
SPR3
LWD architecture
SPLJ
PVLJ
FLJ
S
P
F
TL2 PLJ_VLJ
L
LWing_Jettison_Line VLJ
L2_PLJ
TL3
PLJ
L PL5_L3
L2_PL5 L2
L3 PL5_L2
L3_PL5
Outer Wing Tank
Inner Wing Tank PL5
L2_VL4
FL4
SPL5
F
S
FL2 L3_VL3
F
F PVL4 VL4
FL5
SPL4
PVL3 VL3
S
P
P
VL3_PL3 LWing_Line
VL4_PL4
PL4
Figure 4.15. Further decomposition of the EFCF, CD and LWD subsystems
134
Let us now determine, for each composite element in the static model (i.e. the system or a sub-system), the normal behaviour of this element, in other words let us determine its modes in normal conditions of operation. For simplicity, and without compromising the principle of the method, we restrict the scope of our analysis to a particular phase of operation where the engines have already been started and there is continuous (but variable) demand for fuel supply20.
4.5.3 Normal Behaviour of the Fuel System and its Representation in the Dynamic Model To define the modes of a system, we must examine first the behaviour of the system in normal conditions of operation. In such conditions, the fuel system has a stable functional profile. The two functions of the system are: to ensure uninterrupted fuel supply to each engine at a variable rate x defined by the engine thrust settings, and guarantee that there is always balanced distribution of fuel resources across the platform. One way to ensure the latter is to always distribute the fuel demand x equally among the fuel tanks, so that the reduction of volume in each tank is always equal (within small bounds of error) to that in any other tank. Figure 4.16 illustrates the fuel flow rates (a,b,c) required from the wings and the central deposit in order to satisfy a given demand x by the engines and achieve at the same time the same rate y in the reduction of volume in tanks across the system.
FS Architecture
LWing_Jettison_Line LWD Left Wing Deposit
2y
a LWing_Line SEngine_Line
Refuelling_Line
EFCF
CD Central Deposit
b
Central_Line_1 Central_Line_2
y
Engine Feed Cross-feed
b
2y
x PEngine_Line
x
y = 2x/7 a = 4x/7 b = x/7 c = 4x/7
RWing_Line RWD Right Wing Deposit
c
RWing_Jettison_Line
2y
Figure 4.16. Distribution of fuel flows among sub-systems in normal operation
20
The method could be applied in an identical way to describe and analyse the behaviour of the system in other phases of operation such as refuelling, for example.
135
The values of y, a, b and c (in relation to x) can be calculated by solving a set of simultaneous equations that can be easily derived from the topology of the system and from the relationship between volume reduction and input/output flows in each tank. These values are: y=2x/7, a=4x/7, b=x/7 and c=4x/7
From this calculation, it is possible to derive a control scheme that satisfies the two objectives of the fuel system (i.e. to provide fuel at the requested rate x and ensure balanced distribution of resources). This control scheme is diagrammatically illustrated in Figure 4.17. The figure shows that each pump in the system is controlled in a closed loop with one or more local flow meters, which provide measurement of the actual flow generated by the pump. Each such loop aims to achieve a stable local flow of fuel in the part of the system controlled by the given pump. The level of this flow, in other words the set-point of the loop, changes proportionally to the current demand x. According to this control scheme, the pumps (PL1, PR1) located at the output of EFCF supply the engines with fuel at rate x. At the same time, the pumps (PL3, PR3) generating the output of the central deposit (CD) supply the front and rear tanks of EFCF at rate x/7, while the pumps located at the output of the wing deposits (LWD, RWD) supply the same tanks at rate 4x/7. Front
Pump & Speed Sensor
P
Valve L PR4
F
Flow Meter
L
Level Sensor
FR3 x/7
Left Wing
Right Wing
PR3
(FR2-FR4)/2
(FL2-FL4)/2
L
L PL5
L
FL5
FL4
L
L PR5
4x/7 FL3
FL2
FR5
FR2
x/7
FR4
PL3 VL3
VL4
VR4
VR3 PL4
L
4x/7 Rear x
x
PR1
VR2
VL2
FR1
FL1 VR1
PL1
VL1
VC2
Starboard Engine
Port Engine
Figure 4.17. The control scheme of the fuel system
136
Finally, to ensure that the fuel consumption from each wing is equally shared between the two tanks of the wing, pumps PL5 and PR5 constantly try to counterbalance any differential between the output flows from the two tanks. The central element of this scheme is a control algorithm which is executed by the control computer on a cyclic basis. In each cycle of measurement, the algorithm determines, for each control loop, the difference between the set-point and the actual value of the measured variable (flow). Whenever there is a discrepancy between those values (either caused by a disturbance or a change of the set-point), the algorithm manipulates accordingly the control output (pump setting) to correct the error. The algorithm is a well-known, general purpose, feedback-control algorithm which applies a proportional-integral (PI) correction factor on the control output to correct the error. The theory underlying this type of control is well documented in the literature (see for example [Chard, 1983]). In mathematical terms, the algorithm relates the output of the pump to the current flow deviation according to the following relationship: t
P(t) = K
{ ε(t) + 1T ∫ ε(τ) dτ }+ P(0) 0
where
P(t) :
The calculated setting of the pump at time t after the deviation has been detected
K:
The proportional gain
ε(t) :
The error (discrepancy between set-point and actual value of flow) at time t.
T:
The integral action constant
P(0):
The setting of the pump when the control action began
When a deviation occurs, the proportional correction term (Kε(t)) causes a sudden change to the output of the pump and provides an initial response to the deviation. As the deviation decreases, the effect of the proportional term on the pump output diminishes, while, as the error accumulates over time, the effect of the integral term (K/T ∫ε(τ)dτ) increases. At the point where the deviation has been corrected, the proportional term becomes zero and the integral term is stabilised to a value which maintains the measured variable to the new steady state. The values of the parameters K and T define the amplitude of the proportional and integral correction terms. They can be fine-tuned to ensure the stability of the control loop and reduce the number of oscillations that occur before the controlled variable reaches the steady state. A good review of analytical techniques for the optimisation of closed loop performance can be found in [Astrom and Hagglund, 1988]. However, the
137
application of such techniques was considered non-essential for the accomplishment of the aims of this case study. We have simply used a “trial and error” strategy to derive for each loop non-optimal parameter values, which, however, provide recovery of the control variable within a reasonably short period of time. Using this approach we were able to achieve stabilisation of flows in the system (more precisely a diminishing deviation |ε| of less than 3% around the set-point value) in a maximum period of 6 seconds. Figure 4.18 provides an example of the effect that the control algorithm has on system flows. It gives a graphical representation of the flow to the port engine after a step change in the setpoint from zero to 80% of the maximum flow allowed on this line. It can be noticed that although small oscillations continue to occur until approximately 12 seconds after the step change, the deviation between the new set-point and the actual flow is reduced to less than 3% in less than six seconds. The overall control scheme is composed of a number of such loops, the role of which is to distribute the fuel demand (x) equally among the fuel tanks. But although the level of fuel in the system is controlled indirectly, our experimental results show that, over a complete run of the system (from a fully refuelled system to an empty one), the level in each tank remains approximately equal to that in every other tank. This is illustrated in Figure 4.19, which shows the results of a typical such experiment, in the course of which the fuel demand (x) from the engines was changed twice. The figure plots the levels of fuel in the tanks over time, and shows that only small discrepancies between those levels develop (and slightly increase) with the progression of time. flow/flow max 1
< 6sec
0.8
±3%
0.6
0.4
0.2
0 0
5
10
15
time (sec)
Figure 4.18. The effect of PI control on the line that feeds the port engine
138
level / levelmax set-point changes
1
0.8
0.6
0.4 front tank
0.2 central tank 0.097 1.03 x, for more than 6 sec
Is not possible. Stop the supply to the engine (i.e. stop pump PL1)
MORESEngine_ Line
EFCF_ NORMAL
Less fuel provided to starboard engine
F3
Inadequate fuel flow to the engine, engine flameout. The condition appears only if the control algorithm has failed to correct the problem
Critical
FL1>0.03 and FL1high_limitnormalised] can be derived from tables for the standardised normal distribution using the values of and high_limitnormalised (see also [Feller, 1971]). m"
165
Basic probability theory tells us that the latter could be minimised if there was a mechanism by which we could instruct the safety monitor to raise an alarm only if abnormal measurements persisted over a period of time28 and a number of successive readings. The mechanism that we propose for filtering spurious abnormal measurements is, precisely, a monitoring expression that fires only if it remains true over a period of time. To indicate that a certain expression (which is expressed as a logical combination of constraints) is bound by the above semantics we use the following syntax: T(expression, ∆t) Where:
T: is an operator which when applied to an expression bounds the expression to the above semantics. Expression: is the evaluated expression, a logical combination of constraints. ∆t: is the period (expressed in seconds) for which expression has to remain true in order for T(Expression, ∆t) to be true. ∆t is an interval which always extends from time t-∆t in the past to the present time t.
The above mechanism can also be used for preventing false alarms arising from normal transient behaviour. Consider for example, that we wish to monitor a parameter in closed loop control and raise an alarm every time a discrepancy is detected between the current set-point and the actual value of the parameter. As we have seen in the preceding chapter, a step change in the set-point of the loop is followed by a short period of time in which the control algorithm attempts to bring the value of the parameter to a new steady state. Within this period, the value of the parameter deviates from the new set-point value. To avoid false alarms arising in this situation, we could define a monitoring expression that fires only when abnormal measurements persist over a period that exceeds the time within which the control algorithm would normally correct a deviation from the current set-point.
28
The duration of this period for a particular parameter would, of course, depend on the probability P[x>high_limit] for the given parameter and the sampling frequency of the monitor. If P[x>high_limit] equals 0.01 for example, then we should expect one spurious measurement in a hundred that can cause the monitor to generate a false alarm. The probability of five successive spurious readings, however, is P[x>high_limit] 5 = 10-10. Since that is a very unlikely event, five successive abnormal readings should safety indicate the presence of a real disturbance. If the sampling rate of the parameter is 1sec, the period within which the monitor receives those readings is 5sec.
166
We must point out that even if we employ this type of “timed expression”, persistent noise may still cause false alarms in certain situations, when for example the value of the parameter lies close to the thresholds beyond which the expression fires. As we have seen in chapter three, to prevent such alarms we need to consider the possible presence of noise in the interpretation of the sensor output. A simple way to achieve this is by relaxing the range of normal measurements to tolerate a reasonable level of noise.
5.2.4 Monitoring Deviations from Parameter Trends and From Complex Relationships Among Process Parameters Over Time The expressions that we have discussed so far allow us to detect functional failures as momentary or more persistent deviations of parameters from intended values or ranges of such values. Such expressions are sufficient for detecting anomalous symptoms in systems where the value of parameters is either stable, moves from one steady state to another, or lies within well-defined ranges of normal values. This, however, is certainly not the case in any arbitrary system. Indeed, parameters in controlled processes are often manipulated in a way that forces their value to follow a particular trend over time. How could we detect violations of such trends? Consider, for example, our model fuel system. In normal operation, the control scheme of this system ensures that the level of fuel in every tank drops at the same constant rate (2x/7), which is defined by the current fuel demand (x) by the engines. A deviation from this constant rate may indicate a significant problem such as a structural leak in one of the tanks. To detect such a deviation, we would need to monitor the level of fuel in each tank, and more specifically to calculate the rate of change in the value of this parameter over time. If we assume that the level of fuel in a given tank is monitored by sensor ls, then in mathematical terms the foregoing deviation could be described as follows:
| d(ls) dt Where:
2x
- 7
| > max_discrepancy
d(ls)/dt: the rate of reduction in the tank level. max_discrepancy:
the
maximum
intended reduction rate (2x/7).
167
allowable
deviation
from
the
Notice that this expression captures an anomalous symptom, but cannot be used to relate this symptom to a particular cause. The expression captures an anomaly in the rate with which the level of fuel drops, but such a problem could be caused either by a structural leak in the tank itself or by anomalies in the incoming/outgoing flows. Perhaps a better way to detect a structural leak is to check for a violation of the mass balance equation that relates the reduction of volume and input/output flows in the tank. Consider, for example, the central tank of the model fuel system that is illustrated in Figure 5.4. In normal operation, two pumps draw fuel continuously from the tank at rate x/7 each. Sensor TC monitors the level of fuel in the tank while sensors FR3 and FL3 monitor the two outgoing flows. In this configuration, a structural leak could be detected using an expression that relates the reduction of fuel in the tank to the volume of fuel that has actually flown through the two outputs of the tank over a certain period of time. Mathematically, this can be expressed as follows: t
K{TC(0)-TC(t)} >
∫ {FR3(τ)+FL3(τ)} dτ + max_discrepancy
0
Where:
TC(t): The level of fuel in the tank at time t. K{TC(0)-TC(t)}:
The
reduction
of
fuel
volume
in
the
tank
between time 0 and t. t
0
∫{FR3(τ)+FL3(τ)}dτ: The actual volume of fuel that has flown
through the two outputs of the tank between time 0 and t. max_discrepancy: A (small) allowable discrepancy between the two above terms.
Let us now introduce a set of syntactical constructs that could allow analysts to specify such expressions in the safety case, and in real time, the event monitor to capture the effects of failures on process parameters as deviations from stable trends or from more complex relationships among process parameters over time. This set is composed of the following three syntactical constructs:
x/7
x/7
PL3
FL3
TC
FR3
PR3
Figure 5.4. The central tank of the fuel system
168
Sensor_id(∆t) Where:
Sensor_id is a sensor identifier recognisable by the safety monitor, and Sensor_id(∆t) is the value returned by the sensor in time t-∆t, where t represents the present time. D(expression,∆t)
Where:
D
is
a
differentiation
mathematical
expression
operator over
time
which,
when
applied
∆t,
returns
the
to
a
average
change in the value of the expression during an interval which extends from time t-∆t in the past to the present time t.
expression is a mathematical function (+,-,*,/,**,abs) over one or more parameter values and constants. I(expression,∆t) Where:
I
is
an
integration
operator
which,
when
applied
to
a
mathematical expression over time ∆t, returns the value of the integral of expression during an interval which extends from time t-∆t in the past to the present time t.
In real-time, an expression that falls in the first category instructs the event monitor to reference a historical value of a given parameter. On the other hand, expressions that fall in the other two categories are interpreted as commands for calculating a particular function (differential or integral) over a sequence of recent historical values. The result indicates the trend in the value of the parameter (or parameters) in consideration over a period of time that always extends from a defined moment (t-∆t) in the past to the present time (t).
5.2.5 The Syntax of Monitoring Expressions Having examined the various facets of the monitoring problem and having defined a number of mechanisms for the detection of complex symptoms and for filtering spurious alarms, we are now in a position to define an appropriate syntax for monitoring expressions. Figure 5.5 provides this syntax in Extended - Backus Naur Form (E-BNF). According to the syntax, a monitoring expression can be either a constraint or a logical combination of more than one constraint. Constraints can be constructed by placing two simple expressions in an equality or inequality relationship.
169
expression
::=
constraint {and constraint} | constraint {or constraint}
constraint
::=
simple_exp [relational_op simple_exp]
simple_exp
::=
term { (+ | - ) term}
term
::=
factor {(* | / ) factor}
factor
::=
basic [ ** basic] | abs basic | not basic
basic
::=
constant | sensor_id [ ( ∆t ) ] | event_id | T ( expression, ∆t ) | D ( expression, ∆t ) | I ( expression, ∆t ) | (
relational_op
expression )
::=
=
| /=
|
=
sensor_id
::=
sensor identifier
event_id
::=
identifier of an event in the safety case
constant
::=
a constant value (integer or real)
∆t
::=
time interval (real)
where
Figure 5.5. The grammar of monitoring expressions
Such (simple) expressions are composed of a series of terms, which are combined with addition or subtraction operators. In turn, those terms are composed of series of factors, which are connected with multiplication or division operators. Factors contain the basic elements of the grammar in exponentiation or other relationships. Finally, those basic elements can be constants (numbers), references to the current or historical value of a monitored parameter, references to (the state of) other events in the monitoring model, timed expressions and derivatives or integrals of expressions over time. The above syntax allows analysts to specify (and the on-line monitor to detect) a wide range of effects that failures may have on process parameters. The syntax was sufficient, for example, for describing the monitoring operations that we wished to perform on the model fuel system. However, we can envisage at least one area where this syntax could usefully be extended. Mathematical operations are currently limited to those implied by arithmetic and exponentiation operators. Clearly though, the syntax could be extended to allow more complex statistical and mathematical functions upon current or historical values of parameters. Such functions could increase the ability to calculate 170
complex trends and would potentially improve the capacity of the monitor in reasoning about the past or even the future29 of disturbances. However we leave such issues for further work and turn now to examine how the on-line event monitor performs the primary detection of symptoms of failure and recovery in a system, by evaluating monitoring expressions that conform to the syntax that we have defined in this subsection
5.2.6 The Architecture of the Event Monitor Figure 5.6 gives a simplified pseudo-code representation of the event monitor and illustrates the data structures upon which the monitor operates. The event monitor is a periodic task, which is always called at the beginning of each execution cycle. In each cycle, the monitor first queries the sensory interface and gets the current values of the monitored parameters. The value of each parameter is stored either in an isolated variable or a more complex data structure that also contains historical values for the given parameter. The form that we have used for storing such values is the ring buffer illustrated in Figure 5.7.
cyclic task called by the Kernel every ∆t
events *event_monitor() { update_sensor_buffers(); occurred = nil; for each event e in active if evaluate(e.expr) add(e,occurred); return occurred; }
occured list of events that have occurred in the current cycle of the monitor
evaluate(e.expr)
expression_buffers
active
Ring buffers that carry the historical values of timed expressions
list of active events. This list is dynamically updated by the event_processor()
current value historical values
sensor_buffers The current and historical values of sensors are kept in ring buffers, which are updated in each cycle of the monitor
Figure 5.6. The event monitor
29
That could be achieved, for example, by projection of calculated current trends.
171
current value Vk
Vk Vk-1
Vk-m
Vk-m-1
current value
Vk-1
Vk+1
Vk-2
Vk-m
…
k
Vk-2
…
k+1
monitor cycle
Figure 5.7. A ring buffer
The figure shows a generic ring buffer that contains m+1 elements. At the end of an arbitrary monitoring cycle k, this data structure carries the current measurement vk and the values vk-1 .. vk-m of the parameter in the preceding m cycles of the monitor. In other words, this buffer always holds the current value of a parameter, and its m most recent historical values. This is achieved as in every cycle of measurement, the new value of the parameter replaces the oldest value in the data structure. This mechanism is illustrated in Figure 5.7, where we can see that, in cycle k+1 of the monitor, the new value vk+1 of the parameter replaces vk-m, the last (and oldest) element in the ring. The size of a ring buffer is decided during the interpretation of the monitoring model. At this stage (see Figure 5.1), the parser checks whether monitoring expressions reference historical values of the parameter, and the length of time that such references extend in the past, in order to determine the type30 and size of the data structure required for storing those values in real-time. In real-time, the event monitor updates the buffer and gains access to its contents through a pointer to the last value inserted in the ring. As the location of this value shifts along the periphery of the ring in every cycle of measurement, the monitor is redirected accordingly so that it always points to the most recently stored measurement. Having updated its current and historical view of the controlled process in the beginning of each cycle, the monitor then proceeds to its main goals, the monitoring and detection of events that represent symptoms of failure or successful recovery in the system. The set of monitored events is constantly defined by the current functional and structural state of the system at that time, and changes as the system (or a sub-system)
30
Single variable or ring buffer.
172
moves from one state to another. In implementation terms, this set is represented as a linked list (list of active events in Figure 5.6). As we will see, the contents of this list are dictated by the event processor, and change every time this component of the safety monitor detects a transition into one of the mode-charts of the dynamic model. Each event in the list of active events is a structure, which, among other attributes, contains an expression that the event monitor can evaluate drawing information from the updated image of the monitored process. During the interpretation of the monitoring model, the parser translates each monitoring expression into a tree structure, which contains the components of the expression and their relationships in a way that enables the evaluation of the expression by a straightforward traversal of the tree. In real-time, the function that performs the evaluation (evaluate(expr)) receives a pointer to the root node of the tree and then initiates a depth first traversal of the given tree structure. In each step of the traversal it performs one of the specified mathematical or logical operations between components of the expression, and at the end of the process returns the result of the evaluation. If this result is true the monitor adds the event to a list of occurred events. On the other hand, if it is false the monitor simply continues and evaluates the next expression in the list of active events. Having traversed the list, the monitor terminates by returning a pointer to the list of occurred events. At this point, that list is either empty (if the monitor has not detected any events in the current cycle) or it contains one or more entries that represent detected symptoms of failure or recovery. While this operational scheme seems plausible, it relies on the assumption that an expression can always return a true or false value. Imagine, though, a situation where a component of an expression, an integral for example, references historical values that extend for a time ∆t in the past. Consider also that history for the monitor commences at the beginning of the first cycle when observations of the process are recorded for the first time. Inevitably, for a period that extends from time zero to time ∆t, the component references prehistorical (unrecorded) values. But when the value of a component is unknown for a period of time, how could the monitor evaluate (to true or false) an expression that takes into account this value? And would it be sensible to say that, in such circumstances, the logic of the expression simply cannot be determined?
173
5.2.7 Evaluating
Expressions
that
Contain
Components
with
Unknown Truth Values Consider for example, a vessel that contains gas at certain conditions of temperature and pressure. The following monitoring expression can detect a disturbance in this system from two alternative anomalous symptoms.
Temperature>High or D(Pressure,10)>Max
The first of those symptoms is a high temperature measurement, while the second is an abnormally high rate of increase in the gas pressure calculated over a period of 10 seconds. The latter of course can be determined only after the completion of 10 seconds from the start of monitoring. Let us now assume that the constraint Temperature>High fires at 5 seconds after the first monitoring cycle and while the other constraint cannot produce a logical value yet. The question here is whether, under such circumstances, the expression should produce a true value immediately. Examining this question from a practical point of view, we would be inclined to say that, since the response time to the disturbance may be shorter than 5 seconds, the expression should fire immediately and before its second component is in a position to produce a value. This strategy, however, creates a few theoretical implications that we need to address. Firstly, we need to define precisely what we mean by the concept of an unknown truth value and what happens if we allow components of a compound expression to have such values. Secondly, we must specify how the truth value of a compound expression depends on the (known or unknown) truth values of its components and the logical connectives that have been used in the synthesis of the expression. But let us start this discussion by clarifying the concept of an unknown truth value. In logic, a proposition is a sentence that can either be true or false. However, knowing the actual truth value of the sentence is not a prerequisite for accepting a sentence as a valid proposition. The statement “there are life forms on other planets in the universe”, provides an example of a valid proposition that has an unknown truth value. Indeed, we know that the statement is either true or false, but our current knowledge prevents us from deciding which. But how is this fact relevant to the particular problem that we encounter here? Simply, it is important in our view to clarify that when we talk about components of an expression that have unknown truth values, we
174
in fact refer to valid, logical propositions with a truth value that the event monitor cannot yet determine. But if such components represent valid logical propositions, then the result of logical operations among those components could be determined from the standard truth tables that describe the effects of negation, conjunction and disjunction operations on, or between, logical propositions. In Figure 5.8, we summarise the possible results of basic logical operations between known and unknown truth values: x and y here represent unknown truth values (x,y∈{true,false}), while z gives the result of the logical function (i.e. negation, conjunction or disjunction). Our first observation is that given the type of logical function and the types of the operands, it is always possible to decide whether the truth value of the result (z) can be determined or not. In the majority of cases, this evaluation is impossible, and the truth value of z remains unknown. There are two cases however (cases 3 and 6) where z has a known true or false value despite the fact that one of the operands is an unknown truth value. In Figure 5.8 we have effectively recorded a set of rules that one could use for evaluating compound expressions that contain unknown truth values. In fact, this is precisely how the event monitor evaluates monitoring expressions of this type. During the traversal of the expression, and whenever it encounters a basic logical operation in which one of the components has an unknown truth value, the monitor applies one of those rules to determine the result of this operation. In computational terms, the three possible results of a logical operation that involves an unknown truth value (true, false or unknown truth value) are represented with three different numerical values (1,0,-1 respectively).
⇒ z is unknown
(1)
a) unknown truth value and true
z = x and true ⇒ z = x ⇒ z is unknown
(2)
b) unknown truth value and false
z = x and false
⇒ z = false
(3)
c) unknown and unknown
z = x and y
⇒ z is unknown
(4)
a) an unknown truth value or true value z = x or false ⇒ z = x ⇒ z is unknown
(5)
Negation of an unknown truth value
z = not x
Conjunction
Disjunction
b) an unknown truth value or false
z = x or true
⇒ z = true
(6)
c) unknown and unknown
z = x or y
⇒ z is unknown
(7)
Figure 5.8. Basic logical operations between known and unknown truth values
175
This mechanism enables the monitor to propagate unknown truth values and, ultimately, to compute the truth value of compound expressions from the known or unknown truth values of their constituent components. As we have seen, two rules applied in the course of this computation (rules 3 and 6 in Figure 5.8) are particularly significant, since they allow compound expressions to take true or false values even when some of their components have unknown truth values31. In practice, it is precisely those rules that allow the monitor to produce early alarms on the basis of incomplete process data without violating the logic specified in monitoring expressions.
5.2.8 Evaluating Timed Expressions One outstanding issue is the mechanism by which the monitor evaluates “timed expressions” These are expressions of the type T(expression, ∆t), where expression represents a logical combination of constraints and ∆t is an interval that expression has to remain true in order for T(expression, ∆t) to fire. In Figure 5.6, we can see that each monitored event that carries such an expression also has a ring buffer that the monitor can use in real-time for storing historical values of the expression. The size of this buffer depends on the length of the interval ∆t, and it is mechanically decided during the interpretation of the monitoring model. In each monitoring cycle, whenever the monitor encounters a timed expression during the traversal of the active list, it calculates the truth value of expression and stores the new value in the ring buffer. To calculate the truth value of the expression over time, the monitor simply performs a conjunction of all the values contained in the buffer. If this operation returns false or an unknown32 truth value then the monitor takes no action, while if the result is true the monitor adds the event in the list of occurred events. As we have seen, this mechanism allows the safety monitor to prevent false alarms that could otherwise be caused by spurious abnormal measurements or normal transient behaviour.
31
Those rules simply reflect two basic implication relationships in propositional logic: A∧0⇒0
and
A∨1⇒1
An unknown truth value will always be returned, for example, before ∆t has elapsed since the beginning of the monitoring process.
32
176
5.3 Fault Diagnosis We have examined how the event monitor detects events that represent symptoms of failure on the controlled process. One limitation of the monitor is that those events describe disturbances in terms of their symptoms on process parameters and not in terms of the underlying causal faults. Symptoms though do not always provide a clear and unambiguous picture of the plant status in the presence of failures. Indeed, as we have seen in chapter 2, the effective control of hazardous failures often requires the on-line analysis of disturbances and the diagnosis of the root causes of failure. In this section we ask whether the models contained in an “electronic” safety case can assist on-line fault diagnosis, and if so how. One of the components of the safety case is a collection of fault trees, which record in a formal, logical way the process by which combinations of low level component failures propagate and cause hazardous events at sub-system level. In this chapter though, we have seen that those hazardous events are precisely the symptoms of failure that the event monitor observes in real-time. The fault trees for those events, therefore, could support the diagnostic task, in other words, they could help in isolating the root causes of the anomalous symptoms detected by the event monitor. A simplified example from the model fuel system will help us now demonstrate the diagnostic application of fault trees. Earlier we have seen that a principal objective of the fuel system is to maintain an optimal centre of gravity, and that this objective is achieved by a control scheme which ensures that tanks always have the same level of fuel. Let us now consider two symmetrical wing tanks of the fuel system called A and B as illustrated in Figure 5.9. Pumps PA1 and PB1 feed the front and rear tanks of the aircraft at the same constant feed rate. Valves VA2 and VB2 are closed and disable fuel leak to the jettison pipes.
LB
Tank A
Tank B
LA
Front tank
PB2 Jettison
VB2
FB2
FB1
VB1
PA1
PB1
VA1
FA1
Rear tank
Figure 5.9. Two symmetrical wing tanks of the fuel system
177
FA2
VA2
PA2 Jettison
Let us now assume that, during a flight, the event monitor confirms that a significant asymmetry (greater than 3%) between the level of the two tanks has persisted over a period of 5 seconds (i.e. the monitor evaluates that T(|LA –LB |>0.03,5) is true). Part of the fault tree that corresponds to this event is illustrated in Figure 5.10. The tree models how a number of alternative root causes potentially generate an asymmetrical load condition between tanks A and B. One approach to the diagnosis of this event would be to derive the minimal cutsets of the tree and attempt to evaluate those cut-sets in real time. To locate the actual causes of the problem, the diagnostic system would have to serially evaluate the monitoring expressions contained in those cut-sets, until at least one cut-set returns a true value. The failure modes contained in that cut-set would represent the root causes of failure. However, the difficulty with this strategy is that, as the size of the tree and the number of cut-sets increase, the strategy potentially results to a computationally very expensive (and inefficient) diagnostic strategy.
Imbalance A,B Detected as: T(| LA –LB |>0.03,5)
Fuel in A more than in B Verified as LA > LB
Fuel in B more than in A Verified as LB > L A
Omitted for simplicity
More OUT flow Verified as T(FB1>Fnormal+0.03,2)
Failure mode PB1 performing more
A overloaded -
B under-loaded -
No OUT flow Verified as |FA1-0.03|0.03
Failure mode VA1 stuck closed
Less OUT flow Verified as T(FA1 0.5
Diagnosis: F1 indication is true, Deviation verified. Diagnosis: F2 failure, F1 indication is true, Deviation verified, Schedule F2 maintenance.
Yes No
Diagnosis: F1 failure, Sensor failure, No Deviation, Use F2 in future measurements, Schedule F1 maintenance.
No
Figure 5.15. An event tree for sensor validation
186
The first criterion requires the values reported by F1 and F2 to be close and within a region of acceptable tolerance, while the second requires that the indications of F1 and Sp satisfy the speed/flow characteristics of the pump. The event tree which models the validation process shows that if the first criterion is satisfied, then either both sensors have failed, or both have correctly detected a deviation. Since the sensors are of different technology, we can safely assume that we have minimised the probability of a common cause failure. Thus, the monitor could safely conclude at this point that both sensors have correctly reported an abnormal event (e.g. the pump motor shaft has broken and disturbed the flow in the line). On the other hand, If the first criterion is not satisfied, then one of the sensors must have failed. The monitor would then need to examine which sensor has failed. If F1 satisfies the second criterion we can conclude that it functions properly, and that, therefore, it has correctly detected an abnormal event. In that case F2 must have failed and a record of this could be made for maintenance. If, however, the second criterion is not satisfied then we must assume that F1 has failed. In that case, a hypothetical monitor would be redirected to use F2 measurements in the future and the observability of the plant would have been restored. We must point out that this type of event tree analysis does not arise naturally from the safety case. These event trees do not record the propagation or mitigation of failure following an initiating hazardous event. They are in fact diagnostic trees which model the sensor validation processes that should be executed by the safety monitor (and the control system). Thus, system analysts would not normally consider constructing such event trees during the safety assessment. In our view though, this type of analysis could add some value to the assessment. In the course of the analysis, one is encouraged to examine the sensory scheme for each monitoring point in the system, and to assess how sensor failures affect the integrity of the monitored data. We, therefore, believe that the analysis can help to improve the sensory interface of the system and the safety monitor. In real-time, the validation process could be executed by a module similar to the diagnostic engine. By evaluating sensor validation criteria, this module would be able to determine the path in the event tree that leads to the correct diagnosis. At this point, we have completed the theoretical exploration of the monitoring problem and the presentation of the algorithms of the safety monitor. We shall now proceed to discuss the monitoring experiments that we have performed on the model fuel system. Drawing from those experiments, we will show how the safety monitor has helped us to improve our understanding of the mechanics of failure in this system, and
187
how, in real-time, this system has increased our capacity to detect and control hazardous failures. We shall also discuss the difficulties that we have encountered in the course of those experiments and, through this discussion, we will attempt to identify the limitations of our approach to safety monitoring.
5.6 Experiments Figure 5.16 illustrates the configuration of our experimental platform.
Fault Injector t=5 VR1 (valve)stuck closed t=5 PL1 (pump)failed t=10 VL1 (valve)commission of open command
Injected Failures
Fuel System Simulator Monitor
System Physical Fault Injection (leaks)
k * Σ Px
Fx=e
k
-1/(e -1) t
Ly(t) = Ly(0) -
∫ ΣFx dτ
0 Corrective Measures
Operator Interface Process Feedback P T F 0T 9 P 0 4 8 F F 1 P P 0
Safety Monitor Ref FP F 0
Event Monitor
Event Diagnostic Engine
Processor
Alarms, Diagnostics, Effects of Failure on System Function, Guidance on Corrective Measures
L 8 P FF T8 P T P 0 0 FF 0 FF P 2 F 2 V V F 1 F 0 P Jett P 5
T 8
V Por PP FF V 8 7
T 8 P 8 PP FF T F 0 0 FF P0 F 2 V 2 V V F0F Jett
T 8 F F 1 P P 0
V V
V
F P Starb F 7 P 7
Parser
Safety Analysis Tool
Monitoring Model
Monitoring Model The safety case as an executable specification exported from the safety analysis tool
Safety Case
Figure 5.16. The experimental platform
188
The configuration contains two secondary, but important, elements that have not yet been introduced, the fuel system simulator and the fault injector. The simulator is an executable software model of the hardware fuel system that is illustrated in the embedded photograph. This model played an important role in validating the safety case prior to its connection to the physical system. The simulator receives the current fuel demand (x), the failures injected in the simulation, and the corrective measures authorised by the safety monitor. Based on this information, the module then calculates the flows and tank levels in the system. The flow in a line is calculated by taking into account the connectivity of the line34 and the current settings of the pumps that generate the flow. Assuming that the line is connected to a tank that contains fuel resources, the value of the flow in the line is calculated as a nonlinear function of the current pump setting. The non-linear character of this function causes a delay in the correction of transient disturbances of system flows and is precisely the mechanism that keeps the closed loop control algorithm in operation during the simulation35. Having determined the values of flows in the system, the simulator can then calculate the level of fuel in each tank. This is achieved by solving the mass balance equation that relates the level in a tank to the volume of fuel that has flown in and out of the tank since the beginning of the simulation. The model also allows the addition of noise to the calculated values of flows and levels. The magnitude of this noise follows a normal probabilistic distribution the parameters (µ,σ) of which are defined by the user. The presence of random noise introduces an element of unpredictability in the simulation, which in practice, we have used for testing the behaviour of the monitor in the presence of noise and its response to transient sensor failures. The second new element in the configuration of Figure 5.16 is the fault injector. The role of this module is to disturb the simulation or the real physical process. The faults that we wish to inject as well as the moments in time that those faults should be introduced in the process form a fault injection schedule, which is provided to the module in the form of a script. By translating this script at the beginning of the experiment, the module generates a set of commands that then sends to the simulator or
34
In other words, whether there is an open path to a tank which currently has fuel resources.
35
Indeed if the flow value was proportional to the pump value, then the proportional correction term would be sufficient to correct instantly any deviation between the flow set-point and the actual flow.
189
the system. Those commands transform the behaviour of elements in the system and affect the real (or simulated) process. Table 5.1 defines the type of faults that can be injected through this mechanism. Those faults include transient and permanent component (pump or valve) failures, transient or permanent mutations of the data that control those elements and fuel leaks. It can be noticed that the introduction of leaks through this mechanism is confined to the simulated version of the system. Leaks, however, can also be introduced to the tanks of the physical system via valves that can be mechanically operated by humans (see Figure 5.16). Table 5.1. The failure modes that can be injected to the fuel system (and its simulator) Failure Mode
PX fail stop PX reversed PX biased ±Y%
PX stuck at Y
PX no start PX no stop PX ok
Description
Type
PUMPS PX stops due to electromechanical failure or because its control value is stuck at zero PX pumps in the opposite direction. The pump wiring (or the sign of the control value) is reversed The control value of PX is positively or negatively biased. Y gives the magnitude of the bias as a percentage of the maximum control value. This failure mode limits the range of control that the closed loop algorithm has over PX. The magnitude of Y defines a range of normal flow values that the algorithm will fail to achieve PX speed (or its control value) is stuck at a certain value Y. The pump does not respond to the control PX will fail to start when it first receives a start command PX will fail to stop when it first receives a stop command This is not a failure mode. It is actually a command to return a faulty pump to its normal state. It can be used in conjunction with other commands for the injection of transient failures
Permanent Permanent Permanent
Permanent
Permanent Permanent Transient
VX stuck open
VALVES VX is stuck at the open position
Permanent
VX stuck closed
VX is stuck at the closed position
Permanent
VX stuck half open
VX is stuck between the open and closed positions
Permanent
VX commission close command VX commission open command
Inadvertent closure of a valve due to a commission of the valve close command Inadvertent opening of a valve due to a commission of the valve open command
Transient
TX leak Y%
TANKS A permanent leak of fuel from TX. Y gives the rate of the leak as a percentage of the maximum normal flow that can be achieved in the system. This failure mode can only be introduced in the simulator (not the physical system)
190
Transient
Permanent
5.6.1 Validating the Safety Case The safety monitor, in conjunction with the simulator and the fault injector, played an important role in validating the system safety case. By connecting those three modules, we were able to inject failures in the simulation and monitor the effects of those failures. This, in turn, gave us the capacity to verify the effects predicted in the safety case. Perhaps more importantly, the discrepancies that we have identified between the observations of the monitor and the predictions encoded in the safety case have helped us to locate (and correct) omissions and subtle errors in the latter. To illuminate this point, we will return to the fragment of the safety case that we discussed in section 4.5.5, where we examined the way in which the mode-chart of the fuel system is constructed from the mode-charts of its constituent parts. There, we saw that a transition of the engine feeding subsystem (EFCF) to its cross-feeding mode causes a temporary fuel imbalance at system level, and that this imbalance could be corrected by redirecting the two fuel flows in the central sub-system. More specifically, we saw that the action required in those circumstances is changing the set-points of the loops that control those flows from {x/7,x/7} to {–6x/7,8x/7} (see Figure 5. 17). At this point, we notice that although the assessment of the initial failure condition occurs in the mode-chart of the fuel system, the suggested corrective measures also imply a change in the functionality of the central sub-system. A potential hazard here is that analysts may simply omit to assess the impact that this change has on the mode-chart of the central sub-system. Assume, for example, that this mode-chart was initially constructed on the hypothesis that the system always delivers fuel at rates {x/7,x/7}, and that the potential change in the behaviour of this system has not been registered in the mode-chart. Imagine now that during a simulation the engine feed sub-system fails, and in response to this failure the operator (or the monitor) change the direction and level of flows in the central sub-system.
PL3
FL3
TC
FR3
PR3
Initial configuration of the central sub-system
8x/7
-6x/7
x/7
x/7
PL3
FL3
TC
FR3
PR3
Configuration after the application of corrective measures
Figure 5. 17. The two configurations of the central sub-system
191
In those conditions, the safety monitor will continue to observe the sub-system using the incorrect assumptions that we have encoded in the safety case as a result of our omission to update the mode-chart of the subsystem. Inevitably, soon after the change in the functionality of the system, the monitor will raise one or more false alarms. Continuing to assume, for example, that all flows should be directed from the central tank to other parts of the system, the monitor will perceive the flow from the rear towards the central tank as abnormal (reverse flow). In response, the monitor will raise an unexpected alarm, which, in turn, will point out the problem in the safety case. Here, we have a case where the safety monitor effectively detects an incomplete (and in fact also incorrect) model. Let us now see how that model could be corrected during the revision of the safety case. One way to register the potential behavioural change in the mode-chart of the sub-system would be to introduce a new operational mode. If we do introduce a new mode, though, then we must also analyse the sub-system in this mode. One way to avoid this complication, is by broadening the definition of the function of the sub-system. Indeed, if we redefine this function as being one of “always achieving the current target flows”, then the functional state of the subsystem would not be disturbed by external changes to the set-points of its control loops. That, precisely, is the strategy that we have adopted in this (and other fragments) of the safety case. This strategy has simplified the dynamic model and reduced the volume of the underlying safety analyses. At the same time, though, the strategy entails a few complications. The newly defined functional state of the system, for example, does not require that flows have a certain direction while the system is in this state. In a system where flows are potentially bi-directional, though, the definition of “reverse flow” inevitably becomes relative. A “reverse flow” is not any more a flow with a certain direction, but a flow the direction of which does not agree with that implied by the current flow set-point. This condition can only be detected from a discrepancy between the sign of the actual flow and the sign of its corresponding set-point. But the weakness of this detection mechanism is that it will simply fail to detect any problem, if the reversal of the flow has been caused by a corruption of the set-point36 itself.
An essential precondition for the
adoption of this strategy, therefore, should be the existence of a low-level memory protection mechanism that can ensure the integrity of (at least some of) the data that we reference in the safety case. 36
The sign of the set-point is itself reversed, for example by a failure of the memory element that holds this value.
192
Beyond the identification of areas where the safety case was incomplete, the safety monitor and the simulator have also helped us to: a) identify errors in some of the more complex monitoring expressions that appear in the safety case b) validate the structure of the diagnostic fault trees, and c) check the correctness of the scenarios of failure and recovery that we have anticipated in the structure of the dynamic model.
The outcome of this off-line simulation and testing process was an improved and more reliable monitoring model, which provided the basis of our monitoring experiments on the physical fuel system.
5.6.2 Detection of Failures Our first objective in those experiments was to examine the capacity of the event monitor to detect transient and permanent failures that we were able to inject into the system (see Table 5.1). Although those failures do not include sensor failures, sensors proved to be in practice the most unreliable element in the system. In many occasions, therefore, we encountered unplanned transient and permanent sensor failures, which gave us the opportunity to examine, to some extent, the response of the event monitor to this class of failures. The monitor detects failures from anomalous symptoms on process parameters. The response of the monitor to each symptom is determined by the monitoring expression that describes the symptom. A non-timed expression, for example, fires an alarm even if the anomaly lasts only for a single cycle of monitoring. On the other hand, a timed expression triggers an alarm only if the anomaly persists for a period longer than that specified in the expression. In the fuel system, timed expressions provided an essential mechanism for filtering out the spurious measurements that were frequently caused by transient sensor failures. Figure 5.18 illustrates, for example, the response of the monitor to a transient sensor failure that occurred in a typical experiment, where our aim was to test the ability of the monitor to detect a permanent pump failure on the line that feeds the starboard engine. The figure shows that the control algorithm of the fuel system cannot distinguish sensor failures from real disturbances.
193
flow / flowmax Response of the closed loop algorithm to the transient failure
Permanent pump failure
1
0.8
0.6
0.4
Spurious measurement caused by transient sensor failures
0.2
0 time (sec) MORE-SEngine_Line = T(FL1>1.03*x, 6) false LESS-SEngine_Line = T(FL1>0.03 and FL1