In Proceedings of DSN’2001, the International Conference on Distributed Systems and Networks, Gothenburg, Sweden, July 2001, pages 77-82, ISBN 0-7695-1101-5.
Model-Based Synthesis of Fault Trees from Matlab - Simulink models Yiannis Papadopoulos, Department of Computer Science, University of York, York YO10 5DD, UK
[email protected]
Abstract In this paper, we outline a new approach to safety analysis in which concepts of computer HAZOP are fused with the idea of software fault tree analysis to enable a continuous assessment of an evolving programmable design developed in Matlab-Simulink. We also discuss the architecture of a tool that we have developed to support the new method and enable its application in complex environments. We show that the method and the tool enable the integrated hardware and software analysis of a programmable system and that in the course of that analysis they automate and simplify the development of fault trees for the system. Finally, we propose a demonstration of the method and the tool and we outline the experimental platform and aims of that demonstration.
1. Introduction The processing abilities, relative low cost, speed and flexibility of modern computers are currently leading a number of safety related industries (railways and automobiles, for example) to an almost inevitable transition towards programmable systems. As this transition gradually takes place, there is clearly a need to ensure that new critical systems introduced in those industries will deliver safety-related services with at least the degree of reliability that similar services have been delivered by conventional systems in the past. Naturally, safety assessment processes evolve to deal with changes in the technology of safety critical systems. A number of international safety-related standards are emerging, for example, which define rigorous development and assessment life-cycles in which hazard and safety analysis occupy a central position. But although safety assessment is increasingly being recognised as an important component of the software life-cycle, on the other hand very little guidance exists on precisely how and when to do it in programmable
Matthias Maruhn DaimlerChrysler AG, Research and Technology, Alt Moabit 96a, D-10559, Berlin, Germany
[email protected]
systems. This, we believe, can largely be attributed to the lack of mature techniques for software hazard and safety analysis. Recent efforts to apply modifications of classical safety analysis techniques such as Hazard and Operability studies (HAZOP) and fault tree analysis on software and programmable systems have not yet achieved the degree of maturity that would enable their successful application in complex systems. In most variants of computer HAZOP [1-3], for example, the analysis remains a predominantly manual activity as it would have been in a classical process HAZOP performed at plant level. As systems become more complex though, manually performed analyses become tedious, error prone, timeconsuming and beyond a certain level of complexity practically infeasible. To overcome this problem, other techniques, such as the Leveson template approach to software fault tree analysis [4], introduce a useful degree of automation in the assessment of software. Despite some theoretical contributions to the problem though, no technique has yet found widespread application in complex environments. In many cases, this is due to a lack of appropriate accompanying concepts for tool support that could enable the practical evaluation of those techniques in realistic contexts of application. In this paper, we outline a new method for safety analysis in which concepts of computer HAZOP are fused with the idea of software fault tree analysis [5-6]. The aim is to create a technique that could enable a continuous assessment of an evolving programmable design that starts at the early stages of the design lifecycle. As we will show, the method potentially automates and simplifies the development of fault trees for a programmable system. However, robust tool support is required to realise this potential for automation. In this paper, we therefore discuss the underlying principles and architecture of a tool that we have developed to support the method and enable its application in complex environments.
The tool assists a continuous HAZOP style analysis of a programmable system. This analysis is performed on a model of the system that is being developed in Matlab/Simulink, a well known modelling and simulation environment 1. At each stage in the development of that model, the tool can mechanically determine the failure behaviour of the system as a set of fault trees that define potential scenarios of failure and recovery. These fault trees show whether component failures are stopped by the present error detection and recovery mechanisms in the system or if they propagate through system processes and cause critical functional failures of the system. Analysis of those fault trees helps to eventually establish whether the current design satisfies the given safety requirements or to identify weak areas of the design and stimulate useful design revisions. In sections two and three of this paper we outline this approach to safety analysis and the architecture of the tool respectively, while in section four, we discuss our plans 1
We must point out that the method is not only applicable on Simulink models but more generally on models that record the hierarchical structure of the system and dependencies between components of the architecture on material, energy or data. However, the reason that we have chosen Simulink as a modelling environment is that Simulink models already exist in practice and play a useful role in the design of programmable systems. At the early stages of design, for example, such models help to define and validate through simulation the functional structure of the system. At later stages, they provide a basis for modelling non-functional properties (temporal behaviour, for example) as well as for the automatic generation of code that can be usefully employed in the actual implementation of the system. But, if Simulink models already provide a useful resource in the design of the system, why not use those models as a basis for our approach to the safety analysis of a programmable system?
on how to demonstrate, through the tool, the validity and scalability of our approach. Our plans for demonstration are based on a case study that we currently perform on a prototypical distributed brake-by-wire and adaptive cruise control system for cars, a system which is currently developed by a consortium of automotive companies in the context of a European Commission funded project called SETTA 2 (Systems Engineering for Time-Triggered Architectures).
2. Modelling and Safety Analysis Figure 1 provides an overview of the relationship between modelling and safety analysis in the proposed approach. The left hand side of the figure shows the model that provides the basis for the analysis of the system. This model describes the hierarchical decomposition of the system into composite and basic elements that communicate through material energy or data flows. At the early stages of the design, this model is typically an abstract representation that provides a purely functional description of the system. Composite elements of the model at this stage typically represent functions that are decomposed into networks of lower level subfunctions. As the model is enriched with architectural information about the allocation of functions to hardware, composite elements in the model start to represent programmable entities, processors for example enclosing the tasks running on those processors. While this 2
IST contract number 10043. The SETTA consortium consists of the following partners: DaimlerChrysler, Renault, Siemens, Alcatel Austria, EADS Airbus, Dependable Computer Systems, TTTech, University of York, Technical University of Vienna.
decomposition process evolves, the model always remains a consistent hierarchical representation of the system that progressively records with increasing detail the implementation of the system. To analyse this model at a particular stage of the design, the failure behaviour of each component in the model has to be determined using a form of computer HAZOP. During the application of this technique on a component, each output of the component is systematically examined for potential deviations from the intended normal behaviour. The specific failure modes of each output are determined as the behaviour of the output is scrutinised for potential deviations that may fall in one of the following three categories of failure: (A) service provision failures such as the omission or commission of the output; (B) timing failures such as the early or late delivery of the output; (C) failures in the value domain such as the output value being out of range, stuck, biased, exhibiting a linear or non-linear drift or erratic behaviour. The result of the analysis is a model of the local failure behaviour of the component under examination. This model is represented as a table and provides a list of component failure modes as they can be observed at the component outputs (see figure 2 for an example). For each identified output failure the analysis determines the causes of that failure in logical expressions that contain internal malfunctions of the component (see Component Malfunction Logic column in table of figure 2) or deviations of the component inputs (see Input Deviation Logic column in figure 2). For each internal malfunction, an estimated or experimentally derived failure rate (λ) can also be specified, and later be used for reliability evaluation purposes. One important attribute of that analysis is its local nature. Indeed the analysis is always confined within the component I/O interface, a characteristic which renders the results re-usable in the same application or even across different applications, perhaps after some minor modifications to reflect the effect of a different environment. A second important attribute of the
technique is that, during the analysis, designers and analysts are encouraged to examine the interactions of the component with other components in its periphery. Two important questions, for example, are typically addressed. a)
Does the component under examination respond to all the failures propagated by other components further upstream in the model? b) Are the failures generated or propagated by the component under examination handled by other components further downstream?
By addressing such questions, the analysis effectively assists the systematic improvement of the failure detection and recovery mechanisms in the system. At the early stages of the design, for example, the global application of the technique can help analysts identify hazardous functional dependencies such as those caused by shared material, energy or data flows between different functions. Such dependencies also include cascading cause-effect relationships through which, an initial functional failure causes one or more undesirable functional failures further downstream in the model. Although the early identification of such dependencies is helpful, we must also point out that the hazard analysis of a model can only be complete when information about resource allocation and the architecture of the system has been added to that model. It is only then, for example, that the analysis can capture hazardous dependencies between (seemingly independent) functions that can be caused by shared architectural resources (e.g. processors) in the system. One issue that it is worth exploring further at this point is the application of that technique on programmable components. Indeed, it is perhaps easy to imagine how this type of hazard analysis could be performed on hardware components such as sensors and actuators (also more details on the basic application of this technique and a case study can be found in [5]). What happens, though, when the object of the analysis is a
input_1
Other components further upstream in the model
Other components further downstream in the model
output input_2
Component Examined Output Failure Mode
Description
Omission-output
The component fails to generate the output
Wrong-output
Early-output
Component Malfunction Logic
λ(f/h)
Omission-input_1 AND Omission-input_2
Jammed OR Short_circuited
5x10
The component generates wrong output
Wrong-input_1 OR Wrong-input_2
Biased
6x10
Early output
…
…
…
Input Deviation Logic
Figure 2. Hypothetical component and hazard analysis
-7 -6
6x10 -8
controller implemented as a number of tasks running on some programmable hardware? How could we determine the local failure behaviour of that component and how could we take into account the contribution of both hardware and software in this type of hazard analysis? Figure 3 illustrates a general concept for the representation and analysis of programmable hardware. The figure shows a programmable component modelled as a composite element that encapsulates a network of software modules (tasks). The hazard analysis of such a component could be performed as follows. At the level where the component is represented as a composite element (higher level in the model), we examine and record the direct effect of hardware failures to the outputs of the component. This makes sense, since hardware is typically a common resource shared by all the functional (software) modules of the component, and therefore, a hardware failure will typically impact all software modules. A failure of a processor, for example, will often cause an omission of all the outputs of a controller. It therefore makes sense to examine hardware failures separately, and in a direct and collective fashion. At the level where the functional structure of the component is described (lower level in the model) we perform a hazard analysis of each task using the same technique that we have described in this section. The analysis at this level records how each task responds to omission, commission, timing or value failures propagated by other tasks. Also, how possible internal logical defects in the implementation of each task could affect the outputs of the task. Collectively, the analyses of all tasks show how the software of the controller responds to failures arriving at the controller inputs and how input failures or possible logical errors in the design of that software may propagate and ultimately corrupt the controller outputs. It is important to point out that the concept that we propose in figure 3 can in practice also be used for the Model Controller (hardware ) upon which the enclosed tasks are running
analysis of other types of dependencies. If, for example, the enclosing component in figure 3 lies in an environment where there is excessive electromagnetic interference, then all the enclosed components are susceptible to this hazard. It makes sense therefore to determine the effect of this condition at the level of the enclosing component. Also, in a similar way we can determine the effects of other types of spatial or environmental dependencies that are not caused by shared material, energy or data flows that are explicitly represented in the model of the system. The concept of figure 3, thus, provides a general mechanism for the representation and cause-effect analysis of common cause failures. Let us point out though, that the identification of the root causes of such failures in the design, manufacturing and operating processes of the system, as well as their quantitative analysis, require substantial additional modelling which is currently out of the scope of this work. This approach to hazard analysis also lacks mechanisms for the representation and handling of data dependencies, conditions in which erroneous effects appear only in response to certain (but not all) input conditions. Consider, for example, a data register in which we store and retrieve values, and imagine that the least significant bit of that register is stuck at 0. All odd numbers that we attempt to store in the register will be corrupted, but the fault will not affect any of the even numbers. Clearly, a value failure will be observed at the output of the register but only for a subset of input values. Using the proposed approach, we could only pessimistically declare that a value failure can be caused by a stuck at zero failure in one or more bits of the register. But to represent the data dependent manifestation of failure in those circumstances, we would clearly need a much more elaborate and complex failure expression. Perhaps an alternative way of dealing with this problem would be to enable a form of non-deterministic modelling
Safety Analysis Hardware Analysis This table shows how failures of the hardware affect the outputs of the controller
Task Analyses
Tasks (their interdependencies and how they handle data arriving at the controller inputs)
Each table shows how the given task responds to invalid data arriving at the task inputs and how internal defects in the logic may affect the task outputs All together, the task analyses show how the software responds to failures arriving at the controller inputs and how logical errors in the software may propagate and ultimately affect the controller outputs
Figure 3. A concept for the representation and hazard analysis of programmable hardware
in which uncertainty can exist in the relationship between malfunctions and their effects (see, for instance, the use of unknown values in multi-valued logic approaches as, for example, demonstrated in [7]). Leaving those issues for further work, however, let us now assume that at a given stage of the design, the local hazard analysis of all components in the model has been performed and that the model has been annotated with the results from this analysis. At this point we can move from local analysis to global analysis, in other words we can determine the global propagation of failure in the system using the local analyses of its components. In the proposed method, this is actually achieved mechanically using an algorithm that we have developed for the automatic synthesis of fault trees. The algorithm generates fault trees for hazardous functional failures as these can be observed at outputs of the system, by traversing the hierarchical model of the system and by following the propagation of failure backwards from the final elements of the design (actuators) towards the system inputs (sensors). In the course of this traversal, the algorithm identifies and records in the structure of those fault trees hazardous dependencies between components in the model caused by shared material, energy or data flows. It also takes into account any environmental or functional dependencies (between sub-systems and components) that have been recorded in the vertical axis of the hierarchy. By placing those dependencies in the context of a global view of failure in the system, the algorithm performs an important function. It help us to identify particularly hazardous dependencies between components that we assume to be independent but they are in fact susceptible to common cause failure. The resultant fault trees can, therefore, help us to identify hazardous dependencies between replicated components in fault tolerant architectures. The analysis of those fault trees (cut-set analysis, for example) can also help to eventually establish whether the current design satisfies the given safety requirements or to
identify weak areas of the system that need to be redesigned. Currently, such design iterations create enormous difficulties in the maintenance of large manually constructed fault trees. In contrast, design iterations would not pose problems to the proposed synthesis of the fault trees, as new fault trees could be automatically re-constructed following of course certain changes in the model and the underlying hazard analyses.
3. Tool design At this point, let us move to the design of the tool that we have developed to support the application of this approach to safety analysis. The aim of this tool is to integrate the proposed method, and the fault tree synthesis algorithm, in an environment of already established industrial tools that consists of a popular functional modelling tool (Simulink, from Mathworks) and a popular fault tree analysis tool (Fault Tree Plus, from Isograph). Figure 4 illustrates the main components of the fault tree synthesis tool and the relationships of those components with Simulink and Fault Tree Plus. The first important observation is that the tool does not have independent modelling capabilities. Indeed, the tool exploits models of the system that have been created in Simulink during functional design. An investigation of those models has shown that Simulink models can provide directly one of the two ingredients that we need for the automatic synthesis of fault trees, the topology of the system (i.e. the components of the system, their hierarchical relationships & architectural dependencies). At the same time, though, Simulink models lack the second ingredient required by the synthesis algorithm, namely they lack any information about the local failure behaviour of the basic components of the system. To remedy this problem, we have extended Simulink with an editor that enables analysts to annotate functional models with failure information represented in the form of the hazard analysis that we introduced in the preceding
Automated Safety Analysis Tool SIMULINK Modelling tool
System Model
Hazard analysis editor (SIMULINK extension) Annotated model as text file Parser Annotated model in data structure Fault Tree Synthesis Algorithm
Synthesised fault trees
Figure 4. Architecture of the Tool
FAULT TREE + Fault Tree Analysis Tool
section. This editor has been built as an extension of Simulink, using the application programming interface (scripting language) of that tool. Once a Simulink model has been annotated with the local hazard analyses of its components, it is then exported from Simulink as a text file that conforms to a particular syntax. The second component of the safety analysis tool is a parser that can analyse such files, i.e. files that conform to the grammar of Simulink models. This component performs syntactical analysis and interpretation of the model file, and regenerates (in the memory of the computer) the model and the data structures required for the fault tree synthesis. Finally the fault tree synthesis itself is performed by the third component of the tool, the fault tree synthesis algorithm. To generate the trees, the algorithm performs a backward traversal of the model, in the course of which it evaluates the failure expressions contained in the local analyses of the components encountered during the traversal. The resultant fault trees are written in the binary format of a Fault Tree Plus project file and can be imported in that tool for further analysis and reliability evaluation purposes. It is perhaps important to point out that there are no restrictions imposed on the size of the model or on the type of components that could be used for the development of the model. The safety analysis tool, for example, can handle the complications caused in the traversal of the model and the fault tree synthesis by the multiplexing and de-multiplexing of flows that often exists in realistic models. It can also recognise and handle indirectly relayed control signals (triggers) and components that are not connected through explicit links, but communicate remotely using implicit communication protocols (Data-store/Data-read pairs, for example). We hope that such features will help to deal with realistic models and render the tool useful in industrial contexts of application in the future.
4. Demonstration Plan The following paragraph summarises the three aims of our demonstration at the conference. Firstly, we wish to show an example of how the proposed approach can be used for integrated hardware and software safety analysis in a programmable system. Secondly, we wish to demonstrate the applicability of the proposed approach on complex problems. Our aim, for example, will be to show how, in practice, the tool can operate on a complex Simulink model and synthesise a large fault tree. Finally, we wish to show how this approach can assist the design and simplify the assessment of the system. To achieve this we intend to demonstrate how the synthesised fault trees can point out weak areas of the design. Also, how automatic fault tree
synthesis simplifies the re-analysis of a system following a design iteration. The system that we have chosen as a suitable basis for the demonstration is a prototypical brake-by wire and adaptive cruise control system for cars. This system is currently being developed by a consortium of automotive industries in the context of a European Commission funded project called SETTA [8]. The brake-by-wire system is based on a brake pedal provided by Daimler Chrysler and a brake actuator provided by Siemens Automotive, while the adaptive cruise control system is an executable model of vehicle dynamics provided by Renault. The three components of that system run on an equal number of programmable nodes which communicate over two replicated busses using a deterministic time-triggered communication protocol. The overall architecture is safety critical and contains two distributed control loops with strict timing requirements. For the conference demonstration, we plan to develop an executable Simulink model for that system and annotate that model with local hazard analyses. We will then use our safety analysis tool to generate a set of fault trees for the system and import those fault trees in Fault Tree Plus for further Analysis and Reliability Evaluation. We hope that the demonstration of this process and the results from the analysis of the resultant fault trees will help us substantiate some of the claims that we have made in this paper and to achieve the aims that we have set out in the beginning of this section.
5. References [1] Burns D.J., Pitblado R.M., A modified HAZOP methodology for Safety Critical System Assessment, 7th meeting of the UK Safety Critical Club, Bristol, February 1993. [2] McDermid J.A., Pumfrey D.J., A Development of Hazard Analysis to Aid Software Design, COMPASS'94, Gaithersburg MD, IEEE Computer Society Press, 1994. [3]Yang S., Chung P.W.H., Hazard Analysis and Support Tool for Computer Controlled Processes, Journal of Loss Prevention in the Process Industries, 11:333-345, 1998. [4] Leveson N., Cha S.S., Shimeall T.J., Safety Verification of ADA Programs Using Software Fault Trees, IEEE software, 8(7):48-59, July 1991. [5] Papadopoulos Y., McDermid J.A., A New Method for Safety Analysis and the Mechanical Synthesis of Fault Trees in Complex Systems, ICSSEA‘99, 4(13):1-9, Paris, 1999. [6] Papadopoulos Y., McDermid J.A., Hierarchically Performed Hazard Origin and Propagation Studies, Lecture Notes in Computer Science, 1698:139-152, Springer-Verlag, 1999. [7] Csertάn, A. Pataricza, and E. Selényi, Dependability analysis in HW-SW Co-design, IEEE Computer Performance and Dependability Symposium, IPDS'95, pages 306-315, 1995. [8] Scheidler C., Pushner P., Boutin S., Fuchs E., Gruensteidl G., Papadopoulos Y., Pisecky M., Rennhack J., Virnich U., Systems Engineering of Time-Triggered architectures – The SETTA Approach, DCCS-2000, Sydney, November 2000.