Experience with the application of HAZOP to computer-based systems J. A. McDermid1;2, M. Nicholson2, D. J. Pumfrey1;2 and P. Fenelon2 1 British Aerospace Dependable Computing Systems Centre and 2 High Integrity Systems Engineering Group, Department of Computer Science, University of York, Heslington, York YO1 5DD, U.K. email jam j mark j djp j
[email protected] Abstract— This paper summarises the experience gained from application of Hazard and Operability Studies (HAZOP) and related techniques to four computer-based systems. Emphasis is placed on working practices and the integration of HAZOP-style analysis into a safety-oriented lifecycle. Two of the case studies are described in some detail. An industrial study is used to investigate working practices, highlighting a number of areas of concern with the traditional team approach. A second example is described using an alternative process known as Software Hazard Analysis and Resolution in Design (SHARD), showing its effectiveness on a technology demonstrator case study. This example also demonstrates the integration of our approach with other techniques such as our Failure Propagation and Transformation Notation (FPTN) and Software Fault Trees. I. I NTRODUCTION At COMPASS ’94 we presented a paper [1] which described a method of software safety analysis based on ideas drawn from the process industries’ Hazard and Operability Studies (HAZOP) [2][3]. Since that conference, we have had the opportunity to attempt practical application of these ideas to a number of real systems. This paper summarises the experience we have gained. For one of the systems studied, we also show how this type of analysis is integrated into a safetyoriented development lifecycle. HAZOP is described as a system of imaginative anticipation of hazards, which uses a set of guide words to prompt a team of analysts to consider the hazard potential of various deviations from the expected behaviour of a system. We considered that these guide words, combined with an emphasis on flows rather than processes made HAZOP an interesting basis for developing a safety analysis technique for software. The work described in the paper we presented last year concentrated on the technical aspects of the method — in particular, the selection of guide words. The application of HAZOP to software is too recent an idea for there to be “tried and trusted” guide words, and we were unable to find any accessible incident databases which would have allowed us to develop or test sets of guide words by examination of real failures. Instead, we proposed the use of a set of guide words based on
research into the classification of software failures [4]. A further goal of our work was to show how a closer integration could be achieved between hazard and safety analysis and design. In particular, we wanted to show how early analysis could be used to drive the design process, suggesting modifications and improvements or providing justification for a particular option. In the case study work, we have sought both to test and refine the technical aspects of the technique, and also to study different ways of working, to determine whether the team approach favoured by the chemical industry is appropriate for the systems development environment. The “traditional” team HAZOP approach is summarised in figure 1. There is relatively little work carried out in advance of the study meeting except for preparation of the necessary information. The effort is concentrated in the meeting itself, where an iterative process of identifying, examining and recording deviations is carried out for every information flow in the design. Our early work led us to believe that there were two major potential drawbacks to this approach: 1. It was clear that a team approach is expensive, and we were not convinced that the cost could be justified by the potential benefits. 2. The team brief is only to identify hazards, and not to suggest specific actions to rectify problems. This is a rather narrower brief than we considered to be ideal. This led us to propose an alternative approach, in which analysis becomes an integrated part of system development. This new method was christened Software Hazard Analysis and Resolution in Design (SHARD), to avoid confusion with “traditional” HAZOP. In this new approach, the majority of the analysis becomes an individual, rather than a team, task. As each part of the system design is produced, it is the responsibility of either the designer, or a single independent assessor, to conduct an analysis, based on the principles of HAZOP. This analysis must either be shown to justify the design proposal, or impose a number of emergent requirements which must be satisfied later in design development. If it is clear from the analysis that design revisions are required, these are implemented immediately, and the analysis repeated as necessary.
Before study
Prepare and issue study information
Study meeting
Describe complete system
Select an item
Establish intended behaviour
Use guide words to suggest deviations
Investigate effects and causes
Repeat for all items
Repeat for all guide words
Record results, questions and recommendations
After study
Follow up actions
Fig. 1. Team HAZOP study procedure
Once the designer and/or assessor are satisfied with the design and accompanying analysis, it is passed to one or more reviewers. The final stage of the process is a review meeting, at which reviewers’ concerns can be discussed. If there are serious problems or disagreements, this meeting may conduct further investigations along the lines of a traditional HAZOP study. However, the large amount of preparatory work should ensure that the effort of the meeting is focused only on those areas where there are serious problems. This process is shown in figure 2. II. C ASE S TUDY: A NALYSIS
OF A TECHNOLOGY DEMONSTRATOR
The first large case study we conducted was on a system constructed as a technology demonstrator by one of the operating divisions of British Aerospace. This case study was carried out early in our work, before much effort had been devoted to consideration of working practices; the analyses were actually performed by individuals working alone, but at-
tempting to follow the technical aspects of the HAZOP process. We were able to get full access to design information, which was expressed in DORIS/DIA notation [5], but development was advanced when the case study began, so we were unable to use the results of the analysis to guide development. However, two independent analyses were conducted, and both located hazards in the design which had not been identified by the designers, or by an earlier ad-hoc hazard analysis. A member of the project team who applied the technique concluded in his report that “[HAZOP] gives a good means by which to start reasoning about failures and identify areas of concern”. Although this case study was extremely useful in providing input and encouragement to our work, more useful material can be drawn from later studies. III. C ASE S TUDY: T EAM HAZOP
STUDY OF AN AVIONICS SYSTEM
In the U.K. there has, over the past year, been considerable interest in the use or adaptation of HAZOP for software safety
System or subsystem requirements
Designer Propose design Revise design
Designer or independent assessor
Justify selection
Select preferred action
Conduct analysis
NO Design acceptable?
Record concerns
Propose alternative remedial actions
YES Record justification for acceptance
Reviewers Review design and analysis
NO
Record concerns and/or conduct new analysis
Acceptable? YES
Review meeting Has any reviewer identified concerns? NO Accept design
YES
Further analysis necessary?
YES Proceed as HAZOP
Fig. 2. SHARD study procedure
NO
Propose alternative remedial actions
analysis. This has been spurred, in part, by the inclusion of HAZOP in Defence Standard 00-56 [6], and the ensuing concern that no standards or guidelines existed for the application of this technique to software. In response to these concerns, the U.K. Ministry of Defence (MoD) commissioned two companies, Adelard and Cambridge Consultants Ltd., to produce studies into the feasibility of writing guidelines for HAZOP studies of programmable systems. The feasibility studies were delivered in April 1994, and were generally positive about the production of guidelines. Cambridge Consultants Ltd. were subsequently commissioned to write the first draft of the actual guidelines, which are intended to form the basis of a new Defence Standard. The guidelines proposed are specifically for the use of HAZOP as a hazard identification technique, and the approach they describe is very similar to that familiar to the process industries, i.e. a structured, team study. The guidelines also suggest the retention of the “traditional” guide words (M ORE, L ESS, N O etc.), supplemented by additional words to prompt consideration of absolute and relative timing of events in the system. As part of the process of developing and refining the guidelines, early drafts were released to selected groups who were able to offer trials on real systems. British Aerospace, working with the Dependable Computing Systems Centre (DCSC) offered a system for study. However, in order to integrate with the DCSC’s own work, the case study was conducted using the procedural aspects (i.e. team working and management) of the guidelines, but applying the guide words and other technical details suggested by the DCSC and reported in [1]. This case study was of considerable importance in the development of our ideas and experience, not least because of the opportunities it presented for discussion with experts and practitioners involved on a daily basis in the development and approval of safety critical systems. The team recruited for the study consisted of:
The study leader A recorder, who both participated in discussions and recorded conclusions using a tool built from a standard spreadsheet package A member of the design / implementation team A representative from the Independent Verification and Validation team A member of the group which had specified the system Of these team members, all but the study leader were company personnel. The system studied was an existing avionics subsystem, which provided a range of utility functions such as test and configuration management and maintenance data collection. It was not directly involved in the control or navigation of the aircraft, but could contribute indirectly to hazards through selection of test modes at inappropriate times, or incorrect maintenance or configuration data reporting. Detailed specification and design information were avail-
able, expressed in COntrolled Requirements Expression (CORE) notation [7]. As with the first case study, the design and implementation of the system were well advanced, and consequently we were not able to use results from the study to influence the development of the design. However, a complete hazard analysis for the system had already been performed using other techniques, so we were able to compare the results we achieved, and the effort expended, with the previous analysis. Three days were available for the study, and two sessions, each of about three hours, were timetabled on each day. The first day began with a short presentation explaining the purpose of the study and basic HAZOP concepts. The team then spent approximately half a day studying an example system defined in the CORE notation. The hope was that this would enable people to become familiar with the concepts, and iron out a number of problems with procedure and interpretation before the study proper commenced. Study of the avionics system itself began mid- afternoon on the first day, and continued through both sessions on the second day. At this point it was decided that no further useful progress could be made by continuing analysis, and the team spent the morning session of the third day discussing their conclusions from the study. The actual progress made by the case study in terms of number of flows examined was relatively small. However, amongst these, the existence of a suspected hazard not described in the previous hazard analysis was confirmed. Several factors influencing the rate of progress can be identified:
Problems with the design representation. Time spent discussing — and occasionally re-visiting — complex issues, which in a real HAZOP study should have been resolved off-line. The complexity of the system studied. However, the major cause of slow progress was simply that the study was not conducted strictly as a “proper” HAZOP. As an experimental trial, it was considered important to discuss issues related to the method as they arose. The complexity of the design led to problems understanding design intent; in many cases, data flows were discovered to be so interrelated that it was almost impossible to undertake an analysis of any one in isolation. As well as being complex, the system had many interfaces to external systems which were not well understood by team members, and the team was attempting to study the system at a relatively low level of detail without having results of higher-level analyses on which to draw. This last point meant that the scope of consideration for every deviation inevitably grew wider and wider as people attempted to understand the implications of failures — in some cases to the extent that discussions encompassed ground crew training and maintenance procedures. Possibly as a result of the fact that contextual information was needed to complete the analysis, the study at times became more like an FMEA; certainly, there was less study of
the causes of deviations than would have been desirable, although it was pointed out that it would have been fairly reasonable to expect that these would have become evident as “upstream” flows were considered. A consequence of this was that some of the team members complained that the study was too data-oriented, to the exclusion of process considerations; of course, the data all originated or was consumed in processes, so a better investigation of causes and immediate (as opposed to system- wide) consequences would have rectified this problem. The most important output of the study were the conclusions reached by the team in discussions both during and after the study meetings, and these are summarised in the following subsections. A. Working as a team The study highlighted the benefit that a team conclusion may carry more weight than individuals’ conclusions, but also raised many concerns about the effectiveness of carrying out the analysis as a team, including:
When the system under consideration is extremely complex, or interacts with a great number of other systems the number of questions which must be resolved off-line becomes excessive. The study may have to be reconvened many times as the answers to queries suggest new deviations or interactions which must be explored. In the worst scenario, the team approach may be less effective, and introduce significantly greater delays, than a study carried out by one or two people who can contact others with the necessary expertise as soon as the need arises. In many cases, a lot of the analysis will be either repetitive, or relatively straightforward or obvious. The presence of the entire team is not necessary, and inefficiency, high costs, and possibly boredom of the team members may result. The study could become “adversarial”, with the designer(s) pushed into a defensive role. The greatest concerns with the HAZOP method centre on the balance of costs and benefits of using a relatively large team for the study, as compared to an individual or very small group study. At the level of design detail at which the case study was conducted, a team HAZOP is a very expensive way of obtaining information and this case study presented no compelling evidence that the team study actually provided a more thorough analysis than would have been possible with other techniques. This tends to suggest that the alternative process we identified in figure 2 is more appropriate. B. Team Composition Conclusions about the size and composition of the team included:
The size of the team is critical; too small, and it may be prone to bias, too large, and discussions of relatively simple points may become unnecessarily protracted. The composition of the team in terms of designers / users / experts was good, but the complexity of the system was such that it would never be possible to assemble a team with the expertise necessary to analyse all parts of it. In a project situation with deadline and cost pressures, members of the project team would have a vested interest in approving a design; a HAZOP team should therefore always include at least one completely independent member. C. The Study Leader It is clear that the skills and authority of the leader are crucial to the success of a team HAZOP. Some areas particularly highlighted in which the leader must be active are:
Summarising at the end of a discussion, and ensuring that the records made really reflects the important points. Enforcing time keeping, and deciding quickly when an issue is not going to be resolved in the meeting. Insisting that points already discussed are not needlessly re-visited, especially when it has already been decided to investigate these issues outside the meeting. Implicitly, it is also necessary that the team leader has a strong personality and is able to be assertive in bringing discussions to a close. D. Recording the results The recording of results was one of the areas of the case study which was least successful.
It is difficult for the recorder to participate in the discussion and record what has been said; the time taken to type (possibly large) amounts of text must be considered in deciding the recorder’s role. A solution to this might be a “two–pass” where all the guide words are first discussed for each entity, with each member of the team making his own notes but no official recording being carried out; the guide words are then re-visited, and concerns recorded as appropriate on this “second pass”. In many instances, team members will give examples of specific types of problem, but for the records it is preferable to record the general case. This is difficult to determine “on the fly” during the meeting and a better description or explanation of some issues may need to be written off-line. For large systems, the output of the HAZOP process may be a huge volume of unstructured paperwork, which it is difficult to navigate, and from which it is hard to extract essential material, or understand the structure of deviations contributing to a particular hazard. The tabular records should, where possible, be supported by an alternative representation such as Fault Trees,
or Fenelon’s FPTN diagrams [8]. It is also important to keep the analysis on-line, so that it can be searched and manipulated electronically. E. The design representation The design representation available for this study was found to have a number of drawbacks, and was generally less helpful than would have been wished. We concluded that:
The effectiveness of a HAZOP study is very dependent on the information density of the design notation — the more information on each page, the better the results. If attempts are made to analyse structures introduced into the design to work around limitations of the design method or notation, the results will be uninformative. At low levels of design detail, relatively small changes can result in the need for substantial re-work. This might make the maintenance of any analysis accompanying a design an infeasibly large task, especially if the HAZOP team has to be re-convened to agree the impact of any design change and re-work the analysis. Fenelon and Hebbron [9] make several recommendations about the need for meaningful mappings between design representations, guidewords and identified hazards. F. Guide words As noted in the introduction, this study was carried out with an alternative set of guide words developed by the DCSC. The original intention of this set of guide words was to find the minimum set which we could be reasonably confident should prompt consideration of the plausible failure modes of software. Had the “raw” words (i.e. O MISSION, C OMMIS SION , E ARLY , L ATE ...) been used, there would probably have been fewer difficulties with interpretation; in the event, we attempted to pre-define interpretations for the CORE context, which were found to be contentious and not generally acceptable. Also, the division of the value domain into coarse and subtle, although academically attractive, and very useful in some applications (e.g. determining the placement of defensive programming features), proved difficult to interpret and manage. The case study was therefore not able to provide any support for this alternative set over the “traditional” set contained in the guidelines. However, in the discussion after the study, the team concluded that the greater number of “traditional” words would probably have been considerably more onerous. Whatever guide words are used their most important function is as discussion starters, and it is important not to unnecessarily restrict the freedom of the team to interpret them in novel ways. However, suggesting how the words should be interpreted in particular contexts is seen as important for ensuring uniformity across multiple studies conducted independently on component parts of a large system.
IV. C ASE S TUDY: SHARD
STUDY OF A COMPUTER ASSISTED BRAKING SYSTEM
The third case study carried out was on another industrial technology demonstrator — an experimental Computer Assisted Braking (CAB) system. In this case, only an outline of the system functionality and the proposed hardware platform were completed before the study began, and we were therefore able to test the integration of our hazard analysis techniques into the development process. For this case study, we elected to use the SHARD process, i.e. the initial analysis of each part of the design was carried out by the designer. These analyses were then passed to peer reviewers as part of the design review package, and the complete team met only to discuss those areas in which there was disagreement, or where specific recommendations for design modifications had been made. The case study was a three man-month project. The main study was carried out by a researcher who had not been involved in the investigation of HAZOP or the development of SHARD, allowing us to investigate how easily the technique could be taught and to identify particular areas of difficulty. Of the three months, approximately four weeks were spent in initial understanding and development of the system concept, two weeks on SHARD analysis and related design revisions, and a further six weeks producing supporting evidence that the SHARD analysis was correct. At the highest level of decomposition, the braking system functionality is provided by the following major components:
Input and Output modules, which provide interfaces to the sensors and actuators 3 braking algorithms an anti-lock function a Pressure Control module, which converts the output of the braking algorithms into actuator pressures The three braking algorithms are alternates, providing different levels of functionality. The highest level of functionality is provided by the Enhanced Braking Algorithm (EBA); the Alternate Enhanced Braking Algorithm (EBA’) provides a similar level of functionality, but using different sensors. Both of these algorithms utilise the anti-lock functions. The lowest functionality is that provided by the Basic Braking Algorithm (BBA); this algorithm does not incorporate the advanced feedback sensing of the other algorithms, and does not use the anti-lock function. The minimal set of components required to function to provide braking is therefore Input – BBA – Pressure Control – Output. All the components on this basic path are clearly safety critical, and failures are potentially hazardous. These functions are therefore replicated on independent processors to increase reliability. The EBA, EBA’ and Anti-lock functions are implemented on a further two processors, and all four processors communicate via duplicated buses. The system is implemented using a message-passing architecture,
which has very attractive fault detection and containment properties, and a fixed, periodic scheduling system which precludes most types of timing failure. Two additional functions, Monitor and Fault Manager implement the monitoring and switching which allow the results of the functionally more powerful, but potentially less reliable, EBA or EBA’ algorithms to be used in place of the BBA. Since this switching effectively permits components of a lower integrity to replace the high integrity BBA, the monitoring and fault management components themselves are also safety critical, and are replicated across all four processors. An initial exploratory analysis of the top-level design was undertaken, using SHARD, to investigate potential failure modes of the proposed design and to suggest remedies to them. A total of 70 data flows were identified in the top level design drawing. The HAZOP / SHARD process of using guide words to suggest deviations, and then searching for hazardous effects, plausible causes and possible remedies, was applied by the designer to each of these flows. A number of changes were made to the proposed design as a result of this initial analysis before anything was passed to the reviewers. For example, the system incorporates a fault injection facility for test purposes. This facility permits any system component to be instructed to behave as if it had failed in various ways. As a result of the initial analysis, the designer concluded that there was insufficient protection against spurious fault injection events; in the original design, certain single failures would have been sufficient to trigger an unwanted fault output. To rectify this, the designer added a boolean flag which must be set every time fault injection is required. The self checking properties of the algorithms were also strengthened. An example of the analysis produced for one of the data flows in the system is shown in table I. The format of the table is similar to a standard HAZOP log, and to the example we presented in [1]. However, development of the method has led to the introduction of an additional column (“Detection/Protection”) in which to record information about failure control mechanisms already incorporated into the design. The penultimate column (“H?”) records whether the consequences of the deviation under consideration are hazardous, regardless of whether there are protection mechanisms which can detect the failure and mitigate against its effects. In a more sophisticated example, this column might be used to record some sort of criticality for the deviation, rather than the simple Yes/No used here. The flow shown, pressure5, carries the value which indicates to the Pressure Control which of the braking algorithms is to be used to control the outputs. Since this selection can potentially cause incorrect values to be used if one of the algorithms has failed, it is a safety critical data flow. The details of the example in the table cannot be fully appreciated without more knowledge of the design; however, the spirit of the analysis should be clear from the table. The table shown highlights a problem with any analysis which is evolved with a design, namely how to manage and
record change history. In this case, there is an action (“Use enumerated type for selection value”) identified against the second guide word; examination shows that, in fact, the flow data type has already been changed to “enumerated”. In this case study, actions identified were left in the analysis tables, but ideally it should be possible to “close off” actions as they are completed, and to maintain a log such that traceability is maintained. Once the designer had completed the analysis and revision process to his own satisfaction, a review meeting was called. The reviewers were provided with the design and analyses a week in advance of the meeting. The meeting took 80 minutes, most of which was occupied with discussions about the validity of assumptions made by the designer. For example, the designer had assumed that the messagepassing architecture underlying the implementation could be relied upon to exclude certain types of failure, and to transform others into omission failures, which are relatively easy to detect and handle. The scheduling system is such that it precludes timing failures, and it has fail-silent mechanisms which prevent errors of commission (i.e. a failure which generates an output where a correctly-operating system would have generated none). For this reason, none of the analysis carried out by the designer had included timing guide words (E ARLY, L ATE), or the guide word C OMMISSION. The reviewers accepted that this was a valid assumption, provided that later work could show that worst-case timing properties could be guaranteed. There was also considerable discussion about whether all processes and flows in the system have to be regarded as safety critical given that, if the monitoring and fault management functions are working correctly, there are some components whose failure should not cause a system failure. The reviewers accepted the designer’s arguments that the safety critical flows are those on the minimum functionality path, and the monitoring and fault management functions that can usurp the use of this path. In conclusion, the reviewers accepted that the analysis showed that no single point failure can cause a system failure. The hazardous condition of the level of braking applied not responding to commands from the driver requires several co-incident failures. It was not found necessary to conduct any further analyses in the meeting, and the only changes the designer was requested to make were changes of wording to improve the clarity of parts of the analysis. Following the review meeting, the reviewers prepared short documents summarising the discussion.
SUBTLE
VALUE ,
Memory corruption, message corruption, algorithmic failure
Sparse time base; dual buses; CRC; defined fallback position
Detection / Protection
Effects
H? Justification / Design Recommendations
Pressure control detects Y Justification: Fallback to BBA value absence of message, and combined with redundant BBA selects value from Basic implementation is considered sufficient Braking Algorithm (BBA) protection by default CRC ensures corrupted Pressure Control will be Y Action: Use enumerated type for selection message will fail silent, forced to select default value. Run-time checks will then ensure clean node breaks, value (i.e. BBA output) out-of-range value is not possible. enumerated types Failure of either Pressure Control has Pressure Control will Y Justification: Use of enumerated types and or both of EBA feedback from sensors; select inappropriate sensor feedback should ensure that subtle and EBA’ clean node breaks; value — potentially failures do not persist. enumerated types hazardous if EBA or EBA’ have failed
Co-effectors
Bus failure; Second bus Fault Manager failure; failure message corrupted
Possible Causes
Plausible but incorrect Memory corrupted; selection value message corrupted
Impossible selection value
VALUE ,
COARSE
No selection value sent to Pressure Control
Deviation
1 Top level 8 pressure5 Fault Manager Pressure Control Enumerated This is a selection command value which defines which of the braking algorithms BBA, EBA or EBA’ Pressure Control should select to drive the outputs.
O MISSION
Guide Word
Drawing Ref Drawing Name Flow ID Flow name Source Destination Data Type Additional Information
TABLE I : SHARD EXAMPLE — A NALYSIS OF FLOW pressure5 OF THE CAB SYSTEM TOP LEVEL FUNCTIONAL DIAGRAM
The achievement of a satisfactory design and hazard analysis must be followed by further work to ensure that the failure properties of the implementation are no worse than those assumed in this analysis of the design. In this case study, this was achieved through the use of Fenelon’s Failure Propagation and Transformation Notation (FPTN) [10] to investigate the failure properties of the lower levels of the design, and timing and fault tree analyses of the program code. FPTN represents a system as a number of modules, each corresponding to a functional element of the system. The representation of each module includes descriptions of the failure modes which may be generated through its own internal operation, a list of the failure modes the module is capable of handling (i.e. dealing with in such a way that no failure of any type is propagated out of the module), and a set of equations describing the output failure modes of the module in terms of its input and internal failure modes. These equations also represent the cases where a failure of one type is transformed by the action of the function into a failure of another type. For example, an out-of-range input value (i.e. a detectable value failure) used in a calculation may produce an incorrect but in-range result (i.e. an undetectable value failure) as a result of clipping. This type of information is extremely useful in determining where it is necessary to place range checkers, error handlers and other defensive programming devices. In the case study example, the messagepassing architecture was found to transform many types of input failure into omission output failures, as the designer had asserted in the SHARD analysis. Each FPTN equation corresponds to the minimal-cutset form of a fault-tree (in this case a software fault-tree) derived from the implementation features of that module; output failure events of the equations correspond to top events of the trees; input failures correspond to leaf nodes. An outline of a single FPTN module is shown in figure 3. The modules are connected by the failures which may propagate between them, and may be nested hierarchically (with FPTN’s consistency laws governing the connections between nested modules). All failures in an FPTN representation of a system are labelled with their type. In Fenelon’s original work, the failure classifications used were based on work by Ezhilchelvan and Shrivastava [11]; however, for this study, we used failure classifications which reflected the SHARD guide words, e.g. x:o for omission failure x, and y:Vu for undetectable value failure y. The FPTN process uses the failure modes identified in the SHARD analysis, and considers interactions both between coincident failures, and between failures and correctly operating elements of the system. In carrying out a FPTN analysis of this system, we identified some serious problems of comprehension and expression relating to the injection of faults. Reasoning about faults in the fault injection process is extremely difficult — for example, if some failure causes a requested fault event to be omitted, the result will (erroneously) be normal operation of the system. This is both confusing,
and hard to express in notations such as fault trees. In fact, the most manageable approach to this problem appears to be in considering the effects of failures in fault injection on the testability of the system. We believe that the addition of concepts of state and modality to FPTN will enable us to reason more effectively about such conditions. Figure 4 shows a fragment of the FPTN model constructed for the CAB system. The diagram shown describes the failure properties of the Fault Manager module. The example appears somewhat cryptic, as very short mnemonic names have been used for the failure modes, but the important feature to note is that all the input failure modes to this module are handled within it; output failures can only be generated by internal or infrastructural failures. Flow 8, shown as one of the outputs of this module, is the flow which is analysed in table I; the FPTN shows that, in fact, only the O MISSION and S UBTLE VALUE failures are actually possible — better than was assumed in the initial hazard analysis. We were able to show that the actual failure properties of the whole system were at least as good as had been assumed in the SHARD analysis, thus that the design meets its safety requirements. Parts of the CAB demonstrator, including the Fault Manager, have been coded using SPARK Ada [12]. The production of the code was guided by the SHARD and FPTN analysis. The FPTN analysis in particular gives a good definition of the failure modes that must be considered and handled in a module. As was determined to be necessary at the SHARD review meeting, the code has been annotated and analysed for its worst-case execution time behaviour, using a tool developed at York [13][14], based on earlier work at the University of Washington [15]. Results show that the tasks can be guaranteed to complete each period. Schedulability analysis [16] allows us to determine the response time, in the worst case, when other tasks are resident on the same processor. The code has also been annotated and analysed using a variant of Software Fault Tree Analysis [17][18] and employing a tool produced at York by the Software Safety Assessment Procedures (SSAP) project [8]. The Monitor and Fault Manager subsystems were straightforward – the fault trees produced by SSAP indicated that the code was algorithmically very simple, and that the only credible failure modes could have arisen from either typographical errors in the coding, or omission of complete sections of program code at execution time. This case study was extremely successful in demonstrating the integration of SHARD with the development process, and in showing that a carefully conducted SHARD analysis can provide guidance and justification for a design which can later be validated by more detailed analysis using other techniques. We also gained significant insight into the role of FPTN in an integrated development process. This study has provided us with much experience in the integration of safety analysis techniques and has assisted us in our search for a viable lifecycle model for the integrated development and assessment
SHARD flow ID(s)
Module Name
SHARD flow ID(s)
Critical?
Equations describing output failures in terms of input and internal failures Source of failures
Input failures (with types)
HANDLED List of failures (with types) this module can prevent from propagating
Output failures (with types)
Destination of failures
INTERNAL List of failures (with types) which can arise from operation of this module or related infrastructural failures
Fig. 3. Outline of a FPTN module
flows (10,48)
MFD:Vd
SW1O==B1&B2 | P1&P2&P3&P4 | IO | FAO SW1U== MFU | IU | FAU SW2O==B1&B2 | P1&P2&P3&P4 | IO | FAO SW2U== MFU | IU | FAU
MFU:Vu
HANDLED MFO:o, MFD:Vd, ID:Vd, FAD:Vd
MFO:o monitor data from monitor
FAULT MANAGER
Y
Internal GENERATED processor failure P1:o ,P2:o ,P3:o ,P4:o GENERATED bus failure B1:o, B2:o GENERATED by fault injection IO:o, ID:Vd, IU:Vu GENERATED by algorithm FAO:o, FAD:Vd, FAU:Vu
flows (8,9) SW1O:o SW1U:Vu
SW2O:o SW2U:Vu
switch 1 value to ABS
switch 2 value to pressure control
Fig. 4. FPTN for Fault Manager Subsystem
of programmable systems. V. F UTURE WORK :
FURTHER CASE STUDIES
We are currently undertaking a SHARD study of a naval communication system. As with the CAB system, this project is under development, so the results of the analysis can be used to drive design improvements. Unlike the CAB project, it is a commercial product and not a technology demonstrator, so this study must cope with real world budget and timing pressures. The system under study is an extension to an existing system. However, it is relatively independent, and specification and design are being developed top-down, as if it were a completely new product. The design is expressed in Yourdon / DeMarco notation [19][20], and new interpretations of the guide words have been developed for this. To date, we have analysed parts of the Essential Model, and results so far are promising. No unexpected hazards have been found, but the recommendations which have been made as a result of the analysis have been found appropriate and acceptable. We expect that the major analysis effort will concentrate on the early stages of development of the Implementation Model. The emphasis of this study is on investigating how the SHARD approach to hazard and safety analysis can be integ-
rated into an existing process within a company, and attempting to measure the change in the workload imposed by this new process. We are also about to launch a further SHARD study of a civil avionics system. This study, which will be of about six man-months duration, will be conducted almost entirely by British Aerospace staff, with members of the research team at York acting as advisers. The principal aim of this study is to test and compare a number of novel techniques, to assess their potential as part of the safety critical systems development process of the future. We hope that it will also provide us with useful feedback on the strengths and shortcomings of SHARD. In addition, we are revisiting a more sophisticated version of the computer-aided braking system, this time integrating SHARD, FPTN and software safety analyses into a formal development process, starting with a specification (in the Z language) of the system and refining it into executable Ada code with the aid of the CADiZ Z toolset developed at York [21]. Related work on hazard analysis covers the formalisation of the HAZOP method. Fenelon and Hebbron [9] have investigated a causal formalisation of HAZOP and demonstrated its relationship to other safety analysis techniques;
Hebbron and Fencott [22] have proposed methods for analysing designs expressed in Ward/Mellor essential models and Milner’s Calculus for Communicating Systems by use of HAZOP. The formalisation of HAZOP, and the establishment of a framework for systematically linking HAZOP to other design notations, provides us with increased confidence that a HAZOP study can capture all of the relevant properties of a design in a controlled fashion. The scope of future and related work illustrates that HAZOP and related techniques can be applied to many different design processes and development lifecycles; the method can be readily adapted to suit a particular process, while still retaining the features which make it useful and popular. VI. C ONCLUSIONS From the practical experience we have gained of the application of HAZOP and SHARD to computer-based systems, we believe that the techniques have widespread applicability and acceptability. and provide a useful way of investigating the safety properties of a wide range of computerbased systems. The techniques show potential to address the perceived need (of industry) to integrate hazard and safety analyses more strongly into the development process, and to use the results of such analyses to guide design development. Related work serves to increase our confidence in the scope for use of the method in conjunction with more wellestablished safety analysis techniques. This paper has described several case studies, but concentrated on two — the avionics system team HAZOP, and the SHARD study of the computer assisted braking system. The avionics HAZOP contributed significantly to our understanding of some aspects of the working procedures for hazard analyses of computer based systems. It led us to question the team approach; one single case study cannot be regarded as conclusive, but this study provided evidence to support the view that a team HAZOP is not a cost efficient approach to low-level analysis of complex systems. From a technical viewpoint, the study served to emphasise the importance of a top-down, outputs-first order of working, to maximise working efficiency, and ensure that the results of earlier analyses can be considered and incorporated at lower levels. The study also suggested that there is a level of detailed design development below which HAZOP, and to a somewhat lesser extent SHARD, are not appropriate. We now believe that this type of analysis should only be continued down to a level where the definition of data flowing between components is still more detailed and concrete than the definition of the functionality of the components. This is a relatively high level, but at a lower level we believe it may be more useful and cost effective to carry out analyses based on function. Techniques such as Functional Failure Analysis (FFA) are already in use, but further research is needed to investigate how these can be integrated with HAZOP. The SHARD study of the computer assisted braking system is significant because of the technical progress we were
able to demonstrate. In this study, the hazard analysis was carried out as an integrated part of the development, and we were able to use the results directly to provide real guidance to design process. This study was also our first attempt to integrate FPTN, software fault trees and other techniques into a SHARD-based analysis as development progressed, in order to assess how well the SHARD analysis had identified the safety properties of the design. We were able to show that the initial SHARD analysis was realistic, and that the completed design (with the improvements suggested by the SHARD process) had safety properties which were no worse than those predicted by the initial analysis. This study was also significant because the developers of SHARD did not carry out the study, acting instead as reviewers. The comparative ease and success with which the study was carried out suggests that the techniques should be relatively easy to teach, and to transfer into the industrial environment. Our work is now expected to develop in two distinct new directions. The first is an extension of the work on method integration, described in [23] and [24], with emphasis on the use of safety analyses to help design synthesis. A new project has just commenced at York which is looking at techniques for synthesising high integrity applications, and the type of analysis techniques described in this paper will be used to help assess and select between alternative designs. The second area of work, which has been particularly highlighted by these case studies, is an examination of the specification and assessment of safety properties at interfaces (i.e. system and subsystem boundaries) — particularly where sub-contractors are employed, or independent developers are required to work on different parts of a system. We are particularly interested in developing techniques for expressing the expected contributions of component and subsystem failures to hazards, and assessing these expectations against developing or completed designs to ensure that they are accurate. We hope that this will help control safety analysis for complex systems which are developed by multiple organisations. A CKNOWLEDGMENTS This work was supported in part by British Aerospace under the activities of the BAe Dependable Computing Systems Centre at the University of York. R EFERENCES [1] J. A. McDermid and D. J. Pumfrey, “A development of hazard analysis to aid software design”, in COMPASS ’94: Proceedings of the Ninth Annual Conference on Computer Assurance. IEEE / NIST, Gaithersburg, MD, June 1994, pp. 17–25. [2] CISHEC, A Guide to Hazard and Operability Studies, The Chemical Industry Safety and Health Council of the Chemical Industries Association Ltd., 1977. [3] T. Kletz, Hazop and Hazan: Identifying and Assessing Process Industry Hazards, Institution of Chemical Engineers, third edition, 1992.
[4] A. Bondavalli and L. Simoncini, “Failure classification with respect to detection”, in First Year Report, Task B: Specification and Design for Dependability, Volume 2. ESPRIT BRA Project 3092: Predictably Dependable Computing Systems, May 1990. [5] H. R. Simpson, “Methodological and notational conventions in DORIS real time networks”, British Aerospace Dynamics Division, Feb. 1993. [6] MoD, Draft Defence Standard 00 – 56: Safety Management Requirements for Defence Systems Containing Programmable Electronics, Ministry of Defence, Feb. 1993. [7] G. P. Mullery, “CORE — a method for COntrolled Requirements Expression”, in System and Software Requirements Engineering, R. H. Thayer and M. Dorfman, Eds., pp. 304–313. IEEE Press, 1987. [8] P. Fenelon and J. A. McDermid, “An integrated toolset for software safety analysis”, Journal of Systems and Software, pp. 2/1–2/16, Mar. 1993. [9] P. Fenelon and B. D. Hebbron, “Applying HAZOP to software engineering models”, in Risk Management And Critical Protective Systems: Proceedings of SARSS 1994, Altrincham, England, Oct. 1994, pp. 1/1–1/16, The Safety And Reliability Society. [10] P. Fenelon and J. A. McDermid, “Integrated techniques for software safety analysis”, in Proceedings of the IEE Colloquium on Hazard Analysis. Nov. 1992, Institution of Electrical Engineers. [11] P. D. Ezhilchelvan and S. K. Shrivastava, “A classification of faults in systems”, University of Newcastle upon Tyne, 1989. [12] B. A. Carre and T. J. Jennings, “A subset of Ada for formal verification (SPARK)”, Ada User, vol. 9, no. (Supplement), pp. 121–126, 1989. [13] R. Chapman, A. Burns, and A. Wellings, “Integrated program proof and worst-case timing analysis of SPARK Ada”, in Proceedings of the ACM SIGPLAN Workshop on Language, Compiler and Tool Support for Real-Time Systems, Orlando, Florida, June 1994. [14] R. Chapman, “Worst-case timing analysis via finding longest paths in SPARK Ada basic-path graphs”, Department of Computer Science Yellow Report YCS 246, University of York, Oct. 1994. [15] C. Y. Park and A. C. Shaw, “Experiments with a program timing tool based on source-level timing schema”, IEEE Computer, pp. 48–56, May 1991. [16] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings, “STRESS: A simulator for hard real-time systems”, Software — Practice and Experience, vol. 24, no. 6, pp. 543–564, 1994. [17] N. H. Roberts, W. E. Vesely, D. F. Haasl, and F. F. Goldberg, Fault Tree Handbook, Systems and Reliability Research Office of U.S. Nuclear Regulatory Commission, Jan. 1981.
[18] N. G. Leveson and P. R. Harvey, “Software fault tree analysis”, Journal of Systems and Software, vol. 3, pp. 173–181, 1983. [19] T. De Marco, Structured Analysis and System Specification, Prentice-Hall, 1978. [20] E. Yourdon and L. L. Constantine, Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design, Prentice-Hall, 1985. [21] D. T. Jordan, C. J. Locke, J. A. McDermid, B. A. P. Sharpe, I. Toyn, and C. E. Parker, “Literate formal development from Z to Ada for safety critical applications”, in Proceedings of Safecomp 94, V. Maggioli, Ed. 1994, pp. 1–11, Instrument Society Of America. [22] B. D. Hebbron and P. C. Fencott, “The application of HAZOP studies to integrated requirements models for control systems”, in Proceedings of SAFECOMP ’94, Oct. 1994. [23] P. Fenelon, J. A. McDermid, M. Nicholson, and D. J. Pumfrey, “Towards integrated safety analysis and design”, ACM Applied Computing Review, Aug. 1994. [24] A. Burns and J. A. McDermid, “Real-time safetycritical systems: analysis and synthesis”, Software Engineering Journal, vol. 9, no. 6, pp. 267–281, Nov. 1994.