Lessons Learned from Safety-Critical Software-Based Automation ...

2014 IEEE/IFIP Conference on Software Architecture

Lessons Learned from Safety-Critical Software-Based Automation Architectures of Nuclear Power Plants Eero Uusitalo1, Mikko Raatikainen1, Mika Koskela2, Varvana Myllärniemi1, Tomi Männistö3 1

Aalto University, Finland [email protected]

2

Radiation and Nuclear Safety Authority, Helsinki, Finland [email protected]

University of Helsinki Helsinki, Finland [email protected]

several proposals to address safety in software architecture design: for example, the use of safety tactics [4] and the use of safety patterns [5]. The approach chosen also depends on the characteristics of the application domain. The IEC 61508 paradigm, largely used within machinery and the chemical industry, builds on separate safety functions. The approach is excellent, e.g., in machinery, where the fail-safe behavior is easily reached by the emergency stop functionality or equivalent. In the cases where there is no unique safe state or the control has to be continuous, i.e., emergency shutdown is impossible, safety is achieved by increasing system reliability, such as in space systems [6]. Nuclear safety functions cannot rely solely on either approach. Many crucial safety functions in nuclear power plants are fail-safe, such as the reactor trip to control fission reaction, but there are some controls for which a safe state cannot be uniquely specified, e.g., the valves controlling the containment integrity. Additionally, the current trend promotes high output plant designs with large reactors. In such cases, many safety functions cause undesirable transients that reduce plant lifetime. Thus, robust, symptom-based fail-safe protection functions are often not preferred as the first course of action, and softer, state-based protection functions have increased the complexity of control. However, the current design practices as well as the regulatory requirements of the nuclear industry date back to and reflect the time of the once dominant analog technology. The basic principles, such as SAHARA (safety as high as reasonably achievable), are general, and the requirements on the entire plant level safety systems architecture design are ambiguous. Regulation is more focused on device or single system level issues. The requirements and practices that are more detailed have certain characteristics inherited from the principles of analog technology, and new challenges emerge when incorporating software-based automation. In this paper, we present our observations of challenges found in incorporating software-based automation into nuclear power plants from the point of view of both design and licensing, and related architectural design implications arising from these challenges. The findings are based on our first-hand experiences gained both in the licensing processes of softwarebased automation in Finnish nuclear power plants, and also on the recent renewal of the Finnish nuclear safety guidelines.

Abstract—Engineering large software-based systems in safetycritical domains is a challenging task despite extensive research on the topic. The software technologies and development processes are established, and basic safety principles are well known. However, demonstrating the safety of a software-based automation system remains a key challenge, particularly in the nuclear domain. In this paper, we describe our experiences from current nuclear projects featuring software-based automation. We observed a number of assumptions in regulation and standards related to safety principles, such as separation and diversity, which do not apply to software systems. The assumptions result in unrealistic expectations for software-based systems, making both design and safety demonstration challenging. Index Terms—Nuclear, automation, digital automation, instrumentation and control, software architecture, hardware architecture, safety requirements, safety principles

I. INTRODUCTION The nuclear energy industry is one of the most safetycritical and regulated application domains. Safety means the absence of harmful consequences for the users and the environment [1]. Traditionally, mechanical and electrical components and non-programmable electronics have been preferred for realizing safety functions [2,3]. Especially nuclear power plants have relied on analog relay technology until recently. Nowadays, both operating and newly built nuclear power plants are facing the same challenge: there is no viable option for building safety functions in the traditional relay manner. This is due to the fact that non-software components are poorly available; the control benefits of networked, cost-effective software-based safety instrumentation and control are tempting; and software-based technology can be used to improve the visualization of the human-machine interface (HMI). The transition to digital automation is also ongoing in Finland: Two operable units in Loviisa are currently under renewal, another two operable units in Olkiluoto are planned to be renewed within a few years, and the Olkiluoto 3 unit is under construction. Two additional units, Olkiluoto 4 and Hanhikivi 1, are being planned, both of which are foreseen to utilize software-based control systems. It is generally recognized that the use of software in safetycritical applications presents new challenges. When designing safety functions with software-based technology, software systems architecture plays a major role [2, 4]. There have been 978-1-4799-3412-6/14 $31.00 © 2014 IEEE DOI 10.1109/WICSA.2014.33

3

45

II. WHAT KIND OF SAFETY-RELATED CHALLENGES DID WE OBSERVE IN INCORPORATING SOFTWARE-BASED AUTOMATION IN NUCLEAR POWER PLANTS?

Finally, combining simultaneous software failures to other types of events is performed by using basic mathematical rules based on the assumption of independence of an initiating event and software failure. For almost any initiating event, the superposition of an estimated per-demand type frequency of software failure results in a frequency low enough to be beyond consideration. However, software errors still remaining after exhaustive system testing can be concentrated on rare execution paths not necessarily activated during testing or normal operation [8]. Thus, we consider the assumption of independence of an initiating event and a simultaneous software failure to be suspect.

We recognized five assumptions that affect the design and regulatory licensing of automation of nuclear power plants. All of these assumptions have roots in the experience that has been accumulated during decades of building and licensing relay based systems for both process and safety systems of nuclear power plants. However, these assumptions cannot be taken for granted when software-based systems and networked architectures are used. A. Assumption 1: Frequency-based measures can be readily applied for estimating software failure rates The safety design and assessment applied in nuclear safety rely on a structure of pre-classified initiating events, for each of which frequency, consequence, required mitigation measures and fault criteria are set. Events that have a higher frequency must have a low consequence, and vice versa. The required mitigation measures depend on the severity of the event, up until a certain point where low frequency starts to dominate, e.g., the impact of a large meteorite is not mitigated. The frequencies of the plant internal events are calculated on the basis of estimated breakage, where statistical models apply well: For example, a pump is a physical component subject to material deficiencies and wear, enabling the estimation of the frequency of random failure. A pump design and its physical properties, such as the quality of materials, can be thoroughly inspected in order to reduce the likelihood of common cause failure (CCF), i.e., multiple pumps failing simultaneously due to a systematic reason. Thus, random failures dominate the failure types. However, software errors are systematic, and their consequences are labeled “Software common cause failures (SW CCF)”. The issue here is that any CCF is automatically linked with the notion of rarity of the failure, even though the rarity of software failure cannot be reliably estimated. Despite extensive research on the topic, the estimation of software error rates has been found to be error-prone and highly dependent on the qualification of the inspectors involved[7]. There is a tendency to define failure frequencies for software based on the consequence of software failure, due to the inherent linking of consequence with frequency. For example, a large coolant leak event has a certain predetermined acceptable frequency; hence a software failure resulting in a large coolant leak must have the same frequency. This approach has little to do with actually estimating the rate of software failure. Rather, it sets a target for frequency, but showing that the target is achieved is uncertain due to the lack of proper methodology. Some IAEA member countries use independent confidence building tests based on statistically independent test cases to build reliance on system reliability, but the question of generating statistically independent test cases is still challenging. Also, the sensibility of estimating the point probability of a single specific software based failure scenario is questionable.

B. Assumption 2: The principles of analog technology are suitable for assessing separation and spreading of failures Relay-based analog automation systems bind hardware and functionality implemented by hardware together in a clear manner. A single relay can be uniquely identified to be a part of a specific safety function. The measurement-logic-actuation chain is sometimes called an automation “channel”, or protection “signal“. The inherent level of functional separation is very high on relay based systems due to the close relationship between hardware and functionality. Current, software-based automation systems are heavily networked, and a similar relationship between device and system level functionality is not obvious anymore. For instance, a general purpose automation processor may execute input signal validation, a second one general logic, and a third one carries out the data exchange and interface control, each processor for several functions at a time. Systems can be coupled in ways that were unfeasible for analog technology. The cloud-like distributed control system and particularly its failure behavior differs radically from the relay channels. The inherent separation level of software based automation is far lower, and the question of separation is much more complicated than in relay-based systems. Functional separation in software-based automation could be implemented by software, bringing new layers of uncertainty, as separation is not guaranteed in the same manner as in physically disconnected systems. The effects of existing regulatory requirements also change: A requirement for electrical separation implies a lack of communication channels for analog technology as a side effect. In software-based automation, a fiber optic link fulfills this separation requirement without decoupling communication. In other engineering domains, there are established practices for separation exceptions. For instance, the usual convention is that the same power supply lines (busbars) are allowed to supply systems residing in several levels of defense-in-depth, even though the levels should be separate. Currently, the IAEA guidelines concerning separation in software-based automation architecture design are ambiguous or nonexistent. C. Assumption 3: The failure modes of relay automation and software-based automation are similar There is a tendency to state that the failure of automation will equal loss of automation. Even though spurious actuations (unintended startup or state change of a process device, also

46

referred to as malfunction) have been an issue even with analog systems, the complexity and the networked nature of softwarebased automation mean that spurious actuation could occur on a much larger scale. Devices and systems could be switched on or off, or they could operate too early, too late or out of sequence [3]. Self-diagnostics and watchdogs can aid in shifting some of the failure modes to result in loss of automation, but practical elimination of spurious actuations cannot be guaranteed. To further highlight the issue, in the body of knowledge of nuclear safety, the definition of “failure” does not consistently include spurious actuations or malfunctions. In [9], failure is described as the inability of a system, structure or component to function within its acceptance criteria, and considered to fail at the point in time when it “becomes incapable of functioning”. The term “malfunction” is used in the glossary, but not included as a glossary item. Conversely, in [10], it is accepted that software failure includes both loss of function as well as malfunctions. The term “failure” is used throughout the guide regardless.

E. Assumption 5: Abstract safety principles are effective in mitigating effects of software failures The fundamental high-level safety principles – redundancy, diversity, and defense-in-depth – are sometimes seen to offer reliable mitigation against all software failures. While some software failures are mitigated, the characteristics of softwarebased automation systems as described in II.A and II.B, combined with the way software-based systems are designed and utilized, mean that the abstract implementation of safety principles is not necessarily sufficient. The implementation of redundancy principle and consequent fulfillment of N+1 (or indeed N+X) fault criteria is designed to mitigate random failures, but it is often believed – or wished – to help against software failure, too. However, redundancy without separation and diversity is not effective for mitigating systematic failures. Diversification has been considered as an effective method against common cause failure. This is true for the mechanical devices, but for the software failures, the effectiveness and the possible extent of diversification is a complex question. For example, the diversification of safety functions could be designed to only mitigate function specification errors, and diverse functions might run on the same CPU. Also, the effectiveness of diversity without appropriate separation is poor, which is usually forgotten in the domain literature. The implementation of the defense in depth principle requires several different, independent defense lines, which use a different mechanism for mitigating the increasing consequence of an event. The current practice described in nuclear guidelines does not require an actual plant level defense-in-depth, but the defense model can vary between the technical domains, process systems, electrical systems and automation. The loss of function of automation of a defense line is usually mitigated by other defense lines by design. However, the consequences of spurious actuation of the automation of a given defense line vary case by case. Whether the defense-in-depth concept in use tolerates such events cannot be taken as a given.

D. Assumption 4: User interfaces and maintenance tools have smaller safety significance Centralizing user interfaces has been an ongoing trend in the nuclear domain ever since the Three Mile Island accident of 1979, where one of the root causes was identified to be HMI design oversights. It has so far been largely accepted in different national classification practices, even if not explicitly mentioned in regulation, that user interfaces can be of a lower safety class (SC) than the controlled system. E.g., in Finland, an SC2 (analogous to IEC 61226 cat A) protection system may be controlled by an SC3 user interface, and so on. The reasons for this allowance are historical: the operator as a human being cannot be credited to have SC2 reliability; hence an SC3 user interface will suffice. Additionally, the differences of the physical push buttons across safety classes are not considered to be major, and the potential impact of a single push button malfunctioning is limited in scope. The problem has been solvable by applying the single failure criterion. However, the quality requirements as well as the potential complexity of software-based user interfaces are considerably different. SC2 software has requirements for well-defined quality measures, such as independent V&V, whereas SC3 software has far fewer quality requirements. Moreover, the potential effect of a fault in the software is not necessarily limited to the same scope as a fault in a single push button. In the worst case, malfunction of a unified user interface could cause a malfunction of all controlled process systems. The same principles apply to maintenance units and file servers as well: they may be allowed to be of lower safety class than the systems they are associated with. They are considered to be “auxiliary or support systems”, which historically include, e.g., heating, ventilation and air cooling (HVAC) systems. Though the failure of HVAC systems will lead to the eventual failure of the main systems, the potential safety significance of such systems malfunctioning is quite different in comparison to maintenance units, which are designed to alter the code and parameters of automation systems.

III. WHAT ARE THE ARCHITECTURAL DESIGN IMPLICATIONS OF THESE CHALLENGES? A. The systems-structures-components hierarchy is not sufficient for mapping safety principles to software The use of safety principles is described in regulatory guides at quite an abstract level, specifically so with regards to separation. The hierarchy of systems, structures and components (SSC) is used. Additionally, safety functions are identified, and mappings of functions to SSC define how safety functions are realized and safety principles implemented. In analog technology and physical structures, such a simple, compositional hierarchy is easily visible. In software-based automation, several other hierarchies exist as well. Firstly, the physical architecture of softwarebased automation starts from the plant-level view of systems, sensors and actuators. The systems feature networked and hardwired connections to other systems. A system can be further decomposed into subsystems. Secondly, the software of

47

IV. DISCUSSION AND CONCLUSIONS

a system is typically implemented in a layered model, consisting of application software, some sort of intermediate layers, and finally, the operating system layer. The software could be running in time-sharing mode, depending on implementation. Thirdly, software has a logical structure to deliver its functionality that is not necessarily equivalent to other structures. The logical structure defines higher level functionality, to which several physical systems or components can participate in. Consequently, it is possible that safety functions are realized by a number of interconnected elements, either physical or logical. The key point is that assumptions of the system decomposition structure or of system properties cannot be made in the same manner as with analog technology. Regardless of whether there are specific requirements for software architecture, regulatory interest in the fulfillment of safety principles nevertheless exists. Similarly, V&V and testing activities rely on specified characteristics of the systems. If the fulfillment (or nonfulfillment) of safety principles is documented only on the level of systems, structures, components and functions, unidentified points of uncertainty remain within the logical structures and software layers. As an example, let us consider two functionally diverse safety functions running on the same CPU. These two functions are intended to use different inputs, and the functions are considered separate. Separation is accomplished by function specification, and by the operating system ensuring separation of functions which do not share inputs. However, if separation on the application level is not required, and a dependency for the functions is erroneously introduced in application software, the dependency goes unnoticed by testing and V&V. These activities are based on requirements, which do not in this case include separation on application level. In order to attain a better visibility to the actual structure of software-based systems, existing view-based approaches designed for software architectures [11] should be evaluated for feasibility in the domain.

The current concepts used in nuclear safety include a number of firmly rooted assumptions that have emerged from principles that apply to analog, relay-based automation systems, but do not apply to digital, software-based automation systems. The assumptions build on each other, creating a conceptual whole, which prevents a realistic view of the characteristics of software-based systems. The inapplicability of the concepts results in challenges for design work as well as regulatory work and safety demonstrations. In order to help cope with the current syllabus and changed characteristics of technology while striving for safe designs, we suggest that safety principles in software-based automation are examined in a context more applicable to the characteristics of software-based systems than the current systems, structures and components hierarchy. Several different architectural views may need to be used in order to understand how the safety principles are accomplished. Cross-cutting architectural concerns should be managed in a systematic fashion. A great deal of further work is required in this domain to find conceptual models and safety evaluation methods that are not rooted in technological qualities of past solutions, and that can be applied to the current software-based technologies. REFERENCES [1] Avizienis, A.; Laprie, J.-C.; Randell, B.; Landwehr, C., "Basic concepts and taxonomy of dependable and secure computing," IEEE Transactions on Dependable and Secure Computing, vol.1, no.1, pp.11-33, 2004 [2] Vepsalainen, T.; Kuikka, S.; Eloranta, V., "Software architecture knowledge management for safety systems," 2012 IEEE 17th Conference on Emerging Technologies & Factory Automation (ETFA), pp.1-8, 2012 [3] Leveson, N. G. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press, 2012 [4] Weihang Wu; Kelly, T., "Safety tactics for software architecture design," Proceedings of the 28th Annual International Computer Software and Applications Conference (COMPSAC), pp.368375, 2004 [5] Armoush, A., “Design Patterns for Safety-Critical Embedded Systems”, RWTH Aachen Univ., Aachen, Germany, 2010. [6] Klicker, M.; Putzer, H., "Toward software-based safety systems in space," 5th International Conference on Recent Advances in Space Technologies (RAST), pp.517-521, 2011 [7] Smidts, C.S. et al. A Large Scale Validation of a Methodology for Assessing Software Reliability (NUREG/CR-7042). U.S. Nuclear Regulatory Commission, Washington D.C, 2011. [8] Hecht, H., "Rare conditions - an important cause of failures," Proceedings of the Eighth Annual Conference on Computer Assurance (COMPASS '93), pp.81-85, 1993 [9] IAEA Safety Glossary. International Atomic Energy Agency, Vienna, 2007. [10] Protecting against common cause failures in digital I&C systems of nuclear power plants. International Atomic Energy Agency, Vienna, 2009. [11] ISO/IEC/IEEE 42010:2011 - Systems and software engineering Architecture description, 2011

B. Recognizing safety principles in architectural tradeoffs The quality attributes set on automation architectures are often juxtaposed with safety principles. For example, unified user interfaces cater to usability, but have the potential to adversely affect separation and independence. Likewise, high availability and robust safety functions can also suggest opposite designs. Despite the fact that such quality attributes are inherently crosscutting architectural concerns with mutual interactions, we have so far not observed systematic analysis or management of tradeoffs between them in regulatory work. Rather, the quality attributes are inspected from their respective points of view (e.g., can the process be adjusted fast enough, does the operator understand what to do), and failure behavior is examined separately within the historical context of systems, structures and components as described in this paper. Such a narrow approach contributes to the segregation of engineering disciplines and may lead to local optimizations. We believe the safety analysis of automation architectures should be performed in a plant level context, balancing between different concerns without sacrificing safety by tacit or salient compromises.

48

Lessons Learned from Safety-Critical Software-Based Automation ...

Lessons Learned from Safety-Critical Software-Based Automation ...

Suggest Documents

Lessons Learned from Evaluation

Appreciating Lessons Learned lessons learned

Lessons learned from Dellinger

Lessons Learned from Adopting Clojure

Lessons Learned From Sandy

Lessons Learned from CODES - limsi

Lessons learned from Hurricane Harvey

Lessons Learned from Intervention Studies

Lessons Learned from Adopting Clojure

LESSONS LEARNED FROM IMPLEMENTING BSP

Lessons learned from VISIR - ESO

Fournier's Gangrene: Lessons Learned from

Lessons Learned

Heavy Vehicle Automation: Human Factors Lessons Learned - Core

Lessons Learned from Commercial Transformations to Network ...

Lessons Learned From the Metastatic Breast ... - AdvancedBC.org

Lessons Learned From Research on Multimedia ...

Lessons Learned from AlphaGo - CEUR Workshop Proceedings

Lessons Learned - from the Deepwater Horizon Response

Lessons learned from implementation of ... - BioMedSearch

Lessons Learned From the Children's Environmental Exposure ...

Uneasy Alliances: Lessons Learned from Partnerships Between ...

Lessons Learned from Surveillance of Antimicrobial ... - MDPI

Lessons Learned from Mine Disasters - MISTTI