Example 2: China Airlines Flight 006. On February 19, 1985, a Boeing 747 spontaneously lost thrust on one of the four engines [29] following autopilot-induced.
A Hazard Taxonomy for Embedded and Cyber-Physical Systems** Bastian Tenbergen, Alexander C. Sturm, and Thorsten Weyer paluno – The Ruhr Institute for Software Technology University of Duisburg-Essen, Germany {bastian.tenbergen, alexander.sturm, thorsten.weyer} @paluno.uni-due.de
Abstract. During embedded system development, safety assessment is concerned with ensuring that a system has no unsafe, hazardous effects that may lead to for human users and other systems in its context. This is particularly challenging for Cyber-Physical Systems (CPS) as the context of CPS may change dynamically at runtime and cannot be entirely anticipated at design time. In consequence, the effects of one individual CPS and a collaborative group of CPS on their context can no longer be sufficiently characterized at design time. In this paper we analyze the state of the art regarding embedded system hazards and substantiate our findings with post-hoc analyses of accidents from the avionic domain. Based on our findings we propose a hazard taxonomy for embedded systems and extend it for CPS in order to support the systematic characterization of hazards during safety assessment. Keywords: Cyber-Physical Systems; Embedded Systems; Safety Assessment; Hazards; Hazard Analysis; Operational Context; Taxonomy
1
Introduction
In recent years, the term ”Cyber-Physical System“ has been coined to describe an emerging category of software-intensive systems that dynamically combine the functionality of traditional embedded systems with highly interconnected future internet systems in dynamic contexts. Cyber-Physical Systems (CPS) observe their context by means of sensors, act upon their context by means of actuators, and communicate with other CPS in their context by means of technical infrastructure such as the Internet [2], [35]. Cyber-Physical Systems often collaboratively provide safety-critical functionality which cannot be carried out by one single CPS (cf. [2]). Thus, when developing individual CPS, particular emphasis must be placed on showing that the CPS does not cause harm to any other CPS or human users (see [2], [5]). In the development process of traditional embedded systems, this is done during safety assessment. **
This research was funded in part by the German Federal Ministry of Education and Research under grant number 01IS12005C.
© Copyright by the authors.
Safety assessment is concerned with identifying and mitigating hazards that emanate from a system [18] and thereby showing that a system is sufficiently safe when operating in its context [16]. A hazard is a condition that arises based on some system functionality together with conditions in the system context and that could cause or contribute to an accident (cf. [22], [30]). In CPS, safety assessment is faced with two additional major challenges: First, the dynamic nature of the context may cause CPS to dynamically start interacting and seize interacting with one another, making it difficult to predict which other systems will interact with the CPS during runtime. Second, since the overall functionality of a set of collaborating CPS may be unknown to the individual CPS, ensuring that the overall functionality is safe and hazard-free becomes particularly challenging. To foster the development of safe CPS, this paper provides a hazard taxonomy for CPS. The taxonomy places particular emphasis on the role of context for CPS and differentiates several context-related hazard characteristics. We base our taxonomy on a set of characteristics for hazards of embedded systems. To this end, we first elaborate on the relevant differences between CPS and embedded system and illustrate the role of context in the development of such systems in Section 2. Subsequently, we review the relevant state of the art in Section 3 and conduct a post-hoc analysis of selected accident reports from the avionics domain in Section 4 in order to identify additional characteristics of hazards. Based thereon, we present a hazard taxonomy for embedded systems in Section 0. In Section 6, we revise the hazard taxonomy for embedded systems from Section 0 to address the specific properties of CPS. Section 7 concludes the paper and presents an outlook on future work.
2
Differences Between the Safety-relevant Context of Embedded Systems and Cyber-Physical Systems
Traditionally, the context of a system is the part of the environment that is relevant for the system under development (SUD). This comprises those entities and their corresponding properties that the SUD will interact with during operation (in the following: operational context, cf. [8], [33]) as well as additional information that constrains the development of the SUD (context of knowledge, see [7]). In context theory [13], the SUD is referred to as the context subject. Depending on the selection of the context subject, the context differs: For example, from the perspective of an aircraft, the pilot might be in the operational context, as she gives instructions to the aircraft (cf. [33]). Similarly, the passengers, other aircraft in the vicinity, Air Traffic Controllers, airports, etc. might all be relevant entities in the operational context of the aircraft. In contrast, the perimeter fence around the destination airport may not be relevant because it neither interacts with the aircraft nor constrains its development. When considering a different context subject, e.g. the Air Traffic Controller, relevant context entities include the airport, aircraft in the vicinity, and pilots of these aircraft. However, the passengers of the aircraft may no longer be relevant, as they have no influence on the development of the Air Traffic Controller. The selection of the precise scope of
the context subject is hence essential in order to determine the context that must be considered during development. In recent years, many scenarios for the application of CPS in the not too distant future have been proposed, some of which with a high degree of safety-criticality ([2], [5], [20]). From these scenarios, it becomes apparent that one of the core differences between CPS and embedded systems is that in case of embedded systems, external systems, human users, or functions that the embedded system interacts with are completely known at design time. In other words, in embedded system development a closed world in the sense of [30] is assumed, which means that every entity that will probably interact with the system during runtime are at least known on a type level [10]. In the case of CPS, this assumption no longer holds in general. When developing CPS, one is forced to assume an open world in the sense of [9], i.e. the context in which the system will operate is partially unknown at design time. CPS dynamically build functional networks of distributed systems at runtime that collaboratively fulfill a common purpose, often unbeknownst to the involved CPS themselves and sometimes by observing and acting upon shared resourced within their mutual operational context (see [2], [20, 21]). In consequence, a CPS must be able to adapt to changes in its operational context during runtime, which are not known at design time. This can be done by making assumptions about classes of context entities as well as their interaction with the CPS during operation, i.e. by assuming a localized closed world (cf. [9]). This poses, in particular, new significant challenges for the development of safety-critical CPS, in that safety assessment must not only ensure that the individual CPS is safe, but also that their collaborative functionality does not lead to harm. This requires a clear understanding of the interaction between CPS and groups of CPS with their mutual context at runtime. Therefore, special contemplation when identifying and mitigating hazards during development is required. In the remainder of this paper, we seek to investigate context-related characteristics of hazards for CPS by building upon what is known about embedded systems hazards.
3
Characteristics of Hazards from the Related Work
In order to build a foundation for a hazard taxonomy, this section reviews the related work on hazard-related concepts for embedded systems. We specifically adopt a broader view and include relevant fields of software engineering (e.g. reliability engineering [3] or security engineering [1] as well as natural disaster management [4]). For example, in an attempt to compare the magnitude of environmental hazards (e.g. earth quakes, tornados, erupting volcanoes), Burton et al. [4] suggest “hazard event profiles”. To do so, the authors differentiate between temporal characteristics of the hazard as it occurs (i.e. frequency, duration, speed of onset, and temporal spacing) and spatial characteristics (i.e. areal extent and spatial dispersion). In [23], Lin, Yue, and Zhi propose a software dependability model in order to derive dependability requirements. To do so, the authors suggest three categories of
threats that negatively affect dependability1: congenital defects (i.e. defects that arise due to faulty development), input harms (i.e. hazardous events that negatively affect the system such as faulty user interaction), and environment hazards (i.e. conditions in the environment that harm the system such as fire, earth quakes, etc.). The authors’ work is in line with a taxonomy for security engineering developed by Avizienis et al. [1]. In their taxonomy, the authors distinguish between faults, dependability, and security concerns and differentiate faults regarding their origin, i.e. whether it was caused by the system itself or by the context. Furthermore, the threat categories in [23] show that environment hazards are triggered by the operational context, while congenital defects as well as some types of input harms can only be committed by human users. This is similar to the work by Avizienis et al. [1], who attribute the occurrence of errors to either incorrect operation of a system in the environment, by some natural phenomenon (e.g. ecology or natural occurring wear and tear of physical components), or by erroneous behavior of some human user. Similarly, in [6], Chambers notes that human users as well as inadequate safety engineering lifecycle management (i.e. a congenital defect according to [23]) are among the most common causes for disasters. In his work, Chambers provides a catalog of hazards that originate from humans (i.e. operators, developers, and managers). Nancy Leveson discusses the nature of hazards in a number of her publications. Particularly in [19] and [22], she emphasizes that, given unfavorable conditions, hazards may potentially or inevitably lead to accidents. According to [19], this may greatly impact the choice of feasible mitigations, as for example if an accident inevitably follows some hazard, strategies to mitigate the hazard by reducing damages may be more feasible than strategies to prevent the hazard. In addition, in [22], Leveson stresses the role of causally related events that together lead to accidents. She argues that taking into account causality, however, is insufficient because “accidents are complex processes involving the entire socio-technical system” [22]. She hence makes the argument that in order to document hazards leading to an accident, the operational context of a system along with its role in the causal chain ought to be considered. The field of reliability engineering has also suggested some characteristics of concepts that can be related to hazards, as in many cases, errors, faults, and failures may cause hazards. For example, in [26], Mukherjee suggests that errors may occur permanently (i.e. when proper functioning cannot be regained unless through special maintenance) or temporarily (i.e. when due to some change in circumstance or due to activated countermeasures, proper functioning is restored during operation). Albeit Mukherjee’s book mainly concerns errors in electrical circuits, the same may hold true for hazards of embedded systems and CPS: If a hazard occurs due to some malfunction within the system or within a context entity of the system, it is reasonable to assume that the malfunction has some permanent or temporary quality to it (and example will be given in Section 4). Similarly, MacCollum [24] argues that malfunc-
1
According to [12], [14], dependability comprises, among others, security and safety. Hence, threat categories according to [23] can be understood as hazard categories as well.
tions could arise spontaneously and intermittently, depending on the conditions in the operational context of the system, similar to what Mukherjee considers temporary. A concept of differentiating between types of failures is suggested in [3]. The authors differ between “omission value failures”, which are failures that result due to a loss of the desired functionality, and “timing failures”, which are failures that arise due to some functionality being delivered too early or too late. This is in line with many hazard analysis techniques (such as Hazard and Operability Studies, [15]), which aim to characterize hazards based on some functional degradation of the system by making use of guide phrases (e.g. “failure to operate”, “operates incorrectly”, “operates inadvertently”, “operates at wrong time”, “unable to stop operation”, or “receives/sends incorrect data”, see [9]).
4
Post-hoc Analysis of Accidents
In the previous section, we reviewed the related work in order to identify possible characteristics of hazards for embedded systems. In this section, we complement this theoretical investigation with post-hoc analyses of accident reports from the avionics industry in order to analyze embedded system hazards in the respective chain of events. We specifically chose the avionics domain because accident reports are typically governmentally provided and hence publically available (a database of accident reports is available at [27]). Example 1: American Airlines Flight 191. On May 25, 1979, American Airlines flight 191 crashed shortly after take-off in Chicago [28]. During the initial ascent phase, the left engine of a McDonnell Douglas DC-10 broke off due to vibrations, tearing a hole in the wing. This severely damaged the fuel cell in the left wing as well as hydraulic lines, causing kerosene discharge as well as loss of hydraulic pressure in that wing. For takeoff and initial ascent, slats (i.e. aerodynamic surfaces at the leading edge of the wing, allowing for increased lift at steeper angle of attack) were extended. On the left wing, these slats were inadvertently retracted following the loss of hydraulic pressure, while the right hand side slats remained in place. Together with the loss of thrust due to the torn-off engine, this led to stall the left wing. The aircraft subsequently crashed into a nearby settlement, killing two bystanders as well as all 271 souls aboard the DC-10. Example 2: China Airlines Flight 006. On February 19, 1985, a Boeing 747 spontaneously lost thrust on one of the four engines [29] following autopilot-induced thrust adjustment in response to unfavorable wind. The rightmost engine was slow on acceleration (as it is not uncommon in high altitudes [29]) and lost thrust through a combination of a closed “bleed air” valve and “fuel lean shift”. Since the thrust was no longer equally balanced, the aircraft began rolling over the right wing. The crew tried to restart the engine, however did not to descend the prescribed altitude for engine restart. In consequence, the crew was unable to restart the engine and did not notice that the aircraft continued to roll until the horizontal angle exceeded 60°. The crew attempted to correct the aircraft’s attitude, however, lost orientation due to thick cloud cover. Their efforts to correct the horizontal attitude resulted in further upset-
ting the aircraft until it tipped over the right wing and dived into a steep, uncontrolled descent. The aircraft dropped 30,000ft before the crew regained control. Albeit the aircraft landed safely, this incident caused severe structural damage to the aircraft and injured several passengers. Table 1. Effects and Causal Factors of Instrumental Entities in the Accident Reports
Ex. 3: Lufthansa 2904
Ex. 2: China Airlines 006
Ex 1: American Airlines 191
Ex.
Instrumental Entity
ID: Hazard
Left Engine
H1: Engine breaks off
Left Wing Hydraulics Left Wing Fuel Cell
H2: Loss of hydraulic pressure H3: Loss of tank containment H4: Inadvertent retraction
Left Wing Slat
Effect
Causal Factor
Insufficient thrust; wing damage Loss of control over flight control surfaces
Vibrations during take-off and ascent
Kerosene leak
Whole in wing fuselage
Unilateral slat retraction
Insufficient Hydraulic Pressure
Torn hydraulic lines
Entire Aircraft
H5: Insufficient lift, causing stall
Uncontrolled descent and subsequent crash, killing passengers and bystanders
Rightmost Engine
H1: Sudden loss of thrust
Insufficient thrust; Unbalanced thrust
H2: Overcompensation of thrust H3: Failure to descent to prescribed engine restart altitude H4: Failure to correct horizontal attitude
Unbalanced thrust, causing rolling
Unilaterally retracted slats; Steep angle of attack during ascent; Insufficient engine thrust Incorrect “bleed air” valve opening; Fuel lean shift; Wind turbulence Insufficient thrust due to rightmost engine failure
Failure to restart engine
Inattentive crew
Uncontrolled roll of aircraft
Disorientation; Preoccupation with engine restart
Autopilot
Crew
Entire Aircraft Landing Gear Wheels
H5: Uncontrolled roll, causing upset aircraft H1: Unilateral compression H2: Aquaplaning, causing loss of traction
Wheel Brakes
H3: Fail to apply
Ground Spoiler
H4: Failure to deploy
Thrust Reverser
H5: Failure to deploy
Roll into steep dive, causing injuries and aircraft damage Insufficient compression for braking systems Insufficient wheel speed Insufficient deceleration and subsequent crash, causing injuries and deaths
Unbalanced thrust; Inattentive and disoriented crew Uncontrolled uplift due to windshear Water on runway Insufficient landing gear compression; Insufficient wheel speed Insufficient compression on landing gear
Example 3: Lufthansa Flight 2904. On September 14, 1993, a Lufthansa Flight 2904 overshot the runway and crashed into an earth wall following a landing in Warsaw under wind shear conditions [17]. The aircraft was unable to decelerate in time because all three braking systems, i.e. thrust reverse, wing spoilers, and wheel brakes failed to engage. In order for all three systems to activate, a combination of conditions must hold: The spoilers deploy (a) if the left and the right shock absorbers of the main landing gear are compressed by at least 6,300kg each or (b) if the wheel speed exceeds 72 knots for at least four seconds at both main landing gears. The wheel brakes activate when both conditions (a and b) are true. In order to engage the thrust reverser only condition (a) holds, i.e. sufficient landing gear compression. However, due to wind shear at the time of landing, the aircraft did not touch down evenly, causing the left landing gear to touch down nine seconds after the right gear was already on the ground. In addition, due to heavy rain, the runway was covered with water, causing
the aircraft to aquaplane, hence causing the wheels speed not to exceed 72 knots. As a result, the deployment of spoilers, thrust reversers, and wheel brakes was prevented. For more than half the length of the runway, the aircraft remained at approach velocity of over 150 knots and was unable to slow down. When braking systems became effective, the remaining runway length was too short to bring the aircraft to a complete stop before impacting the boundary wall. The accident caused two fatalities and 53 injuries aboard the aircraft. Table 1 above summarizes the incidents by illustrating the instrumental entities (i.e. embedded system, mechanical system, or human), the hazards they posed during operation, the causal factors for the hazards, and the hazardous effects are shown.
5
Hazard Taxonomy for Embedded Systems
In Section 3, we have seen that related work has suggested a number of characteristics by which hazards can be differentiated. Furthermore, as Section 4 has shown, there is a magnitude of different causal factors for hazardous conditions that are instrumental in the causal chains leading up to the accidents. In this section, we present a hazard taxonomy for embedded systems based on these findings. The taxonomy is shown in Table 2. Each characteristic therein is motivated by the related work from Section 3 and consists of several disjoint attributes which are based on the hazards in Table 1. In the following subsections, we briefly explain each characteristic as well as the different attributes and give examples from Section 4, where appropriate. Table 2. Hazard Taxonomy for Embedded Systems Characteristic Trigger Trigger Type Affected Entity Type Accident Avoidablity
Attributes Context Subject Human User
Operational Context
Hardware
Human
Reference
Context Properties
Software Hardware
Software
[18, 19], [23] [1], [6], [23] [1], [6], [23]
Unavoidable
Avoidable
[19], [22]
Causal Chain
Antecedent
Precedent
[19], [22]
Occurrence
Permanent
Transient
[24], [26]
Functional Degradation
5.1
Omission
Erroneous Value
Incorrect Time
Commission
[3], [9], [15]
Trigger
Hazards emanate from a system given a set of unfavorable conditions. As we have seen in Sections 2 and 3, hazards are triggered by either the system itself or by some entity within its context. Hence, the following attributes can be selected: Context Subject: The trigger of the hazard is a state that the context subject assumes. This could be, for example, a failure of some component of the context sub-
ject like in hazard H1 of Example 1. In this example, the hazard is triggered because a part (i.e. the engine) of the context subject (in this case: the entire aircraft) broke off of its nacelle. Operational Context: The hazard is triggered by specific circumstances in the operational context which impact the system. This could be unfavorable weather conditions (e.g., hazard H2 of Example 3), failing systems in the operational context (e.g., hazard H2 of Example 2, when selecting the autopilot as the context subject), or incorrectly behaving humans (e.g., hazards H3 and H4 of Example 2). 5.2
Trigger Type
The characteristic trigger type specifies the type of entity the hazard triggered. While the characteristic trigger (see Section 5.1) can be used to determine if the triggering entity is (from the perspective of the system) the system itself or in its context, trigger type documents aspects of the triggering entity itself. Specifically, a hazard can be caused by entities belonging to either of the following types: Human User: The hazard can be triggered by human users, e.g., due to inattention, incorrect behavior, or possibly due to malicious intent. For instance, hazards H3 and H4 of Example 2 were caused by humans. It is to note that while it is not common to consider humans part of the system (cf. [8]), depending on the specific development project, it may be sensible to consider humans part of the system (e.g. in the health care domain). Software: Hazards can be caused by software components of the context subject or some context entity. This for example the case in hazard H2 of Example 2: The autopilot depends to a large extent on software that observes the aircraft’s reaction and controls its flight. In contrast to hardware, software is not subject to mechanical deterioration and hence does not result in spontaneous failures, but may include covert defects that only become apparent during runtime when unfavorable conditions arise [18]. Hardware: Spontaneous hardware defects, such as spontaneous failures or physical deterioration may be the trigger for a hazard. For example, in hazard H1 of both Examples 1 and 2, the fact that the engine physically separated from the aircraft (Example 1) or the fact that the engine suddenly lost thrust in altitude or both instances in which a hardware component did not function as required, causing a hazard. Context Properties: In some cases, like in hazards H1 and H2 of Example 3, hazards may be caused by specific circumstance in the context of the system that cannot directly be attributed to context entities. In the example mentioned above, these properties are weather conditions that negatively impact the system. 5.3
Affected Entity Type
Similarly to the type of the entity that triggers the hazard (see Section 5.2), the type of entities that are affected by the hazard can also be characterized. This is useful to
ascertain the severity of a hazard [25] and hence prioritize mitigation [22] or select appropriate mitigation evidence [34]. Specifically, the following attributes can be selected: Human: The hazard causes injury or death to humans, regardless whether the human is a user of the system or not. For example, hazards H3 through H5 of Example 3 resulted in the death and injuries of crew and passengers. Software: The hazard prevents the correct operation of some software component. This is, for example, the case when a software component received wrong input (i.e. input that wasn’t expected during development, like in the case of hazard H2 of Example 2) or when the system does not provide the proper, valid reaction based on some input stimulus as, according to [17], is the case with hazards H3 through H5 of Example 3. Hardware: Hardware can be immediately impacted by some hazard, typically due to mechanical damage, like in hazard H2 of Example 1. It is to note that from the perspective of some system, the context entities are typically considered black boxes, so it is impossible to decide if the hardware or software part of the context entity is impacted. 5.4
Accident Avoidablity
Depending on the specific effect of a hazard, accidents may still be avoided by taking proper preemptive measures. Hence, the characteristic accident avoidability may aid developers in conceiving proper mitigations strategies. According to [19] and [22], accidents may either be: Unavoidable: The hazard will inevitably result in an accident. This is the case with hazard H2 of Example 1: Due to the fact that the hydraulic lines were destroyed, the slats on the left wing inevitably retracted, which in low altitude at a steep ascent angle will inevitably cause the aircraft to dip over the left wing and crash. If an accident is unavoidable, mitigation strategies to reduce the accident likelihood, or control the system behavior may not be sufficient. Instead, damage reduction measures ought to be taken (see [19]). Avoidable: If an accident is not the necessary result of a hazard during operation, a hazard may be avoidable. For instance, the uncontrolled steep dive of the aircraft resulting from hazard H5 of Example 2 is a highly hazardous situation, but occurred in sufficiently high altitude allowing the crew to regain control. 5.5
Causal Chain
As can be seen from the analysis of the accident reports summarized in Section 4, some hazards set in motion a chain of events that causes further hazards and may (or may not, see Sections 5.3) result in an accident. Mitigation strategies must take into account other events in the causal chain that contribute to a potential accident [22]: Antecedent: The hazard is the first in a chain of events that may or may not result in an accident. This is the case with hazard H1 of Example 1, hazards H1 and H4
of Example 2, or hazards H1 and H2 of Example 3. For such hazards, mitigation strategies are needed to reduce the likelihood that the hazard occurs (see [19]). Precedent: The hazard is the effect of at least one other hazard that was triggered before. This may be the case (1) if some previous hazard immediately results in some following hazard (e.g. hazard H3 of Example 1) or (2) if at least two previous hazards must have occurred in order for some additional hazard to occur (e.g. hazard H5 of Example 3). It must be noted that much like the selection of the context subject, the selection of proper events is inherently subjective and may yield different results depending on what events are considered (see [30]). 5.6
Occurrence
As suggested by [26], a hazard may not necessarily be permanent, but may occur sporadically. For example, the circumstance in the operational context can change such that some hazard that was initially triggered is subsequently averted. Alternatively, the effect of a hazard may not necessarily lead to harm to come to humans or other systems. In other words, a hazard can be: Permanent: A hazardous condition will, once triggered, indefinitely exist until an accident occurs (and hence, harm was done) or repairs have taken place (and harm was avoided). For example, mitigating hazards H1 through H4 in Example 1 require physically fixing the damaged aircraft and are hence permanent. Transient: A hazard can be mitigated at runtime by initiating appropriate countermeasures to prevent harm (e.g. by restoring safe functionality). For instance, this is the case in hazard H1 of Example 2, where the hazard would have been mitigated during operation if the crew had been able to restart the engine. Alternatively, it may be the case that circumstance change such that a hazard will no longer be triggered. For example, in hazard H2 of Example 2, the autopilot can resume normal functionality once the engine is restarted. 5.7
Functional Degradation
As outlined in Section 3, a hazard can be characterized by the type of guide phrase that leads to its identification during hazard analyses. These guide words typically address in which way the functionality of the system will be compromised. Types of functional degradation are: Omission: A hazard results in the system being unable to maintain functionality. This is the case in hazard H1 of Example 1: When the engine physically separates from the aircraft, it can no longer provide thrust. Erroneous Value: A hazard results in the system providing false data to another system. An example is given by hazard H2 of Example 3, where the wheel speed no longer corresponds to the aircraft’s speed over ground, hence rendering both wheel brakes and spoilers inoperable.
Incorrect Time: A hazard results in some functionality of the system being provided outside of the proper time period, i.e. either too late or too early. This includes impaired real-time behavior. For example, in hazards H3 through H5 of Example 3, it could be argued that in order to allow the aircraft to come to a complete stop, all three braking systems became effective too late. Commission: A hazard results in some functionality being provided inadvertently, that is without the functionality being requested by some other system. This is the case in hazard H4 of Example 1, where the slats were inadvertently retracted.
6
Hazard Taxonomy for Cyber-Physical Systems
We revise our taxonomy of embedded systems hazards from Section 0 to reflect the differences between CPS and embedded systems. Table 3 shows the complete revised taxonomy of CPS hazards. Table 3. Hazard Taxonomy for Cyber-Physical Systems Characteristic Trigger Trigger Type Range Affected Entity Type Accident Avoidablity
Attributes Context Subject Human User
Interaction with Context Hardware
Context Subject Human
Reference Operational Context Context Properties Miscellaneous Environment Context Properties
Software
Operational Context Hardware
Software
[1], [23] [1], [6], [23] [2], [4, 5], [20, 21], [35] [1, 2], [5, 6], [20, 21], [23], [35]
Unavoidable
Avoidable
[19], [22]
Causal Chain
Antecedent
Precedent
[19], [22]
Occurrence
Permanent
Transient
[24], [26]
Functional Degradation
Omission
Erroneous Value
Incorrect Time
Commission
[3], [9], [15]
The taxonomy has been revised as follows: The characteristics trigger type (Section 5.2), accident avoidability, (Section 5.3), causal chain, (Section 5.5), occurrence, (Section 5.6), and functional degradation, (Section 5.7), hold for embedded systems as well as CPS and have therefore been adopted from the taxonomy in Table 2. When conducting hazard analyses during safety assessment, however, particular attention must be paid to the special properties of CPS when selecting attributes. For example, the attribute context property of characteristic trigger type (Section 5.2) may be particularly relevant for CPS, since hazards may be caused due to inoperable, congested, or otherwise unavailable network infrastructure, thereby impairing communication between CPS and hence compromising the achievement of their common functionality. Similarly, since one core aspect of CPS is that they accomplish tasks through close functional collaboration, functional degradation (Section 5.7) of one safety-critical CPS may impact the safe behavior of other CPS.
As can be seen from Table 3, the characteristic trigger (Section 5.1) has been extended by one attribute. Furthermore, two characteristics have been added, i.e. range and affected entity type. In the following sections, these additions are explained. 6.1
Revised Characteristic: Trigger
In Sections 3, 4, and 5.1, we have seen that hazards are triggered either by either the system itself or by some entity within its context. However, due to the fact that CPS closely collaborate with one another and observe and act upon shared context resources, the interaction of a CPS with its context may also trigger specific hazards. Therefore, we have added the following attribute to the characteristic trigger: Interaction with Context: The hazard is triggered due to the specific interaction between the CPS and an entity within its context, i.e. where both the CPS together with some context entity induces a hazard. An example is shown in hazard H5 of Example 2, where both the overcorrection of the autopilot along with disoriented and inattentive crew caused upsetting of the aircraft. 6.2
Additional Characteristic: Range
The underlying premise of safety assessment is to ascertain potential unsafe effects of the embedded system onto its context during operation [22]. Hence, safety assessment typically does not take into account the safety of the system itself [34], nor is there reason to consider the effects of systems that are not in interaction with the system (cf. [8]). However, when several CPS work together on a common safety-critical task, it is necessary to assume that each CPS is vital for the task to be carried out safely. Therefore, development of individual CPS must preserve their own functionality given internal and external influences (a concept commonly referred to as survivability, see [5], [14]). Furthermore, since CPS share context resources with one another, it may be the case that these context resources are also accessed by other systems that typically would not be in direct interaction with the CPS and therefore not part of the operational context. Therefore, it may hence be reasonable to consider the range of hazardous effects of the CPS’ behavior: Context Subject: The hazard can have an effect mainly on the CPS itself, e.g. by rendering it unable to continue normal operation. For example, hazards H3 through H5 of Example 3 render the entire aircraft unable to decelerate. Operational Context: The hazard may impact entities of the operational context. For example hazard H5 of Example 2 results in the passenger injuries as well as aircraft damages. This is the typical case that safety assessment of embedded systems is concerned with. Miscellaneous Environment: The hazard impacts entities in the environment of the CPS that are not part of the operational context under normal operation of the
CPS2. For example, this is the case in hazard H5 of Example 1, where the crash of the aircraft results in the death of bystanders that, had the aircraft continue to fly safely, would not have been affected by the aircraft’s operation and hence would not have been part of its operational context. 6.3
Revised Characteristic: Affected Entity Type
In Sections 6.1 and 6.2, we have argued that hazards for CPS differ from hazards of embedded systems because due to the highly interactive, collaborative, and general context-dependency of CPS, the triggers of hazards as well as the hazard range must be more finely resolved in CPS than is generally done in embedded systems. For example, since the range of a CPS hazard (Section 6.2) may include entities that are part of the environment but would under normal operation not be considered part of the operational context, it must be noted that this may also include innocent bystanders. Furthermore, a CPS may result in context conditions to change for other CPS, regardless if these other CPS are within the CPS’ operational context or not. We therefore alter the definition of the attribute human as follows and add a further attribute context properties to the characteristic affected entity type: Human: The hazard causes injury or death to humans, regardless whether the human is a user of the system (see Section 5.3) or otherwise uninvolved bystander. For example, hazard H5 of Example 1 resulted in the death of crew, passengers as well as bystanders, which would not have been affected hadn’t the aircraft crashed. Context Property: In some cases, a hazard can also have an effect solely on the context, i.e. when the hazard will compromise the miscellaneous environment. For example, in hazard H3 of Example 1, leaking kerosene can be thought of as a hazardous impact on the environment itself (e.g. through poisoning of the water).
7
Discussion and Conclusion
In this paper, we have argued that due to the fact that safety-critical Cyber-Physical Systems operate in highly diverse contexts that cannot be completely scoped during development of a new CPS, new and unforeseen challenges arise for their safety assessment. To support the engineering of safe Embedded and Cyber-Physical Systems, we first developed a taxonomy of embedded system hazard characteristics. We based our taxonomy on related work as well as post-mortem analyses of popular accidents from the avionics domain. We furthermore analyzed the differences between embedded systems and CPS and extended the hazard taxonomy with regard to specific CPS properties. The taxonomy hence allows characterizing the precise impact of a hazard during the engineering of Embedded and Cyber-Physical Systems, which may allow developers to find suitable mitigation strategies. Furthermore, the taxonomy can be used to validate hazard mitigation by checking, e.g., if certain trigger conditions are 2
Please note the distinction between relevant context and irrelevant environment for system development as explained in Section 2.
avoided (i.e. the hazard was prevented according to [19]), if the causal chain of events is interrupted so that the hazard can no longer cause an accident (i.e. the hazard is controlled according to [19]), or if assumptions about the properties of the operational context hold. It is to note that this taxonomy is no objective assessment tool: How a hazard is characterized depends largely on the scope of the system under development as well as subjective knowledge of the developers (see [13], [30]). Therefore, when developing a system for operation in some assumed context, the characterization of its hazards may be different if assumptions about the context do not hold, change, or if the entire context changes. As a result, this taxonomy may not be complete and depending on the development project, some characteristics may be irrelevant, while other characteristics (that are not part of the taxonomy) become relevant. Future work will focus applying the taxonomy to other accident reports. Furthermore, in order to foster applicability, an existing hazard analysis technique could be extended to incorporate the systematic identification, documentation, and structured validation of hazard characteristics.
References 1. Avizienis, A.; Laprie, J; Randell, B.; Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. In: IEEE Trans. Depend. Sec. Comp., 2004; pp.11-33. 2. Broy, M.; Cengarle, M.; Geisberger, E.: Cyber-Physical Systems: Imminent Challenges. In: Monterey Workshop, 2012, pp. 1-28. 3. Bondavalli, A; Simoncini L.: Failure classification with respect to detection. In:2nd IEEE Works. Future Trends Distrib. Comp. Sys., 1990; pp.47-53. 4. Burton, I.; Kates, R.; Whilte, G.: The Environment as Hazard, 2nd ed. The Guilford Press, New York/London 1993. 5. Cárdenas, A.; Amin, S.; Sastry, S: Secure Control: Towards Survivable Cyber-Physical Systems. In: Proc. Works. 29th Intl. Conf. Distr. Comp. Sys., 2008, pp. 495-500. 6. Chambers, L.: A hazard analysis of human factors in safety-critical systems engineering. In: Proc. 10th Australian Works. Safety Critical Sys. and Softw., 2005; pp. 27-41. 7. Daun, M.; Brings, J.; Tenbergen, B.; Weyer, T.: On the Model-based Documentation of Knowledge Sources in the Engineering of Embedded Systems. Submitted to 4th Works. Future of Softw.-Intensive Embedded Sys., 2014. 8. Daun, M.; Tenbergen, B.; Weyer, T.: Requirements Viewpoint. In: Model-Based Engineering of Embedded Systems. Springer, Heidelberg, 2012; pp. 51–68. 9. Doherty, P.; Lukaszewskicz, W; Szalas, A.: Efficient Reasoning Using the Local ClosedWorld Assumption. In: Proc. 9th Intl. Conf. Artificial Intelligence: Methodology, Systems, and Applications, 2000, pp. 49-58. 10. Eiter, T.; Gottlob, G.: Propositional circumscription and extended closed-world reasoning are ΠP2-complete. In: Theoretical Comp. Sci 114(2), 1993, pp. 231-245. 11. Ericson, C.: Hazard analysis techniques for system safety. Wiley, New Jersey, 2005 12. Firesmith, D.: Engineering safety requirements, safety constraints, and safety-critical requirements. In: J. Obj. Tech. 3(3), 2004. 13. Gong, L.: Contextual Modeling and Applications. In: Proc. IEEE Intl. Conf. Sys., Man and Cybernetics, 2005, pp. 381-386.
14. International Standards Organization: ISO25010: Systems and Software Engineering -Systems and Software Quality Requirements and Evaluation, ISO/IEC std. 25010-2011. 15. Kletz, T.: Hazop - past and future. In: Rel. Eng.& Sys. Safety 55(3), 1997; pp. 263-266. 16. Knight, J.: Safety Critical Systems: Challenges and Directions. In: Proc. 24th Intl. Conf. SE, 2002, pp. 547-550. 17. Ladkin, P.: Report on the Accident to Airbus A320-211 Aircraft in Warsaw, 1994, available at http://goo.gl/5SgkMK, accessed on January 16, 2014. 18. Leveson, N.: Software Safety: Why, What, and How. ACM Comp. Surveys. 18, 1989; pp. 125-163. 19. Leveson, N.: Safeware. Addison-Wesley, New York, 1995. 20. Lee, E.: Computing Foundations and Practice for Cyber-Physical Systems: A Preliminary Report. Tech. Report No. UCB/EECS-2007-72, Univ. California Berkeley, 2007. 21. Lee, E.: Cyber Physical Systems: Design Challenges. In: 11th IEEE Symp. Obj. Oriented Real-Time Distr. Comp., 2008, pp. 363-369. 22. Leveson, N.: Engineering a safer world: Systems thinking applied to safety. The MIT Press, Boston, 2011. 23. Liu, C.; Yue W.; Zhi J.: Elicit the Requirements on Software Dependability. In: 16th AsiaPacific SE Conf., 2009; pp.233-240. 24. MacCollum, D.: Construction safety engineering principles: Designing and managing safer job sites. McGraw-Hill, New York, 2007. 25. McDermid, J.; Pumfrey D.: A development of hazard analysis to aid software design. In IEEE Proc. 9th Ann. Conf. Comp. Assurance, 1994; pp. 17-25. 26. Mukherjee, S.: Architecture design for soft errors. Elsevier, Amsterdam, 2008. 27. National Transportation Safety Board: Aviation Accident Database & Synopses, available at http://www.ntsb.gov/aviationquery/, accessed on January 16, 2014. 28. National Transportation Safety Board: Aircraft Accident Report American Airlines, Inc. DC-10-10, N110AA, Chicago-O’Hare Intl. Airport, Chicago, Illinois. Report No. NTSBAAR-79-17, 1979. 29. National Transportation Safety Board: Aircraft Accident Report China Airlines Boeing 747-SP, N4522V, 300nm NW of San Francisco, California. Report No. NTSB-AAR-8603, 1986. 30. Papadopoulos, Y.; McDermid, J.; Sasse, R.; Heiner, G.: Analysis and Synthesis of the Behavior of Complex Programmable Electronic Systems in Conditions of Failure. In: Rel. Eng & Sys Safety 71(3), 2001, pp. 229-247. 31. Reiter, R.: On Closed World Data Bases. In: Logic and Data Bases, Plenum Press, 1978, pp. 119-140. 32. Stoneburner, G.: Toward a Unified Security-Safety Model. In: IEEE Comp. 39(2), 2006; pp. 96-97. 33. Weyer, T.: Coherence Check of Behavioral Specifications Against Specific Properties of the Operational Context. Dissertation (in German), Univ. of Duisburg-Essen, 2010. 34. Wu W.; Kelly T.: Towards Evidence-Based Architectural Design for Safety-Critical Software Applications. In: Architecting Depend. Sys. IV. Springer, Berlin, 2007; pp. 383-408 35. Wolf, W.: Cyber-physical Systems. In: IEEE Comp. 42(3), 2009; pp. 88-89.