Simulated Fault Injection for the Validation of Fault ...

4 downloads 718 Views 461KB Size Report
troduces our novel framework for the modeling and simulation ... lated fault injection framework for the verification .... Test Case Interpreter (TCI): Is a Python in-.
Simulated Fault Injection for the Validation of Fault Tolerance Mechanisms in Dependable Time-Triggered Systems Iban Ayestaran1 , Irune Agirre1 , Carlos F. Nicolas1 , Jon Perez1 Abstract— The validation of fault-tolerance mechanisms in time-triggered dependable systems is usually carried out in the latest stages of the development process. As a consequence, fixing design faults found at late stages is very costly. Simulated Fault Injection (SFI) enables exercising the intended fault tolerance mechanisms by injecting faults in a simulated model of a system, which becomes a major benefit for designers since it reduces the risk of a late discovery of design flaws. This paper presents an integral modeling and simulation environment for dependable Time-Triggered HW/SW systems. Our approach facilitates the validation of fault-tolerance mechanisms by performing non-intrusive simulated fault injection on models of the system at different levels of abstraction, from the Platform Independent Model (PIM) to the Platform Specific Model (PSM). We exemplify the feasibility of the proposed approach in a case study, where SFI is used to support the Failure Mode and Effect Analysis (FMEA) of an ETCS railway system based on the European Vital Computer (EVC). Keywords— Simulated Fault Injection, Fault Tolerance, Time-Triggered Systems, FMEA

I. Introduction Safety-critical embedded systems are dependable systems that could cause loss of life, significant property damages or damages to the environment in case of failure. Therefore, the safety function shall be fault-tolerant. The selection of appropriate Fault Tolerance Mechanisms (FTMs) requires a careful analysis of the system through all the design refinement phases using techniques such as Failure Mode and Effect Analysis (FMEA), recommended by the IEC-61508 safety standard. The FMEA is a systematic technique for identifying potential faults and analyzing their effects within a system in order to detect weaknesses in the design and prevent failures. The FMEA is typically carried out by a group of safety engineers who evaluate the system and produce a subjective verdict about the hazards, failures and failure effects of the system, based on their previous knowledge and experience. Nowadays, most of the FMEAs do not specify the causality chain in detail. Therefore, reliability assessment along the system design process, and the transition between FMEA and design and implementation get complicated. Moreover, typically the safety validation involves fault injection experiments at the final stage of the system development. However, the late fixing of the 1 Embedded Systems Group, IK4-Ikerlan Research Center, e-mail: {iayestaran, iagirre, cfnicolas, jperez} @ikerlan.es.

detected design flaws might require a complete and expensive redesign of the system. Thus, the validation of fault-tolerance mechanisms in the early steps of the development could bring major benefits to designers. In fact, the IEC-61508 safety standard strongly recommends fault injection techniques in all steps of the development process of safety-critical systems. The Simulated Fault Injection (SFI) makes the simulation of the effects of faults in simulated systems possible. This technique allows the developers to observe the behavior of the system in the presence of faults, enabling the early verification of fault-tolerance mechanisms before assembling a system prototype. The analysis of the results obtained in SFI campaigns can be used to observe the propagation of errors in a system. Such an analysis enables the creation of the so-called Swiss cheese models [1], [2]. The Swiss cheese model presents the system as a set of layers that represent the defenses of the system. In an ideal world, every defensive layer would be intact. However, in reality they have many holes which represent latent conditions for failures. The failure occurs when a hole in each layer gets aligned with the other layers. The Swiss cheese model is widely used by accident investigation communities in order to discover weaknesses in product developments. In order to cope with the increasing complexity of embedded systems and to analyze the diverse failures a system might suffer from, nowadays those systems are usually developed following Model Driven Development (MDD) approaches such as the Model Driven Architecture (MDA) [3]. According to the MDA, first a purely functional model of the system is created, called Platform Independent Model (PIM). This first separation of the functionality from the HW components saves design time and cost during the development process, and eases the early verification of the functionality of the system. Once the assessment of the PIM is finished, the model is refined into the so-called Platform Specific Model (PSM) by adding information about the target platform of the system and deploying the functional components of the PIM into platform components. This enables the introduction of HW-specific faults into the model. The Unified Modeling Language (UML) [4] and the Systems Modeling Language (SysML) [5] are compliant with the MDA. UML is an international

standard that defines a modeling language for software systems design. However, UML does not support the design of HW components, so it is not well suited for the modeling of HW/SW systems. SysML was created to fill this gap. SysML is the adaptation of UML for the development of systems including combinations of HW, SW, data, people, facilities and objects. However, since it is a general purpose language, it is not restricted to a specific Model of Computation (MoC), what hinders the generation of executable models from SysML designs. Languages like SystemC try to overcome the limitations of the languages described previously. SystemC [6] is a high-level HW/SW co-design language based on C++ that enables modeling and simulating systems at different levels of abstraction. Therefore, SystemC can be used to design, simulate and verify platform specific models of dependable embedded systems. Nowadays SystemC has become the de-facto standard in HW/SW system development. The Executable Time-Triggered Model (E-TTM) [7] is a SystemC based extension for the modeling and simulation of real-time embedded systems based on the Time Triggered Architecture (TTA) [8]. The E-TTM meta-model is implemented as a C++ library that extends SystemC with the time-triggered MoC. The E-TTM relies on the concept of sparse time, where the continuous real-time is partitioned into a sequence of alternating intervals of activity of duration π and silence of duration ∆. However, it is focused on the functional aspects of the systems, so it does not provide any mechanism to simulate HW components. In this context, the aim of this research is to provide a complete modeling and simulation environment for dependable Time-Triggered HW/SW systems based in SystemC, which enables reproducible non-intrusive SFI at different abstraction levels for the assessment of fault-tolerance mechanisms. The paper is structured as follows: Section II briefly describes the previous related work on fault injection techniques and approaches. Section III introduces our novel framework for the modeling and simulation of dependable time-triggered embedded systems, and Section IV presents a test and simulated fault injection framework for the verification of such systems. Section V describes the European Train Control System (ETCS), which is used as a case study in Section VI to evaluate our approach. Section VII briefly presents the results obtained in the case study and finally Section VIII describes the main conclusions and future work. II. Related Work Avizienis et al. [9] define a failure as an event that occurs when the delivered service deviates from correct service. An error is a deviation from the correct service of an external state of the system, and a fault is the cause of an error. In other words, errors are the manifestations of faults whereas failures are the consequence of errors. Therefore, injecting faults

into systems is a straightforward technique to verify that such faults do not cause failures in the system, i.e., the system is tolerant to that faults. Thus, fault injection strategies and techniques have been very widely analyzed in the past [10], [11] and several tools have been developed, being most of them focused on VHDL models [12], [13], [14], [15]. However, as previously stated, SystemC is nowadays the de-facto standard in industrial HW/SW system design and simulation. Therefore, fault injection design and simulation in SystemC models has been getting an increasing interest in the latest years [16], [17], [18], [19], [20]. Misera et al. [16] adapt fault injection techniques and strategies from VHDL models to SystemC models in order to analyze the limitations and possibilities of the SystemC kernel. They simulate systems including saboteurs and simulator commands, and they extend logic types of SystemC in order to perform a more realistic behavior of logic components. However, since they focus in VHDL based fault models and techniques, they limit their approach to logiclevel models. In [17] Shafik et al. propose an alternative technique to the one presented in [16], but the focus is still put on logic level models. Bolchini et al. go one step further into multiple abstraction level fault injection in [18]. The paper presents a fault injection environment for the ReSP simulation platform [21]. The approach enables injecting faults by using saboteurs and simulator commands in components defined in the ReSP platform, using a new technique called reflective wrapper. The reflective wrapper technique inserts a Python layer between the SystemC kernel and SystemC IPs, which provides access to any SystemC element. Faults can be then injected into those elements either by parsing an XML file or directly typing commands in the console. Injection of faults in this approach is limited to elements defined as public objects, what might require some re-design of the System Under Test (SUT). The approach does not focus on a specific MoC, so simulation is paused and resumed whenever a fault is injected. In [19] Lu and Radetzki use the Concurrent and Comparative Simulation (CCS) technique to inject faults in SystemC models. The CCS is implemented by extending data types to enable a differential representation of signal/variable values, which are implemented as lists. The first element of the sorted list corresponds to the non-faulty signal/variable value, while all other values in the list correspond to faulty values. When an operation involving a signal/variable with extended data type is made, the operation is executed for all the values in the list. This way faults are propagated all over the system. This approach makes it possible to perform more than one fault injection experiment in each execution. The main limitations of this approach are that the developer must use a specific data-type in order to inject faults in variables, and fault libraries are not defined, so the tester must implement the fault models.

Reiter et al. [20] perform error injection in simulated HW models. First virtual prototypes of the platform components are created using the CHESS modeling language, and then the models are extended in order to inject errors in them. Error models are limited to data-corruption, timing-corruption, halt, and signal-loss. The framework does not rely on a concrete model of computation, and the paper does not describe how timing constraints of the SUT are guaranteed. Regarding faults and their simulation, the ModelBased Generation of Test-Cases for Embedded Systems (MOGENTES) project [22] specifies a number of HW and SW related fault and failure models and taxonomies. On the other hand, the international ASAM AE HIL [23] standard for hardware-in-theloop testing defines the standard interface to perform error simulation in Hardware in the Loop testing. Therefore, this work focuses on the development of a modeling, simulation and fault injection framework for dependable Time-Triggered HW/SW systems. The modeling framework includes a library of SW and HW component models for the design of the system at different abstraction levels. The simulated fault injection environment includes a timetriggered Automatic Test Executor (ATE) with libraries of fault models for SW and HW components, in order to assess the robustness of the SUT in the presence of faults. III. Platform Independent and Platform Specific Time-Triggered Models The model based approach described in this work is based on the MDA. The system development starts by first defining a purely functional, Platform Independent Model (PIM), and then sequentially refining it into a Platform Specific Model (PSM). The transformation of a PIM into a PSM involves the deployment of the functional components into HW components. At this stage the designers may examine the emerging behavior of possibly many different platform variants, e.g. with regard to their fault tolerance against potential failure modes identified in the FMEA. A coherent meta-model is required to ensure the composability of the models. The meta-model provides a modeling framework that guarantees the syntactical correctness of the models. Typically the early validation consists of examinations of the expected system behavior by simulations. Therefore we require the transformation of the design models into executable specifications. To that end, the meta-model must guarantee that the models are unambiguous. The MoC of the meta-model is a central element for this purpose. A. Platform Independent Time-Triggered Model (PI-TTM) The semantics of the platform independent metamodel rely on the Logical Execution Time (LET) MoC [24]. LET is a time-triggered MoC that spec-

ifies a logical duration for each computational job, regardless of its physical duration. Jobs are atomic and communicate in zero time, while computation takes a predefined logical duration. Relying on the LET MoC provides several benefits to the designers of dependable systems. First, since jobs are atomic there are no synchronization points during the execution of jobs and as they are triggered by time, all jobs are executed synchronously. This reduces the number of failures that might occur in the system, because failures due to faulty synchronization points and different orders of execution are avoided. Second, relying on the LET MoC eases the deployment of functional models on time-triggered distributed platforms, which are nowadays widely used in dependable and safety-critical embedded systems. Third, the time-triggered schedule guarantees reproducibility of test-cases when the test executor is synchronized in time with the System Under Test (SUT). Since models must be executable to perform simulation and fault injection, an executable LET engine has been developed, called Platform Independent Time-Triggered Model (PI-TTM) [25]. The PI-TTM engine has been built by providing an extension that imposes LET MoC constraints to the E-TTM in SystemC. Besides enabling design and simulation of LET models in SystemC, this approach eases the transformation of LET models to E-TTM models, and furthermore, it enables co-simulation of LET and E-TTM models. The platform independent models rely on the architectural hierarchy described in [19], where components can be defined as systems, Distributed Application Subsystems (DASs) and jobs. Systems and DASes are abstract hierarchical components, whereas jobs are atomic components, as the example in Figure 3 shows. Unlike E-TTM, all the components in PI-TTM are time-triggered. B. Platform Specific (PS-TTM)

Time-Triggered

Model

Once we build, simulate and verify the PIM, it is further refined to get the PSM. This refinement consists of defining a platform model (PM) and mapping the elements of the PIM to the PM. After assembling the PSM, we simulate it to assess its functionality and reliability. Hence, the PSM must be able to model and simulate HW components. Besides, the platform specific model may require a refinement of its timing properties, due to the increase in the level of detail of the system. Therefore, we developed a new modeling environment for platform specific models, called Platform Specific TimeTriggered Model (PS-TTM)[26], which relies on the E-TTM MoC for the simulation of the PSMs. The PS-TTM has been built as an extension to the E-TTM library. We extended the E-TTM in two ways: first, we refined the HW-related component models to enable PSM design; second, we provided simulated fault injection mechanisms to enable the

TABLE I: Fault library for PIM models

SUT SUT outputs

FI points

FIU

Test points

Boolean

SUT inputs

TPM

ATE

Fig. 1: Test and SFI framework reliability assessment. The PS-TTM includes HW components such as clusters, nodes, processors, cores, hypervisors or partitions. Each of these components is defined as a SystemC wrapper. This way, the platform is defined hierarchically, and jobs defined in the PIM are then mapped to cores or partitions, as the example in Figure 6 illustrates.

Integer / Float

TCI

Fault Effect

Config. attributes

Description

Invert Stuck At Stuck Stuck If Open Circuit

stuck value stuck value, condition -

Delay

delay

Boolean value is inverted Signal gets stuck at a given value Signal gets stuck at the actual value Signal gets stuck if the condition holds Wire disconnected, signal takes an arbitrary value (noise) Signal is delayed by an amount of time

Constant Amplification Amplification Range Drift

constant value ampl value min amp value, max ampl value drift value

Offset Offset Range Stuck Random

offset value min offset value, max offset value min value, max value

Delay

delay

Signal gets stuck at a given constant value Signal is amplified by fixed value Signal is amplified by a randomly selected value (between min amp value and max ampl value) Signal drifts away from its nominal value at each time step A given fixed offset is added to the signal A randomly selected offset value is added to the signal Signal gets stuck at the actual value Signal takes an arbitrary value (between min value and max value) Signal is delayed by an amount of time

TABLE II: Platform specific error library Fault Effect

Attrib.

Description

Corruption

-

No execution Out of time

Delay

Babbling

Delay

The functionality is performed incorrectly. The information provided in the interface is corrupted The functionality is not executed. No information is provided as a result Time bounds of the functionality are not respected. Information is provided later than expected Information in the interface is erroneous both in terms of content and time

IV. Test and simulated fault injection framework Both the PI-TTM and the PS-TTM simulation engines enable non-intrusive simulated fault injection in the models in order to perform their reliability assessment. Test-case simulations and fault injection activities are managed by the time-triggered Automatic Test Executor (ATE) included in the simulation engines. The ATE is capable of reading/writing variables and signals of the System Under Test (SUT), and shares the global notion of time with the SUT, in such a way that test experiments are reproducible. As Figure 1 shows, the ATE is composed by the following modules: •





Test Case Interpreter (TCI): Is a Python interpreter that enables feeding the inputs of the SUT according to the test case provided in a python script. Fault Injection Unit (FIU): Is a xml code interpreter that enables injecting different types of faults in the variables of the System Under Test (SUT). Test Point Manager (TPM): Is the module that enables the observation of internal variables of the System Under Test (SUT).

Thanks to this ATE, the proposed framework allows fault injection by modifying variable/signal values of the models. This corruption of the variable/signal values is performed using saboteurs. However, in contrast to other approaches ([16], [18]), our signal sabotaging technique is not intrusive, i.e., the fault injection activity is handled by the ATE and the models are not modified for the injection. For example, if the test case requires the corruption of a variable, the FIU reads the variable, performs the required fault injection (e.g. invert-boolean), and writes back the new value into the variable in the communication channel, without making any modification to the model. The injections take place during the communication phases and are performed instan-

taneously (in zero simulation time). This way, timing constraints of the model remain unchanged, whereas the fault is injected without making any modification to the model and without the need to sleep() or restart() the simulation. The FIU supports two different fault modes, transient and permanent. Permanent faults remain active until the simulation ends. Instead, transient faults are temporary misbehaviors, so the duration of the fault has to be defined in the fault configuration. Our framework provides two different faultlibraries for PIM and PSMs. The fault library for PIMs draws on the failure modes defined in the MOGENTES project, and compiles up to 5 different fault models for logical variables and 8 fault models for integer and floating point data-type variables, as Table I shows. Since PSMs include HW components that are not defined in the PIMs, new fault models come across the PSM. The physical faults in HW components might be of very diverse type, such as Single Event Upsets (SEUs) in DRAM memories caused by ionic radiation or the complete destruction of a processor due to a big temperature increase. However, the models of HW faults directly depend on the abstraction level of the component models. Thus, since our approach models the HW components at a high abstraction level, the fault library for PSMs is composed by the effects to which HW-related faults are typically reduced in the literature [20] (Table II). However, since the models of the HW components might be extended to add more detail about their behavior, the fault library for PSM models may also be extended with new lower-level fault models. Fault configurations for the ATE are defined in XML files. The selected XML schema for the definition of fault injection campaigns complies with the international ASAM AE HIL standard for hardwarein-the-loop testing. Although the aim of this work

Listing 1: Code extract from fault configuration < !−− FAULT LOCATIONS −−> s u t m o d e l i n s t Output SUT MODEL OUT IR s u t m o d e l i n s t Output SUT MODEL OUT IRM s u t m o d e l i n s t Output SUT MODEL OUT X < !−− FAULTS −−> i n v e r t I R B o o l e a n I n v e r t StuckAtTrue IRM B o o l e a n S t u c k A t t r u e O f f s e t X I n t e g e r O f f s e t 12 < !−− FAULT CONFIGURATION −−> FS1 TRANSIENT 0 . 5 1 . 0 FS2 TRANSIENT 0 . 7 2 . 0 FS3 TRANSIENT 0 . 5 6 . 0

is not to perform fault injection at hardware-in-theloop level, sticking to the standard enables forward reuse of the fault injection campaigns until the final prototyping phase. A graphical tool for fault injection configuration definition with automatic code generation has been developed. Listing 1 shows an extract of a sample fault configuration file created with the tool. In compliance with the standard, the definition of the fault configuration is made in three steps: First, the variables where the faults are going to be injected are declared in the Locations part. Second, the behavior of faults is defined in the Faults section. Each fault refers to a single location and specifies a fault effect from the library. Finally, the Fault Configuration groups the faults in fault sets, where temporal properties are set. Each fault set may refer to one or more faults, and each fault may be referred by more than one fault sets.

V. European Train Control System (ETCS) The European Railway Traffic Management System (ERTMS) [27] is an European Union backed initiative for the definition of a unique train signaling standard throughout Europe. The high-speed train on-board ETCS shown in Figure 2 is a safety-critical embedded system (SIL-4, Safety Integrity Level) that protects the train by supervising the traveled distance and speed, and activating the emergency brake if the authorized values are exceeded. It relies on the distance and speed measurements provided by the on-board odometry system, which performs dead reckoning based on a set of diverse sensors such as wheel angular speed encoders and Doppler radars. The railway infrastructure provides the train’s absolute position whenever a new eurobalise is read, and this location is used to correct and recalibrate the odometry system. A. Subsystems, on-board units Figure 2 shows the ETCS on-board units as described in [27], where all subsystems are connected to the central EVC: •











The European Vital Computer (EVC) is the locomotive central safety processing unit that communicates with all subsystems and executes all safety functions associated to the traveling speed and distance supervision. The EVC executes the safety kernel and includes the odometry subsystem, which estimates the traveled distance and speed based on a set of diverse sensors. The Driver Machine Interface (DMI) is the driver interface, periodically updated with state parameters (e.g. traveling speed) and transmitting sporadic event information (e.g., button pressed). The Juridical Recorder Unit (JRU) records all relevant external events (e.g., new eurobalise message) and internal events (e.g., activate emergency brake). The Balise Transmission Module (BTM) receives the information sent by the eurobalises as the train passes them, and transmits it to the EVC. The Global System for Mobile Communications - Railway (GSM/R) interface enables the bidirectional information exchange between remote control centers and the train. The Train Interface Unit (TIU) reads / writes a set of input / output digital values, such as the emergency brake digital output.

B. Functionality of the European Vital Computer (EVC) The EVC is composed by 4 main tasks: The speed and position estimation, the mode control and the emergency and service brake controls. The speed and position estimation is performed by an odometry subsystem, which reads the information

GSM/R antennas

job mode

GSM/R GSM/R

job emerg

BTM

Train interface

DAS EVC DAS DMI

Radars

job serv

DAS DMI

Eurobalisse Antenna

job dmi

Comm. channel

JRU

DAS SUT

TIU EVC

Comm. channel

DMI

Comm. channel

DAS EVC job odo

Fig. 2: ETCS on-board reference architecture. Fig. 3: Platform Independent Model provided by the sensors and the BTM, and makes an estimation of the actual speed and position of the train. The mode control unit activates the Standby or Supervision modes depending on the command received from the DMI. This information is sent both to the Emergency and Service brake control units. The Standby mode is supposed to be active when the train is stopped, and it activates the emergency brake. In Supervision mode, the EVC supervises the current speed and position of the train and activates the warning and brakes when the maximum permitted speed values are exceeded. The emergency brake control unit implements the safety-critical (SIL-4) functionality of the system. It receives the information about the position and speed estimated by the odometry system, the Standby and Supervision activation signals from the mode control unit, and the reset command from the DMI. When the mode control unit sends the Standby activation signal the emergency brake is activated. On the other side, when the system is set to Supervision mode, the estimated distance and speed are compared to a pre-defined braking-curve that sets a maximum speed for each point in the track. If the estimated speed is higher than the maximum authorized speed, the emergency brake is activated. The brake is released if the train is stopped and the driver sends a reset command through the DMI. The service brake control unit implements the nonsafety-critical functionality of the European Vital Computer. It receives the data about the estimated position and speed from the odometry unit and the mode activation signals from the mode control unit. If the system is in Standby mode, the service brake and the warning are deactivated. However, in Supervision mode, the warning signal and service brake are activated when the speed of the train reaches the warning activation speed and the service brake activation speed respectively. The maximum speeds are pre-defined in two braking-curves that define maximum speeds for warning and service brake activation at each point in the track. Both the warning and the service brake are deactivated when the speed of the train falls below the warning activation speed. VI. Case Study We tried the approach described in section III in the verification of the design of an on-board ETCS

EVC

enc1

Speed Sensors

BTM

BalisseN otDetected

enc2 accel

Odometry System

s v

Emergency Brake Control

balisse

EmergBrake

TIU ServBrake reset

DMI

mode

Mode Control

stdby superv

Service Brake Control

W arning

Fig. 4: Functionality of the PIM

for high-speed trains. A. Platform Independent Model For the sake of simplicity, we omit GSM/R and JRU subsystems from the case study. Furthermore, we choose a single DMI design and we include the speed sensors, BTM and TIU in the system environment, so the system under test is composed by the driver-machine interface and the components in the EVC. We design the PIM of the system relying on the PI-TTM meta-model described in section III. The functionality of the PIM consists of 5 jobs deployed in 4 DASes (Figures 3, 4): •



• •

Odometry DAS: Executes the Odometry job, which implements the speed and position estimation function. DMI DAS: Executes the DMI job, which implements the driver-machine interface functionality. Mode-Control DAS: Executes the Mode-Control job. Brake-System DAS: Executes the Emergency Brake Control and the Service Brake Control jobs.

The selected odometry algorithm is based on the fault-tolerant sensor-fusion technique proposed by Malvezzi et al. [28], which estimates the speed and distance traveled by the train with the information provided by an accelerometer that measures the acceleration of the train and two encoders that measure the speed of a different wheel each. Once we complete the PIM we proceed to the Failure Mode and Effect Analysis (FMEA), as required for safety-critical systems.

job odo

Node Voter A

job mode job emerg

BTM C Sens C

Node EVC B

Processor

TIU

Core1

Node

Node EVC C

Node Voter B

Processor

Core2

C. chan.

BTM B

C. c.

Node DMI

Sens B

Core 2 job serv

C. c.

BTM A

Node EVC A

Comm. chan.

Core 1

Sens A

Cluster Node Superv

job voter

job dmi

C. c.

Core

Core C. c.

Core

C. c.

C. c.

Node voter

Processor

Node Processor

C. c.

Node Superv

Comm. channel

Node Superv

Fig. 5: Functionality of PSM

Node voter

B. Homogeneous Platform Specific Model In this example the ERTMS system is deployed into a Triple Module Redundant (TMR) platform, as the international EN-50128 standard for railway applications recommends for safety critical (SIL-4) systems. Redundancy increases the robustness of the system against the failure modes that are not masked by the PIM. Figure 5 shows the overall functionality of the PSM of the system. For the sake of simplicity, ServiceBrake, Warning and BaliseNotDetected activation lines are omitted from the figure. The TMR system is composed by three nodes, each of them hosting a replica of the EVC functionality. Each of the nodes is connected to its dedicated sensors and BTM, so that a fault in the sensors of a node does not affect the other nodes. Two exact 2oo3 voters handle the replicated values of the EVC nodes. The voting function has two operating modes: normal voting mode and degraded voting mode. The functionality of the voting system is described below: • •





The starting operating mode of the voters is normal voting mode. If the three values received by a voter are equal, the voter remains in normal voting mode and forwards the input values to the output value. No warning is sent to the DMI. If one of the replicated values received by the voter is distinct to the other two, the voter switches to degraded voting mode. In that case the voter behaves as a 1oo2 voter, where the distinct input value is ignored. The inputs coming from the faulty node are no longer taken into account for the voting algorithm. The result of the 1oo2 algorithm is forwarded to the output, and a warning about the faulty node is sent to the DMI. If there is a disagreement between the two active inputs when the voter is in degraded voting

Node DMI

Processor

Node Processor

Core

C. c.

The FMEA of the PIM enumerates 109 potential failure modes, rooted in 276 different potential causes. According to the FMEA, the PIM should tolerate 36 potential failure modes, due to the faulttolerance provided by the selected odometry algorithm. We design the functions in SCADE [29] and we generate the C-code with its automatic code generator. Then we integrate the resulting components into the PIM model of the system to test its faulttolerance properties.

Fig. 6: Platform Specific Model

mode, the voter sends a warning to the DMI to inform about a multiple failure in the system. The EVC system is disconnected and the emergency brakes are applied to the train. One of the voters controls the electric braking system of the train whereas the other voter controls the pneumatic braking system. This way a failure in one of the voters does not jeopardize the reliability of the braking system. Figure 6 shows the platform specific model of the system in the PS-TTM. The components in gray are given by the PS-TTM library. The SUT is defined as a cluster containing 6 nodes. The redundant EVCs are hosted in three identical nodes containing a dualcore processor. One of the cores is the host for the safety-critical jobs, i.e., the odometry job, the mode control job and the emergency brake control job, whereas the other core hosts the service brake control job. The two voters and the DMI are deployed into dedicated nodes for themselves. These nodes are composed by a single-core processor. Once the Homogeneous PS-TTM model is completed, we proceed to the Failure Mode and Effect Analysis (FMEA). In this case the FMEA enumerates 282 potential failure modes. In this model, 92% (261 out of 282) of the failure modes should be masked by the system (e.g., a faulty processor in one of the replicated EVC nodes that reacts too slowly due to a drift in its clock rate). However, 21 failure modes rooted in systematic errors in software design are not tolerated. This PSM is homogeneous, i.e., all the replicated EVC nodes of the system contain the same functional software. In this case we design the functions with SCADE and we generate C code automatically. The C code for the voters is also automatically generated from SCADE models. C. Heterogeneous Platform Specific Model A new platform specific model, called Heterogeneous PSM, is designed in order to tolerate the systematic errors that are not masked by the Homogeneous PSM. The Heterogeneous PSM is based in the same ar-

VII. Results The presented modeling and simulation framework comprises the ATE introduced in section IV. Therefore, we used the ATE to perform simulations and assess the functional and non-functional properties of the models described previously. To do so, first we simulated the models against a pre-defined testcase and analyzed the trace files generated by the TPM to check the functionality of the system. In a second stage, a simulated fault injection campaign was carried out with each potential failure mode identified in the FMEA. The trace files were examined in order to check the propagation of errors through the system and evaluate the correctness of the models. The error propagation was depicted as a Swiss cheese model. A. PIM simulation results analysis As mentioned in section VI, the FMEA expects that faults in sensors would be masked by the odometry algorithm. The results of the simulated fault injection campaigns on the PIM model show the robustness of the odometry system against faults in sensors. As an example, Figure 7 shows the results provided by the odometry algorithm when a permanent fault is injected in the accelerometer (sm 0 and vm 0 ), and compares them against the non-faulty simulation results (sm and vm ). This results correspond to the situation in which the accelerometer permanently breaks down in the 20th second and provides random values to the odometry system. This fault injection campaign caused the highest error rate among all the simulations, which raised up to 0.05% in the estimation of position and 2.24% in speed estimation. All in all, estimation errors made by the algorithm are considered acceptable, since they always fall below the 5% of the traveled distance and speed, as requested by the requirements of the system. Hence, we can state that the potential failures identified by the FMEA for the PIM are correctly handled by the

8000

Traveled Distance (m)

7000 6000

sm sm 0

5000 4000 3000 2000 1000 0

0

50

100

Time (s)

150

200

(a) Traveled distance estimation in faulty (sm 0 ) and non-faulty (sm ) conditions 70 60

Speed (m/s)

chitectural model as the Homogeneous PSM (Figure 6). However, in this case we integrate different functional software in each replicated EVC node. This software diversity prevents systematic software faults due to misunderstandings of the requirements, since it is highly unlikely that different programmers using different programming languages and different programming tools make the same errors. The FMEA of the heterogeneous platform specific model identifies 324 potential failure modes in the system. The implemented fault-tolerance mechanisms should make the system tolerant to all these potential failure modes. We apply design diversity in the system as follows: we design the functions of the EVC with three different tools, SCADE, MATLAB-Simulink and IBM Rhapsody, and we generate C code with the automatic code generators of each tool. Then, we deploy components generated by each tool in a specific node.

vm vm 0

50 40 30 20 10 0

0

50

100

Time (s)

150

200

(b) Speed estimation in faulty (vm 0 ) and non-faulty (vm ) conditions

Fig. 7: Influence of accelerometer faults in odometric estimations system. However, as the FMEA identified, faults in other components such as the Emergency Brake Control lead the system into an unsafe state, as the extract of the Swiss-cheese model of the PIM shows in Figure 8a. B. Homogeneous PSM simulation results analysis Compared to the FMEA of the PIM, the FMEA of the Homogeneous PSM enumerates more potential failure modes, due to HW-specific errors that were out of the scope of the platform independent model. However, the implementation of the PSM model as a TMR system increases its robustness. The results obtained in the SFI of this model show this fact. Redundancy at PS-TTM level involves more defensive layers, as the Swiss cheese model in Figure 8b shows. For example, the simulations show that the late reading of an input value by one of the jobs deployed in a redundant EVC node is now tolerated, since it is detected and masked by the voters. Furthermore, the complete ’freezing’ of one of the EVC nodes is also tolerated thanks to the voting algorithm. The SFI campaign results also show that the system is robust against errors in the voting nodes, e.g., the processor that computes the voting algorithm stops working. However, systematic errors in software are not masked by this model. C. Heterogeneous PSM simulation results analysis A SW-homogeneous system design does not tolerate systematic errors in the design of the software, such as incorrect implementation of the distance estimation algorithm of the odometry system; therefore we developed a new, heterogeneous-software design variant. The heterogeneous PS-TTM increases the robustness of the system since systematic errors would only affect one of the replicated functional jobs

Encoder1: ProvideWheelSpeed() Incorrect execution EmergBrakeControlJob: ReadEstimatedSpeed() OutOfTime

Od om DM etr I yS M Job yst od em Em e C Jo b erg ont Se . rol rv Br Jo i a DM ce B ke b I N rak Syst EV o e e C de Sys m Jo tem b EV No C de Jo b EV No A C de B N Vo o te de Vo r No C te de TI r No A U de B

(a) Extract of the “Swiss cheese model” of PIM

Encoder1: ProvideWheelSpeed() Incorrect execution

Crash danger!

EmergBrakeControlJob: ReadEstimatedSpeed() OutOfTime Core1-NodeA: NoExecution OdometrySystemJob: EstimateDistance() Corruption (Systematic) VoterProcessorB: No Execution

(b) Extract of the “Swiss cheese model” of Homogeneous PSM

Od om DM etr I yS M Job yst od em Em e C Jo b erg ont Se . rol rv Br Jo i a DM ce B ke b I N rak Syst EV o e e C de Sys m Jo tem b EV No C de Jo b EV No A C de B N Vo o te de Vo r No C te de TI r No A U de B

Encoder1: ProvideWheelSpeed() Incorrect execution EmergBrakeControlJob: ReadEstimatedSpeed() OutOfTime Core1-NodeA: NoExecution OdometrySystemJobB: EstimateDistance() Corruption (Systematic)

Crash danger!

A deterministic behavior is mandatory when designing safety systems based on programmable electronics. Nowadays, the design of dependable embedded systems widely falls back on time-triggered distributed platforms. A characteristic of timetriggered architectures is that the software is split in atomic jobs executing synchronously, so that we can prevent failures due to faulty synchronization points or altered orders of execution. This reduces the number of system failures that safety engineers should consider. This paper introduces a modeling and simulation environment for time-triggered dependable-systems. This framework supports the design of platform independent models, based on the Logical Execution Time (LET) Model of Computation (MoC), and the refinement of the design into platform specific models based on the Executable Time-Triggered MoC (ETTM). The purpose of the framework is to improve the comprehensibility of the system design to safety engineers by providing fault-behavior insights. The simulation infrastructure provides a timetriggered Automatic Test Executor (ATE), which is capable of reading signals and performing simulated fault injection on the System Under Test (SUT). The ATE is synchronized with the simulation time of the SUT, a way that functional tests and fault injection experiments become reproducible. We implemented a non-intrusive simulated fault injection technique in the framework, which enables injecting faults in the system model during simulations, without requiring the modification of the system models. This is achieved by monitoring the processes of writing and reading the variables of the SUT and modifying their values if required. The framework provides the user with a library of faults in order to configure the fault injection experiments. The ATE imports these configurations for carrying out the verification campaigns. As the mapping of LET and E-TTM-based models to time-triggered architectures is straightforward, this eventually facilitates the re-usability of tests even on real prototypes, provided that we could build a test harness with equivalent real-world Fault Injection Units (FIUs). The use of non-intrusive Simulated Fault Injection (SFI) from the early stages of the design is beneficial for designers, since now they are able to detect weaknesses in the design, long before assembling a costly system prototype. Non-intrusive SFI also eases doing FMEAs at the next refinement steps of the design. The results of SFI campaigns support the FMEA by benchmarking the behavior of the fault-tolerance mechanisms against the foreseen failure modes. We evaluated the framework in a case study consisting of a railway system based on the European

Od

VIII. Conclusion

om e DM try I J Sys tem M ob od Jo eC b Em o n t erg ro Se . B l Job r rv ic ake TI e Br Sys U ak tem e Sy J ste ob m Crash danger! Jo b

and these will not repeat in the others. The simulations performed with the ATE show that systematic errors are now tolerated by the system since they are detected by the voters, as Figure 8c illustrates.

VoterProcessorB: No Execution

(c) Extract of the “Swiss cheese model” of Heterogeneous PSM

Fig. 8: Extract of the Swiss-cheese models obtained by fault injection with ATE in PIM and PSMs

Vital Computer (EVC). We modeled the system at different abstraction levels, and then we checked the effectiveness of the fault tolerance mechanisms against the potential failure modes identified in the FMEA by simulation. The framework provided simulated fault injection capabilities to exercise the fault-tolerance countermeasures while the integrated ATE supported the repeatability of the tests. The case study demonstrates the suitability of the framework for the design and fault-tolerance assessment of dependable time-triggered systems.

[16]

[17]

[18]

Acknowledgment This research work has been supported in part by the European FP7 DREAMS project under grant No. 610640 and the Spanish INNPACTO project VALMOD under grant number IPT-20111149-370000.

[19]

[20]

References [1]

[2]

[3] [4] [5] [6] [7]

[8] [9]

[10] [11]

[12]

[13]

[14]

[15]

James Reason, “The contribution of latent human failures to the breakdown of complex systems,” Philos Trans R Soc Lond B Biol Sci, vol. 327, no. 1241, pp. 475–484, 1990. James Reason, E. Hollnagel, and J. Paries, “Revisiting the ”Swiss Cheese” Model of Accidents,” Tech. Rep., European Organisation for the Safety of Air Navigation, 2006. Joaquin Miller and Jishnu Mukerji, “MDA Guide Version 1.0.1,” 2003/06/12 2003. Object Management Group (OMG), “OMG Unified Modeling Language (OMG UML) 2.4.1,” 2011 - 08 - 05 2011. Object Management Group (OMG), “OMG Systems Modeling Language (OMG SysML) 1.3,” 2012 - 06 - 01 2012. IEEE, “IEEE Standard SystemC Language Reference Manual,” 2005. Jon Perez, Carlos Fernando Nicolas, Roman Obermaisser, and Christian El Salloum, “Modeling TimeTriggered Architecture Based Real-Time Systems Using SystemC,” in Forum on specification & Design Languages (FDL) 2010, Tom J. Kamierski and Adam Morawiec, Eds., 2010, 2010, Springer. Hermann Kopetz and Gnther Bauer, “The TimeTriggered Architecture,” in Proceedings of the IEEE, 2003, vol. 91, p. 15. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Trans. Dependable Secur. Comput., vol. 1, no. 1, pp. 11–33, 2004. Alfredo Benso and Paolo Prinetto, Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation, Kluwer Academic Publishers, 2003. Haissam Ziade, Rafic Ayoubi, and Raoul Velazco, “A Survey on Fault Injection Techniques,” The International Arab Journal of Information Technology, vol. 1, pp. 16, 2004. Eric Jenn, Jean Arlat, Marcus Rimen, Joakim Ohlsson, and Johan Karlsson, “Fault injection into VHDL models: the MEFISTO tool,” in Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers., Twenty-Fourth International Symposium on, 1994, pp. 66–75. Joaquin Gracia, Juan Carlos Baraza, Daniel Gil, and Pedro Jose Gil, “Comparison and application of different VHDL-based fault injection techniques,” in Defect and Fault Tolerance in VLSI Systems, 2001. Proceedings. 2001 IEEE International Symposium on, 2001, pp. 233–241. Juan Carlos Baraza, Joaquin Gracia, Sara Blanc, Daniel Gil, and Pedro Jose Gil, “Enhancement of Fault Injection Techniques Based on the Modification of VHDL Code,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 6, pp. 693–706, 2008. Shadi Moazzeni, Saadat Poormozaffari, and Amin Emami, “An Optimized Simulation-Based Fault Injec-

[21]

[22] [23] [24] [25]

[26]

[27] [28]

[29]

tion and Test Vector Generation Using VHDL to Calculate Fault Coverage,” in Microprocessor Test and Verification (MTV), 2009 10th International Workshop on, 2009, pp. 55–60. Silvio Misera, Heinrich Theodor Vierhaus, and Andr Sieber, “Fault Injection Techniques and their Accelerated Simulation in SystemC,” in Digital System Design Architectures, Methods and Tools, 2007. DSD 2007. 10th Euromicro Conference on, 2007, pp. 587–595. Rishad Ahmed Shafik, Paul Rosinger, and Bashir AlHashimi, “SystemC-based Minimum Intrusive Fault Injection Technique with Improved Fault Representation,” International On-line Test Symposium (IOLTS), p. 6, 2008. Cristiana Bolchini, Antonio Miele, and Donatella Sciuto, “Fault Models and Injection Strategies in SystemC Specifications,” in Digital System Design Architectures, Methods and Tools, 2008. DSD ’08. 11th EUROMICRO Conference on, 2008, pp. 88–95. Weiyun Lu and Martin Radetzki, “Efficient Fault Simulation of SystemC Designs,” in Digital System Design (DSD), 2011 14th Euromicro Conference on, 2011, pp. 487–494. Sebastian Reiter, Michael Pressler, Alexander Viehl, Oliver Bringmann, and Wolfgang Rosenstiel, “Reliability assessment of safety-relevant automotive systems in a model-based design flow,” in Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, 2013, pp. 417–422. Giovanni Beltrame, Cristiana Bolchini, Luca Fossati, Antonio Miele, and Donatella Sciuto, “ReSP: A nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for design space exploration,” in Design Automation Conference, 2008. ASPDAC 2008. Asia and South Pacific, 2008, pp. 673–678. MOGENTES, “Fault Models,” Tech. Rep., MOGENTES, 2009/12/29 2009. ASAM HIL workgroup, “ASAM AE HIL Programmers Guide,” 2009. Christoph M. Kirsch and Ana Sokolova, The Logical Execution Time Paradigm, chapter 5, pp. 103–120, Springer Berlin Heidelberg, 2012. Iban Ayestaran, Carlos Fernando Nicolas, Jon Perez, and Peter Puschner, “Modeling Logical Execution Time Based Safety-Critical Embedded Systems in SystemC,” in 3rd Mediterranean Conference on Embedded Computing (MECO), 2014. Iban Ayestaran, Carlos Fernando Nicolas, Jon Perez, Asier Larrucea, and Peter Puschner, “Modeling and Simulated Fault Injection for Time-Triggered Safety-Critical Embedded Systems,” in Object/Component/ServiceOriented Real-Time Distributed Computing (ISORC), IEEE 17th International Symposium on, Forthcoming 2014. Peter Winter, Bettina Guiot, and International Union of Railways, Compendium on ERTMS: European Rail Traffic Management System, Eurail Press, 2009. Monica Malvezzi, Benedetto Allotta, and Mirko Rinchi, “Odometric estimation for automatic train protection and control systems,” Vehicle System Dynamics, vol. 49, no. 5, pp. 723–739, 2010. Ansys Esterel, “SCADE Suite,” http://www. esterel-technologies.com/products/scade-suite/, 2014.

Suggest Documents