A Tool Set for Integrated Software and Hardware ...

A Tool Set for Integrated Software and Hardware Dependability Analysis Using the Architecture Analysis and Design Language (AADL) and Error Model Annex Myron Hecht, Alexander Lam

Chris Vogl

Computers and Software Division The Aerospace Corporation El Segundo, California, USA {myron.hecht|alexander.k.lam}@aero.org

Northwestern University Dept. of Engineering Sciences and Applied Mathematics Evanston, Illinois, USA [email protected]

Abstract— Cyberphysical (embedded) computer system availability and reliability can be modeled and assessed using the Architecture Analysis and Design Language (AADL) and its Error Model Annex. AADL can represent systems at multiple levels of abstraction. Therefore, analyses can be performed early and often throughout the development process thereby minimizing the cost and schedule impact of changes. We discuss how the AADL and its Error Model Annex can be used for automated generation of a reliability/dependability model. We then describe a tool set to graphically create AADL system architecture and error behavior files that are then transformed into Stochastic Petri Nets (SPN) and Stochastic Activity Network (SAN) representations and demonstrate its use using a generic satellite as an example.

II provides an overview of the MMD Workbench. Section III provides an application example. Section IV describes the use of the results II.

MDDA WORKBENCH OVERVIEW

Figure 1 shows the top-level data flow of the MDDA Workbench and Table 1 identifies the component tools.

Keywords: AADL, error model annex reliability analysis, availability, modeling, stochastic analysis network, stochastic petri net

I.

INTRODUCTION

This paper describes a tool set to evaluate and assess safety and reliability for systems defined in the Society for Automotive Engineering (SAE) Architecture Analysis & Design Language (SAE AS 5506, AADL). 1, 2 With this tool set, called the Model Driven Design and Analysis (MDDA) Workbench, users can graphically draw and define a system architecture consisting of hardware and software elements; define the failure detection and recovery behavior of the system; and perform quantitative and qualitative analyses. The benefit of the tool set is to provide timely feedback to decision makers to ensure that reliability and safety requirements can be met. in multiple phases of development. AADL supports qualitative and quantitative analysis of the hardware and software architecture. Annex E, the Error Model Annex, of the standard language defines an error and recovery modeling textual notation. Rugina et. al7 provided an initial description of how the architecture and error model together can be used to generate a Stochastic Petri Net (SPNs) that could be used to perform a quantitative analysis. This paper describes (1) a new graphical editor error model implementing the notation textual in Annex E using state transition diagrams and (2) extensions to Rugina’s tool from SPNs to Stochastic Analysis Networks8 to enable a full analysis of reliability, availability and safety (referred in this paper to collectively as “dependability”). The remainder of this paper is organized as follows: section

Figure 1. MDDA Workbench Data Flow. The process of creating a model starts with the definition of the AADL architecture using the TOPCASED graphical editing tool 3 and the OSATE AADL code generation and validation tool. 4 Then, the error models for each of the architecture elements are developed using an error model editor (or in the alternative, a model is used from a library of existing models). The architecture and error model editors and OSATE operate within the Eclipse development environment 5 and modeling framework. 6 The two output

files are combined and routed to the ADAPT tool 7 that creates an SPN output file. The SPN file is then converted into a Stochastic Analysis Network (SAN) file for use by the MoBIUS8 analysis tool for quantitative reliability and availability predictions. A separate tool chain is used for generation of a Failure Modes and Effects Analysis (FMEA) and is described elsewhere10. Table 1. Tools in the AADL reliability, availability, and dependability tool set Component Function Developer TopCASED A graphical editor to TOPCASED Consortium, create AADL The Aerospace architecture diagrams Corporation Modifications OSATE Open Source AADL Software Engineering Tool Environment -Institute AADL syntax The Aerospace sensitive editor, Corporation AAXL generator modifications Error Graphical editor to The Aerospace Model create AADL error Corporation Editor model diagrams ADAPT AADL to Stochastic LAAS/CNRS and Petri Net Generator Software Engineering Institute ADAPT-M Stochastic Petri net to The Aerospace MoBIUS stochastic Corporation analysis network tool MOBIUS Quantitative University of Illinois, Dependability Champaign Urbana 8 modeling and prediction tool FMEA Generate FMEA from The Aerospace Generator AADL architecture Corporation and error models

III.

APPLICATION EXAMPLE

This section describes an architecture and a dependability model for a hypothetical satellite created using the MDDA tool set. The first subsection describes the AADL architecture model and the second describes the error model as displayed by the newly developed error model graphical editor. A. Architecture Model The architecture model consisted of levels that are shown in Table 2 together with the associated AADL component types. Because the focus of this use of AADL is on failure and recovery behavior, the only interfaces shown on these types are related to events. However, in more comprehensive architectural models, data ports and event data ports would also be included. There are multiple implementations for each type definition. All implementations share the same interface (defined in type definition), and thus, if a change is needed in the interface, it will apply to all types using the interface.

Table 2. Component Types in Space Vehicle Model Level Types Description and Notes System Space Top level system type and component Vehicle Subsystem Control Subcomponent of the Space Vehicle Unit Object containing control channels. Two implementations: one for bus and one for payload. Subsystem FMS Failure Monitor – an artificial component Assembly Control Subcomponent of the Control Unit Channel Component. Three implementations: two for bus (primary and backup) and one for payload Hardware CPU Subcomponent of control channel. Component Two implementations: bus and payload Process Software Flight software running in a processor

Bus Control Software (backup)

Bus Control Software (primary)

Payload Control Software

Vehicle Network

BCP (primary)

BCP (backup)

Inter-BCP Bus

BCU

PCP (dormant)

PCP (active)

Inter-PCP Bus

PCU

Figure 2. Informal Representation of Hypothetical Satellite Architecture

1) System Level Description Figure 2 shows an informal representation of the satellite architecture using the AADL graphic components. The satellite consists of a bus and payload, each controlled by bus or payload control units (BCU and PCU) respectively. Both the PCU and BCU have dual redundant processors; constituent processors of the BCU are called Bus Control Processors (BCPs); for the PCU, they are named payload control processors (PCPs). Each processor pair has an internal network, and a vehicle network connects the BCU and PCU. The BCU uses active or “hot” redundancy in a primary/standby configuration. That is, each BCP is executing a copy of the bus control software, with one of the copies being active. The PCP uses a “cold redundancy” scheme in which only one processor is active and the other is turned off. Upon failure, the software is restarted on the backup processor in the PCP.

2) Subsystem Level Architectural Elements The satellite subsystems are the Bus Control Unit (BCU), Payload Control Unit (PCU), and the Failure Management System (FMS). Figure 3 shows the BCU at the subsystem level. It consists of two channels, a primary channel and an alternate channel together with a third component, the vehicle Failure Management System. The FMS is a virtual component accumulates state information and determining which channel is active on the BCU as well as the PCU. Each of the BCU channels has event ports leading to and from the FMS that will communicate events such as hardware or software failures.

Figure 3. Bus Control Unit

The reason that we call the FMS a virtual component is because the SV modeled here does not have a separate and distinct component to perform this function. This functionality is distributed and replicated in the hardware and software in the BCU. Indeed, use of a single point detection and recovery for the satellite would be a questionable design practice because it would be a single point of failure. However, the FMS is necessary as a modeling artifact for initiating reconfiguration or transition of the architectural components. For example, if the software on the primary channel fails, the FMS is notified of this information through the event port and then through its error model (discussed below) it will (with a certain probability) notify the backup channel to take control. The actions are described as state transitions in error models (described in the next section) associated with architectural components. Prior to deciding on the use of this virtual component, we considered two alternative architectural representation approaches. The first was use of lower granularity (i.e., larger scale) error models, inside of subsystems. This alternative was discarded because the architecture would be too abstract to be usable by system engineers and other stakeholders, thereby defeating the purpose of the integrated graphical and analytical representation of the architecture. The second was the use of the AADL behavior annex. This alternative was rejected because at the time of the performance of our work, the behavior annex and associated tools were not standardized. Although the payload is redundant, the backup system is in cold standby, i.e., there is only one channel active at a time. The events going to the FMS are failures of the internal processors or software. Events coming from the FMS are (1) notification that the BCU is down (in which

case the PCU immediately goes down) and (2) activation of the unpowered channel after the failure of an active payload channel. 3) Assembly Level Elements Figure 4 shows the assembly level architectural elements for the BCP (a similar diagram would be created for the PCP). At this level, the BCP (or PCP) hardware and software are considered as a single subcomponent called a channel: i.e., each Control Channel consists of one Control Software (BCS, PCS) process and one Control Processor (BCP, PCP). Events to and from the FMS are for the indication of a failure or a recovery of the software, and events to the FMS are for the indication of a recovery or a termination. Events to the CPU are to indicate whether the software is failed or working. If further detail were necessary, each of the processes could be further decomposed into threads for bus and payload subsystem control (e.g., attitude control, power management, thermal control, payload control, etc.), and in addition to the event ports, data and control interfaces could also be shown.

Figure 4. Primary and Backup Bus Control Channels

4) Hardware Component and Process Levels The CPU and software component types are the lowest level in the architecture. There are two implementations of the CPU type: one for the bus (BCP) and one for the payload (PCP). There is a single event port from the process component to the CPU to indicate the software status. Event ports to and from the bus software processes indicate role (i.e, whether they are primary or backup). B. Error Models This section describes the error models for the generic satellite using the graphical displays for the newly developed error model editor. All the models are represented using state transition diagrams. Triangles are event ports, with the point of the triangle indicating the direction. Events are generated and leave (propagate out from) the diagram when a state or transition with the matching name is passed through (for a transition) or entered (for a state).

1. CPU Error Model Implementation Figure 5 shows the error model for the CPU component. The state transition diagram has three states: active, standby, and terminated. For this model, it is assumed that the CPU does not have a recoverable hardware failure (analytically, this is equivalent to a recoverable software failure and can be included in the failure and recovery rate of the software component). Thus, any failure in this model results in a terminated processor. For a payload processor, there is an additional standby state in which it is dormant until woken up by the FMS. A transition can also occur to the standby state if the FMS indicates that the processor should stop working and go to sleep (this would occur if the bus processors failed). The CPUFail transition occurrence property is Poisson (i.e., random). For the other transitions, the transition occurrence property is fixed.

Figure 5. CPU Error Model Implementation

The BCP processor uses only the active and terminated states; the PCP processor uses the standby state. As opposed to the event ports in the architectural models, error models contain error propagations (error propagations is something of a misnomer, such propagations can also trigger recovery and restoration actions). There is one outbound error propagation (CPUFail) triggered by the CPU fail transition, and two inbound propagations (initiated by the FMS) for Wake and Sleep. 2. Bus Control Software Error Model Implementation Figure 6 shows the error model for the Bus Control Software (BCS) component. There are 7 states in the transition model: when the software is in the active state, it is controlling the satellite bus, it is in standby, and the software is executing in shadow mode. The “down” state is a general state for a recoverable failure which is merely a pass-through to one of two more persistent states: a minor failure (DownMinor) with a short recovery time, or a major failure (DownMajor), with a long recovery time. The terminate state covers the case when the software can not recover at all, and it can be reached from either the active or standby states. The transitions in the BCS error model are fairly clear with the exception of the “Wake”, “Sleep”, and “SwitchFail” transitions. For BCS, the “Sleep” transition takes the processor to the shadow execution (standby) state. The “Wake” transition takes the standby processor to the

controlling execution (active) state. These names have a different meaning for the PCS (discussed below). The “SwitchFail” transition propagates a recovery failure which actually occurs in the FMS model (discussed below). The Fail, MinorRepair, and MajorRepair are all random variables and have an occurrence property of Poisson. Fail_case_Minor and Fail_case_Major transitions are proportions and have fixed value (i.e., 99% are minor failures and 1% are major failures). The error propagations are between this model and the FMS model (discussed below) as well as with the CPU hardware.

Figure 6. Bus Control Software Error Model Implementation

3. Payload Control Software Error Model Implementation Figure 7 shows the error model for the Payload Control Software (PCS) component. The model has four states: active, down, standby, and terminated. Because the payload uses cold rather than hot standby, this model does not have the Standby, Repaired, DownMajor and DownMinor states found in the BCS error model.

Figure 7. Payload Control Software Error Model Implementation

For BCS, the “Sleep” transition takes the processor to the shadow execution (standby) state. The “Wake” transition takes the standby processor to the controlling execution (active) state. The “Fail” transition takes the software to a down (recoverable) state. As was the case with the BCS, some failures are non-recoverable resulting in a terminate state. The Fail and terminate are random variables and have an occurrence property of Poisson. As

was the case with the BCS, the error propagations are between this model and the FMS model (discussed below) as well as with the CPU hardware. 4. Failure Management System Figure 8 shows the error model for the Failure Management System (FMS) component. The model has seven states but most of the residence time is either the UsingPrimary or UsingBackup states. These refer to the BCS computers. When a failure of the primary BCS software occurs, a transition occurs to the SwitchingPtoB state. The FMS can either successfully or unsuccessfully switch to the backup. A successful switch results in a transition to UsingBackup. An unsuccessful switch results in either a Down state or a terminated failure. A subsequent failure in the Backup processor would result in a transition to the SwitchingBtoP state. If that switch is successful, the FMS would revert to the UsingPrimary state. A terminal failure of the primary or backup would result in a transition to the SwitchFail state and then to the Down state. From the Down state, the FMS would attempt to restore operation to either the primary or the backup. Unsuccessful restoration would result in a terminated state. A terminal PCS failure would also result in a satellite-level terminate state. The in and out failure propagations for thee FMS have been discussed in the context of the three preceding error models.

Figure 8. Failure Management System Error Model Implementation

IV. POSTPROCESSING ANALYSIS AND RESULTS Definition of the architecture and error models is the first step to creating the architectural and error models for generation of qualitative and quantitative results that provide insight into system dependability. Table 3 lists the postprocessing steps using the tools that were introduced above. A more complete description of the process of generating Mobius files and the utility of their results were presented elsewhere.9 An additional step not shown in the table, FMEA generation capability was described elsewhere.10 Table 3. Post-architecture definition processing steps Step Tool Input Output 1. Generation of ADAPT AADL and AAXL Petri Net a Petri Net files generated during in XML the architecture format definition 2, Conversion ADAPT- Petri Net generated Input file of Petri Net into M by ADAPT to a Stochastic Mobius Analysis Network 3. Creation of MoBIUS ADAPT-M files Quantitat quantitative ive results results

Figure 9. Mission Duration Hardware MTBF

Figure 9 shows the impact of hardware reliability (as measured in MTBF) upon mission duration (the analysis was terminated at 100,000 hours, the assumed design life of the satellite). This figure shows the relatively small impact of the failure rate of the individual processors on the expected number of operational hours. For the 100,000

hour mission, a factor of 10 increase in hardware MTBF leads to only a 25% increase in mission duration. The reason for this relatively modest increase is the presence of redundancy and an effective failure detection and switchover scheme. The cost of the redundancy is additional space, weight, and power consumption. In some cases (particularly for shorter missions), an increase in reliability through the use of higher quality parts, more screening, and more software testing might be a more cost effective alternative. Figure 10 shows the impact of software recovery times on total uptime. A decrease in the recovery time from 20 hours down to 2 hours increases the uptime by approximately 15,000 hours for the bus and 23,000 hours for the payload. This should be compared to the results shown in Figure 9, Whereas it might be impossible to increase the reliability of a subsystem from 100,000 to 1,000,000 hours, it is generally more feasible to decrease the recovery time, where a similar benefit was achieved. Figure 10 shows the impact of changing the recovery probability from a minor failure upon uptime.

requirements can be met. Current practice, which does not use automated tools, is to perform reliability and safety analyses only at the preliminary or detailed design milestones, when a significant amount of engineering effort has been expended and the cost of changes becomes much higher – or even infeasible under project cost and schedule constraints. REFERENCES 1

Society of Automotive Engineers (SAE) AS-2c Committee, SAE Architecture Analysis and Design Language (AADL) Annex Volume 1: Annex A: Graphical AADL Notation, Annex C: AADL Meta-Model and Interchange Formats, Annex D: Language Compliance and Application Program Interface Annex E: Error Model Annnex, June 2006, available online http://standards.sae.org/as5506/1/ (charges apply). 2

Behavior, Data Modeling, and ARINC653 Annex Compendium, (in progress), http://standards.sae.org/wip/as5506/2.

3 TOPCASED consortium home page, http://www.topcased.org, last visited June 27, 2010. 4 Software Engineering Institute, AADL OSATE home page, http://www.aadl.info/aadl/currentsite/tool/toolsets.html, last visited June 27, 2010. 5

Eclipse Foundation Open Source Community http://www.eclipse.org/, last visited December 15, 2010

Website,

6 Eclipse Foundation Modeling Documentation – Eclipspedia, http://wiki.eclipse.org/Modeling, last visited December 15, 2010 7

A. Rugina, K. Kanoun, M Kaaniche, “The ADAPT Tool: From AADL Architectural Models to Stochastic Petri Nets through Model Transformation,” 7th European Dependable Computing Conference (EDCC), Kaunas : Lituanie (2008). 8

D. D. Deavours, G. Clark, T. Courtney, D. Daly, S. Derisavi, J. M. Doyle, W. H. Sanders, and P. G. Webster, “The Mobius framework and its implementation,” IEEE Trans. on Soft. Eng., vol. 28, no. 10, pp. 956–969, October 2002. Figure 10. Uptime vs. Software Recovery Time

V. CONCLUSION This paper described how the AADL and its Error Model Annex are used for automated generation of a reliability/dependability model and failure modes and effects analyses. A tool set to create AADL system architecture and error behavior files that are then transformed into Stochastic Petri Nets (SPN) and Activity Network (SAN) representations was presented and an application to a hypothetical satellite provided. Future work will involve incorporation of the AADL behavior annex to provide further verification of the failure detection and recovery algorithms The benefit of the tool set is to provide timely feedback to decision makers to ensure that reliability and safety

9

M. Hecht, A. Lam, C. Vogl, “Use of Test Data for Integrated Hardware/Software Reliability and Availability Modeling of a Space Vehicle”, Proc. 2009 AIAA Space Conference, Pasadena, CA, September, 2009, available online at http://www.aiaa.org/content.cfm?pageid=320. 10 Myron Hecht, Alexander Lam, Russell Howes, and Christopher Vogl, “Automated Generation of Failure Modes and Effects Analyses from AADL Architectural and Error Models”, Proc. 2010 Systems and Software Development Conference, Salt Lake City, UT, May, 2010, available online at http://sstconline.org/2010/