State/Event Fault Trees

State/Event Fault Trees: A Safety and Reliability Analysis Technique for Software-Controlled Systems (Zustands-Ereignis-Fehlerbäume: Eine Sicherheits- und Zuverlässigkeitsanalysetechnik für softwaregesteuerte Systeme)

Vom Fachbereich Informatik der Technischen Universität Kaiserslautern zur Verleihung des akademischen Grades Doktor-Ingenieur (Dr.-Ing.) genehmigte Dissertation

von Dipl.–Ing. Bernhard Kaiser

Datum der wissenschaftlichen Aussprache:

27.01.2006

Dekan:

Prof. Dr. Reinhard Gotzhein

Prüfungskommission: Vorsitzender: Berichterstatter:

Prof. Dr.-Ing. Stefan Deßloch Prof. Dr.-Ing. Peter Liggesmeyer Prof. Dr. Klaus Schneider

D 386

ii

Danksagungen Mein Dank gebührt an erster Stelle meinen studentischen Mitarbeitern, ohne die meine Forschungsarbeiten und insbesondere die Entwicklung der Werkzeuge UWG3 und ESSaRel nicht möglich gewesen wären: den Studenten des Bachelorprojekts ”UWG3”, den studentischen Hilfskräften am Hasso-Plattner-Institut (HPI) an der Universität Potsdam und am Fraunhofer Institut für Experimentelles Software Engineering (IESE), insbesondere den Masterarbeitern Catharina Gramlich, Antje Rogotzki, André Zocher und Marc Förster. Besonderer Dank geht an Marc Förster für die Erstellung der meisten Graphiken und das Review des Übersetzungsalgorithmus, André Zocher für die Erstellung des Literaturverzeichnisses und Sonnhild Namingha vom Fraunhofer IESE für das Lektorat. Ich danke meinem Betreuer und Gutachter, Prof. Dr.-Ing. Peter Liggesmeyer, für den fachkundigen Rat, die mir gewährte Freiheit bei der Ausgestaltung des Themas und die großzüge Bereitstellung von Hilfskräften. Prof. Dr. Klaus Schneider danke ich für die Übernahme des Zweitgutachtens und für die interessanten fachlichen Diskussionen. Prof. Dr.-Ing. Stefan Deßloch danke ich für die Übernahme des Vorsitzes der Prüfungskommission. Ich bedanke mich beim HPI und beim Fraunhofer IESE für die Ermöglichung dieser Arbeit als wissenschaftlicher Mitarbeiter und die vielfältige Unterstützung. Mein Dank geht ebenso an die Firma Siemens, die nicht nur die Anregung zur Werkzeugentwicklung von UWG3 geliefert hat, sondern diese auch über die ganze Entwicklungszeit finanziell und beratend unterstützt hat. Bei Siemens bin ich insbesondere Oliver Mäckel, Martin Rothfelder und Reiner Heilmann von der Konzernforschungsabteilung CT PP 2 sowie Frank Renpenning vom Bereich Transportation Systems zu Dank verpflichtet. Weiter danke ich meinen Kollegen am HPI und am Fraunhofer IESE für die kritischen und konstruktiven Gespräche. Ich danke Dr. Armin Zimmermann von der TU Berlin dafür, dass er mir das Werkzeug TimeNET zur Verfügung gestellt hat und mich bei Fragen zur DSPN-Analyse beraten hat. Auf privater Seite danke ich meiner Freundin Kati und allen meinen Freunden für die moralische Unterstützung und das Verständnis für meine Arbeitsbelastung, sowie meinen Eltern dafür, dass sie die Wurzeln für mein wissenschaftliches Interesse gelegt haben.

iii

iv

Abstract The ubiquitous presence of software-controlled systems in all areas of everyday life has steadily increased the demand for safety and reliability analyses. On the one hand, these require models that are capable of describing all types of safety relevant system behaviour, and, on the other hand, analysis techniques that derive quantitative hazard or failure probabilities from these models. From the multitude of existing techniques, the established Fault Tree Analysis (FTA) is particularly suitable for both tasks. Many researchers have claimed, however, that the expressive power of this technique is insufficient in the domain of software-controlled systems: durations, the order of events, or state interdependency between different components cannot be expressed. Furthermore, the desirable integration of Fault Trees with software engineering models is hard to achieve due to different definitions of terms and different compositionality approaches. Suitable expressiveness can be reached better with state-based modelling techniques, but these do not, in contrast, represent causal chains as clearly as Fault Trees do. Moreover, Markov Chains, an elementary probabilistic and state-based model, do not provide suitable modularisation and also lack the possibility of expressing triggered events. Suitable concepts for modularisation and for triggering relations are found, though, in state-based software engineering models such as Statecharts or ROOMcharts, but these are unsuitable for probabilistic analysis. A promising approach is consequently the combination of modelling elements from Fault Trees and from different state-based techniques. This dissertation introduces State/Event Fault Trees (SEFTs), a new modelling technique for safety and reliability analysis of software-controlled systems with an associated evaluation technique. SEFTs are a notation that combines elements from traditional FTA with elements from state-based techniques. Their outstanding difference in comparison to traditional Fault Trees is the visual distinction of states and events by different symbols. Besides the representation of causal chains in Fault Tree style, it is possible to express complex behaviour and state dependencies by additional application of state-machine elements. The latter also allow the formalisation of advanced Fault Tree gates, whose semantics could not be defined consistently in the past. SEFTs (as well as the predecessor model Component Fault Trees, which was also developed in this doctoral research project) consist of self-contained and reusable components that are joined by typed ports. SEFTs are a technique that is both intuitive for practitioners and sufficiently formal for analysis. They enable tight integration with popular software design models, which has been shown by the prototypical import of ROOMchart models. Thus SEFTs constitute an important step towards an integrated design and evaluation process for safety-critical

v

vi

systems. An underlying state-based model is required both for formalising the semantics of SEFTs and for analysing them probabilistically. Therefore, this dissertation proposes an algorithm that translates SEFTs into Deterministic and Stochastic Petri Nets (DSPNs), for which established analysis and simulation techniques exist. To support the construction and evaluation of SEFTs, the prototypical tool ESSaRel has been implemented, which served to validate the applicability of the SEFT technique to several case studies. Correctness and consistency of the new technique are demonstrated by proofs and test cases.

Kurzfassung Aufgrund der zunehmenden Durchdringung aller Lebensbereiche mit softwaregesteuerten Systemen gewinnt die Sicherheits- und Zuverlässigkeitsanalyse solcher Systeme immer mehr an Bedeutung. Hierzu benötigt man erstens Modelle, die das sicherheitsrelevante Verhalten von Systemen anschaulich beschreiben, sowie zweitens Analysetechniken, die quantitative Aussagen über die Wahrscheinlichkeit von Gefahren- oder Ausfallzuständen daraus ableiten. Aus der Vielzahl existierender Verfahren eignet sich besonders die verbreitete Fehlerbaumanalyse für diese beiden Aufgaben. Es wurde jedoch vielfach festgestellt, dass die Ausdrucksmächtigkeit dieser Technik gerade im Bereich der softwaregesteuerten Systeme unzureichend ist: Zeitdauern, Reihenfolgen von Ereignissen sowie Abhängigkeiten zwischen Zuständen verschiedener Komponenten können nicht dargestellt werden. Wegen der verschiedenen Begriffswelten und Modularisierungsweisen ist weiterhin eine Integration mit Software-Engineering-Modellen schwer möglich. Die für Softwaresysteme benötigte Ausdrucksmächtigkeit wird besser mit zustandsbasierten Modellierungstechniken erreicht; diese wiederum stellen jedoch die Kausalverkettungen in Gefährdungsszenarien nicht so deutlich dar wie Fehlerbäume. Außerdem bieten Markovketten als elementare probabilistische zustandsbasierte Technik kein Komponentenkonzept und keine Ausdrucksmöglichkeit für getriggerte Ereignisse. Geeignete Komponenten- und Triggerkonzepte finden sich in zustandsbasierten Techniken aus dem Software-Engineering, wie Statecharts oder ROOMcharts, wobei diese sich wieder nicht für probabilistische Analysen eignen. Ein viel versprechender Ansatz ist folglich die Kombination von Modellierungselementen der Fehlerbaumtechnik und der verschiedenen zustandsbasierten Techniken. Diese Dissertation stellt Zustands-Ereignis-Fehlerbäume (State/Event Fault Trees, SEFTs) als neue Modellierungstechnik für Sicherheits- und Zuverlässigkeitsanalysen mit zugehörigem Auswertungsverfahren vor. SEFTs sind eine Notation, die Elemente der traditionellen Fehlerbaumtechnik mit Elementen aus den zustandsbasierten Techniken verbindet. Ihr wesentlicher Unterschied im Vergleich zu StandardFehlerbäumen ist die visuelle Unterscheidung von Zuständen und Ereignissen durch unterschiedliche Symbole. Neben der Darstellung von Kausalzusammenhängen im Stil traditioneller Fehlerbäume können Elemente von Zustandsautomaten verwendet werden, um komplexere Verhaltensweisen und Abhängigkeiten zu beschreiben. Zustandsmodelle erlauben auch die Formalisierung von erweiterten Fehlerbaum-Gattern, deren Semantik in der bisherigen Fehlerbaumtechnik nicht konsistent zu beschreiben war. SEFTs (sowie das ebenfalls im Rahmen dieser Doktorarbeit entwickelte Vorgängermodell der Komponentenfehlerbäume) beste-

vii

viii

hen aus wiederverwendbaren Komponenten, die über typisierte Ports verbunden werden. SEFTs stellen eine für Praktiker intuitiv verständliche und dennoch für die Analyse ausreichend formalisierte Technik dar. Sie ermöglichen eine Integration mit verbreiteten Software-Modellierungstechniken, was beispielhaft durch den Import von ROOMcharts gezeigt wurde. Daher bilden sie einen wichtigen Schritt auf dem Weg zu einem integrierten Systementwurfs- und Sicherheitsanalyseprozess für sicherheitskritische Systeme. Um SEFTs semantisch zu formalisieren und quantitativ auszuwerten, bedarf es eines unterliegenden zustandsbasierten Modells. Daher schlägt diese Dissertation einen Algorithmus vor, der SEFTs in Deterministische und Stochastische Petrinetze (DSPNs) überführt, für die es bewährte Analyse- und Simulationsverfahren gibt. Für die Erstellung und Auswertung von SEFTs wurde das prototypische Werkzeug ESSaRel implementiert, mit dessen Hilfe die SEFT-Technik an Fallstudien validiert wurde. Korrektheit und Konsistenz der neuen Technik werden durch Beweise und Testfälle gezeigt.

Contents 1

2

Introduction

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Proposition of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Preliminaries and State of the Art

7

2.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Mathematical Foundations . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3

Safety and Reliability Analysis Techniques . . . . . . . . . . . . . . .

18

2.3.1

Situation and Classification . . . . . . . . . . . . . . . . . . . .

18

2.3.2

Important Safety and Reliability Analysis Techniques . . . . .

19

2.3.3

Application to Embedded Systems and Research Agenda . . .

23

Fault Tree Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.2

An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.4.3

Quantitative Analysis of Fault Trees . . . . . . . . . . . . . . .

29

2.4.4

FTA in Practice: Extensions, Limitations and Ambiguities . .

32

State-Based Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.5.1

State-Machines and State-Based Software Engineering Models

37

2.5.2

Markov Chains and Other Probabilistic State-Based Modelling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

2.6

Structural Modelling in Software Engineering . . . . . . . . . . . . . .

45

2.7

Recent Research and Remaining Issues . . . . . . . . . . . . . . . . . .

47

2.4

2.5

2.5.3

ix

x

3

CONTENTS

Component Fault Trees (CFTs)

53

3.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.2

Traditional Fault Tree Decomposition by Modules . . . . . . . . . . .

54

3.3

Informal Introduction to Component Fault Trees . . . . . . . . . . . .

57

3.4

CFT Model Elements and Rules for Well-Formedness . . . . . . . . .

58

3.5

Application Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.6

Analysis of Component Fault Trees . . . . . . . . . . . . . . . . . . . .

64

3.6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.6.2

The BDD-Based Analysis Algorithm Adapted to CFTs . . . .

65

3.6.3

Reduction of Analysis Effort by Exploiting the Component Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . .

71

3.7 4

State/Event Fault Trees (SEFTs)

73

4.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.2

Informal Introduction to SEFTs . . . . . . . . . . . . . . . . . . . . . .

74

4.3

An Introductory Example . . . . . . . . . . . . . . . . . . . . . . . . .

80

4.4

SEFT Modelling Elements . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.5

Rules for SEFT Well-Formedness . . . . . . . . . . . . . . . . . . . . .

85

4.6

The SEFT Gate Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.6.1

Introductory Remarks . . . . . . . . . . . . . . . . . . . . . . .

86

4.6.2

The AND Gate (n State Inputs) . . . . . . . . . . . . . . . . . .

87

4.6.3

The AND Gate (1 Event Input and n State Inputs) . . . . . . .

87

4.6.4

The OR Gate (n State Inputs) . . . . . . . . . . . . . . . . . . .

88

4.6.5

The OR Gate (n Event Inputs) . . . . . . . . . . . . . . . . . . .

88

4.6.6

The NOT Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.6.7

The Inhibit Gate . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.6.8

The Exclusive-OR (XOR) Gate . . . . . . . . . . . . . . . . . .

90

4.6.9

The Equal Gate . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.6.10 The Voter Gate . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.6.11 The History-AND Gate and its Variants . . . . . . . . . . . . .

91

4.6.12 The Priority-AND Gate and its Variants . . . . . . . . . . . . .

92

4.6.13 The Delay Gate . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

4.6.14 The Conditional Probability Gate . . . . . . . . . . . . . . . . .

93

CONTENTS

5

6

7

xi

4.6.15 The Duration Gate . . . . . . . . . . . . . . . . . . . . . . . . .

94

4.6.16 State/Event Adapter Gates . . . . . . . . . . . . . . . . . . . .

94

4.6.17 Extending the Gate Set . . . . . . . . . . . . . . . . . . . . . . .

95

4.7

Some SEFT Examples

. . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.8

Application in the Development Process . . . . . . . . . . . . . . . . .

98

Analysis by Translation into DSPNs

101

5.1

Choice of an Appropriate Intermediate Model . . . . . . . . . . . . .

101

5.2

Deterministic and Stochastic Petri Nets . . . . . . . . . . . . . . . . .

103

5.3

Modularisation of DSPNs . . . . . . . . . . . . . . . . . . . . . . . . .

104

5.4

Translation into DSPNs . . . . . . . . . . . . . . . . . . . . . . . . . . .

106

5.4.1

Overview and Technical Remarks . . . . . . . . . . . . . . . .

106

5.4.2

Initialisation and Precondition Check . . . . . . . . . . . . . .

108

5.4.3

Translating States and Events . . . . . . . . . . . . . . . . . . .

109

5.4.4

Translating Ports . . . . . . . . . . . . . . . . . . . . . . . . . .

110

5.4.5

Translating Gates: The Dictionary . . . . . . . . . . . . . . . .

110

5.4.6

Translating Subcomponent Footprints . . . . . . . . . . . . . .

115

5.4.7

Translating Edges . . . . . . . . . . . . . . . . . . . . . . . . . .

116

5.5

Optional Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . .

120

5.6

Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120

5.7

Initial Marking of the DSPN . . . . . . . . . . . . . . . . . . . . . . . .

121

5.8

Analysis or Simulation with TimeNET . . . . . . . . . . . . . . . . . .

123

The Tool Projects UWG3 and ESSaRel

127

6.1

UWG3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

6.2

ESSaRel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129

Evaluation

135

7.1

Correctness, Confluence and Consistency . . . . . . . . . . . . . . . .

135

7.1.1

Boolean Logic Aspects . . . . . . . . . . . . . . . . . . . . . . .

136

7.1.2

Gates Involving Memory . . . . . . . . . . . . . . . . . . . . .

137

7.1.3

Test Cases for Gates . . . . . . . . . . . . . . . . . . . . . . . .

139

7.1.4

Test Cases for Inner Consistency . . . . . . . . . . . . . . . . .

144

7.1.5

Test Cases for the Component Concept . . . . . . . . . . . . .

146

7.1.6

Consistency with Traditional Fault Trees . . . . . . . . . . . .

148

xii

CONTENTS

7.1.7

Consistency of Priority-AND with Traditional Approach . . .

150

7.1.8

Consistency with Dynamic Fault Trees . . . . . . . . . . . . . .

154

7.1.9

Consistency with Markov Chains . . . . . . . . . . . . . . . . .

156

7.1.10 Semantical Issues . . . . . . . . . . . . . . . . . . . . . . . . . .

159

Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

160

7.2.1

A Simple Fire Alarm . . . . . . . . . . . . . . . . . . . . . . . .

160

7.2.2

A Motorway Safety System . . . . . . . . . . . . . . . . . . . .

164

7.2.3

Inverted Pendulum: A Case Study for the Import of ROOMcharts into SEFTs . . . . . . . . . . . . . . . . . . . . . . . . . .

167

7.2.4

Further Case Studies . . . . . . . . . . . . . . . . . . . . . . . .

170

7.2.5

Observations from the Case Studies . . . . . . . . . . . . . . .

173

7.3

Comparison to the State of the Art . . . . . . . . . . . . . . . . . . . .

174

7.4

Limitations and Improvement Proposals . . . . . . . . . . . . . . . . .

176

7.4.1

Basic Events and Solitary Events . . . . . . . . . . . . . . . . .

176

7.4.2

Additional Event Parameters . . . . . . . . . . . . . . . . . . .

178

7.4.3

Multiple Predecessor and Successor States . . . . . . . . . . .

180

7.4.4

Hierarchical States . . . . . . . . . . . . . . . . . . . . . . . . .

182

7.4.5

Spare Pools and Repair Dependencies . . . . . . . . . . . . . .

183

7.4.6

Semantics Extensions and Further Formalisation . . . . . . . .

185

7.4.7

Import of Software Design Models . . . . . . . . . . . . . . . .

186

7.4.8

Software Defect Modelling . . . . . . . . . . . . . . . . . . . .

187

7.4.9

Integration with other Software Engineering Approaches . . .

189

7.4.10 Other Analysis Types from Traditional FTA . . . . . . . . . . .

190

Analysis Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .

192

7.5.1

Analysis Time Evaluation . . . . . . . . . . . . . . . . . . . . .

192

7.5.2

Performance Improvement . . . . . . . . . . . . . . . . . . . .

196

7.2

7.5

8

Conclusion

201

A The Gate Dictionary

205

B The Translation Algorithm as Pseudo-Code

215

C Proofs and Validity Arguments

227

Bibliography

243

CONTENTS

xiii

List of Figures

255

Curriculum Vitae

261

Publication List

263

xiv

CONTENTS

Chapter 1 Introduction 1.1

Motivation

Software-controlled systems, often called embedded systems, have increasingly replaced mechanical or electrical components in safety or reliability critical domains such as aerospace, automotive, railway or industrial automation. In these domains, safety and reliability analysis is a mandatory part of system development. Different techniques exist that help analysers to identify safety issues, explain their causes and consequences, and quantify their probabilities. Most of them stem from an era where safety-critical systems consisted mainly of mechanical or electronical parts, so their applicability to software-controlled systems is not guaranteed. Among the modelling techniques that can both depict the causal chains leading to a hazard and calculate its probability, Fault Tree Analysis (FTA) is one of the most widely used ones. Fault Trees (FTs) are intuitive for practitioners due to their hierarchical structure and the familiar logical symbols. They provide a set of qualitative and quantitative analyses for system failures and hazards. FTs have been used for several decades and are required by customers, by standards, and by approval authorities in many safety-critical domains. Nowadays they are gaining more and more importance in the context of software-controlled systems. Nevertheless, some differences between Fault Trees and the models that are commonly considered appropriate for embedded system design have been identified and make many industrial companies reluctant to apply them: At first, most modern modelling techniques for embedded systems offer a decomposition concept, as the complexity of today’s systems and the partitioning of work prohibit handling a whole system at once. Although FTs inherently describe a hierarchical breakdown of the system, this decomposition reflects the topology of the failure influences graph rather than the system architecture as defined by the designer. Modules in terms of FTA are independent sub-trees, but this independence assumption fails for the majority of technical components, because they strongly influence each other. Unlike in component-based design models, FT modules cannot be modelled as separate and reusable entities. This hampers the division of work as well as component reuse.

1

2

CHAPTER 1. INTRODUCTION

Moreover, being a combinatorial technique, traditional FTs cannot handle many phenomena that are important when explaining chains of events in softwarecontrolled systems, such as state dependencies, temporal order of events, or duration of states. Safety, as will be discussed later on, is a matter of behaviour, rather than a static property of a technical unit, and thus requires dynamic models. Current behavioural models that are used in the design of software-controlled systems are based on states and events. State-based models are much more suitable for describing embedded system behaviour than FTs or most other current safety analysis techniques. The consequences of this discrepancy are two-fold: First, the application of FTs to embedded systems is hampered. Many software designers consider FTs inappropriate for this class of systems. Those who do apply traditional FTs to softwarecontrolled systems are often using them beyond their intended purpose, making tacit assumptions about the semantics they really mean. This is a source of ambiguities and misinterpretation. Second, the desirable integration of system / software design techniques with safety / reliability assurance techniques into an allencompassing development process is hard to achieve if the techniques used in each phase do not match in their semantics. It is not possible to refer to knowledge about the system behaviour during safety analysis. Instead, the safety evaluation models have to be created from scratch. This is not only an efficiency issue, but also a matter of soundness: if hazardous events are identified during safety analysis, they cannot be identified as those considered at certain points in the design modelling phase and thus cannot be traced back to the design model to examine their occurrences. It would be highly desirable if the states and events that a FT refers to could be identified by the same name in a design model describing system behaviour. Then it would also be possible for each technical component to be a self-contained unit that has its own safety model associated with it. In summary, three main drawbacks of Fault Trees in the context of softwarecontrolled systems are their lack of an adequate component concept, of semantic expressiveness and of integration into the system development process. Other safety and reliability analysis techniques do not solve these issues any better. State-based modelling techniques from software design offer better capabilities for behavioural modelling and could manage integration with system design, but lack probabilistic aspects that are required for safety and reliability analysis. Markov chains, which are a probabilistic state-based model, provide neither adequate compositionality, nor capabilities to describe the deterministic behaviour of software. Most of all, all of these state-based models have difficulties showing the causal chains leading to hazards or accidents in a visible manner, as Fault Trees or Event Trees do. As a contribution to overcoming these problems, this doctoral thesis proposes combining the traditional FTA technique with a state-based model. FTs are enhanced by a state/event distinction that not only extends the expressive power of FTs, but also enables their integration with state-based semantics. Additionally, an enhanced component concept for FTs is presented. The resulting analysis technique is called State/Event Fault Trees (SEFTs). An appropriate evaluation technique based on the

1.2. PROPOSITION OF THE THESIS

3

translation to a kind of Petri Nets (DSPNs) has been developed. An analysis tool for SEFTs named ESSaRel has been implemented. The correctness and applicability of SEFTs have been evaluated using different test settings, and case studies and the new technique has been compared to existing techniques.

1.2

Proposition of the Thesis

This doctoral thesis introduces State/Event Fault Trees (SEFTs). SEFTs are a new safety and reliability analysis technique that combines modelling elements from Fault Trees (FTs) and from Statecharts. In comparison to traditional Fault Trees (FTs), they introduce distinct symbols for states and events and allow modelling with deterministic and probabilistic state-machine semantics where plain Boolean logic is insufficient. With regard to their semantics and their analysis, SEFTs are a state-based model; visually, they preserve the familiar FT notation as far as possible. Like traditional FTs, SEFT use gates to join different propositions, but these gates are not restricted to Boolean logic. The gate set is extended with respect to FTs and new gates can be defined based on their state-machine semantics. The gate inputs and outputs are typed to enforce the distinction of states and events. SEFTs allow modelling state transitions that occur deterministically or probabilistically according to an exponential distribution. As a consequence, they subsume Markov Chains and some aspects of Statecharts and similar system design model. Therefore, they form a formal basis for the integration of FTs with other safety analysis models, as well as with system design models. SEFTs consist of components, self-contained entities joined by ports and representing technical components. The main advantage of SEFTs is their use of the familiar symbols of FTs and of Statecharts, which makes them directly applicable in industry. SEFTs will be introduced and formalised in this thesis. Their practical application is demonstrated by several case studies. SEFTs are compared to traditional Fault Trees and other current safety analysis techniques, and it is argued that they constitute a better means to express safety-critical scenarios of software-controlled systems than formerly existing techniques. SEFTs are analysed by translating them into Deterministic and Stochastic Petri Nets (DSPNs), but without confronting the user with the much more complicated Petri Net structures. The translation to DSPNs also defines the semantics of SEFTs. It is shown that SEFTs are consistent with existing modelling techniques regarding a set of essential properties. Additionally, the following contributions have been achieved and are presented in this thesis: 1. Component Fault Trees (CFTs) are an intermediate result of the research, but constitute a usable technique in their own right. They were first published in [KLM03]. CFTs augment Fault Trees by a component concept, allowing to cut arbitrary parts out of an FT and put them into a reusable component with input and output ports. FT components correspond to the technical components as defined during the architectural design phase. The port concept resembles

4


the component and port concepts of many current design models. Components can be developed in different places by different people and thus allow division of labour and reuse. At the same time, CFTs extend the tree topology to directed acyclic graphs (DAGs) and allow consistent modelling in the presence of repeated events. CFTs still have the Boolean expressive power of traditional FTs, but besides better structuring of projects, they provide efficiency improvements for the analysis. CFTs inspired the component concept of SEFTs. 2. As a proof-of-concept of the techniques proposed in this thesis, but also to make the techniques industrially applicable, two software tools have been implemented: UWG3, a mature and industry-proven tool that supports CFT analysis, and ESSaRel, its experimental successor, which implements the SEFT concept, including the analysis by translation into DSPNs. ESSaRel incorporates other models like Markov Chains, ROOMcharts (a Statechart variant) and CFTs and is meant as a model integration platform. Using these tools, several case studies have been carried out, partly as parts of a Master thesis, partly in cooperation with industrial partners. 3. Some proposals for further enhancements of the techniques and mechanisms to reduce the computational effort of the evaluation are given, although not all of them have been implemented until the submission of this thesis. The enhancement proposals result directly from the evaluation experiments carried out to validate the applicability of the SEFT technique and from its comparison to existing techniques. As SEFTs are a state-based model (as opposed to traditional FTs, which are a combinatorial model), calculation time is an issue because of the state-space explosion problem. Therefore, a set of methods towards faster analysis are proposed. These comprise combined usage of combinatorial and state-based techniques, some extensions to the combinatorial analysis algorithm, and some considerations on possible state-space reduction.

1.3

Structure of the Thesis

The thesis is structured as follows: In Chapter 2, domain-specific terms are defined and some mathematical background is given. Next, the state of the art in safety and reliability analysis is summarised and Fault Tree Analysis as well as state-based modelling techniques are introduced. The existing techniques are discussed in the context of software-controlled systems and some problems are pointed out. Chapters 3, 4 and 5 form the core part of the thesis and explain the achieved scientific results. Each of them starts with a motivation section that repeats the problems that are addressed by the specific technique. In Chapter 3, the new concept of Component Fault Trees is presented. The analysis by Binary Decision Diagrams (BDDs) is explained and the practical application is discussed. In Chapter 4, the State/Event

1.3. STRUCTURE OF THE THESIS

5

Fault Tree concept is introduced and then formalised. The evaluation algorithm by translation into DSPNs is described in the separate Chapter 5. The remaining chapters deal with the practical application and the evaluation of the SEFT technique. Chapter 6 introduces the tools UWG3 and ESSaRel, which were developed to evaluate the results of the doctoral research. Chapter 7 shows how the new concepts have been validated and compared to existing techniques. It lists some proofs for logical properties and some test cases. A number of small case studies and an industrial application example are presented that demonstrate the applicability of the SEFT concept. Advantages and drawbacks are critically discussed and proposals for model enhancements and analysis acceleration are given. Finally, Chapter 8 summmarises the achievements in relation to the problems initially described.

6


Chapter 2 Preliminaries and State of the Art 2.1

Definitions

This thesis deals with safety and reliability analysis. Therefore, the definition of the terms safety and reliability is essential, but so is the definition of the related terms accident and hazard. Safety can be related to the behaviour of a component. Behaviour, in turn, is determined by events and states. As the subject of the thesis is quantitative, i.e., probabilistic analysis, the necessary measures and formulas will be given in the subsequent section. Component is an identifiable entity with a well defined interface and deterministically or stochastically described behaviour. In embedded systems engineering it usually designates a self-contained, i.e., separately deployable piece of hardware or software. System is a set of components which act together as a whole, delimited by a system boundary (cf. the similar definitions in [DIN02b, CL99, Wik]). Systems of technical interest interact with their environment and often have a complex structure. A system can be a component of a higher-level system, and a component of a system can be regarded as a subsystem, allowing further hierarchical decomposition. What is a system, a subsystem and a component depends on the current context (cf. [DIN00]). Thus, system and component are not distinguished in the following. This thesis mainly deals with technical systems and components, although an extension of the scope to non-technical components (e.g., human operator, natural environment) is possible and often necessary in safety analysis. Mission Time is the expected or assumed time period throughout which a system is in operation.

7

8

CHAPTER 2. PRELIMINARIES AND STATE OF THE ART

This can be the time until scheduled shut-down or, in case of repairable systems, until the component is replaced or put into its initial state again. In safety and reliability analysis, it is often necessary to restrict predictions to a given period of time, as due to wear-and-tear technical systems cannot maintain their quality properties for an arbitrary period. Some measures (e.g., expected number of failures) can be related to the mission time. The mission time must be distinguished from the life time [Bir91] or time to failure (TTF) of a non-repairable component, as the latter ends unexpectedly when the component fails. It is possible to choose a domain-specific time unit for mission time and life time (e.g., operation hours, flight hours, but also produced entities for a production plant or kilometers for a vehicle). State is the collectivity of the variable properties of a component that are relevant to its behaviour and its reaction to external events. A major distinction in systems theorie is to divide continuous state systems, i.e., systems where the state variables can take values out of a continuous domain, from discrete state systems where there is a finite or countable number of states [CL99]. A finite state space is often appropriate to model the overall operational behaviour of software-controlled systems. For safety and reliability considerations in particular, an abstraction to a finite number of states is usually sufficient. Regarding an aircraft for instance, the exact height over ground, which is a continuous state variable, is not relevant to safety; it is sufficient to distinguish the equivalence partitions {too low, acceptable, too high} as three operational states. Event is a term with two different meanings in safety analyis: Event in the context of probability theory (in particular in the context of Fault Tree Analysis) means any possible outcome of a random experiment, which can be any condition or situation that can be verbally described and classified as being true or false (cf. usage in [VGR81]). For instance, when tossing a coin, the two possible events are ”head is up” and”tail is up”. Event in the context of discrete event systems modelling is a sudden phenomenon without temporal expansion [CL99]. In particular, transitions from one discrete state to another are events. In this thesis, the term event is used in the latter meaning unless otherwise noted. Moreover, the term event is used here to represent a class of similar phenomena that are only distinguished by the time of their occurrence. An event occurrence can be described as a pair of an event and a time stamp. Behaviour of a system or component is the sequence of event occurrences that is observable at the boundary of a system or component.

2.1. DEFINITIONS

9

Other definitions are in use, especially some that refer to a sequence of states instead of a sequence of event occurrences. For safety analysis, it is useful to concentrate on events that are externally observable, because these do harm to the environment. Failure is any behaviour that diverges from the specified (desired) behaviour (cf. [DIN03a, DIN02b]). In consequence, the precision of the failure definition depends on the precision of the definition of the desired, or correct, behaviour; without a formal definition (which is often the case in practice), the definition becomes vague. A time can be associated to a failure (cf. [Lap95, Sto96, DIN93, DIN02b]): usually, this is the time when some incorrect event occurs first or the latest time some required event that actually did not occur was expected. An incorrect event occurs in particular when some continuous value violates its tolerance range. The concept of events that do not occur or that do occur too late is important when considering protective systems (e.g., car airbag, fire alarm system). This latter class of failures is usually called failure on demand. For different purposes in safety and reliability analysis, failures can further be classified, e.g., into safety-critical failures and non-safety-critical failures.1 Fault is an undesired state or condition in a system, which can cause or admit a failure (cf. [DIN02b, DIN03a]). Thus, a fault can be modelled as a particular state of a component or system.2 Accident (also: harm, mishap, damage, incident) is an undesired (not necessarily unexpected) event that destroys or impairs human life or health, material or economical goods or the natural environment (Cf. [Lev95]: ”that causes damages or losses”). The definition of event used here implies the sudden nature of accidents that corresponds to everyday life observations. 1

[PM01] proposes different failure domains (service provision, timing, value) and several classes of failures: value too high, value too low, timing too early, timing too late, commission (unexpected event) and omission (expected event that does not occur). Similar classifications have often been cited throughout safety analysis literature as they fit well with informal failure investigation techniques like HAZOP or FMEA. They do not intefere with the mandatory association of a failure time to each failure: in the definition of a failure time proposed in this thesis, the first case (failure time is the time of an unexpected event) can be mapped to the failure classes value, commission and too early. The second case (failure time is latest time when an event was expected) covers omission and too late failures. 2 In safety and reliability literature, there are many differing defintions for fault. Partly, the additional term error is introduced, but there is no consensus on what is fault and what is error and which one is cause and which one is effect ([Lap95, Lev95, Lig00, Sto96]). Laprie’s definitions are frequently cited. In this thesis, the term error is not used. Most important is the consensus that fault (static) must be distinguished from failure (dynamic) and that a fault can lead to a failure but does not need to.

10


Hazard is a state of a system and its environment where the occurrence of an accident only depends on factors that cannot be controlled by the system [Lev95]. Other relevant standards simply call it a potential source of harm [DIN02b] or state it more precisely as a physical situation with potential danger to humans [DIN00]. On practice this does not make much of a difference; however, the definition used here stresses the fact that hazard cannot be judged without referring to the environment. On a railway level crossing for instance, it is a hazard if a train is approaching and the barriers are open. Whether or not an accident actually occurs depends on whether cars are around and whether the drivers are attentive or not - factors that cannot be influenced by the system level crossing. This is why technical safety analysis more often deals with hazard probabilities than with accident frequencies. Note that a hazard, being a state, has a probability, whereas an accident, being an event, has a frequency. Risk is the combination of the expected frequency of an accident with a measure for the severity of its consequences (similar to [DIN02b, DIN00]). When dealing with a hazard instead of the corresponding accident, it is the combination of the hazard probability with a severity measure. It is often more useful to work with hazard probabilities than with accident frequencies, because the latter depend on external factors that cannot easily be quantified. There is not one agreed measure for the severity of the consequences: for technical systems, repair or replacement costs can be applied; but if it comes to death or injury of humans, any measure involves ethical considerations. One possible solution is to choose a set of categories on an ordinal scale, e.g., reaching from ”negligible” to ”catastrophic”, as proposed by different standards [DO-92, DIN00, DIN02b]. For this reason, the definition suggests ”combination of” and not ”product of” probability and severity. Acceptable Risk is the risk that is considered as acceptable according to an agreed criterion. The criteria are not discussed in detail in this thesis. These criteria can, for instance, compare a system to be assessed to existing and accepted systems (GAMAB = ”Globalement au moins aussi bon”) or claim that deadly accidents introduced by the technical systems should not significantly augment the natural mortality (MEM = Criterion of Minimal Endogenous Mortality) or they claim that everything technically possible should be undertaken to keep the risk small (ALARP = ”As low as reasonably practical”). The main goal of safety analysis is to estimate the risk introduced by some technical system in order to compare it to the acceptable risk. Very important in the presentation of a safety and reliability analysis technique is the definition of the terms safety and reliability. As many of the terms presented so far, they are used in different meanings so that an appropriate one for the purpose of this thesis had to be chosen. This applies in particular to the definition of safety, which has several meanings in the different sciences that are related to the subject

2.1. DEFINITIONS

11

of this thesis (computer science, safety engineering and the various domain-specific engineering branches like electrical engineering, or automotive engineering). In absolute terms, safety is the non-occurrence of accidents [Lev95] or the absence of hazard [MIL93]. Related to a technical system, it is the property not to cause any accidents [Bir91]. This latter definition is unsuitable for protective systems, e.g., car airbags. Moreover, all of the absolute definitions suffer from the problem that it is practically impossible to construct a technical system such that it never causes or allows an accident; this is why probabilistic definitions have gained importance in many current standards. As this thesis proposes a probabilistic safety analysis technique, it is natural to focus on the probabilistic definition of safety here. The definition given here encompasses both systems that can cause danger and systems that are designed to reduce or avoid danger caused by other systems. Safety is the situation where the actual risk introduced or permitted by a system is lower than the acceptable risk [DIN00, DIN95, DIN02b]. Thus, safety, as it is used in this thesis, is defined via hazard probabilities related to hazards of different severity. For a new technical system, it has to be shown that its hazard probabilities are lower than the acceptable limits. This is why techniques like Fault Tree Analysis or Markov Chain Analysis are of predominant importance in the construction of these systems. The probabilistic definition of safety is nothing else as a quantification of the binary property that no hazard is present. The other way around, if the acceptable risk is set to zero, the probabilistic definition turns into the absolute one. In the various disciplines of computer science, other definitions are frequently used. In discrete event systems theory, a safety property is usually understood as a property that always holds, in particular a negative proposition about the reachability of some undesired states: ”It is always the case that the system is not in state X.” Different types of temporal logic have their specific syntax to express this fact. An application example is model checking, where the system behaviour is often specified by a state diagram and the safety property by a temporal logic expression; the purpose of model checking is to show (e.g., by exhaustive search of the state space) that the property always holds and the undesired states are never reached. This definition of safety corresponds to the absolute definition of safety given below, but works on a much more abstract level, representing the system by the mathematical notation of a finite state-machine. In software engineering, a common view on safety is to consider it as a quality property of the software system among others like usability, maintainability, efficiency and others [ISO01]. Sometimes, the term ”non-functional properties” is used for these quality properties. This view is not wrong (and in the software engineering context certainly helpful), but can be misleading: it is important to keep in mind that safety is not a software property but a system property; on the other hand, it is sure that software can contribute to the safety or compromise the safety of the systems it controls. Therefore, good software engineering practice can be expected to increase system safety, but there is no immediate evidence. Many researchers agree

12


that ”software safety” is a wrong term and that it is better to speak about ”safety of software-controlled systems”. Also, safety is a system property and does not necessarily decompose to subsystems and components. A system is not as safe as its components; but its safety is determined by the behaviour of its components. It is important to understand that safety is related to system behaviour, and not simply a property that can be attributed to a component or a piece of software as a number on a scale. Availability and Reliability are properties that are closely related to safety. Not only, it is partially possible to apply the same techniques for the purpose of safety as well as reliability analysis, but also because safety of a system often decomposes to availability or reliability of its protective subsystems (e.g., a car is safe as long as its brakes are available).3 Availability is the property of a system to fulfill its specified functions at a given moment in time, provided that defined environment conditions are met [DIN00]. This binary property is quantified by the probability A(t) over time that the system is available at time t, [Bir91]. Average availability over a mission period is the percentage of the time when the system is available. Availability is a measure that is particularly suited for reactive (on demand) systems (e.g., web server, car brakes), including protective systems (e.g., car airbag). Thus, availability can have a strong impact on safety. Reliability is the property that a system fulfills its specified functions uninterruptedly over a given period of time, provided that defined environment conditions are met [DIN90], in quantitative analysis the probability thereof [Bir91, DIN00]. The probabilistic quantification of reliability is the probability that no failure occurs over a given period of time. Another quantification is the expected time to the first or to the next failure (Mean Time To Failure, MTTF or Mean Time Between Failures, MTBF). Another important measure is the failure rate (the term rate will be introduced in the next section). Reliability is a suitable measure for continuously working systems, in particular for systems that cause accidents or hazard if their service is interrupted (e.g., heart pacemakers). Probabilistic safety analysis techniques like FTA can basically also be used for availability and reliability analysis. Dependability has been bdefined by [Lap95] as a collective term for safety, security, availability and reliability (also in [Bir91, Sto96] sometimes also referred to as RAMS or RAMSS (if including security) properties [DIN00]). In practice. the term is sometimes used synonymously with reliability. 3

There are also cases where safety and reliability / availability are independent (e.g., a car, that is often out of order, but never causes an accident) or event anti-correlated (e.g., systems that are not operable because safety functions shut them down). However, it is not true that there is always a possible trade-off between safety and availability: when a safe stable state is unreachable (e.g. an airplane that has left the ground), the systems must be operable in order to assure safety.

2.2. MATHEMATICAL FOUNDATIONS

13

In summary, safety analysis aims at finding and explaining dangerous events (accidents) or states (hazards) caused or tolerated by technical systems. Reliability analysis investigates failures that would prevent the system prematurely from providing its intended service. Quantitative safety and reliability analysis techniques deal with probability measures of hazards, accidents or critical failures. As hazards and accidents arise as a consequence of undesired behaviour (failure) of technical components, i.e., are causally linked to internal states and events, a common task of safety and reliability analysis is to explain cause-consequence chains that lead to failures, hazards, and accidents, and provide quantitative rules on the probabilities of these states and events and their propagation throughout the system. In the following, the focus is on safety analysis, but the developed techniques are also suitable for reliability and availability analysis or, in general for obtaining probabilistic quantification of any relevant behaviour of a system. There are other terms and quality properties that are often applied in the context of safety-critical or highly reliable systems, e.g., correctness, robustness, fault-tolerance or maintainability. These quality properties do have an influence on safety and reliability, but constitute research domains on their own, with corresponding analysis techniques, and will not be further discussed in this thesis.

2.2

Mathematical Foundations

This thesis mainly focusses on the quantitative, i.e., probabilistic aspects of safety and reliability analysis. The relevant phenomena in safety and reliability analysis are undesired or dangerous states (e.g., hazard, system unavailability) or events (e.g., accident, system breakdown). There are different measures to quantify these. Although not always performed in practice, a careful distinction between states and events is necessary for a correct argument. The main difference from a mathematical point of view is the fact that states have probability at each point in time, whereas events have a probability density or a rate. A probability is a real number in the range [0,1] that indicates the expected relative number of observations of a particular outcome of some random experiment. A probability at any point in time t can be assigned to every state S of some discrete state component: the probability that the component is in S at this time t. Observing the probability of a state over time renders a function from the time values, i.e., the set of the real numbers in the case of continuous time, to the interval [0,1], denoted as AS (t). An example for a state probability measure is the availability A(t) of some system, i.e., the probability of being in a working state at time t. A(t) is a function over time. Its average value over the mission time M T can be calculated as:

A=

1 MT

R MT 0

A(t)dt

14


Another relevant measure for states in this context is the average duration for which a state is maintained.4 If the state occurs several times over the system mission time, the average duration is calculated. An example is the Mean Down Time (MDT) or Mean Time to Repair (MTTR) of a repairable system, which is the average duration of the failed state: P

− tf ail i ) MT

i (trepair i

M DT = M T T R =

Other important average durations are the Mean Time to Failure (MTTF) or (applied to repairable systems) the Mean Time between Failures (MTBF). Events (in terms of sudden phenomena) have no probability at a point in time. What they do have is a probability to occur at least once in a given interval of time, but at a point in time they rather have a probability density or frequency. Safety and reliability engineering often considers events that occur only once (e.g., the failure of a nonrepairable system or a severe accident). Many traditional analysis techniques tacitly make the assumption that every event occurs only once. In some cases, however, there are recurring events. These can be dangerous events that do not end the component’s life (e.g., repairable failures), but also events that regularly occur during operation and that lead to a hazard or failure only under certain circumstances. The expected number of occurrences of an event in a given time interval is a positive real number. Dividing this number by the interval length leads to the frequency over the interval, which is a positive real number with the inverse of a time unit (e.g., 5 occurrences per hour). If the interval shrinks to a singular point in time, the probability density results; it can be calculated as: = dn with n(a, b) := f (t) = lim∆t→0 n(t,t+∆t) ∆t dt Number of Occurrences in time interval [a, b] The integral of this function over a time interval gives the expectation value for the number of occurrences within this interval. For the important case of events that occur only once, e.g., the failure of a nonrepairable component, the probability density is also defined. Its integral from the beginning of the life of the component to a certain point in time is the probability that the event has occurred so far. In the case of a failure event, this is the probability of the failed state of the system. This probability is often denoted as F (t) and can be calculated as F (t) =

Rt 0

f (x)dx

For an event with a single occurrence, the following equation holds: R∞ 0 4

f (t)dt = 1

This time can be measured in standard time units like hours or years, but also in mission specific units like flight hours or years of operation.


15

Often the reliability or probability of survival R(t) is requested instead. This is the complementary state to the failed state (assuming that there are no different failure modes and no degrading) and thus R(t) = 1 − F (t) = 1 −

Rt 0

f (x)dx

Another appropriate measure in the context of events with a single occurrence is the expected time until the occurrence. An example is the aforementioned Mean Time to Failure (MTTF). The probability density can be used to calculate this expected time. For instance, let R(t) be the probability that a non-repairable system is still working at time t and F (t) the probability that it has already failed at that time, then the following equations hold: M T T F = E(Tf ail ) = 0∞ t · f (t)dt with f (t) : Failure Density Function, Tf ail : Time of Failure R

Another important measure in the context of events is the occurrence rate or rate. The rate is defined for transitions between a source state (e.g., available state) to a target state (e.g., failed state). It is defined as the conditional probability of a state change within the next infinitesimal time interval, provided the system has been in the source state before. For example, the failure rate of a system is the probability that the system fails within the next time interval, provided that is working now, or, in other words, the fraction of a large number of working systems that fails within the next time interval. The usual symbol for rates is the Greek letter λ. Like probability densities, rates are displayed as positive real numbers with the inverse of a time unit. Resuming the example of the non-repairable system with two states, the failure rate is defined as λ(t) = lim∆t→0

R(t)−R(t+∆t) R(t)·∆t

1 = − R(t) ·

dR(t) dt

=

f (t) R(t)

and the relation between rate, probability density of the failure event and the probability of the failed state is f (t) = λ(t) · R(t) Probabilities of failed states are originally the quantitative values that are used in Fault Tree Analysis. Nevertheless, many FTA tools and analysts specify and calculate rates, often without being aware of the mathematical background. Markov Chain analysis, on the other hand, deals with given rates and calculates state probabilities from them. As the equations show, there actually is a relation between states (e.g., the failed state) and events (e.g., failure) that can be formalised correctly if exact terms are used and a state-transition model is assumed. This is why state-based semantics

16


has been chosen for the State/Event Fault Tree approach of this thesis in order to overcome the traditional inconveniences. The considerations so far did not explain what the probability functions looks like, how the events are distributed over time and which standard deviations have to be expected. An MTTF of 10 years could result from the situation that all system instances deterministically die after ten years (constant life time assumption), but also from the case that from the beginning, each year 10% of the systems that are still alive fail (constant rate assumption). The distinction between both cases is the different probability distribution, which determines the shape of the function. For different distributions there are different formulas to calculate the mean value and the standard deviation. The most important distribution in safety and reliability engineering is the exponential distribution, in cases of single occurrence events. It assumes a constant rate. It is often used because it reflects the failure behaviour of hardware (but not necessarily software) components quite well, and it has some nice properties that facilitate mathematical analysis. Its probability density function over time is f (t) = λe−λt with λ = const This means for the probability of the failed state: F (t) = f (t), where F (0) = 0 and thus F (t) = 1 − e−λt R

and for the survival probability (or reliability): R(t) = P (Tf ail > 1) = 1 − F (t) = e−λt In the special case of an exponential distribution, the MTTF is: MTTF =

1 λ

Figure 2.1 shows all aforementioned measures for the important case of the exponential distribution. A quantitative analysis technique should at least allow modelling the exponential distribution and should correctly deal with these measures. Other important probability distributions in Reliability Engineering are the Weibull distribution and the Normal and Log-Normal distributions; choosing an appropriate probability distribution for the problem at hand is one of the analyst’s tasks. Independently from the chosen distribution, reliability engineers usually make the assumption that every system is available in the beginning and every system eventually fails: R(0) = 1

and

R(∞) = 0

In the case of events that can occur more than once, the Poisson Distribution is important, because it assumes that after each occurrence, the residual time until the next occurrence is exponentially distributed. Each occurrence is independent from the preceding ones. More details about the mathematical background and various probability distributions can be found in [CL99].


17

Figure 2.1: Rate, Failure Density, Reliability, and Failed State Probability assuming exponential distribution

18


2.3 2.3.1

Safety and Reliability Analysis Techniques Situation and Classification

In today’s society, humans strongly rely on technical systems and many technical systems can even cause danger to humans or goods if they fail to provide their service correctly. A certain awareness of safety and reliability issues started to emerge in the early 20th century, but it was only in the 1960s or 70s that systematic safety and reliability research began and a body of techniques, best practices, and standards began to grow. Today, many industrial customers claim quantitative reliability assertion for devices and systems that cause losses in case of failure. If these figures cannot be provided, access to the market is impossible; if systems do not comply with the promised level of reliability, high penalties and the loss of trust can cause severe difficulties to the manufacturer. In safety-critical domains such as the military, aerospace, railway, automotive or medical technology there is even a need for an operation permit from the local authorities before putting systems into service. In order to obtain this permit, quantitative predictions about hazards and accidents have to be provided. A series of accidents caused by technical systems (e.g., the deaths caused by the medical device Therac 25, the Ariane 5 accident, the space shuttle accidents, and a large number of plane and train accidents) has promoted public awareness of safety and reliability of technical systems, and thus a rising importance of safety and reliability analysis has to be expected. Safety and reliability management deals with • identification of potential hazards and failures when the system is defined and designed • investigation of the risks that a system can impose on its environment and definition of tolerable risk levels and reliability goals for the system • constructive measures to build the system in such a way that it is sufficiently safe and reliable, but yet affordable • estimation of the residual risk imposed by the system and assurance that the goals are met. Safety and reliability management is a process that accompanies all development phases from product definition to system deployment and operation. Often it takes an iterative approach: at first, some of the goals are missed, so the system design has to be modified and the analysis has to be repeated until the goals are fulfilled. This suggests that a close integration of system design and safety analysis techniques can make this process more efficient. The analytical tasks in this context comprise listings of potential failures and hazards and of the consequences to the environment, the explanation of hazard occurrence based on technical insights to the system, and the quantitative estimation of

2.3. SAFETY AND RELIABILITY ANALYSIS TECHNIQUES

19

hazard probabilities or failure frequencies. A broad set of techniques has been developed to support these tasks. These different techniques can be classified in different ways, e.g., by the process phases they are used in, by the formalisms they use, by empirical versus model-based techniques, or by the types of analyses that they provide. Techniques that are applied in early phases of the development help to identify coarse scenarios without referring to implementation details and techniques for later phases operate on a model of the actual system. There are techniques that apply textual or tabular notations and others that use graphical representations like state-diagrams or decision trees. There are techniques that rather focus on finding and prioritising hazards and on listing them along with possible counter-measures and others that have the main purpose of calculating hazard probabilities. Another important classification can be made between human-centred investigating techniques that perform a focused search for hazards and their causes, as opposed to computer-based techniques that start from a formal system model and partly allow for automated analyses. Nevertheless, due to their complexity, all of the techniques require software programs for their applications. The investigating techniques can further be divided into forward-searching or inductive techniques, which search the consequences of a known critical event versus backward-searching or deductive techniques, which investigate what leads to a certain critical event. In the following section, some important safety and reliabiltiy techniques are listed and briefly described; more details can be found in [RL05]. Over the years, a set of standards has been established that requires certain construction principles as well as analysis techniques for different industry domains and safety levels, e.g., [DIN02a] as a generic standard, [DIN00] in the European railway sector, [DO-92] in the civil US aerospace sector, and many standards from the military (mainly US and UK, e.g., [DEF97, MIL93]) and from regional space associations (ESA, NASA). Only a few of them are directly software-related; the majority are concerned with system safety in general, but often contain dedicated parts or amendments dealing with the controlling software (e.g., [DIN01]). In details, these standards differ from each other, but there is an increasing consensus that safety should be managed with varying rigour for different criticality levels (often called Safety Integrity Levels, SILs), that formal and model-based techniques should be applied in combination with human-centred investigating techniques, and that quantitative goals should be established. The standards are accompanied by auxiliary standards that define vocabulary, processes, and the individual techniques, and, of course, the standards regarding software quality processes in general also apply to safety-critical software. There is a large body of literature that describes the different techniques and safety approaches (e.g., [Lap95, Lev95, Rus94, Sto96, Bir91, Vil92]).

2.3.2

Important Safety and Reliability Analysis Techniques

In the early phase of product definition, mainly tabular techniques are used to find possible hazards or the hazardous consequences of certain operation conditions or failures. Preliminary Hazard Analysis (PHA) is one of these techniques that uses the table columns Hazard Description, Consequences, Side Conditions, Severity, So-

20


journ Probability of Persons in the Hazardous Area, and Possibilities for Avoidance, which are filled in for each identified hazard situation. A similar technique, which is also used for reliability analysis and for risk assessment in general, is the Failure Mode and Effect Analysis (FMEA) [IEC91], which exists in many industry-specific variants, in particular the extension Failure Mode, Effects and Criticality Analysis (FMECA). It also uses tables with similar column headings. The distinguishing feature is that measures (natural numbers) on an ordinal scale (usually 1 ... 10) are assigned to the severity, the probability, and the detection or coverage probability of each hazard. A standardised set of criteria helps the analyst to assign these numbers (e.g., Severity = 10 corresponds to ”death of many persons”); the criteria are also domain-specific. The product of these numbers is called Risk Priority Number (RPN) and indicates the importance of the hazard. This helps managers to direct the effort to the most critical situations. FMEA can be applied on the system and on the component level and is frequently used in aerospace and automotive industries. A third major example for tabular techniques is Hazard and Operability Studies (HAZOP) [IEC01], a technique that originally comes from chemical industries. The focus of HAZOP is on derivations from normal conditions, especially regarding the flow of material. Guide words like ”too much”, ”too little”, ”wrong direction”, ”no” help the analyst to think of all possible derivations. The tabular techniques are often preferred over the more formal modelling techniques in industry, because they are structured but yet flexible enough to explain situations in natural language. Exact probabilities, which are often hard to estimate, are not required. The effort for the analysis and the granularity of details can vary, which gives a lot of freedom to the analysts without violating the obligations. On the other hand, as they are informal techniques, their applicability to automated and formally underpinned analyses is limited. In reliability engineering, the goal is often to estimate quantitative measures for availability, reliability, or time to failure. For simple hardware parts, there is often empirical failure data available, but not so for complex or new systems. Therefore, combinatorial or state-based modelling techniques are frequent in this domain. The most simplistic combinatorial approach is the parts-count-approach. A system is said to be available if all of its parts are available. If the probability of each part to be available is known and all parts are independent from each other, the availability of the system can be calculated as the product of all part availabilities. This approach is easy to handle but often too pessimistic in practice, as it does not account for the collaboration between the parts. In particular, it does not honour the reliability amelioration achieved by redundancy. A more flexible approach is the Reliability Block Diagram (RBD) technique [DIN03c]. It uses the analogy of an electrical circuit with series and parallel connections to display the logical connection (AND versus OR) between the availability of functional blocks with respect to the system availability. An example is given in Figure 2.2. The system in the example consists of three components C1, C2 and C3. It is working as long as C1 and (C2 or C3) are working, i.e., as long as there is one working path from left to right. If the availabilities of the components (i.e., the probabilities of the working state) are known and all components


21

are independent from each other, then the system availability can be calculated by application of the combinatorial rules.

Figure 2.2: A Reliability Block Diagram Example

Another combinatorial technique is the Fault Tree Analysis (FTA) technique. As this technique is one of the fundamentals for the State/Event Fault Tree technique, it will be discussed in detail in Section 2.4. In brief, a Fault Tree depicts the combined influence of the parts to the system by a tree structure, thereby focussing on failures (i.e., the unavailable state instead of the operating state that is considered in RBDs). It is used for both reliability and safety analysis. In reliability engineering, it is usually used as a combinatorial model (which it actually is), but in safety analysis, it is in many cases understood as a causal model, as will be discussed in Section 2.4.4. A simple example is given in Figure 2.3; it shows the different failure causes for some controller system and will be explained later, when the FTA technique will be discussed in detail.

Figure 2.3: A Simple Fault Tree Example

A classification of the usual combinatorial techniques by their expressive power can be found in [MT94]. Note that the combinatorial techniques that originally operate with static probabilities can also model the evolution of the system over time (ageing), if the component availability is specified as a function over time (e.g., assuming

22


exponential or Weibull distribution of the component failures). A frequent assumption in reliability analysis is that a component that has failed stays in its unavailable state forever; however, all of these models have been extended by repair modelling. A counterpart of the FTA technique in safety analysis is the Event Tree Analysis (ETA). It is a forward analysis, i.e., it searches the consequences of a given event or state (failure or hazard). Therefore, it complements FTA: for each identified hazard, the causes are searched with FTA and the consequences are searched with ETA. An event tree is a tree that has the original event or state as root and then branches according to side conditions that influence the continuation of the scenario. E.g., if a fire breaks out, the first condition to be considered is whether or not the automatic sprinkler system works. If it works, the fire is extinguished and no further damage occurs. If not, the next question is whether or not surveillance personnel detect the fire early enough. Again, the scenario branches into more or less severe continuations. The technique helps to find all potential scenarios and to plan for protective measurements that can prevent the dangerous sub-scenarios. If the probabilities for the conditions at the branching points are known, the probabilities for each subscenario can be calculated. Therefore, ETA can also be applied as a quantitative analysis technique.

Figure 2.4: A Simple Event Tree Example

The Cause-Consequence Diagram technique [Vil92] combines elements from ETA and FTA and has its origins in nuclear and chemical industries in the early 1970s. The main structure is similar to an Event Tree, but at the decision points, top-events of Fault Trees are attached, which explain why the system is in a state that forces one or the other decision. As a final important technique, the Markov Chain (MC) analysis should be mentioned. This technique is frequently applied in quantitative reliability analysis. MC analysis is a state-based technique and does not require the stochastical independence of all components, as the combinatorial techniques do. Instead, the product state-machine of all components is formed and all probabilistic transitions between


23

its states are modelled explicitly. As state-based models are another important ingredient of State/Event Fault Trees, there is also a separate section (Section 2.5) where this class of techniques is explained, including a detailed introduction to Markov Chains. For now, only an introductionary example is shown (2.5), which will be resumed later. It shows a pump stations with two pumps that faile independently from each other. Failure and repair transitions are depicted by arrows. More details will be explained later, when the Markov Chain analysis will be discussed in detail.

Figure 2.5: A Simple Markov Chain Example

2.3.3

Application to Embedded Systems and Research Agenda

Most of the established techniques have been developed at a time when safetycritical tasks were exclusively performed by purely mechanical or electrical systems and consequently do not consider the new aspects introduced by software control. As more and more control functions of technical systems involve software, there is a need for appropriate safety and reliability analysis techniques and there is a number of reasons why the existing techniques do not satisfy this need sufficiently. Softwarecontrolled systems differ from mechanical and electronic systems in a number of aspects and have different failure modes and hazards than traditional hardware systems [Lev95]. The first difference is that embedded systems are complex. It is impossible for humans to capture their structure and the potential interaction patterns of their components as a whole. Most accidents in which modern technical systems are involved cannot be traced back to one single cause of one single part, but result from the complexity and the culmination of a lot of small misbehaviours [Per85]. The software, although correct according to its specification, can take wrong decisions when it is exposed to an unforeseeable or very improbable combination of circumstances, as

24


was the case in the Warsaw Airbus 320 accident [HL94]. Human errors or wrong information of the software caused by sensor or actuator failures play an important role in these accident or failure scenarios, so the behaviour of the complete system and its environment has to be considered for safety and reliability analysis. Simple reliability approaches like the parts-count-approach or the AND / OR combination of components as applied in the RBD and FTA technique are not sufficient to model the different ways of component interaction. In particular, the combinatorial techniques require independence of all described components, which is an unsuitable assumption for most software-controlled systems. Other techniques, like the tabular techniques, possess no formalised notion of compositionality, even if they are performed on different modularisation levels (system, subsystem and component). In particular with regard to safety, it has to be questioned to which extent compositionality can be assumed at all: it is too simplistic to say that a system is safe as long as all of its components are safe. However, in order to cope with the complex structure of today’s embedded systems, compositional models, i.e., models, that allow deriving the system properties from the component properties, or, at least, models with an intuitive hierachical structure for the reader are required. As there is no simple way to describe the interaction of components and its influence on systemlevel safety, there is a need for research on modelling techniques providing adequate modularity or compositionality. The second difference between traditional systems and software-controlled systems is that hardware parts constitute persisting laws of control by their mere existence and integrity (e.g., the parts of a mechanical car brake), whereas software evolves over time, recognizes events in the environment at certain points in time, and takes time to calculate and communicate results. While static classification of a mechanical part either as working or as failed may be acceptable, the failure scenarios where software is involved usually require a dynamic description. It is of high relevance in which order or at which time events occur and for how long states last. The combinatorial techniques do not include dynamic aspects - they can probabilistically capture ageing, e.g., time in the large, but not occurrence time on events, e.g., time in the small. Even the application of the traditional reliability models (time in the large) to software is not straightforward. There have been attempts to apply traditional techniques and assumptions (e.g., the assumption of constant failure rate) to software, but especially in the embedded software domain, acceptable predictions are yet hard to obtain. Furthermore, the number of relevant states for safety or reliability considerations for software-controlled systems is much larger than just working or failed, because even states that normally occur during operation can lead to hazard if certain events occur at this time. These drawbacks of the combinatorial models is discussed in the example of the FTA technique in Section 2.4.4. State-based probabilistic techniques, in turn, can describe the dynamic evolution of the system, but lack the required compositionality and cannot intuitively depict the chains of causes and effects that lead to a hazard or an accident. So another important research direction is the investigation of dynamic modelling techniques that are suitable for safety and reliability analyses.


25

Embedded systems are heterogeneous in terms of consisting of hardware and software that are best described with different modelling techniques that are hard to unite (attempts to do so are currently being pushed forward in the system design context). As the original models often lack precise semantics, the integration of models from different original application domains is very difficult. Revisiting the overview of the relevant safety and reliability techniques, one finding is that the majority are informal or semi-formal techniques; it even seems that the acceptance in industry is higher, the easier and more intuitive a technique appears - even if analysts agree that compromises and a lack of exactness are the price for unclear semantics and for the restricted expressive power of these techniques. The poor success of formal methods in industry is due to the difficult notations and the need to understand the theory that lies behind them in order to apply them correctly, but also to the fact that it is practically impossible to describe all relevant system aspects formally and quantify all initial causes exactly. As formal methods require highly trained personnel and are cost-intensive, they are hardly applicable to mass-market industries like automotive and industrial control. In the military, in space and avionics industries, where the risks and the budget for safety measures are much higher than in mass markets, formal methods have a higher acceptance. Mainly applied to automatic model refinement or code generation and for proving correctness, they also serve for qualitative or probabilistic demonstration of the fulfilment of safety goals. But even there, the complexity of today’s systems makes a complete formal description of the intended behaviour hard to achieve. Hazards and failures often result from derivations of the intended behaviour, and all possible derivations are impossible to foresee and capture formally. The problem that accidents occur although the software was correct cannot be solved by formal methods. Therefore, it is unlikely that in the near future, it will be possible to handle safety analysis entirely by formal methods and thus informal or semi-formal techniques will still be needed. It is a promising approach to search for new techniques that avoid the ambiguities and obscurities of the existing ones by introducing some more formal underpinning, but without attempting to provide models that could be constructed and interpreted entirely without human expertise. For industrial acceptance, it is important that unambiguous semantics does not come at the price of incomprehensible notations, but is brought to the analyst in an intuitive graphical representation. Another observation is that the research communities of safety engineering, reliability engineering, and software engineering have traditionally been distinct, and each community has been using its own set of techniques. However, many reliability engineering techniques can also be applied to safety, and many model-based safety and reliability techniques resemble the techniques applied in software and systems engineering. This suggests more research about the integration of the different model types. As complexity and software aspects increasingly dominate the safety and reliability relevant behaviour of technical systems, it can be expected that the different communities will have to work together and work out common techniques in the future. This collaboration and the research to formalise the model semantics form the foundation for the integration of techniques that have been designed for different purposes. This not only helps to treat safety in a holistic way

26


and to use the best fitting modelling techniques for the different aspects, but also generates opportunities for more appropriate analysis results and for faster designanalysis-redesign cycles, which is of increasing value as the pressure on time-tomarket gets higher. In summary, the following requirements should guide the research for enhanced safety and reliability analysis techniques for the embedded systems domain: The modelling techniques should be • compositional • sufficiently expressive for dynamic behaviour and causal relations that occur in the software context • as formal as practicable, while yet intuitive • suitable for integration with each other and with software engineering models Some important research contributions of the last years are presented in Section 2.7. This thesis constitutes a further contribution to the formalisation, extension, and integration of existing techniques. Techniques of special interest in the following are Fault Tree Analysis, as it intuitively depicts causal relations, the state-based modelling techniques, as they are a suitable way to express evolution over time and dependencies, and the architectural models from software engineering, which introduce useful component and communication concepts. These techniques inspired the design of the new StateEvent-Fault-Tree technique, which borrows modelling elements from all of these model classes. Therefore, these three classes of techniques are presented in more detail in the following three sections.

2.4 2.4.1

Fault Tree Analysis Introduction

This section introduces the Fault Tree Analysis (FTA) technique that is one of the most important techniques in practice and one of the most powerful ones [LM96, Vil92, MT94]. FTA is a combinatorial safety and reliability analysis technique that graphically shows how influence factors (generally component failures) contribute to some given hazard or accident. FTs provide logical connectives (called gates5 ) that allow decomposing the system-level hazard top-down in a recursive procedure. The two basic gates are AND (in the European notation denoted by &) and OR (denoted by ≤ 1, because at least one input must be true for the output to be true). The AND gate indicates that all influence factors must apply together6 to cause the 5

The term gate stems from digital hardware circuits, where gates are electronic parts that implement logical functions. Also the symbols in FTA are derived from the gate symbols used in digital technology. 6 The meaning of ”together”, in particular, if it encompasses simultaneous occurrence of two events, will be discussed later.

2.4. FAULT TREE ANALYSIS

27

hazard and the OR gate indicates that any of the influences causes the hazard alone. Additional standard gates are Not,7 XOR, Voter (n-out-of-m). The Not gate inverts the event at its input (describes that a hazard is present if some influence is not present). The XOR, which has not much of a practical relevance, shows true at the output, when exactly one of the inputs is true. The Voter gate corresponds to a very frequent technical structure in high-reliability systems: the majority voter, a part that generates a true signal at its output, if at least n out of its m input signals are true. A frequent example is the 2-out-of-3 voter. Of course, other Boolean connectives could be used as FT gates as well, although not mentioned in the standards. Apart from the gates that correspond to Boolean connectives, many specific extensions to the FTA technique have been proposed or are even contained in the standards, although their semantics cannot be defined by Boolean logic. The Priority-AND gate, for instance, considers the order of the input events, and the Inhibit gate makes the occurrence of an event depending on a condition (state) and, consequently, requires a state/event distinction. How to deal with the semantics of these gates will be discussed in the following. The most frequently used gates are depicted in Figure 2.6, in the European style according to [DIN93] and in the US style from [VGR81].8 The logical structure of the causal chains is usually depicted as an upside-down tree with the hazard to be examined (called top-event) at its root and the lowest-level influence factors (called basic events) as the leaves. In the context of FTA the term "event" is applied in its meaning from probability theory: an event is not necessarily some sudden phenomenon, but can be any proposition that is true with a certain probability. A fundamental assumption in FTA is that all basic events are stochastically independent from each other. This is a precondition for the application of combinatorial formulas to calculate the top-event probability.

2.4.2

An Example

Figure 2.7 shows a simple example of a controller system that is unavailable if both main and backup controller CPU are unavailable (events denoted E1 and E2) or if no power supply is available. The latter is the case if both mains power is down (E3) and the battery is empty (E4). Unavailability of the whole system has been chosen as the top-event of the FTA. The tree has four leaves, corresponding to the identified failure events E1 to E4. It has three gates (i.e., logical connectives): one OR and two ANDs. This example is just intended to give an impression of what FTs look like; industry-scale FTs can have over 1,000, in extreme cases up to 10,000 events. 7

Whether or not the NOT gate makes sense in FTA has been discussed [And00], but under the assumption of events being Boolean variables there is no semantic problem about it. Other possible interpretations of events require a new discussion, as will be explained in Chapter 4. Whether or not quantitative analysis is correct in the presence of NOT gates depends on the technology used; the BDD algorithm described in this chapter deals correctly with negated variables. 8 Although the US style symbols are usually preferred by analysts all over the world, the remainder of this thesis and the prototype tools UWG3 and ESSaRel use the European style. By their rectangular shape, the European symbols are easier to implement, and designing new gates by writing a suitable label on a rectangle is easier than in the US style, where the shape determines the semantics.

28


Figure 2.6: Fault Tree Gates in European and US style


29

Figure 2.7: Simple Fault Tree

In the example, fictitious probability values have been assigned to the basic events. They will be of interest in the next section regarding quantitative analysis.

2.4.3

Quantitative Analysis of Fault Trees

Even without subsequent analysis, the creation of an FT gives a lot of insights about critical scenarios and safety issues. Normally, drawing the tree is a team process where the experts together analyse all available technical documentation, but also profit from their experience from past projects. Once designed, the FT is a good help to visualise and explain the causal implications leading to the hazard or system failure. However, in most cases the purpose of drawing FTs is their qualitative or quantitative evaluation, which can only be done efficiently with the help of a software tool. FTA offers several qualitative and quantitative analyses. Qualitative analysis lists, for instance, all Minimal Cut Sets (MCSs) or Prime Implicants.9 Prime Implicants are combinations of basic events that together cause the top-event, with all of them being required to do so. So MCS or Prime Implicant analysis produces element lists or propositions about the FT. Other qualitative analyses answer the question of whether the top-event can occur at all (if not, the set of Prime Implicants is empty) or if there are single points of failure (in these cases, there are Prime Implicants that contain one single event). 9

The term minimal cut set is more common, but only correct if there are no negations within the tree. A set is described by listing the elements that are in it; there is no such thing as the ”opposite” of an element. The term Prime Implicant, in contrast, designates a list of logical variables that can appear in positive or negated form.

30


Quantitative analysis, in contrast, which will be the focus of the subsequent discussions, produces numerical results. The most important measure is the top-event probability or occurrence rate for a given system. There are still other measures that can be calculated, such as the relative importance of a given event or sub-tree to the top-event probability. For now, each fault tree event is understood as a proposition that is true with a certain probability; in later chapters, this view will be questioned, leading to the discussion of a state / event distinction. The considerations given in this chapter are confined to the basic gates AND, OR and NOT for simplicity. Other combinatorial gates such as the n-out-of-m voter can be constructed out of these. Some gates that are available in standards and many FTA tools, such as Priority-AND, should be considered with care, as their semantics is not clear under the assumption of events being Boolean propositions. Given all probability figures, the top-event probability can be calculated in a bottomup fashion by applying combinatorial rules that are given for each gate. For the AND gate, which indicates that the independent causes A and B must be true for the consequence to be true (i.e., C = A∧B), the probabilities PA and PB of the causes must be multiplied to produce the probability PC of the consequence, still assuming independence: PC = P A · PB The probability of a negated variable is calculated as PA¯ = 1 − PA which is thus the formula for the NOT gate. The formula for OR can be deduced from AND and NOT applying De Morgan’s law PC = 1 − ((1 − PA ) · (1 − PB )) = PA + PB − PA PB Recursive application of these formulas is sufficient for calculaing the top-event probability of the small example from Figure 2.7 as

Ptop = 1 − [(1 − P1 P2 )(1 − P3 P4 )] = 1 − [(1 − 0.1 · 0.1)(1 − 0.05 · 0.2)] = 0.0199 This evaluation method is not very efficient. [DIN81] presents a more efficient calculation algorithm, which is based on finding minimal cut sets and summing up their probabilities. This algorithm is not suitable in the presence of negations (NOT gates) and needs corrective measurements if events appear in more than one MCS


31

or in the presence of repeated events (basic events that influence the top-event via more than one path). Today many FTA tools apply an algorithm that is based on the efficient representation of Boolean expressions by Binary Decision Diagrams (BDDs) [Bry86]. This algorithm will be revisited in Chapter 3, thus it is briefly summarised and the example from above is evaluated a second time using the BDD transformation. A BDD is a directed graph where each Boolean variable v is represented by one or more nodes. Each node has two outgoing edges, one (the true branch) leading to the child variable that is checked next in case that v evaluates to true, and one (the false branch) to the child variable for the case that v evaluates to false. This corresponds to the if-then-else notation (ite-notation for short) in which any binary Boolean expression can be encoded [Bry86]. In the following figures, a dotted line signifies a false branch. The two terminal nodes (usually drawn as rectangular nodes) are the constant expressions true (denoted as 1) and false (denoted as 0) 10 that are reached when no more sub-expressions have to be evaluated. For instance, the partial term x ∧ y, represented by the AND gate with the two input causes x and y, results in the Figure 2.8:

Figure 2.8: BDD corresponding to the AND conjunction of two variables (events) E1 and E2

The meaning is: If x is false (left branch), the output expression is false in any case; the left branch therefore leads directly to the false terminal node. If x is true, the right child node y is examined. If y is false, then again the output of the gate is false; if y is true, then the output is true and therefore the right child is the true terminal. The example FT put into a BDD looks like Figure 2.9. To generate the BDD, the Boolean variables (E1 to E4 in this example) have to be put into an arbitrary order and the order is significant for the size of the resulting BDD. To calculate the top-event probability, the outgoing true edge of each variable is annotated with the probability p that is assigned to the corresponding FT event, whereas the false edge is annotated with the complementary probability (1-p). The probability of the top-event is calculated by summing up the multiplied probabilities along each path that leads to the terminal node true (or 1). In our example, the result is therefore 10

As Boolean variables are often represented by the integer values 0 for false and 1 for true, the labels 0 and 1 are common for the terminal nodes.

32


Figure 2.9: BDD corresponding to the Simple FT example from Fig. 2.7

Ptop = 0.1 · 0.1 + 0.1 · 0.9 · 0.05 · 0.2 + 0.9 · 0.05 · 0.2 = 0.0199 as before. Concluding the summary about the quantitative analysis of FTs, it should be pointed out that there are other analysis types that are not the subject of this thesis. The most relevant ones of these are the different kinds of importance analyses that have been proposed. They aim at quantifying the relative contribution of the probability of a basic event to the top-event probability.

2.4.4

FTA in Practice: Extensions, Limitations and Ambiguities

Originally, Fault Trees were a model designed by practitioners without a formal underpinning. In later years, their evolution was also mainly pushed forward by practitioners and tool vendors. Even if some attempts to define Fault Trees formally have been made, there is still no commonly accepted and unambiguous semantics. The events (in FTA terms) can be failed states of certain components at a given point of time, e.g., ”safety valve is stuck”, but also any natural language propositions about the system, such as ”operator does not detect stuck safety valve early enough”. A formalisation of propositions of the latter kind is hardly achievable, since FTA tools allow description texts without any restriction. Also, the methods


33

for estimating the probability of a proposition of the second kind can be questioned. This informal usage of FTA is certainly useful if applied by skilled experts, but relies on unspoken assumptions that could be misinterpreted by other analysts. The terms event (as borrowed from probability theory) and event (as used in the context of state-machines) are intermixed deliberately and often it is not clear whether the instant of failure or the time span after a component has failed is the subject to the consideration. Modellers often use terms describing events (e.g., ”bolt breaks”) in FTs when they actually mean the failed state of a component (”the bolt is broken”); this can be regarded as an implicit conversion. In many commercial tools, events can have constant probability (which only makes sense for an enduring state) or exponentially or Weibull distributed probability (which suggests a transition event). The early German standard [DIN81] distinguishes unavailability and failure probability density as separate measures and give distinct formulas for each gate in a table. These considerations have not been adopted by the leading international standards [DIN93, VGR81] and the distinction of state and event got lost in practice. When evaluating examples from literature or the behaviour of common FTA tools like Relex, FaultTree+, Item, or Reliasoft, the following working definition of an FT event can be proposed: Definition: A Fault Tree event represents the failed state of a two-state component, of which the transition from working to failed state obeys the specified probability distribution. Many reliability analysts and some tools assume that a failed component stays in failed state forever, others allow the specification of a repair rate or an inspection interval, after the elapse of which the component toggles back to working state. In other words, FT events implicitly have state-machine semantics with different kinds of transitions (in both directions) between the two states. This implicit conversion also enables the model integration between Markov Chains, Fault Trees, and Event Trees, which is offered by many tools in practice, without a documented formal underpinning. The question of what FT events really are is related to the question of whether or not they have a duration. If they do, the semantics of the AND gate is clear; the output event is true as long as all input events are true. If not, the semantics is not clear. To claim that the input events occur simultaneously is unrealistic if a continuous time scale is assumed. In practice, the semantics is usually that both of the events must have occurred once to make the output event happen. This is acceptable in reliability engineering, where events mean failures of non-repairable components because then, the implicit conversion from failure event to failed state does not cause any problems. Based on combinatorial theory, FTs can only deal with sojourn probabilities of states of independent components. The assumption in FTA is that a component has only two relevant states: failed or operable. The two-valued Boolean logic prohibits more than one failure mode, although in practice there are components that can fail in several ways or in a degrading manner. The observation that even some operable

34


states can, in conjunction with certain states of other components, lead to a hazard situation, is impossible to model, as a distinction of different operational states is not provided; the consideration of states that are not failures by themselves has not even been intended, as FTs by definition deal with failures only. FTs deal with probabilities, basically probabilities in one instant of time. Modelling the evolution of a state probability over time is possible if probability functions are given for the basic events instead of simple probabilities. If the basic events are failed state probability distributions of a system’s components, e.g., each obeying an exponential distribution, then the calculation rules for the gates allow calculating the system unavailability as a probability distribution over time. This top-event probability distribution does not necessarily obey an exponential distribution, even if all basic events do, and thus the calculation of a single resulting top-event rate, as is often done in practice, is not justified mathematically. Not only the events, but also some of the gates provided by standard FTA have unclear semantics. This especially applies to the NOT, the Priority-AND (P-AND) and the Inhibit gate. According to many reliability engineers, a NOT gate is unnecessary in FTA, as by definition all FT events constitute failures and system availability does not get higher if more components fail. Some analysis algorithms (but not so the BDD-based algorithm) cannot deal with negations correctly. However, there is reason to include a NOT gate, because in many safety scenarios, a hazard occurs if some component is not in a certain state [And02]. If FT events and gate outputs are interpreted as propositions about state, there is no semantic problem with negating them. The idea of the P-AND is that the output event only occurs if all of the input events occur in the right temporal order (from left to right in the graphical representation). Tacitly doing the implicit conversion from an event to the following state, this means that the second failure must occur after the first component has already failed, and so on. This can be stated as a conditional probability and, provided that the probability distributions of all events are known, the probability that the gate output is in failed state can be calculated as an integral over the probability densities. In the case of two inputs, the formula is [HK92, FAR76]: Fout (t) =

Z

t

0

f2 (τ ) · F1 (τ )dτ

where Fi (t) is the failed state probability of event i and fi (t) is the probability density of the corresponding failure event and thus F1 (t) =

Z

t

0

f1 (τ )dτ

This leads to the general form for n inputs: Fout (t) =

Z 0

t

fn (tn )

Z 0

tn

fn−1 (tn−1 ) ...

Z 0

t2

f1 (t1 )dt1 ... dtn−1 dtn


35

This formula is only valid under the assumptions that all events are stochastically independent, have exponential failure distribution and are non-repairable [FAR76].11 This way of calculation is acceptable in reliability engineering, where failures of independent components in the long run (time in the large) have to be modelled. For safety considerations, where the order and interleaving time of events (time in the small) can decide on the further evolution of the scenario, only state-based modelling that considers dependencies is suitable [PD96]. In practice, there are even some commercial tools that make no difference at all between normal AND and P-AND. In addition, the semantics of the Inhibit gate, denoting that an event leads to a consequence only if another event is not true (this second event usually signifies the availability of some protective unit), is unclear. Basically, the Inhibit gate can be expressed as an AND gate with a NOT connected to one of its inputs. However, this gate forces the question about states and events: while the first input (it could be called initiator or trigger input) corresponds to a sudden event, the second one, which inhibits the consequence, must be a state term or a guard condition. This distinction is sometimes also relevant in situations that do not involve an Inhibit gate, because often events that signify conditions that must be present when something bad happens appear together with events that trigger the scenario. Sometimes the terms initiator events and enabler events are used for this distinction. This discussion is related to another, more fundamental question: Do FTs represent causality or not? Basically, they are a graphical representation of Boolean logic, expressions. Consequently, some researchers claimed that FTs have no causal semantics at all. Practial usage and the examples in the standards and handbooks, however, suggest that most analysts use FTs to depict cause-consequence relations. This entails the question of how this type of causality can be formalised: Is the combination of input events sufficient to cause the output event, or necessary to cause the output event, or both? There is no commonly agreed formalisation yet, but there are different attempts to formalise the semantics. [Gór94] distinguishes causal and decomposition semantics and consequently suggests two sets of gates: causation gates and decomposition gates. For example, Causation-AND means that the output event is triggered by the common occurrence of the input events, while DecompositionAND means that the output event is just another name for the situation that all of the input events are true. A further restriction is the requirement for independent basic events, because in complex technical systems, the same root cause often affects different parts of the system. A first approach to account for dependent events was the concept of repeated events. A repeated event is an event that occurs in different positions of the tree. Depending on the applied solution algorithm, the Boolean Function has to be transformed to eliminate repeated events before the calculation is started. The frequently used BDD-based algorithm has no problem with repeated events, provided they are formally marked in the FTA tool. However, in practice the components of 11

Fussel’s conditions should be rendered more precise in two points: first, the integral formulas as stated here do not require an exponential distribution and second, a ”non-repairable event” means a non-repairable failure, i.e., a failed state of a component that persists for ever.

36


technical systems work together in complex ways so that the independence assumption is unrealistic [And02].

The fact that one failure has influenced other events in the tree is captured by many non-standard amendments: A ”Functional Dependency” gate has been proposed in [DBB92], a trigger relation in [BB03], and sometimes the concept of an event representing a ”Commanded Fault” is used.

The fact that some causes provoke the consequence not always, but only in a part of the cases, is difficult to model in FTA (this can better be expressed in Event Trees or Cause Consequence Diagrams). In FTA, this fact can only be captured probabilistically, by giving a conditional probability that the event leads to a hazard. Often the complementary probability is given, namely the probability that the event is covered, e.g., disabled by some mitigation mechanism. Many FTA tools allow the attribution of events with coverage factors, others introduce a Conditional Probability or Probabilistic Dependency gate for the same purpose.

More gates have been proposed to explain other correlations of events and their consequences (e.g., hot, cold and warm spare gate or a similar gate called ”Reserveverknüpfung” from early German [DIN81]). Some of them do not even describe an input-output relation, but a constraint on the inputs (such as the SequenceEnforcing-Gate from [DBB92]).

Although many of the described extensions are undoubtedly useful in practice, it can be questioned if the extension of the FT technique with unintuitive gates and notations is the best way to implement them. Often a state-based notation could explain the situation much easier.

A problem that is not discussed here, but in Section 3.2, is the decomposition of Fault Trees. FTs are traditionally decomposed into independent sub-trees called modules. As these do not necessarily correspond to technical units, a search for a more suitable decomposition has been carried out and has led to the introduction of Component Fault Trees, which are described in Chapter 3.

The mentioned limitations and ambiguities that exist in FTA have been noticed by numerous researchers. The last decade has seen many proposals on how to improve the semantic underpinning of Fault Trees, their expressive power, and integration with other modelling techniques. A survey of these recent contributions follows in Section 2.7, but first the other important classes of modelling techniques, state-based modelling and structural modelling, are introduced.

2.5. STATE-BASED MODELLING

2.5 2.5.1

37

State-Based Modelling State-Machines and State-Based Software Engineering Models

Most state-of-the-art techniques for behavioural modelling of technical software are based on the theory of finite state-machines or automata.12 A literature survey reveals slight differences in the definition of a state-machine and many variants of this model have been proposed and implemented in different tools. To describe a system that communicates with its environment, state-machines with inputs and outputs or triggers and actions are necessary. Some relevant variants are Mealy-Automata, Moore-Automata, I/O-Automata [Mea55, Moo56, LT89]. The most important ones for embedded systems modelling in practice are Statecharts, ROOMcharts and UML2 State Diagrams, which extend state-machines by the concepts of hierarchy, concurrency, and communication. In the most basic case, a finite state-machine (FSM) or automaton consists of a finite set S of states, a set T of state transitions, which can be expressed as a relation in S × S, and one member s0 of S which is called the initial state or, in the case of non-deterministic automata, a set of initial states. The states or transitions may be labelled. Automata are a mathematical concept that can be applied to a variety of problems, for example, the definition of formal languages. Applied to the behaviour description of technical systems, the finite state-machine model assumes that a component is in exactly one of a finite number of states at each point in time. When it instantaneously passes to another state, this is called a transition. State-machine models are usually graphically depicted by directed graphs where the nodes (depicted as circles or rounded rectangles) correspond to the states and directed arcs correspond to transitions. The transitions are labelled with event names. In a simple automaton, no information is given about the reason for the state transition. The model is non-deterministic, i.e., the time of the transition is unknown and potentially, there are several possibilities for the target state of a transition. Non-deterministic statemachines are useful when a whole class of potential behaviour shall be described or if the causes for a state change are unknown or irrelevant. In embedded systems design, deterministic behaviour - the desired behaviour of the system to be constructed - has to be modelled. The state-machine variants used for this purpose usually annotate the transitions with the name of the event or the system input that triggers (causes) the transition and optionally a condition (guard) that must be true in order to allow the state transition. The triggers determine when and - in conjunction with the guards - to which target state a transition takes place.13 A small example can be seen in Figure 2.10. The semantics of the small example is that there are two states, A and B. If the component is in state A and event α occurs 12

The other major modelling paradigm are differential equations, which are used to describe continuous control systems. Many embedded systems contain both aspects and are therefore called hybrid systems. However, for safety issues, the discrete control part is often more relevant. 13 Normally only the system part is deterministic - the combination of system and environment is only deterministic if the environment is also described by a deterministic state-machine.

38


and at the same time condition c is true, then the component changes its state to B. If event β occurs in state B, then the component changes its state to A. In some model types, an action is also annotated to the transition. This is an event that can be perceived by other components in the system or a signal to other components or a program function that is called each time the transition occurs. The idea of triggers and actions that are related to the environment of the component allows the extension of the state-machine by inputs and outputs. These enable the connection of several components to a system or the modelling of system-environment interactions.

[c] A

B

Figure 2.10: Basic State Diagram

For the usage in industrial practice and in CASE tools, a number of different statemachine models has evolved. Thery are intuitive notations, rather than mathematical formalisms. These models provide features that facilitate the work of the development engineer, but in general it is still possible to map these features to the mathematical concepts as defined by basic automata. Many of them have been invented by practitioners, whitout starting from a formal definition. However, in some cases a formalisation (or several competing formalisations) has been added afterwards, but still many of these modelling techniques, although successful in practice, suffer from some semantical ambiguities, cf. [Bee94, HN96, HPSS87, FSKR05], to give a few examples. Formalisation has become an issue with the increasing desire to generate program code automatically from the models. One of the most popular state-machine models applied in software and systems engineering are Statecharts. Statecharts were introduced by [Har87] and have been copied and modified in many modelling techniques. Statecharts annotate triggers, guards and actions to the state transitions. Additionally, they introduce a concept of hierarchy by allowing two different kinds of decomposition: • the ”OR-Decomposition” (sometimes called ”XOR-Decomposition”) and • the ”AND-Decomposition” The OR-Decomposition refines a state into a number of sub-states with the constraint that, if the superstate is active, exactly one of the sub-states is active. Figure


39

2.11 shows in its left part an example of a set of traffic lights. The light can be off or on. If it is on, it is either green or orange or red (sub-states of on). If it is not on, then none of the substates of on is active. The AND-Decomposition introduces a notion of concurrency: With regard to each part of a state (delimited by dashed lines), the system is in exactly one sub-state. In the right part of the figure, the traffic lights example is modified. The system now consists of a road signal and a pedestrian signal. The pedestrian signal is either in state red or green and at the same time, the road signal is either red or orange or green. The model does not explicitly specify the constraint that the pedestrian signal must be red when the road signal is green. This behaviour is only achieved by the triggers and actions that are exchanged between the components (for simplicity, these have been omitted in the figure). The start state of each hierarchy level is denoted by a black dot that is attached to a transition arrow. There is a separate start state on each hierarchy level; it applies when the surrounding state is visited for the first time. On

On

Road

Pedestrians

green

green

orange

red

red

Off

orange

red

green

Figure 2.11: The AND and the OR Decomposition in Statecharts: Traffic Lights Example

There are other details in Statecharts such as history states (sub-states that the system returns to when a formerly active superstate has been left and is entered again) or choice-points (decisions about the target state of a transition according to the current value of some variable). Timers allow modelling transitions that deterministically occur after a state has been active for a given period of time. Statecharts do not include an explicit component concept and the communication model is quite simple: each event that occurs somewhere in the system is assumed to be visible at the same time in every part of the system (broadcast communication). As semantic details such as the model of computation have not been formalised from the beginning, Statecharts can be considered a semi-formal notation. However, different formalisations for Statecharts have been proposed (e.g., [HN96]). Statecharts are supported by the CASE tool Statemate from I-Logix. ROOMcharts (ROOM = Realtime Object Oriented Modelling), introduced by [Sel94], further extended the Statechart approach by introducing the notion of components and communication. There are separate models for system structure and for behaviour. In the structural model, components are depicted as boxes that are joint by ports. Ports are typed by the messages they can exchange and appear as pairs: a

40


base port is connected to a conjugated port, which has the same message types but with switched directions (incoming vs. outgoing). The components (called actors) on the structural level form a hierarchy that allows successive refinement during design. The OR-Decomposition of states from Statecharts still exists and allows a state hierarchy as well. The AND-Decomposition, however, has been abandoned in favour of the structural hierarchy that better captures the actual construction principle of a technical component. As information exchange is only possible via the ports, the broadcast communication assumption from Statecharts is replaced by a more realistic communication model. So-called End-Ports achieve the connection between the structural and the behavioural model: A signal arriving at an End-Port in the structural model can be referenced as a trigger for some transition in the state model. A runtime environment is assumed so that timers and other services usually provided by an operation system can be modelled as service access points. This way, ROOMcharts are closer to the actual implementation and the tool ObjecTime (later IBM Rational RoseRT) promises code generation from the model. UML State Diagrams, similar to Harel’s Statecharts, have been incorporated in the UML notation. Selic’s ROOM method influenced Real Time derivatives of UML. Eventually, elements from both notations were merged into the new state diagrams in the proposed UML 2.0 version [JRH+ 05]. Apart from Statemate and Rational RoseRT, there are many other CASE tools that support state-based design models, such as Artisan, Stateflow or Ascet. Therefore, state-based behavioural models are state-of-the-practice and have a high acceptance in industry. All of the practically applied model variants have a semantics that is intuitively clear to practitioners. However, formalisation has proven to be quite difficult because it involves issues that are not visible at first glance and not relevant if the purpose of the modelling technique is just to give an idea of how the control aspects of a system work. However, if state-machine models are used formally, they can be used for several types of analysis or model-checking. These are concerned with questions such as which states are reachable or which sequences of events are possible at all or lead to a certain acceptance or end state. This can be exploited for safety-critical systems if hazard states or scenarios are formalised, e.g., using a suitable temporal logic, and model-checking techniques provide the proof that the hazard states cannot be reached under any conditions. For example, if an elevator control unit is specified by formal state-machines, then one possible proposition that must never be true is ”Door is open AND Elevator is moving up”. Note that model-checking techniques are not counted as safety analysis techniques, but as correctness proving techniques; in this context, however, correctness with respect to safety requirements is the focus. If compositional state-based techniques are needed, because the system is too complex to capture or should be built out of existing components, models with a notion of composition (”product automaton”) or with explicit input and output ports as in ROOMcharts or I/O-Automata are required. One common problem of state-based techniques is the combinatorial explosion of the state-space: as every state of the first component can be combined with every state of the second, and the same with the third and so on, the number of product states becomes quickly unmanageable. For


41

example, the combination of ten independent components with just 4 states leads to over one million states.

2.5.2

Markov Chains and Other Probabilistic State-Based Modelling Techniques

Apart from deterministic and non-deterministic state-machines, there is a third variant: probabilistic state-machines. Probabilistic means that the time and target of state transitions are not exactly defined, but not completely unknown either. It can be described probabilistically. In particular, the transition time can be specified by a probability density function and the next state can be specified by a probability distribution over the discrete set of states, so that the probabilities for each state to be the successor state sum up to 1. The most basic and most important probabilistic state-based model is the Markov Chain (MC). A Markov Chain is a state-machine where the transitions are labelled with transition rates, i.e., conditional probabilities that the transition to a given successor state occurs in the next small time interval, provided that the system is in the source state. Markov Chains exist in discrete-time and continuous-time variants and require different analysis techniques in each case. In discrete-time Markov Chains (DTMCs), a transition probability is given (with respect to the next finite time interval), in Continuous Time Markov Chains (CTMCs) a rate is given. However, it is possible to approximate CTMCs by DTMCs if the sampling points are chosen close enough to each other. In standard Markov Chains, the rates are constant, i.e., do not depend on time or on the current state or on external conditions. However, a variety of extensions have been proposed that allow non-constant rates or side conditions. An example for a Markov Chain in the context of reliability analysis is given in Figure 2.12. A pump station consisting of two electrical pumps can be in four different states: both pumps can be working, only Pump1 can be defective, only Pump2 can be defective or both pumps can be defective. To the user this means that the service of the station is unavailable if both pumps are down and degraded if one pump is down. If both pumps are of the same type, it is likely that they have the same failure rate λ1 . If one pump has failed, the other one is operated on a higher load level and is therefore likely to fail earlier than when both pumps are operating. This is modelled by a higher failure rate λ2 . It is also possible to model transitions back to the working state, which in practice corresponds to repair or replacement of failed components. In the example, the repair rate is assumed to be the same, denoted µ, no matter if one or two pumps have failed when the service personnel arrives. Note that although each rate is constant, failure dependencies between parts can be modelled in MCs. Markov Chains are frequently applied for reliability analysis of mechanical or electronical hardware parts. Their acceptance is high because of two major advantages: They capture the failure behaviour of hardware during the normal operational phase quite well, i.e., the assumption of constant rates roughly holds, and the mathematical foundations for their analysis are simple. A Markov Chain is said to

42


Figure 2.12: A Markov-Chain Example

be memory-less, which means that a look at the current state is sufficient to predict its future evolution; it does not matter how long the system has been in the current state. The assumption of constant rates enables the description of the behaviour by a system of linear first-order differential equations, for which a number of symbolic or numerical solution algorithms exist. The solution for the sojourn probability function for each state is a combination of exponential functions. In the special case of a two state system (working vs. failed), this boils down to a simple exponential function, hence the frequent usage of the exponential distribution for failure states in other analysis techniques such as FTA. In the example, state S4 (both pumps defective) has ”probability flows” from S2 and S3 where only one pump is defective. Following from the definition of the rate, the probability growth in a small time interval is equal to the corresponding rate times the probability of the source state (P2(t) for S2 or P3(t) for S3). The probability loss of S4 is caused by the flow towards S1 with the repair rate µ and is the product of probability P4(t) of state S4. This leads to the differential equation for S4: ∂ P4 (t) = λ2 P2 (t) + λ1 P3 (t) − µP4 (t) ∂t Note that it is always the probability of the source state of the transition that is being referred to. Flows into the state count as positive, flows out of the state count as negative. In a similar way, differential equations can be formed for the probabilities of all four states. Using the side conditions that the sum of all probabilities is always 1 and that P1 (0) = 1 (i.e., the system is working after start-up), the system of differential equations can be solved and the functions Pi (t) can be calculated. This way Markov Chains can be solved symbolically or numerically; Monte-Carlo-Simulation (i.e., playing runs of the systems where the transition times are chosen probabilistically according to the given rate) is also possible. The analyses of interest comprise


43

steady-state analyses (e.g., analyses of the stable sojourn probabilities that evolve in the long run) and transient analyses (providing the sojourn probability functions over time). Apart from their convenient quantitative analysis, MCs offer further advantages: They can model repair, as shown by the repair transitions in the example; they can express dependencies between components, as shown by the increased failure rate when one pump is working alone in the example; and they can express more states than just working or failed. The latter feature is useful if a component has multiple failure modes or exhibits degrading failures, e.g., a passage from fully functional to degraded and finally to failed. On the other hand, MCs have some disadvantages: the component dependencies can only be modelled by explicitly constructing the product state space of all components (e.g., the two states that each pump possesses combined with each other result in four different states of the entire pump station). This state-explosion problem often makes practical models too complex to understand. Moreover, the assumption of constant failure rates is often justified for hardware parts that fail due to wear-and-tear, but not necessarily for software parts. It can also be doubted for the repair or recovery aspects. To extend the modelling power, variants of the MC formalism have been proposed. To give some examples, Markov Decision Processes allow controlled Markov models, Semi-Markov Processes allow for non-exponentially distributed transition delays, and [BB03] introduce ”Boolean logic Driven Markov Processes” that allow triggered Markov Chains. There are other state-based probabilistic modelling techniques that have been proposed (e.g., Probabilistic I/O Automata [WSS97]), but none of them has yet achieved broad industrial acceptance as a safety or reliability analysis technique. However, interesting lessons can be learned from these techniques and have influenced the research work for this thesis.

2.5.3

Petri Nets

Petri Nets (PN), first introduced in 1962 by Carl Adam Petri [Pet62], are a particular kind of state-based models that are especially suitable for modelling concurrent systems. They are mainly applied for modelling material flow (e.g., in production plants), information flow, or steps of batch processes. They are not widely applied in software engineering or in safety and reliability engineering, but have been proposed there as well [KW03, DIN03b]. The reason why they are discussed here is that they have frequently been proposed as a target model into which Fault Trees can be translated, and they will also be used as a target model for State/Event Fault Trees later in this thesis. Like other state-based models, PNs exist in non-deterministic, deterministic and probabilistic variants. The variety of PN types has become vast over the years; an introduction and overview can be found in [Rei85]. A PN is a bipartite directed graph, e.g., a graph with two kinds of nodes that are connected alternatingly by directed arcs (also called edges). The nodes are places, denoted by circles, and transitions, denoted by bars or rectangles. A place can hold

44


tokens, denoted as black dots. Depending on the specific PN variant, a place can hold one token or more (the maximum number of tokens allowed is called the capacity of the place) and the tokens may be typed (often called ”coloured”) or indistinguishable. There is an initial token distribution, i.e., some of the places are marked by tokens and others are not. A transition is generally connected to places by an arc leading from the place to the transition, or by an arc leading from the transition to the place. In the first case, the place is called predecessor place of the transition, in the second case it is called successor place. The basic principle of how a PN evolves over time says that transitions switch (or ”fire”) by taking tokens from all of their predecessor places and put tokens on all of their successor places. In some PN variants, multiplicities, i.e., natural numbers, are annotated to edges that specify how many tokens are taken or put when the transition fires. Transitions are only ready to fire if all of their predecessor places carry enough tokens to execute the firing operation. A simple PN example is given in Figure 2.13. The initial state is shown on the left side, and on the right side the state after transition T1 has fired is shown. Note that T2 could have fired as well and the net in the example gives no indication of which one fires first. The term state of a PN refers to the whole marking, e.g., the vector containing the numbers of tokens currently present on each place. A PN place should not be confused with a state as defined in state-machines. However, it is possible to translate state-machines to PNs so that each place corresponds to a state-machine state and PNs to state-machines so that each marking becomes a state.

P1

P1

T1

T2

T1

T2

P2

P3

P2

P3

T3

T4

T3

T4

P4 T5

P4 T5

Figure 2.13: A Simple Petri Net Example (Initial State and State After Transition T1 Has Fired)

In this example, showing one of the basic PN variants called Place-Transition-Net, no information is given about when or why transitions fire and which one is chosen if there is more than one possibility. The net is non-deterministic, i.e., it captures a set of possible evolutions over time, and takes a ”player” or ”adversary” to choose transitions arbitrarily to realise one particular run. The main types of analysis performed on this kind of PN are reachability analyses for given markings or analyses

2.6. STRUCTURAL MODELLING IN SOFTWARE ENGINEERING

45

for certain properties, e.g., that a certain place never holds more than a given number of tokens (boundedness) or that the net never gets to a state where no transition can fire (liveness). Different PN variants incorporate an explicit notion of time that is annotated either to places or, more frequently, to transitions. In some of them, transition firing takes a certain time, in other variants firing occurs instantaneously, but there is a defined delay a transition waits before firing after becoming enabled. Time Petri Nets [MF76] assign a minimum and maximum waiting time to transition. There is no information on when exactly the transition fires, so this class of PNs is still non-deterministic. Stochastic Petri Nets (SPNs) assign an exponentially distributed waiting time to each transitions. Generalised Stochastic Petri Nets (GSPNs) [MBC+ 95] allow exponential and immediate transitions. Their expressive power is identical to the expressive power of Markov Chains. Deterministic and Stochastic Petri Nets (DSPNs) [MC87] additionally allow transitions with deterministic waiting time. Their expressive power is higher, but they can only be analysed under certain conditions; in other cases they have to be evaluated by Monte-Carlo Simulation. DSPNs have been chosen as an intermediate evaluation model for State/Event Fault Trees; therefore, they will be discussed again in Section 5.2. In general, for the stochastic variants of PNs, the same types of analysis (steady-state or transient) are of relevance as in other probabilistic state-based models; for an overview of analysis types, see [Mur89].

2.6

Structural Modelling in Software Engineering

The science of software engineering has recognised that a lot of the difficulties with the construction of modern technical systems lies in their size and complexity. An appropriate way to cope with this complexity is partitioning the system into clearly separated and self-contained parts. Object-oriented techniques and the paradigm of component-based software construction have boosted the trend towards encapsulated components that are made for an exactly defined purpose and that can be constructed, examined, and reused on their own. At the same time, the awareness for software and systems architecture has grown and structural modelling techniques have evolved that describe how the components interact in the system. Several semi-formal architectural description languages (ADLs) have emerged (e.g., C2, MetaH, Rapide), as well as the so-called system description languages, which cover software and hardware components and also model the architecture (e.g., System-C, System-Verilog). ADLs appear to be quite different, but (hierarchical) components, ports, and connectors are a common concept to most of them. Many ADLs offer encapsulation, abstraction, and a typing system [HNS00, Sel94]. Typed connections or messages enforce consistency when assembling components. The state-of-the art modelling language UML [JRH+ 05] has several diagram types that deal with a software system’s architecture on various abstraction levels. The upcoming version UML2.0 has observably been influenced by Selic’s ROOM methodology,

46


which places emphasis on a component concept where the components are called actors (in the derived technique Real Time UML they are called capsules). A generic component concept similar to the one used in ROOMcharts or UML2 is briefly introduced, because it serves as an archetype for the component concept in Component Fault Trees and State/Event Fault Trees. A system consists out of components, each of which can consist of further components and so on, constituting a component hierarchy. The internal details of a component are hidden from the environment. The only way that information can be passed to the component or obtained from it is via the ports. Components are usually depicted as rectangles and ports as small squares on the border of the component. Ports are typed, i.e., the kind of information that can be passed from inside to outside or vice versa is defined ahead. In ROOMcharts there are base ports and conjugated ports, which support the same types of messages, but with reversed direction. Connectors are lines that connect the ports of different components; they must always connect a base port with its conjugated port. Figure 2.14 shows an example of a structure diagram in an archetypical ADL notation (quite similar to ROOM). The actor class C1 describes components that are composed of two subcomponents (called actor references, because they reference another actor class which is represented by a separate ROOMchart model). The actor references are connected at their ports: the base port is black and the conjugated port is white. At the border of C1 there are two ports that are called relay ports, as they relay messages from and to the environment. The right relay port is connected to an end port that interfaces with the behaviour description of C1. In ROOM, structure and behaviour are described by separate diagram types; the behavioural diagram is the ROOMchart, which has been introduced along with Statecharts in Section 2.5.

Figure 2.14: A ROOM Structure Diagram Example

In summary, many state-of-the-art software engineering models offer a component concept and provide a suitable means to represent the system architecture graphically. For safety and reliability analysis techniques, it should consequently be claimed that they should support a similar kind of structuring. This is not only necessary to understand complex analysis models, but also to allow division of labour between different analysts, reuse of partial analyses from former projects, and attachment of analyses to deliverable technical components. How a component concept can be usefully introduced to FTA is explained in Section 3.4.

2.7. RECENT RESEARCH AND REMAINING ISSUES

2.7

47

Recent Research and Remaining Issues

The research challenges discussed in Section 2.3.3 have been recognised by many researchers for the last decade, and various extensions and ameliorations for safety and reliability analysis techniques have been proposed. They cover the following research issues: • Extension of existing techniques for new applications, in particular software • Rendering the semantics of existing techniques more precise • Creation of new techniques that better fit the modelling needs • Integration of safety/reliability analysis techniques with each other or with software design techniques • Automatic generation of safety/reliability models from software design models FTA, being a very important safety analysis technique [LM96, Vil92], has often been subject of research approaches. Many approaches involve combinations of FTA and state-based modelling techniques, an approach that has also been chosen for State/Event Fault Trees. Modifications to existing techniques for the application to software have been proposed. For tabular techniques, this can be achieved by adapting table columns, keywords, and risk measures. For instance, there are several proposals for Software FMEA and Software HAZOP [RCC99, DEF04, FMPN94]. The direct application of Fault Trees to software has been proposed [LCS91], but not widely applied in practice. Leveson’s approach works on the source code level and attributes failure possibilities for each language primitive (e.g., if, for, while, arithmetic operations,...). The FTs can be generated automatically, but it is not clear how to determine what could go wrong, and why, and with which probability it should go wrong. Another code-level approach can be found in [LM01]. A technique similar to code slicing is used there to determine for each instruction in the control-flow graph which other instruction it depends on. The failures produced by an instruction are either caused by the instruction itself or by the input data, the origin of which is known. These techniques have not yet been applied to large systems and could, in practice, lead to complex FTs. A remarkable feature of the approach of [LM01] is that the system level FTs span across software and hardware parts of the system and different techniques are applied to generate both partial FTs. Other approaches [LR98, McD02] consider software on a higher abstraction level. They model the intended behaviour of the software by state diagrams and then manually add failure states and faulty transitions to these states, ideally guided by an investigation technique such as FMEA or HAZOP. To evaluate how the evaluated errors together influence failures on the system level, standard FTA can be used.

48


Formal methods have been used in different approaches for the mapping of FTs to software or for the formalisation of the FTA semantics. [STR02] integrate FTs with formal program specifications and use Interval Temporal Logic to give formal semantics to Fault Trees, including evolution over time. In [RST00] the same researchers show how they apply FTs to specify hazard probabilities of a railway control system that is formally specified in Statemate [HN96]. Due to the formal specification, the impossibility of the hazard states can be proven automatically. [Sch03] uses Duration Calculus with Liveness (DCL) for a formalisation of Fault Trees and applies Model Checking to prove some properties of an FT. Each FT event is a DCL term, i.e., a proposition over time. Decomposition gates and cause-consequence gates are distinguished. The goal is a proof that the FT is properly constructed with respect to the system, not the calculation of the top-event probability. [Han96] also uses Duration Calculus for the formalisation of FTs, but suggests a derivation of software requirements from the FT instead of an evaluation of software safety by FTA. In the ESACS project [Boz03], FTs are automatically generated from Statecharts or other models and then evaluated by model-checking. The duration or order of events is not considered. Other approaches involving formal methods for the formalisation of FTs are [DSC00] (in this contribution, the language Z is used to specify the syntax and semantics of Dynamic Fault Trees for the tool Galileo) and [Gór94], where the ontology of FTA terms is mathematically defined. The latter approach explicitly considers time, the execution order, and the duration of events, and therefore assumes two different kinds of events: instantaneous transitions vs. enduring actions. Many approaches to modelling dynamic behaviour and multi-state components map FTs to state-based models, in particular Markov Chains [DBB92] or different variants of Petri Nets (PNs) [HA88, MT95, HWS+ 01, GMW95] . The most frequent translation to PNs is the one shown in Figure 2.15. It translates the Boolean conditions expressed by the FT into switching conditions of a PN. For the AND gate, this means that the output switches after all input events have switched, and for the OR gate, this means that the output switches as soon as one of the inputs switches. This translation is only valid for events that occurrence only once in the lifetime of a component, like failures of non-repairable components. The advantage of this Petri Net representation can be doubted, because it uses a state-based model that is expensive to analyse and provides merely the results that can be obtained by combinatorial analyses as well; the advantages of a state-based model are not fully exploited. However, some of the cited contributions extend the basic translation, e.g., by timed transitions, choose stochastic variants of PNs (such as Generalised Stochastic Petri Nets, GSPNs) and perform more advanced analyses on them. [Buc00] and the diploma thesis [Gra95] cited therein propose a different translation to GSPNs, which better captures the state-machine behaviour of the components and allows transitions in both directions (working to failed and failed to working). Each component is represented by a subnet consisting of two states (usually working and failed) and two events (failure and repair), as shown in Figure 2.16. The transitions are of the exponential type and are annotated with failure and repair rate.


49

(a)

Figure 2.15: Mapping of FT AND (left) and OR (right) Gate into Petri Net Structures according to [HA88].

Figure 2.16: Mapping of Components to GSPN Subnets according to [Gra95]

50


The AND and OR gate translations are shown in Figure 2.17. The transition labelled ”on” in the structure corresponding to the AND gate switches as soon as n tokens have been put into the lower place SN . This place is filled by the fail transitions of the n connected subcomponents. So if all of the n subcomponents have failed, the output of the AND gate is enabled. The inhibit arc with multiplicity n prevents that the AND output toggles back to off immediately, as long as n tokens are on the lower place SN . As soon as one of the input subcomponents switches its repair transition, one token is removed from SN and the output switches back. Similar considerations apply to the OR gate, except that 1 token provided from any of the n input subcomponents is enough to make the output toggle to on. In [Buc00] formal proofs are given that this translation actually preserves the logic of Boolean gates for repairable components. Translations for k-out-of-n gate and NOT gate are provided as well.

Figure 2.17: Mapping of AND and OR Gates to GSPN Subnets according to [Gra95]

The latter mapping approach makes better use of the state-based nature of PNs and served as the basis for the translation proposed for State/Event Fault Trees in Chapter 5. However, it has some disadvantages: the composition of systems from components is not formally defined in terms of places or transitions that serve as interface elements. Instead, PN arcs (shown as dotted lines in the figure) have to be inserted from the failure and repair transitions of each input component to the input place of the gate substructure. This becomes difficult when a subcomponent has more than two states and a potentially large number of incoming and outgoing transitions to the failed state. The translation proposed in this thesis avoids this disadvantage by introducing interface places and allows the extension to multi-state components. Nevertheless, state-based models in general offer an appropriate solution to the issue that standard FTs cannot consider multi-state components, but only the states working and failed. Several multi-state extensions have been discussed. However, when time dependencies are not relevant, the combinatorial approaches can also be extended to multi-valued logic in order to represent multi-state components, e.g., in the approach of [ZWST03]. These approaches are useful because combinatorial evaluation is much faster than state-based evaluation.


51

To combine the advantages of both approaches, [GD97] propose to divide the FT into static modules and dynamic modules and to solve the static modules (modules that only contain plain logic gates) by BDDs and the dynamic modules by translation to Markov Chains. Some researchers propose safety modelling techniques that reflect the modularity of software systems and the dependencies of one component on the services provided by another. A typical example is the Failure Propagation and Transformation Notation (FPTN) proposed by [FMPN94]. FPTN models a system as a set of components with inputs and outputs and connections between them. A set of potential failure modes is attached to each component output and each component is described by a set of rules on how internal failures and failures at the component inputs are propagated or transformed to some output. This technique has been further developed in the HiP-HOPS approach [PM99]. Another example that also reuses the structural information from system design models for safety models is presented in [PM01]. In the context of an international research project named SETTA, the authors explain how to generate FT models from Matlab-Simulink models that are frequently used in systems design in automotive and aerospace industries. The structural model on different abstraction levels is enriched with failure modes found by an interfacespecific HAZOP, and once the failure logic has been specified, an FT can be generated automatically. Some contributions suggest theoretical frameworks for model integration and corresponding tools. They define a set of common terms and formalisms that apply to all models belonging to the framework. Often a notion of states, events, and interaction mechanisms is part of that common set of terms. Every model provider derives model-specific terms from these basic terms, defines the model syntax and graphical representations, and provides different kinds of analyses or simulations for the model. This way, various models can be integrated with each other and be partly transformed into each other on a formal basis. Möbius [CCD+ 01] offers a framework based on states and actions that allows integration of different kinds of stochastic Petri Nets or Stochastic Activity Networks, Markov Processes, or Queueing Networks. It is targeted at availability, reliability, and performance analysis. The DEDS (Discrete Event Dynamic System) Toolbox [BBK98] achieves model integration by translating of different models into an abstract Petri Net notation. SHARPE [ST87] offers a framework that integrates probabilistic modelling techniques for reliability engineering. [CSD00] continue Joanne Bechta Dugan’s work about Dynamic Fault Trees and propose an intermediate model named Failure Automata as a framework for the integration of FTA, Markov analysis, and other techniques. In the safety area, a Common Safety Description Model (CSDM) has been proposed in [BCG91]. It incorporates the main features events, causality, timing, non-determinism, and generalisation, which are common to all integrated models. The idea of an integration framework in the safety domain is also present in [FM93]. Most of these frameworks contributed to the formalisation of the models being integrated; however, they all come from academia and have not yet provided notations and tools that have been broadly accepted by engineers in industry.

52


In summary, there are a lot of ongoing research activities that confirm the need for dynamic, component-oriented, and formalised safety and reliability analysis frameworks, but a solution that avoids all of the inconveniences of the traditional techniques has not been found. Regarding FTA, the formal methods and especially the integration with state-based models helped formalise the higher-level gates like P-AND, Functional Dependency, or Spare Gates. The proposed distinction between decomposition gates and causation gates and between initiator and enabler events that is referred to in many of these contributions has led to a widely accepted understanding of the causal semantics that FTs exhibit in practice. Yet the various formalisation approaches show that there is not one commonly accepted FT semantics and that FTs as defined in the standards lack the required precision when trying to put the original FTA into a formal framework. Moreover, most formal approaches are not suitable for practitioners as they require specialised expertise, and thus an applicable way to provide precise semantics to FTs is still missing. There seems to be a consensus that none of the existing techniques alone can solve all problems; promising approaches combine several existing techniques. The State/Event Fault Tree (SEFT) approach described in this thesis aims at combining advantages from many of the cited approaches to solve as many of the issues as possible while keeping the notation simple. It provides a new modelling technique, but with elements from the accepted families of the techniques FTA, state-based modelling and architectural modelling. The Component Fault Tree (CFT) approach, which was also developed during the research period of this doctoral thesis, marks a milestone on the way towards SEFTs.

Chapter 3 Component Fault Trees (CFTs) 3.1

Motivation

Component Fault Trees (CFTs) are the first usable outcome of this doctoral research project. They address the requirement of compositionality that was observed to be lacking in the FTA technique. CFTs are a means to structure FTs in a more appropriate way by introducing components that are joined by ports. They were first proposed in [KLM03]. In the context of this thesis, CFTs constitute a milestone in the development of the SEFT concept. Nevertheless, they are a useful technique on their own: they allow safety analysts to better structure and reuse FTs for technical components, and, in addition, they can reduce the computational effort for quantitative evaluation. The decomposition of complex systems into manageable parts is an essential principle when dealing with complex technical systems. In hierarchical models, a system consists of components, which are recursively refined into sub-components. A model is called compositional if a set of rules allows determining the properties of the whole system based on the properties of its components and its architecture. As pointed out in Section 2.6, many modern design models provide compositional semantics, or at least a visually represented hierarchy of components and subcomponents. A common paradigm in hierarchical models is to show the inner details of a component on its own hierarchical level, but to hide them on the next higher hierarchical level. On this higher level, all referenced subcomponents appear as ”black boxes”. The relations, i.e., communication relations, between these black boxes are depicted as edges (lines). The points where these edges are connected to a box are usually called ports. The ports of a component specify its external interface; only via these ports can internal details of a component be accessed from outside. In different models, ports are typed according to appropriate principles, e.g., by type and direction of the messages that can be exchanged across the port in a communication relation. What exactly is exchanged via ports depends on the semantics of the actual model. Examples are provided and required services, messages with data fields or simply signals that notify the receiver about some event. The visual hierarchical decomposition should match the component hierarchy in terms of the technical parts

53

54

CHAPTER 3. COMPONENT FAULT TREES (CFTS)

the system consists of. Ideally, the decomposition is also exploited by compositional analysis techniques, which analyse each component on its own and construct the system-level analysis out of the analysis results of the components. A similar kind of decomposition would be desirable in safety and reliability models as well. In favour of an integrated development process, which was claimed in the beginning of this thesis, the decomposition units in safety analysis should be the same as the decomposition units defined during system specification and design (i.e., technical subsystems). For each identified component, their should be one partial model, and the system architecture should guide the integration for analysis. Traditional FTs, although visually constituting hierarchical models, do not provide this kind of compositionality; the differences will be explained in the following section. Component Fault Trees (CFTs) have been developed as an improvement with regard to this issue.

3.2

Traditional Fault Tree Decomposition by Modules

Today, the usual principle cutting down Fault Trees is division into independent sub-trees, called modules. A module is a sub-tree that is not influenced by other parts of the Fault Tree and that influences other parts of the tree only by its root [DIN81][KHI89]. Modularisation is a recursive process, as sub-trees might themselves contain independent sub-trees. As special cases, the whole Fault Tree and every basic event are modules. Traditionally, the top-event probability of each module is calculated and then the whole module is replaced by a virtual simple event, located at the former position of the sub-tree root. This virtual event is assigned the probability calculated for the root of the removed sub-tree. This bottom-up procedure continues until the topevent probability of a Fault Tree has been calculated. Thus, for probabilistic analysis, the output of the root gate of a module is regarded like a basic event. Identification of modules in Fault Trees is a formal procedure that only refers to the tree structure (which reflects the chains of faults or hazards) and not to the system architecture (which reflects how the technical parts of the system are constructively arranged). Technical components - in terms of self-contained technical units, potentially delivered by a sub-supplier - are usually not independent from each other with respect to their failure behaviour. There are influences between them and commoncause failures. In consequence, many FT events are members of the same module, although they belong to different technical units. Other events may belong to different modules, although they belong to the same technical unit, just because they happen to have no causal relation. Thus, system architecture components do not necessarily correspond to Fault Tree modules. The refinement of a system into components by its architecture is a different kind of hierarchy than the refinement of the corresponding FT into modules. This issue is best explained by a small example. First, the traditional decomposition by modules is applied. In Figure 3.1, a simple FT is shown. It has four basic events

3.2. TRADITIONAL FAULT TREE DECOMPOSITION BY MODULES

55

to which, for sake of explanation, arbitrary (constant) failure probabilities of 0.1, 0.2, 0.3, and 0.4 have been assigned. Using the given calculation rules (in this example, only multiplication is applied) the resulting probability of the top-event is 0.0024. One of the modules of this FT has been deliberately selected and marked by a box around it.

Figure 3.1: A Fault Tree and one of its Modules

Using the triangular ”transfer symbol” from standard FTA, it is possible to split the tree at the root of the module and to transfer the module to another page for better viewing and editing, as shown in Figure 3.2. This decomposition is not only applicable to the graphical view of the tree, but also to its evaluation. Assuming that all gate outputs are FT events and thus have a probability, it is possible to calculate the probability of the module root by the usual formulas. The top probability of the module is 0.006. This probability is carried to the main page and, together with the remaining event e1, results again in a top-event probability of 0.0024, as expected. Now suppose that the FT from the example belongs to a technical system where some part (an example is marked in the right part of the Figure 3.3) constitutes a technical component on its own, possibly delivered by a foreign supplier, and the rest is considered to be the main system. The marked subcomponent is obviously not a module, as it is not independent from the events at the bottom of the figure. It cannot be analysed on its own by FTA. However, as in an architecture model the subcomponent appears as a black-box that is developed on a separate page, the safety

56


Figure 3.2: Decomposition of a Fault Tree by Modules

analyst might want to treat it separately as well. Of course, the triangular transfer symbols that FTA provides can be used to partition the FT on different pages accordingly. However, this is a graphical issue and has no semantic meaning; in particular, these FT pages are no reusable and self-contained components as in architecture models.

Figure 3.3: Fault Tree with Subcomponent

In summary, there are two distinct refinement hierarchies: 1. the backward refinement of the cause-effect relations as indicated by the Fault Tree 2. the refinement by components as defined by the system architecture.

3.3. INFORMAL INTRODUCTION TO COMPONENT FAULT TREES

57

While the traditional FT decomposition into modules reflects the first kind of hierarchy, the new CFT approach exploits the second kind, namely, the component hierarchy.

3.3

Informal Introduction to Component Fault Trees

The CFT concept is first introduced by a continuation of the example from above: It would be desirable to cut the subcomponent off and put it into a box. The connection of the box to its environment is achieved by ports, points of information transfer. These ports appear as small symbols at the edge of the box.1 The inner details of the subcomponent are hidden on the system level and put into a separate CFT model. In this model the ports reappear, this time as triangular transfer symbols (see Figure 3.4). Of course, there can be more than one subcomponent on each level, and the hierarchy depth can be greater than two (i.e., a subcomponent can have sub-subcomponents and so on).

Figure 3.4: Component Fault Tree of the System (left) and the Subcomponent (right)

In this example, neither the system nor the subcomponent are analysable on their own, because they have open ports where information is missing. Nevertheless, it is possible to store the subcomponent independently, to deliver it together with other component models, to instantiate it several times and to keep it in a repository for later reuse. There are also analysis steps that are possible on the component level that speed up the final analysis, e.g., the translation to and reduction of the BDDs, if the algorithm that will be proposed in Section 3.6.2 is applied. The quantitative result, however, can only be calculated if the components of all hierarchy levels are available. In Figure 3.5 the subcomponent from above appears twice. This means that two distinct instances of the same component type are part of the system. The 1

As shown in the figure, not only subcomponents, but also gates, are connected via ports. For gates this may appear strange. However, both formalisation and technical implementation are easier if a common connection mechanism is used. Moreover, and even more important, the later introduction of State/Event Fault Trees makes gates with ports necessary, because in this model the gate ports are typed.

58


internal events of both instances are distinct. Both subcomponents fail independently from each other, but with the same failure characteristics, such as probability distribution and parameters. In the following section, CFTs are defined more formally.

&

SC1 : C2

SC2 : C2

&

&

e1

e2

e3

Figure 3.5: Multiple Use of the same Subcomponent type

3.4

CFT Model Formedness

Elements

and

Rules

for

Well-

A CFT is a directed acyclic graph consisting of four different types of nodes and edges. All edges are directed; they lead from one node, called source, to another, called target. The nodes can be further divided into simple nodes that can directly be the source or target of edges and complex nodes, which can only be addressed by distinguishable ports. Simple nodes are • events (in terms of traditional FTA) • ports (of the model, triangular symbols) while complex nodes are • subcomponents • gates.

3.4. CFT MODEL ELEMENTS AND RULES FOR WELL-FORMEDNESS

59

Figure 3.6: The Model Elements of Component Fault Trees

The different model elements of CFTs are shown in Figure 3.6. Simple nodes is the collective name for the internal events (basic events in traditional FTA) and the input and output ports belonging to the component being modelled, as these nodes possess no internal structure and can be the source or target of edges. Complex nodes is the name used for substructures (depicted as rectangles) that possess distinguishable ports and can only be connected by their ports. In other words, complex nodes are not nodes in the sense of graph theory, but complex structures uniting several nodes that belong to the same logical unit. The ports are distinguished as input ports and output ports. Gates and subcomponents are very similar to each other, not only regarding their graphical representation as rectangular boxes. The differences are:

• a gate has exactly one output port, whereas a sub-component may have more than one, • a gate represents a Boolean function, whereas a sub-component references another CFT model that describes the component appearing as subcomponent. Remembering that CFTs, by their graph structure, define Boolean functions from their inputs and internal events to their outputs, we find that both gates and sub-components represent Boolean functions.

The connection between different hierarchical levels is accomplished in two steps:

60


1. by the reference from every sub-component to its corresponding CFT. This is technically achieved by storing the unique ID of the referenced CFT model as an attribute of the subcomponent.2 2. by the input and output ports that are joined by edges. The own (triangular) input/output ports of the component currently being modelled appear as (small attached) sub-component input/output ports on the next higher hierarchical level. On both levels, the ports are joined to other model elements by edges. The matching of corresponding ports on both levels is assured by the unique port identifier. Figure 3.7 repeats the example from above, this time with visible port IDs. For instance, as subcomponent SC1 references component model C2, it is clear that SC1.Pout1 makes the connection to C2.Pout1.

Figure 3.7: IDs as a Means to Reference Components and Ports

Edges lead from one simple node (called source) to another simple port (called target); the arrowhead points to the target. Complex nodes cannot be addressed directly, but only via their ports. Some kinds of simple nodes can only be the source, some can only be the target of edges. The nodes that can only be sources are: • Events • Input Ports of the model itself • Output Ports of some substructure (gate or subcomponent) 2

In the prototype tools UWG3 and ESSaRel, this reference has been implemented as a link so that referenced models can be opened on the screen by double-clicking the subcomponent symbol.

3.4. CFT MODEL ELEMENTS AND RULES FOR WELL-FORMEDNESS

61

The nodes that can only be targets are: • Output Ports of the model itself • Input Ports of some substructure (gate or subcomponent) The fact that the role of ports depends on whether they belong to a substructure or to the model as a whole may not seem obvious; it is explained in Figure 3.8. The idea behind this is that regarding a gate or subcomponent, information ”comes out of” the outputs, but with regard to the CFT currently being modelled, information ”comes in” via the inputs.

Figure 3.8: The Direction of Edges with Respect to Different Kinds of Ports

It is forbidden that two or more edges share the same target; in other words, several edges may start from the same point, but just one edge may end at the same point. This rule ensures an unambiguous semantics: each consequence has exactly one cause, if there are more they have to be joined explicitly using a gate stating how.3 To express that a consequence depends on several causes, it is necessary to say how it depends on them, which is the purpose of the gates. The uniqueness of the origin of any causal influence is also technically essential when tracing the graph backwards during the analysis algorithm. The fact that two or more edges may have a common source indicates that CFTs, despite their name that is kept for convenience, are no trees but directed acyclic graphs. The common source of two or more edges semantically corresponds to a ”repeated event” in traditional FTA. 3.9 summarises the difference between both cases. As CFTs are acyclic graphs, it is forbidden that there is any set of nodes and edges that forms a directed cycle (called "shallow cycle"). This rule applies within the same component, but also across component and hierarchy borders. To avoid cycles across different hierarchy levels, edges leading (directly or indirectly) from an output port of a gate or subcomponent to an input port of the same gate or subcomponent are forbidden.4 This must be checked in a separate procedure before the proper analysis is started. 3

The terms ”cause” and ”consequence” are used here without rigour. Their validity has been discussed before. 4 This rule could be relaxed for subcomponents if a cycle check accross different model hierarchies and additional checks for special properties of the logical structure were implemented. Furthermore, a cycle can be tolerated if it dissappears when the CFT is flattened, or if the logical structure can be translated into an equivalent structure without cycles. Apart from this, the question of when cycles in logical circuits are acceptable is broadly discussed in the literature on digital electronic circuits.

62


Figure 3.9: Multiple Edges from one point (allowed) and to one point (forbidden)

It is forbidden that a component contains itself as a sub-component - directly or indirectly. Violations of this rule (called "deep cycles") must be checked before analysis is started. Shallow and deep cycles are shown in 3.10.

C1 SC1 : C1

C2 C1

SC2 : C2

Figure 3.10: Forbidden: Shallow Cycle and Deep Cycle

3.5

Application Aspects

CFTs are used for qualitative or quantitative safety and reliability analyses in the same way as standard FTs. One particular hazard or failure that has been identified before is refined, using the normal logical gates. The top-event appears as output port, as the whole system can later be used as a subsystem of a bigger system. In addition to traditional FTs, the user can structure the FT by grouping graph elements into components as appropriate. This can be done arbitrarily while drawing the tree, but it is recommended to consult existing architectural models and define the involved components first, before putting in the gates and events, as this leads to a better structure. Subcomponents should be kept small for better understanding and self-contained for later reuse. The analyst must carefully check if a basic failure event really belongs to the component he is currently considering, or if it rather is

3.5. APPLICATION ASPECTS

63

an external cause that could impact other components as well. In the latter case, the event should be placed on a higher hierarchy level or into a different component where it belongs, and in the current component an input port is placed to make the connection to the event. This applies in particular to external failures that might affect more than one subcomponent. Figure 3.11 shows a variation of the example, this time with a common failure that affects both subsystems. e.g., in a redundant system with common power supply, the power supply is a common source of failure. System &

CPU1 : CPU

CPU2 : CPU

Power supply failed

Figure 3.11: Common cause failure

The figure shows another interesting property of CFTs: CFTs are no trees, but directed acyclic graphs (DAGs). Repeated events that have often been a source of misunderstanding are obsolete in CFTs.5 Of course, traditional FTs could be extended to DAGs as well to avoid repeated events, but as they do not possess a component system that keeps the number of graph elements per page small, this would result in a network that is usually too complex to be understood.6 The fact that CFTs are DAGs also makes it possible that there is not just one topevent, but several ones that may share parts of the graph. This allows modelling different failure modes of a component that may partly be depending on each other. As top-events in CFTs actually are output ports, a multi-rooted graph can model a 5

Some tools do not respect repeated events at all, some automatically assume events to be repeated if they have the same name (which could be the case for other reasons), some require the user to check a certain field or to enter lists of repeated events manually. 6 The tool UWG3 offers an option to display the graph as a regular tree instead, doubling events as necessary and showing repeated events in a different colour. This has been implemented as a help for analysts familiar with traditional FTs.

64


component with several failure modes, and on a higher level, one or more of these can be referenced. Another essential property of CFTs is the hierarchical ID system that avoids name conflicts, even if the analysts of different components use the same names: each component constitutes a namespace, and when instantiating a component several times, each local ID is prefixed with the instance ID. For example, event E1 within component model C2 becomes SC1.E1 if this component model is instantiated for the first time, SC2.E1 for the second time and so on.7 If supported by the tool, CFTs can, once they have been defined, be reused in different files and projects. The prototype implementation UWG3 uses an XML format to store component models. Repositories can be built and archived. The concept can further be exploited if CFTs are automatically integrated based on the system architecture model. An approach has been presented in [GK05].

3.6 3.6.1

Analysis of Component Fault Trees Introduction

As CFTs are semantically the same as traditional Fault Trees, the established combinatorial analysis approaches still work. However, some misconceptions must be avoided: traditionally, an FT could be calculated bottom-up, although this was not the most efficient way to do it. The output of each gate acted like a (compound) event and had a probability assigned to it. The next step of the calculation could be performed without knowing any details of the sub-tree below. This technique is not applicable for CFTs - unless the CFT is completely flattened before analysis, which is not the purpose. In CFTs, subcomponents do not have a probability assigned to it. They rather represent a set of Boolean formulas (one for each output port) from the input ports and the internal events to the output ports. It can be shown that a modification of the BDD-based8 algorithm presented in Section 2.4.3 fits very well to this concept: each of the Boolean formulas is encoded in a BDD fragment and each subcomponent is represented by a multi-rooted BDD fragment. It is possible to do the translation, and even some optimisation steps, on component level, without flattening the tree. The optimised BDD fragments can be reused each time a component is instantiated somewhere. Optionally, they can be stored persistently along with the model in order to avoid translation the next time the component is referenced. If a component is a module (this is always the case if it has no input ports and exactly one output port), then the quantitative evaluation can be performed on component level and for the analysis of the super-component the numerical result is used directly. The modifications we did to the BDD-based algorithm in order to support the component concept are summarised in the next section. 7

The tools UWG3 and ESSaRel distinguish IDs, which are handled by the tools, and labels, which are given by the user. As the latter are not used for identification, there are no problems with name conflicts anyway. 8 more precisely: reduced ordered BDDs

3.6. ANALYSIS OF COMPONENT FAULT TREES

3.6.2

65

The BDD-Based Analysis Algorithm Adapted to CFTs

The proposed analysis algorithm has been described in detail in the Master thesis [Zoc05], which was carried out in the context of this doctoral thesis. 9 Basically, for each component a multi-rooted BDD fragment set is created (translation) and afterwards the fragment sets are cloned for all involved component instances and then integrated (flattening). Multi-rooted means that there is not one entry point as for common BDDs, but as many as there are output ports and subcomponent input ports in the CFT model being translated. The name fragment is used because the BDDs are not complete on their own, but may contain references to external BDD fragments that are only resolved during flattening. The resulting flat BDD after flattening is analysed by the BDD-based algorithm that has been presented in Section 2.4.3. The translation of the CFT into BDD fragments consists of several steps, which are described in the following. Before starting, some preconditions must be checked: • All referenced component models must be present at analysis time. • They must be consistent in the sense that the component contents must fit to the footprint of the black-box by which it is represented, i.e., have the same number of input and output ports. • They must be correct, i.e., obey the rules given in Section 3.4 (e.g., be acyclic). First, the nesting hierarchy of the components must be checked. To this end, a data structure called Containment Tree is created, which reflects the nesting of all components that are recursively referenced from the top-level system. The root of this tree is the top-level system (the component for which the user starts the analysis). The nodes immediately below are the component classes that are instantiated as immediate subcomponents of the main systems. Below them, their subcomponents follow and so on. The terminal nodes (leaves of the tree) are component classes without subcomponents. When building this tree, the models of all involved component classes are loaded into memory,10 if this has not been done before, and the presence and correctness of the nesting hierarchy has to be assured. Apart from the preconditions mentioned above, it is forbidden for a component to refer to itself as a subcomponent or for a set of components to refer to each other, directly or indirectly (called deep cycle). This means, in other words, that the Containment Tree must actually be a true tree structure, in particular, an acyclic graph. 9

This algorithm has been implemented and successfully evaluated in our prototype tool ESSaRel. In the predecessor tool UWG3, the components only served as a means of better representing the Fault Tree to the user; the components had not been expoited for analysis and the tree was traversed and translated across component boundaries instead. 10 This is a technical necessity in the prototype tools UWG3 and ESSaRel, because each file can contain several components that are not necessarily in memory all the time, and components belonging to the same project can even spread across several files that are loaded only for analysis.

66


To give an example, a system is analysed, which is a component instance from component class C1. It contains two subsystems SC1 and SC2, which are both instances of component C2, plus three subsystems SC3, SC4 and SC5, all instances of component C3. Component C2, which was referenced twice as subcomponent within C1, has itself a subcomponent SC1 of component C3. This means that both instances SC1 and SC2 that appear within the system have their personal instance of C3 and that these instances do not depend on each other in any way. The fact that both are called SC1 is not a problem, because both subcomponents of the top-level system constitute different name spaces. The fully qualified ID is C1.SC1.SC1 in one case and C1.SC2.SC1 in the other. Figure 3.12 shows a glass box schematic of the nesting structure.

Figure 3.12: Hierarchical Example System: Nesting Structure

3.13 shows the corresponding Containment Tree. This tree can be consulted for two different kinds of information: 1. The component models that are involved (C3, C2, C1 in this example), and thus have to be translated 2. The flattening steps that will have to be performed (C3 as SC1 into C2, C2 as SC1 into C1, C2 as SC2 into C1, C3 as SC3 into C1, C3 as SC4 into C1, C3 as SC5 into C1) To access both kinds of information technically, an appropriate class named ContainmentTree has been implemented, which provides access methods for traversal of the tree. The different access methods provide either a list of all components to be translated or a list of all insertion operations to be performed during flattening. It is possible to iterate over this list by a foreach loop. A standard depth-first post-order tree traversal algorithm has been used within these methods. In the translation step, BDD fragments for all components in the containment tree are constructed by the following rules: For each component, a multi-rooted BDD fragment is generated; the entry points, i.e., the references by which the top node of the BDD can be accessed, are • output ports of the component • input ports of subcomponents of the components.


67

Figure 3.13: Hierarchical Example System: Containment Tree

As each BDD fragment can have more than one entry point, the BDD substructures belonging to one component may be unconnected; however, it may be the case that the substructures have some nodes and edges in common. In any case, identical semantical units (Fault Tree Event or port) are mapped to the same BDD node; the CFT model element ID is relevant for considering units to be identical. Consequently, there is no confusion about ”repeated events” as in the case of some traditional FTA algorithms. All BDD fragments are stored in a common memory space and their entry points can be accessed by named pointers indicating the CFT port ID they belong to. Basic events found in the same CFT instance, called local variables, map to standard BDD nodes. For subcomponent output and the own input ports of the current component, specially labeled BDD nodes are created, which are named external variables or reference nodes and which will be replaced by an inserted BDD fragment during flattening; these reference nodes bear the name of the port they refer to. After flattening, no more reference nodes are left in the BDD and there is only one connected BDD for the system. Figure 3.14 shows the BDD fragments that are generated for the Subcomponents in the CFT example in Figure 3.4. The first two fragments belong to C1 and are thus members of the same fragment set. In this example, the two fragments of C1 happen to be unconnected; in practice they could be connected as well, so there would be a multi-rooted decision diagram instead of a tree.11 Each fragment belongs to a model output or a subcomponent input, as these are the top-events of local Fault Tree fragments: the first BDD belongs to C1.Pout1, the second to C1.SC1.Pin1 and the third to C2.Pout1. In the internal data structure, the ID of the port being represented must be stored. The nodes with IDs ending in Ei (where i is a natural number) are basic events in the sense of FTA. Probability values for each point in time are assigned to them according to the specified distribution (as in traditional FTA, the probability distribution and its parameter must be known in advance for each basic event). The nodes with IDs ending in Pini or Pouti are reference nodes that will be resolved during flattening. Depending on the BDD library used, the BDD fragments are reduced immediately while they are built. The probabilities of basic events are annotated to the BDD 11

Technically spoken, all BDDs currently in use by the library are connected anyway, because the terminal nodes 0 and 1 exist just once in the whole memory space.

68


Figure 3.14: BDD Fragments to the Entries C1.Pout1, C1.SC1.Pin1 and C2.Pout1 of the Example in Figure 3.4

edges (the probability p to the true branch and the complementary probability 1-p to the false branch). When probability distributions over time have to be considered (e.g., exponential distribution), this can be achieved by a discrete-time approach that samples the function in equidistant time intervals. In this case, a vector of probability values is annotated instead of a simple number. The translated BDD fragments are now ready for use in the BDD algorithm and can be instantiated several times, as often the component appears as a subcomponent somewhere. Optionally, they can be serialized and stored persistently for later reuse. After all involved CFT models have been translated into BDD fragments, the flattening procedure takes place. Its purpose is to assemble the fragments on all hierarchy levels to the final flat system level BDD. The procedure consists of a series of clone and compose operations and is guided by the Containment Tree. BDD fragments that are instantiated several times have to be cloned for each instance, but no additional translation is necessary. During the clone procedure, the IDs have to be adapted. The component ID prefix (”C2” in the example) is replaced by the instance prefix (for the insertion of SC1 into C1, this would be ”C1.SC1”, because in this case, C1 references C2 as a subcomponent with the ID SC1). Thus, in the example, ”C2” is replaced by ”C1.SC1” each time it occurs. This procedure has to be done for each insertion step individually, if more than one insertion has to be performed. Suppose that C1 in the example had three subcomponents SC1, SC2, SC3 that all refer to C2; then C2 would have been cloned 3 times, and in the first clone, all ID prefixes are named C1.SC1, in the second C1.SC2 and in the third C1.SC3. Figure 3.15 shows the cloned and renamed fragment set of subcomponent SC1 (in this example, only this fragment had to be cloned and renamed, because it is the only fragment belonging to a subcomponent; as it is referenced just one time, the same effect could have been obtained by just renaming without cloning). The label of the fragment entry point has to be adapted as well, i.e., the entry point to the clone of the third fragment is now labeled C1.SC1.Pout1 instead of C2.Pout1. Note that basic events in different instances of the same components are considered as being distinct, i.e., stochastically independent; however, they share the same probability distribution and parameters, because these are cloned as well.


69

Figure 3.15: BDD Fragments C1.Pout1, C1.SC1.Pin1 and C1.SC1.Pout1 after Cloning C2 to C1.SC1

The last step is to compose the BDDs by resolving the ID labels of each fragment and replacing nodes that have the ID of a fragment with the fragment itself. This insertion is performed by the compose operation, which is a standard operation on BDDs and a built-in feature of many BDD program libraries. In the example, the left fragment has a node called C1.SC1.Pout1, which is also the label of the right fragment. So the right fragment is substituted for that node of the left fragment. The terminals 0 and 1 become the true and false outputs of the node. Next, C1.SC1.Pin1, which was formerly in the third fragment and is now in the integrated first fragment, will have to be replaced by the second fragment, because it has the same label C1.SC1.Pin1. This way, each external variable is replaced with its corresponding sub-BDD fragment and only the real FT events survive. As a result, a standard BDD has been generated, the event probabilities are given, and the usual BDD-based computation of the top-event probability is started for the desired output port (i.e., the top-event). In the example, it is easy to see that the BDD represents a chain of all AND conjunctions of four events. Applying the true-branch probabilities from the example above, the result is again 0.0024, as provided by the other algorithms discussed so far.

3.6.3

Reduction of Analysis Effort by Exploiting the Component Concept

When the CFT concept was first proposed in the beginning of the research period of this thesis, it was proposed as a modelling aid [KLM03]. However, later a way was found to exploit it also for a reduction of the analysis effort [KZ05]. First, as safety-critical and reliable systems often incorporate redundancy, it is likely for components to appear more than once in the system. The time to translate and reduce the BDD is spent only once; all later occurrences just require cloning the BDD. This saves a significant part of the computational effort. This idea can be driven further if the BDD fragments are not only kept in memory during one analysis, but are stored persistently. This is not difficult if the software package that does the BDD construction and simplification offers a way to serialize and read back BDD structures. The BDD fragments are stored in the model file (the UWG3 and ESSaRel tools use an

70


Figure 3.16: BDD Fragments with Prefixed IDs (left) and after Composition (right)

XML format) and read back each time the component is referenced in some CFT model. It is important to store the date and time of the last BDD translation along with the BDD, in order to check if the translation is still valid or if the original CFT has been edited meanwhile. In the latter case, a new translation would be necessary. This proceeding is similar to the build strategy for software used by a make tool: use object files, if available and if the source file has not been changed, otherwise, compile the source file. So only the components that have recently been modified would have to be translated into a BDD. If only numerical parameters have been changed, no translation would be necessary. This can be further optimised by translating CFTs in a background thread while the user edits other components. Using this strategy, there is a high chance that when the user starts the analysis, only the currently edited CFT has to be translated and all other involved models are already available as BDDs. Of course, flattening and calculation are only possible after all CFTs have been translated into BDD fragments and must be repeated after each modification. Recently we suggested an additional promising optimisation approach in [KZ05]: the components as defined by the analyst are often a good heuristics for defining areas for variable order optimisation. A known problem with the BDD approach is that the same Boolean formula can be encoded in different reduced ordered BDDs, because the variables can be ordered differently. Depending on the structure of the Boolean formula, the size of the BDD may depend heavily on the chosen variable order. Without explicit optimisation steps, the variable order depends on the (arbitrary) order in which the different basic events are encountered when traversing the Fault Tree, e.g., depth first from left to right. This order is not necessarily a good one. Finding the variable order that leads to the smallest BDD has been proven to be an NP-complete problem [BW96]. There are different proposals on how to improve the variable order with acceptable effort. Some approaches change the position of

3.7. EVALUATION AND DISCUSSION

71

neighbouring variables, some optimise the order within one fixed-size window of adjacent variables, and some apply different heuristics to analyse the graph. These approaches do not profit from the analyst’s knowledge about the system architecture. The CFT technique encourages analysts to structure the FT according to the architecture of the system; a good architecture most likely groups parts together that have strong causal connections in between them and separates parts with loose coupling, where coupling can be measured by the number of connecting edges for this purpose. Thus, one hypothesis is that it is efficient to optimise the variable order in each BDD fragment belonging to a CFT component. This is manageable, as the number of variables in each component is usually small, so the reduction is feasible. The resulting variable order for the whole system is not necessarily optimal, but can be expected to be a reasonable one in terms of effort / effect ratio. Again, the optimisation effort is only necessary once per component and pays off each time the component is used as a part of some system. First experiments have examined different variants of this approach and the initial results looked promising; however, this approach has not been fully elaborated within this thesis.

3.7

Evaluation and Discussion

Like ordinary FTs, CFTs are not suitable as a paper and pencil method. They require a graphical editor to draw the trees plus an analysis backend for quantitative evaluation, preferably exploiting the component concept as described above. To get practical experiences with CFTs, the tool UWG3 has been developed. For details on the history and on the features and usage of the tool, see Chapter 6. The UWG3 tool has been used to perform the experiments for this thesis and for the cited publications, but it has also been used for real industrial projects at different departments of Siemens AG in Munich and Karlsruhe. There, experienced analysts used UWG3 in the domains of power generation, railway, and military projects. The analysts took some time and some consultancy to get familiar with the new component method and with the tool. Soon they began to appreciate the component concept for its better structuring of large projects and for the reuse capabilities that save a big part of the editing time. Although at that time, many optimisation strategies described in Section 3.6.3 were not yet implemented, the analysis speed far outperformed the predecessor tool UWG2. Initial problems with the application of the CFT technique were generally due to bugs and missing features of the tool and not due to the technique itself. The fact that CFTs look slightly different than standard FTs hampered their acceptance in the beginning, but soon the technicians got used to them. The practical experience shows that the component concept that CFTs offer is useful, intuitive, and applicable in practice. It allows partitioning CFTs in order to support division of labour and component reuse on the interface-definition level, and thus complies with the current state-of-the-art approaches in software engineering. Thus, CFTs provide better integration with the software development process than standard FTs.

72


As CFTs semantically correspond to standard FTs and can be transformed into standard FTs, the CFT technique is applicable in domains where FTA is required. On the other hand, being semantically equivalent to standard FTs means that they are restricted to combinatorial logic as well and the weaknesses regarding the required expressive power for software-controlled systems are still the same. This suggested further research into improvements to FTA and safety analysis. CFTs are a are a applicable technique on their but with respect to the goals of this thesis, they have to be considered as an intermediate result, as they only address the issue of compositionality.

Chapter 4 State/Event Fault Trees (SEFTs) 4.1

Motivation

In Section 2.3.3 three requirements for Safety and Reliability analysis techniques in the software-controlled systems domain have been mentioned: • Compositionality • Fitness to express typical behavioural scenarios in software-controlled systems • Integration into an overall system development process. As shown so far, CFTs, which were the first outcome of the doctoral research, augment FTs by an appropriate component structure and, thus, offer a solution to the first requirement. Moreover, they partly facilitate process integration by providing the same architecture in the safety analysis as in system design. On the other hand, the last section mentioned that CFTs are still a combinatorial modelling technique. Accordingly, they do not change the semantics of FTs and leave the mentioned issues of missing expressive power open. In consequence, also the full semantical integration with design models, in particular with state-based models, cannot be achieved with CFTs. Enhancing the expressive power requires to some degree a formalisation of the present semantics. The semantics of FTs is not easy to capture, because the discussion in Section 2.4.4 revealed that traditional FT events are used with different (and sometimes fuzzy) semantics. In practice, FTA events can be failed states of components at a given point of time, e.g., ”safety valve is defective”, or sudden events like ”boiler explodes”, but also complex natural language propositions, such as ”operator does not detect stuck safety valve early enough”. Without a clear explanation of what the model elements really mean, the desired integration with formal methods is impossible to reach. Also, the semantics of gates and of the technique as a whole can be understood in different ways. Moreover, taking FTs for graphical representations of Boolean formulas, they have neither a causal semantics nor can they express state dependencies. Evolution of a

73

74

CHAPTER 4. STATE/EVENT FAULT TREES (SEFTS)

system can partly be expressed, but only under the discussed side conditions. An outcome of the discussion in Section 2.7 was that an appropriate description of failure scenarios on a more formal level, including state dependencies, temporal order and duration of states, requires a state-based description technique. The usage of pure state-based modelling techniques for safety analysis shows the disadvantage of models becoming too complex, and causal chains are not easy to follow as in FTA or ETA. The acceptance of the FT notation in practice, as well as its appropriateness to model failure and hazard stories, can be agreed upon and its advantages should be preserved. Consequently, it is a promising research project to seek an augmented notation that preserves the style of FTs, but incorporates state-based semantics. The resulting formalism would serve as a description technique for safetyor reliability-critical scenarios in the context of embedded systems. With a mapping to well-defined probabilistic models it would allow for probabilistic analysis. If it covers the semantics of traditional FTA and also probabilistic and deterministic state-based techniques, it could even be exploited as an ”umbrella technique” that enables the integration of causality-centred techniques (FTA, ETA) with state-based techniques (Markov Chains, Statecharts). Developing a technique the fulfills these requirements was the research challenge to this doctoral research. The following sections present State/Event Fault Trees (SEFTs) as a response to this challenge and constitute the main result of this thesis. In the following sections, SEFTs are first informally introduced, then their modelling elements and correctness rules are listed. Some application examples are given. The quantitative analysis of SEFTs is the subject of the separate Chapter 5.

4.2

Informal Introduction to SEFTs

State/Event Fault Trees are a visual notation that integrates elements from statebased models with FTs. The main difference to traditional FTs is the visual distinction of states and events. SEFTs are targeted for safety and reliability analysis, with the focus being on the representation of causal chains, as in standard FTs. The underlying intention is to create an intuitive, but unambiguous and analysable modelling technique for industrial practitioners, and to hide the formalism behind it. Mathematical notations that would be hard to understand for practitioners are avoided, wherever possible. The modelling elements are adopted from traditional FTA and from Statecharts (and their variants like ROOMcharts or UML 2.0 State Diagrams) because these techniques are widely used in industry. Regarding the underlying semantics, SEFTs are an extended state-machine model and no longer a purely combinatorial model. The semantics is closer to the ROOM method [Sel94] than to original Statecharts [Har87], since the latter do not explicitly support a component concept with defined communication (or triggering) relations between components. In contrast to Statecharts or ROOMcharts that depict transitions by directed arcs, an explicit event symbol is introduced. This is necessary in order to allow causal edges from events or states to other events for depicting cause-effect relations (triggering and guarding). These causal edges correspond to the edges in traditional FTs and can be connected by logical gates, as usual in FTA.

4.2. INFORMAL INTRODUCTION TO SEFTS

75

The component concept developed for CFTs has been further developed for SEFTs: each system (or top-level component) may be decomposed into subcomponents. Components are self-contained and concurrently acting entities. Components are connected via ports that are typed as state ports and event ports in SEFT. This abstraction can generally be applied to both software and hardware; thanks to the probabilistic modelling elements, even system users or scenarios in the system environment could be modelled as components. There is a difference between the component class that describes all components of that same type and the individual component instances. A SEFT model refers to a component class.1 SEFTs add state-machine elements to FTs and are based upon a finite state approach. State in this context means an abstraction of all (safety or reliability) relevant variable properties of a component onto a finite number of equivalence classes. This finite state approach is usually sufficient to describe safety or reliability relevant behaviour of technical systems.2 For example, if the altitude of an aircraft is a relevant property, it is sufficient for safety analysis purposes to distinguish the states {too low, acceptable, too high}. At each point in time, a component is in exactly one of a finite set of states, called the active state, and stays in that state for some interval of time.3 So for any point in time t the proposition ”Component C is in state S at time t” is either true or false. Any such proposition is called a state expression or condition. State expressions may be atomic or composed by means of the standard Boolean connectives (AND, OR, NOT...). When performing the probabilistic analysis of a SEFT, a probability in the interval [0,1] for each given point of time can be assigned to a state expression. States are graphically represented by the customary rounded rectangles that are found in Statecharts and most other notations. Transitions between states are atomic and happen in zero time. The collective term event denotes all instantaneous phenomena, comprising state transitions as well as spontaneous, stochastic events in the environment that usually have to be considered in safety analysis. The term event actually means a class of similar events. This has to be distinguished from the specific occurrence, characterised by a pair of an event and a point of time. SEFTs are defined on a continuous time scale.4 Simultaneous occurrences of unrelated events are not considered, as their probability approaches zero.5 1

In the following, the term component is used for both, depending on the context. Where necessary, an explicit distinction is made by stating ”component class” or ”component instance”. 2 Of course, there are limitations: In some cases, hybrid models would be desirable but they come at a much higher cost, as will be discussed in Chapter 7. 3 Usually a non-zero interval; however, extension to zero length intervals (i.e., immediate state changes) is not a problem, as long as the number of state changes in any time interval is finite. 4 This does not exclude the option to use discrete time algorithms to evaluate them in an actual technical tool implementation. 5 Not addressing the issue of how to deal with simultaneous events is a semantic weakness in the present state of the technique, as will be discussed in the evaluation chapter. In particular, two independent events that are deterministically triggered after a given time can occur simultaneously, if the same time parameter is chosen for both events. In this case, the reaction of the system can be undefined.

76


There is a visual difference between SEFT state-machine elements and other notations such as Statecharts: In most graphical representations of finite state-machines, there is no explicit symbol for events; state transitions are represented by directed arcs joining two states. The trigger / action relations are annotated to these arcs in textual form. As these causal relations are the main substance of an FT, these should be represented by (a distinguishable kind of) edges as well: the causal edges. To depict causal relations between events by edges, a solid bar is introduced as a dedicated event symbol, which can serve as source or target for a causal edge. Consequently, in SEFTs there are two different and graphically distinguishable kinds of edges: temporal edges that join source state - transition event - successor state and causal edges, which lead from the triggering event or guarding state expression to the consequence event. Causal edges are denoted as lines with solid arrowheads, whereas temporal edges are lines with light arrowheads. SEFTs consider three different reasons for a state transition to occur: 1. the state changes deterministically when a certain sojourn time in a state has elapsed, or 2. it changes probabilistically according to an exponential distribution of the sojourn time, or 3. it changes because some other event caused (triggered) the state change. Deterministic timing is usually found in software programs or in processing steps in technical systems, in the safety and reliability context also in terms of periodical inspection or replacement. Probabilistic timing is used to model operational profiles, hardware or software failures,6 communication through non-deterministic channels, or spontaneous effects in the system environment, like external failures or human behaviour. Triggering is what usually happens if components request actions from other components or when a system reacts to signals from its environment. Causal edges denote the relation between triggering event and triggered event. Any event is allowed to trigger one or more other events. The semantics of triggering a state transition in SEFTs is that when the triggering event occurs, the triggered transition also occurs without explicit delay, provided that it is enabled. Enabled means that the predecessor state of the transition to be triggered is active and that all present guards are true.7 If the triggered transition is not enabled, the trigger signal is discarded i.e., not stored for later use. Triggering does not incorporate synchronisation in terms of inhibiting the triggering event from occurring when the triggered transition is not feasible. As incoming causal edges depict triggering, deterministically delayed or stochastic transitions must not have incoming causal edges from other events (they may have edges from states, denoting guards). Instead, the delay or occurrence rate is written next to the event 6

Whether or not software failures caused by faults in the implementation (bugs) should be included in a safety analysis will be discussed in the evaluation section. 7 Guards will be explained below.


77

symbol. Figure 4.1 shows the three different reasons for events to occur: deterministic delay, exponentially distributed delay, and triggering.8

Figure 4.1: The Different Ways of Event Occurrence: a) Probabilistic Event, b) Deterministic Event, c) Triggered Event (upper)

States cannot trigger transitions as events can, but they can allow or inhibit them. A state condition allowing a transition to occur is called a guard. The influenced transition can only occur when all guard conditions evaluate to true; so if there is more than one guard condition, they are ANDed. Both causal relations and their transfer across component boundaries via typed ports are exemplified in Figure 4.2. For explanation purpose, this figure is shown in a glass-box view where the subcomponents unveil their internals, which would normally be modelled on a separate page. The distinction between trigger and guard is made by the distinction of whether an event or a state is connected. The triggered subcomponent SC3 has no information about what element is actually the starting point of the triggering edge, as this is part of a foreign component. However, the distinction of state ports and event ports makes sure that it must be an event and thus preserves the consistency. The meaning of the example is that the state change in Sub3 only occurs if the (probabilistic) state change in SC1 occurs while SC2 is in state S2 (i.e., its probabilistic state change happened to occur before). In the case of deterministically delayed transitions, time is halted and reset while the guard condition is false and the transition occurs only after the delay time when the condition becomes true. For exponential transitions, the same considerations apply, but they are of less relevance since the exponential distribution is memoryless anyway. Like triggering, guarding is graphically denoted by causal edges, with the difference that this time they lead from a state to an event. Besides state-machine elements, SEFTs offer gates similar to traditional FT gates. The gate set includes the standard Boolean connectives AND, OR and NOT applied to state expressions, but also gates that allow connecting triggering relations and gates that involve state-based semantics, such as Priority-AND or Duration Gate. Like component input and outputs, gate inputs and outputs are typed as state or 8

In the triggering example, the fact that the states and events would actually belong to different components and transfer the triggering information via ports has been neglected for reasons of simplicity.

78


C1 SC3 : C4

S1

S2

Init

Trigger relation

Guard relation S

E

SC1 : C2

S1 Init

SC2 : C3

E

S2

S E

S1

S2

Init

Figure 4.2: Causal Edges as Trigger (left) and Guard (right) Relations

event ports and the semantics of a gate varies accordingly: an AND gate that joins a state and an event and has an event output is different from an AND gate that joins two states and has a state output.9 Gates are always connected by causal edges. Gates allow explaining how several causes are related to their common consequence and visually building more complex trigger and guard structures. In practice, many SEFTs consist in large parts of gates and causal edges that tell the ”story” of the hazard or accident. Thus SEFTs resemble traditional FTs. Despite this similarity the semantics of SEFT gates differs from the Boolean semantics of traditional FT gates. The example in Figure 4.3 shows an AND gate in a trigger relation; in this case, it is an AND gate of the type Event-State AND because the left input is connected to an event. The meaning of the example is exactly the same as in the one before. Causal loops (i.e., cycles in sequences of causal edges) are forbidden, except if some explicit delay is introduced into the cycle.10 This is not a severe restriction to expressiveness, because traditional FTs do not allow loops either, and in practical cases, some deterministic or probabilistic delay is always present. Forbidding causal loops keeps unnecessary semantic questions about the model of computation out of the SEFT model. The absence of causal loops must be checked before the analysis is started. Of course, causal edges could alternatively be replaced by textual references (as the trigger / output annotations in Statecharts) to avoid overloading the diagram, but the resulting diagram would no longer focus on visualising causal relations as FTs 9

An analysis tool can do the distinction at runtime, as the types of elements connected to the inputs of a gate uniquely determine the type of the gate. 10 As discussed in the context of CFTs, this rule can additionally be relaxed if a detailed analysis of the relevance of causal cycles is performed. Similar considerations are known from digital circuits design. For a discussion of cycles cf. [SBST05] and the references cited there.


79

Figure 4.3: Putting a Logical Gate into a Trigger Relation

do. As the component concept keeps the structures small and easier to understand, it is acceptable to denote the causal relations by edges in the diagram. As the SEFT technique is tailored for safety analysis, it abstracts from several aspects that are important during the system design phase, but not during safety analysis. There are no data types other than the Boolean type, and events and triggers are not annotated with values or messages (as they are in Statecharts or ROOMcharts).11 This means that, although the reuse of design models for safety scenarios is one main goal of the SEFT technique, not all details of these design models can be translated to SEFTs and an abstraction must take place when reusing design models. What this restriction means for practical application to real-world case studies will be discussed in Chapter 7. Finally, it should be noted that SEFTs give a lot of modelling freedom to the analyst: There are SEFT examples that are predominated by FTA elements (apart from some basic state-machines that play the roles of the ”basic events” in terms of FTA) and others that are predominated by state-machine elements (provided that at least one state or transition is connected to an output port by a causal edge, because this marks the entry point for the analysis).

11

If there were data types, the semantics should be extended so that comparisons of data values can serve as guards as well.

80

4.3


An Introductory Example

SEFTs can best be introduced by a small case example. In the simplistic example in Section 4.4, the accident to be analysed (”top-event”) is a road accident, in which a pedestrian is killed by a car because he is not seen by the driver when crossing. The figure shows the scenario in a glass-box view to explain the whole situation at a glance; in reality, all subcomponents would appear as black boxes and their contents would be shown in a separate model.12 The assumption in the example is that every time a pedestrian crosses the road in darkness, he is killed - the cars are not modelled for simplicity. The analyst will usually start with the ”top-event” identified before (which is actually an event in this example - in SEFT it could be a state as well) and puts an output port (of the whole system) for this event on the drawing area. This event is decomposed using SEFT gates. The story is that the accident occurs if a pedestrian crosses (event) AND it is dark (state); thus, the upper AND gate is an Event-State typed AND. It is dark if it is night and the street lamp is not lit. Both are state expressions, so this time the AND gate is a State-AND. The fact that the street lamp is not lit is expressed by a negation, a NOT gate that can only be applied to state expressions. The street lamp is a technical subsystem and thus modelled as an SEFT subcomponent. It has one state output and this declares the state that the lamp is lit visible from outside. The street lamp actually has three states: On, Off and Failed, but when it is not lit it is not distinguishable from outside whether it has been switched off or whether it has failed . The street lamp has two triggered transitions: the transitions between the On and Off state. These transitions are triggered by another component via event ports. A third transition from On state to Failed state (the street lamp in this example does not fail when it is Off) has an exponentially distributed delay and thus needs no trigger. This state transition corresponds to the standard Markov Chain semantics. The input ports of the street lamp subcomponent are connected to another subcomponent by causal edges that belong to the top-level system. Again, the fact that a trigger relation (as opposed to a guard relation) is expressed by these edges is made clear by the fact that their sources and targets are event ports. This allows enforcing consistency on the system level without knowing what is inside of the subcomponents. The subcomponent that triggers the street lamp is the periodical change of day and night - simply modelled by deterministic transitions with 12 hours of delay time. This example shows that not only parts of the technical system, but also scenarios in the environment can be described using the SEFT model elements. The last subcomponent to be explained is the pedestrian. He is modelled by a one-state subcomponent with an exponentially distributed probabilistic event in it. This corresponds to a Poisson process generating ”pedestrian crosses” events in probabilistic distances.13 This small example, although far from being realistic, gives an impression of the expressive power of SEFTs. In an actual setting, it could be used for a quantitative 12

The ESSaRel tool opens the contained model when double-clicking the black box. A simplified notation for such basic scenarios, similar to basic events in standard FTA, would be desirable. A proposal will be made in Section 7.4.1 under the name of ”solitary event”. 13

4.4. SEFT MODELLING ELEMENTS

81

analysis, determining the expected number of accidents in a given mission time. Modelling the same scenario in standard FTs is not simply possible; using a statebased technique, in particular, a Petri Net, the example would look much more complicated to humans, as the case studies in Section 7.2 clearly demonstrate.

Figure 4.4: A First SEFT Example

4.4

SEFT Modelling Elements

After this initial introduction, a more formal definition of SEFTs is given. All model elements of SEFTs are listed, divided into groups. Then the rules on how to connect these elements to form a valid SEFT model are introduced. In comparison to traditional FTs, Markov Chains or standard Statecharts, SEFTs are more complex to formalise, which is the cost for their powerful domain-specific syntax. An SEFT is a graph consisting of five types of nodes and two types of edges. All edges are directed: they lead from one node, called source, to another, called target.

82


The nodes can be further divided into simple nodes that can directly be the source or target of edges and complex nodes which can only be addressed by typed and distinguishable ports.14 Simple nodes are • events (depicted as thin bars) • states (depicted as rounded rectangles) • ports (model ports, depicted as triangles, and subcomponent or gate ports, depicted as small squares on the edge of the substructure rectangle) while complex nodes or substructures (all depicted as rectangles) are • subcomponents • gates. Events can be subdivided into three categories: • Triggered events • Deterministic events (with time parameter t) • Exponentially distributed probabilistic events (with rate parameter λ) Events model sudden phenomena, in particular state changes. Exactly one triggered event in a model with state-machine elements is labelled Init, which means that it is triggered implicitly by the start of the system operation. This event corresponds to the init point in Statecharts or ROOMcharts, which is denoted there by a black dot. The Init event qualifies its successor state as the start state. Alternatively, the start state could have been marked by a Boolean attribute ”IsStartState”, which must be true for exactly one state. The design decision to use an Init transition instead offers some more flexibility: First, the init event can trigger other events. Second, a future SEFT modelling feature allowing component reset or replacement can easily be implemented by triggering the Init event. Third, it is possible in the future to extend the model to allow not just one start state, but an initial probability distribution over the states by adding multiple weighted edges having the Init event as their source (for proposed future extensions to SEFTs see Section7.4). All triggered events except the Init event must have an incoming causal edge from another event that triggers them. Deterministic or probabilistic events must not have an incoming causal edge from another event, but there may be incoming causal edges from states. The semantics of states is as in finite state-machines. SEFT states have no hierarchy (sub-state-machines), but introducing hierarchical states into SEFTs is a demand for the future (see Section 7.4). 14

Due to technical reasons, the ESSaRel tool implementation treats some of the simple nodes, namely events and states, as complex nodes as well, in order to profit from the implemented port mechanism to connect edges.

4.4. SEFT MODELLING ELEMENTS

83

Ports are interface elements that serve as connection points of subcomponents and gates and make the connection between different hierarchy levels. They can be divided into several orthogonal categorisation schemes: • Model ports vs. subcomponent or gate ports: Model ports stand alone for themselves and are denoted by triangular symbols. They make the connection to the next higher hierarchy level. Substructure ports belong to a gate or a subcomponent. They are denoted as small squares on the edge of the rectangle and connect the subcomponent or gate to its environment. • State type vs. event type ports: As causal relations may come from events (trigger) or states (guard), this distinction must be preserved across different hierarchical levels for consistent analysis. Also, the different types of gates are distinguished by the state/event port distinction. For graphical distinction, the letter ”S” for state and ”E” for event may be written on top of the port symbol. • Input vs. output ports: As all edges (causal and temporal) are directed, ports should also be, in order to enforce consistency. For subcomponent and gate ports, the symbols are the same, but inputs are usually placed on the lower edge of the black box and outputs on the upper edge. For model ports, a closed triangle designates an output and an open triangle denotes an input port. The following Figure 4.5 compares the different categories of ports.

Figure 4.5: Different Kinds of Ports: Model Ports vs. Substructure Ports; Input Ports vs. Output Ports, State Ports vs. Event Ports

Subcomponents are ”black-boxes” (depicted as empty rectangles with an ID tag written on it) that represent instances of some other component, as introduced by the CFT concept.

84


Gates are logical connection symbols for causal relations: Unlike traditional FT gates they have two distinguishing properties: 1. Gates are typed in the sense that each gate can exist in various types with different numbers and types of input ports and different types of output ports. 2. The semantics of gates is not restricted to propositional logic; instead, propositions that can be expressed by (timed or probabilistic) state-machines are also allowed (e.g., event A occurs before event B). In consequence, gates that have existed for quite a while in some FTA packages, such as Priority-AND or Inhibit, can now be formalised in a better way. A full list of the currently implemented gates is given in the next subsection. Edges are divided into two types: 1. causal edges, denoting a trigger or guard relation from an event or a state to an event. 2. temporal edges, denoting the predecessor/successor relation between states and events. The sequence must always be state-event-state-event etc. Figure 4.6 summarizes all model elements of SEFTs (for gates, just one example is given - actually, there are many different gates that will be discussed in detail in Section 4.6). Basic Entities C1 S1

(Sub-) Component

State

Event

Relations and Propositions S

S

& S

E

S

E

S

Gate (Junction of causal chains – not restricted to binary or boolean operators)

Temporal order (Predecessor / successor relation)

Causal order (Trigger / guard relation)

Figure 4.6: Overview of SEFT Model Elements

Ports (State input / event input / state output / event output)

4.5. RULES FOR SEFT WELL-FORMEDNESS

4.5

85

Rules for SEFT Well-Formedness

The description of a graph model contains not only a set of elements but also a set of rules on how many elements of each type may be part of a model and how the elements may or must be connected to each other for the model to be correct. Model in terms of this section means SEFT model of one particular component. It is useful to distinguish different degrees of correctness: well-formedness and validity. Models that are well-formed must not violate some basic syntactical rules. To be valid for analysis, more rules must be enforced that can partly only be defined in relation to other models. Syntactic rules for well-formedness can be enforced on each editing step or checked on demand. When to check the syntax is partly a technical decision (some rules can be checked quickly on a single graph element, some require time-consuming checks on the whole graph), partly a logical consequence of the process of editing (e.g., the rule that one Init transition must be present is necessarily validated before the analyst draws the Init transition). The following syntax rules for well-formedness concern the presence of certain events and allowed or forbidden connections between elements: Causal edges may lead • from – a state or – a state input port of the model or – a state output port of some complex node • to – a state output port of the model or – a state input port of some complex node or – some event (guard relation) • or from – an event or – an event input port of the model or – an event output port of some complex node • to – an event output port of the model or – an event or event input port of some complex node – an event (trigger relation)

86


Furthermore, it is forbidden that more than one causal edge shares the same target (but they may share the same source). The meaning of the rules about causal edges should be clear from the explanations in Section 3.6. Temporal edges may lead • from an event to a state or from a state to an event. A well-formed SEFT component model is not always valid for analysis. It may either be • analysable on its own, without referring to other models, or • analysable, provided that additional models of all referenced subcomponents are available, or • not analysable (but possibly still analysable when used as a subcomponent in some higher level model). Analysable on its own means that it has no open input ports and references no subcomponents. Analysable in conjunction with additional models means that there are references to subcomponents. For each component class describing a subcomponent, the SEFT or other compatible model must be present at analysis time. A model with input ports is not analysable. It can, however, be referenced as a subcomponent of some other system and be part of the analysis of that system. Further, to be analysable, a model needs at least one output port (which corresponds to the top-event of traditional FTA). This output port and a related measure (rate or probability density for an event output, probability for a state output) marks the goal of the analysis. All output ports and all input ports of substructures must be connected in a syntactically correct way. If explicit states are present, there must be exactly one start state, marked by the presence of an Init event connected to it. Whether a model is valid for analysis is checked immediately when the analysis starts. Some validity rules can only be decided in the context of other involved models.

4.6

The SEFT Gate Set

4.6.1

Introductory Remarks

To reflect the state / event distinction, all gates have input and output ports that are typed as state and event ports. Consequently, there can be more than one gate of the same kind (e.g., AND), distinguished by the type of the inputs and outputs. This is comparable to the concept of method overloading in object-oriented programming

4.6. THE SEFT GATE SET

87

languages: There may be one method add(int, int) and another method add(float, float), which are distinguished by the type of argument fed to them and that do similar things to different data types. There are some constraints on the number and type of input and output ports: • A gate has always exactly one output port • In dependency of the kind of gate, the number of input ports can be variable (e.g., AND, OR) or fixed (e.g., NOT: always 1, XOR: always 2). • The input ports can be state, event or mixed types, according to the kind of gate (e.g., AND: all state inputs or one event and one or more state; OR: all event or all state). • The type of the output port can be uniquely determined if the types of the input ports are known (e.g., AND with all state inputs must have a state output). In the following subsections, the SEFT gates are briefly presented, including their graphical description and application examples.15 The convention used for the port IDs is that the (only) output of any gate has the ID ”Pout1” and the input ports are numbered from ”Pin1” as they are shown from left to right in the figures. It is important to know the IDs of the inputs, except for the special cases when they are all commutative: when inserting the corresponding analysable structure from the gate dictionary (see Section 5.4, Appendix A), it must be clear which inserted model element belongs to which input port. An ellipsis (”...”) in some of the figures means that the number of inputs of this kind is variable.

4.6.2

The AND Gate (n State Inputs)

The State-AND Gate has a state typed output and one or more state typed inputs. Its semantics is that the output state expression is true when and as long as all input state expressions are true. The number n of inputs is arbitrary and all inputs are commutative. The symbol is a rectangle with & marked on it as given in the figure. An application example for the State-AND is the situation that a laptop computer is unavailable when both the battery is empty and external power supply is unavailable. The State-AND corresponds to the standard Boolean AND of traditional FTA.

4.6.3

The AND Gate (1 Event Input and n State Inputs)

The Event/State-AND Gate uses the same graphical symbol as State-AND; however, by the fact that the first input is of the type event and not state, the gate can 15

In the prototype tool ESSaRel the appearance of the gates is slightly different, which will be visible in the screen-shots given later.

88


Figure 4.7: AND Gate with n State Inputs

be distinguished from State-AND at analysis time. There is exactly one event input and n (one or more) state inputs and an event output called trigger. The state inputs are commutative. The semantics is that the event from the input triggers the event at the output, provided that at this moment all of the state input expressions are true.16 To give an application example for this kind of AND gate, a steam boiler explodes (event) if the pressure exceeds a certain limit (event) while the safety valve is defective (state). If the pressure exceeds without the valve being defective, nothing happens. This kind of AND is not present in traditional FTA (although in practice, the traditional AND gate is sometimes used to model similar situations).

Figure 4.8: AND Gate with one Event (Trigger) and n State Inputs

There is no AND-Gate with only event inputs. As the probability that two independent events will occur within a short time decreases as the interval gets smaller, the notion of simultaneously occurring events is not useful in a continuous time domain. Instead, SEFTs provide History-AND and Priority-AND, with or without a time interval parameter, to specify cases where two events together cause an effect. These gates will be discussed below.

4.6.4

The OR Gate (n State Inputs)

The State-OR Gate has a state typed output and one or more state typed inputs. Its semantics is that the output state expression is true when and as long as one ore more of the input state expressions are true. The inputs are commutative. For example, a computer might by defective if the CPU is defective or the hard disk is defective (this includes the case that both are defective).

4.6.5

The OR Gate (n Event Inputs)

The Event-OR Gate has an event typed output and one or more event typed inputs. Its semantics is that the output event occurs if and as often as any of the input events 16

More frequently appears the Inhibit gate, which is the same but with negated state inputs. It is explained in a separate section.


89

Figure 4.9: OR Gate with n State Inputs

occurs. The symbol is the same as for the State-OR. For example, a boiler breaks when either the temperature or the pressure exceeds a critical level.

Figure 4.10: OR gate with n Event Inputs

There is no OR-Gate mixing state with event inputs.

4.6.6

The NOT Gate

The State-NOT Gate negates a state expression, i.e., the output expression is true when and as long as the input expression is false.17 For example, it might by useful to state as a side condition for a car accident that the anti-skid braking system is not in an operating state, which can be formalised by connecting the NOT gate to the state symbol that marks the operational state.

Figure 4.11: NOT Gate with One State Input

4.6.7

The Inhibit Gate

The Inhibit Gate which appears in most FTA tools and literature, has one or more state inputs and one event input and one event output. It is a shorthand for a 17

The discussion about NOT in traditional FTA has been mentioned before. For SEFTs, the whole argumentation does not apply since the translation to a state-based model defines the semantics and there is a translation for the NOT gate, thus the gate is justified. However, there is no NOT-gate or events in SEFTs, as the assumption of a continuous time scale makes the negation of ”Event e occurs at time t” problematic. The fact that e does not occur within a given time interval, marked by its start event and end event, can, however, be expressed by using the state-machine capabilities within SEFTs.

90


State/Event-AND with a NOT at each of the state inputs. In other words, if at least one of the (protective) units at the inputs is in its active state, no accident happens. The typical context of its usage is a protective device: the car driver gets injured in a (supposed) car crash (event), unless the airbag is available (state).

Figure 4.12: The Inhibit Gate

4.6.8

The Exclusive-OR (XOR) Gate

The meaning of the Exclusive-OR Gate (XOR for short) is that the output state expression is true when exactly one of the n input states is true. The symbol ”=1” in the European notation designates the fact that it must be exactly one input that is true. This gate is not very common in practical cases.

Figure 4.13: The Exclusive OR (XOR) Gate

4.6.9

The Equal Gate

The Equal Gate is not common in traditional FTA, but a consequent completion of the set of Boolean junctions between state expressions. The gate has one state output and n state inputs. The output expression is true when all inputs are true or all inputs are false, which is symbolised by the ”=” symbol.

Figure 4.14: The Equal Gate

4.6.10

The Voter Gate

The Voter Gate (also called K-out-of-N Gate) is a gate that is frequently used in reliability engineering. Its semantics is that the system failure state is present if at


91

least k out of the n available subsystems have failed. Therefore, this gate belongs to the Boolean gates like AND and OR; it can be replaced by a structure of AND and OR gates. The parameter n is always equal to the number of gate inputs, while the parameter k can be selected by the user in the range from 1 to n. Special cases are the n-out-of-n setting, which corresponds to an AND gate, and the 1-out-of-n setting, which corresponds to OR. The Voter gate also exists in traditional FTA, but it has varying names and symbols in different literature and is not present in all standards.

Figure 4.15: The Voter Gate

4.6.11

The History-AND Gate and its Variants

In many situations a dangerous event occurs if a set of events (typically failures of redundant units) have occurred before. The consequence (output event) occurs at the same time with the last of the causes (input event). To model these cases, a History-AND Gate has been introduced. For the output to occur, each input event must have occurred once or more when the last one occurs and triggers the output. The order in which the inputs are triggered is arbitrary (in contrast to the PriorityAND Gate, which calls for a given order). Following the SEFT philosophy that all events can occur several times (i.e., in contrast to the traditional FTA philosophy), the design decision for History-AND was that after the output has fired, the cycle starts from the beginning (i.e., if all input events occur again, the output occurs one more time). For example, a redundant server system crashes down at the moment when the last of its server units crashes down.

Figure 4.16: The History-AND Gate: Standard, Variant with Reset Input, Variant with Time Parameter

A variant of the History-AND has a time parameter, indicating an interval within which all of the events must occur to trigger the output event. Therefore the time of the last occurrence of each input event has to be stored. If the tolerance time expires, the history is reset. This gate is useful in situations where two events only cause an effect if they occur close to each other. In a modification of the former example, a

92


redundant server system consists of several servers. Each of them is rebooted upon failure and works again - but this procedure takes one minute. In consequence, a system failure occurs if all servers fail within a time interval of one minute, counted from the first failure. Another variant is the History-AND with an additional reset input. This gate is useful in situations where some corrective action (e.g., restart or inspection) sets all components back to the initial state and thus makes the gate ”forget” its history. As deterministic delay can be modelled in SEFTs, it is easy to model periodic inspection by technicians or watchdogs. For example, a redundant fire detection system fails if the last of its independent circuits fails. But when a manual inspection detects that some of the circuits have failed, they are put back into working state. As all of the History-ANDs have a state memory, they could not be expressed by standard combinatorial techniques as applied in standard FTA.

4.6.12

The Priority-AND Gate and its Variants

The Priority-AND Gate is similar to the History-AND Gate, except that the input events must occur in the order of ascending input IDs (in graphical representation: from left to right). The output event occurs at the same time as the last of the input events. The inputs are obviously not commutative. Each event may occur more than once; if events occur in the wrong order, they are ignored, i.e., do not trigger the output event, but neither reset the history to a former state. For example, with three inputs, the occurrence sequence (3-1-2-2-1-1-3) triggers. After the output has been triggered a new cycle begins.

Figure 4.17: The Priority-AND Gate: Standard, Variant with Reset Input, Variant with Time Parameter

Similar to History-AND, Priority-AND also exists in variants with interval parameter and with reset input. The semantics corresponds to this.

4.6.13

The Delay Gate

Causation in SEFT does not incorporate any delay, i.e., the cause and the effect occur simultaneously. If there is a need to model the time that elapses between cause and effect, this must be done explicitly using the Delay gate. There are two variants of delay gates: 1. Deterministic Delay


93

2. Probabilistic Delay

Both variants possess a delay parameter. In the case of deterministic delay, the output event occurs exactly by this much time later than the input event. In the probabilistic case, the delay time is exponentially distributed so that the mean delay time equals the given parameter (i.e., the rate of the exponential function is the inverse of the parameter). The Delay Gate has no event queue, which means that a multiple occurrence at the input is ignored if it occurs before the delay has elapsed. E

Det. Delay

E

t = 2h

E

Det. Delay E

t = 2h

E

Reset E

Prob. Delay E

E

t = 2h

Prob. Delay E

t = 2h

E

Reset

Figure 4.18: The Deterministic and Probabilistic (Exponentially Distributed) Delay Gates, with and without Reset Input

An example for the usage of a Delay gate is the frequent situation that after failure of some control function of a vehicle or plant, an accident may occur some time later. There is a variant of the Delay gate (deterministic or probabilistic) with an additional event type input called Reset. This input allows inhibiting the output event from occurring, provided that the delay has not elapsed before. In other words, to have an effect, the Reset event must occur after the input event but before the output event of the gate. This gate can be used to model recovery and corrective actions after failures or dangerous incidents, e.g., a fire breaks out but is detected before it sets the whole building alight.

4.6.14

The Conditional Probability Gate

In some cases, one event triggers another, but only in a certain statistical percentage of the cases. This can be modelled using the Conditional Probability Gate. It has one event input and one event output and a conditional parameter p, allowing for values outside the range 0 and 1. If the input event occurs, there is a probability of p that the output event occurs as well (without delay) and a probability of 1-p that no output event occurs. There is no memory so that subsequent occurrences are independent from each other. The Conditional Probability gate is especially useful to model failure coverage or detection, if it can only be quantified statistically.

94


Figure 4.19: The Conditional Probability Gate

4.6.15

The Duration Gate

In some cases, some event occurs if a state persists for a certain time. For example, if the inlet valve of a tank remains open for too long, the tank bursts. This is the application area for the Duration Gate. The gate has one input, one output, and a time parameter. The input of the duration gate is state type; it is meant to be connected to the state term to be observed. The output is an event that occurs once when the state has persisted for the specified time parameter without interruption. If the state expression at the input becomes false and true again, the output event will occur again after the time has elapsed.

Figure 4.20: The Duration Gate

4.6.16

State/Event Adapter Gates

In traditional FTs, states and events are often used synonymously. No distinction is made between the fact that a part fails and the state after it has failed. This practice carries a great risk for confusion. SEFTs introduce a clear separation between one and the other. However, in some cases the modeller may want to pass from an event to a state on purpose - especially when joining different kinds of models, which is one of the intentions of SEFTs. Therefore, adapter gates are provided with event input and state output or vice versa, in order to make that transition explicit. The gate that makes the transition from an event to the state that is present upon occurrence of the event is called UPON gate. The output state is true upon the moment when the input event first occurs; all further occurrences are ignored. The counterpart is the UNTIL gate. Here the output state is initially true and remains so until the input event first occurs. Sometimes, a combination of both is useful: a gate with a set and a reset event input and a state output that can be ”switched on and off” using the inputs. The state depends on which of the inputs was triggered last: it is true after the event at the set input has occurred and false after the event at the reset input has occurred. This gate is named Flip-Flop gate after the corresponding hardware circuit. There are also adapter gates that make the transition from a state to an event. The Entering and the Leaving gates have both one state input and one event output. The Entering gate creates an event occurrence each time the state expression

4.7. SOME SEFT EXAMPLES

95

changes from false to true and the Leaving gate creates an event occurrence each time the state expression changes from true to false.

S

Flip-Flop E

E

Set

Reset

Figure 4.21: The State/Event Adapter Gates: Entering, Leaving, Upon, Until, Flip-Flop

4.6.17

Extending the Gate Set

The gate set presented in this thesis is the gate set that was initially proposed in [KG04] and later extended in [Kai05]. This set has been validated for the intended purposes (see Chapter 7) and forms the set of ”built-in” gates of the ESSaRel Prototype tool. However, it is possible to extend this gate set in the future if useful gates appear to be missing in practice. A candidate for a future gate can be any subcomponent with one output that can be described by propositional logic or state-machine semantics and that stands for a common scenario in safety or reliability analysis. The technical proceeding to create a new gate depends mainly on the tool used ; the ESSaRel gate set can be extended by defining a new graphical symbol (from existing basic shapes like rectangles or lines) and adding a new DSPN equivalent to the gate dictionary (see Appendix A). The new gate becomes a ”built-in” gate as well. The translation procedure described in later sections does not have to be adapted in any way. Alternatively, every user can specify subcomponents for certain scenarios and use them like gates. So SEFTs are an extensible model, capable of describing any sort of behaviour that can be expressed in terms of state-machines, propositional logic, and deterministic or exponentially distributed delay times. Some suggestions for future gates that model frequent failure scenarios are watchdog (a failure occurs if some event does not occur for a certain time period) or failure-on-demand (a failure occurs if some initiator event is not answered by a response event within a given time).

4.7

Some SEFT Examples

As SEFTs are a new modelling technique, a few more examples can help to understand the semantic richness and the combination of state-based and FT-like elements. In particular, SEFTs subsume different accepted techniques.

96


In Figure 4.22 there is a Markov Chain in traditional notation and as SEFT. Traditionally, this Markov Chain could be analysed for transient or steady-state probabilities of all states. For safety and reliability analysis, it is often required to calculate the maximum probability of one state (failed state or hazard state) with respect to a given mission time. This analysis can be carried out by the SEFT analysis tool ESSaRel. In Section 7 the SEFT results are compared to results of a state-of-the-art Markov analysis, and it turns out that both comply.

Figure 4.22: Traditional Markov Chain (upper) and SEFT Counterpart (lower)

The next example in Fig 4.23 shows the application of the SEFT technique to model a traditional FT. Only gates that operate on states are used in this example. An aspect that is certainly more complicated than in standard FTs (including CFTs) is the fact that each basic event has to be modelled as a separate two-state subcomponent with a working state, a failed state, and a transition rate between both. But this only makes visible what standard FTs tacitly assume. However, for ease of use it should be considered to re-introduce the basic event symbol (circle) from standard FTA or ”solitary events” without visible predecessor and successor states as a shorthand for these state-machines; this is one of the improvement proposals in Section 7.4.1. On the other hand, this example shows that the state/event distinction and the behaviour that is usually hidden between the basic events is revealed to the user. Also, the distinction between the equal and the same, which is often confused in FTA (”repeated events”) is clear: as events E2 and E3 in the traditional FT have the same rate

4.7. SOME SEFT EXAMPLES

97

and are (supposedly) instances of the same type of behaviour,18 they appear as intances of one component C2 in the SEFT. However, they only behave equally in the probabilistic sense, i.e., they fail with the same distribution and rate, but their failures are not related in any sense. If both E2 and E3 referred to the same failure behaviour, then there would be just one basic subcomponent. E1 in the traditional FT also fails according to an exponential distribution, like E2 and E3, but with a different rate; therefore, a separate component C1 is used to model it in SEFT.

Figure 4.23: Traditional Fault Tree (upper left) and Components of SEFT Counterpart (upper right: System, lower: Subcomponents)

An example of the combined usage of FT and state-based modelling elements has already been given in Figure 4.4. Of course, it will have to be shown that in those cases where SEFTs they represent scenarios that can also be expressed by traditional techniques, they provide the same analysis results. This will be carried out by proofs and case studies in Chapter 7. The example also shows that in SEFT there are often several ways to express the same fact. The modeller can choose the most appropriate representation. The results of the quantitative analysis also have to be proven to be the same. In the evaluation chapter, some additional examples will follow in order to show the appropriateness of SEFTs for practical application. 18

Note that this is not clear from the original Fault Tree. They could either mark instances of the same component type, or two different kinds of components that just happen to show the same failure rate by chance.

98


4.8

Application in the Development Process

Basically, SEFTs are constructed like traditional FTs. Starting with an undesired system state (hazard) or event (accident), the analyst traces back its influences and finds out which system states or events play a role in initiating, propagating, or inhibiting the fatal behaviour. The richer variety and semantic precision of gates that has been introduced by state-based modelling allows better capturing of embedded systems behaviour. The basic events of standard FTA correspond either to failure states of components or to solitary exponentially distributed events. The project is structured hierarchically using the component concept. This allows modelling subcomponents first and instantiating them as often as necessary. Where necessary, state-based models that explain the behaviour of subcomponents can be plugged in. For instance, stochastic failures are appropriately modelled by Markov Chains, which can be expressed in the SEFT syntax. They have traditionally been used for hardware wearand-tear, but they can also serve as stochastic models for software failures. To model software and control aspects of the system, Statecharts or similar models, which are often available from the design phase, can be put into SEFT semantics to use them as subcomponent models in a FT. In [Rog05] an automated import of state diagrams from a CASE tool (Rational Rose RT) is presented. The visible difference between the original and the SEFT syntax is that the transitions, which are originally represented by labelled edges, now appear as explicit transition symbols. Consequently, SEFTs are not only a modelling technique on their own, but, due to their expressive power, they can serve as an integration platform for different probabilistic and deterministic modelling techniques. Thus, SEFTs can be applied over large parts of the development process, as they can be continuously refined. In the phase of feasibility analysis, possible failure scenarios that have been identified (e.g., by FMEA) can be modelled. As the system architecture is defined and more and more refined, corresponding SEFT component models can be generated and joined by ports that reflect the interfaces between the actual components. Architectural models from Software Engineering techniques such as ROOM can be imported and reused. In early phases, the inner details of components can be approximated by probabilistic failure events, as it is done in traditional FTA. In later process phases, behavioural models of the implemented components can be introduced from Software Modelling Tools and replace the approximative models used before. Software behavioural models can be used for two different purposes in the SEFT analysis: • Models for the behaviour of the correct software can serve to check the reaction to rare or unforeseen events in the environment, such as hardware failures or faulty operation by humans. • Models that have been enriched by probabilistic transitions leading to failure states can model software failures that are introduced by faults (bugs), i.e., deviations of the actually implemented software from the modelled software. Both approaches will be discussed in the evaluation chapter.

4.8. APPLICATION IN THE DEVELOPMENT PROCESS

99

An advantage of the component concept of SEFTs is that components can later be replaced with other components that have the same footprint (number and type of ports). This allows not only refinement along with the development process, but also the quantitative comparison of different implementation alternatives. Furthermore, the component approach helps analysts to get a good overview over the project; a feature that has often been missing for Markov Chains. Components can be reused in other projects, which is helpful in productline approaches that are becoming more and more popular. Component models can be stored in repositories to build a safety model library and can be exchanged between different contractors working on the same project. SEFTs help the analyst to avoid misunderstandings and modelling errors, as their clear distinction of states and events, their expressive richness, and their intuitive appearance facilitates expressing chains of actions much closer to reality than traditional FTs would allow. Implicit assumptions, e.g., about the order of events or about the internal failure mechanism of some component are not necessary. Considerations about the so-called repeated events (common cause failures) in traditional FTs are not necessary as these are modelled explicitly, and stochastic independence is no longer a precondition. The ID systems allows locating events and states all over the project; an event that appears on a path to a hazard can equally be identified in the design models.19

19

To which extent these goals are actually met in practice will be discussed in detail in Chapter 7.

100


Chapter 5 Analysis by Translation into DSPNs 5.1

Choice of an Appropriate Intermediate Model

To analyse a traditional Fault Tree, a combinatorial analysis based on the underlying Boolean formula is performed, as described earlier. Different algorithms can be selected, such as the bottom-up approach, the Minimal-Cut-Set approach, or conversion into a Binary Decision Diagram (BDD). Due to their state-based nature, SEFTs cannot be evaluated this way - neither the state-machine aspects of SEFTs nor the gates with memory, such as Priority-AND, could be expressed correctly.1 To cope with this fact, an underlying state-based model is required. The SEFT approach suggests a translation to Deterministic and Stochastic Petri Nets (DSPNs)[MC87], since DSPNs are a concurrent state-based model possessing all required types of modelling elements as well as analysis and simulation techniques for the relevant quantitative measures [CL93, GM95]. They offer Petri Net transitions that fire immediately, with an exponentially distributed delay, or with a deterministic delay, which matches the types of events in SEFTs. The choice of DSPNs as an intermediate model for analysis of SEFTs was made after comparison of different possible evaluation methods. It could have been possible to define and implement a new SEFT-specific analysis algorithm that allows calculating state probabilities and event probability densities. This would not only require a lot of research and implementation work, which is time-consuming and error-prone, but could also hamper the acceptance of the SEFT modelling technique. If the model is new, translation into an established model can increase confidence in it. There is a second important reason for providing a translation to an existing formal model: translation into an existing model with a well-defined semantics not only allows analysis with existing methods, but also defines the semantics of the new model formally. It was the purpose of this research project to give a formal underpinning to Fault Trees, but it would have been a difficult undertaking to define the 1

One may object that there are FTA tools that do provide a Priority-AND gate and yet use combinatorial analysis. In 7.3 the approaches are compared and an argument is given for why the traditional way of implementing this gate only works under certain restrictions.

101

102

CHAPTER 5. ANALYSIS BY TRANSLATION INTO DSPNS

SEFT notation with all its details directly in a formal method. Giving an opoerational semantics to it based on a translation is a feasible way of formalising it. So the choice of DSPNs as an intermediate model is an important part of the definition of SEFTs, even if other ways of evaluating them might exist and the applied transformation algorithm is not the only possible or the most efficient one. Finding an appropriate target model for the translation was one of the major research steps of this thesis. A model that provides the required expressiveness for stochastic behaviour is mandatory as an equivalent counterpart to a probabilistic safety analysis model that subsumes both Markov Chains and traditional FTs.2 Models other than Petri Nets have been considered, but were discarded. Markov Chains, which are used in Dugan’s Dynamic Fault Trees [DBB92], do not offer the notion of components that are essential to SEFTs. Although triggered Markov Chains have been proposed [BB03], Markov Chains are difficult to compose and control, and a representation for the causal chains and gates in SEFTs could not be found easily. Probabilistic I/O Automata [WSS97] are a potential alternative to the chosen DSPNS. Failure Automata [CSD00] are less formal, but could, on the other hand, be a bridge to existing safety and reliability modelling techniques. A connection to Failure Automata should be checked in the future. Non-state-based techniques like Probabilistic Duration Calculus [LRSZ93] have not been considered as they constitute a are based on a very different approach. Because of their implicit notion of concurrency, Petri Nets with their timed and stochastic extensions seemed to be a good candidate for the analysis model. The concurrency aspect of PNs essentially helped to reflect the component concept because each component can be translated to a partial net and all nets are finally combined. However, as the next sections will show, it took some modifications to make them compositional in the desired way since they do not possess a notion of interfaces. GSPNs (Generalized Stochastic Petri Nets, a class of Petri Nets that is similar to DSPNs except for the lack of deterministic transitions) [MBC+ 95] could have been chosen and are even easier to analyse than DSPNs in some cases. However, the decision to choose DSPNs was motivated by the fact that deterministic delay is a frequent requirement when computer controlled systems are described (examples are watchdogs, periodical self-checks or other periodical tasks, or fixed-duration computing operation). As long as deterministic events do not occur in an SEFT, the resulting DSPN is a GSPN and can be analysed with the appropriate procedures. The DSPN obtained as a result of the translation serves as an immediate model only for analysis. It is not intended for the generated DSPNs to be read or modified by humans or reverse translated. There may be an argument on why DSPNs are not directly used to model the facts described by DSPNs. The reason is that the DSPNs obtained by translation are quite complex and unstructured. They do not explain the structure elements, nor do they allow detecting the borders of technical subsys2

Originally, FTs do not deal with probability distributions at all, but with probabilities; in practice, the insertion of probability distributions over time is supported in most cases and the exponential distribution is the most important one. Besides the exponential distribution, the Weibull distribution and sometimes other distribution types are customary. Resolving Markov Chains with constant rate also leads to exponential distributions.

5.2. DETERMINISTIC AND STOCHASTIC PETRI NETS

103

tems or the causal chains determining the scenario. This can easily be confirmed by looking at one of the generated DSPNs; some examples of SEFTs and their corresponding DSPNs are given in Section 7.2. Even if DSPNs were extended by a structuring mechanism, as, for instance, [KW01] propose, the places and transitions of a Petri Net would not reveal the significance of the model elements in the application domain.

5.2

Deterministic and Stochastic Petri Nets

DSPNs are a timed variant of Petri Nets (cf. Section 2.5.3), i.e., the (deterministic or probabilistic) time that a transition waits before firing after becoming enabled is explicitly specified in the model. There are three kinds of transitions that differ by their way of firing: immediately after activation, after a deterministic delay (specified by an annotated time parameter), or after an exponentially distributed random delay (specified by an annotated rate parameter). Firing of transitions is atomic and takes no time. In the graphical representation, black bars depict immediate transitions, empty rectangles depict transitions with exponentially distributed firing time, and black filled rectangles depict transitions with deterministic firing time.

Figure 5.1: DSPN Example with Different Modelling Elements

Transitions and places are joint by two different kinds of arcs: input arcs lead from a predecessor place to a transition, output arcs from a transition to a successor place, and inhibitor arcs, which are not available in standard PNs, lead from a place to a transition with the particularity that they forbid firing as long as the corresponding place is marked. Standard arcs have an arrowhead on the target side, inhibiting arcs have a small circle instead. Priorities can be attached to immediate transitions to resolve conflicts. They are integer numbers, and a higher number means higher priority with respect to other transitions that are enabled at the same time. Additionally, weights can be assigned to give the probability that one of several enabled

104


transitions fires. All weights are real numbers. The probability that one transition fires is calculated by its weight divided by the sum of the weights of all conflicting transitions. If the weights are chosen so that this sum is 1, the weight can be interpreted directly as a probability. Places can have a capacity of more than one token and arcs can have a multiplicity of greater than one. The underlying time scale is continuous. The analysis of DSPNs has been described in [CL93, GM95] and several tools are available to apply it. For the evaluation of this thesis and in the ESSaRel tool project (cf. Section 6), the tool TimeNET [ZGFH99] from Technische Universität Berlin has been used. It allows several kinds of analysis and simulation of DSPNs and other stochastic Petri Nets.

5.3

Modularisation of DSPNs

Petri Nets were chosen as an intermediate model for analysis because they possess the required expressive power and allow for concurrency, which is introduced by the concurrent components that exist in SEFTs. However, they are not originally a compositional model and have no implicit composition semantics. There are several known approaches to make PNs modular, but most of them are unsuitable for the current problem, as discussed below. One approach is place refinement or transition refinement: informally speaking, places or transitions on the higher hierarchy level are representative of a subnet on the lower level (the example of transition refinement is shown in Figure 5.2). This technique is not suitable, because there are no distinguishable ports, and as either only places or only transitions are refined, there is no way to have both state ports (which must be places) and event ports (which must be transitions). Another approach, which is applied in the modelling technique ”Fundamental Modeling Concept (FMC)” [KW03], is partitioning Petri Nets by cutting edges that lead from a transition to a place. This approach is usefu,l e.g., for modelling production processes: Process A passes a piece of work to Process B; B can start working on that piece immediately or later. This approach is shown in the right part of Figure 5.2. Again, the semantics is not suitable for SEFT modules. SEFT composition does not incorporate the storage of events; all events must be recognised immediately or never. Also, there is no appropriate counterpart for state ports. So this approach had to be discarded as well. An approach that fits well with the SEFT component concept with state and event ports has been proposed by [KW01]. Both places and transitions can be declared part of the external interface of a component net. In this notation, the interface is graphically shown in a separate part of the net. When passing the interface description to somebody, only this part is passed; the rest of the net is private and not visible from the outside. As the interface elements are part of the interface and also part of the net, they graphically appear twice. In order to show to which internal net elements the interface elements belong, a new kind of edge (dashed line) is

5.3. MODULARISATION OF DSPNS

105

Figure 5.2: Different Modularisation Techniques: Transition Refinement (left) and Net Cutting (right)

introduced, which connects both. An example is given in Figure 5.3. Interface elements can either be imported or exported. Exported means that the element from this component is the master element that is offered to the environment, and one or more imported elements can refer to it. During flattening, all imported elements are merged to the exported elements. An imported element means that the component is not complete until it is connected to another component net providing the required export counterpart. All net elements in the interface parts must either be annotated with the keyword export or with the keyword import. The interface elements have IDs that are necessary for flattening.

Figure 5.3: Modular Petri Nets according to [KW01]

An example for a net composed of three instances of that module is shown in Figure 5.4. Unlike in SEFT, where components on all hierarchy levels are described by a visual model, Kindler-Weber’s approach uses a textual language to instantiate Petri Net modules and combine them by assigning imported elements to exported ele-

106


ments by their ID. For example, the definition of the net in the future first contains a place p (the leftmost place) that is defined on the super-component level, then three instance definitions ”m1” to ”m3”of the module ”M1” which is known from Figure 5.3. The import place of the second one is assigned to the export place of the first by the syntax ”m2 = instance M1(p1 = mi.p2) ” and so on. During flattening, the matching imported and exported elements are merged, according to their IDs.

Figure 5.4: Flattening of Modular Petri Nets according to [KW01]

Kindler-Weber’s modular PN approach served as a template for the modular DSPNs that are generated during SEFT translation. However, there are some differences: First, the visual notation with doubled elements and a new kind of edges was abandoned in favour of standard Petri Net accompanied by an interface list that mentions the IDs of all imported and exported elements plus the IDs of their SEFT counterpart. The latter is necessary as the components are later merged according to the SEFT port IDs and not the DSPN place or transition IDs. There is no textual language that describes how the system level net is assembled from modules. Instead, the net that corresponds to the higher level SEFT contains a so-called footprint, i.e., a set of places and transitions that correspond to the ports of the subcomponent, and during flattening after the sub-net has been inserted, its port elements are merged to the corresponding footprint elements. Regardless of the signal-flow direction (input or output port), the footprint elements are always imported elements and the port elements of the sub-net are always exported elements. After merging, the resulting elements are private elements of the higher level net and not part of its interface. Therefore, each merged pair of import/export elements is deleted from the interface list after flattening. More details can be found in Section 5.6 about Flattening and in the pseudo-code in Appendix B.

5.4 5.4.1

Translation into DSPNs Overview and Technical Remarks

In the following sections, the translation algorithm is summarized and explained. In Appendix B, the detailed algorithm is given in pseudo-code. The major parts are initialisation, translation of each involved model, flattening of the model hierarchy, and preparation for analysis. The framework for the prototype implementation is the ESSaRel tool, which offers facilities to start the analysis

5.4. TRANSLATION INTO DSPNS

107

for a model from the GUI, to load other referenced components, and to call external analysis tools. There is an abstract class Analysis from which all analyses for different kinds of models inherit. They must all implement a main operation Compute() that organises the rest of the algorithm. This main function can refer to subfunctions within the model-specific analysis class, but also to other analysis classes and to translations. Translations are another principal family of classes in ESSaRel. They process a source model and produce a target model of the same or of another model type (e.g., SEFT into DSPN, hierarchical DSPN into flat DSPN). Not only translations to other models, but also flattening or simplification operations can be implemented using subclasses of Translation. The overall structure of the main operation Compute() of the SEFT analysis that is started by the user is: • Init • Load and Check all involved SEFT models and put them in a Data Structure (Containment Tree, see below) • for each involved SEFT model: – translate (SEFT model -> component DSPN model) • optionally for each resulting DSPN model – simplify • for each pair (subcomponent, referenced model) in bottom-up order: – flatten (referenced DSPN model of subcomponent -> environment DSPN model) • Export flat DSPN to TimeNET XML file format • Call TimeNET for analysis or simulation of the requested measure • Read TimeNET result file and display results The of translation SEFT to DSPN that is applied to each component model comprises the following steps: • Check of validity preconditions on model level • Translation of states and events • Translation of component ports • Translation of gates • Translation of subcomponent footprints

108


• Translation of temporal and causal edges After these steps, all SEFT component models have been translated into corresponding DSPNs, but not yet integrated with each other. Note that there is always just one translation per component class, no matter how often that component class is instantiated as subcomponent of other components. The following parts (described in Section 5.6) are flattening and preparation for analysis. Their purpose is to transform the hierarchical SEFT model into one flat DSPN and to make it accessible to the analysis / simulation tool.3 This comprises the steps: • Flattening of the component hierarchy • Application of initial DSPN markings • Export of the DSPN to the analyser file or data format • Translation of the measure to be calculated • Sending an analysis request to the analysis tool • Capturing and displaying the results The details of these procedure steps are described in the following sections.

5.4.2

Initialisation and Precondition Check

Immediately after the technical initialisation of the translation class, the component hierarchy has to be analysed to determine which subcomponents, subsubcomponents and so on are nested within the component to be translated. All models are loaded into memory and the presence and correctness of all involved models has to be assured. These steps are the same as for the analysis of CFTs as described in Section 3.6.2. Also, the Containment Tree that stores the information about the nesting hierarchy is the same (cf. example there). It is built when checking and loading all involved component models.4 The following translation is performed separately for each involved model. The Containment Tree guides this proceeding; it is iterated, and for each component 3

The tool for the validation of SEFTs was ESSaRel 1.0, the tool for DSPN analysis and simulation was TimeNET 3.0. A direct integration of both tools could not be implemented yet, because some of the technical preconditions were not given at that time; mainly, both tools required different operation systems Thus, TimeNET had to be operated manually and the last three steps of the following list were omitted. 4 Technically, the situation is a little bit more complicated in the ESSaRel tool than in the UWG3 tool for CFTs: As UWG3 only deals with CFT models, the terms component and model could be used synonymously. ESSaRel allows different kinds of models per component and so the right model must be located within a component.


109

SEFT model, the translation procedure (SEFT to DSPN) is called. As a means of analysis time reduction, it should first be verified for each component whether a translated DSPN model is readily available (either in memory from a preceding translation or loaded from file) and whether this is still valid (the last modification to the source model was not later than the last translation). If a valid translation result is found, the rest of the translation can be skipped and the available result is used instead. For more details on this kind of acceleration, refer to Section 7.5.2. The checks for preconditions are also similar to those performed on CFTs. The rules for well-formedness and validity that must be checked for each model are different and will be discussed below. For example, in SEFTs there are two different types of edges and the prohibition of cycles applies only to causal edges. Moreover, the requirement from CFTs that each subcomponent reference must match the footprint of the corresponding component model is extended: in SEFTs, this applies not only to the number of input and output ports, but also to their type (state vs. event). After all component models have been loaded for translation, but before actually starting the translation, some validity preconditions on the model level and between different model levels have to be checked. These are conditions that exceed the conditions for well-formedness given in 4.5 and could not be checked earlier, for at least one of the following reasons: 1. because they take so much effort to verify that a verification on each modification to the graph is impractical, 2. because they are usually violated during the construction phase of the model (e.g., the condition that there must be exactly one init transition in each SEFT component is naturally violated before the analyst inserts this Init transition), or 3. because they concern more than one model and can, therefore, only be checked when all models are available (e.g., the condition that a subcomponent reference matches the referenced component model in number and type of the ports). Some important checks to be performed are the checks for absence of untimed causal cycles and the check that all causal input ports of gates are actually connected.5 Details about the rules to be checked can be found in Appendix B.

5.4.3

Translating States and Events

As a first step, all states and events given in the SEFT are translated. For each state a place6 is inserted and for each event a transition. Deterministic and exponentially 5

One remark regarding tool implementation: The user feedback on violation of any of the conditions should be provided in a way that visually points out the involved graph elements, as otherwise it can be extremely difficult for the analyst to locate the violation. 6 It is a common misunderstanding that a place of a Petri Net means a state; it is rather the complete marking of a Petri Net that determines its state. However, a Petri Net can model a state-machine

110


distributed events are translated into the corresponding DSPN transition type, and the parameter (deterministic delay or rate in case of probabilistic event ) is copied. A triggered SEFT event becomes an immediate DSPN transition. The translation of the triggering or guarding relation is not performed now, but later when translating causal edges; then, the transition that is generated now will be merged with another transition belonging to a subnet called ”trigger pattern” or ”guard pattern”. The distinct Init transition is not translated as a DSPN transition; instead, the start state it points to is translated to a place which is marked by a token in the beginning. The position of each state or event on the drawing area is also copied to the corresponding DSPN element. This is not really necessary because the resulting DSPN is just meant to be sent to an analysis program; however, during the debugging phase, it is likely that humans want to check the DSPN manually and therefore it must be possible to display it.

5.4.4

Translating Ports

Next, the ports of the component model are translated (this applies to input and output ports of both types, state and event ports). Note that at this time only the ports of the model itself, i.e., the triangular symbols in the SEFT notation, are translated, and not yet the ports of nested gates or subcomponents (i.e., the small squares on the edge of the rectangular gate or subcomponent symbol). State ports, no matter whether inputs or outputs, are translated as places and event ports as immediate transitions (again, the trigger and guard relations will be added later). All of these elements will be marked as export elements, i.e., elements can be referenced (imported) by any other component, actually the super-component. Component DSPNs contain an additional interface list holding the information on which original port belongs to which Petri Net element as ID pairs; otherwise, later flattening would not be possible.

5.4.5

Translating Gates: The Dictionary

Next, the contained SEFT gates are translated. The semantics of all gates can be defined by a DSPN structure. Designated ”export” places and transitions correspond to the gate ports, just as the ports of subcomponents are translated as places or transitions. Sometimes, gates contain state memory, sometimes they simply map input markings to output markings. Technically, the translation is performed by means of a ”dictionary”, i.e., a virtual data structure that, for each gate from Section 4.6, contains a corresponding component DSPN to be pasted into the DSPN environment. The full dictionary can be found in Appendix A. The DSPN structures from the dictionary realise the Boolean connective or the state-machine behaviour represented by the corresponding SEFT if each state corresponds to one place and the whole net contains exactly one token that always marks the active state.


111

gate, as verbally specified in Section 4.6. Test cases that the semantics is correctly represented are given in Section 7.1, some proofs and correctness arguments are given in Appendix C. The dictionary returns DSPN structures that are immediately embedded into the environment DSPN. When requesting the counterpart DSPN to a given gate, the type of gate must be specified. The term type encompasses not only the principal class of a gate, such as AND, OR, NOT, etc., but also the type of the inputs, e.g., Event-State-AND vs. State-AND. Moreover, for some gates it is necessary to specify additional parameters. These may either be functional parameters such as the delay time of the Delay gate, or structural parameters, in particular the number of inputs that is variable for some gates. In the following, functional parameters are denoted by t for time and p for (conditional) probability, and the variable number of inputs is denoted as n.7 In order to cope with these variants, it is recommendable to implement the dictionary as an encapsulated data structure with access methods that return the desired net. This way, there is no need to literally store all variants in the dictionary (which is impossible as the number of variants is infinite), because the access method can dynamically produce the gate applying some rules instead of looking them up in a static data store. The access method in the prototype implementation ESSaRel is called Dictionary.Translate(). It takes as arguments the name of the gate (implemented as an enum type or string), the number of inputs n, the target DSPN model (environment), the functional parameters such as delay time, and some other parameters that are technically necessary, such as the graphical position where the DSPN structure should be inserted. To produce gates with more inputs, for instance an AND gate with 5 inputs, there are two possibilities: Preferably, the given 2-input Petri Net structure is modified by additional input places or transitions so that it fits to the required number of inputs. Alternatively, all gates with n inputs can be decomposed as a cascade of gates with 2 inputs, e.g., AND(x,y,z):=AND(x,AND(y,z)). For some advanced gates, e.g., Priority-AND with time parameter, the cascade has to be augmented by some additional network elements, in particular, a timer or reset circuit. The latter approach of constructing one gate with n inputs out of n-1 gates with 2 inputs is rather theoretically important, as it defines the semantics of gates with n inputs formally, so all considerations in this thesis apply without restriction to n inputs, even if only examples with 2 inputs are given in the dictionary; the extension to more than two inputs (or the reduction to one input) does not pose any substantial problems. Both possibilities of extending a two-input gate to an n-input gate are exemplified in the following figures, extending a State-AND: The approach chosen for the implementation of the ESSaRel-Tool was the first one. The corresponding DSPN structure for each gate with an arbitrary number of inputs has been divided into an invariant part, which contains, in particular, the output of the gate, and a building block, which can be instantiated once or more often, depending on the number of inputs. The access function that gets a gate out of the dictionary first puts the invariant part into the DSPN, then adds as many input building 7

For compatibility reasons, we also maintain this notation also for the voter gate, which is usually called n-out-of-m gate, i.e., with respect to the dictionary it should rather be called m-out-of-n.

112


Figure 5.5: The State-AND Gate with 2 Inputs (upper) and two Different Implementation Variants with 3 Inputs (middle and lower)


113

blocks as appropriate, plus some ”glue” arcs that put the building blocks and the invariant parts together according to a regular pattern. In the following, all input building blocks are drawn within a dashed box to show that they form a unit that can be replicated. An example can be seen in Figure 5.6, which again shows the State-AND gate as it can be found in the dictionary.

Figure 5.6: A Generic Implementation for the State-AND Gate Using Building Blocks

All gate inputs are distinguishable, which is technically achieved by a port ID system. Inputs of the same type are sometimes, but not always, commutative. For instance, the order of the event inputs of a Priority-AND obviously matters. The IDs of the gate inputs can be found along with the graphical and semantic description in Section 4.6. In the table representation given in this thesis, all correspondences between SEFT and DSPN IDs are marked by labelling the DSPN elements with the corresponding SEFT IDs. In the technical implementation, this may be handled differently; the suggested implementation separates the graph from the correspondence list. The usage of the dictionary and the translation of SEFT gates is exemplified by two more gates, the Event/State-AND and the Priority-AND with time input. The corresponding records from the dictionary (fully given in A) are shown in 5.1. In the appendix of this thesis, all of the data is given as in the example; the implemented dictionary, however, only contains the structural data needed for translation; all other information is exploited in other parts of the software. For instance, the list of the allowed number and type of inputs and outputs is needed to do a validity check on the SEFT structure before the actual translation is started. The left column specifies the name of the gate. The second column specifies the allowable input and output ports in an abbreviated notation. As the number of outputs is always one for each gate, this number is not specified, but only the type of the output port (state or event). For inputs, both the (possibly variable) number and type are specified, and it is also specified, whether or not the inputs are commutative and which of them have a special function, such as a Reset input. The first example can be read as: one event type output (letter ”E”), one event type input

114


Name

Ports

Param.

AND_Event_ State

Out:E, In:1E, 1+Sc

none

Det_Delay_R(t)

Out:E, In:1E ”Reset”, 1E

Delay t

Gate Symbol

Related DSPN

Table 5.1: Two Examples from the SEFT-t- DSPN Dictionary

(”1E”), and one or more (”1+”) state type (”S”) inputs that are commutative (”c”). In the case of the Deterministic Delay gate with Reset, there are two event inputs, one of them labeled ”Reset” in order to distinguish it from the triggering input that is also present in the case of the standard Delay gate. The third column specifies the functional parameters; additionally, the total number of inputs is always supplied to the dictionary as an implicit parameter, by convention denoted as n.8 At runtime, the parameters must be specified when calling the dictionary access method.9 In the example of the Deterministic Delay Gate with Reset, there is one functional parameter: the delay time, after which the output triggers after the input has been triggered, provided that no restreeet occurs in between. An interface list, which is part of the DSPN structure as provided by the dictionary, specifies which DSPN elements correspond to which ports of the original SEFT gate. Resuming the example of the Event-State-AND with two inputs (i.e., one state input), the transition in the upper part of the net in the dictionary is labeled ”Pout1” (its internal ID is T1), the transition to the left in the lower part of the net is labelled ”Pin1” (internal ID T1), and the place P1 in the first building block (dashed line rectangle) is labeled ”Pin2” (note that there is just one building block in the case of just 8

For easier implementation, the total number of inputs and not the number of variable inputs has been chosen as parameter n. So the Event-State-AND has a minimum value of 2 for the parameter n, indicating that there is one state input plus the event input that is invariable. 9 Technically, in the ESSaRel tool implementation, a reference to the gate object is passed to the dictionary object. From the gate, the dictionary can obtain all relevant parameters.


115

one state input). This notation means that the actual interface list supplied with the DSPN fragment reads as in Table 5.2.10 Direction export export export

Type event event state

Source Domain ID Pout1 Pin1 Pin2

Target Domain ID T1 T2 P2

Table 5.2: Interface List, Example: Event-State-AND(2)

All DSPN elements labelled with a port ID are exported elements, i.e., visible to the environment. The environment, i.e., the DSPN corresponding to the model being translated, has places and transitions specified as imported that have been translated before inserting the gate. After instantiating the gate DSPN, the imported and exported elements are merged so that the gate is connected to its environment. In the example of the Event-State-AND with n=2, in an assumed environment, the elements corresponding to the (event) ports Pin1 and Pout1 are immediate transitions in the DSPN and the element corresponding to (state) port Pin2 is a DSPN places. The Translate() operation gets a target model as argument where it embeds the instantiated DSPN structure. Looking at the second example, the Priority-AND with delay with two inputs, there is an additional parameter t that has to be translated. In the DSPN fragment given in the dictionary there is one deterministic DSPN transition whose delay time is labelled t. This signifies that the delay parameter to be set is the argument t passed to the Dictionary.Translate() operation. This way, the numerical parameters are translated from the source model to the target model.

5.4.6

Translating Subcomponent Footprints

An SEFT component may contain instances of other components as subcomponents. This is graphically denoted by subcomponent references, black boxes where only the ports are visible. These subcomponents look similar to gates, except that there may be more than one output and there may be zero inputs. In fact, every SEFT gate could be realised by a subcomponent as well. Unlike gates, subcomponents do not refer to structures that are individual to each project and not necessarily known at translation time. Thus, they cannot be looked up in a dictionary like gates, but the corresponding component model must be translated to a DSPN and inserted in their place. The step of inserting the translated structure is called flattening and will be performed later; its description follows in Section 5.6. During translation, only the 10

The list is necessary because in the actual implementation it is, of course, not possible to specify ”the transition in the upper part of the picture”. Only the net element IDs indicate unambiguously which element is being referred to.

116


”footprint” of the component, i.e., the DSPN counterparts to its ports, is inserted. This is necessary because these elements could be sources or targets of Petri Net arcs and these arcs could not be translated otherwise. Therefore, for each state type port (no matter if input or output), a Petri Net place is inserted and for each event type port, an immediately transition. This allows the arcs to address these elements as source or target. The flattening procedure will merge the net elements (place or transition) corresponding to the footprint within the super-component with the net elements corresponding to the ports of the component acting as a subcomponent. One issue that remains to be solved is to remember which DSPN element corresponds to which SEFT port. This cannot be achieved using the ID of the DSPN element, as this is a number that is arbitrarily assigned during translation and there is no relation between the footprint element in the super-component and the port element in the subcomponent. The solution is to maintain a separate data structure called Interface List where the DSPN element IDs are stored together with the original port IDs. Together with the knowledge about which subcomponent instance references which component model, this provides sufficient information for matching the ports. Figure 5.7 gives a (simplified) example: A component SC1 of component class C2 is a subcomponent of component class C1. It has two input ports Pin1 and Pin2 (one state type and one event type) and one output port (event). The port with the ID SC1.Pin1 within C1 shall later be merged with Pin1 within an instance of C2 and so on.

Figure 5.7: An SEFT Component with a Subcomponent and the Referenced Component SEFT

Figure 5.8 shows the translation of both components to DSPNs. In the left part of the figure, the footprint is the empty space in the middle with the three elements denoting the ports around it. The Interface List of both components can be seen from Table 5.3: In the section about flattening, this example will be continued.

5.4.7

Translating Edges

Temporal edges in SEFTs denote the predecessor/successor relation between states and events and thus only occur in parts of the model where the analyst uses state-


117

Direction import import import


Source Domain ID SC1.Pout1 SC1.Pin1 SC1.Pin2


Direction export export export


Source Domain ID Pout1 Pin1 Pin2


Table 5.3: Interface Lists of Super-component and Subcomponent

C1.DSPN

C2.DSPN

T1 (E2)

T5 (Pout1)

T2 Prio 2

P1

Prio 1

P5 T3 (SC1.Pout1)

T4 P1 (S1)

T1

T4 (SC1.Pin1)

P2 (S2)

P3 T5 Prio 2

P2

(SC1.Pin2, S2)

Prio 1 Prio 2

P3

T6 (E1)

T2 (Pin1)

Prio 1

P4 (Pin2)

T3

Figure 5.8: The corresponding DSPN structures

118


machine elements. The translation of temporal edges is straightforward: They are translated into DSPN arcs and the source and target of these arcs are the places and transitions that correspond to the source and target elements from the SEFT. Causal edges denote trigger and guard relations: if they lead from one event to another event, this is a trigger relation; if they lead from a state to an event, this is a guard relation. If ports are involved, they can model triggering or guarding relations accross component boundaries, i.e., between components of different hierarchy levels or between different subcomponents of the same super-component (cf. Section 5.3). Unlike temporal edges, causal edges cannot be translated by simple DSPN arcs, but require patterns that reflect their causation semantics. The SEFT causation semantics comprises two main assumptions: 1. There is no backward influence, e.g., the component (or event) being influenced cannot change the state of the component influencing it or provoke or inhibit any events there. 2. The cause follows immediately, i.e., events are not stored to entail some consequence at a later time (unless explicitly modelled). If the target event is not ready to occur at the same instant, the triggering event is ignored. This semantics does not fit to the usual Petri Net semantics: First, in Petri Nets, sensing a marking (part of the state) is performed by an outgoing edge from a place, and when the following transition fires, the marking of that place is modified. Second, when a token is passed to another component, there is no guarantee that it is used immediately; it can stay on a place for a long time until it is used. The first problem can be solved by sensing the marking of a place either by an inhibit arc (which prevents the transition from firing as long as the place is marked, but does not take any token away) or by a pair of anti-parallel standard arcs (which takes a token, but immediately puts it back). A pair of anti-parallel arcs is called guardpattern, because it reflects the guard relation that is denoted by a causal edge from a state to an event. It is shown in Figure 5.9. Note that the guarded transition can only fire if the guarding place is marked; the marking of the guarding place is not changed.11 The second problem can be solved by introducing another pattern that inhibits storing of tokens in a trigger relation and that is used as a DSPN counterpart for a causal edge from one event (or event port) to another. This trigger-pattern adds an artificial place on which the triggering transition puts a token when firing, and an additional immediate transition with lower priority than the triggered transition connected to 11

The guard-pattern was introduced because of the lack of a positive counterpart of inhibit arcs in DSPNs. Not only do they look more complicated than a single arc, but they can also cause semantical problems when more than one transition senses the same place: in this case, the firing order of the conflicting transitions is undefined, which can, in some cases, lead to different runs of the system. A more elegant solution, which has been proposed for other types of Petri Nets than DSPNs, is the introduction of test arcs, which are a positive counterpart of inhibit arcs [CH93]. They join a place with a transition and allow the transition to fire only if the place is marked; they do, in contrast to standard arcs, not change the marking of the place they test.


119

Figure 5.9: The Guard-Pattern: SEFT (left) and DSPN Translation (right)

it. When the triggering transition fires and puts a token on the place, either the triggered transition is ready and consumes that token immediately (because it has higher priority), or the triggered transition is not ready and the token is consumed by the artificial low-priority transition. In any case, the token is consumed immediately. The two cases are shown in Figure 5.10, which shows the (vanishing) state immediately after the triggering transition C1.T1 has fired. In the middle part of the figure, C2.T1 is ready and will fire next so that triggering occurs. In the right part, C2.T1 is not ready and Tx steals the token. Note that in DSPNs, priority 2 takes precedence over priority 1. The left part shows the SEFT model from which the DSPN structure was translated. C2.T1

C2.T1

C2

Prio=2 Init

Tx

Prio=2

Tx

E

Prio=1 C1

Prio=1

E

C1.T1

C1.T1

Init

Figure 5.10: The Trigger Pattern: SEFT (left) and corresponding DSPN: a) Case where Triggered Transition is Ready (middle) and b) Case Where It Is Not Ready (right)

The remaining technical problem to solve when translating both kinds of edges is to know the DSPN element that corresponds to the SEFT elements to be joined. For instance, if an edge leads from state S1 to event E2 of a SEFT, the resulting DSPN arc must lead from the counterpart place of S1, say P5, to the counterpart transition of E3, say T7. The counterpart can be looked up in a cross-reference list that is maintained during translation. Each pair of original element and translated

120


element is put in there so that the connection can be made. After the translation, the cross-reference list can be discarded; only the information about ports and their counterparts must be kept in a list, called the interface list.

5.5

Optional Simplification

At this point of the algorithm, i.e., after translation of the individual components but before flattening of the component hierarchy, a simplification step is suggested. During flattening, the state explosion problem appears. It is a promising idea to reduce the state space by exploiting the component concept, in particular the fact that only those states or events that are connected to ports are externally visible. Some ideas about simplification (with or without loss of accuracy) are given in Section 7.5.2; however, a final evaluation and implementation of these techniques could not be achieved within this doctoral research work. The recommended translation and flattening procedure as implemented in the prototype tool ESSaRel is ready for future insertion of simplification steps. Therefore, it starts flattening bottom-up with regard to the component hierarchy and calls for a simplification step after each level of integration, leading to an alternating sequence of flattening and reduction steps until the top level system is reached. As the simplified DSPN models are used everywhere a component is referred to, the computational effort for the reduction pays off several times.

5.6

Flattening

Flattening makes one hierarchy-less standard DSPN from the set of DSPNs corresponding to the system and all of its subcomponents, sub-subcomponents and so on. The flattening procedure consists of copying instances of subcomponent DSPN into the containing DSPNs and merging the elements that correspond to ports. As flattening usually follows the translation of the involved SEFTs into DSPNs, the resulting DSPNs can be assumed to be present in memory. Otherwise they have to be loaded. The flattening procedure creates a new DSPN model belonging to the top-level component. The Containment Tree (explained for CFTs in Section 3.6.2, created in the first analysis step in Section 5.4.2) guides the flattening procedure. It is traversed depth-first and the components are integrated bottom-up until the top-level is reached. The flattening procedure starts from the leaves of this tree, i.e., the components that do not contain any subcomponents. These are integrated into the components referencing them as subcomponents. The result of each integration step should be subjected to a simplification step (cf. Section 5.5) and stored (cached) for later reuse (cf. Section 7.5.2); both steps are optional and serve for performance improvement. The flattening ends when the top-level component (the system to be analysed) is reached. The flattening procedure returns the flat DSPN for this component. This DSPN can be viewed, stored, and exported to any DSPN tool. The functions necessary for flattening are given as pseudo-code in Appendix B.

5.7. INITIAL MARKING OF THE DSPN

121

In the example from Figure 3.13, first an instance of C3 is integrated as SC1 into C2, then an instance of the flattened C2 as SC1 into C1, then again C3 as SC1 into C2 and the flattened C2 this time as SC2 into C1, then three instances of C3 as SC3, SC4 and SC5 into C1. The flattened DSPN of C1 is finally returned. The small example shows that this algorithm does not efficiently exploit the component structure: the integration of C3 into C2 is performed twice (once for the instance SC1 of C1 and once for SC2 of C1). It would be clearly preferable to store the flattened version of C2 and to instantiate it as often as required. This is how the actual implementation works; the Containment Tree produces a list that contains each pair of subcomponent and super-component just once. To show how a subcomponent DSPN is inserted into its super-component DSPN, the example from Section 5.4.6 is continued. The situation was that a footprint, i.e., a set of DSPN elements that correspond to the subcomponent ports, had been inserted into the supercomponent DSPN and connected to arcs there. Now during flattening, an instance of the subcomponent DSPN is copied into the supercomponent. All of its net element IDs are prefixed with the subcomponent ID (this is necessary to avoid name conflicts). So the port elements of the inserted structure are named, e.g., SC1.P1 (if it is a place) or SC1.T1 (if it is a transition).The counterparts in the Interface List have the IDs SC1.Pin1 (for an input port), SC1.Pout1 (for an output port) and the like. The Interface List of the inserted structure is merged with the Interface List of the environment. Then the list is resolved to find matching pairs, because each port element exists twice: once because of the footprint, a second time because of the inserted DSPN. The elements of matching pairs are merged and deleted from the Interface List. Figure 5.11 shows the models from Section 5.4.6 after flattening, but before merging (left) and after merging (right).

5.7

Initial Marking of the DSPN

When the DSPN is passed to an analyser or simulator, the marking of the net must be so that it describes the state of the system for the time t=0, i.e., the moment of taking the system into operation. For those parts of the DSPN that correspond to state-machine aspects of DSPN, the translation algorithm has already marked each place corresponding to the start state of some component. This has been achieved in the step ”Translation of the Init transition”. For those DSPN parts that correspond to SEFT gates, an initial marking is provided in the dictionary. This initial marking is designed in such a way that it corresponds to the situation that the events connected to event inputs have never occurred before and the state expression connected to any state input is false in the beginning. This is the regular situation in FTA. However, there may be situations where this marking does not represent the initial condition consistently. This happens, in particular, if some state input is connected to the start state of some component, as shown in Figure 5.12. This constellation requires a different initial marking, with possible consequences to further substructures: as in the example the output of the gate is marked as well in

122


Figure 5.11: The flattened component DSPN before and after Merging

Pout

Pin1

Pin2

Figure 5.12: SEFT Expample and DSPN of the OR-Gate with irregular Start Marking

5.8. ANALYSIS OR SIMULATION WITH TIMENET

123

the beginning, the marking has to be changed for the gate input it is connected to and so on. This requires a post-processing step after flattening.12

5.8

Analysis or Simulation with TimeNET

After translation and flattening, the result is a standard DSPN that can be analysed or simulated to calculate the required measures. These measures will typically be the probability of a (hazard) state at a certain point of time, or an average over the mission time, or the frequency of some (accident) event. Several established algorithms and tools are available to perform this analysis or simulation. Instead of implementing the analysis by ourselves, we decided to use the tool TimeNET from Technische Universität Berlin [ZGFH99] to do the analysis within the ESSaRel tool.13 After completion of the analysis, the results are read back and displayed. Thus, to finally start the analysis, the DSPN must be exported to the TimeNET XML file format and the requested measure must be translated into a measure that can be determined by TimeNET (e.g., the marking probability for a place that corresponds to the system state of interest). Then the TimeNET calculation engine is called for a suitable analysis or simulation procedure. The tool TimeNET offers both transient and steady-state analysis for DSPNs or, alternatively, simulation. As in practice the analysis usually refers to a limited mission time of the system, transient analysis is used.14 Analysis is faster than simulation (for evaluation results, see Section 7.5), but due to the analysis algorithm used, it can only be applied in cases where at most one deterministic transition is enabled in any marking. If this condition is violated, simulation is still possible. The validation experiments in this thesis exploited both ways of evaluation. TimeNET can only evaluate marking probabilities of Petri Net places. When steadystate analysis or simulation is performed, the steady-state probability is calculated as a single value. When, as in the case in the context of this thesis, transient analysis 12

Fortunately, this step is obsolete for the types of analyses considered in this thesis. This is explained by the fact that the marking found after flattening unstable if it does not represent the consistent start state; in this case, the enabled transitions fire immedately when the analysis starts, until a stable and consistent start state is reached. In the figure, the immediate transition T1 is enabled from the start and fires immediately when the analysis or simulation starts (at t=0); the resulting situation is stable and represents the correct initial state of the system. As the analyses considered in this thesis only evaluate marking probabilities over time and not the number of occurrence of transitions, the result is not affected by the additional firing in the beginning of the analysis. Therefore, the postprocessing step for the initial marking has not yet been implemented in the prototype tool, without this having any negative impact. 13 The tool ESSaRel is planned to support direct integration of the TimeNET calculation engine or an alternative engine in the future, so that the user will start the analysis by pushing a button on the ESSaRel GUI and the results are displayed on the SEFT. The prototype that has been used to validate the results of this thesis did not yet offer this feature, so the TimeNET tool had to be started manually. 14 In the ESSaRel tool, the mission time for each component can be specified as by the user, and so can additional analysis parameters, such as number of sampling points for discrete time value tables and required confidence intervals for simulation.

124


or simulation is performed, the results are two-fold: first, there is one probability value that gives the marking probability at the model time where the transient analysis ends (corresponding to the end of the system mission time in safety and reliability studies) and second, there is a file with time-discrete sampling values covering the whole analysis time from zero to the end of mission time. The main purpose of this latter file is to display the evaluation of the system graphically, hence the developers of TimeNET call it a curves file. To this end, TimeNET has a built-in graph viewer and our own tool ESSaRel also has one. For transitions, TimeNET offers no measures (which could be probability density, number of occurrences in a time interval etc.) What the analyst is actually interested in is not a measure about places in some Petri Net, but safety or reliability figures for the technical system modelled by an SEFT. If the output port the analysis refers to is a state port, the result is a probability (usually of a hazard or failed state). The analyst will have to specify further if he is interested in the end-of-mission-time probability, the average probability over the mission time, or the maximum probability over the mission time. All of these values can be calculated by easy computations based on the data found in the curves file. Also, other statistical data such as Mean Down Time, Mean Time to Failure or Mean Time Between Failures (where the output state of the analysis is assumed to be the failure state) can be determined from the data in the file. All necessary formulas are given in Section 2.2. The measure to be transmitted to TimeNET as a request is ”tell me the probability that there is one token on Place Px at time t”, in TimeNET syntax written as ”P{#Px=1}” with x being a natural number identifying a place. Px is the place that has been created as a counter-part of the state output port of the SEFT (top-event) during analysis. Time t is the end of mission time or the individual sampling points that make up the value table in the curves file. The translation guarantees that a place that corresponds to a state or state type output port has zero or one token at any time. So the measure can alternatively be stated as ”Expectation value of the number of tokens on Place Px at time t”, in TimeNET syntax written as ”E{#Px}”, where x is a natural number. Both requests lead to the same results. If the output of the analysis is an event, the user will either be interested in a rate - usually chosen for events that occur just once in the lifetime of a system like accidents - or in a probability density function (pdf) over time - for events that occur frequently, such as failures of a repairable system. As TimeNET does not offer any measures based on transitions, a structure called Event Output Pattern has to be created during analysis for the event type output port of the analysis; this is shown in Figure 5.13.15 The purpose is for each occurrence of the transition (which represents the event type output port in the SEFT, i.e., the accident or failure of interest) to have an additional token put on the place on the right side. So the number of tokens there corresponds to the number of occurrences of the event. The request given to TimeNET is the expectation value for the number of tokens at each sampling point of time, in TimeNET syntax ”E{#Px}”, where Px is the ID of the right place in the 15

At present, the prototype tool ESSaRel is still under development and the Event Output Pattern has not yet been added during translation. As a workaround, the SEFTs had to be augmented manually by a similar pattern (as an SEFT model) and the calculations described in the following had to be performed by hand. In Section 7, the corresponding test cases are shown.

5.8. ANALYSIS OR SIMULATION WITH TIMENET

125

figure. The finally requested measure (expected pdf or rate) is then calculated with differential equations (discrete-time differentiation); the formulas follow from the basic definitions in Section 2.2. The expected probability density function (in the real world, e.g., failure frequency of the system) is calculated according to the formula E{pdfi } =

E{#P x}i − E{#P x}i−1 ti − ti−1

where the measures in the numerator are the aforementioned TimeNET measures at sampling points i and i-1 (with respect to the curves file) and the denominator is the real time difference between the two consecutive sampling points, or in other words, the sampling interval. The expected occurrence rate can be calculated according to the formula E{λi } =

E{pdfi } 1 − E{#P x}i−1

This kind of calculation of a rate is only appropriate in cases where a failure occurs only once in the lifetime (non-repairable system), which is often tacitly assumed in reliability engineering. Then E{#Px} is the failed state probability F(t) and the denominator of the fraction corresponds to the reliability R(t).16 Technically, the TimeNET curves file, which is in textual format, is read by an input parser of the ESSaRel tool and the values are copied into a table. Then the necessary calculations are performed. The final result is plotted graphically and the result value is also displayed next to the output port (top-event) of the SEFT that has been analysed.

Figure 5.13: Additional DSPN Pattern for Event Outputs

16

Industrial reliability analysts have sometimes reported that customers and authorities ask for a rate, even where it is not appropriate. In particular, they require one number as a rate instead of a series of rate values over time. One number is only appropriate if the rate is constant over the mission time, which is the case if the failures are exponentially distributed. The latter assumption is not true in many cases, and therefore a single rate value is useless. This problem can, of course, not be solved by SEFTs. In the prototype tool ESSaRel, the user has to specify whether he requests average rate or maximum rate over mission time.

126


Chapter 6 The Tool Projects UWG3 and ESSaRel During the doctoral research period, two analysis tools have been implemented: UWG3 and ESSaRel. UWG3 is the older and more mature one; it has been used by different industrial companies since 2003. It implements the CFT approach. ESSaRel, later derived from UWG3, is the newer one and is in an experimental stage. It is designed to integrate many different models for safety and reliability analysis. It has been used to carry out the validation studies for this thesis (see Chapter 7) and is currently being developed towards its industrial maturity. The tool development began in 2002 at the Department of Software Engineering and Quality Management at Potsdam University (Prof. Dr. Peter Liggesmeyer) / Hasso-Plattner-Institute for Software Systems Engineering. By the end of 2005, it was moved to Fraunhofer IESE Kaiserslautern and the Department for Software Engineering / Dependability at the University of Kaiserslautern. Information and download of the tool for evaluation purposes are possible at www.essarel.de and in [KFG+ 05]. So far, over 70 registered users from industry and universities downloaded UWG3 or ESSaRel.

6.1

UWG3

The Safety and Reliability Analyser tool UWG3 (for German ”Ursache-WirkungsGraph” = Cause-Effect-Graph) was initially developed in a co-operation between the Department of Software Engineering and Quality Management at Potsdam University and the industrial companies Siemens and DaimlerChrysler. It started as a Bachelor Project in 2002 and was originally meant as a successor for the proprietary Siemens FTA tool UWG2. Due to the promising results, the initial goals have been extended and the development was kept alive after the end of the student project. In the beginning of 2005, the project was ported to Fraunhofer IESE, Kaiserslautern. UWG3 was first released in 2003, a revised version UWG3.1 was published one year later, and since then, an ongoing maintenance process has been carried out in parallel to the practical application in several Siemens departments. UWG3 is a non-profit development and continuously supported by Siemens CT PP 2 (corporate technology department). It can be downloaded for free for evaluation and teaching purposes.

127

128

CHAPTER 6. THE TOOL PROJECTS UWG3 AND ESSAREL

UWG3 was written from scratch in the C# programming language (C and C++ for analysis libraries) and is based on Microsoft’s .NET framework. A stringent development and quality assurance process was applied. Unlike many prototype tools developed at universities, UWG3 has been designed for industrial application and therefore offers a GUI and the usual features known from today’s Windows applications. It is the first tool that applies the new CFT approach. For the BDD analysis of the CFTs, the BUDDY library [LN99] is used.

Figure 6.1: UWG3 Screenshot

The screenshot in Figure 6.1 shows the tool at work. Its Windows-based GUI offers different graph windows showing one Component Fault Tree each. On the right hand side, there is a Component Explorer window that allows navigating through all open files, all components therein, and all graph elements belonging to these components. Below, there is a properties window that allows modifying functional parameters (e.g., probabilities) and style attributes (e.g., line colours and weights). Large amounts of numerical data are more comfortably edited in tables, so UWG3 offers a table view that allows data import and export to programs such as EXCEL. On the left side of the screen, there is a repository of available graph elements, such as basic events or gates. The logical symbols displayed here are IEC 61025 style; they can alternatively be shown as international (US) symbols. Note that port symbols are offered in addition to the classical Fault Tree elements. Graph elements are applied by dragging and dropping them from the repository window into the graph

6.2. ESSAREL

129

window. The same mechanism is used to apply components as subcomponents of higher-level components: a component is dragged from the explorer window into another component window where it will appear as a black box, showing only the ports. Drawing edges between graph elements generates the semantic connection. The tool uses an open XML file format to store the component models. This facilitates later integration with other tools. Each file may contain one or more components and the components belonging to one system may be distributed across different files. This permits models for different components to be edited concurrently. Each graph element has a unique ID including the URI where the file is located. The tool automatically assigns and resolves the IDs, independently from the names that the user gives to events and components (the user can optionally display the internal IDs). This ID system also enables the detection of repeated events and serves as a cross-reference to other development documents, as proposed in [DIN93]. Targeted to industrial applications, the tool offers standard features like vector graphics export to the clipboard, a printing facility or an undo / redo stack. Documentation is available in German, the English translation is under preparation. UWG3 is deployed as a Windows set-up project. The tool has been successfully applied in various industrial projects, e.g., for the quantitative reliability evaluation of a railway drive, for the safety analysis of a braking system for a tramway, for an availability analysis of a turbine protection unit in the power generation domain, for a safety analysis of an electrical automotive steering system, and for quantitative safety analysis of a fuel cell system.

6.2

ESSaRel

In 2003, the ESSaRel project (Embedded Systems Safety and Reliability Analyser) was created with the purpose of creating a platform for the integration of different modelling techniques. To support this goal, a more comprehensive tool was needed that allows handling different kinds of models (and even the later extension to new modelling techniques), to analyse models with appropriate algorithms, and to translate models into other models. Due to their expressive power that unites features from different safety and reliability modelling techniques, SEFTs were chosen as the central model within the ESSaRel project. For several reasons it seemed useful to separate the existing UWG3 tool and the new tool that will be published under the name ESSaRel: • The UWG3 tool has gained a certain degree of maturity and uses a commonly accepted method (CFTs, which are mathematically compatible to standard FTs). A new tool with a new modelling technique will not run as stable and get as much acceptance in the beginning. • The software architecture had to be changed in many details in order to enable the former FTA tool to support a variety of models.

130


• Some details of standard Fault Trees that do not reflect the strictly typed semantics of SEFTs should first be kept out of ESSaRel (e.g., the basic events, which theoretically rather correspond to states). • The UWG3 tool was inspired and ordered by Siemens AG; the ESSaRel tool is an original development at the chair of Prof. Dr. Peter Liggesmeyer at the University of Potsdam, later at the University of Kaiserslautern and the Fraunhofer IESE (its development, however, was generously supported by Siemens AG as well). Although the GUIs of both tools are quite similar, the internal structure of both tools differ a lot. The architecture of the ESSaRel tool was designed from the beginning for the integration of various graphical models. ”Separation of concern” between the domain aspects of the models and the GUI leads to a manageable structure of the large project and permits adding new models of different types quickly. Every model is compiled into a separate DLL (dynamically loadable library). On the source code level, each model consists of • a list of model elements including their graphical description in terms of standard figures (rectangle, ellipse, line,...), • a set of rules on how these elements may be connected, • a save and load filter to the model-specific XML file format, and • a set of analysis or translation procedures. Figure 6.2 exemplifies with the Markov Chain model how specific models inherit from the base classes. The currently supported models are Component Fault Trees, State/Event Fault Trees, Markov Chains, ROOM Structure Diagrams, ROOM Behaviour Diagrams, and DSPN (the latter are only intended as immediate models for SEFT analysis). An impression of the tool is given in Figure 6.3. The overall appearance is quite similar to UWG, but now different models can be displayed (in the example, a Markov Chain to the left and an SEFT component model to the right). ESSaRel projects, which can spread across multiple files, are organised by components and each component can contain one or more models. In contrast to UWG3, the modelling elements on the blue pane to the left change according to the model currently being edited. The project explorer to the right and the parameter editing area in the lower left corner are the same as in UWG3. The table area is not only capable of displaying and editing large amounts of parameters, but also shows analysis results, search results and error messages. The error messages contain a list of the affected graph elements (e.g., the nodes and edges forming a forbidden circle in a Fault Tree). When these are double-clicked, the graphical editor opens the containing model and marks the elements in colour. Therefore, problems in complex projects can be leasily ocated.

6.2. ESSAREL

131

Figure 6.2: Inheritance of a Specific Model from the Built-In Base Classes

Figure 6.3: ESSaRel Screenshot

132


The technical integration between the different models is achieved by a namespace system that avoids conflicts between similar concepts in different models (e.g., the term ”event” in CFTs has quite a different meaning than ”event” in SEFTs). In a ”built-in” namespace, there are the abstract top-level classes from which all modelspecific classes must inherit. On this level, concepts like graph-element, node, edge, port, component or gate are defined. A set of subclasses that applies to any kind of model is also provided there (e.g., input and output ports that are derived from port). The more specific elements are defined on the specific model level. The ID system that helps to identify model elements across different model types is defined on top-level as well. It provides hierarchical IDs that are composed of dots and that contain mnemonics for the kind of element (e.g., C1.M2.S3 can be interpreted as the third state in the second model of the first component). As in UWG3, the IDs are maintained by the tool to enforce consistency. The same namespace concept as in the C# source code is maintained for the XML file format that stores all ESSaRel models. A set of XML schemas (.xsd files) hierarchically defines the format of the actual model file. XML schemas allow namespaces and type inheritance similar to OO programming languages. This schema hierarchy is depicted in Figure 6.4. The top-level schema represents the namespace ”xlns:essarel”. It defines the global types that have to be used for every model schema. On the middle layers, there is a set of schemas named after the models they represent. Each constitutes a separate namespace and defines the XML records for the model-specific elements. The schema essarelfile.xsd finally assembles the specific schemas and defines the overall structure of the actual ESSaRel file (extension .esr). The load / save operations for ESSaRel files start with a generic filter that creates the framework DOM (document object model) structure of the file. For each model to be saved, it passes control to the model-specific filter that must be provided along with the model. Most of the DOM structure handling and OS-level file access is performed by .NET libraries or by code that is automatically generated from the schemas using the tool XML-Spy. However, the adaptation to the internal data structures of ESSaRel and the management of the different files, components, and models belonging to a project is specific to ESSaRel and has been written manually. For each model, specific analyses or translations can be defined. Analyses perform mathematical operations on a model to obtain a list of qualitative or quantitative results (e.g., a list of Prime Implicants for a Fault Tree, a list of steady-state probabilities for the states of a Markov Chain). Translations transform ESSaRel models into other models (e.g., CFTs or Markov Chains into SEFTs, hierarchical SEFTs into flat SEFTs). As all analyses and translation inherit from an abstract class, they share the same signature and the same data structures for requests, parameters, results, and error messages. A configuration file specifies the names and entry functions for each model so that the corresponding menu items on the GUI can be generated automatically. Analyses and translations can call other analyses or translations as helpers. In this way, the step-wise translation, flattening, and evaluation of SEFTs as described in Chapter 5 can be implemented in a clear layout. The ESSaRel tool also offers the possibility to attach import and export filters to foreign file formats. This way, the generated DSPNs can be exported to TimeNET, and CFT models from UWG3 can

6.2. ESSAREL

133

Figure 6.4: XML Schema Hierarchy of ESSaRel

be imported. The case studies and experiments described in the following chapter have all been carried out using ESSaRel.

134


Chapter 7 Evaluation The evaluation of a newly developed technique comprises several steps: first, it must be proven that the technique produces correct and consistent results, and second, it must be shown that the new technique is applicable to practical situations, which can, in particular, comprise compatibility with or similarity to accepted techniques. In order to claim that the new technique contributes to the scientific progress, it has to be shown that the new technique has some significant advantages in comparison to the existing techniques (or at least the potential to develop those in the future), for example, that it solves practical problems that have not been solved before or that it is more efficient, more accurate, or easier to handle. The following sections discuss these issues one after the other and in the end point out where remaining problems are and which ideas exist to overcome them.

7.1

Correctness, Confluence and Consistency

This section deals with the demonstration of correctness, inner consistency (confluence), and consistency with external techniques. The demonstration of correctness in a fully mathematical fashion is only possible if the semantics of the new technique is formally specified. SEFTs have a formal semantics on the DSPN mapping level, i.e., on the level of analysis. For some gates, a mathematical definition can also be stated on application level (e.g., for state-related gates like AND, OR, NOT, Voter, etc., which obviously correspond to well-known Boolean functions). Moreover, the meaning of SEFTs from the modeller’s point of view has been verbally explained in Section 4.6, and many aspects can be considered implicitly clear due to the intuitive symbolism and the (intended) similarity to other modelling techniques. So correctness can be proven or argued on different levels of abstraction. SEFTs are a flexible technique that allows modelling many situations in several ways, and it is mandatory that each way of expressing the same fact produces the same analysis results. This property is called inner consistency (i.e., consistency between the different models subsumed by SEFTs) or confluence (i.e., different calculations lead all to the same result). Through their similarity to existing techniques, in particular Fault Trees, Markov Chains and Statecharts, SEFTs suggest a similar

135

136

CHAPTER 7. EVALUATION

meaning as these commonly known techniques (which is intended). Consequently, it has to be shown that in those cases where SEFTs express a situation that can be modelled similarly in an existing technique, the analysis results are the same (referred to as external consistency).

7.1.1

Boolean Logic Aspects

The AND, OR, and NOT gate with only state inputs and outputs obviously define Boolean expressions over atomic propositions in the style ”Some component C is in state S at time T”. In the DSPN representations of all gates with state inputs and outputs, a token on the corresponding places means that a state expression is true. So it has to be proven that the Boolean junction that is named by the gate applies to predicates of the type ”Input place Pi is marked”. For instance, for the DSPN fragment describing a State-AND it has to be proven that the output place is marked by a token iff1 input place 1 and input place 2 and ... input place n, i.e., all of the input states are marked at the same time. For the State-OR the analogous proof has to be carried out in the sense that the output place is marked iff one or more of the input places are marked. Both proofs can be carried out for the simplest case of two inputs and then be extended to n inputs. For the State-NOT, it has to be proven that the output place has a token iff the input place has no token. The proofs for AND, OR, and NOT are given in Appendix C. For the Voter gate (k-out-of-n), the XOR and the Equivalent gate, similar proofs can be carried out, but their correctness has already be shown if their equivalence to basic structures of only AND, OR, and NOT has been proven (e.g., XOR((S1,S2)=OR(AND(S1,NOT(S2)),AND(S2,NOT(S1))). In general, all gates describing Boolean expressions about two or more state conditions can be constructed out of AND, OR, and NOT gates; thus, their correctness is validated if the direct DSPN translation given in the dictionary provably produces the same markings at the output at the alternative DSPN. Besides this formal validation, all gates have been practically tested in an experimental comparison of sets of alternative representations of the same logical networks: in the test setting, the outputs of both representations must show equivalent markings for any combination of input valuations. One example of this kind of test will be shown in Figure 7.11 in Section 7.1.4; this example is not the only one that has been carried out, but also test cases for all of the gates introduced in this thesis. Moreover, having proven the equivalence of the markings of the corresponding DSPNs to the Boolean propositions, which are represented by the SEFT gates, it is no longer necessary to explicitly prove the validity of the laws that apply to Boolean logic (e.g., commutative law for AND and OR, associative law, distributive law, DeMorgan’s laws), because these are known to be valid for Boolean logic. These laws taken for valid, a good deal of the inner consistency can be accepted as valid (for instance, DeMorgan’s laws, i.e., A∧B ⇔ ¬(¬A ∨ ¬B), describe different kinds of describing the same situation.) However, to confirm that these laws also work in 1

”iff ”is a customary abreviation for ”if and only if” in mathematical literature

7.1. CORRECTNESS, CONFLUENCE AND CONSISTENCY

137

practice, many aspects of inner consistency have also been shown experimentally using test cases, as described in Section 7.1.4 and [Gra04]. One of these, the test case in Figure 7.10 confirms that DeMorgan’s laws apply. The tests have been carried out using the prototype tool ESSaRel. The correctness arguments for gates involving events compare the behaviour of the corresponding DSPN to the verbal description in Chapter 4.6. In the case of EventState-AND with two inputs, the output transition fires iff the event input transition fires and the state input place is marked. The Event-State-AND with n inputs can be represented by the Event-State-Input with two inputs plus a State-AND with n-1 inputs. In the case of the Event-OR, the output event must fire iff one of the input events fires. All gates involving memory or time or the transformation of states into events or vice versa cannot be proven based on Boolean logics; their validation is discussed in the next section.

7.1.2

Gates Involving Memory

In this section, validity arguments for some basic gates involving state memory are discussed. Unlike for the Boolean gates, these gates have no evident counterpart to which the SEFT analysis results can be compared for proof. However, based on the verbal description of the gates, corresponding state-based models for some representative situations can be constructed and the consistency of the DSPN that has been generated from the SEFT to the state-based model can be shown. The gates discussed here are History-AND and Priority-AND, for simplicity restricted to two inputs. The semantics of History-AND, as expected by the user according to the verbal description in Section 4.6, is that 1. if, after the initial state, input i has been triggered once or more, then the output is triggered at the same time when input j is triggered for the first time and 2. the gate is in its initial stateafterwards and 3. otherwise the output is not triggered. The letters i and j can represent 1 and 2 in any order, so a case distinction is necessary (which is trivial as the corresponding DSPN structure is visibly symmetric). The case that both input events occur at the same time is not considered in the proof, as the underlying DSPN semantics is not clear in this point.2 The semantics that is expressed by this verbal description is depicted by the statemachine in Figure 7.1. The initial state is labelled ”Ready”, the states after input 1 or input 2 has been triggered is labelled ”In1 Triggered” and ”In2 Triggered”, respectively. 2

This is one of the semantical weaknesses that will be discussed below; however, the problem has not found to be relevant in practice.

138


In1 / -

In2 / Out In1 Triggered

In1 / Ready

In2 / In1 / Out

In2 Triggered

In2 / -

Figure 7.1: State-Machine Describing the Semantics of History-AND

For Priority-AND, the inputs are obviously not commutative. It has to be shown that 1. if, after the initial state, input 1 has been triggered once or more, then the output is triggered at the same time when input 2 is triggered for the first time and 2. the gate is in its initial state afterwards and 3. otherwise the output is not triggered. The semantics is depicted by the state-machine in Figure 7.2.

In1 / -

In2 / In1 / -

In1 Triggered

Ready

In2 / Out

Figure 7.2: State-Machine Describing the Semantics of Priority-AND

As both arguments are strongly related (the difference is that History-AND consideres two different firing sequences where Priority-AND considers just one), large parts of the argument can be reused. Both arguments can be found in Appendix C.


139

For the variants with reset input, which are more complicated, manual validation is possible by ”playing the token game” on the corresponding Petri Net. Also, for the extensions of these gates with a delay parameter (and also for the delay gate itself), it is difficult to give a proof, as these involve propositions about real-time. Again, the corresponding DSPNs are simple enough to validate them manually by playing the token game. The same applies to adapter gates (e.g., Upon, Entering, Flip-Flop). According to good practice in software testing, it has to be assured that all equivalence classes of input sequences are tested. All of these gates have additionally been validated by test cases using the ESSaRel tool; some examples of them are given in the next section. In some cases it is also possible to find (by an informal argument based on the verbal specification) a corresponding state-machine or Markov Chain to which the consistency can be shown as a demonstration of correctness.

7.1.3

Test Cases for Gates

Apart from the proofs and arguments given so far, the SEFT gates have been tested qualitatively and quantitatively by manually designed test cases. In simple cases, it is possible to generate input sequences that exhaustively cover all possible input combinations, in other cases, representative test cases have been selected. Qualitative test cases check if the combinations of input state conditions (true or false) or the sequences of input sequences lead to the intended output. Quantitative test cases suppose selected probability parameters for the generator events and check whether the expected reference failure probabilities are obtained using ESSaRel. Reference results have been calculated manually or using other tools or other kinds of analysis (e.g., Markov Chain analysis, traditional FTA). It is not possible to show all of the test cases in this thesis; however, all of the test cases that were executed passed. All gate tests follow similar principles that are exemplified here by the State-AND gate with two inputs (whose correctness has already been proven above). A qualitative test checks that the sequence of output values over time follows the Boolean conjunction of the sequences of input values for each point of time, where the input values are only 0 (false) and 1 (true). To generate an input test sequence that covers all possible input combinations, deterministic events are a suitable means. The setting shown in Figure 7.3 starts at t=0 with both inputs being 0. After one second, the first input changes to 1, after two seconds the second one changes to 1, after 3 seconds the first input changes back to 0 and after 4 seconds the second one goes to 0. The obvious reference result is an output sequence that has 1 values for the time interval [2s, 3s] and 0 otherwise.3 The input sequences are generated by subcomponents that contain state-machines with probabilistically or deterministically timed events, so-called generators. In the following examples, all figures are shown as glass-box view for simplicity, i.e., the contents of subcomponents, namely of the generators, are drawn inside of the box that would normally represent the subcomponent and hide its content. 3

The points of state change are not considered in the tests.

140


Figure 7.3: Qualitative Test Example for State-AND: Test Setting

The output sequence can be exported from the DSPN tool TimeNET as a formatted list of numbers and can also be plotted graphically. The plots obtained by a continuous time transient simulation running from 0s to 5s can be seen in Figure 7.4. The result can be judged to be correct by visual inspection. Additionally, for selected test cases, the value lists have also been checked to see that the values are actually 0.0 and 1.0.4 Of course, this test does not prove all properties of the AND gate. To show that the gate is actually memory-less, additional test cases would be needed that permute the inputs in a different order or that use time intervals of different length. However, it is reasonable to accept that the mentioned test case shows that the AND gate actually implements the Boolean AND conjunction of state terms. Figure 7.5 shows an example for a quantitative test setting. This time, two-state Markov Chains with exponential events are used as test sequence generators. The rate of both events is set to 1 per second and the mission time of both components to 1 second. The expected output plot looks similar to the reference curve generated with CFT analysis, and the end-of-mission-time failure probability shows the expected value of 0.135335. Figure 7.6 shows the plots for both the input and output sequences, obtained by continuous time transient analysis. A similar result could be obtained by simulation (the obtained end probability was 0.390476 by simulation, and 0.3995760102 by analysis, which is exactly the reference result). The simulation error corresponds to a relative deviation of -2.2% with parameter setting 10% relative tolerance for 95% of the results. It can be seen that the accuracy of the analysis is usually better than the accuracy of the simulation, which will be discussed later on. 4

For all qualitative and quantitative tests, values are considered to be correct if they do not differ from the reference by more than 10−4 . This applies to analysis and to simulation results. An exception are accuracy tests where different tresholds apply for analysis and simulation.


Figure 7.4: Qualitative Test Example for State-AND: Test Result Plot

Figure 7.5: Quantitative Test Example for State-AND: Test Setting

141

142


Determined by equivalence class partitioning, these test cases have been repeated with other values, especially very small values (order of mangitude 10−9 ) and values that are close to 1, such as 0.97 and 0.99, because these extreme cases could reveal rounding errors or stability problems that make the result inacceptable. Some of the tests have been executed for gates with more inputs as well (e.g., 3 and 5 inputs).

Figure 7.6: Quantitative Test Example for State-AND: Test Result Plot

Similar tests as for the State-AND gate have been executed for all of the gates those that have already been validated by a mathematical proof or argument, but in particular those that could not be validated otherwise. For example, the Delay gates (here: the Deterministic Delay gate) can be checked by a test setting as shown in Figure 7.7. A deterministic event that occurs at t = 1s triggers a Deterministic Delay gate with a delay time set to 2s. The reference output is an event occurrence at t = 3s. As singleton events are difficult to monitor in a simulation5 , the output of the gate triggers a two-state state-machine subcomponent with a triggered event. The probability of state S2 of this subcomponent is recorded. As expected, it changes from 0.0 to 1.0 at t = 3s. After the fault tree gates, the state-machine elements of SEFT were also evaluated by testing. Figure 7.8 shows a setting where a triggered event inside a state-machine subcomponent must follow the triggering exponential event. The test case showed that both SC1.S2 (state 2 of subcomponent 1) and SC2.S2 show the same probability plot, which is an exponential function, as expected. The test cases were executed using both transient continuous time analysis and simulation. The correctness test cases especially consider situations where traditional combinatorial FTA is known to fail, for instance the case in Figure 7.9. In this case, an AND gate is applied to two different states of the same subcomponent. These are not independent (as would be required for combinatorial analysis) but mutually exclusive, because the subcomponent can only be in one state at a time. Traditional FTA, which cannot consider the state dependencies, would produce wrong results by multiplying the state probabilities under false assumption of independence. Both 5

Additionally, the structure discussed in Section 5.8 that would be necessary to analyse event outputs had not been implemented at the time when the tests were executed.


Figure 7.7: Test Setting for Delay Gate

Figure 7.8: Test Setting for Triggered State-Machine

143

144


the analysis and simulation using ESSaRel and TimeNET show the correct result 0.0 (as it is impossible for both state expressions to be true at the same time). Test cases like this show situations where SEFTs have advantages compared to traditional FTs.

Figure 7.9: Special Case: AND Applied to Dependent States

7.1.4

Test Cases for Inner Consistency

Closely related to the tests for correctness are the tests for inner consistency.6 Inner consistency (or confluence) means that two different ways of modelling the same situation by SEFTs do not lead to contradictionary results. So these test cases were designed to show that equivalences that can be expected to hold from theory actually hold in practice. Doing such tests and consistency considerations was one main subject of the Master thesis [Gra04]. Examples of laws to be validated are the commutative law for all applying gates (see gate descriptions), the extension of twoinput gates, to n-input gates or the associative laws applied to structures of several AND and OR gates (e.g., AND(S1, AND(S2, S3)) = AND(AND(S1, S2), S3)). Further examples are the validations of DeMorgan’s Laws (see an example case in Figure 7.10) or the equivalence of a Voter gate to the equivalent structure of AND and OR gates (see Figure 7.11, showing at the same time a convenient implementation of test cases as subcomponents that can easily be put into different test environments). All consistency test cases consist of two partial models that must provide the same result for any combination of input values. In addition to the test execution, confidence in the equivalence can be increased by performing the translation from SEFT to DSPN and the subsequent translation from DSPN to a Markov Chain by hand and comparing the resulting structures. These transformations were done for some 6

In practice, the test cases for correctness, inner consistency, and consistency with existing methods cannot be clearly separated. For instance, after testing an AND gates with input probabilities 0.2 and 0.3 with an expected result of 0.06, the same test is immediately repeated with permutated probabilites, proving at the same time the correctness of the calculation and the validity of the commutative law.


145

of the test cases and the resulting Markov Chains were actually found to be isomorphic to each other (for more details see [Gra04]). After these theoretical consideration„ a set of quantitative test runs was executed on the same test cases. The results were the same for both partial models in all test cases. Several sets of sample values were used as input probabilities7 , e.g., very small values (order of magnitute of 10−9 ), values in the range of 0.2 to 0.5, and values that come very close to 1.0. The tests were partly executed for the Master thesis [Gra04] using manual translation to DSPNs and analysis with TimeNET; later, when the ESSaRel tool was finished, the test cases were repeated and extended using the tool, with the results being the same. In a similar way, other theoretically proven assumptions have been validated by theoretical considerations and by tests.

Figure 7.10: A Test Case for Inner Consistency Demonstration: DeMorgan’s Law

Figure 7.11: A Test Case for Inner Consistency Demonstration: Voter Gate 7

Presently, only the rates of the events in the generator state-machines can be selected and the state probabilities evolve over time as a result; speaking of probabilities here refers to the end-ofmission-time probabilities.

146


Apart from equivalence, which can be expected due to the logical definitions of the state gates, other inner consistency issues can be demonstrated that involve nonBoolean gates or state-machine patterns. These tests are necessarily based on the verbal description of SEFT model elements or on the intuitive analogy of SEFTs to traditional Fault Trees, Markov Chains, and state-machines, as a fully formal specification does not exist. One example is a History-AND gate in combination with an Upon gate (see left part of Figure 7.12) that intuitively describes the same story as a triggered statemachine in the right part of the same figure (”The state upon both generator failure events having occurred at least once, in any order” is the same as ”The failed state of a state-machine that changes from good to failed at the moment when any of the generator components failed, provided that the other has failed before.”) So it should be expected that both tests lead to the same quantitative results when run with a set of representative values; this has actually been the case. This test case shows, in particular, the inner consistency between FT gates with memory and statemachine structures that can consequently be used to define the semantics of those gates. Furthermore, consistency should be expected with the test setting from Figure 7.10 if the same generator event rates are applied: The intuitive semantics of the example specifies ”the state when the first generator subcomponent is in failed state and the second subcomponent is in failed state at the same time”, which is the same as ”the state upon both generator failure events having occurred at least once, in any order”. Note that in one case the failed states of the generator subcomponents are exported by state output ports and in the other case, the failure events are exported by event output ports. Again, consistency could be shown when the same set of rates was chosen for the events, and the underlying Markov Chains were isomorphic, too. A fourth version of an equivalent system will be presented in Figure 7.24, which shows a Markov Chain that has the same state space as the expanded state space of the former examples and consequently produces the same results with the same parameters. This test case belongs to the family tests for consistency with existing techniques (in this case, Markov Chains) and is therefore described in Section 7.1.9. A broad set of further test cases for adapter gates, gates with memory (especially delay), and triggered state-machines was carried out.

7.1.5

Test Cases for the Component Concept

The Component Concept is a major feature that has been newly introduced with CFTs and SEFTs, so special attention must be paid to its validation. Therefore, a set of test cases with different numbers of involved component models and different depth of inheritance was carried out. Some examples are given below. The most important correctness criterion is the consistency of the hierarchical SEFT with its flat counterpart, both in terms of the flattened DSPN that is produced and in terms of numerical results for sample input values.


147

Figure 7.12: A Test Case for Inner Consistency: Upon-Gate and Triggered State-Machine

Although some small examples had been solved manually before, these tests cannot be carried out efficiently without the tools UWG3 or ESSaRel.8 Clearly, these test cases were at the same time test cases for the ESSaRel tool and for the algorithm, and thus cover error situations (e.g., cycles) and the integration of different kinds of models (e.g., a Markov Chain into an SEFT) as well. The tests that were only executed to demonstrate the correctness of the tool implementation are not described in this thesis. An aspect of utmost importance in the context of the component concept is the distinction between equal and same. States and events that are referenced by several paths must be recognised as being the same; if, in contrast, the same component type is used several times as a subcomponent, all internal states and events must be independent from each other. A sample test case for this aspect is shown in Figure 7.13, which shows a glassbox view.9 In this example, the failed states of two components that are independent but of the same type (i.e., same failure distribution with the same parameters) are connected to an AND gate. The resulting failure probability is the product of the individual failure probabilities due to the independence. In the right part of the figure, the AND gate is connected twice to the failure output of the same component. This time, the output probability must be the same as for the individual component, because there actually is just that one component and x ∧ x is still x (law of idempotence). The expected results could be obtained, proving that SEFTs (and similarly, CFTs) deal correctly with interdependencies and repeated events, in contrast to some standard FTA implementations. An example test case for consistency of the component framework in general is given in Figure 7.14. In this example, the same scenario is depicted as a hierarchical structure and as a flattened structure where only the generating state-machines are encapsulated in separate subcomponents. The example is depicted in Figure 7.14, 8

Examples with two or three different components and about five graph elements per model require about ten sheets of paper if the algorithm is applied step by step using paper and pencil. 9 This means that the model referenced by a subcomponent is drawn inside the box, which is normally not the case.

148


Figure 7.13: Test Case ”Equal vs. Same”, Glassbox View

again in glassbox view for the sake of simplicity. The expected result is that both cases evaluate to isomorphic DSPNs and that the quantitative results are the same when analysing or simulating the example. Special attention should be paid to the fact that the state-machine subcomponent in the middle influences the component with the AND gate on two different paths. The state space must be constructed in a way that considers the information from both paths to be identical - this is again a question of distinguishing equal and same. This result has actually been observed. Other test cases were carried out in this context, also including partial dependencies, i.e., subtrees that have one influence in common, both also independent influences. Many test cases for the component concept had already been developed for the CFT concept and the UWG3 tool; they could be reused for SEFTs and the ESSaRel tool.

7.1.6

Consistency with Traditional Fault Trees

In addition to the test cases for correctness in Section 7.1, test cases were carried out in order to check if the numerical analysis via DSPNs produces the same numerical results as traditional FTA. From a theoretical point of view, this should be expected, as it has been proven that the state gates of SEFTs obey the standard laws of Boolean algebra. However, the tests should confirm that this equivalence also holds in practice. The reference analyses were evaluated both using the bottom-up calculation approach from Section 2.4.3 as a paper-and-pencil method and using existing Fault Tree tools. The commercial tool FaultTree+ and the tool UWG3, which was developed for this thesis, were used for these back-to-back tests. UWG3 uses the BDDbased approach for quantitative analysis. These test cases only affect the standard Boolean gates State-AND, State-OR, NOT, XOR, and Voter. The simplest test cases just focus on a single gate and calculate one numerical value for a given point of time (end-of-mission-time). As input value generators, simple two-state Markov chains in SEFT notation were used as in the cases before (see Figure 7.15). The rates were adjusted so that the failed-state probability at the end of the mission time (different mission times were tried) matched predefined values.


Figure 7.14: Test Case ”Hierarchical Structure vs. Flat Counterpart”, Glassbox View

149

150


Then, the correctness of the result obtained via DSPNs was checked. For example, the probabilities 0.2, 0.3, 0.5, 0.7 and 2E-09, 3E-09, 5E-09, 7E-09 were used in different combinations. This covers rather large probabilities (in terms of reliability analysis) and very low probabilities, to check if rounding errors or saturation effects cause problems when coming close to 0 or 1. Prime numbers as factors were chosen in order to avoid values that seem correct by coincidence. The number of sampling points for the numerical simulation was normally set to 1000; only special accuracy tests were carried out with different numbers of sampling points. In these test cases, results are considered to be correct if they match the reference down to the 5th significant decimal place. All tests that compare to paper-and-pencil results were passed. Also, the comparison to the commercial tools worked alright, but due to some features and assumptions in these tools, some additional parameters had to be set. Other test cases involve more than two generator state-machines and n-input gates, or more than one gate of the same or of different kinds. To reduce the testing effort, equivalence classes were formed and only one example of each was tested. All of the tests were passed.

Figure 7.15: Test Setting for Single Gate Tests (here the State-AND Gate) as FT and as SEFT in Glassbox View

In addition to the calculation of the probability at one point of time, time functions (usually combinations of exponential functions) were plotted and visually compared. The ESSaRel tool offers the export of discrete time functions to tables and so does TimeNET. Using Microsoft EXCEL, the tables can be read and compared or graphically plotted. The paper-and-pencil method that serves as a reference was also replaced by formulas in EXCEL that apply to rows of discrete time values. These test cases also showed consistency between the Fault Tree aspects of SEFTs and traditionally calculated Fault Trees.

7.1.7

Consistency of Priority-AND with Traditional Approach

Traditional FTA has allowed for a Priority-AND gate for many years - without using state-based analysis. They usually follow the approach proposed by [FAR76], as described in Section 2.4.4, but as the subsequent analysis shows, there are even some


151

commercially successful FTA tools, like Isograph’s FaultTree+, that do not consider Priority-AND in any special way, but use the standard AND formula instead. The applied formula in Fussel’s approach is Fout (t) =

Z 0

t

fn (tn )

Z 0

tn

fn−1 (tn−1 ) ...

Z 0

t2

f1 (t1 )dt1 ... dtn−1 dtn

which is based on the idea of conditional event probabilities. It has been discussed that this approach is only valid under the assumption that all basic events are independent and occur only once in the lifetime of the system; usually they describe long-term phenomena like wear and tear. Of course, it is desirable that in applicable scenarios, the SEFT Priority-AND gate leads to the same result as the traditional method. Therefore, tests were executed that compare a Priority-AND setting in SEFT to the reference result calculated according to the given formula. Tests were carried out with exponentially distributed events and deterministic delay events and also with a mixture of both. Only the case with two inputs was examined. As in former test cases, a triggered state-machine (or alternatively, an Upon gate) integrates the probability of the failed states, in order to avoid dealing with punctual probability densities. Figure 7.16 shows an example with one exponential event at input 1 and one deterministic event at input 2. The case with two deterministic events is quite easy to verify manually: if the delay time of the first input is lower than the delay time of the second input, then the output probability jumps to 1.0 at the time when the event at input 2 occurs. If the delay at the first input is higher, then the output probability will be 0.0 forever. The exponential and mixed cases (there are two of them: one with the exponential event to the left and the deterministic one to the right, and one vice versa) were proven and validated by test cases.

Figure 7.16: Priority AND Test Case with One Exponential and One Deterministic Event

A rigorous argument can be constructed by performing the translation to DSPN manually (see left part of Figure 7.17), then extracting the underlying Markov Chain

152


by removing vanishing states (right part of the figure), and then forming the generator matrix of this Markov Chain. The system of linear differential equations represented by the generator matrix is then solved using the symbolic mathematics tool Mupad [GP03] and compared to the solution of Fussel’s formula. The symbolic results were found to be the same.

Figure 7.17: A Proof Case for Priority AND: Both Events Are Exponential

To show that the Priority-AND works in practice as well, test cases were carried out. Figure 7.18 shows a TimeNET screenshot displaying the results from a simulation with ESSaRel and TimeNET for the case with the probabilistic event at input 1 and the deterministic event at input 2. The time axis is scaled in sampling points; in this setting, the end value of 600 corresponds to 10 minutes in real time.10 The mission time was set to 10 minutes, the rate of the exponential event was 1 / 5 min, and the delay of the deterministic event was 5 min. The expected behaviour is that in those cases where the probabilistic event occurs before the deterministic events, the output probability jumps to 1 at the instant of the deterministic events; in the other cases, it stays at 0 forever. As due to the exponential distribution with a mean delay of 5 min, in about 63% of the cases the probabilistic events occurs before the deterministic event, the overall probability that the event has occurred is 0 for t ≤ 5 min and 0.63 for t > 5 min.11 The expected result can be seen qualitatively in the figure; a quantitative comparison of the end values has been carried out as well. The analysis produced the expected end value of 0.6321, the simulation, with parameters set as above, resulted in 0.5913, which corresponds to a relative deviation of -6.5%. 10

Unlike TimeNET, the tool ESSaRel displays all values with their actual time unit. The value for t = 5 min cannot be examined from the simulation results ,and from theory it is not clear what happens if both events occur at the same time; this and other semantical weaknesses will be discussed later. 11


153

Figure 7.18: Simulation Results of the Priority AND Test Case

Another test case has been conducted with both events being exponentially distributed with a rate of 1 per second and a mission time of 1 second.12 The expected result, according to Fussel’s formula, is 0.199788; ESSaRel produced the same result using transient state continuous time analysis on TimeNET. The FTA tools Galileo [SCD99] (which applies Dugan’s state-based DFT approach and is therefore known for correct handling of Priority-AND by underlying Markov analysis) produced the same result, and so did the commercial tool Relex [Rel]. The commercial tool FaultTree+, in contrast, produced a result of 0.399, which would only be the correct value in the case of a standard AND gate, if applied to the failed states that belong to the two basic failure events. The consistency with Fussel’s approach has been shown by these tests; nevertheless, Fussel’s approach only works under the side-condition of independent basid events. In the case of depending events, only a state-based approach produces correct results for Priority-AND. Thus, it must additionally be demonstrated that the SEFT solution produces correct results, even in cases where the traditional approach does not. Therefore, the test case in Figure 7.19 was designed. The state-machine acts as an event generator and creates two events that appear in a guaranteed order (E1 before E2). They are connected to a Priority-AND gate by crossed edges, meaning that the gate would only trigger if E2 occurred first, which is obviously never true. As in previous test cases, a state-machine is connected to the gate output so that it ”records” an occurring output event. Both simulation and continuous time transient analysis lead to the correct output probability of 0.0. This result cannot be obtained by the integral formula, so the SEFT approach works in cases where the traditional approach does not work. This is due to its state-based analysis. In the next section it will be shown that the Priority-AND gate is also compatible with the Priority-AND from Dynamic Fault Trees, which are also evaluated by a state-based algorithm.

12

ESSaRel allows entering all time and rate parameters with value and time unit, but not so the other tools that were used for reference.

154


Figure 7.19: Consistency Test Priority AND with Sequential Events

7.1.8

Consistency with Dynamic Fault Trees

Dynamic Fault Trees (DFTs) [DBB92] are a dynamic extension to fault trees that have gained broad acceptance in research and in standards [SVD+ 02]. A part of the DFT modelling approaches has been integrated into the commercial FTA tool RELEX [Rel]. They also use an underlying state-based model (Continuous Time Markov Chains) for analysis. The tool Galileo [SCD99] is available for the evaluation of DFTs. It is particularly interesting to check whether the analysis provided by SEFTs leads to the same results for test cases involving dynamics as DFTs. However, not all the concepts are the same in both techniques (see Section 7.3 for a comparison), so that only the gates that are present in both techniques could be tested and a suitable mapping of DFT events to states or events in SEFT had to be found. A test case involving the Priority-AND gate is shown in Figure 7.20. The experiment produced the same quantitative results in both settings when the same event rates are given. This shows that the underlying state-based concept is the same, which is not surprising, since DFTs are translated into Markov Chains for analysis. After showing the basic consistency between SEFTs and DFTs, it is interesting to see which other concepts from DFTs can be expressed in SEFTs as well and whether these lead to consistent results. These tests not only show consistency, but also judge the applicability of the technique and compare the capabilities of SEFTs to existing techniques. One gate that has been introduced in DFTs is the Functional Dependency gate (FDEP). The meaning is that some event B occurs probabilistically, but also inevitably when a triggering event A occurs. An example is given in the left part of Figure 7.21. In SEFT there is no such gate; however, the same meaning can be expressed by the state-machine structure in the right part of the same figure. A comparison test using the tools ESSaRel and Galileo shows the same quantitative


155

S

S

Init E

E

Priority& E

E

Init

1

1

2

A

B

1

2

E

E

Init

2

Figure 7.20: Test Case for Comparison Priority-AND in SEFT (left) to DFT (right)

results for selected rate parameters, as demonstrated in [Gra04]. A possible advantage of the SEFT representation could be its understandability to novices who have not spent time on gettin familiar with the specific modelling technique, but know the popular Fault Tree and Statecharts techniques.13

Figure 7.21: The FDEP Gate in DFT (left) and an Equivalent SEFT (right)

Additionally, DFTs offer three types of Spare gates: Cold Spare, Warm Spare, and Hot Spare.14 A cold spare has inputs to one primary unit and to one or more alternate (or spare) units. These do (except in the special case of component dependency) 13

In the cited Master thesis, some additional simplifications are proposed; in particular, it is shown that under the assumption of independent events that occur only once, the triggered event can be replaced by a simple event with a rate that is calculated as the sum of its original rate and the rate of the triggering event. This simplification may be useful for reducing the analysis effort, but it does not show the semantics of the scenario to the user any longer. 14 In [DBB92] there was only Cold Spare; Hot and Warm Spare were added later and can be found in [CSD00].

156


not fail until they are required, i.e., until the primary unit has failed. When the first spare is activated, it has its normal failure rate; when it fails, it passes to the second spare and so on, until all spare units have failed; then the gate output fails. Spares can be shared among several spare gates; if they are activated by one of them, they are no longer available to others. Therefore, spare gates offer a powerful modelling facility for spare pools with multiple components of the same or even different type. The hot spare gate works in the same way, except that a spare gate that is not currently in use fails with the same failure rate as if it was in use. The warm spare gate indicates that the unused spares fail with a rate that is attenuated by a dormancy factor. The gate symbols from DFTs are shown in Figure 7.22.

Figure 7.22: The Hot Spare, Cold Spare, and Warm Spare Gates in DFT

There are no equivalent gates in SEFTs. If the failure rate of one event is influenced by the failure of another event, this can be expressed by a state-machine structure in SEFTs. An example for the Warm Spare gate - which is the most general one of the three - is presented in Figure 7.23. A test provided the same results for the DFT gate and its SEFT equivalent with selected parameters. It can be argued that the SEFT model comes closer to the actual scenario and therefore, is more intuitive to domain specialists without a deep understanding of reliability theory; however, discussions with practitioners revealed that many reliability engineers expect a spare gate and even more sophisticated spare modelling facilities, such as spare and repair pools (cf. proposals for extensions in Section 7.4.5.) For spares that are shared among several users there is no easy way to model the scenario in SEFTs; this lack has also been discovered in some of the case studies and will be discussed later on. One final gate that is available in DFTs and not in standard FTs is the Sequence Enforcing gate. This gate is not actually a gate in the sense of FTA, because it does not have an output. It merely states the assumption that a set of events can only occur in a given order (following the input numbering of the gate). Therefore, no attempt is made to create an equivalent gate in SEFTs. If events can only appear in a given order, they should be modelled in that way, using the provided state-machine elements. However, it is possible that the resulting state-machines look complicated and that practitioners prefer a FT-like style over the more exact notation as statemachines, because they are used to the traditional notation.

7.1.9

Consistency with Markov Chains

The state-machine aspects of SEFT should also be consistent with existing statebased approaches, as far as these allow modelling the same scenario. This particularly applies to Markov Chains because SEFTs claim to integrate different safety


157

Figure 7.23: An SEFT Equivalent for the DFT Warm Spare Gate

and reliability analysis techniques, especially Markov Chains and Fault Trees. The consistency of both modelling techniques can be shown theoretically; this has been achieved for a few simple examples in the Master thesis [Gra04]. The proof idea is that an SEFT that represents a Markov Chain, i.e., that only consists of states and exponential events, is translated to a DSPN using the given algorithm. The DSPN analyser, in turn, determines the marking transition graph of this DSPN where each tangible marking represents one state and each firing of a DSPN transition represents a state transition, which inherits the distribution and parameter of the corresponding DSPN transition. This graph is again a Markov Chain, as it only consists of states and exponential transitions. If this Markov Chain is isomorphic to the original Markov Chain being modelled, then the SEFT version must lead to the same results for any set of values. Figure 7.24 revises the example already presented in Section 4.7, which has also served as a demonstration example in the Master thesis. The left part of the figure shows the Markov Chain notation and the right part the corresponding SEFT notation. The DSPN translation will look very similar to the SEFT notation (except for the init event that corresponds to an initial marking of the left place). The marking transition graph of that DSPN again looks like the Markov Chain in the left part of the figure. To check that this equivalence works in practice, some test cases were executed. They comprise the given example with different settings for the transition rates and different mission times. The numerical reference results were calculated with Mupad, EXCEL, and the Markov analyser that is part of ESSaRel. There have also been

158


Figure 7.24: Test Setting: A Markov Chain and its Corresponding SEFT


159

some test cases that comprise both deterministic and exponential transitions. All test cases led to the expected results and show that SEFTs are compatible with the established Markov models.

7.1.10

Semantical Issues

Although, in general, all test cases succeeded, it became obvious that there is a set of semantical details that needs to be defined more precisely, and a set of issues where the chosen implementation does not reflect the original intention. There is no exact definition of what happens at the point of time when an event fires. In the case of an Upon gate, for instance, it is not defined if the state output term is already true at the moment when the input event occurs, or only an infinitesimal time later. Another example is the coincidence of event occurrences. In the theoretical discussion, this aspect has been left aside, because independent events in continuous time can be assumed not to occur at exactly the same time. Moreover, in most cases the differences in the quantitative results are infinitesimal. However, in the case of the Priority-AND gate, this difference leads to qualitatively different results. An interesting test case connects both inputs of an Priority-AND to the same event. The question of interest was whether or not the gate output triggers if the input is triggered only once. Interestingly, the system behaved differently on simulation and on analysis. This test case can be modified if two independent generator state-machines are used for both inputs, with both having a deterministic event with the same time parameter. This time the events occur simultaneously, without being identical. The behaviour was different from the former case. More details can be found in [Gra04]. In some cases, the different behaviour can be attributed to an unexact specification of the underlying DSPN model or to the actual implementation of the used TimeNET tool, but sometimes the point is that the SEFT model specification is not as concise as it should be. In some cases, the resolution of conflicts between concurrently enabled Petri Net transitions is an issue. If, for instance, several state inputs of subsequent gates are connected to the same state or state output, they all use a guard pattern (i.e., a pair of anti-parallel arcs) to sense that state. If the state becomes enabled, the firing order of the subsequent transitions is undefined, which could, in some cases, lead to problems (which have actually never been observed). This issue could be resolved by assigning priorities to the subsequent transitions to define an arbitrary order between them. Other issues that are most probably due to the implementation are, e.g., the fact that the simulation does not work if states occur where no transition is enabled, although the model seems semantically correct and the analysis produces correct results. The mentioned issues do not put the whole SEFT technique into question; however, before SEFTs are recommended for industrial use, they should be addressed.

160

7.2


Case Studies

Some case studies were carried out to evaluate the applicability of the SEFT technique to more complex systems. Still, these systems were not real-sized industrial projects, but they involved a lot more aspects than the simple test cases discussed so far. Some of them are standard studies from published literature, others were invented especially to show the capabilities of SEFTs. Most of the case studies were published in research papers [KG04, Kai05], some were evaluated in the Master thesis [Gra04] and details can be found there. As a demonstration of the applicability of SEFTs, this section gives a summary of these case studies. Some are shown in more detail in order to show the SEFT technique at work, others are just briefly summarised in order to point the attention to particular features and limitations of the SEFT technique. The limitations found in these case studies inspired some of the extension proposals discussed in Section 7.4.

7.2.1

A Simple Fire Alarm

An example that shows the different modelling elements of SEFTs is the fire alarm system that was first presented in [KG04]. The system consists of two redundant fire alarm units which may fail stochastically. The system and its components are given in Figure 7.25. The hazard to be analysed is the situation when both alarm units are simultaneously in the failed state, since in this case a fire might break out without being noticed. A watchdog periodically checks the alarm units and restarts them if they are in failed state. The alarm units are instances of the component ”Fire Alarm Unit”. An alarm unit may be running properly or fail stochastically with a failure rate of 0.001 per hour.15 In order to restart a unit, an external trigger is needed. After the trigger and before returning to normal operation, some initialization steps have to be performed, with an exponentially distributed time of 0.1 hours on average.16 In order to notice when a unit is out of order, a state output port (depicted by the filled S triangle) that senses if the unit is running is used in the model. For external triggering of the initialisation routine, an event input port (the empty E triangle) is needed. The watchdog is simply depicted as a component with only one state. The triggering event is produced once every hour (modelled by a deterministic event) and can be connected to other components via an event output port.17 The top-level SEFT shows how the components are interconnected and models the top-level hazard (corresponding to a top-event in traditional FTA). This SEFT would 15

The tool ESSaRel allows entering all values with their physical units. This is a contribution to avoiding user errors, as numbers without units have often led to misinterpretations. 16 This example has also been conducted with a fixed delay of 0.1 hours. As several concurrent deterministic delays usually enforce quantitative evaluation by simulation, a comparison between analysis and simulation of the resulting DSPN is not possible; therefore, the variant with probabilistic delay is presented here. 17 To model events that occur once every fixed time interval, a short-hand notation named solitary event will be proposed in Section 7.4.1.

7.2. CASE STUDIES

161

Figure 7.25: The Fire Alarm Case Study

162


normally have to be drawn by the safety analyst, whereas the subcomponent models for the alarm units and the watchdog can be provided by the system designer or even be automatically derived from the functional design models of these components. The top-level system contains two instances of the fire alarm unit and one instance of the watchdog component. The inner structure of the instances is omitted in this view. The watchdog is connected to the event input ports of both alarm units so that it can trigger a restart of a failed unit when necessary. Since the fire hazard is present when both of the redundant alarm units are not working at the same time, they are combined with NOT gates, which in turn serve as inputs for a State-AND gate. The output of the State-AND gate is connected to the system-level output port that represents the hazard situation. All preconditions for analyzing this model are fulfilled: There are no cycles in the component hierarchy or in any causal relation, and all state and event ports have been connected to their counterparts. The state output port of the AND gate does not need to be connected since it represents the hazard situation to be analyzed. The translation and evaluation process follows the procedure given in this thesis; all steps and intermediate results are described in [KG04, KGF06]. The DSPN after translation and flattening is shown in Figure 7.26. As the currently available DSPN analysis algorithms are not capable of analysing DSPNs with simultaneously enabled deterministic transitions, the results had to be obtained by transient simulation. This is a disadvantage, as simulation takes much more time than analysis (see Section 7.5 for a discussion of evaluation time and possibilities for acceleration). Figure 7.27 shows a plot from a simulation. The maximum hazard probability is 0.0103 in this simulation. It can be seen that the fine-grained behaviour of the system is considered in the evaluation, hence the periodical growth and reset of the hazard probability. Not in every practical case is such a fine-grained evaluation desired or feasible. Sometimes, an approximative solution is sufficient. This is one of the options for evaluation time reduction that will be discussed later on. This case study shows how elements from state-based modelling techniques can enrich a Fault Tree Analysis, and that in the presence of computer-controlled systems with their deterministic delays, the availability of different kinds of events leads to more accurate modelling in comparison to traditional models that approximate every delay with an exponential distribution. The resulting model has more statemachine elements than Fault Tree elements and may perhaps not be perceived by safety analysts as an extended Fault Tree. A suggestion to users is to keep statemachine elements and Fault Tree elements in separate component models in order to avoid confusion. Moreover, the model structure needed to express events that occur periodically (the watchdog from the example) may seem unneccessarily complicated. A simplification, the periodic solitary event, is proposed in Section 7.4.1. Nevertheless, the given system could not have been modelled in the same way in standard Fault Trees, neither in terms of visible correspondence between model and reality, nor in terms of accuracy of the obtained results. Therefore, SEFTs are better suited than any of the traditional techniques alone to model this kind of systems.

7.2. CASE STUDIES

Figure 7.26: DSPN for the Fire Alarm Case Study

163

164


Figure 7.27: Simulation Result Plot for the Fire Alarm Case Study

7.2.2

A Motorway Safety System

Another example presented in [Kai05] is a motorway safety system that comprises hardware and software. To avoid hazard by cars entering a motorway with separate lanes the wrong way, an alarm system is installed at every entrance. It contains two sensors mounted beneath the road surface, a controller, and a flashing red light to signal drivers to stop if they entered the wrong way. The controller is described by a state diagram (in the example, a ROOMchart) that was modelled similarly inside a CASE tool (in the example, Rational Rose RT) and then imported to the SEFT tool ESSaRel by the prototype translator implemented in [Rog05]. This case study examplifies a richer variety of SEFT gates and a situation that is difficult to model with traditional FTs. The state diagrams in case tools only provide triggered or deterministically timed transitions. The feature of probabilistic state diagrams in SEFTs is helpful to model the environment (i.e., a car entering the wrong way) or component failures. The sensors are the only technical components assumed to fail in this example. The software is assumed to be correct, i.e., to behave as specified by the state diagram. Yet there may be hazards resulting from wrong system composition, inappropriate controller software, or the behavior of the environment (the car, in this case). The purpose of the SEFT analysis is to detect and quantify these hazards. As usual in FTA, the analysis concentrates on one given hazard (a car entering the wrong way without subsequent alarm) and neglects all others. In consequence, unnecessary details from system design and uncritical situations, like cars going the right way, can be omitted.

7.2. CASE STUDIES

165

The model of the car can be found in Figure 7.28. We assume cars going the wrong way to appear stochastically. This can be modeled as if one car passed the sensor repeatedly, according to an exponential distribution with a rate of 1 per week. The car, coming from the initial state "far", first hits the second sensor (because it goes the wrong way) and thereby enters the state "between". Then, after a delay, it hits the first sensor. The delay is stochastic, because cars run at different speeds; we assume a mean delay time of 5 seconds. The failure modes of the component class Sensor are specified as an SEFT. Two different failure modes are supposed: a sensor can send an event without being hit, or it can fail to send an event when it is hit. Note that on this level of hierarchy, it is useless to classify these faults as hazardous or not, since only the controller and the environment determine what further happens with the sensor signals. The first kind of failure is modelled by a spontaneous stochastic event with the rate of 1 per year and the second one with the Conditional Probability gate, setting its parameter to 0.9 (90% probability that the event is propagated correctly or, in other words, 10% probability for a failure on demand).

Figure 7.28: Motorway Alarm Case Study: The Car and the Sensor SEFTs

The state diagram for the controller is shown in Figure 7.29. A car going the right way (sensor 1 first) sets the controller to the state "Forward" and then back to its initial state "Ready" as soon as it hits the second sensor. A car touching sensor 2 first causes a transition to the "Reverse" state and, by hitting sensor 1, a further transition to the "Alarm" state. This is the state when the red light (not shown in the example) is on. To make the controller fault-tolerant, time-outs are programmed so that 10 seconds after only one sensor (and not the other one afterwards) has been touched, the controller assumes a false detection and goes back to initial state. From Alarm state, the controller goes back to Ready state after 1 minute. The complete system (Figure 7.30) contains two instances of the component class Sensor. The sensor inputs and outputs are connected to the car and controller components by causal edges. A proposition that cannot be expressed directly by traditional FTs is the fact that the hazard is present when the alarm is not enabled a certain

166


Figure 7.29: Motorway Alarm Case Study: The Controller SEFT

time after the car hit the sensor. Admitting a certain delay for a system reaction before the dangerous situation is assumed to begin is a typical scenario in many safety critical applications. To model this scenario, the Deterministic Delay gate is used and a tolerable time of one second (before something bad happens) is deliberately assumed. At the point of time when the delay runs out, the alarm device must be found in the ON state. The AND gate is an Event-State-AND, denoting that the (delayed) event propagates to the output port if the alarm state is not active, hence the NOT gate at the right leg (an Inhibit gate could replace both gates). The UPON gate at the top of the tree allows referring to the state after the event has happened. The probability of this state is the measure of interest when carrying out the quantitative analysis. Again, translation, flattening, and evaluation are described in the original publication [Kai05] and not repeated here. However, to give an impression of the complexity of the resulting DSPN of this quite simple example, the DSPN after flattening is depicted in Figure 7.31. This picture also answers the frequently asked question of why it is not preferable to model the system directly in DSPNs instead of first modelling it with SEFTs and then translating the model into DSPNs: The DSPN lacks the component structure and the visible semantics to explain how the system actually behaves. Modelling close to reality is an obvious feature of SEFTs. Under the given assumptions, the resulting probability for the output place to be marked is calculated to 0.67 in this example, which means that there is a chance of 67% that during the assumed mission time of the system (1 week in the example), at least one car could enter the wrong way without an alarm being issued within 10 seconds.18 Figure 7.32 shows an TimeNET screenshot showing the simulation results. In case of a simulation, the tool also displays the error range (light lines above and below the main curve) within which 95% (according to selected parameters) of the results are expected. 18

The selected values in the example were not typical. They were chosen in order to keep the range between the shortest and the longest time constants as small as possible, because the number of sampling points and thus the simulation time depends on this range. To cope with this effect in real analyses, a solution will have to be found; this is discussed in Section 7.5.2.

7.2. CASE STUDIES

167

Figure 7.30: The Top-Level SEFT of the Motorway Alarm System

This case study shows a greater richness of Fault Tree gates and more complex statemachines than the first one. Also, the automatic generation of SEFTs from design models from a case tool has been demonstrated. What became clear, on the other hand, was, that the more one tries to model the functional behaviour of technical systems, the more the resulting SEFTs tend to look like formal design models rather than Fault Trees. This might lower the acceptance as a safety or reliability technique.

7.2.3

Inverted Pendulum: A Case Study for the Import of ROOMcharts into SEFTs

One of the applicability studies was performed with a inverted-pendulum demonstrator featuring redundant controller units and a safety operating system / middleware developed by Fraunhofer FIRST. The pendulum is held in its instable head-up position using a redundant set of 3 networked controllers, angle sensors, and a step motor. All components and their connections were modelled by failure rates. To obtain observable failures, a fault injection mechanism was inserted: the controllers were programmed to stop communication (fail silent) at random intervals according to an exponential distribution. Only an external reset could put the controller back to the normal state. A PC that was connected to the data bus recorded all messages and detected the up and down time of the controllers. The controllers had an additional boot time assigned to them that applies after another controller has detected a failure and sends a reset signal.

168


Figure 7.31: The DSPN of the Motorway Alarm System

7.2. CASE STUDIES

169

Figure 7.32: The Result Plot from a Simulation of the Motorway Alarm System

The safety-critical scenarios to be modelled quantitatively were 1. the hazard state of the controller system being unavailable 2. the accident event of the pendulum falling down The start state of these scenarios is the stable state of the controller with pendulum head-up, without movements, and with all components working correctly. Regarding usability, it turned out that on a coarse level, modelling each component as either working or failed with the transition rates and times as specified, the model could be entered quite easily and the component concept proved useful for structuring the project and profiting from the multiple instances of the controller. However, when trying to model the behaviour on a more detailed level using Rational Rose RT (e.g., with special attention given to the communication failure detection and handover procedure, which is important for the timing conditions), it turned out that it was very difficult to get a suitable SEFT model. This is due to several reasons: First, there was no complete and realistic Statechart model of the controller system. Some models had been drawn by the developers during system design, but the final software implementation had been done by hand without referring exactly to the models. Second, a lot of details were not designed intentionally into the system but determined by hardware or environment factors, such as detection times for communication failures or startup times of the controllers. These times, but also the qualitative behaviour in some situations, were not known and proved to be hard to determine by experiments. This is not necessarily an inconvenience of the SEFT method, but rather a question of the applied engineering skills (software could have been generated from Statecharts, communication and hardware parameters including failure rates could be known for standards parts...). However, as the engineering maturity level is similar in large parts of the industry today, this means that the SEFT technique is not yet applicable to realistic systems in many industry branches.

170


Third, even if an exact Statechart, ROOMchart, or Stateflow model had been used to generate the code, this model would most probably have semantical elements that cannot been translated into SEFTs by hand or by the translation algorithms implemented by [Rog05]. It turned out that the implemented model import from Rational Rose RT, which had worked well in small test cases, did not perform as well in larger applications. This is not necessarily due to a bad implementation, but to semantical details that were overlooked or that were too complicated to implement in the translation algorithm, or missing in the SEFT model itself. In particular, hierarchical states are missing in the current version of SEFTs. Also, it was hard to find adequate descriptions for the communication mechanisms between components, as ROOMcharts allow messages with typed data where SEFTs can only express event and state ports. This is a limitation of the SEFT technique and will be discussed later. Fourth, it was impossible to model the real-world accident event ”pendulum falls down” accurately, as the physical laws - although generally understood - cannot be expressed by SEFTs. In reality, it takes a lot of control theory to decide on how long the pendulum stays head-up in a defined tolerance range whitout external control. So it is not possible to calculate, e.g., the maximum time for the hand-over from a failed controller to a spare part. It was not possible to express conditions on physical values (like the angle of the pendulum) or data of a continuous range (like internal measurement values) and to compare them to tolerance ranges for safe operation. This is a limitation of SEFTs being a discrete state space method and can only be overcome by hybrid modelling techniques. Proposals to extend SEFTs to allow integer or even float values could emerge and will be discussed in Section 7.4.6. Their feasability can, however, be questioned. In summary, this case study shows that SEFTs are better suited for coarse-grained models (on the abstraction level of standard FTs) or for pure discrete state-space controllers, but not so much for complete modelling of physical or continuous control systems. Regarding quantitative results, no useful results could be obtained from this experiment, since there were apparently unconsidered factors that led to frequent dropouts of the pendulum system that were not predicted by the model.

7.2.4

Further Case Studies

Further case studies have been carried out to cover additional aspects of SEFTs or to test the particular applicability to standard problems that are widely used in the literature to evaluate similar techniques. Most of them can be found in detail in the Master thesis [Gra04]. They are briefly subsumed together with their results. Communication Channel is an example of how SEFT fits the specific aspects of distributed systems and also how the component concept can be exploited for the creation of a pattern library for standard applications. The communication pattern includes deterministic plus probabilistic delay, a conditional probability for loss or corruption of data, and a probabilistic event for spontaneous,

7.2. CASE STUDIES

171

uncommanded messages at the channel output. This example confirmed that it is possible to provide standard building blocks for recurring parts of safety analyses. The prohibition of causal cycles in SEFTs turned out to be a problem, because usually the communication channels are modelled for both directions, which leads to cycles. Dining Philosphers is a well-known academic example that symbolises shared access to some media or resource. It was chosen here to try out the composition of regular systems out of a number of equal components. The story is that there is a group of philosophers (usually 5) sitting around a table. There is one fork between each two of them and a bowl of noodles in the middle of the table. The philosophers are thinking for a probabilistic time, then they get hungry and try to grab both forks, the one to their left and the one to their right. As the forks are shared with their neighbours, it can happen that only one or no fork is available; the philosophers, though, can only eat if they have both forks. Eating takes a probabilistic time, after which the philosopher lays back both of his forks and starts thinking again. The worst case scenario is that each philosopher has one of the forks in his hand and is not willing to put it back until he has eaten - this is called a deadlock. In the adapted version examined for this thesis, not only the deadlock is examined, but a top-level Fault Tree uses the Duration gate to indicate a failure if any of the philosophers stays hungry for longer than ten seconds for any reason (waiting for too long or deadlock). In reality, this corresponds to the situation that a safety-critical message cannot be delivered or processed for a certain time because the bus or processor is busy. For simplicity, only two philosophers were modelled. Philosopher and fork were modelled as reusable components and instantiated twice each. The connection pattern is regular; this suggests amending the SEFT technique by some textual architectural specification language that allows building ”vectors”, ”rings” or ”pools” of n instances of the same component, as these structures can be found in many highly reliable technical systems. Mission Avionics System is a study that has been examined both in [DA01] and in [Buc00], where combinations of Fault Trees and state-based modelling are examined, too. This study thus helps to compare SEFT to existing proposals for state-based extensions of the FTA technique. The case study is closer to reality than the examples mentioned so far. It describes a highly redundant control system which consists of five cooperating computer subsystems, named Vehicle Management System, Crew Station, Local Path Generation and so on. The degree of redundancy and the consequences of component failures are different for each of the subsystems. The details are modelled differently in both cited references and also in the Master thesis [Gra04]. In particular, reduncany is handled differently: in [DA01], spare gates in FT notation are used whereas in [Buc00], the notion of repair groups exists. It turned out that similar modelling elements are yet missing in SEFTs. The scenario could, however, be modelled in SEFTs, but less elegantly and less intuitively. A side condition was that the spares had to be put artificially into a defined order artificially, which was not demanded by the problem description (but most prob-

172


ably corresponds to the actual technical implementation). The missing spare gate could be emulated by a triggered state-machine in SEFT. Again, it seems desirable to add specification elements for spare pools; this issue will be discussed later. There was also one component with degrading failure behaviour (three states: working - minimal function - completely failed); this could be captured easily by referring to the state-machine elements and the multiple output ports offered by SEFTs. Radio-Controlled Railway Crossing [KT02] is a frequently cited case study from railway automation. It deals with the radio-controlled operation of trains, which is an upcoming technology in European railway networks. Considering a level crossing, the train must obtain the right-of-way signal after the barriers have been closed and has to stop if it does not receive this signal early enough. It is obvious that this scenario is higly safety-critical and that it involves real time constraints, as well as many different components with different failure modes. In [KT02], the example is modelled with rich details and failure possibilities for every component involved. In [Gra04], the example has been simplified and modelled in the SEFT technique using ESSaRel. Aspects that were particularly considered included the communication channels and their failure modes, in order to evaluate the applicability of SEFTs to distributed systems. Again, it turned out that the component concept helps the analyst a lot to find an appropriate partitioning of the system. Also, the state-machine modelling elements of SEFTs were necessary to model the scenario realistically; this could not be achieved in the same way with standard FTs without temporal extension, as the authors of the original study confirm. The whole study (already largely simplified) consisted of 7 components, the most complex one with 6 states, most of them with more than two states. Still, the correctness of the model could be verified manually thanks to the different abstraction levels of the component hierarchy. The model could only be simulated due to deterministic delays. Usable quantitative results could, however, not be obtained, because too many details had been left out, and the original study does not provide comparable numbers, either. Gas-Burner is a standard case study from safety analysis and can be found, for example, in [GMW95], where an FTA mapping to Petri Nets is also proposed. It contains timing aspects, but is much easier than the complex Railway Crossing example and is therefore interesting for the evaluation of SEFTs. A gas burner system consists of a gas tank, a valve that controls the gas flow out of the tank, and an ignition unit, which can set the gas on fire. If, for any reason, gas escapes from the tank, it must be assured that the ignition unit produces a spark; otherwise, after a certain time, there is a danger of explosion. This example deals with a failure on demand, which is an important problem class that is hard to describe with standard FTs. In comparison to the Railway Crossing example, the model that could be found for the Gas Burner looked much more like a Fault Tree than like a Statechart and showed that SEFT actually is comparable to classical FTA. Timed gates were used to model the acceptable delays until a hazard is present. The distinction of states (gas leaks from the tank) and events

7.2. CASE STUDIES

173

(ignition spark occurs) made it easier to model the situation correctly and to communicate with other people. The model has been analysed; however, there were no reference values to compare the result to. A larger industrial case study is currently under preparation with Siemens TS in Braunschweig. The goal is to repeat a safety analysis of a radio-controlled train operation system (this time, at a realistic level of detail) that was formerly performed by traditional FTA. The analysts at Siemens complained that it was hard to capture the realistic behaviour of this system by standard FTA; after a first introduction to the SEFT technique, they expressed the hope that SEFTs are more convenient to model the scenario. The results are expected in the spring of 2006 and will be published as a student’s thesis. Additionally, the tool UWG3 has been used in different Siemens departments and also by a number of other analyits since 2004, and the feedback regarding the component concept has been very positive.

7.2.5

Observations from the Case Studies

In summary, the case studies proved the general applicability of SEFTs to typical safety- or reliability-critical scenarios. The applicability is comparable to other approaches for dynamic or multistate enhancements to FTA. In many cases, the new technique allows modelling of practical examples in a more intuitive way and closer to reality than standard FTs; however, some details must be modelled differently and redundancy situations in particular are more difficult to model in SEFTs. The students who were involved in the case studies confirmed that the technique and the supporting tool ESSaRel can be understood within a few days of training. Industrial users further confirmed that the inability of traditional Fault Trees to distinguish between states and events and their lack of expressiveness for temporal order and time had been leading to inappropriate safety models in the past, and that this issue was solved to a great deal by the SEFT technique. With regard to Markov Chain analysis, industry partners at Siemens Corporate Research had been complaining about models that were too complicated to be meaningful to humans and to be checked manually for correctness. The state-based components as offered by SEFTs were judged as a solution to this issue. Other industrial users at Siemens, as well as an independent safety analyst from Denmark, confirmed the advantage of the component concept that had been introduced by the Component Fault Trees, as integrated in the UWG3 tool. In general, the experience base with the UWG3 tool is much larger than with the ESSaRel tool, which is still at a prototype stage, and from the user feedback (over 60 registered users) it can be judged that CFTs have found their way into practical acceptance. However, a set of inconveniences regarding SEFTs became visible during the modelling of the case studies. Some of them are also present in alternative techniques, but some are specific to SEFTs. The most criticised issue in comparison to traditional Fault Trees is that there are no basic events and that each time, a state-machinemodel has to be created; a solution to this issue, called solitary events, is easy to

174


implement and is presented in Section 7.34. Some case studies from reliability engineering claim better possibilities to model spare pools or depending failure rates for spares (cf. the Spare gates in the DFT technique). These issues are discussed later on and some improvements are proposed; however, existing safety and reliabiltiy analysis techniques often do not deal better with these aspects. There are additional ideas for semantics extensions and state hierarchy (mainly found missing in the Master thesis [Rog05]). These aspects should be discussed very carefully, because they can introduce subsequent analytical problems and also hamper intuitiveness. A discussion of proposed extensions as a result of the case studies is given in Section 7.4. The case studies and the industrial experiences focus on the applicability of the method, not on the numerical results. As reference results for the case studies and industry examples were not available, no comparison could be made. Only the Markov Chain and Fault Tree aspects of SEFT were successfully compared with the traditional methods, as discussed above. One problem that did not appear unexpectedly was the analysis and simulation time required. As SEFTs containing deterministic transitions can only be solved by simulation, simulation time in particular is an issue. Performance is discussed in Section 7.5 and some promising approaches for improvement are given. However, it should be noticed that the measured calculation times rely on a prototype tool and that both algorithm and implementation leave potential for optimisation. Also, the fact that the user had to change between a Windows PC running ESSaRel and a Linux PC running TimeNET was found particularly annoying, but cannot be attributed to the performance of the analysis algorithms.

7.3

Comparison to the State of the Art

Following the presented proofs, tests and case studies, it can be stated that the new SEFT technique presented in this thesis provides plausible and consistent quantitative results. Both inner consistency and consistency with existing analysis techniques - in so far as they are able to model the same situations - were shown successfully. It has been shown that in terms of fitness for purpose and industrial applicability, the SEFT technique exceeds the known standard techniques in many facets. SEFTs - and also CFTs, which have been proposed as an intermediate result of this thesis - offer a structuring concept that is superior to traditional Fault Trees and Markov Chains and comparable to software engineering models like ROOMcharts. This component concept is unique to SEFTs and CFTs and several practicioners have agreed that it adds valuable structuring and reuse capabilities that were missing in all existing safety analysis techniques, but are, at the same time, well-known to users familiar with software engineering models. Furthermore, SEFTs and CFTs avoid confusion about repeated events, as they extend FTs to directed acyclic graphs and give a unique identity to each event.

7.3. COMPARISON TO THE STATE OF THE ART

175

In comparison to traditional Fault Trees, SEFTs avoid confusion about states and events and offer modelling capabilities for event ordering and time. While the Priority-AND gate is a part of the Fault Tree standards, its traditional application is restricted and some commercial tools even ignore the ordering information. The comparison to Dynamic Fault Trees [DBB92], a model that correctly deals with ordering information and Priority-AND, showed the result compatibility to SEFTs for Priority-AND, even in cases where the traditional approach from [FAR76] does not work. However, DFTs cannot deal with real time distances between events and with delays. Moreover, they lack the state / event distinction that SEFTs offer; this can, on the other hand, be an advantage for industrial acceptance, as DFTs appear to be closer to the established Fault Trees. DFTs offer some gates that are missing in SEFTs (e.g., the different Spare Gates). Some of them, however, do not match the original idea of gates (especially the Sequence Enforcing Gate) and are only necessary in DFTs because there is no capability to integrate state-machine elements as in SEFTs. Buchacker’s approach [Buc00] is another approach to exploit state-based modelling for FTA. It has been shown that the results of this approach comply with the results found by SEFTs. Even the underlying Petri Nets are equivalent in most cases. SEFTs share the ability to model multi-state components with Buchacker’s approach. However, the state / event distinction is missing there, too. Also, the state-machine capabilities have not been exploited for the modelling of temporal order and real-time. As GSPNs and not DS, which often occurs in computer-controlled systems. As SEFTs use notation elements from popular techniques, they can easily be understood by practicioners and avoid ambiguities. However, their new appearance in comparison to standard FTs can hamper their acceptance, not only for experienced FTA users, but also for authorities that require Fault Tree Analysis for the acceptance of new technical systems. The state-machine paradigm that allows behavioural modelling on a finely granular level could mislead the user into attempting a safety model on level that is a too detailed, which will result in a safety argument that is too complex, or in a failure of the whole analysis because of unknown details. Integration of safety analysis techniques with software design models have been tried by several researchers, but the direct translation of Statecharts or similar notations into a safety modelling technique has not been achieved before, as it requires a clear notion of states and events that has been introduced by SEFTs. The general feasibility of the integration of Rational Rose RT models and SEFTs has been demonstrated by [Rog05]. Measured in terms of the initially proposed requirements • compositionality • fitness for the semantic aspects relevant for software-controlled systems • good compromise of intuitiveness and formal semantics • integration into an embedded systems development process,

176


the SEFT technique exceeds most of the established safety and reliability techniques. However, some drawbacks and missing features of this technique have to be pointed out. Being a new technique, the SEFT technique evidently leaves room for improvements and extensions that could not be implemented during the doctoral research phase. The following section gives an overview of some disadvantages of SEFTs in their current version that were found during the evaluation. It also discusses possible improvements, as far as propositions exist.

7.4

Limitations and Improvement Proposals

During the practical evaluation, a number of difficulties and missing features in the SEFT concept became obvious. A significant part of these issues could be fixed by changes and additions to the original proposal published in [KG04]. A number of new gates has been added since then and some semantic issues were clarified or changed, i.e., using an init transition instead of marking one state as initial state (cf. [Kai05, KGF06]). There were cases, however, where a solution to known issues would require further detailed research or where proposed solutions or extensions had to be postponed, in order to obtain a sound and usable version of the SEFT concept quickly. This section lists several problem areas and discusses the achievements of this doctoral thesis critically. Alternatively, it can be understood as a list of improvement suggestions to researchers who want to develop the SEFT concept further towards industrial applicability in the future.

7.4.1

Basic Events and Solitary Events

In discussions with experienced Fault Tree analysts from industry it turned out to be a main problem that, at first glance, there is no way to express the basic events from traditional Fault Trees. After a short explanation, most experts accepted that each basic event, as provided in traditional FTA tools, semantically represents a subcomponent described by a two-state Markov Chain with a given failure rate, of which the failed state probability is taken into account for the combinatorial analysis. In SEFTs, there is no basic event symbol (circle in traditional FTA), but a subcomponent has to be created that has two states and one transition, like the one in Figure 7.33. A similar example of this correspondence has already been given in Section 4.7. Although the SEFT notation only makes explicit what has been tacitly assumed in tradtitional FTA (which was one main purpose for its invention), acceptance problems are forseeable because of the missing basic events. A solution to this issue could be the re-introduction of the basic event from traditional FTA as a shorthand for the failed state of a two-state subcomponent, as depicted. However, as these ”basic events” actually represent (failed) states - as they always did in standard FTA - the name event is confusing and the choice of a different name should be taken into consideration.

7.4. LIMITATIONS AND IMPROVEMENT PROPOSALS

177

Figure 7.33: Subcomponent Representing a Basic Event (left) from Traditional FTA

In a similar case, the user might want to refer to events (”events” this time in the sense of SEFT events) that are failure events of similar two-state subcomponents, without having to draw the whole component. Therefore I suggest the extension of the SEFT model repository by solitary events, i.e., events without explicit predecessor and successor states. Semantically, they are also abbreviations for subcomponents, this time with an event type output port. They could look as depicted in Figure 7.34. Four variants of them could be useful: • Single probabilistic event: occurs once in the system’s life time, exponentially distributed with a given rate • Single deterministic event: occurs once in the system’s life time, at a given, deterministic time after system-up19 • Repeated probabilistic event: occurs repeatedly at stochastic intervals according to a Poisson distribution20 with a given rate • Repeated deterministic event: occurs repeatedly at equidistant time intervals from system-up, with the given delay (e.g., every 5 hours) For the probabilistic types, a rate parameter (alternatively MTTF / MTBF) and for the deterministic types, a time parameter must be specified. During translation to DSPN, the solitary event would first have to be replaced by a corresponding subcomponent; this kind of translation can easily be implemented in the ESSaRel framework. The pattern for the single probabilistic event as an example can be found in the left part of Figure 7.33, the pattern for the repeated deterministic event as a second example is given in the right part of the figure. 19

The system-up event is marked by the mandatory init event of the system. The Poisson distribution is closely related to the exponential distribution: if events occur according to a Poisson distribution, the arrival times between each pair of them are exponentially distributed. 20

178


Figure 7.34: Solitary Events: Single Exponential, Single Deterministic, Repeated Exponential, Repeated Deterministic

Figure 7.35: Subcomponent Representing a Single Probabilistic Solitary Event (left) and a Repeated Deterministic Solitary Event (right)

7.4.2

Additional Event Parameters

Another issue that was mentioned by some UWG3 users can also be solved easily: users wish, as it is partly offered by commercial FTA packages, to specify additional parameters for events. These comprise repair rate, mission time, coverage factor, and CCF (common cause failure) parameters. Regarding coverage and CCF, including them into the SEFT concept should be discarded, because SEFTs provide explicit ways of expressing the same semantics (such as Conditional Probability Gate and state models to express dependencies between components). However, repair rate and individual mission time by event21 could easily be integrated into the concept of solitary events. The semantics of a repair rate is that the corresponding two-state Markov Chain has a transition back to the working state, which could, for instance, represent a repair activity. Mission time of an event means that the component of which the events marks the failure is replaced or maintained after a given time and afterwards is ”as new”. This is represented by the pattern in Figure 7.36, which specifies the translation of a single exponential event with failure rate λ, repair rate µ, and mission time MT. Additionally requested (and already implemented in the UWG3 tool) are different probability distributions for events, including their specific parameters. Apart from 21

The ESSaRel tool already offers one mission time for each component. The analysis ends when the mission time of the system (top-level component) elapses. If the mission time of a subcomponent elapses earlier, this subcomponent is immediately put back to its new state by triggering its init transition.


179

E

Init

E

MT

Init

MT

Figure 7.36: Basic Event with Repair Rate and Mission Time

the exponential distribution that is currently the only one available in SEFTs, the Weibull distribution, the exponential distribution with phase-wise constant rate, and the constant failed state probability are most frequently demanded. The first is used to model non-constant failure rates, as they appear for some components or for some phases of the life-cycle.22 The second is used when the stress on a component changes with different phases of its lifetime, and the third one is usually used for simplification in the absence of more exact data. The Weibull distribution requires an additional shape parameter; this parameter is 1 in the special case of an exponential distribution. A Weibull distribution could be provided if the Petri Net used as evaluation model is extended to allow for this distribution. This would possibly impair the analytical solution, but for simulation, it is not a serious problem. The tool TimeNET already offers an extension to DSPNs, called eDSPNs (for extended DSPNs), with general transitions. Similar considerations apply for phase-depending rates of the exponential distribution. Here, however, an alternative (but more complicated) solution is cloning the transition with its source and target state once for each operation phase and using the available state-machine facilities to model a change between the operation phases of the component. For the constant failed state probability, finally, it first has to be discussed why it is used and on what assumptions it is based. Discussions with experts lead to different explanations, mainly: 22

for instance, the non-flat parts of the frequently cited ”bathtub curve”.

180


• the fail-repair cycle has already reached its steady state for the relevant instances of time • the exact distribution is unknown or irrelevant, but empirical data is available that suggests a certain overall probability that the part is working (= component availability) • an instantaneous or overall setting is modelled instead of an evolution over time. For the first case it is, of course, possible to model the situation exactly in SEFTs: to model the failure and repair events and to increase the failure and repair rate so that they are very high in relation to the mission time, while keeping their ratio. This leads to a Markov process that quickly reaches its steady state. Of course, a more pragmatical approach by approximation is often favourable. In the next subsection, an extension to more than one successor state for each event, with a given probability distribution over the successor states, is proposed. If this extension is adopted, the init transition can also have more than one successor. In consequence, it is possible to model a system that goes to one (failed) state immediately after init with a given probability and to another (working) state with the complementary probability. If there are no other transitions out of these states, this leads to a constant failed state probability. These considerations apply to the first and the second case. For the third case, where only a snapshot of the system is modelled, it is suggested not to use SEFTs at all, but to use CFTs with their (much more efficient) combinatorial analysis for given probabilities of independent components.

7.4.3

Multiple Predecessor and Successor States

There are reasons to think about more than one predecessor or successor state for events. More than one temporal edge from different states to a (failure) event could be a useful shorthand notation in many practical situations, for instance in cases where a component has several operating states, with a transition to the failed state being possible from each with the same rate. Instead of overloading the model with a lot of events that all have the same meaning, one event and a set of edges are sufficient. This would also be a preparation step for the integration of hierarchical states (cf. next subsection), as a transition from a state with several substates is nothing else than a shorthand for many transitions from each of the substates. Semantically, there is no problem with implementing this extension, because it is only a shorthand and can be translated into a standard SEFT. The extension to more than one successor state for an event is slightly more complicated and has less application fields. One possible application is the frequent case that a component, when it fails, passes into one of several failure modes according to a known probability distribution. Moreover, SEFTs should be able to cover most relevant facets of probabilistic automata. Many versions of probabilitic automata allow distribution over several successor states instead of one single successor state. To bring this facet of probabilistic


181

automata to SEFTs, it would be favourable to allow a set of successor states for SEFT events as well. These automata also allow an initial sojourn probability distribution over all states instead of one deterministic initial state; this is a special case of several successor states, because it would correspond to several predecessor states being assigned to the init event in SEFT. So there are several reasons for allowing several successor states with a defined probability distribution over the successor states for all kinds of SEFT events - including the init event. Fortunately, this can be handled easily by the chosen translation concept, as DSPNs allow the annotation of transitions with weights, which represent the conditional probabilities that a transition is preferred over the others that are concurrently enabled. The translation algorithm could be extended so that it translates an SEFT event possessing several successor states into several transitions with the appropriate weights. As the other SEFT event parameters like rate or deterministic delay have to be dealt with as well, it is sometimes necessary to introduce an artificial transient state from which the created transitions depart. A purely technical problem during implementation has to be expected because, due to rounding errors, the specified probabilities for the choices do not necessarily sum up to 1, which could, in time, lead to a ”probability leak”. Relative weights (as used in GSPNs and DSPNs) could be a solution to this issue; they refer to the sum of all weights, so it is not required that their sum is 1. However, as a probably more convenient solution for practitioners One suggestion to deal with this is that, for exactly one of the successor states of each event, the user annotates the reserved symbol ”else” instead of an explicit probability. This means that this alternative gets all the probability that is missing to make the sum of 1. Of course, a tool would have to check that the sum of the other probabilities is not greater than 1. An example for an event with multiple predecessor states and an event with multiple successor states is shown in Figure 7.37.

Figure 7.37: Multiple Predecessor States to an Event (left) and Multiple Probabilistically Distributed Successor States (right)

Apart from probabilistically chosen successor states, there is a second case where more than one successor state could make sense: conditional or guarded transitions. A guarded choice can be imagined as a (transition) event from which multiple temporal edges lead to a set of potential successor states. Each one is denoted by a condition (Boolean expression) so that the conditions are mutually exclusive and

182


one of them is always true. The latter can again be assured by a mandatory ”else” branch. The Boolean conditions should be restricted to state expressions about visible states, i.e., states that belong to the same components or that are accessible via state ports. Thus, everything that can be expressed in the extended notation can be translated back into the current SEFT notation - but at the cost of using more events and more causal edges expressing the guards. If added, the guarded choice notation would allow for models that are simpler in their structure and more similar to Statecharts. Moreover, if an extension of the SEFT technique to higher semantics (e.g., integer or float data types, cf. Section 7.4.6) will be implemented one day, these guarded choices could also depend on conditions like ”speed > 100 km/h”, which would take the SEFT notation another step closer to the physical systems being modelled. Last but not least, if the conditions with their given probabilities could be entered in a table, a model with the same expressive power as an Event Tree would be available. As SEFTs allow for several outputs anyway, the rates of different final events could be quantified. This way, SEFTs would subsume another important safety analysis technique and cover the complete ”bow-tie” analysis, consisting of hazard and risk analysis.

7.4.4

Hierarchical States

One major advantage of Statecharts and ROOMcharts over traditional finite statemachines is their concept of hierarchical states. States that differ only in some aspects can be subsumed in a common super-state or, vice versa, coarse-grained states can be refined into substates. An example of a set of traffic lights has been given in Figure 2.11 in Subsection 2.5.1. It shows the two kinds of state decomposition that exist in Harel’s Statecharts, AND-Decomposition and (X)OR-Decomposition. In Selic’s ROOMcharts, only the OR-Decomposition exists; the other type of decomposition is expressed by the component decomposition instead. Without these state hierachy concepts, realistic software-controlled systems can hardly be modelled in a manageable way, and therefore, state hierarchy should be provided by every statebased modelling technique. SEFTs in their current version have no concept of hierarchical states. Throughout the case studies, this appeared to be acceptable, because in the context of safety or reliability analysis, only the global hazard or failure states are of interest, and not all operational states that exist. However, as systems get more complex or have to be modelled on a more detailed level, even the failure models are likely to become too complex without a hierarchical state concept. Moreover, when importing behavioural models from software design into safety analysis as proposed in Section 7.4.7, these design models (e.g., Statecharts, ROOMcharts) are usually hierarchical and it is desirable to translate them into SEFTs without flattening them before. Therefore, hierarchical extensions of the SEFT notation were discussed and experiments were carried out. In particular, [Rog05] prototypically imported hierarchical models in a ROOMcharts-like notation from the CASE tool IBM Rational Rose RT


183

into the tool ESSaRel and there translated them into SEFTs. Technically, this was achieved by extending the available subcomponent concept of ESSaRel to cover substates as well, and by adding new types of ports (so-called temporal ports), which correspond to the init and exit points in ROOMcharts. Semantically, SEFTs with hierarchical states can be mapped to standard SEFTs by flattening the state hierarchy before starting the translation to DSPNs. A procedure for flattening ROOMcharts has been described in [RM05]. Many details, however, have not been solved so far for hierarchical states, e.g., history states, choice points, or the attachment of triggers and consequences to transitions that cross state boundaries. Theoretically, the successful formalisation of Harel’s hierarchical Statecharts proves that all of these can be solved; however, there is a lot of theoretical work necessary before a solid hierarchy concept can be implemented. The existing formalisation approaches for hierarchical models like Statecharts and ROOMcharts differ from each other in many details. So adding a hierarchy concept to SEFTs implies a decision to be made in favour of one particular modelling technique. Solving these issues was outside the scope of this thesis and, in consequence, the state hierarchy aspect can be considered as a missing feature in the current version of SEFTs. However, the component hierarchy already offers one powerful complexity reduction mechanism and covers the main aspects of the AND-Decomposition in Statecharts.

7.4.5

Spare Pools and Repair Dependencies

In reliability analysis, a frequent assumption is that for critical subfunctions, there is a set of equivalent components (spares) available and that the mission of the system can be fulfilled as long as at least one of them is left to replace an active component when it fails. In some models these spares can be shared among several subsystems. Another frequent assumption is that there is a limited number of repair crews that repair failed components. Examples of a similar kind were examined in [Gra04] and in the aforementioned case studies. It turned out that SEFTs have difficulties in modelling these scenarios and that this is a drawback in comparison to DFTs or to Buchacker’s approach. Traditional models like queues or some of the proposed extensions to Fault Trees, e.g., the Cold/Warm/Hot-Spare-Gate in Dynamic Fault Trees [DBB92], have better means to model spare pools or repair dependencies. In some other models, these dependencies are handled by corrective factors. However, there are means to exploit the SEFT concept to model these situations (e.g., the k-out-of-n voter gate represents one kind of a resource pool), and there are also different possibilities to extend SEFTs to better fit these aspects. As DSPNs allow more than one token on a place and are also an appropriate means to model interdependencies, one way to enable spare pool modelling would be to offer DSPNs directly as a modelling tool to users. As DSPNs are necessary for SEFT analysis anyway, they are available as one of the models in the ESSaRel tool. However, as Petri Nets do not offer a general port concept that would allow integration

184


with the SEFT models, an intuitive way of interfacing would have to be found. This could be achieved by adding port symbols and special edges to the DSPN model, so that the user can define ports and join them to places or transitions of the net; the result would look quite similar to the approach in [KW01]. The ports could be referenced from an external SEFT model when a subcomponent of the DSPN is instantiated. The disadvantage of this approach is the complicated structure of the Petri Nets - which was the reason for inventing SEFTs and not directly modelling in Petri Nets - and the non-standard extension that would be necessary to introduce the port concept. Another possibility to model structures of equal components is a ”replicated component” concept similar to the one offered in the ROOM method [Sel94], where multiple instances of the same component type are denoted by a special component frame, as shown in Figure 7.38. Also, ports and edges should then allow for higher multiplicity. An OR or Voter gate would connect the outputs of the replicated components and express the situation that at least one (or more) of them needs to be operable for the system to continue working. In this way, redundant structures consisting of several subcomponents of the same type could be modelled. A similar concept of replicated components has been added to the Dynamic Fault Tree technique, as the DFT example in [CPS03] shows.

Figure 7.38: Multiple Component Instances in ROOM

To share a set of spares among several systems, other shorthands that involve statebased behaviour could be modelled. The difficulty is how to display additional behavioural information about how the replicated components are connected, e.g., whether they start operating at the same time or one after the other, and which event is the failure event that commands the operation of the next instance. However, this could be solved in an acceptable way by providing a library of standard patterns that can be parametrised, or by integrating a simple Architecture Description Language (ADL). With a library concept, even more complex topologies, like chains or rings of equal components or redundant bus structures, could be provided as stan-


185

dard solutions. Some examples of Petri Net structures that describe spare pools and repair dependencies can be found in [Buc00]. Last but not least, there is the possibility to add Cold Spare, Warm Spare, and Hot Spare gates as in DFTs to SEFTs, because the state-based semantics of SEFTs allows adding any kind of gates that can be described by a state-machine and that can be translated into DSPNs.

7.4.6

Semantics Extensions and Further Formalisation

The ”inverted pendulum” case study showed that it is often necessary to model physical environment or continuous control systems on a detailed level. Otherwise, it is not possible to explain which failures of the controller provoke which hazards in the environment at which time. To do so, it would be desirable to provide integer or even floating point values. Additionally, modelling of physical phenomena in the environment and the effects of external disturbances on the system requires further elements from hybrid theory, such as differential equations. It further became obvious that, in order to interface directly with the given software design models, structured data types would be desirable. There are certainly possibilities to extend the model step-by-step by more powerful data types, but these could reduce its intuitiveness and blow up the complexity of the analysis. Integer types and structured types, as offered by ROOMcharts, are still easier to integrate than floating point types and elements from hybrid automata. As the underlying model of SEFTs is currently based on a finite state-space, an extension towards hybrid modelling would completely change the foundations of the method. Moreover, when extending the SEFT technique by more data types, there is a danger of copying existing models from system design or simulation, although they would better fit the purpose than SEFTs do. Therefore, adding integer and floating point data types is discouraged. Raher, SEFTs should be left as a simple probabilistic finite state model. It is further suggested working towards a common simulation framework for SEFT models and hybrid models (e.g., as provided by the Matlab/Simulink tool family, which is popular in many of the relevant industries). As SEFTs and the underlying DSPNs can be solved by simulation, the approach of an integrated simulation environment seems feasable, possibly with a set of restrictions. The experience with the inverted pendulum also shows that an adequate model of the whole system of hardware and software becomes more necessary as the modeller tries to describe possible failures on a very fine-grained level. In consequence and independently from integration approaches with other techniques, the analyst should consider if a coarser representation comprising just general operation states is more appropriate for safety and reliability analysis. Another question that arose when trying to model on a fine-grained level was whether the semantics of SEFTs could be put on a more formal base. SEFTs, as they are now, have a formal semantics, which is given by the translation to DSPNs. However, in comparison to formal methods from software design, such as ROOM,

186


or ESTEREL, or the Statemate semantics of Statecharts, some details seem to be unclear and could be formalised more exactly. This further formalisation could, for instance, include a well-defined model of computation and more precise communication mechanisms. In consequence, this could lead to a more detailed and timeaccurate model, avoid some of the remaining ambiguities and even overcome some restrictions, such as the interdiction of cyclical causal influences. However, the fact that SEFTs work on a more abstract level and cannot express all details of software execution and communication has not been a severe restriction in most of the case studies. SEFTs in their present version, although considered a semi-formal method, are closer to the real world than many existing safety analysis techniques, and it has not been the purpose of SEFTs to create another formal language. Regarding the restriction that causal cycles are forbidden as long as there is no explicit time delay in the loop, this corresponds to the structure of other popular models, such as Fault Trees or Event Trees. Introducing state-machine elements has already greatly helped to model behavioural aspects that could not be captured before and the semantics given by the translation to DSPNs provides an adequate degree of formalisation. Further formalisation would possibly compromise the ease of use and blow up the complexity of the analysis, without really adding value for safety analysts. What might be necessary in the future is adapting the import facilities for design models from external CASE tools (see following Section) to cover more different tools and their specific models, e.g., Statemate, Stateflow or IBM Rational Rose. This will have to be done upon customer demand when the ESSaRel tools is further developed towards industrial applicability. Of course, this may require some more detailed formalisation and entail semantical modifications or extensions to the SEFT model.

7.4.7

Import of Software Design Models

One of the design goals of SEFTs was to provide a better means for integrating software aspects into safety and reliability analysis than it has been possible with the existing techniques and, in particular, to allow direct integration of software design models for safety analysis. From the beginning it was forseeable that importing software design models into safety analyses is not straightforward, because of the different purpose of the models and the partly informal semantics. Indeed, even after the introduction of SEFTs, the level of detail and the appearance are still quite different between software design models and safety analysis models. The Master thesis [Rog05] describes in detail how models from the CASE tool IBM Rational Rose RT, which uses a notation very close to ROOMcharts, were translated into SEFTs and imported into the tool ESSaRel. There they were used as parts of a broader scoped analysis model, which was, in large parts, created manually. As described before, even a rudimental extension for hierarchical states to the ESSaRel tool could be realised, although the current SEFT analysis cannot deal with hierarchical states yet (see Subsection 7.4.4). Also, for structures like timers and typed message channels, corresponding standard patterns in the SEFT model world could


187

be defined and instantiated on demand. The working implementation shows that the goal of a direct model import was reached in principle. However, there are many differing details remaining, and modelling a software program on a low level, i.e., considering all states and events that are necessary to create runtime code for the system, could not be managed in our experiments. Other proposals to derive safety models from low-level software code (e.g., the software fault-tree approach in [LCS91]) suffered from this problem as well. It seems hard or even impossible to find a readable safety or reliability model that, at the same time, respects all details of a software-controlled system. At the present state of research, the suggestion is that not every detail of the design can (and should) be translated into a safety analysis model. The most urgent task is to define a level of abstraction that contains all details that are necessary for safety and reliability involvements, but leaves out unneccessary details. Once the appropriate level of abstraction has been defined and the translation from design to analysis models takes place on this level, the reuse of design models becomes manageable and the resulting models fit into the context of safety analysis. This, in turn, suggests searching for integration strategies that not only translate formal design models mechanically, as presented in this thesis, but rather guide the analyst to select important details and abstract from the others. The remaining translation can then be performed automatically. This aspect has not been covered by the research work for this thesis. Nevertheless, after an initial evaluation of the SEFT technique, industrial users confirmed that integrating design models into safety analysis was an interesting opportunity and that the SEFT technique was one step to achieve this goal. Not only can the safety specialist save time by reusing the models that the designer already created, but it can also be expected that the generated safety models describe the system behaviour more accurately and that the communication between different domain experts becomes easier if components, states, and events have a common name in both modelling domains. Therefore, it seems worthwile to continue research in the area of design model / safety model integration. As stated before, the IBM Rational Rose RT models that were used to prototypically implement the model integration are not the only ones that are popular in industry. The import of models from other tools, like Statemate or Stateflow, which is part of the Matlab / Simulink tool family, should be tried as well for comparison.23 Also, integration approaches for different kinds of models within the UML2 modelling language should be investigated.

7.4.8

Software Defect Modelling

When importing models from the software design phase as practiced so far, these models only describe the intended, i.e., correct behaviour of the software, neglecting 23

It is reasonable to speak about tools and not about modelling techniques here, because the integration is a rather technical problem and even the semantics of the used models is largely defined by the tool.

188


bugs and malicious manipulations. It can be argued that it is sufficient to consider the correct software and concentrate on the hazards that might, however, be present (due to unexpected external conditions the correct software can still cause danger). These arguments are backe -up by the existence of other means to ensure the correctness of the software with respect to its state-machine model, such as code generation from state-machines or automatic verification against state-machines. At the present state of software engineering, however, others may argue that deviations of the implemented software from its design model (i.e., bugs) do have to be included into the safety analysis as a major cause for hazards and failures. Approaches to do so have been proposed, either working on the code level (e.g., [LCS91]), or on more abstract models such as state-machines [LM01, McD02]. Codelevel techniques consider every code statement or control structure and systematically investigate what could go wrong. In the model-level approaches, the analyst typically starts with a correct state-diagram and manually adds faulty states or transitions where he believes that things can turn wrong. The augmented state-diagram is then used in safety and reliability analysis. These approaches work on a similar abstraction level as SEFTs do and are, thus, candidates for integration with the SEFT technique. The central problem consists of finding the faulty states and transitions. Software can fail in a lot of different ways: for instance, a calculation result can be wrong in terms of value, or it can be provided too late, or not at all; it is even possible that other tasks are disturbed by the calculation or that the whole controller system crashes down when the calculation is started. It is probably not practical to consider all of these deviations and interferences for every potential program execution step or each state transition in a state diagram. Finding out the relevant ones is a task where other established safety analysis or software validation techniques can help the analyst, such as adapted versions of FMEA, HAZOP, Data Flow Analysis, or Sneak Circuit Analysis. In addition, constructive techniques during the software development process should guarantee that certain kinds of failures can be excluded by design, e.g., by partitioning the system into independent parts, in particular into safety-critical and non-critical parts. Thus, the set of feasible and relevant deviations is reduced and a failure-enhanced state-based model becomes practicable. Once this extended software behaviour model has been created, it can be integrated smoothly into an SEFT analysis in the same way as the model for the correct software. In summary, the SEFT technique is prepared to work with bug-extended behavioural models, but finding models for deviations and bugs was not within the scope of this thesis. A different method for quantifying software defects or bugs is to build a probabilistic model for the overall number of software failures, without explicitly referring to a behavioural model [Lyu96, MIO87]. This discipline is usually called Software Reliability Engineering or, more specifically, Software Reliability Growth Modelling. The failure frequencies are estimated from empirical failure data or predicted based on measures that refer to properties of the source code, the design models, or other artefacts. In this case, the modelling of the software aspects is rather coarse and does not reflect individual operating states or failure modes. The estimated failure


189

frequencies for the software can be carried into an FTA by introducing separate basic events for software failures as a special class of system failures. This proceeding suffers from several problems: first, without an exact model, it seems even more difficult to foresee in which cases and in which way the software will fail, and, second, it is not always possible to get credible probabilistic data to quantify these failures, as the probabilistic failure growth models rely on assumptions that are not always fulfilled in embedded systems.24 So this proceeding has not worked satisfyably for the safety-critical industries until present. What can be considered is to adapt methods from Software Reliability Enginering to estimate frequencies of specific failure classes that were identified by an SEFT analysis. Research approaches towards this end and integration of SEFTs and Software Reliability Models have not been considered in this thesis. In brief, whether or not software defects should be considered for safety analysis, there is no current need for extending the SEFT notation technique.

7.4.9

Integration with other Software Engineering Approaches

To conclude the discussion about integration of SEFTs into the entire software and system engineering process, a few possibilities for further integrative research are proposed, although these have not been elaborated within the research phase for this dissertation. • Model checking [CS01] is a verification technique that stems from digital hardware circuit verification and has gained importance in the area of software validation, too. A standard case is that a state-based model of the (intended) behaviour of a component is checked for certain properties that are expressed in some type of temporal logic. An issue that hampers the broad application of model checking in software industries is that many existing tools are difficult to operate for engineers. The temporal logic expressions have to be entered in a formal notation and the underlying semantics has to be understood (e.g., LTL, CTL, pCTL). SEFTs could come into play as a convenient graphical user interface for model checkers. The system is described by the state-machine elements of SEFTs and the safety properties to be checked are described by Fault Tree gates, which, in the case of SEFTs, include temporal relations as well. All logical conditions expressed by a Fault Tree are implicitly understood as safety properties in the sense of model checking, i.e., read as if prefixed by ”On all execution paths it must never be true that...”. Apart from using SEFTs as a convenient graphical interface for model checkers, model checking can in turn allow reduction of the SEFT by unreachable states and event sequences before passing it to the quantitative analyser, or model checkers can be used for qualitative analysis of SEFTs. 24

For instance, one of the usual assumptions is instantaneous fault removal, which is hardly possible for software that runs in millions of cars. Also, the reliability growth models extrapolate from the past to the future, which assumes long-term development according to a stable process; this is often not the case in industries with short product life cycles.

190


• Testing is still one of the most important quality assurance techniques for software. Testing has been successfully combined with state-based models [RM05], where it serves to prove that all states and transitions of the specification, and none but these, are actually feasible in the implemented piece of software. State-based testing could be used in combination with the import of state-based models into SEFTs. While the SEFT analysis shows how the system reacts to spontaneous external failures, provided that it behaves as specified, the tests show that it actually behaves as specified. So both techniques in combination can make up a seamless safety argument. Further, when extended by importance measures, the SEFT analysis can show the most critical sequences of actions and thus help to direct the attention of the tester to the parts of the system with the highest risk and to reduce the risk with minimal testing effort. The other way around, statistical testing and mass testing can provide failure rates, which can be used for software reliability models, which eventually finally feed the basic event probabilities into SEFTs. • As FTA (and, consequently, also SEFT analysis) is a technique that is recognised for helping humans understand critical parts and critical event sequences in complex systems, they can also be used in combination with other human-centered techniques such as FMEA (as a preparation step to identify the top-events or as a subsequent step to judge the risk priority of given events) or with inspections of code artefacts or design models (as a pointer to critical aspects that the reviewer should direct his attention to).

7.4.10

Other Analysis Types from Traditional FTA

As developed so far, SEFTs are a means of modelling safety or reliability critical scenarios and calculating failure frequencies or hazard probabilities. Other types of analyses have not been considered in this thesis. However, as SEFTs claim to be a replacement for traditional FTs, it is interesting to investigate which other analysis types from traditional FTs apply to SEFTs as well. An important qualitative analysis on FTs is the search for minimal cut sets or, more correctly in the general case, prime implicants. These are collections of events or conditions that must apply together in order to cause the top-event of the FT. For states within SEFTs, similar considerations are possible. For events, however, when and in which order they occur plays an important role. So instead of searching for minimal cut sets, it will probably be necessary to search for minimal event sequences or, at least, example sequences that trigger an accident. The proceeding to do so could be adapted from certain strategies from software testing, which aim at finding minimal input sequences to get the system into desired states. Another qualitative analysis on FTs is the search for modules in the sense of independent subtrees. As the component concept offered by CFTs and SEFTs replaces the partition into modules by a more appropriate modularisation, finding modules is not a relevant task for SEFTs. However, different approaches to reduce the state space rely on the identification of independent subtrees [MDCS98, Buc00, För06].


191

Consequently, it may be important to transfer the existing definition of a module to SEFTs in the future and to find algorithms, possibly modifications of the existing efficient algorithms, in order to find these modules. Suggestions to do so are given in [Buc00, För06]. Important quantitative analyses on FTs are importance analyses, sensitivity analyses, and uncertainty analyses. Importance measures specify how much one individual event (in terms of FTA) influences the top-event probability. Uncertainty analyses estimate the quantitative error in the top-event frequency. Uncertainty analysis is closely related to sensitivity analysis, which investigates how much deviations in the input failure frequencies of basic events influence the top-event result; and sensitivity analysis is, in turn, related to importance analysis. There are different competing importance measures. A very popular one is the partial derivative of the top-event probability with respect to the probability of a specific event to occur [HK92]. Another measure is the difference in the probability of the top-event, provided that a specific basic event did occur, minus the probability of the top-event, given that the same event did not occur. This is a measure of the topevent probability increase due to the basic event, but it does not consider how likely the event is to occur. Therefore, the Fussel-Vesely importance measure considers the ratio of the probability of the union of all minimal cut sets containing some basic event, divided by the probability of the union of all minimal cut sets. A transfer of any of these measures to SEFT models is not directly possible for several reasons. Most important, they apply to events in terms of FTA events, which correspond to states in SEFT. For sudden events, there is no appropriate couterpart in traditional FTA. The problem for events is that they have not one probability, but probability densities over time, so there is no way to calculate a partial derivative with respect to the probability. Moreover, if an event occurs with the same probability, but at an earlier or later time, this can completely change the scenario. Differentiation is not possible in this case, because the effect is a qualitative one and not a quantitative one. Also, measures like the Fusell-Vesely importance, which are based on minimal cut sets, are unsuitable for SEFTs, because an equivalent to the minimal cut set concept has not been found so far. Apart from this, even in traditional FTs, these measures are only correct under certain side conditions, which are not necessarily fulfilled in all cases, and which, in particular, often do not hold for SEFTs. In summary, the issue of importance measures is completely unsolved for SEFTs. Regarding uncertainty and sensitivity analyse,s no research has been carried out, either. It is possible that similar problems arise as for importance measures. Differences in the input parameters, e.g., rates, will certainly have analysable influences on the top-level hazard or accident. However, as a time shift in the occurrence of some events can lead to completely different scenarios, it is not always possible to computate the derivation of the change of the output by the changes to the basic causes, which is the principal proceeding to estimate uncertainty. However, [OD00] propose an approach to extend sensitivity analysis to Dynamic Fault Trees, of which the dynamic part is defined by state-machines as well. It seems reasonable that this approach can be adapted to SEFTs.

192


In summary, at the current state of research, it is not possible to guarantee that all analysis types that are applicable to standard Fault Trees are applicable to State/Event Fault Trees as well.25

7.5

Analysis Performance

7.5.1

Analysis Time Evaluation

As SEFT is a state-based technique, it was forseeable that analysis and, even more so, simulation are very time-consuming in comparison to the combinatorial analysis applied to traditional FTs or CFTs. In order to evaluate actual computation time, load tests were executed. The component concept of SEFTs allows creating large regular structures very quickly. In the load test cases discussed in this section, a generator component with two states is defined as in Figure 7.33 and then instantiated several times. The output port of all instances are connected by gates (State-AND in the easiest case). It is possible to do this in a flat component, but for large load test cases, this proceeding can be cascaded: a component contains, e.g., two generators, connected to an AND gate with two inputs, and from there to an output port. This component is used again as a generator and instantiated twice on the next higher level. The resulting component is again instantiated twice on higher level, and so on. This generates a sequence of test cases with 2, 4, 8, 16, ... basic generators and, accordingly, a state-space of 4, 16, 256, 65636, ... states, because the product automata of n basic two-state automata contains 2n states. In practice, only the smallest numbers of this sequence could be applied to SEFTs. For CFTs, however, which were used as a comparison and reference for the correctness of the results, even test cases with 1024 generators (i.e.,basic events, in this case) could be analysed on a 3GHz PC in less than one minute. This is due to the fact that combinatorial analysis does not build the state space, but also to the efficient compositional BDD algorithm developed for the ESSaRel tool. To get usable results for SEFTs, smaller test cases had to be constructed. They consisted of a two-level hierarchy with 2, 4, 6, 8, ..., 16 generator subcomponents. The required computation time for both analysis and simulation using the TimeNET tool is presented in the following Table 7.1. The first column indicates the number of generator subcomponents (or basic events, in the case of CFT analysis). The second column gives the resulting number of states, as indicated by the tool TimeNET; it will be compared to the theoretical expectations below. The third and fourth column give the time for transient continuous time analysis and for transient continuous time simulation with TimeNET. The fifth column indicates, for comparison, the time required for compositional BDD analysis using the ESSaRel tool. As these 25

This is a significant drawback when it comes to truncation as a means of analysis performance improvement, as discussed in the next section. Truncation cuts off prime implicants if they have a low probability, a large size, or a low importance. Without being able to find prime implicants and without importance measures, this technique is not applicable in the same way. The same problem has been observed for Buchacker’s approach [Buc00].

7.5. ANALYSIS PERFORMANCE

193

times are very short in comparison to the state-based evaluation, a few extra lines are appended that show CFT analysis results for similar test cases of a much bigger size. Generators

States

Analysis

Simulation

CFT

2 4 6 8 10 12 14 16

128 192 3072 32786 * * * * *

2.0 s 2.5 s * 4.3 s 12.7 s 101 s * *

1:18 min 3:43 min 6:18 min 5:13 min 2:02 min * * *

1.125 s 1.094 s 1.328 s 1.328 s 1.766 s 1.750 s 1.843 s 1.656 s

256 1024 4096 16384

1.860 s 2.266 s 3.156 s 5.922 s

Table 7.1: Analysis and Simulation Time for Load Test Cases

Empty spaces in the table mean that the respective kind of value is not available in this case, or no attempt has been made to calculate it. Fields with asterisks mean that a value was expected, but could not be obtained (or the obtained result is believed to be incorrect). The most probable reason for not obtaining values is that TimeNET did not manage the model because its state space was too large. The fact that models greater than 12 states could not be analysed at all (there were error messages like ”segmentation fault”) is surprising, because from literature it is known that models with a much larger state space can be analysed with the chosen algorihm. Interestingly, the model with 6 generators also produced a ”segmentation fault” repeatedly, although larger models were analysable again. Most probably, this is due to internal bugs in the TimeNET tool,26 or this is caused by incorrect installation or incompatibility with the used operation system (Mandrake Linux 8.0, which is not the OS that the TimeNET developers use). Another interesting observation is the decreasing simulation times for 6 and more generators. This can be explained by the nature of transient time simulation: a system run is performed until the specified end of mission time, then the next run is started, and so on. The whole procedure is repeated until the average of all time curves stabilises within the given tolerance boundaries. As regular structures with many transitions with the same probability distributions might provide an averaging effect on their own, the number of required simulation runs until the specified precision is reached is smaller for larger structures, which 26

Potentially, there is a relation to the chosen data types (e.g., 16 bit for the number of states) or to the mechanism of memory allocation.

194


over-compensates for the higher effort required for a single run. It is not clear to which extent realistic models profit from the this effect. Very interesting are the values for the size of the state-space, which are given in the second column. These values rely on TimeNET’s own state-space estimation. Its is remarkable that they are much bigger than the theoretical numbers, as calculated from the numbers of non-vanishing markings in the DSPN. In the case of 4 generators, for instance, the tool indicates a state-space of 192, but theoretically possible are just 24 = 16 states. There are many DSPN places that are caused by the AND gates in the original SEFT model, but as these are always connected to immediate transitions, which fire immediately upon input changes, they do not increase the number of stable states. This observation leads to the assumption that the implemented algorithms do not fit well to the problem at hand, probably because the generated DSPNs possess some special properties that are not common to all kinds of DSPNs, but should be exploited in this case in order to avoid unnecessary state explosion. Regarding the CFT analysis times, it is obvious that for the small load tests there is no significant dependency between the size of the model and the analysis time. The measured analysis time is entirely determined by overhead times (library initialisation, memory allocation, file access) and the granularity of the ”stopwatch”, a time logger we built into the ESSaRel tool, which uses WindowsXP timer services to record the analysis time. For larger examples (that were by far too big to be treated by a state-based analysis), a dependency between model size and analysis time becomes visible. ESSaRels CFT analysis performance outperforms not only the state-based analysis (which is natural), but also the performance of UWG3’s predecessor UWG2, which is reported to take several hours for analysing large Fault Trees. The reason is, apart from the progress in hardware technology since the era of UWG2, a different algorithm that exploits the component structure and produces numerical results only for the selected output ports of each component. The comparison between the same size model in SEFT and CFT clearly shows how much faster combinatorial analysis works - in cases where it can be applied. This suggests that a large optimisation potential lies in the combination of both approaches. We have ongoing research activities aimad at a combined state-based and combinatorial analysis, which will be reported in the next section. When evaluating the results, it has to be considered that the computation times, as well as the general feasibility of analyses and simulations, does not only depend on the complexity of the algorithm, but also on the computer environment and the tool implementation.27 Regarding the TimeNET tool, which is the most important part in the analysis of SEFTs, we had no insights into the source code, but we believe that there is potential for optimisation, too. The analysis and simulation results were obtained on a 800MHz Pentium4 PC with 256 MB of memory. The operation system was Mandrake 8.0, which is no longer state of the art; however, the used version of TimeNET had problems with newer OSs. TimeNET version 3.0 was used. The comparison results for CFTs were obtained using ESSaRel 0.5 on a 3GHz Pentium4 27

This became visible for our own tool ESSaRel, where we could obtain savings in time as large as one order of magnitude for the translations just by changing internal data structures for faster search, avoiding nested loops and so on.


195

PC under WindowsXP. The used algorithm was the compositional BDD-based one, as explained in Chapter 3. The number of sampling points (important for the calculation of the sampling values; the analysis by itself is continuous time) was set to 1000 in all cases; the selected accuracy parameters for simulation were set in such a fashion that 95% of the results can be expected within a relative tolerance range of +/- 10%. Independently from the actual computation time, the fact that currently the analysis / simulation engine is not integrated with the ESSaRel tool, but works under a separate operation system on a separate PC, requiring several file format translations, makes the technique in its current state unpractical for industrial use. Developing an integrated solution is, however, just a question of some months of work. This integrated solution, based on TimeNET or not, can then also exploit additional computation time optimisation. To conclude the section on load tests, the same load tests were used to get some figures about the accuracy of state-based analysis, state-based simulation, and, for comparison, combinatorial analysis using the CFT analysis of ESSaRel. As reference results for the case studies were not available, the load test cases served as a basis for accuracy evaluation. Due to their regular structure, their reference resuls are easy to calculate. The results obtained by CFT analysis have, at least for the smaller test cases, been compared to the theoretical results, which were obtained by generating the known solution (exponential function) within an EXCEL table covering 1000 sampling points; the results matched up to at least 8 decimal places, so the CFT results can be used as reference values. The experiments were the same ones as those used for the performance evaluation above, so the chosen accuracy parameters that led to the presented accuracy results also led to the indicated evaluation times. Generators

CFT (Reference)

Analysis

Rel.Err.

Simulation

Rel.Err.

2 4 6 8 10

0.39957640 0.15966130 0.06379688 0.02549173 0.01018589

0.399576 0.159661 * 0.025492 0.010186

0 0 * 0 0

0.392205 0.154120 0.065880 0.025250 0.010200

-1.84% -3.47% +3.26% -0.95% +0.14%

Table 7.2: Analysis and Simulation Results for Load Test Cases

The values show that in principle, both approaches work acceptably. The accuracy of the analysis is even perfect, up to the precision of the underlying mathematical framework (which is the same as for the reference results). Simulation is less accurate than analysis, but the accuracy is better than the selected accuracy of 10% (for 95% of the values) would intuitively suggest. Different accuracy settings lead to more or less accurate results, but possibly at the cost of higher evaluation time.To which extent the observed accuracy is acceptable is a decision that must be left to domain experts. Often the input to safety and reliability analyses in practice does

196


not meet high accuracy demands, so it is useless to claim a very high accuracy for the analysis output.

7.5.2

Performance Improvement

To cope with the computation time issue, a number of approaches has been developed during the research project, especially in the correlated Master Theses [Zoc05] and [För06], and the implementation of these approaches is currently ongoing. First, it should be checked if the currently used analysis and simulation techniques, which were determined by the choice of the evaluation tool TimeNET, are the most efficient ones. The result of the load tests suggest that this is not always the case. In particular, the tool does not consider certain restrictions that apply to the DSPNs that are automatically generated from SEFTs. For instance, all used substructures are (or could be, by different representation) restricted to at most one token on each place; this is a constraint that restricts the possible state-space of the model. Also, the load tests lead to the impression that the state-space that the tool actually considers is much larger than the reachable combinations of non-vanishing states. This will be discussed with the tool developers; possibly, there is potential for optimisation. Differently from what was observed in the load tests, analysis should be much more efficient in comparison to simulation in many cases. In this context it is interesting that recently, some researchers have published results that claim to extend the analysis of DSPNs even to cases where more than one deterministic transition is enabled at one time [LRT99] and developped tools that support this approach; this approach could avoid the need for simulation in certain cases. A different approach that could allow for analysis instead of simulation consists of replacement of deterministic events by exponentially distributed events. It would have to be investigated, whether or not the error introduced by this modification is acceptable for the final result. If a combination of several exponential transitions is used as a replacement for a single deterministic transition, the shape of the cumulated probability distribution further approaches the step function that is valid for a deterministic transition. Moreover, the tool TimeNET offers other analysis techniques that have not been considered or did not work, for technical or theoretical reasons. Examples are discrete time analysis and discrete time simulation, which could be applicable to the given setting as well. Also, there are other tools that provide the same types or analysis and simulation, possibly implemented in a more efficient fashion. Examples are DSPNexpress or SPNP. Other tools than TimeNET and other analysis techniques than the chosen ones should be evaluated for comparison. Another issue is the usually wide range of rates, delays, and mission times that occur in software-controlled systems. Mission times, and also MTTFs of some hardware components, are often in the order of magnitude of years. Transient states in the software and in electronic parts can last for just a few milliseconds. If discrete time evaluation is chosen, it is difficult to determine an appropriate, but feasible step size. Even in continuous time evaluation techniques there exists the danger


197

that influences of minor importance or steady-state values, which carry no new information, blow up the analysis. Research is necessary in order to find out which effects are necessary to consider and in for which parts of the system steady-state analysis can be used instead of transient-time analysis without significant loss of precision. In cases where the steady-state result of some hazard probability is sufficient as result, other recently proposed state-space reduction approaches could be applied (e.g., [FB03]). Still the biggest issue, as impressingly demonstrated by the comparison to CFT analysis, is the use of state-based modelling at all. As this thesis explained and other researchers confirm, only state-based techniques cover all of the aspects that are relevant in the context of embedded system safety analysis. However, not all parts of the models really require state-based analysis. Consequently, a very powerful improvement would be the combination of state-based techniques with combinatorial techniques, where applicable. It is easy to see that gates like State-And, State-Or, Not and so on (sometimes summarised as Boolean gates) in SEFTs have the same semantics as And, Or, Not and so on in CFTs or traditional FTs. So it seems promising to represent these gates by Boolean formulas, usually encoded as BDDs, instead of Petri Net structures in order to keep the state-space smaller. The BDD technique can be applied to the evolution of a system over time if a discrete time approach is accceptable: instantaneous probabilities are sampled at (usually equidistant) time intervals and the Boolean formula is evaluated for each of these time intervals. This is much more efficient than a DSPN analysis or simulation, as the measurement results suggest. However, this approach is not possible in all cases where these gates occur. If the Boolean gate is part of a structure that involves gates or components with state memory, the whole substructure has to be treated by state-based analysis; otherwise, dependency information is lost and the quantitative results become invalid. So only independent subtrees (modules) that pass probability information at their root to a purely Boolean environment can be separated. They must then be treated by state-based analysis, while the Boolean environment is treated by combinatorial analysis. The values are passed from the subtree to the environment as a vector of discrete-time sampling values. A similar combination of various solution techniques has been successfully applied to Dynamic Fault Trees (DFTs) in [GD97]. The combination of static (combinatorial) and dynamic (state-based) techniques in the context of SEFTs is currently under implementation and will be published in the correlated Master thesis [För06]. The benefit is two-fold: First, the state space is globally smaller as it contains only those sustructures that need to be treated dynamically. Second, in many cases these substructures are unconnected to each other (”state-based islands”), which reduces the combinatorial state-space explosion. For instance, ten components with two states each generate a much smaller state-space (10 models times 2 states) than the product state-machine of the same ten components (one model with 210 states). Another contribution to the combination of static and dynamic analysis is described in the Master thesis [Zoc05]. State-based components (in this case, Markov Chains, which are another model supported by the ESSaRel tool) are integrated as subcomponents into CFTs, which are a combinatorial model. The proceeding is the same as the one described above (passing of discrete-time probability samples in vectors)

198


and produces correct results, as long as not more than one state of each Markov Chain serves as component output port. If, however, several Markov Chain states are referenced by the CFTs as component outputs, the results obtained by BDD analysis are wrong, because the combinatorial analysis does not recognise the dependencies (Markov Chain states are dependent in a special way, as they are mutually exclusive). As an example for this case, revisit Figure 7.9. The innovative approach developed in this Master thesis consists of using Multi-Valued Decision Diagrams (MDDs)28 instead of BDDs and encoding all outputs of the same subcomponent as alternative node outputs of the same MDD node. Applying this approach, the results are correct, even if several outputs are connected to the CFT. What is applicable to Markov Chains is also applicable to the underlying state-machines of SEFTs. So this approach further extends the application area of combinatorial analysis and avoids state-based analysis in even more cases. As further analysis time reduction, the combinatorial part of the analysis can profit from the compositional algorithm, as we have described in [KZ05]. A different approach towards state-space reduction, which has not yet been implemented, is the reduction of sub-state-machines by removing unreachable or irrelevant states and transitions. States may be irrelevant if they are not connected to any output port and do not influence states or transitions that are connected to any output ports. This way, the component concept, which is unique to SEFTs, helps to hide irrelevant information and to reduce the analysis effort. Also, the state-space can be reduced by uniting symmetric branches of state-machines if their difference is not distinguishable to an outside observer. Reachability analyses or model-checking can help to find states that can be removed from the model because they are unreachable. These reduction steps should be performed after each flattening step, as after flattening, new reduction possibilities may have appeared. The component concept further suggests caching of calculated results (where applicable) of subcomponent analyses or translated component models (into DSPNs or into BDDs, respectively) for later reuse. This avoids translation or calculation effort if the same component appears several times in the same project (e.g., for redundancy reasons), or if it is later reused for repeated analyses or in other projects. If a subcomponent has no input ports, it is possible to do the entire quantitative analysis for this subcomponent and reuse the numerical result whereever this subcomponent appears. If there are input ports, then at least the simplified structure (BDD in the case of CFT or stateless part of SEFT, DSPN in the case of SEFT) can be stored and reused. Most of the reduction approaches discussed so far provide exact results, apart from small errors introduced by discrete time sampling when combining state-based analyses with combinatorial analyses. There are other reduction approaches that are based on simplification of the model or approximations. They make the quantitative result less accurate, but possibly still acceptable, while significantly saving calculation time. 28

Sometimes the term Integer Decision Diagrams (IDDs) is used for the same data structure. The important aspect in comparison with BDDs is that each node can have an arbitrary number of children and that there is an arbitrary number of terminal nodes.


199

One important approximative solution is truncation of the model by substructures that have a small relative importance or a low probability. This has been done in a similar way in traditional FTA. Of course, this requires preceeding quantitative analyses to find out the irrelevant parts, which is also time-consuming. Moreover, appropriate importance measures have not been found yet for SEFTs, and thus there is a lot of research work remaining to be done in order to exploit this approach. However, [Buc00] has suggested truncation approaches for the combination of Fault Trees and Petri Nets used there, and it seems reasonable to believe that these approaches can be transferred to the very similar Petri Net approach proposed for SEFT analysis.

200


Chapter 8 Conclusion This doctoral thesis has proposed the new techniques Component Fault Trees (CFTs) and State/Event Fault Trees (SEFTs) as safety and reliability analysis techniques especially for software-controlled systems. As the initial survey on current modelling and analysis techniques has revealed, the existing techniques have significant drawbacks when applied to this kind of systems. A set of requirements was suggested: modelling techniques should be compositional, sufficiently expressive for dynamic and state-depending behaviour, intuitive, and suitable for integration with software engineering models. The proposed techniques are based on the popular technique Fault Tree Analysis and on state-based modelling techniques. CFTs address the issue of compositionality by extending the Fault Tree technique by hierarchical components that are connected by ports. They enable better structuring of large projects, model reuse, and division of labour. The BDD algorithm for quantitative FTA has been adapted to exploit the component concept, and the components constitute suitable frames for BDD reduction by variable reordering and therefore also help to reduce analysis time. The main contribution of this thesis is the introduction of State/Event Fault Trees (SEFTs). These unite FT modelling elements with Statecharts modelling elements and visibly distinguish states from events. The component concept from CFTs is extended by typed ports (state ports and event ports) to enforce consistent combination of component models. Besides states, events, and FT gates, SEFTs provide two types of edges: temporal edges, which denote the predecessor-successor relation between states and events, and causal edges, which denote trigger or guard relations as in traditional FTs. Like the ports, the FT gates are also typed by states and events and, in consequence, there may be several variants of the same gate. The gate set has been extended by gates including memory (e.g., History-AND, Duration gate), which can be modelled consistently because of the state-based semantics of SEFTs. SEFTs allow a probabilistic analysis of hazard states or failure events. Being a statebased model, SEFTs cannot be analysed by the traditional combinatorial approaches such as the BDD approach. Instead, they are translated to Deterministic and Stochastic Petri Nets (DSPN) as an intermediate model, because DSPNs are a model

201

202

CHAPTER 8. CONCLUSION

that offers all the expressive power required, as well as a rich body of analysis possibilities. In the DSPN, the flattening of the component hierarchy is performed with intermittent simplification steps. The resulting flat DSPN can be exported to any available analysis or simulation tool for obtaining the quantitative solution. An algorithm has been presented that performs the analysis of all graph elements of an SEFT to a DSPN. For the SEFT gates, a dictionary has been presented that provides a DSPN subnet as counterpart for each gate. The flattening algorithm has also been given and the subsequent steps to export the model to the external evaluation tool have been sketched. With several proofs and test cases, the correctness of the technique with respect to the formal or verbal specification and its confluence and consistency to existing analysis techniques have been demonstrated. The new technique has been evaluated in several case studies, partly taken from literature or provided by industrial partners. It turned out that the new SEFT technique is acceptable in industry because of its intuitive graphical representation, and that it allows modelling most relevant aspects in safety and reliability analysis. It is claimed and agreed by industrial partners that the SEFT technique is better suited to model many aspects of embedded systems than traditional Fault Trees, Markov Chains, or other known techniques. In particular, industrial users confirm that the state/event distinction brings a lot more clarity to FTA. In a related Master thesis it has been shown by a prototype implementation that software design models such as ROOMcharts can be imported automatically and translated into SEFTs, which enables a direct link of the safety analysis to the system design process. A comparison to other recently developed techniques, in particular Dynamic Fault Trees, shows that the techniques lead to the same results but the models appear differently; the different notations all have their advantages and inconveniences that mainly depend on the actual case and on the user’s expectations. However, a few details turned out to be missing in SEFTs. The integration of software design models achieved so far is quite rudimentary, and analysis or simulation time is also an issue. The inconveniences have been listed, together with improvement proposals that can serve as suggestions for other researchers. As a proof of concept the tools UWG3 and ESSaRel have been designed and implemented to a prototype level. The tools have been written in the C# language to run under Microsoft’s .NET framework and offer a convenient Windows GUI. The ESSaRel tool has been designed as an integration platform for different analysis techniques. Besides SEFTs, it currently also supports CFTs, Markov Chains, and ROOMcharts, and offers different analyses of and translations between these types of models. ESSaRel has been used to carry out the case studies described in this thesis. It produced correct results in all cases; however, it showed that analysis time could be an issue in industry-size projects. ESSaRel will be further developed in cooperation with industry partners after this doctoral research has finished. Both tools are available for download at www.essarel.de. Over 70 registered customers from industry and academia downloaded the tools for evaluation up to now. UWG3 has been used since in many industrial projects, mainly by Siemens. ESSaRel is currently in the beta test phase at Siemens Transportation Systems.

203

In conclusion, the main goals of the doctoral research project have been met, as SEFTs constitute a new analysis that is compositional, that can express the relevant phenomena of embedded systems, that is intuitive to practitioners, and that is prepared to integrate with other design and analysis models.

204

CHAPTER 8. CONCLUSION

Appendix A The Gate Dictionary This appendix lists all gates that are currently implemented in the SEFT framework. The semantics of each gate is explained in Section 4.6. The first table lists all gates with their graphical representation, there (potentially variable) number of inputs and their parameters. The second table gives the DSPN counterparts that are inserted for the gate during the translation algorithm described in Chapter 5, so this is the actual ”dictionary” the algorithm refers to. Technically, the SEFT to DSPN gate dictionary is implemented as a separate class named SEFT2DSPNDictionary, which refers to further helper classes (”DSPN Builders”). The builder classes copy the elements as specified in the table into a new DSPN model. They are able to adapt to varying numbers of inputs and they can reference to the original SEFT gate to read out its parameters and to transfer them to the DSPN counterpart as appropriate. The translation of a specific gate can be requested by calling the public Translate() operation provided by the Dictionary class. For details of the translation procedure and the usage of the dictionary within it, see Appendix B. Name

Ports

Parameters

AND_State(n)

Out:S, In:1+Sc

none

AND_Event_State(n) Out:E, In:1E,1+Sc

none

OR_State(n)

Out:S, In:1+Sc

none

OR_Event(n)

Out:E, In:1+Ec

none

NOT

Out:S, In:1S

none

205

Graphical Representation

206

APPENDIX A. THE GATE DICTIONARY

Name

Ports

Parameters

Inhibit

Out:E, In:1E,1+Sc

none

XOR(n)

Out:S, In:2+Sc

none

EQUAL(n)

Out:S, In:2+Sc

none

Voter(k,n)

Out:S, In:1+Sc

Voting Inp. k

History_AND(n)

Out:E, In:2+Ec

none


E

History_AND(n,t)

Out:E, In:2+Ec

Reset Time t

History_AND_R(n)

Out:E, In:1E ”Reset”,2+Ec

none

Priority_AND(n)

Out:E, In:2+E

none

Priority_AND_R(n)

Out:E, In:1E ”Reset”,2+E

none

Priority_AND(n,t)

Out:E, In:2+E

Reset Time t

Prob_Delay(t)

Out:E, In:1E

Avg. Delay t

Prob_Delay_R(t)

Out:E, In:1E”Reset”,1E

Avg. Delay t

E

H& ...

t = 2h E

E

Det. Delay Det_Delay(t)

Out:E, In:1E

Delay t

Det_Delay_R(t)

Out:E, In:1E”Reset”,1E

Delay t

Duration(t)

Out: E, In:1S

Duration t

E

t = 2h

207

Name

Ports

Parameters

Cond(p)

Out:E, In:1E

Cond. Prob. p

UPON

Out:S, In: 1E

none

UNTIL

Out:S, In:1E

none


S

Flip-Flop Flip-Flop

Out:S,In:1E ”Set”,1E ”Reset”

none

ENTER

Out:E, In:1S

none

LEAVE

Out:E, In:1S

none

E

E

Set

Reset

Table A.2: List of all SEFT Gates with Ports, Parameters and Graphical Representation

Some notes regarding Table A.2: • Only the number of input ports is given, as the number of output ports is always 1; 2+ means two or more • S for State and E for Event precice the type of inputs • A small letter c means that the inputs of this group are commutative, i.e., that their order does not matter; otherwise, the position of input does matter • If a name is given for an input (e.g., ”Reset”, then this input has a special meaning - see gate description - and must be visibly distinguishable to the user) • Only functional parameters are listed; an additional parameter is always the total number of inputs, designated n, which is of relevance if the number of inputs is variable • In the ESSaRel tool, all parameters are typed; probabilities are limited to the range [0,1], time parameters consist of a floating point number and a time unit and rate parameters are floating point numbers with the inverse of a time unit (e.g., per Year); parameters regarding numbers of affected inputs, like for the voter gate, are positive integer numbers that may be subject to additional restrictions.

208


Example: ’Out:S, In:1 E ”Reset”, 1+ Sc’ as an entry for the ports means that the output is of type state, there is one event input named ”Reset” and one or more state inputs that are commutative. Name

AND_State(n)

AND_Event_State(n)

OR_State(n)

OR_Event(n)

NOT

Related DSPN

209

Name

Inhibit

XOR(n)

Related DSPN

210

Name

EQUAL(n)

Voter(k,n)


Related DSPN

211

Name

History_AND(n)

History_AND(n,t)

History_AND_R(n)

Related DSPN

212

Name

Priority_AND(n)

Priority_AND(n,t)

Priority_AND_R(n)

Prob_Delay(t)


Related DSPN

213

Name

Prob_Delay_R(t)

Det_Delay(t)

Det_Delay_R(t)

Duration(t)

Cond(p)

UPON

Related DSPN

214


Name

Related DSPN

UNTIL

Flip-Flop

ENTER

LEAVE

Table A.4: Translation DSPN Counterparts for all SEFT Gates (”Gate Dictionary”)

Appendix B The Translation Algorithm as Pseudo-Code This appendix explains the translation algorithm as implemented in the prototype tool ESSaRel. It has been validated and performed the translation procedure as described in Chapter 5. First the class diagram by which the translation and flattening algorithms are integrated into the tool and the relevant data structures are described, then all preconditions and the entry function Translate() of the main translation (of the whole hierarchy of SEFT models into one flat DSPN) is given. This is followed by different sub-functions and calls to other translation classes; the translation of a single SEFT to DSPN (without considering the hierarchy) and also the flattening of a Component DSPN are handled by separate Translation classed that are invoked by a call to their respective main functions Translate().

Class Diagram As ESSaRel is a platform to integrate different modelling techniques, to analyse models in different ways and to translate models into other models, a class hierarchy of these different computations has been created to allow similar interfaces for similar tasks. Analyses and translations inherit from the same base class that prescribes common data structures for information exchange between different (possibly external) parts of ESSaRel. This includes references to the source (and possibly target) models, tables for parameters and output tables for numerical results and error entries. Some of them are briefly described in the sequel. Analyses and translations can be attached to models. The mandatory main function is necessary to attach the computation to the menu entry in the GUI. Analyses and translations can call other analyses or translations as helpers; these obey the same data exchange mechanisms. Flattening is considered as a kind of translation, so the main function is called Translate() as well. The dictionary that performs the translation of SEFT gates to DSPN substructures is implemented in a

215

216

APPENDIX B. THE TRANSLATION ALGORITHM AS PSEUDO-CODE

Figure B.1: Class Diagram of Different Computations within ESSaRel

217

separate class that is associated to the SEFTsingle2DSPN translation class. It has also a Translate() operation as entry, but with a slightly different signature. In the pseudo-code, not all operation arguments are listed as they are in the actual implementation; in particular, passed lists for additional arguments and errors have been omitted. It is important to know that in ESSaRel each component can have several models, even of the same type. For example, component C1 has only a SEFT model in the beginning, after the translation also a DSPN model that refers to subcomponents and after flattening another DSPN model with all related subcomponents copied into it. The latter one is then used for analysis.

Data Structures Component DSPN is a DSPN model augmented with a Crossreference List. Crossreference List is a list that keeps records consisting of the ID of each translated DSPN element and the ID of the corresponding original SEFT element, including additional information like public visibility (import or export) and type (state or event). Containment Tree is a tree structure that reflects the component nesting structure, as described in Section 3.6.2. Request List is a table passed to the entry operation Translate() of all translation classes where the caller can specify additional parameters such as the IDs of models to use. Result List is a table passed to the entry operation Translate() of all translation classes where the operation can report the results, i.e., the IDs of successfully generated models. Error List is a table passed to the entry operation Translate() of all translation classes where the operation can report errors that might occur during translation. Its entries consist of an error message to display on the GUI and a list of the related graph elements that can then be marked in a special colour on the editor in order to help the analyst to locate the problem.1

Preconditions 1. The component model to be analysed and all (recursively) nested subcomponent models are available at analysis time and are valid SEFT models 2. The component nesting hierarchy is free from cycles (a component must not refer directly or indirectly to itself as a subcomponent) 1

Error handling is not shown in the pseudo-code.

218


3. All causal component or gate input ports are connected on the higher hierarchy level 4. All causal paths are free from untimed cycles, i.e., cycles not containing a delay gate (this must be checked across component borders and hierarchy levels) 5. Exactly one event is marked as the init event and has one successor state and no predecessor state 6. For each event e: e has either exactly one predecessor and one successor state or is the init event, which has a successor state but no predecessor state. 7. For each triggered event e: there must be exactly one causal edge c (called the triggering edge) so that c.target = e and c.source is • an event or • an event type input port or • an event type output port of some subcomponent or gate 8. For each event e: if e is target of a causal edge c and c is not the triggering edge then c.source is • a state type input port or • a state type output port of some subcomponent or gate 9. For each deterministic event e: the time parameter t is greater than 0 10. The interface (footprint) of each subcomponent matches the actual component model for that subcomponent

SEFThierarchy2DSPN.Translate(SEFT source, DSPN target, ...) Public translation method according to the ESSaRel interface definition for model translations. Initialisation and overall control function to be called for translation of a SEFT hierarchy (system with all subcomponents) to a flat DSPN; recursively calls translation of the system and all subcomponents and then the flattening. Takes as arguments the source model (type SEFT) for the translation (i.e., the SEFT model of the system) and as target an (empty) DSPN where the translation result is put into. Further arguments such as Request List, Result List and Error List are not discussed here in detail. Load all referenced subcomponents of the source model (SEFT component) recursively Check preconditions, throw exception if violated Build Containment Tree of the main component and all of its recursive subcomponents For each model currentSource in Containment Tree in bottom-up order // Translate all involved SEFT component classes one by one

219

• Create a new model currentTarget of type DSPN • Invoke SEFTsingle2DSPN.Translate() with the DSPN currentTarget as the target and the SEFT currentSource as the source • Add currentTarget to the component where currentSource belongs to • Keep the IDs of the result models in a list, because they will be needed later For each component environment which has subcomponents for each translated subcomponent DSPN subnet (as found in Containment Tree in bottom-up order) • Invoke DSPNFlattening.FlattenSubcomponent() with environment as target and the referenced subcomponent DSPN subnet as source and the ID of the subcomponent reference (proxy) // this flattens the subcomponent hierarchy of one component and one of its immediate subcomponents by putting the elements of the referenced component model into the environment model Put the flattened top-level DSPN into the DSPN model passed as target, attach it to the source component and report its ID in the result list

SEFTsingle2DSPN.Translate(SEFT source, DSPN target, ...) Transforms one single SEFT component into a component DSPN (without including the subcomponents). Takes as arguments: an empty DSPN as target and an SEFT as source. Create a new crossrefList and attach it to target For each state s in SEFT source // Translate States into places... • Create a new place p and add it to target • Copy the position of s to the position of p • Put s.id ( the SEFT state ID) and p.id (the ID of the DSPN place) in target.crossrefList // Translate the different types of events into transitions... For each deterministic SEFT event e in SEFT source • Create a new deterministic DSPN transition t and add it to target // assert that t is not zero • Copy the time parameter of e to the the time parameter of t • Copy the position of e to the position of t

220


• Put e.id ( the SEFT event ID) and t.id (the ID of the DSPN transition ) in target.crossrefList For each exponential SEFT event e in SEFT source • Create a new exponential DSPN transition t and add it to target • Copy the inverse of the rate parameter of e to the the average time parameter of t • Copy the position of e to the position of t • Put e.id ( the SEFT event ID) and t.id (the ID of the DSPN transition ) in target.crossrefList For each triggered SEFT event e in SEFT source • Create a new immediate DSPN transition t and add it to target The triggering mechanism will be translated later, along with the causal edge • Put e.id ( the SEFT event ID) and t.id (the ID of the DSPN transition ) in target.crossrefList For the init event in SEFT source • Find its successor state in the SEFT, look up its corresponding DSPN place in crossrefList and mark that place with one token Note that no transition is created, but the initial state is marked! Translate the different types of ports that belong to the SEFT model (not yet the ports of gates and subcomponents)... For each event input port pin in SEFT source • Create a new immediate DSPN transition t and add it to target • Copy the position of pin to the position of t • Put the SEFT element ID and its target.crossrefList and mark it as export

DSPN

counterpart

For each event output port Pout in SEFT source • Create a new immediate DSPN transition t and add it to target • Copy the position of Pout to the position of t

in

221

• Put the SEFT element ID and its target.crossrefList and mark it as export

DSPN

counterpart

in

DSPN

counterpart

in

DSPN

counterpart

in

For each state input port Pin in SEFT source • Create a new DSPN place p and add it to target • Copy the position of Pin to the position of p • Put the SEFT element ID and its target.crossrefList and mark it as export For each state output port Pout in SEFT source • Create a new DSPN place p and add it to target • Copy the position of Pout to the position of p • Put the SEFT element ID and its target.crossrefList and mark it as export // Translate gates and subcomponents... For each gate g in SEFT source • Have corresponding pattern for gate g created by the Dictionary and add it to target // Subfunction SEFT2DSPNDictionary.Translate() will be given below; the gate, the target environment and the insert position must be passed to the function as arguments For each instance si of some subcomponent sc in SEFT source • Insert footprint (i.e., DSPN counterparts of ports) of sc into target at the right position // Subfunction insertFootprint() will be given below // Translate temporal edges (i.e., edges joining events with their predecessor / successor states) For each temporal edge e in SEFT source, except the edge from the init event • Create a new DSPN arc a and add it to target • Find e.source, look up its DSPN counterpart in crossrefList and set a.source to that element

222


• Find e.target, look up its DSPN counterpart in crossrefList and set a.target to that element // Translate causal edges (i.e., edges joining a cause - trigger event or guard state - with its effect). Respect trigger semantics! For each causal edge ce // Source and target of ce can be looked up in the Crossreference List • If ce.source is event or event port and ce.target is event // This means a triggering relation – Instantiate Trigger Pattern and connect its source and target to the two transitions corresponding to source and target of ce // this is performed by a helper function • Else if ce.source is state or state port and ce.target is event This means a guarding relation – Create a pair of anti-parallel DSPN arcs (guard pattern) and set their respective sources and targets to the place and the transition corresponding to source and target of ce • Else This means that both source and target are ports; assert that both are of same type! – Merge the two net elements corresponding to source and target of ce Subfunction Merge() will be given below // Purge Crossreference List, so that only the imported or exported elements (the interface) remain For each element e in Crossreference List • if e is non-public (neither imported nor exported) – delete e from Crossreference List

SEFT2DPSNDictionary.Translate(Gate gate, DSPN target, Position insertPosition) Public member method of the gate dictionary class. Creates an instance of a DSPN pattern corresponding to an SEFT gate and puts it into the target DSPN model at the given position . Takes as arguments the gate (a complex type that includes information about all relevant parameters and the ID of the original SEFT gate), the target DSPN, and the insertion position where the inserted pattern should go to .

223

Create new GateBuilder according to gate type // Gate builders are classes with a common interface that belong to each gate type and that generate the corresponding DSPN structure, including flexible input building blocks that allow for variable nunmbers of inputs Call GateBuilder.Build() to obtain a new DSPN that is the translation of the gate in its current configuration // This includes the adaption of number of inputs and parameters, which are known to the GateBuilder by examining the parameters of the passed gate object Insert the created DSPN into target // The function InsertSubnet() will be given below. It includes the creation of white space for the substructure and adaptation of the position

SEFTsingle2DSPN.insertFootprint(SEFT source, DSPN target, Position insertPosition) Puts places and transitions into a DSPN that mark the ports of a (possibly unknown) subnet corresponding to a subcomponent in the original SEFT into a target DSPN. Takes as arguments a DSPN target, a SEFT subcomponent sc (sometimes called proxy), the position at which the substructure shall be inserted and the ID prefix. Create new DSPN temp • For each event input or output port temp – Create a new immediate DSPN transition and add it to temp – Mark this transition as import – Put the SEFT element ID (including subcomponent ID) and its DSPN counterpart in target.crossrefList • For each state input or output port – Create a new DSPN place and add it to temp – Mark this place as import – Put the SEFT element ID (including subcomponent ID) and its DSPN counterpart in target.crossrefList • Position all port elements on an evenly distributed basis on the top or bottom line (outputs on top, inputs on bottom) Insert subnet temp into DSPN target at the specified position, including the creation of white space // The function InsertSubnet() will be given below

224


SEFTsingle2DSPN.InsertSubnet(DSPN source, DSPN target, Position insertPosition, ID idPrefix, bool CreateWhitespace) Public operation that fits in some (DSPN) substructure source into the given environment target, including the creation of white space for the new elements (on demand), copying to the right graphical position and prefixing of the ID so that name conflicts are avoided. Additional structures such as Crossreference List are updated as well. If CreateWhitespace • Create new bounding box around all graphic elements in source // Cut and spread environment net to create space for inserted elements at insertion Position: • For each graphelement e in target – if e.x ≥ insertPosition.x, increment it by boundingbox.width – if e.y ≥ insertPosition.y, boundingbox.height

increment

it

by

// Clone structure as a whole, prefix all of its elements with ID prefix, insert into target at desired position Create a new clone of all elements in source // This is performed by the method CloneStructure() which is part of the ESSaRel Framework foreach element eNew in the cloned structure • eNew.position = e.position - boundingbox.upperLeft + insertPosition // Transform Position of eNew using the given insertion position Add cloned structure to target Clone Crossreference List of source replace in the cloned Crossreference List the original DSPN element IDs with the IDs of the cloned DSPN elements If idPrefix is not empty • prefix all SEFT ID entries in the cloned crossrefList with idPrefix append cloned Crossreference List to target.crossrefList

225

DSPNFlattening.FlattenSubcomponent(DSPN source, DSPN target, ID proxyId, Position insertPosition) Puts a subcomponent DSPN source into its envrironment DSPN target for a given proxy. Takes as argument the environment DSPN and the DSPN source to put in and the proxy (subcomponent reference) as which the structure should be inserted. Assumes that a footprint is already present. Cares about positioning, index prefixing and maintenance of the interface list. Exists also in an overloaded variant that takes directly one subcomponent/referenced component entry from the Containment Tree. Determine insertPosition by looking up one of the ports in the interface list and determining its position Insert subnet source into DSPN target at position insertPosition with ID prefix proxyId // see description of SEFTsingle2DSPN.InsertSubnet // Connect inserted subnet by resolving all port entries in the interface list, as currently the curresponding elements exist twice now (once from the footprint, the other from the subnet just inserted) for each pair of entries with same SEFT ID in the Crossreference List // these are found and listed by a helper function • find corresponding DSPN nodes n1 and n2 • merge n1 and n2 to get one single node nResult // see description of DSPNFlattening.Merge() • delete both entries for n1 and n2 from target’s Crossrefence List // the new node does no longer belong to the external interface of the flattened component; however, the same node may be merged several times and thus there may be other lines in the Crossreference List that belong to the same DSPN node; these are not deleted!

DSPNFlattening.Merge(DSPN net, Node n1, Node n2) Merges elements in the sense that it replaces two elements by one while preserving all arc connections and all additional attributes. Used by the translation and the flattening procedure. Called with an environment DSPN "net", a source node "n1" and a target node "n2" as arguments. Assert that arguments n1 and n2 are valid nodes of the net and are of the same type (place or transition) and distinct from each other. Create a list arcsFrom for all arcs that have n1 or n2 as source and fill list table by finding all applying arcs Create a list arcsTo for all arcs that have n1 or n2 as target and fill this list by finding all applying arcs Create a new node nResult of the more specific type of n1 or n2 and add it to net // more specific means that a deterministic or exponential transition wins over an immediate transition; other constellations must not occur

226


Add nResult to net Remove n1 and n2 from net For each arc a in arcsFrom • Set a.source = nResult For each arc a in arcsTo • Set a.target = nResult Merge attributes of both original nodes to the new node // this is performed by helper function. A new position will be chosen appropriately. Other attributes must not be contrictory or it is irrelevant, which one wins. Assertions are implemented that make sure that no conflicts arise, because they would mean a program error (e.g., a place with initial marking unequal to zero must not be merged with a place with a different initial marking unequal to zero, no two transitions with contradicting delay parameters must be merged etc.) Adapt Crossreference List entries of net // all entries referencing one of the former nodes are updated to the ID of the new node; import / export attributes remain the same.

Appendix C Proofs and Validity Arguments This Appendix presents proofs for the Boolean gates State-AND, State-OR and StateNOT and validity arguments for History-AND and Priority-AND. First it is mathematically proven that the Boolean SEFT gates as defined by the DSPN structures they are translated to, match with the Boolean operators they claim to represent. The proof is only carried out for the basic variant with two inputs (one for the NOT gate). For variants with more than two inputs and for other Boolean gates like Exclusive OR, Voter and so on, the proofs are not presented. They can be constructed by similar patterns; alternatively, each of these gates can be replaced by a structure of AND, OR and NOT and the equivalence to the proven gates can be shown by exhaustive back-to-back tests. For example, AND(e1, e2, e3) with three inputs can be replaced by AND(e1, AND(e2, e3)), XOR(e1, e2) can be replaced by AND(OR(e1, e2), NOT(AND(e1, e2))) and so on. This enables a proof by induction for any number of inputs. The general proof pattern sounds like: "In every stable state of the net and for every possible combination of input markings, the output is marked iff input 1 is marked input 2 is marked.", where represents the Boolean operator AND, OR, or NOT (of course, in the case of NOT there is only one input). An input or output place being marked by one token thereby corresponds to the logical value "true" at this input or output, an unmarked place to "false". To help formalising the proofs, a few predicates on the net and on individual transitions and places are defined. These definitions will be used throughout this appendix. • Stable(N ) means that the Petri Net N is in a stable state (also called tangible marking), i.e. no immediate transition is enabled, so that the present state persists for a non-zero time interval. • Enabled(Ti ) means that the Transition Ti is ready to fire, i.e. there is the necessary number of tokens on each of its input places. • M arked(Pi ) means that there is at least one token on place Pi . In the following nets, there will be at maximum one token on each place.

227

228

APPENDIX C. PROOFS AND VALIDITY ARGUMENTS

The definition about the stability of the net, which will be important in the following proofs, can be formally rephrased as Stable(N ) =6 ∃Ti | Enabled(Ti ). The following proofs only consider stable states (also called tangible markings). Vanishing markings (markings where at least one immediate transition is enabled) will lead to a sequence of transition firing that ends up in a stable state, if it ends at all. Consequently, it has to be proven in addition, that there are no infinite sequences of vanishing markings, given that the input places are not changed externally. This latter proof has not been carried out, because the present net fragments are so simple that it can be verified manually that with just one transition firing a stable state is reached in any situation. So it is sufficient to show that the Boolean formulas hold for every stable state and every possible combination of input valuations (there are just four in the case of a two-input gate).

State-AND The proof for the AND gate with two inputs refers to the place and transition IDs given in Figure C.1. In the initial state, there is one token on place P3 . The claim to be proven is "If the net is in a stable state then the output place P4 is marked iff input place P1 and input place P2 are marked." Four cases (the four combinations of input marking) will be considered separately.

Figure C.1: The DSPN Structure for the AND Gate from the Dictionary, with Visible IDs

First, the specific predicates for the AND gate are formalised:

Enabled(T1 ) = M arked(P4 ) ∧ ¬M arked(P1 ) Enabled(T2 ) = M arked(P4 ) ∧ ¬M arked(P2 )

(C.1) (C.2)

229

Enabled(T3 ) = M arked(P1 ) ∧ M arked(P2 ) ∧ M arked(P3 ) Stable(AN D) = ¬Enabled(T1 ) ∧ ¬Enabled(T2 ) ∧ ¬Enabled(T3 )

(C.3) (C.4)

The latter predicate can be rephrased as: Stable(AN D) = ¬(M arked(P4 ) ∧ ¬M arked(P1 ))∧ ¬(M arked(P4 ) ∧ ¬M arked(P2 ))∧ ¬(M arked(P1 ) ∧ M arked(P2 ) ∧ M arked(P3 )) = (M arked(P1 ) ∨ ¬M arked(P4 )∧ (M arked(P2 ) ∨ ¬M arked(P4 )∧ (¬M arked(P1 ) ∨ ¬M arked(P2 ) ∨ ¬M arked(P3 )) For the proof, an additional lemma will be needed, which states an invariant on the given net: The sum of tokens on Place P3 and Place P4 is always exactly one, in other words, at each time, either P3 or P4 is marked, but not both. There are formal proofs for invariants of this kind, which make use of the transition matrix of the Petri Net. In the present case, the lemma can be validated by hand, as there are only three transitions:

• Initially, there is one token on P3 and no token on P4 , so the lemma holds. • T1 adds one token to P3 and removes one token from P4 , so the sum is unchanged. • T2 adds one token to P3 and removes one token from P4 , so the sum is unchanged. • T3 adds one token to P4 and removes one token from P3 , so the sum is unchanged.

Consequently, the sum of tokens is always one and we can state the invariantly true propositions (Lemmas 1 and 2):

M arked(P3 ) ∨ M arked(P4 ) ¬(M arked(P3 ) ∧ M arked(P4 ))

(C.5) (C.6)

Now the proof can be carried out for the four cases that have to be distinguished.

230


Case I Input1 = false and Input2 = false Assumptions: ¬M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(AN D) Proof Obligation: ¬M arked(P4 ) T1 must not be true for the net to be stable. As the right condition in (8.1) is true by assumption ( ¬M arked(P1 )), the left condition, ¬M arked(P4 ) must not be true, and thereore ¬M arked(P4 ) must be true. An alternative argumentation works with T2 and (8.2).

Case II Input1 = trueright and Input2 = false Assumptions: M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(N et) Proof Obligation: ¬M arked(P4 ) Starting with the assumption, the following implications are valid:

M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(N et) ⇒ M arked(P1 ) ∧ ¬M arked(P2 ) ∧ (M arked(P1 ) ∨ ¬M arked(P4 ) ∧ |

{z

}

true because M arked(P1 ) (M arked(P2 ) ∨ ¬M arked(P4 )) ∧ |

{z

}

¬M arked(P4 ), because M arked(P2 ) is f alse (¬M arked(P1 ) ∨ ¬M arked(P2 ) ∨ ¬M arked(P3 )) |

{z

true because ¬M arked(P2 ) ⇒ ¬M arked(P4 )

}

231

Case III Input1 = false and Input2 = true The proof is the same as for Case II, except that the indices 1 and 2 have to be exchanged.

Case IV Input 1 = true and Input 2 = true Assumptions: M arked(P1 ) ∧ M arked(P2 ) ∧ Stable(N et) Proof Obligation: M arked(P4 ) From (8.1), (8.2), (8.3), and (8.4) follows that ¬M arked(P3 ) Lemma 1 states that either P3 or P4 must be marked at any time. Consequently, P4 is marked. Note that the implications given in Case I to IV only work in forward direction. Indeed, the single cases cannot be reversed. However, the claim was that P4 is marked if and only if P1 and P2 are marked. This has been proven as well, because all possible combinations of input markings have been explicitly considered and there was only one of them (Case IV) where P4 was marked. After all, it has been proven, that the Petri Net substructure actually reflects the Boolean AND semantics.

State-OR The proof for the OR gate with two inputs refers to the place and transition IDs given in Figure C.2. In the initial state, there is one token on place P3 . The claim to be proven is "If the net is in a stable state then the output place P4 is marked iff input place P1 or input place P2 is marked." Of course this includes the case that both input places are marked. Again four cases (the four combinations of input marking) will be considered separately. Lemmas 1 and 2 from the AND gate discussion apply in a similar way as above and will be used throughout the proof. Again, the specific predicates for the OR gate are formalised:

232


Figure C.2: The DSPN Structure for the OR Gate from the Dictionary, with Visible IDs

Enabled(T1 ) = M arked(P3 ) ∧ M arked(P1 ) Enabled(T2 ) = M arked(P3 ) ∧ M arked(P2 ) Enabled(T3 ) = M arked(P4 ) ∧ ¬M arked(P1 ) ∧ ¬M arked(P2 ) Stable(OR) = ¬Enabled(T1 ) ∧ ¬Enabled(T2 ) ∧ ¬Enabled(T3 )

(C.7) (C.8) (C.9) (C.10)

The latter predicate can be rephrased as: Stable(OR) = ¬(M arked(P3 ) ∧ M arked(P1 ))∧ ¬(M arked(P3 ) ∧ M arked(P2 ))∧ ¬(M arked(P4 ) ∧ ¬M arked(P1 ) ∧ ¬M arked(P2 )) = (¬M arked(P3 ) ∨ ¬M arked(P1 )∧ (¬M arked(P3 ) ∨ ¬M arked(P2 )∧ (M arked(P1 ) ∨ M arked(P2 ) ∨ ¬M arked(P4 )) In the following the proof is carried out for the four cases that have to be distinguished.

Case I Input1 = false and Input2 = false Assumptions: ¬M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(N et) Proof Obligation: ¬M arked(P4 )

233

From the assumption that the net is stable and the rephrased specific stability predicate (last line) it is known that (M arked(P1 ) ∨ M arked(P2 ) ∨ ¬M arked(P4 )). From the assumption is further known that ¬M arked(P1 ) ∧ ¬M arked(P2 ), so that the only way to get a true proposition is that M arked(P4 ) is false, which had to be proven.

Case II Input1 = true and Input2 = false Assumptions: M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(N et) Proof Obligation: M arked(P4 ) Starting with the assumption, the following implications are valid:

M arked(P1 ) ∧ ¬M arked(P2 ) ∧ Stable(N et) ⇒ M arked(P1 ) ∧ ¬M arked(P2 ) ∧ (¬M arked(P3 ) ∨ ¬M arked(P1 ) ∧ |

{z

}

¬M arked(P3 ), because ¬M arked(P1 ) is f alse (¬M arked(P3 ) ∨ ¬M arked(P2 )) ∧ |

{z

}

true because ¬M arked(P2 ) (M arked(P1 ) ∨ M arked(P2 ) ∨ ¬M arked(P4 )) |

{z

true because M arked(P1 ) ⇒ ¬M arked(P3 ) ⇒ M arked(P4 )

The last implication is true because of Lemma 2.

}

234


Case III Input1 = false and Input2 = true The proof is the same as for Case II, except that the indices 1 and 2 have to be exchanged.

Case IV Input 1 = true and Input 2 = true Assumptions: M arked(P1 ) ∧ M arked(P2 ) ∧ Stable(N et) Proof Obligation: M arked(P4 ) One of the terms in the definition of stability is: ¬M arked(P3 ) ∨ ¬M arked(P1 ) As the second part is not true by assumption, the first part must be true, i.e. ¬M arked(P3 ) (the same works with P2 instead of P1 ). From Lemma 1 it follows that if M arked(P3 ) is false, then M arked(P4 ) must be true, because either of both places must be marked. Again, by showing all four possible combination of inputs, it has been proven that the output is only true if input 1 or 2 is true.

State-NOT The proof for the NOT gate refers to the place and transition IDs given in Figure C.3. Note that the output is place P2 ! In the initial state, there is one token on place P2 . The claim to be proven is "If the net is in a stable state then the output place P2 is marked iff input place P1 is not marked." This time, only two cases are possible, because there is just one input that can either be marked or not. Lemmas 1 and 2 from the AND gate discussion apply, but this time with respect to P2 and P3 , hence M arked(P2 ) = ¬M arked(P3 ). Again, the specific predicates for the NOT gate are formalised:

Enabled(T1 ) = M arked(P2 ) ∧ M arked(P1 ) Enabled(T2 ) = M arked(P3 ) ∧ ¬M arked(P1 ) Stable(N OT ) = ¬Enabled(T1 ) ∧ ¬Enabled(T2 )

(C.11) (C.12) (C.13)

235

Figure C.3: The DSPN Structure for the NOT Gate from the Dictionary, with Visible IDs

The latter predicate can be rephrased as: Stable(N OT ) = ¬(M arked(P2 ) ∧ M arked(P1 ))∧ ¬(M arked(P3 ) ∧ ¬M arked(P1 )) = (¬M arked(P2 ) ∨ ¬M arked(P1 )∧ (¬M arked(P3 ) ∨ M arked(P1 )) In the following the proof is carried out for the two cases that have to be distinguished.

Case I Input = false Assumptions: ¬M arked(P1 ) ∧ Stable(N et) Proof Obligation: M arked(P2 ) By assumption the net is stable and this implies ¬M arked(P3 ) ∨ M arked(P1 ). As the second term is not true by assumption, the first one must be true to achieve stability, i.e. ¬M arked(P3 ) is true. From Lemma 2 follows that if (P3 ) is not marked, then (P2 ) must be marked.

Case II Input = true Assumptions: M arked(P1 ) ∧ Stable(N et)

236


Proof Obligation: ¬M arked(P2 ) By assumption the net is stable and this implies ¬M arked(P2 ) ∨ ¬M arked(P1 ). As the second term is not true by assumption, the first one must be true to achieve stability, i.e. ¬M arked(P2 ) is true. Again, the proof is valid in both directions because of the exhaustive examination of all two cases.

History-AND For History-AND and Priority-AND, there is no Boolean logic counterpart a proof could refer to. However, there is a precise verbal specification for these gates and a validity argument can be carried out against the specification. Again, only the most basic variants are considered: two inputs, no reset input or time parameter. The properties that must be validated were given in Section 7.1.2. For the HistoryAND gate, the important properties are that 1. if, after initial state, input i has been triggered once or more, then the output is triggered at the same time when input j is triggered the first time and 2. the gate is afterwards in its initial state and 3. otherwise the output is not triggered (except in the complementary case where j comes first and i after). The letters i and j can represent 1 and 2 in any order, so there is a case distinction is necessary (which is trivial as the corresponding DSPN structure is visibly symmetric). The case that both input events occur at the same time is not considered in the proof. The corresponding DSPN is given in Figure C.5. The left figure shows the initial state after system start-up, before any init event has triggered. The behaviour of this DSPN is first discussed by ”playing the token game”, then the marking transition graph of the DSPN is formally compared to the state-machine in Figure C.4, which captures the intended behaviour (cf. Section 7.1.2). First the case is considered, that input 1, corresponding T1, triggers first. The first property to be shown can be instantiated for this case and further refined into (1a) ”if, after initial state, input 1 has been triggered ..., then the output is triggered ... when input 2 is triggered the first time”, (1b) ”...input 1 has been triggered once or more”, (1c) ”...the output is triggered at the same time” . Looking at the left part of the figure, it can be seen that in initial state T3 is ready to fire with priority over T2, if a token is put on P1. If some external agent triggers firing of T1 (input 1), then this situation happens. Without a delay, T3 fires and

237

In1 / -

In2 / Out In1 Triggered

In1 / Ready

In2 / In1 / Out

In2 Triggered

In2 / -

Figure C.4: State-Machine Describing the Semantics of History-AND

removes tokens from P1 and P2 and puts a token on P3. This is the state in the middle of the figure. Suppose now that T1 fires again in this state. As P2 is not marked by a token, T3 is not ready to fire and the lower prioritized transition T2 fires instead and removes the token from P2. The state is unchanged afterwards. This shows (1b), the fact that firing T1 more than once makes no difference. Now, if the state in the middle is still given, suppose that T4 (input 2) fires. A token is put on P4 and because of the token on P5, T6 takes precedence over T5 and fires immediately. This puts a token on P6. This is a vanishing state, i.e., another firing occurs immediately after without time elapsing: as both P3 and P6 are marked, T7, the output transition fires. So (1b), the fact that the firing of the second output triggers the output, is shown as well. Also (1c), the simultaneousness of input 2 and the output, is shown, because no time has elapsed between the firing of T4 and the firing of T5 (all of the involved transitions have been ready to fire and are of type immediate transition). The firing of T7 removes the tokens on P3 and P6 and puts tokens on P2 and P5, so the resulting state is the initial state again (property 2). Property 3 requires that otherwise the output is not triggered. If input 1 had not prepared the state in the middle of the figure, input 2 would not have been able to trigger the output (instead, the state sequence of the other case to be examined, input 2 before input 1, would have been initiated). So without input 1 first, the output would not trigger. After input 1, it still takes a trigger on input 2 to trigger the output, because otherwise a token on P6 would be missing. Together with the simultaneosness of the latter input and the output figure, it can be confirmed that the output does not fire except in the situation described by (1). So all of the three properties have been shown.

238


Figure C.5: The DSPN Structure for the History-AND Gate from the Dictionary, in initial state (upper left), in the state where input 1, but not input 2 has triggered (upper right) and in the state where only input 2 has triggered (lower)

239

The second case is that input 2 is triggered before input 1. Also in this case the output should trigger, because the inputs of the History-AND are commuatative. The argument is the very same as in the first case, except that the indices of the inputs and of the symmentical places and transitions have to be swapped. The equivalence between the implementation in terms of DSPN structures and the specification in terms of the state-machine that reflects the expectation from the verbal description can be shown by constructing the marking transition graph of the DSPN, cf. Figure C.4. The initial marking, as shown in the left part of Figure C.5, corresponds to the initial state ”Ready” of the state-machine. If input 1 is triggered, the next stable state of the DSPN is the one in the middle part of the figure. It corresponds to the state ”Input 1 Triggered” of the state-machine, and the triggering event and the corresponding output (none in this case) are also the same. The transition back to initial state is triggered by input 2, and the corresponding action is triggering the output. This corresponds to the transition from ”Input 1 Triggered” to ”Ready” in the state diagram. If, starting from the initial marking (or the state ”Ready”, respectively), input2 is triggered, then the next stable state of the DSPN is the one on the right side of the figure. This corresponds to state ”Input 2 triggered” in the state-machine. The transitions between both states correspond to the sequences of transitions in the Petri Net (i.e., from ”Ready” to ”Input 2 Triggered” by triggering Input 2 with no related action, and the other way around by triggering input 1, with the action of triggering the output). The fact that, depending on its marking, the Petri Net ignores some inputs (because the higher priority transitions are not ready and the token is consumed by the lower priority transitions) is reflected by the self transitions in the state diagram. After all, it has been argued that the marking transition graph (reduced to the tangible markings) is equivalent to the state-machine that describes the intended behaviour. Therefore, the DSPN structure acts like the History-AND gate should.

Priority-AND For the Priority-AND, the inputs are obviously not commutative. It has to be shown that 1. if, after initial state, input 1 has been triggered once or more, then the output is triggered at the same time when input 2 is triggered the first time and 2. the gate is afterwards in its initial state and 3. otherwise the output is not triggered. This behaviour is captured in the state-machine in Figure C.6. Large parts of the argument for History-AND can be used. Of course, the second case does not exist, i.e., should not lead to firing of the output transition. So in particular the third property has to be shown separately. Again, the first property to be shown can be specifically refined into (1a) ”if, after initial state, input 1 has been

240


In1 / -

In2 / In1 / -

In1 Triggered

Ready

In2 / Out

Figure C.6: State-Machine Describing the Semantics of Priority-AND

triggered ..., then the output is triggered ... when input 2 is triggered the first time”, (1b) ”...input 1 has been triggered once or more”, (1c) ”...the output is triggered at the same time” . The corresponding DSPN structure can be found in Figure C.7. In the left part of the figure, the initial state is shown. It can be seen that in initial state T3 is ready to fire with priority over T2, if a token is put on P1. If some external agent triggers firing of T1 (input 1), then this situation happens. Without a delay, T3 fires and removes tokens from P1 and P2 and puts a token on P3. This is the state in the right part of the figure. Suppose now that T1 fires again in this state. As P2 is not marked by a token, T3 is not ready to fire and the lower prioritized transition T2 fires instead and removes the token from P2. The state is unchanged afterwards. This shows (1b), the fact that firing T1 more than once makes no difference. Otherwise, if the state in the right part of the figure is still given, suppose that T4 (input 2) fires. A token is put on P4 and because of the token on P3, the output T6 takes precedence over T5 and fires immediately (because it is an immediate transition). So (1a) and (1c) have been shown, the property that if input 2 is triggered after input 1 has been triggered before, the output is triggered and this happens immediately. As T6 removes the tokens from P3 and P4 and puts a token on P2, the states after firing of the output transition is the initial state, which is property 2. Property 3 says that otherwise the output is not triggered. As there are no untimed transitions, no transition triggers unless some input is triggered or at another point of time. If input 1 had not been triggered, before, transition T6 would not be enabled and triggering of input 2 would be without output reaction (token on P4 is removed immediately by the lower priority transition T5). If, on the other hand, input 1 has fired before, it still takes input 2 as well to trigger the output, because T6 requires a token in P4. Input 1 alone can fire arbitrarily often without output reaction, if input 2 does not fire. So property 3 has been shown as well. As in the case of the History-AND, also the state-machine that describes the intended behaviour of the Priority-AND gate can be compared to the marking transition graph of the corresponding DSPN structure. Again, both state diagrams can be judged equivalent, because they match in the number of states and all transitions with their respective triggers and actions. The DSPN marking in the left part of Figure C.7 corresponds to the state ”Ready” of the state-machine in Figure C.6 and the

241

Figure C.7: The DSPN Structure for the Priority-AND Gate from the Dictionary, in initial state (left) and in the state where input 1, but not input 2 has triggered (right)

marking in the right part to the state ”In1 Triggered”. Therefore, the implementation of the Priority-AND gate as a DSPN structure corresponds to the specification of its behaviour.

242


Bibliography [And00]

A NDREWS, John D.: The Use of Not Logic in Fault Tree Analysis. In: 14th Advances in Reliability Technology Symposium (ARTS’00), 2000, p. 1–15

[And02]

A NDREWS, John D.: Fault Tree Analysis - Common Misconceptions. In: The 20th International System Safety Conference, 2002

[BB03]

B OUISSOU, Marc ; B ON, Jean-Louis: A New Formalism that Combines Advantages of Fault-Trees and Markov Models: Boolean Logic Driven Markov Processes. In: Reliability Engineering and System Safety 82 (2003), n. 2, p. 149–163

[BBK98]

B AUSE, Falko ; B UCHHOLZ, Peter ; K EMPER, Peter: A Toolbox for Functional and Quantitative Analysis of DEDS. In: P UIGJANER, Ramon (Ed.) ; S AVINO, Nunzio N. (Ed.) ; S ERRA, Bartomeu (Ed.): Computer Performance Evaluation: Modelling Techniques and Tools, 10th International Conference (Tools ’98) v. 1469. Palma de Mallorca, Spain : Springer, September 14-18 1998, p. 356–359

[BCG91] B LOOMFIELD, Robin E. ; C HENG, J.H. ; G ÓRSKI, Janusz: Towards a Common Safety Description Model. In: L INDEBERG, J. F. (Ed.): The 10th International Conference on Computer Safety, Reliability and Security (SAFECOMP’91), Pergamon Press, 1991, p. 1–6 [Bee94]

B EECK, Michael von d.: A Comparison of Statecharts Variants. In: Formal Techniques in Real Time and Fault Tolerant Systems (FTRTFT) v. 863, Springer Verlag Berlin, 1994, p. 128–148

[Bir91]

B IROLINI, Alessandro: Qualität und Zuverlässigkeit technischer Systeme Theorie, Praxis, Management. Springer Verlag Berlin, Heidelberg, New York, 1991 (3. Auflage)

[Boz03]

B OZZANO, Marco et al.: ESACS: An Integrated Methodology for Design and Safety Analysis of Complex Systems. In: Proceedings of ESREL 2003. Maastricht, The Netherlands : Balkema Publishers, June 15-18 2003, p. 237–245

[Bry86]

B RYANT, Randal E.: Graph-Based Algorithms for Boolean Function Manipulation. In: IEEE Transactions on Computers 35 (1986), n. 8, p. 677–691

243

244

BIBLIOGRAPHY

[Buc00]

B UCHACKER, Kerstin: Definition und Auswertung erweiterter Fehlerbäume für die Zuverlässigkeitsanalyse technischer Systeme, Friedrich-AlexanderUniversität Erlangen-Nürnberg, Ph.D. thesis, 2000

[BW96]

B OLLIG, Beate ; W EGENER, Ingo: Improving the Variable Ordering of OBDDs is NP-complete. In: IEEE Transactions on Computers 45 (1996), September, n. 9, p. 993–1002

[CCD+ 01] C LARK, Graham ; C OURTNEY, Tod ; D EVOURS, Dan ; D ERIVSAVI, Salim ; D OYLE, Jay M. ; S ANDERS, William H. ; W EBSTER, Patrick: The Moebius Modeling Tool. In: The 9th International Workshop on Petri Nets and Performance Models, 2001, p. 241–250 [CH93]

C HRISTENSEN, Søren ; H ANSEN, Niels D.: Coloured Petri Nets Extended with Place Capacities, Test Arcs and Inhibitor Arcs. In: M ARSAN, Marco A. (Ed.): Application and Theory of Petri Nets 1993, 14th International Conference v. 691. Chicago, Illinois, USA : Springer, June 21-25 1993, p. 186–205

[CL93]

C IARDO, Gianfranco ; L INDEMANN, Christoph: Analysis of Deterministic and Stochastic Petri Nets. In: The 5th International Workshop on Petri Nets and Performance Models, 1993

[CL99]

C ASSANDRAS, Christos G. ; L AFORTUNE, Stéphane ; C ASSANDRAS, Christos G. (Ed.): Introduction to Discrete Event Systems. Kluwer Academic Publishers, 1999

[CPS03]

C OPPIT, David ; PAINTER, Robert R. ; S ULLIVAN, Kevin J.: Shared Semantic Domains for Computational Reliability Engineering. In: 14th International Symposium on Software Reliability Engineering (ISSRE 2003). Denver, Colorado, USA : IEEE Computer Society, November 17-20 2003, p. 169– 180

[CS01]

C LARKE, Edmund M. ; S CHLINGLOFF, Bernd-Holger: Model Checking. In: R OBINSON, John A. (Ed.) ; V ORONKOV, Andrei (Ed.): Handbook of Automated Reasoning. Elsevier and MIT Press, 2001, p. 1635–1790

[CSD00]

C OPPIT, David ; S ULLIVAN, Kevin J. ; D UGAN, Joanne B.: Formal Semantics of Models for Computational Engineering: A Case Study on Dynamic Fault Trees. In: The 11th International Symposium on Software Reliability Engineering (ISSRE’00), IEEE Computer Society, October 2000, p. 270

[DA01]

D UGAN, Joanne B. ; A SSAF, Tariq S.: Dynamic Fault Tree Analysis of a Reconfigurable Software System. In: The 19th International System Safety Conference, 2001, p. 480–487

[DBB92]

D UGAN, Joanne B. ; B AVUSO, Salvatore J. ; B OYD, Mark A.: Dynamic Fault Tree Models for Fault Tolerant Computer Systems. In: IEEE Transactions on Reliability 41 (1992), September, n. 3, p. 363–377

BIBLIOGRAPHY

245

[DEF97]

DEF STAN 00-55: Requirements for Safety Related Software in Defence Equipment: Part 1: Requirements. Part 2: Guidance. Ministry of Defence, 1997. – Issue 2

[DEF04]

DEF STAN 00-58: Hazop Studies on Systems Containing Programmable Electronics: Part 1: Requirements. Part 2: General Application Guidance. Ministry of Defence, 2004

[DIN81]

DIN 25424: Fehlerbaumanalyse: Part 1 (09/81): Methode und Bildzeichen. Part 2 (04/90): Handrechenverfahren zur Auswertung eines Fehlerbaumes. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 1981

[DIN90]

DIN 40041: Zuverlässigkeit; Begriffe. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 1990

[DIN93]

DIN IEC 61025: Störungsbaumanalyse. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 1993. – Identisch mit IEC 61025:1990

[DIN95]

DIN EN ISO 8402: Qualitätsmanagement und Qualitätssicherung; Begriffe. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 1995

[DIN00]

DIN EN 50126: Bahnanwendungen - Spezifikation und Nachweis der Zuverlässigkeit, Verfügbarkeit, Instandhaltbarkeit, Sicherheit (RAMS). DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2000

[DIN01]

DIN EN 50128: Bahnanwendungen - Telekommunikationstechnik, Signaltechnik und Datenverarbeitungssysteme - Software für Eisenbahnsteuerungs- und Überwachungssysteme. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2001

[DIN02a] DIN EN 61508-3: Funktionale Sicherheit sicherheitsbezogener elektrischer/elektronischer/programmierbarer elektronischer Systeme - Teil 3: Anforderungen an Software (IEC 61508-3:1998 + Corrigendum 1999). DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2002 [DIN02b] DIN EN 61508-4: Funktionale Sicherheit elektrischer/elektronischer/programmierbar elektronischer sicherheitsbezogener Systeme - Teil 4: Begriffe und Abkürzungen (IEC 61508-4:1998 + Corrigendum 1999). DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2002 [DIN03a] DIN EN 50129: Bahnanwendungen - Telekommunikationstechnik, Signaltechnik und Datenverarbeitungssysteme - Sicherheitsrelevante elektronische Systeme für Signaltechnik. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2003 [DIN03b] DIN EN 61508-7: Funktionale Sicherheit elektrischer/elektronischer/programmierbar elektronischer sicherheitsbezogener Systeme - Teil 7: Anwendungshinweise über Verfahren und Maßnahmen (IEC 61508-7:2000). DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2003

246

BIBLIOGRAPHY

[DIN03c] DIN IEC 61078: Techniken für die Analyse der Zuverlässigkeit - Verfahren mit dem Zuverlässigkeitsblockdiagramm und Boole’sche Verfahren. DIN Deutsches Institut für Normung e.V. Beuth Verlag GmbH, 2003. – Norm-Entwurf [DO-92]

DO-178B: Software Considerations in Airborne Systems and Equipment Certification. Radio Technical Commission for Aeronautics, Inc., 1992

[DSC00]

D UGAN, Joanne B. ; S ULLIVAN, Kevin J. ; C OPPIT, David: Developing a Low Cost High Quality Software Tool for Dynamic Fault Tree Analysis. In: IEEE Transactions on Reliability 49 (2000), March, n. 1, p. 49–59

[FAR76]

F USSEL, J.B. ; A BER, E.F. ; R AHL, R.G.: On the Quantitative Analysis of Priority-AND Failure Logic. In: IEEE Transactions on Reliability 25 (1976), December, n. 5, p. 324–326

[FB03]

F REIHEIT, Jörn ; B ILLINGTON, Jonathan: New Developments in ClosedForm Computation for GSPN Aggregation. (2003)

[FM93]

F ENELON, Peter ; M C D ERMID, John A.: An Integrated Toolset for Software Safety Analysis. In: The Journal of Systems and Software 21 (1993), July, n. 3, p. 279–290

[FMPN94] F ENELON, Peter ; M C D ERMID, John A. ; P UMFREY, D. J. ; N ICHOLSON, M.: Towards Integrated Safety Analysis and Design. In: ACM Applied Computing Review 2 (1994), August, n. 1, p. 21–32 [För06]

F ÖRSTER, Marc: Efficient Quantitative Evaluation of State/Event Fault Trees, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, Master thesis, to be submitted 2006

[FSKR05] F ECHER, Harald ; S CHÖNBORN, Jens ; K YAS, Marcel ; R OEVER, WillemPaul d.: New Unclarities in the Semantics of UML 2.0 State Machines. In: L AU, Kung-Kiu (Ed.) ; B ANACH, Richard (Ed.): Formal Methods and Software Engineering: 7th International Conference on Formal Engineering Methods, ICFEM 2005 v. 3785. Manchester, UK : Springer-Verlag GmbH, November 1-4 2005, p. 52–65 [GD97]

G ULATI, Rohit ; D UGAN, Joanne B.: A Modular Approach for Analyzing Static and Dynamic Fault Trees. In: Reliability and Maintainability Symposium, 1997, p. 57–63

[GK05]

G RUNSKE, Lars ; K AISER, Bernhard: An Automated Dependability Analysis Method for COTS-Based Systems. In: The 4th International Conference on COTS-Based Software Systems (ICCBSS’05) v. 3412, 2005, p. 178– 190

[GM95]

G ERMAN, Reinhard ; M ITZLAFF, Jörg: Transient Analysis of Deterministic and Stochastic Petri Nets with TimeNET. In: Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen (MMB), 1995, p. 209–223

BIBLIOGRAPHY

247

[GMW95] G ÓRSKI, Janusz ; M AGOTT, J. ; WARDZINSKI, A.: Modelling of Fault Trees Using Petri Nets. In: The 14th International Conference on Computer Safety, Reliability and Security (SAFECOMP’95), 1995, p. 90–100 [GP03]

G EHRS, Kai ; P OSTEL, Dr. F.: MuPAD - A Practical Guide. Mathematics made anew - Tools and Texts for Computer Aided Learning. Paderborn: SciFace Software GmbH & Co. KG, November 2003. http://edu. mupad.de

[Gór94]

G ÓRSKI, Janusz: Extending Safety Analysis Techniques with Formal Semantics. In: Technology and Assessment of Safety Critical Systems (1994), p. 147–163

[Gra95]

G RAMS, Boris: Entwurf und Implementierung von anwendungsbasierten Methoden zur Dependability-Analyse basierend auf stochastischen PetriNetzen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Diplomarbeit, April 1995

[Gra04]

G RAMLICH, Catharina: Präzisierung und Validierung eines Modells zur Sicherheitsanalyse, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, Master thesis, 2004

[HA88]

H URA, G. S. ; ATWOOD, J. W.: The Use of Petri Nets to Analyze Coherent Fault Trees. In: IEEE Transactions on Reliability 37 (1988), n. 5, p. 469–474

[Han96]

H ANSEN, Kirsten M.: Linking Safety Analysis to Safety Requirements - Exemplified by Railway Interlocking Systems, Technical University of Denmark, Ph.D. thesis, 1996

[Har87]

H AREL, David: Statecharts: A Visual Formalism for Complex Systems. In: Science of Computer Programming 8 (1987), June, n. 3, p. 231–274

[HK92]

H ENLEY, Ernest J. ; K UMAMOTO, Hiromitsu: Probabilistic Risk Assessment: Reliability Engineering, Design and Analysis. IEEE Press, 1992

[HL94]

H ÖHL, Michael ; L ADKIN, Peter: Report on the Accident to Airbus A320211 Aircraft in Warsaw on 14 September 1993 / Main Commission Aircraft Accident Investigation Warsaw. 1994. – Technical Report

[HN96]

H AREL, David ; N AAMAD, Amnon: The STATEMATE Semantics of Statecharts. In: ACM Transactions on Software Engineering and Methodology (TOSEM) 5 (1996), October, n. 4, p. 293–333

[HNS00] H OFMEISTER, C. ; N ORD, R. ; S ONI, D.: Applied Software Architecture. Addison-Wesley, Reading, Mass. and London, 2000 [HPSS87] H AREL, David ; P NUELI, Amir ; S CHMIDT, Jeanette P. ; S HERMAN, Rivi: On the Formal Semantics of Statecharts (Extended Abstract). In: Proceedings of the Symposium on Logic in Computer Science (LICS ’87). Ithaca, New York, USA : IEEE Computer Society, June 22-25 1987, p. 54–64

248

BIBLIOGRAPHY

[HWS+ 01] H ELMER, Guy ; W ONG, Johnny ; S LAGELL, Mark ; H ONAVAR, Les ; L UTZ, Robyn: Software Fault Tree and Colored Petri Net Based Specification, Design and Implementation of Agent-Based Intrusion Detection Systems. In: In revision. (2001) [IEC91]

IEC 60812: Analysis Techniques for System Reliability - Procedure for Failure Mode and Effect Analysis (FMEA). International Electrotechnical Commission, 1991

[IEC01]

IEC 61822: Hazard and operability studies (HAZOP studies) - Application guide. International Electrotechnical Commission, 2001

[ISO01]

ISO/IEC 9126-1:2001 ; ISO/IEC (Ed.): Software engineering - Product quality - Part 1: Quality model. 2001

[JRH+ 05] J ECKLE, Mario ; R UPP, Chris ; H AHN, Jürgen ; Z ENGLER, Barbara ; Q UEINS, Stefan ; SOPHIST GROUP (Ed.): UML 2.0 Glasklar - Praxiswissen für die UML-Modellierung und -Zertifizierung. Hanser Fachbuchverlag, 2005 [Kai05]

K AISER, Bernhard: Extending the Expressive Power of Fault Trees. In: Accepted for Publication at the 51st Annual Reliability&Maintainability Symposium (RAMS’05). Alexandria, VA, USA, January 24-27 2005

[KFG+ 05] K AISER, Bernhard ; F ÖRSTER, Marc ; G OMEZ, Carolina ; G RAMLICH, Catharina ; L AKDAWALA, Zahra ; R ICHTER, Tina ; R OZINAT, Anne ; S CIVER, Johann van ; Z OCHER, André: The ESSaRel Project - Fault Tree Analysis Tool UWG3 and Integrated Safety and Reliability Analysis Tool ESSaRel / IESE. 2005. – Technical Report [KG04]

K AISER, Bernhard ; G RAMLICH, Catharina: State-Event-Fault-Trees - A Safety Analysis Model for Software Controlled Systems. In: H EISEL, Maritta (Ed.) ; L IGGESMEYER, Peter (Ed.) ; W ITTMANN, Stefan (Ed.): The 23rd International Conference on Computer Safety, Reliability, and Security (SAFECOMP’04) v. 3219. Potsdam, Germany : Springer, September 2124 2004. – ISBN 3–540–23176–5

[KGF06] K AISER, Bernhard ; G RAMLICH, Catharina ; F ÖRSTER, Marc: State-EventFault-Trees - A Safety Analysis Model for Software Controlled Systems. (2006). – ISSN 0951–8320 [KHI89]

K OHDA, Takehisa ; H ENLEY, Ernest J. ; I NOUE, Koichi: Finding Modules in Fault Trees. In: IEEE Transactions on Reliability 38 (1989), n. 2, p. 165–176

[KLM03] K AISER, Bernhard ; L IGGESMEYER, Peter ; M ÄCKEL, Oliver: A New Component Concept for Fault Trees. In: L INDSAY, P. (Ed.) ; C ANT, T. (Ed.): Proceedings of the 8th Australian Workshop on Safety Critical Systems and Software (SCS’03) v. 33. Canberra, 2003

BIBLIOGRAPHY

249

[KT02]

K LOSE, Jochen ; T HUMS, Andreas: The STATEMATE Reference Model of the Reference Case Study ’Verkehrsleittechnik’ / University Augsburg. 2002 (2002-01). – Technical Report

[KW01]

K INDLER, Ekkart ; W EBER, Michael: A Universal Module Concept for Petri Nets - An Implementation Oriented Approach / HumboldtUniversität zu Berlin. 2001. – Technical Report

[KW03]

K ELLER, Frank ; W ENDT, Siegfried: An Approach Towards ArchitectureCentric System Development. In: The 10th IEEE Symposium and Workshops on Engineering of Computer Based Systems. Huntsville Alabama USA, 2003

[KZ05]

K AISER, Bernhard ; Z OCHER, André: BDD Complexity Reduction by Component Fault Trees. In: ESREL - European Safety and Reliability Conference. Gdynia, Poland, 2005

[Lap95]

L APRIE, J.C.: Dependability - Its Attributes, Impairments and Means. In: Predictably Dependable Computing Systems (1995), p. 3–24

[LCS91]

L EVESON, Nancy G. ; C HA, Stephen S. ; S HIMEALL, Timothy J.: Safety Verification of ADA Programs using Software Fault Trees. In: IEEE Software 8 (1991), n. 4, p. 48–59

[Lev95]

L EVESON, Nancy G.: Safeware: System Safety and Computers. AddisonWesley, 1995

[Lig00]

L IGGESMEYER, Peter: Qualitätssicherung softwareintensiver technischer Systeme. Spektrum Akademischer Verlag, Heidelberg, 2000

[LM96]

L IU, Shaoying ; M C D ERMID, John A.: A Model-Oriented Approach to Safety Analysis Using Fault Trees and a Support System. In: Journal of Systems and Software 35 (1996), n. 2, p. 151–164

[LM01]

L IGGESMEYER, Peter ; M ÄCKEL, Oliver: Quantifying the Reliability of Embedded Systems by Automated Analysis. In: International Conference on Dependable Systems and Networks (DSN’01), IEEE Computer Society, 2001, p. 89–96

[LN99]

L IND -N IELSEN, Jørn: BuDDy - A Binary Decision Diagram Package / Department of Information Technology, Technical University of Denmark. Version: 1999. http://buddy.sourceforge.net/. – Technical Report. – Online–Ressource

[LR98]

L IGGESMEYER, Peter ; R OTHFELDER, Martin: Improving System Reliability with Automatic Fault Tree Generation. In: The 28th Annual International Symposium on Fault-Tolerant Computing, IEEE Computer Society, 1998, p. 90–99

[LRSZ93] L IU, Zhiming ; R AVN, Anders P. ; S ORENSEN, Erlin V. ; Z HOU, Chaochen: A Probabilistic Duration Calculus. In: K OPETZ, H. (Ed.) ; K AKUDA, Y.

250

BIBLIOGRAPHY

(Ed.): Proceedings of the Second International Workshop on Responsive Systems: Dependable Computing and Fault-Tolerant Systems v. 7. Saitama, Japan : Springer-Verlag, 1993, p. 30–52 [LRT99]

L INDEMANN, Christoph ; R EUYS, Andreas ; T HÜMMLER, Axel: The DSPNexpress 2.000 Performance and Dependability Modeling Environment. In: The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (FTCS ’99). Madison, Wisconsin, USA : IEEE Computer Society, June 15-18 1999, p. 228–231

[LT89]

LYNCH, Nancy A. ; T UTTLE, Mark R.: An Introduction to I/O Automata. In: CWI Quarterly 2 (1989), n. 3, p. 219–246

[Lyu96]

LYU, Michael R.: Handbook of Software Reliability Engineering. IEEE Computer Society Press and McGraw-Hill Book Company, 1996

[MBC+ 95] M ARSAN, Marco A. ; B ALBO, Gianfranco ; C ONTE, Gianni ; D ONATELLI, Susanna ; F RANCESCHINIS, Giuliana ; M ARSAN, Marco A. (Ed.): Modelling with Generalized Stochastic Petri Nets. 1. John Wiley & Sons, Inc., 1995 (John Wiley Series in Parallel Computing) [MC87]

M ARSAN, Marco A. ; C HIOLA, Giovanni: On Petri Nets with Deterministic and Exponentially Distributed Firing Times. In: The 7th European Workshop on Applications and Theory of Petri Nets v. 266, Springer Verlag, 1987, p. 132–145

[McD02] M C D ERMID, John A.: Software Hazard and Safety Analysis. In: D AMM, W. (Ed.) ; O LDEROG, E.-R. (Ed.): Formal Techniques in Real-Time and FaultTolerant Systems: 7th International Symposium, FTRTFT 2002, Co-sponsored by IFIP WG 2.2 v. 2469. Oldenburg, Germany : Springer-Verlag GmbH, September 9-12 2002, p. 23–34 [MDCS98] M ANIAN, Ragavan ; D UGAN, Joanne B. ; C OPPIT, David ; S ULLIVAN, Kevin J.: Combining Various Solution Techniques for Dynamic Fault Tree Analysis of Computer Systems. In: 3rd IEEE International Symposium on High-Assurance Systems Engineering (HASE ’98). Washington, D.C, USA : IEEE Computer Society, November 13-14 1998, p. 21–28 [Mea55]

M EALY, George H.: A Method for Synthesizing Sequential Circuits. In: Bell System Technical Journal 34 (1955), n. 5, p. 1045–1079

[MF76]

M ERLIN, Philip M. ; FABER, David J.: Recoverability of Communication Protocols - Implication of a Theoretical Study. In: IEEE Transactions on Communications 24 (1976), n. 9, p. 1036–1043

[MIL93]

MIL STD 882C: Standard Practice for System Safety Program Requirements. Department of Defense, Washington, D.C., 1993

[MIO87] M USA, John D. ; I ANNINO, Anthony ; O KUMOTO, Kazuhira ; M USA, John D. (Ed.): Software Reliability: Measurement, Prediction, Application. McGraw-Hill Book Company, 1987

BIBLIOGRAPHY

251

[Moo56] M OORE, E.F.: Gedanken-Experiments on Sequential Machines. In: Automata Studies (1956), n. 34, p. 129–153 [MT94]

M ALHOTRA, Manish ; T RIVEDI, Kishor S.: Power-Hierarchy of Dependability Model Types. In: IEEE Transactions on Reliability 43 (1994), n. 3, p. 493–501

[MT95]

M ALHOTRA, Manish ; T RIVEDI, Kishor S.: Dependability Modeling Using Petri Nets. In: IEEE Transactions on Reliability 44 (1995), n. 3, p. 428–440

[Mur89]

M URATA, Tadao: Petri Nets: Properties, Analysis and Applications. In: IEEE 77 (1989), n. 4, p. 541–580

[OD00]

O U, Yong ; D UGAN, Joanne B.: Sensitivity Analysis of Modular Dynamic Fault Trees. In: 4th International Computer Performance and Dependability Symposium, 2000, p. 35–43

[PD96]

P ULLUM, Laura L. ; D UGAN, Joanne B.: Fault tree Models for the Analysis of Complex Computer-Based Systems. In: Proceedings of the Reliability and Maintainability Symposium, 1996, p. 200–207

[Per85]

P ERROW, Charles: Normal Accidents: Living with High-Risk Technologies. Basic Books, 1985

[Pet62]

P ETRI, Carl A.: Kommunikation mit Automaten, Institut für Instrumentelle Mathematik, University of Bonn, Ph.D. thesis, 1962

[PM99]

PAPADOPOULOS, Yiannis ; M C D ERMID, John A.: Hierarchically Performed Hazard Origin and Propagation Studies. In: The 18th International Conference on Computer Safety, Reliability and Security (SAFECOMP’99) v. 1698, Springer Verlag, 1999, p. 139–152

[PM01]

PAPADOPOULOS, Yiannis ; M ARUHN, Matthias: Model-Based Synthesis of Fault Trees from Matlab-Simulink Models. In: International Conference on Dependable Systems and Networks (DSN’01), 2001

[RCC99] R EDMILL, F. ; C HUDLEIGH, M. ; C ATMUR, J.: System Safety: HAZOP and Software HAZOP. John Wiley & Sons, 1999 [Rei85]

R EISIG, Wolfgang: Petri Nets: An Introduction. Springer Verlag Berlin, Heidelberg, New York, 1985

[Rel]

Relex Software Corporation: Relex Fault Tree/Event Tree. http://www. relexsoftware.com

[RL05]

Chapter 12 - Sicherheits- und Zuverlässigkeitsanalysetechniken. In: R OMBACH, Dieter ; L IGGESMEYER, Peter: Software Engineering eingebetteter Systeme: Grundlagen - Methodik - Anwendungen. Spektrum Akademischer Verlag, Heidelberg, 2005, p. 281–315

252

BIBLIOGRAPHY

[RM05]

R OBINSON -M ALLETT, Christopher: Modellbasierte Modulprüfung für die Entwicklung technischer, softwareintensiver Systeme mit Real-Time ObjectOriented Modeling, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, Ph.D. thesis, 2005

[Rog05]

R OGOTZKI, Antje: Integration von zustandsbasierten Modellen eines CASETools in das Fehlerbaumanalysewerkzeug ESSaRel, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, Master thesis, 2005

[RST00]

R EIF, Wolfgang ; S CHELLHORN, Gerhard ; T HUMS, Andreas: Safety Analysis of a Radio-Based Crossing Control System Using Formal Methods. In: 9th IFAC Symposium on Control in Transportation Systems 2000, 2000

[Rus94]

R USHBY, John: Critical System Properties: Survey and Taxonomy / Computer Science Laboratory, SRI International. 1994. – Technical Report

[SBST05] S CHNEIDER, Klaus ; B RANDT, Jens ; S CHUELE, Tobias ; T UERK, Thomas: Maximal Causality Analysis. In: Conference on Application of Concurrency to System Design (ACSD). St. Malo, France : IEEE Computer Society, June 2005, p. 106–115 [SCD99]

S ULLIVAN, Kevin J. ; C OPPIT, David ; D UGAN, Joanne B.: The Galileo Fault Tree Analysis Tool. In: The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (FTCS 1999). Madison, Wisconsin, USA : IEEE Computer Society, June 15-18 1999, p. 232–235

[Sch03]

S CHÄFER, Andreas: Combining Real-Time Model-Checking and Fault Tree Analysis. In: The 12th International FME Symposium (FME’03): Formal Methods 2805 (2003), p. 522–541

[Sel94]

S ELIC, Bran: Real-Time Object-Oriented Modeling. John Wiley & Sons, Inc., 1994

[ST87]

S AHNER, Robin A. ; T RIVEDI, Kishor S.: Reliability Modeling Using SHARPE. In: IEEE Transactions on Reliability R-36 (1987), n. 2, p. 186–193

[Sto96]

S TOREY, Neil R.: Safety-Critical Computer Systems. Addison-Wesley Longman Publishing Co., Inc., 1996

[STR02]

S CHELLHORN, Gerhard ; T HUMS, Andreas ; R EIF, Wolfgang: Comparing Formal Fault Tree Semantics. In: The 5th Workshop on Tools for System Design and Verification (FM-TOOLS’02), 2002

[SVD+ 02] S TAMATELATOS, Michael ; V ESELY, William ; D UGAN, Joanne ; F RAGOLA, Joseph ; M INARICK, Joseph ; R AILSBACK, Jan: Fault Tree Handbook with Aerospace Applications. Washington, DC 20546 : Office of Safety and Mission Assurance, NASA Headquarters, 2002

BIBLIOGRAPHY

253

[VGR81] V ESELY, William E. ; G OLDBERG, F. ; R OBERTS, D.: Fault Tree Handbook (NUREG 0492). U.S. Nuclear Regulatory Commission, 1981 [Vil92]

V ILLEMEUR, Alain: Reliability, Availability, Maintainability and Safety Assessment. John Wiley & Sons, Inc., 1992

[Wik]

Wikipedia, the free encyclopedia. http://www.wikipedia.org

[WSS97] W U, Sue-Hwey ; S MOLKA, Scott A. ; S TARK, Eugene W.: Composition and Behaviors of Probabilistic I/O Automata. In: Theoretical Computer Science 176 (1997), n. 1-2, p. 1–38 [ZGFH99] Z IMMERMANN, A. ; G ERMAN, R. ; F REIHEIT, J. ; H OMMEL, G: TimeNET 3.0 Tool Description. In: Int. Conf. on Petri Nets and Performance Models (PNPM’99). Zaragoza, Spain, 1999 [Zoc05]

Z OCHER, André: Quantitative Auswertung von MultizustandKomponentenfehlerbäumen durch mehrwertige Entscheidungsdiagramme, Hasso-Plattner-Institut für Softwaresystemtechnik an der Universität Potsdam, Master thesis, 2005

[ZWST03] Z ANG, Xinyu ; WANG, Dazhi ; S UN, Hairong ; T RIVEDI, Kishor S.: A BDD-Based Algorithm for Analysis of Multistate Systems with Multistate Components. In: IEEE Transactions on Computers 52 (2003), n. 12, p. 1608– 1618

254

BIBLIOGRAPHY

List of Figures 2.1

Rate, Failure Density, Reliability, and Failed State Probability assuming exponential distribution . . . . . . . . . . . . . . . . . . . . . . . .

17

2.2

A Reliability Block Diagram Example . . . . . . . . . . . . . . . . . .

21

2.3

A Simple Fault Tree Example . . . . . . . . . . . . . . . . . . . . . . .

21

2.4

A Simple Event Tree Example . . . . . . . . . . . . . . . . . . . . . . .

22

2.5

A Simple Markov Chain Example . . . . . . . . . . . . . . . . . . . . .

23

2.6

Fault Tree Gates in European and US style . . . . . . . . . . . . . . . .

28

2.7

Simple Fault Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

2.8

BDD corresponding to the AND conjunction of two variables (events) E1 and E2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

BDD corresponding to the Simple FT example from Fig. 2.7 . . . . . .

32

2.10 Basic State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.11 The AND and the OR Decomposition in Statecharts: Traffic Lights Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

2.12 A Markov-Chain Example . . . . . . . . . . . . . . . . . . . . . . . . .

42

2.13 A Simple Petri Net Example (Initial State and State After Transition T1 Has Fired) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

2.14 A ROOM Structure Diagram Example . . . . . . . . . . . . . . . . . .

46

2.15 Mapping of FT AND (left) and OR (right) Gate into Petri Net Structures according to [HA88]. . . . . . . . . . . . . . . . . . . . . . . . . .

49

2.16 Mapping of Components to GSPN Subnets according to [Gra95] . . .

49

2.9

2.17 Mapping of AND and OR Gates to GSPN Subnets according to [Gra95] 50 3.1

A Fault Tree and one of its Modules . . . . . . . . . . . . . . . . . . .

55

3.2

Decomposition of a Fault Tree by Modules . . . . . . . . . . . . . . .

56

3.3

Fault Tree with Subcomponent . . . . . . . . . . . . . . . . . . . . . .

56

3.4

Component Fault Tree of the System (left) and the Subcomponent (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

255

256

LIST OF FIGURES

3.5

Multiple Use of the same Subcomponent type . . . . . . . . . . . . . .

58

3.6

The Model Elements of Component Fault Trees . . . . . . . . . . . . .

59

3.7

IDs as a Means to Reference Components and Ports . . . . . . . . . .

60

3.8

The Direction of Edges with Respect to Different Kinds of Ports . . .

61

3.9

Multiple Edges from one point (allowed) and to one point (forbidden)

62

3.10 Forbidden: Shallow Cycle and Deep Cycle . . . . . . . . . . . . . . .

62

3.11 Common cause failure . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

3.12 Hierarchical Example System: Nesting Structure . . . . . . . . . . . .

66

3.13 Hierarchical Example System: Containment Tree . . . . . . . . . . . .

67

3.14 BDD Fragments to the Entries C1.Pout1, C1.SC1.Pin1 and C2.Pout1 of the Example in Figure 3.4 . . . . . . . . . . . . . . . . . . . . . . . .

68

3.15 BDD Fragments C1.Pout1, C1.SC1.Pin1 and C1.SC1.Pout1 after Cloning C2 to C1.SC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.16 BDD Fragments with Prefixed IDs (left) and after Composition (right)

70

4.1

The Different Ways of Event Occurrence: a) Probabilistic Event, b) Deterministic Event, c) Triggered Event (upper) . . . . . . . . . . . .

77

4.2

Causal Edges as Trigger (left) and Guard (right) Relations . . . . . . .

78

4.3

Putting a Logical Gate into a Trigger Relation . . . . . . . . . . . . . .

79

4.4

A First SEFT Example . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.5

Different Kinds of Ports: Model Ports vs. Substructure Ports; Input Ports vs. Output Ports, State Ports vs. Event Ports . . . . . . . . . . .

83

4.6

Overview of SEFT Model Elements . . . . . . . . . . . . . . . . . . . .

84

4.7

AND Gate with n State Inputs . . . . . . . . . . . . . . . . . . . . . . .

88

4.8

AND Gate with one Event (Trigger) and n State Inputs . . . . . . . .

88

4.9

OR Gate with n State Inputs . . . . . . . . . . . . . . . . . . . . . . . .

89

4.10 OR gate with n Event Inputs . . . . . . . . . . . . . . . . . . . . . . . .

89

4.11 NOT Gate with One State Input . . . . . . . . . . . . . . . . . . . . . .

89

4.12 The Inhibit Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.13 The Exclusive OR (XOR) Gate . . . . . . . . . . . . . . . . . . . . . . .

90

4.14 The Equal Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

4.15 The Voter Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.16 The History-AND Gate: Standard, Variant with Reset Input, Variant with Time Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.17 The Priority-AND Gate: Standard, Variant with Reset Input, Variant with Time Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

LIST OF FIGURES

257

4.18 The Deterministic and Probabilistic (Exponentially Distributed) Delay Gates, with and without Reset Input . . . . . . . . . . . . . . . . .

93

4.19 The Conditional Probability Gate . . . . . . . . . . . . . . . . . . . . .

94

4.20 The Duration Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

4.21 The State/Event Adapter Gates: Entering, Leaving, Upon, Until, FlipFlop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.22 Traditional Markov Chain (upper) and SEFT Counterpart (lower) . .

96

4.23 Traditional Fault Tree (upper left) and Components of SEFT Counterpart (upper right: System, lower: Subcomponents) . . . . . . . . . . .

97

5.1

DSPN Example with Different Modelling Elements . . . . . . . . . .

103

5.2

Different Modularisation Techniques: Transition Refinement (left) and Net Cutting (right) . . . . . . . . . . . . . . . . . . . . . . . . . . .

105

5.3

Modular Petri Nets according to [KW01] . . . . . . . . . . . . . . . . .

105

5.4

Flattening of Modular Petri Nets according to [KW01] . . . . . . . . .

106

5.5

The State-AND Gate with 2 Inputs (upper) and two Different Implementation Variants with 3 Inputs (middle and lower) . . . . . . . . .

112

A Generic Implementation for the State-AND Gate Using Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

113

An SEFT Component with a Subcomponent and the Referenced Component SEFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

116

5.8

The corresponding DSPN structures . . . . . . . . . . . . . . . . . . .

117

5.9

The Guard-Pattern: SEFT (left) and DSPN Translation (right) . . . . .

119

5.10 The Trigger Pattern: SEFT (left) and corresponding DSPN: a) Case where Triggered Transition is Ready (middle) and b) Case Where It Is Not Ready (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

119

5.11 The flattened component DSPN before and after Merging . . . . . . .

122

5.6 5.7

5.12 SEFT Expample and DSPN of the OR-Gate with irregular Start Marking122 5.13 Additional DSPN Pattern for Event Outputs . . . . . . . . . . . . . .

125

6.1

UWG3 Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

6.2

Inheritance of a Specific Model from the Built-In Base Classes . . . .

131

6.3

ESSaRel Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

131

6.4

XML Schema Hierarchy of ESSaRel . . . . . . . . . . . . . . . . . . . .

133

7.1

State-Machine Describing the Semantics of History-AND . . . . . . .

138

7.2

State-Machine Describing the Semantics of Priority-AND . . . . . . .

138

7.3

Qualitative Test Example for State-AND: Test Setting . . . . . . . . .

140

258

LIST OF FIGURES

7.4

Qualitative Test Example for State-AND: Test Result Plot . . . . . . .

141

7.5

Quantitative Test Example for State-AND: Test Setting . . . . . . . . .

141

7.6

Quantitative Test Example for State-AND: Test Result Plot . . . . . .

142

7.7

Test Setting for Delay Gate . . . . . . . . . . . . . . . . . . . . . . . . .

143

7.8

Test Setting for Triggered State-Machine . . . . . . . . . . . . . . . . .

143

7.9

Special Case: AND Applied to Dependent States . . . . . . . . . . . .

144

7.10 A Test Case for Inner Consistency Demonstration: DeMorgan’s Law

145

7.11 A Test Case for Inner Consistency Demonstration: Voter Gate . . . .

145

7.12 A Test Case for Inner Consistency: Upon-Gate and Triggered StateMachine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147

7.13 Test Case ”Equal vs. Same”, Glassbox View . . . . . . . . . . . . . . .

148

7.14 Test Case ”Hierarchical Structure vs. Flat Counterpart”, Glassbox View 149 7.15 Test Setting for Single Gate Tests (here the State-AND Gate) as FT and as SEFT in Glassbox View . . . . . . . . . . . . . . . . . . . . . . . . .

150

7.16 Priority AND Test Case with One Exponential and One Deterministic Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151

7.17 A Proof Case for Priority AND: Both Events Are Exponential . . . . .

152

7.18 Simulation Results of the Priority AND Test Case . . . . . . . . . . .

153

7.19 Consistency Test Priority AND with Sequential Events . . . . . . . .

154

7.20 Test Case for Comparison Priority-AND in SEFT (left) to DFT (right)

155

7.21 The FDEP Gate in DFT (left) and an Equivalent SEFT (right) . . . . .

155

7.22 The Hot Spare, Cold Spare, and Warm Spare Gates in DFT . . . . . .

156

7.23 An SEFT Equivalent for the DFT Warm Spare Gate . . . . . . . . . . .

157

7.24 Test Setting: A Markov Chain and its Corresponding SEFT . . . . . .

158

7.25 The Fire Alarm Case Study . . . . . . . . . . . . . . . . . . . . . . . .

161

7.26 DSPN for the Fire Alarm Case Study . . . . . . . . . . . . . . . . . . .

163

7.27 Simulation Result Plot for the Fire Alarm Case Study . . . . . . . . .

164

7.28 Motorway Alarm Case Study: The Car and the Sensor SEFTs . . . . .

165

7.29 Motorway Alarm Case Study: The Controller SEFT . . . . . . . . . .

166

7.30 The Top-Level SEFT of the Motorway Alarm System . . . . . . . . . .

167

7.31 The DSPN of the Motorway Alarm System . . . . . . . . . . . . . . .

168

7.32 The Result Plot from a Simulation of the Motorway Alarm System . .

169

7.33 Subcomponent Representing a Basic Event (left) from Traditional FTA 177 7.34 Solitary Events: Single Exponential, Single Deterministic, Repeated Exponential, Repeated Deterministic . . . . . . . . . . . . . . . . . . .

178

LIST OF FIGURES

259

7.35 Subcomponent Representing a Single Probabilistic Solitary Event (left) and a Repeated Deterministic Solitary Event (right) . . . . . . .

178

7.36 Basic Event with Repair Rate and Mission Time . . . . . . . . . . . . .

179

7.37 Multiple Predecessor States to an Event (left) and Multiple Probabilistically Distributed Successor States (right) . . . . . . . . . . . . . . . .

181

7.38 Multiple Component Instances in ROOM . . . . . . . . . . . . . . . .

184

B.1 Class Diagram of Different Computations within ESSaRel . . . . . . .

216

C.1 The DSPN Structure for the AND Gate from the Dictionary, with Visible IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

228

C.2 The DSPN Structure for the OR Gate from the Dictionary, with Visible IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

232

C.3 The DSPN Structure for the NOT Gate from the Dictionary, with Visible IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

235

C.4 State-Machine Describing the Semantics of History-AND . . . . . . .

237

C.5 The DSPN Structure for the History-AND Gate from the Dictionary, in initial state (upper left), in the state where input 1, but not input 2 has triggered (upper right) and in the state where only input 2 has triggered (lower) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

238

C.6 State-Machine Describing the Semantics of Priority-AND . . . . . . .

240

C.7 The DSPN Structure for the Priority-AND Gate from the Dictionary, in initial state (left) and in the state where input 1, but not input 2 has triggered (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

241

260

LIST OF FIGURES

Curriculum Vitae Dipl.-Ing. Bernhard Kaiser born in 1970 1990 -1996

Studies in Electrical Engineering at Kaiserslautern University. Diploma thesis on ”Examination of the VSELP Speech Coding Algorithm and Implementation on a DSP”.

1996 - 2001

ABB Automation Products GmbH. Development engineer for microprocessor-controlled drive rectifiers (hardware and embedded software), later project manager for the development of the new DCS400 product and group leader, responsible for software quality assurance.

2001 - 2004

Hasso-Plattner-Institut für Softwaresystemtechnik GmbH at Potsdam University. Researcher, start of the doctoral studies, assistant in teaching (Software Engineering and Quality Management), project leader of the UWG3 / ESSaRel project.

since 2004

Fraunhofer Institute for Experimental Software Engineering (IESE), Kaiserslautern. Researcher, Leader of the Competence Development Team ”Safety and Reliability of Embedded Systems”, later department head Security and Safety, project leader of the ESSaRel project and other industrial and public funded research projects, completion of the Doctor Thesis.

Avocationally

Lecturer in Electrical and Software Engineering subjects at Fachhochschule Mannheim and Fachhochschule für Technik und Wirtschaft Berlin, Industry training in Software Engineering with Deutsche Informatik Akademie and ComCenter, Industry Consultancy

261

262

LIST OF FIGURES

Publication List Kaiser, B.: Studies on VSELP Speech Coding Algorithm and Implementation in a Laboratory Environment. Diploma Thesis University of Kaiserslautern / Institut Supérieur d’Electronique de Paris. 1996 Neumann R., Grunske L., Kaiser B.: Hierarchical Software Quality Models - A step towards quantifying non-functional properties, Proceedings of the 12th International Workshop on Software Measurement, Magdeburg, Shaker 2002 Kaiser, B.: Integration von Sicherheits- und Zuverlässigkeitsmodellen in den Entwicklungsprozess Eingebetteter Systeme. Softwaretechnik-Trends 22(4): Herausgegeben von der Gesellschaft für Informatik 2002 Kaiser, B.: A Fault-Tree Semantics to model Software-Controlled Systems. Softwaretechnik-Trends 23(3): Herausgegeben von der Gesellschaft für Informatik 2003 Kaiser, B., Liggesmeyer, P. and Mäckel, O.: A New Component Concept for Fault Trees. In Proc., Canberra, Australia. Conferences in Research and Practice in Information Technology, 33. Lindsay, P. and Cant, T., Eds., ACS. 37-46 Kaiser, B., Gramlich, C.: State-Event-Fault-Trees - A Safety Analysis Model for Software Controlled Systems. In: Computer Safety, Reliability, and Security. 23rd International Conference, SAFECOMP 2004, Potsdam, Germany, September 21-24, 2004, Proceedings. Lecture Notes in Computer Science, Vol. 3219 2004, p. 195-209 Kaiser, B.: Extending the Expressive Power of Fault Trees. Proceedings 51st Annual Reliability & Maintainability Symposium (RAMS05), January 24-27, 2005, Alexandria, VA, USA, ISSN : 0149-144X, Pages 468 - 474 Grunske, L., Kaiser, B.: An Automated Dependability Analysis Method for COTSBased Systems. COTS-Based Software Systems: 4th International Conference, ICCBSS 2005, Bilbao, Spain, February 7-11, 2005. Proceedings. Lecture Notes in Computer Science, Volume 3412, Jan 2005, Pages 178-190 Papadopoulos Y., Grante C., Grunske L., Kaiser B.: Continuous Assessment of Designs and Reuse in Model-based Safety Analysis, IFAC WC 05, 16th World Congress, Intl Federation of Automatic Control, Prague, July 4-8, 2005. Grunske L., Kaiser B., Papadopoulos Y.: Model-Driven Safety Evaluation with StateEvent-Based Component Failure Annotations. Proceedings Component-Based Software Engineering, 8th International Symposium, CBSE 2005, St. Louis, MO, USA,

263

264

LIST OF FIGURES

May 14-15, 2005. Lecture Notes in Computer Science Volume 3489, Springer 2005, Pages 33-48 Kaiser, B., Zocher, A.: BDD complexity reduction by component fault trees. Proceedings ESREL2005, Tri City, Poland, June 27 - 30, 2005. Mäckel, O., Kaiser, B.: Sicherheits- und Zuverlässigkeitsanalysetechniken. In: Liggesmeyer, P. und Rombach, D. (Hrsg.): Software Engineering eingebetteter Systeme. Elsevier, 2005. Grunske, L., Kaiser, B., Reussner, R.: Specification and Evaluation of Safety Properties in a Component-based Software Engineering Process. In C. Atkinson, C. Bunse, H.-G. Groß, and C. Peper (eds.): Embedded System Development with Components. Lecture Notes in Computer Science, Vol. 3778, ISBN 3-540-30644-7. Springer, 2005, Pages 737-748 Kaiser, B., Gramlich, C., Förster, M.: State-Event Fault Trees - A Safety Analysis Model for Software Controlled Systems. Submitted to Reliability Engineering & System Safety Jornal Förster, M., Kaiser, B.: Increased Efficiency in the Quantitative Evaluation of State/Event Fault Trees by Combination of DSPN- and BDD-Based Techniques. Submitted to 12th IFAC Symposium on Information Control Problems in Manufacturing

State/Event Fault Trees

State/Event Fault Trees

Suggest Documents

Configurable Fault Trees - Springer

Configurable Fault Trees

Interactive information zoom on Component Fault Trees

FAULT TREES FOR DECISION MAKING IN

Fault Trees: Sensitivity of Estimated Failure Probabilities ... - Gwern.net

Loran Fault Trees for Required Navigation ... - Semantic Scholar

Fault Detection and Diagnosis with Parity Trees for Space ... - CiteSeerX

Enterprise Architecture analysis using Fault Trees and MODAF

A New Component Concept for Fault Trees - CiteSeerX

Model-Based Synthesis of Fault Trees from - Semantic Scholar

A compositional semantics for Dynamic Fault Trees in ... - CiteSeerX

Fault Trees: Sensitivity of Estimated Failure Probabilities ... - Gwern.net

Improved Fault Recovery for Core Based Trees - Semantic Scholar

State-Event Fault Trees - A Safety Analysis Model for Software ...

State-Event Fault Trees - A Safety Analysis Model ... - Semantic Scholar

Robot reliability using fuzzy fault trees and Markov

Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and

Integration of Component Fault Trees into the UML - CEUR Workshop ...

a simulation tool for extended dynamic fault trees - University of ...

Fault Detection and Diagnosis with Parity Trees for Space ... - CiteSeerX

Using AVL Trees for Fault Tolerant Group Key Management - CiteSeerX

From Probabilistic Counterexamples via Causality to Fault Trees

Using Repairable Fault Trees for the evaluation of ... - IEEE Xplore

Trees Trees