CALCULATING ARCHITECTURAL RELIABILITY VIA MODELING ...

78 downloads 3346 Views 1MB Size Report
the sacrifices you have made for me to be able to complete this dissertation. And my ..... Appendix B: Sample Matlab Code for Component Reliability Estimation .
CALCULATING ARCHITECTURAL RELIABILITY VIA MODELING AND ANALYSIS

by Roshanak Roshandel

A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE)

December 2006

Copyright 2006

Roshanak Roshandel

Dedication

To Ava.

ii

Acknowledgment

I wish to express my gratitude and appreciation to my advisor, Professor Nenad Medvidovic. Under his supervision, I have developed and evolved professionally. I am thankful for his guidance, support, and direction in the past few years. I will forever be grateful for all he has taught me. My special thanks to other dissertation committee members – Professors Leana Golubchik, Barry Boehm, Michal Young, Andre van der Hoek, and Najmedin Meshkati – who have provided me with excellent guidance and support.

I would also like to especially thank Professor Andre van der Hoek from University of California, Irvine for his support and mentorship. He has been a wonderful friend and colleague, and I am grateful for everything. Dr. Jafar Adibi at USC’s Information Sciences Institute has been an immense source of support, guidance, enthusiasm, and friendship throughout this journey, and I will be forever grateful.

My friends and office mates in the software architecture group at USC – Architecture Mafia – Sam Malek, Marija Mikic-Rakic, Chris Mattmann, Vladimir Jakobac, and Ebru Dincel, I thank you for your friendship, and help during all these years. My special thanks to Somo Banerjee and Leslie Cheung for reading early versions of this dissertation and providing helpful feedback.

iii

To my husband and best friend Payman Arabshahi, I am forever thankful for your endless love, support and encouragements both before and during this process, and all the sacrifices you have made for me to be able to complete this dissertation. And my dearest Ava, you are the joy of my life every day. Thank you for hugs and kisses, and smiles and giggles. And thank you for letting maman do her work!

Last but not least, to my family for their support and encouragements – my parents Parvin Samei and Jalil Roshandel, my brother Rooein, and my in-laws Azra Sadri and Samad Arabshahi – Thank you! If it were not because of you, I could not have completed this endeavor!

iv

Table of Contents Dedication .......................................................................................................... ii Acknowledgment .............................................................................................. iii List of Tables .................................................................................................... vii List of Figures ................................................................................................... viii Abstract ............................................................................................................. xii Chapter 1: Introduction ..................................................................................... 1 1.1 1.2

Reliability of Software Architectures .......................................... 7 Research Hypotheses and Validation .......................................... 11

Chapter 2: Architectural Modeling for Reliability ............................................ 15 2.1 2.2 2.3 2.4 2.5 2.6

Example ....................................................................................... 18 Component Modeling .................................................................. 20 Our Approach .............................................................................. 23 Relating Component Models ....................................................... 33 Implications of the Quartet on Reliability ................................... 42 Defect Classification and Cost Framework ................................. 43

Chapter 3: Component Reliability .................................................................... 58 3.1 3.2 3.3

Classification of the Component Reliability Modeling Problem 62 Profile Modeling .......................................................................... 66 Reliability Prediction ................................................................... 81

Chapter 4: System Reliability ........................................................................... 94 4.1 4.2 4.3 4.4

Global Behavioral Model ............................................................ 96 Global Reliability Modeling ........................................................ 101 A Bayesian Network for System Reliability Modeling ............... 107 System Reliability Analysis ........................................................ 124 v

Chapter 5: Tool Support .................................................................................... 136 5.1 5.2 5.3

Mae .............................................................................................. 137 Component Reliability Modeling ................................................ 140 System Reliability Modeling ....................................................... 141

Chapter 6: Evaluation ........................................................................................ 142 6.1 6.2 6.3

Architectural Analysis and Defect Classification ........................ 144 Component Reliability Prediction ............................................... 148 System Reliability Prediction ...................................................... 173

Chapter 7: Related Work ................................................................................... 207 7.1

Architectural Modeling ............................................................... 207

7.2 7.3

Reliability Modeling ................................................................... 211 Taxonomy of Architectural Reliability Models .......................... 216

Chapter 8: Conclusion and Future Work .......................................................... 222 8.1 8.2

Contributions ............................................................................... 223 Future Work ................................................................................. 224

References ......................................................................................................... 228 Appendix A: Mae Schemas for Quartet Models ............................................... 240 Appendix B: Sample Matlab Code for Component Reliability Estimation ...... 248 Appendix C: SCRover Bayesian Network Generated by Netica ...................... 251

vi

List of Tables

Table 2-1. Table 3-1.

Sample Instantiation of the Cost Framework

.............................. 53

Classification of the Forms of the Reliability Modeling Problem Space ............................................................. 65

vii

List of Figures

Figure 1-1.

Our Approach to Local and Global Reliability Modeling

..........11

Figure 2-1.

Software Model Relationships within a Component.

Figure 2-2.

SCRover’s Software Architecture.

Figure 2-3.

Controller component’s Interface and Static Behavior View.

..25

Figure 2-4.

Model of Controller Component’s Dynamic Behavior View

...29

Figure 2-5.

Model of Controller Component’s Interaction Protocol View

..32

Figure 2-6.

Taxonomy of Architectural Defects

Figure 2-7.

The Radar Chart View for the Cost Framework

Figure 2-8.

Graphical View of the Cost Framework Instantiation for Different Defect Types ..............................................................................57

Figure 3-1.

Component Reliability Prediction Framework

Figure 3-2.

Controller’s Dynamic Behavior Model (Guards omitted for Brevity).....................................................................................68

Figure 3-3.

Formal Definition of AHMM

Figure 3-4.

Graphical View of the Controller’s Reliability Model

Figure 3-5.

Reliability Analysis Results for the Controller Component

Figure 4-1.

Our Approach to System Reliability Prediction

Figure 4-2.

View of SCRover System’s Collective Behavior

Figure 4-3.

SCRover’s Global Behavioral View in terms of Interacting Components ...............................................................................99

Figure 4-4.

Nodes of the SCRover’s Bayesian Network (Top), and Initialization and Failure Links Extension (Bottom) .................111

.................17

.............................................20

...........................................45 ........................55

..........................60

.....................................................78 ...............82 .......92

.........................95 ......................97

viii

Figure 4-5.

Interaction Links in SCRover’s Bayesian Network

...................112

Figure 4-6.

SCRover’s Final Bayesian Network Model

Figure 4-7.

Expanded View of the SCRover’s Dynamic Bayesian Network

Figure 4-8.

Summary of the BN’s Qualitative Construction Steps

Figure 4-8.

SCRover’s Bayesian Network (top) and the Expanded Bayesian Network (bottom) .......................................................................127

Figure 4-9.

Cumulative Effect of Different Failures in SCRover

...............................114

...............117

.................131

Figure 4-10. Recovery Probability for Different Failures (Inverse of Cost) Figure 4-11. Weighted Cumulative Effect of Failures for SCRover

116

..134

..............134

Figure 5-1.

Overall View of the Required Tools for Architectural Reliability Modeling and Analysis ............................................138

Figure 5-2.

Mae’s Architecture

Figure 6-1.

Mae Defect Detection Yield by Type

Figure 6-2.

Defects Detected by UML, Acme, and Mae (by Type and Number) ..............................................................................147

Figure 6-3.

Controller Component Reliability Analysis Based on Various Probabilities of Failures to the Two Failure States ....................152

Figure 6-4.

Cost-framework Instantiation for Different Defect Types based on data in Chapter 2 ...................................................................154

Figure 6-5.

Changes to a Random Component’s Reliability based on Different Failure Probabilities ...................................................154

Figure 6-6.

Predicted Reliability for an Arbitrary Component Given a Full Failure Probability Matrix (Left), and a Sparse Failure Probability Matrix (Right) .........................................................156

Figure 6-7.

Percentage of Changes in the Reliability Value of the Controller Component (5%, 10%, and 20% Noise) ...................158

.....................................................................139 ........................................146

ix

Figure 6-8.

Percentage Change in Reliability Value of Three Arbitrary Components with 5, 10, and 20 States (5% noise, 10% noise, and 20% noise, respectively) ...................................160

Figure 6-9.

Controller Component Reliability Based on Random and Expert Instantiation ................................................................................162

Figure 6-10. Arbitrary Component’s Full (Left) and Sparse (Right) Random Instantiation for Training Data Generation ................................163 Figure 6-11. Sensitivity Analysis for the Controller Component with Different Recovery Probabilities ...............................................166 Figure 6-12. Changes to the Probability of Recovery from Various Failure Types for an Arbitrary Component ............................................167 Figure 6-13. Controller Component Reliability w.r.t. Different Failure Recovery Probabilities ...............................................................................168 Figure 6-14. Controller Component Reliability w.r.t. A Full Range of Recovery Probability Values ......................................................................170 Figure 6-15. Sensitivity Analysis for an Arbitrary 10-state Component Figure 6-16. Effect of Total Elimination of Failures Figure 6-17. SCRover’s Bayesian Network

........171

......................................172

....................................................176

Figure 6-18. Changes to the Reliability of the SCRover System at Times t=0 and t=1 ......................................................................177 Figure 6-19. SCRover’s Reliability over Time based on its Dynamic Bayesian Network ......................................................................................179 Figure 6-20.

Updated Prediction of Reliability over Time based on New Evidence .....................................................................................180

Figure 6-21. Effect of Changes to Components’ Reliabilities on System’s Reliability ...................................................................................181 Figure 6-22. Changes to System Reliability Based on Different Component Reliability Values at Time Step t=1 ...........................................183 x

Figure 6-23. The Effect of Changes to System’s Reliability as the Reliability of the Startup Process Changes ..................................................185 Figure 6-24. Effect of Elimination of Particular Failures on SCRover System’s Reliability ...................................................................................186 Figure 6-25. OODT’s High Level Architecture

..............................................188

Figure 6-26. OODT’s Global Behavioral Model (top) and Corresponding BN (bottom) ...............................................................................189 Figure 6-27. OODT Model’s Sensitivity to Different Initial Component Reliabilities ................................................................................192 Figure 6-28. Changes in the OODT System’s Reliability as Components’ Reliabilities Change ...................................................................193 Figure 6-29. Effect of Changes to Components Reliabilities on System Reliability ...................................................................................194 Figure 6-30. Reliability Prediction of the OODT System Over Three Time Periods ........................................................................................195 Figure 6-31. Changes to the OODT’s Reliability based on Different Startup Process Reliability..........................................................................196 Figure 6-32. Eliminating the Probability of different Failures and Their Impact on System Reliability.........................................................197 Figure 6-33. Modeling Redundancy in OODT

...............................................199

Figure 6-34. Impact of Different Configurations on OODT System Reliability Figure 7-1.

Taxonomy of Architecture-based Reliability Models

200 ................217

xi

Abstract

Modeling and estimating software reliability during testing is useful in quantifying the quality of the software systems. However, such measurements applied late in the development process leave too little to be done to improve the quality and dependability of the software system in a cost-effective way. Reliability, an important dependability attribute, is defined as the probability that the system performs its intended functionality under specified design limits. We argue that reliability models must be built to predict the system reliability throughout the development process, and specifically when exact context and execution profile of the system is unknown, or when the implementation artifacts are unavailable. In the context of software architectures, various techniques for modeling software systems and specifying their functionality have been developed. These techniques enable extensive analysis of the specification, but typically lack quantification. Additionally, their relation to dependability attributes of the modeled software system is unknown.

In this dissertation, we present a software architecture-based approach to predicting reliability. The approach is applicable to early stages of development when the implementation artifacts are not yet available, and exact execution profile is unknown. The approach is two fold: first, the reliability of individual components is predicted via a stochastic reliability model built using software architectural artifacts. The uncertainty associated with the execution profile is modeled using Hidden Markov Models, xii

which enable probabilistic modeling with unknown parameters. The overall system reliability is obtained compositionally as a function of the reliability of its constituent components, and their complex interactions. The interactions form a causal network that models how reliability at a specific time in a system's execution is affected by the reliability at previous time steps.

We evaluate our software architecture-based reliability modeling approach to demonstrate that reliability prediction of software systems architectures early during the development life-cycle is both possible and meaningful. The coverage of our architectural analyses, as well as our defect classification is evaluated empirically. The component-level and system-level reliability prediction methodology is evaluated using sensitivity, uncertainty, and complexity, and scalability analyses.

xiii

Chapter 1: Introduction

The field of software architecture provides high-level abstractions for representing the structure, behavior, and key properties of a software system. Architectural artifacts are critical in bridging the gap between requirement specification and implementation of the system. In general, a particular software system is defined in terms of a collection of components (loci of computation) and connectors (loci of communication) as organized in an architectural configuration. Architecture description languages (ADLs) have been developed to aid architecture-based development [77]. ADLs provide formal notations for describing and analyzing software systems. Various tools for analysis, simulation, and code generation of the modeled systems usually accompany these ADLs. Examples of ADLs include C2SADEL [76], Darwin [71], Rapide [69], UniCon [118], xADL [25], and Wright [2]. A number of these ADLs also provide extensive support for modeling behaviors and constraints on the properties of components and connectors [77]. These behaviors and constraints can be leveraged to ensure the consistency of an architectural configuration throughout a system’s lifespan (e.g., by establishing conformance between the services of interacting components). In essence, architecture is the first step in which important decisions concerning the quality of the design are made. These decisions, in turn, directly influence dependability properties of the system under the development.

1

Software reliability is defined as the probability that the system will perform its intended functionality under specified design limits. Software reliability techniques are aimed at reducing or eliminating failures of software systems. Existing software reliability techniques are often rooted in the field of reliability engineering in general, and hardware reliability in particular. Such approaches provide significant experience in building reliability models, and advanced mathematical formalisms for analytical reasoning. However, they are not properly gauged toward today’s complex software systems and their specific challenges. In particular, existing software reliability techniques mainly address reliability modeling during a system’s testing. Similar to hardware engineering, they build models of the system’s failure behavior by observing its runtime operation. The reliability of the system is then measured by building formalisms that explain the failure behavior of the system. Such treatment of reliability measurement, prediction, or estimation reveals defects late in the development process. Defects detected earlier in the development life cycle are less costly to mitigate [13]. Consequently, following traditional reliability measurement as outlined above results in an increase in the overall development costs, and prevents understanding the influence of early architectural decisions on the system’s dependability. Reliability and other quality attributes must thus be built into a software system throughout the development process, and as an innate aspect of system design. This requires developing and/or adapting reliability models to predict and measure the reliability of a software system early on. After all, “you can’t control what you can’t measure”[24].

2

To clarify upcoming discussions, let us define some basic concepts: An error is a mental mistake made by the designer or programmer. A fault or a defect is the manifestation of that error in the system. It is an abnormal condition that may cause a reduction in, or loss of, the capability of a functional unit to perform a required function; it is a requirements, design, or implementation flaw or deviation from a desired or intended state [61]. Finally software failure is defined as the occurrence of an incorrect output as a result of an input value that is received, with respect to specification [101].

There are fundamental differences between the nature of failures in software and hardware systems. Consequently the reliability methods in the two field vary accordingly to accommodate these differences [70,101]. While the failure rate in hardware systems has a bathtub curve, the failure rate in a software system is statistically nonincreasing (not considering software evolution). In other words, a software system is not expected to become less reliable as time passes. Moreover, in a software system, failures never occur if the software is not used. This is not true of hardware systems where material deterioration can cause failures even though the system is not being used. Software reliability models are often analytical models derived from assumptions about the system, and the interpretation of those assumptions and model parameters. On the other hand, hardware reliability methods are usually derived from fitting specific distributions to failure data. This is done by extensive analysis as well as the domain experience. Finally, once defects in a software system are repaired, a new 3

piece of software is obtained. This is not true of hardware repairs, which typically restore the original system.

While the cause of hardware failures may be material deterioration, design flaws, and random failures, software failures may be caused by incorrect specification or design, human errors, and incorrect data. It is estimated that 85% of software defects are introduced during analysis and design alone, of which only

1 are detected in the same 3

phase [101]. Software architecture modeling techniques are used as an abstraction for representing software systems. Analytical reasoning about these models can be used to reveal a variety of design faults. Assuming the implementation artifacts are built in a manner that preserves the architectural design properties, early detection of these faults can help prevent their propagation into the final product, reduce the probability of failures, thus improving the reliability of the system as a whole, and in turn reducing the development costs.

In order to quantify and predict the reliability of software architectural models, a reliability model is needed that combines the result of architectural analyses (as failure behavior) with the context in which the software will be used. Since the exact context and operation profile of the system may not be known in advance, the reliability model should account for this uncertainty. Stochastic approaches to reliability modeling seem to be especially appropriate for these circumstances. Probabilistic reliability 4

models are widely used in all engineering disciplines, including software reliability during testing. However, for handling architectural reliability, they need to be specifically gauged to account for uncertainties associated with unknown operation profiles. Furthermore, given all the uncertainties, a single meaningful estimation of the reliability is not feasible. Instead a reliability prediction framework offering a range of analyses and predictions is more appropriate. Such a framework can be used in conjunction with standard design tools to quantify the effects of various design decisions on the reliability of the system throughout the development process.

Complex mathematical models have been developed for modeling uncertainty in other disciplines [19,49,103,104]. Such models leverage known data about the system, and solve the model to obtain unknown information. Examples of such models include Hidden Markov Models (HMM) [63] and Bayesian Networks (BNs) [49] that combine concepts from the fields of Graph Theory and Probability Theory. Our research leverages these two methodologies, and applies them to address reliability prediction of software architectures.

Our work focuses on both structural and behavioral aspects of a software system’s architecture, with the goal of addressing the following research question: Can we use the architectural model of a software system to predict meaningfully the reliability of an individual component, and consequently the reliability of the entire system?

5

Our approach for Calculating Architectural Reliability via Modeling and Analysis (CARMA) attempts to answer this research question. We hypothesize that models of architectural structure and behavior may be used as a basis for a stochastic reliability model. We further hypothesize that this reliability model can be used to predict individual component reliability, which then can be used to predict compositionally the overall system reliability. The accuracy of the estimated reliability depends on the richness of the architectural models: the more comprehensive and extensive the architectural model, the more accurate the reliability values obtained.

Furthermore, we hypothesize that this stochastic model can be parameterized based on the type of architectural defects, (e.g., their frequency, severity, etc.). It then can be used to identify defects that are more critical and cost-effective to fix. To validate these hypotheses and evaluate our overall research, we have developed and used an architectural modeling and analysis environment in the context of several case studies. We have also developed a reliability prediction framework for both componentlevel and system-level reliability prediction and analysis, and applied them to examples and case studies to demonstrate that reliability prediction of a software architecture is both meaningful and useful.

This dissertation research takes a step in closing the gap between qualitative representation of a system’s architecture on the one hand, and quantitative prediction of the

6

system’s reliability on the other hand. Particularly the contributions of this thesis include:

1. Mechanisms to ensure intra- and inter-consistency among multiple views of system’s architectural models, 2. A formal reliability model to predict both component-level and system-level reliability of a given software system based on its architectural specification, and 3. Parameterized and pluggable defect classification and cost-framework to identify critical defects, whose mitigation is most cost-effective in improving a system’s overall reliability. In the rest of this chapter, we describe our approach at a high-level, and discuss the hypotheses upon which this research is based.

1.1 Reliability of Software Architectures A survey of related literature in the area of architectural modeling and its connection to system reliability reveals that, despite the development of sophisticated architectural modeling techniques and their related analyses, proper quantification to measure system dependability at the level of software architecture is lacking. Formal modeling of software architecture is a complex and time consuming task. If such models cannot reveal and quantify potential defects, and systematically outline the effects of these defects on the overall system dependability, then their use may be considered to be 7

cost ineffective. As an answer, this thesis proposes an effective framework for architectural reliability prediction. The developed approach leverages architectural modeling and analysis, and enables sensitivity analyses that can be used to prescribe costeffective strategies to mitigate architectural defects.

1.1.1 Problem Description The goal of this research is to find a solution to the following problem:

Given architectural models of a system’s structure and behavior, 1. Analyze the reliability of individual components, in terms of the probability that each component performs its intended functionality successfully. 2. Analyze the reliability of the overall system’s architecture in terms of the probability that the system as a whole performs its intended functionality successfully. The system reliability is estimated in terms of the composition of and interactions among its constituent components (and their reliabilities). 3. Perform analysis to rank the components according to their effect on overall system reliability.

1.1.2 Approach In this thesis, we propose and evaluate a three-part solution to the problem of modeling and quantifying architecture-level reliability of software systems:

8

I. Multi-View Models. We advocate using a multi-view modeling approach called Quartet to comprehensively model the properties of components in a software system. The interface, static behavior, dynamic behavior and the interaction protocol views each represent and help to ensure different characteristics of a component. Moreover, the four views have complementary strengths and weaknesses with respect to their ability to characterize systems.

These views can be analyzed to detect possible inconsistencies both within a component’s models and across models of communicating components. The inconsistencies signify architectural faults or defects, which may cause a failure during the system’s operations and thus adversely affect the system’s reliability. The models also can be used as a basis for generating implementation-level artifacts. In Chapter 2, we introduce the details of each modeling view, and discuss an approach by which the consistency among these view can be preserved.

II. Component Reliability. We offer a framework to predict and analyze the reliability of individual components (referred to as Local Reliability) using a stochastic model based on the Quartet. The reliability model leverages the Hidden Markov Model formalism [104], and is built using the Quartet’s dynamic behavior view. The model estimates the component’s reliability in terms of the probability of successfully recovering from a failure occurring during the component’s operation. Local reliability is estimated as a function of the inter-consistency of the component’s models, and 9

the internal behavior of the component described as a state machine. This technique is discussed in detail in Chapter 3.

III. System Reliability. The technique for predicting and analyzing a system’s overall reliability (referred to as Global Reliability) is compositional in nature: the system’s reliability is estimated in terms of the reliabilities of its constituent components and their interactions. We leverage Bayesian Networks (BNs) [19,49], and model the interactions among components in terms of the causal relationships among their reliabilities; when a change of state in a component causes a change of state in another component, then the reliability of the second component depends on the reliability of the first one. This Bayesian model is then augmented with the notion of a failure state: any state in a component may result in a failure, so unreliability at each state can affect the probability of the system’s failure (i.e., its unreliability value). The model also leverages the estimated reliability of individual components (obtained from the Local Reliability estimation step), to estimate compositionally the reliability of the system.

A high-level conceptual view of our approach is depicted in Figure 1-1. We describe each of these steps and evaluate the underlying methodology in the following chapters. The rest of this section outlines the hypotheses upon which the research is based.

10

Architecture

Q “T ua he rt et ”

Q “T u a he rt et ”

Static Dynamic Behaviors Behaviors Static Behavio Static Dynamic rsInterface Behavio Behaviors Interfac rs e Interfac e Component Protocols Protoco Component ls Protocols Component

HMM Local Reliability Local Reliability

HMM Local Reliability

BN

Global Reliability

Figure 1-1. Our Approach to Local and Global Reliability Modeling

1.2 Research Hypotheses and Validation Our research is based on the following hypotheses.

Hypothesis 1. In order to predict software reliability at the architectural level, we need rich architectural models, as well as reliability models that do not rely on a running system’s operation profile. We hypothesize that models of architectural structure and behavior alone can be used to obtain a meaningful estimate of system reliability.

Hypothesis 2. A component’s internal behavior is traditionally modeled using dynamic behavioral models. Such models offer a continuous view of the component 11

and how it arrives at certain states during its execution. Additionally, models of components’ interaction protocols abstract away the details of internal component behaviors and focus on a component’s external interactions. We hypothesize that a stochastic model constructed based on the component’s dynamic behavioral model and its interaction protocols can be used to predict both component-level and then system-level reliability.

Hypothesis 3. Different cost values are associated with different classes of defects introduced during architectural design. Consequently, different classes of defects can affect the system and reduce its overall reliability in different ways. We hypothesize that our stochastic reliability estimation model can be parameterized for a set of cost factors associated with different defect types, which then can be used to identify defects that comparatively are more critical and cost-effective to fix.

Validation. The approach is evaluated by applying our architectural modeling, analysis, and reliability prediction framework to several case studies. Using these case studies, we demonstrate that our approach to reliability prediction and analysis of software architecture is both meaningful and useful. We show that it is meaningful by demonstrating its sensitivity to various model parameters. We further demonstrate that it is useful via a set of sensitivity analyses that demonstrate our results can aid in mitigating architectural defects and enhancing the quality of design in a cost-effective

12

manner. Our approach leverages architectural models of a system to construct component-level and system-level reliability models. This will validate our hypothesis 1.

In the context of our stochastic methods, we validate the model by varying the data provided as input to the models (e.g., type and number of defects in the case of component reliability estimation, and component reliability value in the case of system reliability estimation) to evaluate our hypothesis 2. In particular we will justify defect mitigation strategies enabled by our methodology, by leveraging principles of stochastic reliability modeling, and our parameterized cost-framework.

To further evaluate hypotheses 2 and 3, we use simulations that, given (1) an architectural configuration of components and connectors, (2) an arbitrary set of defects for each component, and (3) a particular interaction protocol for the system, estimates each component’s and the entire system’s reliability, and ranks the components based on their impact on the system reliability. The latter is done by leveraging a defect classification and the cost framework.

The rest of this dissertation is organized as follows. Chapter 2 presents our work in modeling and analysis of software architectures. Chapters 3 and 4 describe the details of our technique to estimating component-level and system-level reliability, respectively. Chapter 5 presents various tools used for modeling, analysis, and reliability estimations via our approach. Chapter 6 details our evaluation strategy, as well as the 13

results obtained in evaluating our work. Chapter 7 details the related work both in architectural modeling and analysis and in reliability estimation, and presents a classification of architecture-based reliability models. We conclude by summarizing the contributions of this thesis, and discussing the future research directions.

14

Chapter 2: Architectural Modeling for Reliability

Component-based software engineering has emerged as an important discipline for developing large and complex software systems. Software components have become the primary abstraction level at which software development and evolution are carried out. We consider a software component to be any self-contained unit of functionality in a software system that exports its services via an interface, encapsulates the realization of those services, and possibly maintains internal state. In the context of this research, we further focus on components for which information on their interfaces and behaviors may be obtained. In order to ensure the desired properties of component-based systems (e.g., dependability attributes such as correctness, compatibility, interchangeability, and functional reliability), both individual components and the resulting systems’ architectural configurations must be modeled and analyzed.

The role of components as software systems’ building blocks has been studied extensively in the area of software architectures [77,100,117]. While there are many aspects of a software component worthy of careful study (e.g., modeling notations [16,77], implementation platforms [1,2], evolution mechanisms [65,76]), we restrict our study to an aspect of dependability only partially considered in existing literature, namely, consistency among different models of a component. We consider this aspect from an architectural modeling perspective, as opposed to an implementation or runtime perspective. 15

The direct motivation for this work is our observation that there are four primary functional aspects of a software component: (1) interface, (2) static behavior, (3) dynamic behavior, and (4) interaction protocol. Each of the four modeling views represents and helps to ensure different characteristics of a component. Moreover, the four views have complementary strengths and weaknesses with respect to their ability to characterize systems. As detailed in Section 2.2, existing approaches to component-based development typically select different subsets of these four views (e.g., interface and static behavior [65], or interface and interaction protocol [133]). At the same time, different approaches treat each individual view in very similar ways (e.g., modeling static behaviors via pre- and post-conditions, or modeling interaction protocols via finite state machines).

The four views’ complementary strengths and weaknesses in system modeling, as well as their consistent treatment in the literature suggest the possibility of using them in concert. However, what is missing from this picture is an understanding of the different relationships among these different models within a single component. Figure 2-1 depicts the space of possible intra-component model relationship clusters. Each cluster represents a range of possible relationships, including not only “exact” matches, but also “relaxed” matches [136] between the models in question. Of these six clusters, only the pair-wise relationships between a component’s interface and its other modeling aspects have been studied extensively (relationships 1, 2, and 3 in Figure 2-1). 16

Static Behavior 1

6

Dynamic Behavior

4

2

Interface 5 3

Protocols

Figure 2-1. Software Model Relationships within a Component.

It is our intent to focus on completing the modeling space depicted in Figure 2-1. We present and discuss extensions to commonly used modeling approaches for each aspect, relate them to each other, and ensure their compatibility. We also discuss the advantages and drawbacks inherent in modeling all four aspects (the Quartet) and six relationships shown in Figure 2-1. As part of this dissertation, we focus on providing a framework for predicting architectural reliability, by addressing all the relationships shown in Figure 2-1. In this manner, several important long-term goals may be accomplished:



Enrich, and in some respects complete, the existing body of knowledge in component modeling and analysis,



Suggest constraints on and provide guidelines to practical modeling techniques, which typically select only a subset of the Quartet,

17



Provide a basis for additional operations on components, such as retrieval, reuse, and interchange [136],



Suggest ways of creating one (possibly partial) model from another automatically, and



Provide better implementation generation capabilities from such enriched system models.

In the rest of this chapter, we introduce a simple example that will be used throughout the dissertation to clarify concepts. We will also provide an overview of existing approaches to component modeling, introduce the Quartet, and discuss the relationships among the four modeling perspectives. These relationships along with more traditional architectural analyses such as type checking and consistency checking will be used as the core of our reliability modeling approach. The result of these analyses a set of architecture-level defects which in turn, may translate into failures during components’ operations. In order to distinguish and quantify the effect of each defect, we have developed a defect classification and a cost framework presented later in this chapter. The result of this quantification is used in our reliability models presented in Chapter 3 and Chapter 4.

2.1 Example Throughout this dissertation, we use a simple example of a robotic rover to illustrate the introduced concepts. The robotic rover, called SCRover, is designed and devel18

oped in collaboration with NASA’s Jet Propulsion Laboratory, and in accordance with their Mission Data System (MDS) methodology. To avoid unnecessary complexity, we discuss a simplified version of the application. Our focus is particularly on SCRover’s “wall following” behavior. In this mode, the rover uses a laser rangefinder to determine the distance to the wall, drives forward while maintaining a fixed distance from that wall, and turns both inside and outside corners when it encounters them. This scenario also involves sensing and controlled locomotion, including reducing speed when approaching obstacles.

The system contains five main components: controller, estimator, sensor, actuator, and a database. The sensor component gathers physical data (e.g., distance from the wall) from the environment. The estimator component accesses the data and passes them to the controller for control decisions. The controller component issues commands to the actuator to change the direction or speed of the rover. The database component stores the “state” of the rover at certain intervals, as well as when a change in the values happens. Figure 2-2 shows a high-level architecture of the system in terms of the constituent components, connectors, and their associated interfaces (ports): the rectangular boxes represent components in the system; the ovals are connectors; the dark circles on a component correspond to interfaces of services provided by the component/connector, while the light circles represent interfaces of services required by the component/connector. To illustrate our approach, we will specifically define the

19

Components

Actuator

Sensor Database

Provided Port Required Port Bi-directional Connector

mqmu

e

Uni-directional Connector

u e

Controller

mq mu

u

q n

q n

Estimator

Interface types measQuery:mq Query:q measUpdate:mu Notify:n Execution:e UpdateDB:u Legends

Figure 2-2. SCRover’s Software Architecture

static and dynamic behavioral models and protocols of interactions for the controller component in Section 2.3.

2.2 Component Modeling As previously discussed, our goal is to leverage the architectural models of software components to predict their architectural reliability. We will then use this estimated reliability along with models of components’ interaction to predict the overall reliability of software systems. Architectural models form the core of our reliability modeling approach. Analyzing these models reveals faults or defects. These faults negatively affect the reliability of individual components and in turn, adversely influences the overall system reliability.

An overview of related approaches to architectural modeling and analysis is provided in Chapter 7. In this chapter we focus on multiple functional modeling aspects of a single software component. We advocate a four-view modeling approach, called the 20

Quartet. Using the Quartet, a component’s structure, behavior, and its interaction with other components in the system can be described. Moreover, analysis of these models could reveal potential problems with the design and future implementation of the system. In the rest of this section, we will first discuss the four component aspects. We will use this discussion as the basis for studying the dependencies among these models and implications of maintaining their interconsistency in Sections 2.3 and 2.4.

2.2.1. Introducing the Quartet Traditionally, functional characteristics of software components have been modeled predominantly from the following four perspectives:

Interface modeling. Component modeling has been most frequently performed at the level of interfaces. Interface models specify the points by which a component interacts with other components in the system. Interface modeling has included matching interface names and associated input/output parameter types. However, software modeling solely at this level does not guarantee many important properties, such as interoperability or substitutability of components: two components may associate vastly different meanings with identical interfaces.

Static Behavior Modeling. Approaches to static behavior modeling describe the behavioral properties of a system discretely, i.e., at specific snapshots in the system’s execution. This is done primarily using invariants on the component states and pre21

and post-conditions associated with the components’ operations. Static behavioral specification techniques are successful at describing what the state of a component should be at specific points of time. However, they are not expressive enough to represent how the component arrives at a given state.

Dynamic Behavior Modeling. The deficiencies associated with static behavior modeling have led to a third group of component modeling techniques and notations. Modeling dynamic component behavior results in a more detailed view of the component and how it arrives at certain states during its execution. It provides a continuous view of the component’s internal execution details.

Interaction Protocol Modeling. The

last

category

of

component

modeling

approaches focuses on legal protocols of interaction among components. This view of modeling provides a continuous external view of a component’s execution by specifying the allowed execution traces of its operations (accessed via interfaces).

Typically, the static and dynamic component behaviors and interaction protocols are expressed in terms of a component’s interface model. For instance, at the level of static behavior modeling, the pre- and post-conditions of an operation are tied to the specific interface through which the operation is accessed. Similarly, the protocol modeling approach [133] uses finite state machines (FSMs) in which component interfaces serve as labels on the transitions. The same is also true of UML’s use of 22

interfaces specified in class diagrams for modeling event/action pairs in the corresponding statechart models. This is why we chose to place Interface at the center of the diagram shown in Figure 2-1.

2.3 Our Approach We argue that a complete functional model of a software component can be achieved only by focusing on all four aspects of the Quartet. At the same time, focusing on all four aspects has the potential to introduce certain problems that must be carefully addressed (e.g., large number of modeling notations that developers have to master, model inconsistencies). While we use a particular notation in the discussion below, the approach is generic such that it can be easily adapted to other modeling notations. It is note worthy that the concise formulations used throughout this chapter to clarify our definitions are not meant to serve as a formal specification of our model. Similar to regular expression, in this notation x+ denotes one or more repetition of x, x* denotes zero or more repetition of x, and x? denotes optional (zero or one) instance of x, where x is a model element.

A component model is defined as follows: Component_Model: (Interface, Static_Behavior, Dynamic_Behavior, Interaction_Protocol);

23

2.3.1. Interface Interface modeling serves as the core of our component modeling approach and is extensively leveraged by the other three modeling views. A component’s interface has a type and is specified in terms of one or more interface elements. Each interface element has a direction, a name (method signature), a set of input parameters, and possibly a return type (output parameter). The direction indicates whether the component requires (+) the service (i.e., operation) associated with the interface element or provides (-) it to the rest of the system. In other words: Interface: (Type, Interface_Element+);

Interface_Element: (Direction, Method_signature, Input_parameter*, Output_parameter?);

In the context of the SCRover example discussed in Section 2.1, the controller component exposes four interfaces through its four ports: e, u, q, and n, correspond to Execution, UpdateDB, Query, and Notification interface types, respectively (recall Figure 2-2). Each of these interfaces may have several interface elements associated with them. These interface elements are enumerated in Figure 2-3. Examples include the getWallDist and executeSpeedChange interface elements defined below

24

Interface View Interface types e: Execution; q: Query; n: Notify; u: UpdateDB;

Ports: prov: {q:Query}; req: {n:Notify, u:UpdateDB, e:Execution}; Interfaces: u: + setDefaults(); e: + executeSpeedChange (speed:speedType); e: + executeDirChange (dir:DirType); n: + notifyDistChange():DistType; n: + notifySpeedChange():SpeedType; n: + notifyDirChange():DirType; q: - getWallDist():DistanceType;

Static Behavior View

StateVariable: mode:Integer; dist:DistanceType; speed:SpeedType; dir:DirType; Invariant: //off=0, on=1, halt=2, failure=3 {0 ≤ mode ≤ 3 AND 0 ≤ dist};

Operations: op_getWallDist{ preCond: {dist ≥ 0}; postCond: {result=dist}; mapped_interfaces: {getWallDist}; } op_notifyDistChange{ postCond: {result=~dist); mapped_interfaces: {notifyDistChange}; } op_notifyDirChange{ postCond: {result=~dir); mapped_interfaces: {notifyDirChange}; } op_notifySpeedChange{ postCond: {result=~speed); mapped_interfaces:{notifySpeedChange}; } op_setDefaults{ preCond: {dist > 100}; postCond: {~speed > 100 AND ~dir = 0}; mapped_interfaces: {setDefaults}; } op_executeSpeedChange{ preCond: {val speed}; mapped_interfaces: {setDefaults}; } op_executeDirChange{ preCond: {val 0}; mapped_interfaces: {setDefaults}; }

Figure 2-3. Controller component’s Interface and Static Behavior View.

25

u: +executeSpeedChange(speed: SpeedType); q: -getWallDist():DistanceType;

where executeSpeedChange() is an interface element of type UpdateDB, required by the controller component. Its input parameter speed is of user-defined type SpeedType and it has no return value. Similarly, getWallDist() is provided by the controller component, is an interface element of type Query, takes no input parameters, and returns a value of type DistanceType.

2.3.2. Static Behavior We adopt a widely used approach for static behavior modeling [65], which relies on first-order predicate logic to specify functional properties of a component in terms of the component’s state variables, invariants (constraints associated with the state variables), and operations (accessed via interfaces). Each operation is mapped to one or more interface element (as modeled in the interface model), and specifies corresponding pre- and post-conditions. Static_Behaviors: (State_Variable*, Invariant*, Operation+);

State_Variable: (Name, Type);

26

Invariant: (Logical_Expression);

Operation: (Interface_Element+, Pre_Cond*, Post_Cond*);

Pre/Post_Cond: (Logical_Expression);

Interface and static behavior views of the SCRover’s controller component are depicted in Figure 2-3. The specification details the interface types, instances, and associated operations for performing various component’s functionality to query the distance from obstacles, enacting changes in the rover’s speed or direction, and notifying other components about these changes. The pre- and post-conditions are used to specify conditions that must be true immediately prior to, or right after an operation is invoked. For instance in the case of op_setDefaults, the new value of the variables dist, speed, and dir (denoted by ~ followed by the variable name) is specified to be within a certain range.

2.3.3. Dynamic Behavior A dynamic behavior model provides a continuous view of the component’s internal execution details. Variations of state-based modeling techniques have often been used to model a component’s internal behavior (e.g., in UML). Such approaches describe the component’s dynamic behavior using a set of sequencing constraints that define 27

legal ordering of the operations performed by the component. These operations may belong to one of two categories: (1) they may be directly related to the interfaces of the component as described in both interface and static behavioral models; or (2) they may be internal operations of the component (i.e., invisible to the rest of the system such as private methods in a UML class). To simplify our discussion, we only focus on the first case: publicly accessible operations. The second case may be reduced to the first one using the concept of hierarchy in statecharts: internal operations may be abstracted away by building a higher-level state-machine that describes the dynamic behavior only in terms of the component’s interfaces.

A dynamic behavior model serves as a conceptual bridge between the component’s model of interaction protocols and its static behavioral model. On the one hand, a dynamic behavior model serves as a refinement of the static behavior model as it further details the internal behavior of the component. On the other hand, by leveraging a state-based notation, a dynamic behavior model may be used to specify the sequence by which a component’s operations get executed. A rich description of a component’s dynamic behavior is essential to achieving two key objectives. First, it provides a rich model that can be used to perform sophisticated analysis and simulation of the component’s behavior. Second, it can serve as an important intermediate level model to generate implementation level artifacts from the architectural specification.

28

executeSpeedChange/ notifySpeedChange

getWallDist [dist > 0]/ notifyDistChange /notifyDistChange

executeDirChange/ notifyDirChange

getWallDist [dist > 0]/ notifyDistChange

normal

setDefaults

[dist > 0]/ notifyDistChange setDefaults [dist>100]

init

getWallDist [dist State)+);

State: (Name);

Transition (Label);

An interaction protocol for the controller component is shown in Figure 2-5. Starting from S1, the model specifies different sequences of events that are acceptable by the component. For example, one or more getWallDist may be followed by an executeDirChange event. Note that the transition from S3 to S1 has no label. In other words, there is no event that corresponds to this transition. Similarly, the transitions from S1 32

to S2 and one of the transitions from S2 to S2 do not have an event associated with them. These transitions all correspond to a special event called True event, where no stimuli are needed for it to be triggered.

2.4 Relating Component Models Modeling components of complex software systems from multiple perspectives is essential in capturing a multitude of structural, behavioral, and interaction properties of the system under development. Emergence of dependable systems as a result of following a rigorous modeling and design phase, requires ensuring the consistency among different modeling perspectives [8,32,37,38,51]. Our approach addresses the issue of model consistency in the context of component and system modeling using the Quartet.

In order to ensure the consistency among the models, their inter-relationships must be understood. Figure 2-1 depicts the conceptual relationships among these models. We categorize these relationships into two groups: syntactic and semantic. A syntactic relationship is one in which a model (re)uses the elements of another model directly and without the need for interpretation. For instance, interfaces and their input/output parameters (as specified in the interface model) are directly reused in the static behavior model of a component (relationship 1 in Figure 2-1). The same is true for relation-

33

ships 2 and 3, where the dynamic behavior and protocol models (re)use the names of the interface elements as transition labels in their respective finite state machines.

Alternatively, a semantic relationship is one in which modeling elements are designed using the “meaning” and interpretation of other elements. That is, specification of elements in one model indirectly affects the specification of elements in a different model. For instance, an operation’s pre-condition in the static behavior model specifies the condition that must be satisfied in order for the operation to be executed. Similarly, in the dynamic behavior model, a transition’s guard ensures that the transition will only be taken when the guard condition is satisfied. The relationship between a transition’s guard in the dynamic behavior model and the corresponding operation’s pre-condition in the static behavior model is semantic in nature: one must be interpreted in terms of the other (e.g., by establishing logical equivalence or implication) before their (in)consistency can be established. Examples of this type of relationship are relationships 4 and 5 in Figure 2-1.

In the remainder of this section we focus in more detail on the six relationships among the component model Quartet depicted in Figure 2-1.

2.4.1. Relationships 1, 2, and 3 — Interface vs. Other Models The interface model plays a central role in the design of other component models. Regardless of whether the goal of modeling is to design a component’s interaction 34

with the rest of the system or to model details of the component’s internal behavior, interface models will be extensively leveraged.

When modeling a component’s behaviors from a static perspective, the component’s operations are specified in terms of interfaces through which they are accessed. As discussed in Section 2.3, an interface element specified in the interface model is mapped to an operation, which is further specified in terms of its pre- and post-conditions that must be satisfied, respectively, prior to and after the operation’s invocation.

In the dynamic behavior and interaction protocol models, activations of transitions result in changes to the component’s state. Activation of these transitions is caused by internal or external stimuli. Since invocation of component operations results in changes to the component’s state, there is a relationship between these operations’ invocations (accessed via interfaces) and the transitions’ activations. The labels on these transitions (as defined in Section 2.3) directly relate to the interfaces captured in the interface model.

The relationship between the interface model and other models is syntactic in nature. The relationship is also unidirectional: all interface elements in an interface model may be leveraged in the dynamic and protocol models as transition labels; however, not all transition labels will necessarily relate to an interface element. For example, in the controller’s dynamic behavior view, transition logState corresponds to an internal 35

event used to log various parameters in the system when the controller is in the emergency state. Our (informal) discussion provides a conceptual view of this relationship and can be used as a framework to build automated analysis support to ensure consistency among the interface and remaining three models within a component.

2.4.2. Relationship 4 — Static Behavior vs. Dynamic Behavior An important concept in relating static and dynamic behavior models is the notion of state in the dynamic model and its connection to the static specification of component’s state variables and their associated invariant. Additionally, operation pre- and post-conditions in the static behavior model and transition guards in the dynamic behavior model are semantically related. We have identified the ranges of all such possible relationships. The corresponding concepts in the two models may be equivalent, or they may be related by logical implication. Although their equivalence would ensure their inter-consistency, in some cases equivalence may be too restrictive. A discussion of such cases is given below.

Transition Guard vs. Operation Pre-Condition. At any given state in a component’s dynamic behavior model, multiple outgoing transitions may share the same label, but with different guards on the label. In order to relate an operation’s pre-condition in the static model to the guards on the corresponding transitions in the dynamic model, we first define the union guard (UG) of a transition label at a given state. UG is the disjunction of all guards G associated with outgoing transitions that 36

carry the same label: n

UG = ∨ G i i =1

where n is the number of outgoing transitions with the same label at a given state, and Gi is the guard associated with the ith transition.

As an example in Figure 2-4, the dynamic model is designed such that different states (normal and emergency) are going to be reachable as destinations of the getWallDist() transition depending on the distance of the encountered obstacle (dist variable in the transition guards). In this case at state normal we have:

UGgetWallDist = (dist > 100) OR (0 < dist ≤ 100)

Clearly, if the UG is equivalent to its corresponding operation’s pre-condition, the consistency at this level is achieved. However, if we consider the static behavior model to be an abstract specification of the component’s functionality, the dynamic behavioral model becomes a concrete realization of that functionality. In that case, if the UG is stronger than the corresponding operation’s pre-condition, the operation may still be invoked safely. The reason for this is that the UG places bounds on the operation’s (i.e., transition’s) invocation, ensuring that the operation will never be invoked under circumstances that violate its pre-condition; in other words, the UG should imply the corresponding operation’s pre-condition. This is the case for the getWallDist() operation in the rover’s controller component. 37

State Invariant vs. Component Invariant. The state of a component in the static behavior specification is modeled using a set of state variables. The possible values of these variables are constrained by the component’s invariant. Furthermore, a component’s operations may modify the state variables’ values, thus modifying the state of the component as a whole. The dynamic behavior model, in turn, specifies internal details of the component’s states when the component’s services are invoked. As described in Section 2.2, these states are defined using a name, a set of variables, and an invariant associated with these variables (called state’s invariant). It is crucial to define the states in the dynamic behavior state machine in a manner consistent with the static specification of component state and invariant.

Once again, an equivalence relation among these two elements may be too restrictive. In particular, if a state’s invariant in the dynamic model is stronger than the component’s invariant in the static model (i.e., state’s invariant implies component’s invariant), then the state is simply bounding the component’s invariant, and does not permit for circumstances under which the component’s invariant is violated. This relationship preserves the properties of the abstract specification (i.e., static model) in its concrete realization (i.e., dynamic model) and thus may be considered less restrictive than equivalence. A simple case is that of the state normal and its invariant in the controller component. Relating the invariant of the state normal, and the controller component invariant we have:

38

normalinv:(0 ≤ dir < 360) AND (0 < speed < 500) AND (dist > 100) Controllerinv:(0 ≤ dir < 360) AND (0 < speed < 1000) AND (dist ≥ 0) normalinv ⇒ Controllerinv

State Invariants vs. Operation Post-Condition. The final important relationship between a component’s static and dynamic behavior models is that of an operation’s post-condition and the invariant associated with the corresponding transition’s destination state. For example, in Figure 2-3, the post-condition of the op_setDefaults operation is specified as:

op_setDefaultsPost:(~speed > 100) AND (~dir = 0)

while state normal is a destination state for setDefaults() and we have:

normalinv:(0 ≤ dir < 360) AND (0 < speed < 500) AND (dist > 100)

In the static behavior model, each operation’s post-condition must hold true following the operation’s invocation. In the dynamic behavior model, once a transition is taken, the state of the component changes from the transition’s origin state to its destination state. Consequently, the state invariant constraining the destination state and the operation’s post-condition are related. Again, the equivalence relationship may be unnecessarily restrictive. Analogous to the previous cases, if the invariant associated with a transition’s destination state is stronger than the corresponding operation’s post-condition (i.e., destination state’s invariant implies the corresponding operation’s postcondition), then the operation may still be invoked safely. As an example consider the 39

specification of state normal and operation op_setDefaults shown above. Clearly, the appropriate implication relationship does not exist. The op_setDefaults operation may assign the value of the variable speed to be greater than 500. Such assignment could result in a fault in the component, which in turn, could negatively affect the component’s dependability.

2.4.3. Relationship 5 — Dynamic Behavior vs. Interaction Protocols The relationship between the dynamic behavior and interaction protocol models of a component is semantic in nature: the concepts of the two models relate to each other in an indirect way.

As discussed in Section 2.2, we model a component’s dynamic behavior by enhancing traditional FSMs with state invariants. Our approach to modeling interaction protocols also leverages FSMs to specify acceptable traces of execution of component services. The relationship between the dynamic behavior model and the interaction protocol model thus may be characterized in terms of the relationship between the two state machines. These two state machines are at different granularity levels however: the dynamic behavior model details the internal behavior of the component based on both internally- and externally-visible transitions, guards, and state invariants; on the other hand, the protocol model simply specifies the externally-visible behavior of the component. In the case of the SCRover models for instance, the dynamic behavior model contains a transition logState used to log the status of the 40

component while in the emergency state. This is an internal operation of the component and thus is not visible to the other components through interfaces, and as such is not modeled in the interaction protocol model.

Our goal here is not to define a formal technique to ensure the equivalence of two arbitrary state machines. This task cannot be done for models of different granularity like ours, and thus first require some calibration of the models to make them comparable. Moreover, several approaches have studied the equivalence of statecharts [6,72,133]. Instead, we provide a more pragmatic approach to ensure the consistency of the two models. We consider the dynamic behavior model to be the concrete realization of the system under development, while the protocol of interaction provides a guideline for the correct execution sequence of the component’s interfaces. For example, recall models of the controller component specified in Figure 2-4, and Figure 25. Assuming that the interaction protocol model demonstrates all the valid sequences of operation invocations of the component, it can be deduced that multiple consecutive invocations of setDefaults() are permitted. However, based on the dynamic model, only one such operation is possible. Consequently, the dynamic and protocol models are not equivalent. Since the controller component’s dynamic behavior FSM is less general than its protocol FSM, some legal sequences of invocations of the component are not permitted by the component’s dynamic behavior FSM. Such inconsistencies in the models of the components, may contribute to a fault in the implementation, which in turn may impact the component’s dependability. 41

2.4.4. Relationship 6 — Static Behavior vs. Interaction Protocol The interaction protocol model specifies the valid sequence by which the component’s interfaces may be accessed. In doing so, it fails to take into account the component’s internal behavior (e.g., the pre-conditions that must be satisfied prior to an operation’s invocation). Consequently, we believe that there is no direct conceptual relationship between the static behavior and interaction protocol models. Note, however, that the two models are related indirectly via a component’s interface and dynamic behavior models.

2.5 Implications of the Quartet on Reliability The goal of our work is to support modeling architectural aspects of complex software systems from multiple perspectives and to ensure the inter- and intra- consistency among these models. Such consistency is critical in building dependable software systems, where complex components interact to achieve a desired functionality. Dependability attributes must therefore be “built into” the software system throughout the development process, including during the architecture phase. The Quartet serves as the central piece of our reliability models. The analyses enabled by the Quartet to ensure intra- and inter-consistencies among the models reveal defects that may cause failures during components operations. These failures in turn, result in reducing the reliability of components and consequently the reliability of the system.

42

The next two chapters of this dissertation describe our approach to estimating the reliability of individual components and the overall reliability of the system. In order to incorporate the result of architectural analyses, we need to quantify the influence of each defect on component’s and system’s operations. To do this, we have developed an architectural defect classification along with a pluggable cost-framework that take domain-specific information into consideration when quantifying the defects. Our reliability models then leverage the Quartet views as well as the quantification results, to estimate both components’ and system’s reliability.

2.6 Defect Classification and Cost Framework Architectural models represent properties of the system, from its high-level structure in terms of its constituent components and their configuration, to the low-level behavior of its constituent components. Analyses of these models reveal defects that may result in component-level and system-level failures. These failures adversely affect the reliability of the system.

The nature of these defects ranges from structural issues to behavioral problems. Some may result in a catastrophic failure, while others may cause simple discrepancies in the operation of components and their interactions. For example, a unit mismatch in NASA’s JPL Mars Climate Orbiter mission in 1999, resulted in the loss of the spacecraft with the total cost of over $300 million. This mismatch is considered to 43

be a behavioral problem, where two communicating components exchanged data using two different units of measurement. While in a safety-critical system no failure may be tolerable, in a different domain the same type of failure may only cause minor problems with the system’s operation. All these factors affect how consequences of defects must be measured and incorporated when modeling the reliability of a system. We have developed a taxonomy of architectural defects that helps us classify various defects discovered during the architectural modeling and analysis phase. We use this taxonomy in conjunction with a pluggable cost framework to quantify the effect of specific defects on the reliability of a component. Both of these approaches to defect classification and quantification are pluggable in the context of our reliability models: other relevant techniques may be substituted instead. In this section, first we describe the defect classification in detail, and then introduce our cost framework.

2.6.1 Taxonomy of Architectural Defects Our experience with several architectural modeling and analysis techniques [14,110] enabled us to identify the existence of a pattern to the types of defects various modeling approaches attempt to reveal. This led us to develop a taxonomy of architectural defects that is applicable to a wide range of design and architectural problems, and is independent of the specific modeling approach adopted. The result is depicted in Figure 2-6.

44

Architectural Defect Topological Error

Directional Usage Structural Incomplete Behavioral Inconsistency

Interface

Signatures

Static Behavior

Pre/Post Conditions

Protocol

Interaction Protocols

Figure 2-6. Taxonomy of Architectural Defects

At its top level, the taxonomy classifies architectural defects as Topological errors or Behavioral inconsistencies. Topological errors tend to be global to the architecture and are concerned with aspects related to the configuration of components and connectors in the system. They are often a result of the violation of constraints imposed by architectural styles.

Some topological errors are directional in nature: the specific direction of communication required by the style is violated. An example is when in a Client-Server system, the server component requests services from the client. In our experience with modeling the SCRover system [110], an instance of this error was detected as follows. 45

Recall the SCRover example introduced in Chapter 2. The controller component issues commands to the actuator to change the direction or speed of the rover. In other words, the controller requires certain functionality that the actuator provides. A directional mismatch between the two components was revealed that reversed this relationship: the actuator relied on the controller to provide needed functionality. This directly violates the Mission Data System’s (MDS) architectural style using which SCRover is designed and developed.

Other topological errors are structural in nature and are further divided into usage violations and incompleteness of the specification. An example of a usage violation is when a communication link between two components is missing, or alternately, when a communication link between components exists where it should not be present (i.e., an incorrect use of the resources in the system). An SCRover related example is when, due to a design error, the actuator component directly modifies the values in the database.

The last type of topological error relates to the incompleteness of the specification, and manifests itself when there is insufficient information for specifying the properties of the architecture’s components and connectors.

Behavioral inconsistencies are the second category of architectural defects that are local to a component. An interface defect occurs when the signatures of the corre46

sponding provided and required services of two components are mismatched. For example, in the case of the SCRover’s controller component discussed in Chapter 2, the controller component provides an interface of type query that queries the estimator component for the distance to the wall or other obstacles. The corresponding interface element is defined as follows:

q: -getWallDist():DistanceType;

The returned value in this case is of type DistType. If the estimator component that requires this service, expects the change in the distance to an obstacle via a different user defined type such as LengthType as shown below, then there is a signature mismatch between the provided and required services. As a result of this mismatch, the query cannot be processed by the estimator component:

q: +getWallDist():LengthType;

A static behavioral inconsistency is concerned with mismatches between the pre- and post-conditions of corresponding provided and required services in two components. For example in the case of the SCRover’s controller component, the setDefaults service required by the component, requires the value of the dist state variable to be greater than 100 as its pre-condition:

op_setDefaults{ preCond: {dist > 100}; postCond: {~speed > 100 AND dir = 0}; mapped_interfaces: {setDefaults}; }

47

If the corresponding service in the database component that provides this service to the controller component can assume speed values of greater than 120, then communication between the two components may exhibit some problems.

For instance the controller component may send a setDefaults request when the distance of the rover is 105 (units) from an obstacle. The database component will not be able to process this request, because according to its specification, the dist value has to be at least 120. This would be an example of a pre- and post-condition mismatch. Principles upon which this type of analysis is based on can be found in [76, 136].

Finally, a protocol inconsistency reveals mismatched interaction protocols among communicating components. For example according to the controller component’s interaction protocol model depicted in Figure 2-5, upon instantiation the component can either react to a getWallDist event generated by the component itself, or could react to a notifyDistChange action (generated by the estimator component). According to this model, the controller will not be able to react a setDefaults event upon instantiation. This would indicate that if the database component model would only react to a setDefaults request upon instantiation, the two components may be unable to communicate as intended. This is an example of a protocol mismatch between the two components.

48

Our classification framework here is one that was developed experimentally based on our experience in a collaborative project between NASA’s JPL, University of Southern California, and Carnegie Mellon University [110]. Our reliability prediction approach leverages this specific classification, but in essence is independent from this specific taxonomy. Classifying the defects using a taxonomy can help the architect distinguish among different classes of defects (and the possible subsequent failures), and provide a basis for quantifying the influence of each defect on the component’s reliability. We have also developed a simple cost framework (described next) to quantify this influence according to cost factors applicable to the specific domain.

2.6.2 Cost Framework An architectural defect classification such as the one presented here helps the designer to distinguish among different types of defects in the system. These defects, at the level of architectural specification, may translate to failures during the component’s operation. We assume that these failures are all recoverable failures: it is possible for a component to recover from them during its operation. The recovery may be automatic (e.g., self-healing systems), or may require human intervention. This assumption does not pose any limitation on our approach, since both recoverable and non-recoverable failures may be represented in our models: a non-recoverable failure is a failure for which the probability of recovery to a non-failure state is zero. Since our approach is only concerned with the probability of occurrence or recovery from

49

failures, the specific recovery techniques and associated processes are outside the scope of this research.

The probability that a component recovers from a certain type of failure (e.g., a failure resulting from a protocol mismatch between two components) depends on many parameters. Examples include the impact of a component failure on other components’ operations and system operations, the automatic correction and adaptation mechanisms built into the system, manual error handling procedures, as well as the effort, time, and the cost associated with the recovery process. These parameters are highly domain and application dependent, and as such must be designed and adjusted for each domain, specifically by a domain expert. For example, the types of relevant parameters and their associated values in the computer games domain, would be very different from those in a safety critical system where human lives are at stake. Perhaps, the former primarily takes economical aspect into consideration (such as time and resources), whereas the latter may take a wider view and incorporate various parameters that measure the impact and risks on human lives associated with failures. We call these various parameters cost factors, and leave designation of different cost factors to the domain expert. Much research has focused on developing a comprehensive set of cost factors [12,46,116]. While we select a few for our analysis, designation of various factors, and justification of their instantiation is not an integral part of our reliability modeling, and as such is beyond the scope of this research.

50

In order to estimate the probability of recovery from each type of failures, we first estimate the cost of such recovery. Using a mathematical cost function we incorporate the values of all cost factors, and derive a single cost value associated with the recovery from each failure type (cost). The recovery probability is then calculated as the complementary probability of the cost value (1-cost): the higher the cost of recovery, the lower the probability of recovery, and vice-versa. We now present the details of our cost framework, as well as a specific adaptation of it applied to the SCRover system.

Let us assume:

G

G

θ : Set of all cost factors defined by the expert. θ = {θ1 ,...,θ n }

In our adaptation for the SCRover project, we define four cost factors that influence the probability of recovery from failures: severity of a defect, the effort required for its mitigation, the impact of a particular defect on the environment, and the development team’s expertise. In other words: G θ = {θ1 , θ 2 , θ3 ,θ 4 }

θ1 : severity of the defect θ 2 : efforts needed for mitigation of the defect θ3 : impact θ 4 : team expertise

51

Once these cost factors are defined, for each defect type, a numerical value in the range of [0,1] must be assigned to each factor by the domain expert. The domain expert may be able to obtain evidence from past experiences with earlier versions of the application, or may use her professional judgement to assign these values.

A sample instantiation of this framework in conjunction with our defect classification is shown in Table 2-1. According to this instantiation, the severities of the usage, incomplete, and signature type of defects are all considered to be the same and are

assigned a 0.9 value, while the pre-/post-condition and interaction protocol mismatches are considered to be less severe (0.7 and 0.6 respectively). This could be justified by considering that each of the usage, incomplete, and signature types of defects results in complete inability of the related components to communicate, whereas the other two defect types would only indicate that under some circumstances there may be a problem with the components’ interaction. Furthermore, while the interaction protocol defect is assumed to be the least severe type of defect, it is designated as the one that requires the most effort for mitigation. This goes back to the nature of protocol mismatches. Identifying and mitigating this type of defect can be potentially very difficult, and typically requires a lot of effort. Furthermore, this instantiation specifies the impact level for each particular defect type on the environment (other components and the system as whole). In this particular case, the impact of the pre-/post-condition and interaction protocol defects is assumed to be lower than those of usage, incomplete, and signature defects. The justification is that in the 52

Table 2-1 Sample Instantiation of the Cost Framework

Usage

Incomplete

Signature

Pre/Post Condition

Interaction Protocol

Severity

0.9

0.9

0.9

0.7

0.6

Effort

0.4

0.7

0.2

0.4

0.8

Impact

0.9

0.9

0.9

0.65

0.5

Expertise

0.75

0.75

0.75

0.75

0.75

former case, a mitigation solution may not involve other components that communicate with this component. However, in the case of usage, incomplete, and signature defects, it is more likely that the fix involves more than the defective components. Finally, we assume the expertise of the development team to have a fixed value across all defect types. This factor could vary depending on the qualification of the development team (with 1 denoting highly qualified), and could possibly be different across different defect types, if they are assigned to different development teams.

The last step in the recovery cost estimation is to define the appropriate cost function that incorporates various cost factors to calculate the recovery probability for a given defect type. Intuitively, this cost function has to be specifically designed for the domain by taking various domain-specific concerns into consideration. For instance, a cost function for a video game software is probably quite different from one used in a safety-critical system. Other socio-economical and cultural factors, such as the number of developers responsible for defect mitigation and quality assurance, the 53

time to delivery, the distribution of the development team (e.g., offshoring) would all pose special circumstances, which may prevent a single cost function to be applicable to a variety of scenarios. Research into the selection of cost functions is beyond the scope of this dissertation. Instead we offer a simple technique to incorporate all cost factors into a single number. Alternatively, other cost functions may be used at this stage. Our reliability prediction models are oblivious to the specific cost functions used for quantification.

The role of individual cost factors may vary in the overall cost estimation. For instance in our case, as the severity, effort, or the impact of a defect increases, it is expected that the overall cost of recovery would increase, resulting in a lower recovery probability. In some cases however, this relationship may be reversed. For example, an increase in the development team expertise could indicate a lower cost for recovery. In these circumstances, to avoid confusion, we suggest using the complementary value of the cost factor into consideration. In our case, we use 1-expertise value in our cost estimation.

We use a Radar chart (aka Polar chart) to plot values of various cost factors. Each cost factor is plotted along an axis. The number of axes is equal to the number of designated cost factors, and the angle between all axes are equal. Figure 2-7 depicts our instantiation of the Radar chart. Four axes represent severity, effort, impact, and expertise. Each axis has a maximum length of 1, which is consistent with cost factors

54

severity 1

1

0

expertise*

1

effort

1

impact

Figure 2-7. The Radar Chart View for the Cost Framework

taking a value between 0 and 1. A point closer to the center on any axis depicts a low value, while a point near the edge of the circle depicts a high value for the corresponding cost factor.

Radar charts are useful when incorporating several indicators (cost factors) related to one item (a defect type). The cumulative effect of the cost factors can be calculated by finding out the surface area formed by the axes. The expertise factor is marked with a * which indicates that it has an inverse effect on the cost estimation (i.e., the value of

1-expertise is used in the area calculation). This decision is made to make the influence of changes to the cost value consistent and intuitive.

We use a triangulation method to calculate this surface area. The overall area is divided into triangles that are formed between two consecutive axes and the line connecting two points on the axes. Figure 2-7 depicts the four triangles formed by the 55

four values on the axes. Assuming that the values for the four cost factors

θ1 , θ 2 ,θ3 , and θ 4 are denoted by τ 1 ,τ 2 ,τ 3 , and τ 4 respectively, the overall area is estimated using the following formula: 3 1 2π area = × sin( )[∑ (τ i × τ i +1 ) + τ 4 ×τ 1 ] num i =1 4

where num is the number of cost factors (number of axes) in the Radar chart, and the angle between each two axes is π / 2 .

Charts corresponding to the instantiation presented in Table 2-1 are shown in Figure 2-8 (top). Each chart corresponds to a defect type, and the area under the surface for each chart corresponds to the calculated cost of recovery for each defect type. The calculated recovery probability based on the cost of recovery is shown in the bottom diagram of Figure 2-8.

One particular characteristic of the Radar chart is that equal weights are assigned to all cost factors. However, under other circumstances, the importance of each cost factor may vary. It is thus debatable if it is adequate to treat all cost factors as having the same importance. In those cases, our area calculation formula may be adapted to incorporate different weight values when incorporating the cost factors.

56

Usage Defect

Incomplete Defect

Signature Defect

Severity

Severity

Severity

1

1

1 0.8 0.6 0.4 0.2 0

0.8 0.6

0.5

Expertise*

0.4

Effort

0

0.2 0

Expertise*

Expertise*

Effort

Impact

Impact

Impact Pre/Post Condition Defect

Interaction Protocol Defect Severity

Severity

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

Expertise*

0

Effort

Effort

Expertise*

0

Effort

Impact

Impact

0.9

Recovery Probability

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Usage Recovery Probability for Each Defect Type

0.7075

Incomplete Signature 0.5725

0.7975

Pre/post cond

Interaction protocols

0.780625

0.71125

Figure 2-8. Graphical View of the Cost Framework Instantiation for Different Defect Types

57

Chapter 3: Component Reliability

At the architectural level, the intended functionality of a software component is captured in structural and behavioral models [77]. Analysis of these models may reveal potential design problems that can affect the component’s reliability. Our goal is to provide a framework to predict the reliability of software components based on their structural and behavioral models. Our framework can be used to provide analysis of the components’ reliability before they are fully implemented and deployed, taking into account the uncertainties associated with early reliability prediction. It can also be used later on during the implementation, when more information on the component’s operation and deployment is available, as an ongoing analysis tool aiding the process of improving the reliability of the components and consequently the reliability of the entire system.

A component’s reliability is estimated as the probability that it performs its intended functionality without failure. A failure is defined as the occurrence of an incorrect output as a result of an input value that is received, with respect to the specification [101]. Moreover, an error is a mental mistake made by the designer or programmer. A fault or a defect is the manifestation of that error in the system. It is an abnormal con-

dition that may cause a reduction in, or loss of, the capability of a functional unit to perform a required function; it is a requirements, design, or implementation flaw or

58

deviation from a desired or intended state [61. In other words, faults are causes of failures.

A highly reliable component, therefore, is a component for which the probability of the occurrence of failures is close to zero. We build our reliability model upon the notion of failure states: a component’s dynamic behavioral model is augmented with one or more failure states representing occurrence of a fault in the component’s operation. We assume that failures are recoverable [101]: the component may recover from them with or without external interventions. A component’s reliability is then predicted as the probability that it is operating normally at time tn in the future, as n approaches infinity. The assumption of recoverability from failures implies that the model does not have any absorbing state, i.e, a state where the probability of staying in it once entered is zero1.

Our approach to reliability modeling involves three phases of activities. Figure 3-1 depicts the high-level methodology. The first phase is Architectural Modeling, Analysis, and Quantification. In Chapter 2, we explained in detail how standard analysis

techniques are applied to the Quartet models of a component to reveal inconsistencies. These inconsistencies represent defects that could result in failures during the component’s operation. The failures in turn contribute to the component’s unreliabil1. An alternative approach to modeling reliability assumes failure states are absorbing states. The reliability is then calculated as the mean time required for the component to arrive at an absorbing (failure) state.

59

Architectural Modeling, Analysis, and Quantification Architectural Models

Analysis

HMM Builder

Domain Knowledge

Defects

HMM Solver

Training Data

Defect Quantification

Markov Model

Reliability Computation

Component Reliability

Reliability Prediction

Profile Modeling

Legend Artifacts

Functional blocks

Numerical Values

Figure 3-1. Component Reliability Prediction Framework

ity. However, not all defects (and subsequent failures) are “created equal”. The types of defects could determine the severity of the subsequent failures, which in turn could determine the cost required for recovering from them. In Section 2.6 of Chapter 2, we presented our defect classification and cost framework, which together enable the architect to quantify the effect of various types of defects on components’ reliability. As discussed, our reliability model is independent from the specific classification and quantification technique, and alternative approaches may be used instead.

The next phase of our component reliability prediction framework is the Operational Profile Modeling. An operational profile is a quantitative characterization of how the

60

component will be used. It is an ordered set of operations that the software component performs along with their associated probabilities. Since during the architectural phase of software development, data on a given component’s operational profile may not be available, an architecture-level reliability modeling approach must take this uncertainty into consideration. In Section 3.2, we discuss a technique that can handle this type of uncertainty under certain conditions and aid the reliability modeling process.

Finally, the last phase of our approach is the Reliability Prediction step. Given (1) the architectural models, associated defects revealed by analysis, and quantification of these defects by the cost framework (obtained from the modeling, analysis, and quantification phase), and (2) an operational profile (obtained from the profile modeling phase), our reliability prediction framework offers a range of analyses on component reliability values. A range of analyses is necessary when taking uncertainties associated with early reliability predication into consideration. Details of this phase are presented in Section 3.3.

An important observation in reliability modeling of software systems during the architecture phase is that depending on various development scenarios, the artifacts available vary significantly. The types of these artifacts influence the steps required by the reliability model. Consequently, we construct a simple classification of various forms of the reliability modeling problem (presented in Section 3.1), and use it to 61

explain which particular steps in the reliability model are applicable to a particular form of the problem.

The rest of this chapter is organized as follows. We first describe a classification of various forms of the component reliability estimation problem in Section 3.1. We then discuss our approach to component profile modeling in Section 3.2. Section 3.3 presents our reliability prediction framework.

3.1 Classification of the Component Reliability Modeling Problem When measuring the reliability after the implementation phase is complete, (e.g., during testing), components are typically deployed in a host system, and various analyses and measurements are performed in conditions that mimic the intended operational profile of the component. At the architectural level, however, relying on the availability of such information may not be a reasonable assumption, particularly because no implementation artifact may be available. Depending on the process adopted for software development (waterfall, spiral, agile, etc.), the type and the amount of information relevant to a component’s operational profile varies significantly. For example, in a spiral development process, at any time after the completion of the initial iterations, some data representative of the operational profile may be obtained from past iterations. Furthermore, in cases of product-line software development [128], when a component under development may be an upgrade to an existing version of the same 62

component, data from previous versions may be available. In other cases however, for example when architecting a brand new component given a set of requirements (e.g., UML’s use-cases as scenarios), no such data may be available.

In cases where operational profile-related data is available, an important factor is whether this data is obtained from run-time monitoring of a version of the component, or from architecture-time simulation of its architectural models. The primary difference in the two cases is the type of available data.

When gathering data from simulation of architectural models, the data could include both the sequences and the frequency of component’s interface invocations, as well as the associated sequence of states. The simulation of dynamic architectural models (e.g., dynamic behavioral model) may be based on the user’s interaction with a simulator: the user would manually control external stimuli and conditions that would determine how the component would behave under different circumstances (e.g., [36]). The order and type of stimuli in this case may be based on the user’s perception. In other cases where a run-time version of the same component is available, the simulation could also be performed by leveraging run-time observable stimuli during a software component’s execution and feeding that information as inputs to the simulator (e.g., [31]). In this case the order and type of stimuli is directly obtained from run-time monitoring of the system and does not depend on the user’s perception.

63

When monitoring (instrumenting) the run-time operation of a component, however, depending on the circumstances it is possible that only the sequences, and thus the frequency of invocation of component’s interfaces, are logged. This for instance could be due to limitations on availability of the source code (e.g., COTS-based systems), or to the use of distributed middleware technologies such as COM, CORBA, and J2EE, where traditionally Interface Definition Languages (IDLs) are used as the basis for data gathering [62]. The type of gathered data moreover depends on the goal of the runtime monitoring process. If runtime monitoring is leveraged as a tool to aid testing activities, the results are likely to contain additional debugging data. On the other hand, when instrumenting the code to identify interactions and relationships among components and subsystems (e.g., [62]), the results are unlikely to contain all the necessary data to reconstruct the states as modeled in the architectural models. This is due to the abstract nature of the notion of states: they do not exist at the implementation level with the same granularity or in the same form, and it is often very difficult, if not impossible, to keep track of all the parameter changes to reconstruct the states as specified in the architectural models.

Finally, there are cases when no data on the component’s operational profile may be obtained. Such a case could, for instance, occur when a new component is being designed and developed. This case is the main focus of this dissertation, and while our reliability model is applicable to other cases described here, our discussion will be primarily focused on this scenario. Particularly, in cases where the ability to simulate 64

Table 3-1 Classification of the Forms of the Reliability Modeling Problem Space

Source of Data

Architectural Models

Runtime

Case 1

+

+

Case 2

+

Case 3

+

Case 4

-

Cases

Simulation

Synthesis

+ +

dynamic behavioral models does not exist, a synthesis process is used to produce the data. Our synthesis process described later in this chapter relies upon input from a domain expert to synthesize the necessary data. In the worst case, the operational profile obtained via this approach would be random, and thus not reflective of the component’s actual eventual usage. However, as shown in our evaluation in Chapter 6, the synthesis process could greatly benefit from the domain knowledge of an expert.

Table 3-1 depicts a classification of various forms of the component reliability estimation problem. We will use this classification throughout this chapter to help us identify specific steps required in reliability prediction. In general, since we assume that architectural models of the system are available (hence the architectural reliability approach), case 4 falls outside the scope this research. It is noteworthy that this case is addressed by existing reliability modeling approaches applicable to the testing phase (e.g., [23,28,41,44,45,53,66,83]). The other three cases (cases 1, 2, and 3) 65

assume availability of the component’s architectural models, and leverage the data obtained from simulation of the models, runtime monitoring of the component, or a synthesis process respectively.

3.2 Profile Modeling Addressing the problem of a component’s reliability modeling requires knowledge of its operational profile. Estimating a representative operational profile for a component before the completion of the development and deployment process is a challenging problem that must be handled when predicting the component’s architectural reliability.

A representative and reasonably complete operational profile of a component can only be obtained by observing its actual operation after deployment. This profile

would include data on the order and the frequency of invocation of the component’s operations. As discussed in Chapter 2, a component’s operations are accessed via interfaces. These interfaces correspond to transitions in the component’s dynamic behavioral model. Invocation of the component’s interfaces serve as stimuli that trigger corresponding transitions in the behavioral model. Consequently, data on the frequency of the component’s interface invocations may be translated into probabilities of activation of transitions in the component’s behavioral models. In turn, these prob-

66

abilities, together with the behavioral model itself, may be used to predict the reliability of the given component.

During the architectural phase, however, it is not always reasonable to assume that such data is available. A reliability model applicable to the architectural stage thus needs to account for and handle the uncertainties associated with unknown operational profiles. The discussion on the classification of various forms of the component architectural reliability problem (Table 3-1) identified three primary cases when obtaining data associated with the operational profile: Case1 – runtime monitoring of an existing component; Case 2 – simulating component’s architectural models; or Case 3 – synthesizing data using domain information. As mentioned before, the pri-

mary focus of our research so far has been on case 3. Below we first describe how our approach addresses the problem as related to the data synthesis case, and then briefly discuss how variations of our approach relate to the other two cases.

3.2.1. Data Synthesis Approach When building a brand new component, objective data on its operational profile may not be available during the architectural phase. Moreover, it may not be always reasonable to assume that dynamic models of the component’s behavior can be simulated (e.g., waterfall development process, and scenario-based requirement development using UML’s use-cases). In these cases, we use domain knowledge to generate data representing the component’s operational profile. 67

getWallDist/ notifyDistChange

executeSpeedChange/ notifySpeedChange

/notifyDistChange

executeDirChange/ notifyDirChange

getWallDist/ notifyDistChange

normal

setDefaults /notifyDistChange setDefaults

init

getWallDist/ notifyDistChange logState

/notifyDistChange emergency

getWallDist/ notifyDistChange /notifyDistChange

changed

/notifyDistChange executeSpeedChange/ notifySpeedChange

executeDirChange/ notifyDirChange getWallDist/ notifyDistChange

executeSpeedChange/ notifySpeedChange

Figure 3-2. Controller’s Dynamic Behavior Model (Guards Omitted for Brevity)

In particular, we ask the architect to explicitly specify several valid sequences of the component’s interfaces that may be invoked to achieve various functionality. The frequency of these invocations is then statistically obtained given a set of sequences. These sequences could be inferred from the dynamic behavioral model (e.g., in our approach), by considering various paths through the corresponding statechart model. For example, recall the controller component discussed in Chapter 2. Its dynamic behavioral model is depicted in Figure 3-2. A domain expert may identify {getWallDist, getWallDist, getWallDist, executeDirChange} as a desired sequence of interfaces

to be invoked starting at state init. This sequence represents when the rover is driving in a particular direction under normal conditions, and then changes direction to avoid 68

an obstacle.Typically, the expert would identify several such sequences. The number of these sequences, along with their lengths, are critical factors for building a representative operational profile.

To build the corresponding operational profile, we need to translate the data embedded in these sequences into frequency of activation of corresponding transitions in the model. This, in turn, relates to the probability of invocation of various components’ interfaces (aka a component’s operational profile). For instance, consider a hypothetical state si in the model where two possible outgoing transitions may be activated. If the expert has identified a set of 10 sequences of interface invocations starting at si, we can obtain the frequency of activation of each transition by statistically analyzing the sequences. Let us assume that the frequencies of the two transitions are 7 and 3 respectively. That is, 7 out of the 10 sequences identify the first transition as their invoked interface. Consequently, the transition probabilities of the two transitions can be inferred to be 0.7 and 0.3 respectively. It is clear that a larger set of interface sequences results in generation of a more representative operational profile.

While this is a very simple process, in practice applying it to a simple model such as the Controller’s state machine, exhibits a problem. Given a sequence of interface invocations in this case, there exists more than one sequence of states corresponding to the interface sequence. For example, if a sequence of interface invocations given 69

by the domain expert contains {getWallDist, getWallDist, getWallDist, executeDirChange}, we cannot deterministically identify the corresponding sequence of

states associated with this sequence of transitions, by looking at the component’s dynamic behavior model (shown in Figure 3-2). One sequence of states could be {init, normal, normal, changed}, while another may be {init, normal, emergency, changed}. This lack of a one-to-one correspondence between the interfaces and the

states prevents us from directly mapping this information into an operational profile.

A formalism that can be used in this case is Hidden Markov Models (HMMs). HMMs are essentially Markov models with unknown parameters. In our case, the unknown parameters are the unknown transition probabilities between different states which correspond to the unknown operational profile of the component. Using HMMs and existing standard algorithms, we use the data on the sequences of a component’s interfaces specified by the architect (aka training data), and obtain transition probabilities corresponding to our dynamic behavior model. In Section 3.2.4 we present some background information on Markov Models as well as the Hidden Markov Model variation, and then describe how this formalism can help us address the problem of a component’s operational profile modeling.

Our current approach to training data generation relies on the domain expert knowledge and produces data that are representative of the expert’s knowledge of component’s operation. This can be done by asking the expert to manually provide a set of 70

sequences of component’s interfaces. Alternatively, we use an automated technique that requires the expert to predict probability of invocation of various interfaces at each state in the model. We then use this prediction to automatically generate valid sequences of interfaces. This approach addresses the need to synthesize HMM’s training data based on the domain knowledge. As part of our future work, we are also investigating other ways of training data generation using the dynamic behavior model of a component, e.g., statechart simulation methods, and trace assertion methods for module specification. The goal is that using these new techniques, we decrease the impact of the expert’s “judgement” and offer a more objective approach to training data generation.

3.2.2. Model Simulation Approach Case 2 of the classification of different forms of the component reliability problem

relies on the results from the simulation of architectural models as the source to build an operational profile. In this case, when data on both the sequences of states and the frequencies of transitions is available, it is possible to obtain an operational profile directly from the data. This could be done by analyzing several runs or executions of the component’s operations and identifying the frequency of activation of various transitions at each state. These frequencies are then directly translated to transition probabilities on the model using an approach similar to the one described in the previous section. Consequently, in case 2, no additional profile modeling and estimation activity would be necessary to perform the component’s reliability prediction. 71

3.2.3. Runtime Monitoring Approach In cases where the data is obtained from the component’s execution at runtime (Case 1) (or when no state information is collected from simulation), the data cannot be

directly used to build the component’s operational profile. Similar to the data synthesis approach, a one-to-one mapping between the sequence of transitions and states in the model may not exist. Once again, a Hidden Markov Model methodology may be used here to build an operational profile based on available set of (training) data obtained from runtime monitoring.

3.2.4 Background on Markov and Hidden Markov Models Informally, a Markov chain is a Finite State Machine (FSM) that is extended with transition-probability distributions. Formally, a Markov Model consists of a set of states S={S1,S2,…,Sn}, a transition probability matrix A={aij} representing the proba-

bility of transition from state Si to state Sj, and an initial state distribution vector π . The initial state distribution is defined as the probability that the state Si is an initial state: π i = Pr[ q1 = S i ], where q1 denotes the state of the model at time t1. At regular fixed intervals of time the system transfers from its state at time t, (qt) to its state at time t+1, (qt+1). The Markov property assumes that the transfer of control between states is memoryless. In other words, the probability of transition to the next state at time t+1, only depends on the system at time t and is independent from its past his-

72

tory. In other words: Pr[qt = Si | qt −1 = S j ] = aij

This assumption allows us analytical tractability, and does not pose any significant limitation2.

Markov-based reliability models (e.g., {18,106,134]) leverage the Markov property, and can be used to estimate the probability of being at given state when the model reaches a steady state. They rely on the availability of matrix A (the probability of transitions among states), to estimate the probability of being at a given state in the future, by calculating the model’s steady state. The steady state is also known as the equilibrium state, and is reached after the process passes through an arbitrary large number of steps. A Markov Model’s steady state is characterized by a steady state probability distribution vector defined as: v = lim π ( n ) n →∞

where π ( n ) = Aπ ( n −1) = Anπ (0)

where π is the initial state distribution vector, and A is the transition probability matrix.

2. This is particularly the case since our architectural models are sufficiently rich to embody memory informations using the notion of state variables and invariants at each state (Recall Chapter 2) without violating the Markov property in the corresponding reliability model.

73

Markov models have entirely observable states. A Hidden Markov Model (HMM), however, is a variation of Markov Models that assumes that some of the parameters (e.g., transitions probability) may be unknown. Particularly, HMMs assume that while the number of states in the state-based model is known, the exact sequence of states to obtain a sequence of transitions may not be known. In addition, HMMs assume that the value of the transition probability distribution may be unknown or inaccurate. The challenge is to determine the hidden parameters, from the observable parameters, based on these assumptions.

An HMM is defined by a set of states S={S1,S2,…,Sn}, a transition probability matrix A={aij} representing the probability of transition from state Si to state Sj, an initial state distribution vector π , a set of observations O={O1,O2,…,Om}, and an observation probability matrix B = {bik}, representing the probability of observing observa-

tion Ok, given that the system is in state Si.

The following three canonical problems are associated with Hidden Markov Models [103,104]. Given an output sequence:

1. What is the probability that a given HMM produced this output sequence? 2. What is the most likely sequence of state transitions that yield this output sequence? 74

3. What are estimates for the transition probabilities related to this output sequence? Later in this chapter we will discuss how the third problem is related to the component reliability prediction problem. This problem is addressed by the Baum-Welch algorithm [11]. Baum-Welch is an Expectation-Maximization algorithm, that given the number of states, number of observations, and a set of training data, approximates the best model in terms of transition and observation probability matrices A and B that represent the training data set.3

The Baum-Welch algorithm is an iterative optimization technique, which starts from a possibly random model, and leverages the training data to find the local maximum of the likelihood function. Specifically, the algorithm applies a dynamic programming technique to efficiently estimate the HMM parameters (including transition probabilities), while maximizing the likelihood that the training data is generated from the estimated model. It operates by defining a forward variable α t (i ) , and a backward variable β t (i ) as follows:

α t (i ) = ∑ α t −1 ( j ) Pr1 (qt = i | qt −1 = j ) Pr0 ( xt | qt = i ) j

β t −1 (i ) = ∑ Pr1 (qt = j | qt −1 = i) Pr0 ( xt | qt = j ) βt ( j ) j

3. Baum-Welch training is only guaranteed to converge to a local optimum. The local optimum may not always be the global optimum. While this is the only known algorithm to address this problem, its output is an approximation of the actual model. One way to mitigate this shortcoming is to execute the algorithm iteratively and obtain a statistically significant or “typical” result.

75

The forward variable determines the probability of reaching state Sj from state Si, given a sequence of transitions (t0,... tt). Conversely, the backward variable determines of the occurrence of a (future) sequence of transitions, given the current state Si.

Using the two complementary forward-backward probabilities the Baum-Welch algorithm evaluates various probabilities. These include the probability of a given observation sequence, the probability that the HMM was at a given state Si at time t, as well as the probability that the HMM was at a given state Si at time t and transitioned to state Sj at time t+1. By applying the Baum-Welch algorithm, the unknown matrices A and B are obtained. This is equivalent to obtaining the operational profile of a component given the set of data obtained by simulation, runtime monitoring, or synthesis of the component’s model.

3.2.5 Application of HMMs to the Component Reliability Problem A component’s dynamic behavior model is the heart of our component reliability modeling approach. In our approach, the graphical representation of the component’s internal behavior using states and transitions among those states is leveraged to build a Markovian reliability model. The Markov property assumption is not too restrictive in our case: it is possible to keep track of memory in a dynamic behavioral model and

76

still preserve the Markov property, by using more complex states with additional state variables (recall Chapter 2).

However, building a Markov model from a component’s architectural models not only relies on knowledge about the component’s states and transitions among those states, but also requires availability of data that helps us obtain the probability of various transitions. As previously explained, such data may be obtained from runtime monitoring or simulation of components’s model (cases 1, 2 in Section 3.1), or via an expert-driven synthesis process (case 3 in Section 3.1), and based on circumstances, a one-to-one correspondence between various observations and the transitions in the model may not exist. In such cases, a regular Markov Model is not capable of properly representing the behavior of the component. A Hidden Markov Model, however, can be formed to estimate an operational profile for the component given the available data.

The event/action interaction semantics in the dynamic behavioral model discussed in Chapter 2 require an augmentation to the basic HMM (without changing its key traits). Each transition in the component’s dynamic behavior model may have an event/action pair associated with it: invocation of an event may result in triggering of an action, which in turn may trigger another event in another component. This interaction semantics is leveraged in designing our system-level reliability model discussed in Chapter 4. We now formally define an Augmented HMM (AHMM) used to 77

Assume: S: set of all possible States, S = {S1 ,..., S N } N: number of states q t : state at time t E: set of all events, E = {E1 ,..., EM } M: number of events F: set of all actions, F :{F1 ,..., FK } K: number of actions We now define:

λ = ( A, B, π ) is a Hidden Markov Modelsuch that: A:state transition probability distribution A = {aij }, aij = Pr[qt +1 = S j | qt = Si ], 1 ≤ i, j ≤ N B: Interface probability distribution in state j B = { b j (m)} b j (m) = Pr[ Em / Fk at t | qt = S j ], 1 ≤ j ≤ N ,1 ≤ m ≤ M ,1 ≤ k ≤ K π:The initial probability distribution π = {π i }

π i = Pr[q1 = Si ],1 ≤ i ≤ N . Figure 3-3. Formal Definition of AHMM

model the operational profile of components. Once the operational profile is obtained, we use our reliability model to predict the component reliability. The formal definition of our AHMM is given in Figure 3-3. Below we describe some of its properties.

In the dynamic behavioral model serving as the basis of our AHMM, for every two states Si and Sj, there may be several transitions with different event/action pairs (Em/ Fk), for 1 ≤ k ≤ K, and 1 ≤ m ≤ M, where as shown in Figure 3-3, M is the number of 78

events and K is the number of actions. Then, the transition probability from Si to Sj by means of a given event Em via any of the possible actions Fk on Em is: K

∑P k =1

ijEm Fk

We define the probability Tij of reaching state Sj from state Si via any of the event/ action pairs E/F as: M

K

Tij = ∑∑ PijEm Fk m =1 k =1

Finally, at each state Si the following condition among all outgoing transitions exists: M

K

N

∑∑∑ P m =1 k =1 j =1

ijEm Fk

=1

For example, in the case of the controller component’s dynamic behavioral model (depicted in Figure 3-2), between the two states init (S1) and normal (S2), there are two transitions designated: getWallDist/notifyDistChange (E1/F1), and –/notifyDistChange (E2/F1). The latter is an example of a transition with a true event where no external stimuli besides a time step are necessary for the transition to be activated.

79

Using the above equations, the transition probability from S1 to S2 by means of event E1 via any of the possible actions is: K

∑P k =1

ijEm Fk

= P12 E1F1

The transition probability from state init to state normal can be formulated as the sum of the probabilities of the two transitions. M

K

T12 = ∑∑ PijEm Fk = P12 E1F1 + P12 E2 F1 m =1 k =1

Finally, at state S1 we have: M

K

N

∑∑∑ P m =1 k =1 j =1

ijEm Fk

= P12 E1F1 + P12 E2 F1 = 1

As mentioned before the important question at this point is how to obtain these individual probability values ( P12 E1F1 and P12 E2 F1 ). Section 3.2 described how these probabilities could be obtained directly from the a simulation process (case 2). In cases 1 and 3, these probabilities may be obtained by applying the Baum-Welch algorithm [11]. The Baum-Welch algorithm leverages information about the number of states and event/action pairs in our model, as well as the training data (obtained from runtime monitoring or the synthesis process for cases 1 and 3 respectively), and estimates the parameters of the AHMM in terms of matrices A and B. In the next section, we show how matrix B is used in the reliability prediction process. 80

3.3 Reliability Prediction The last phase of our component reliability modeling approach involves actual prediction and analysis of a component’s reliability. Given the operational profile of the component (Section 3.2), the aim is to build a reliability model that predicts and analyzes the probability that a component performs its operation without failure.

Our component reliability model extends the Quartet’s dynamic behavioral model with the notion of failure states. One of our goals has been to provide targeted sensitivity analyses as part of our reliability modeling, aiming at offering cost-effective strategies to defect mitigation. As a result, we model a failure state for each defect type revealed during the architectural modeling and analysis phase. Each failure state represents possible manifestation of the corresponding defect type during the component’s runtime operation. We note that other possibilities, ranging from a single failure state to multiple failure states for each type, are also enabled by the model. Once the model is augmented with failure states, two additional types of transitions must be added to the model: failure transitions and recovery transitions.

Failure transitions are arcs from a component’s states to the failure states, and represent the possibility of a failure happening while the component is in a normal operating state. Recovery transitions model the notion of recovery from failures, and are arcs from failure states to one or more “normal” component states. The designation of 81

getWallDist/ notifyDistChange

executeSpeedChange/ notifySpeedChange

/notifyDistChange

executeDirChange/ notifyDirChange

getWallDist/ notifyDistChange

normal

setDefaults /notifyDistChange setDefaults

init

getWallDist/ notifyDistChange logState

/notifyDistChange emergency

getWallDist/ notifyDistChange

/notifyDistChange

changed

/notifyDistChange executeSpeedChange/ notifySpeedChange

executeDirChange/ notifyDirChange getWallDist/ notifyDistChange

executeSpeedChange/ notifySpeedChange

F1 (sig)

F2 (prot)

Figure 3-4. Graphical View of the Controller’s Reliability Model

one or more of the component’s “normal” states as recovery states for a given failure state is performed by the architect.

As shown in Figure 3-4, the controller component’s dynamic behavioral model is augmented with two failure states, F1 and F2. F1 denotes occurrence of failures corresponding to the signature mismatch defect type, and F2 represents occurrence of failures corresponding to the protocol mismatch defect type. As discussed in Chapter 2 these two types of defects were revealed by running various analyses on the controller 82

component’s architectural models. In particular, the signature mismatch between estimator and controller components was associated with the getWallDist interface, due to an inconsistency of the return types. The setDefaults operation was determined to be the source of a protocol mismatch between the controller and database components. Dotted transitions in the diagram represent failure transitions connecting a subset of the component’s states to failure states. Bolded transitions from failure states to the init state in Figure 3-4 represent recovery transitions.

Only component states in which a particular defect type is relevant are linked to a failure state. This is decided by examining all outgoing transitions at a given state: if an outgoing transition relates to a defect detected by architectural analysis, then an arc connecting that state to the corresponding failure state is required. In the controller example, since no outgoing transition corresponding to the setDefaults operation at states init and normal exists, there is no need to model a failure transition from those states to state F2. Moreover, designation of recovery states for each failure state is an application specific task, and as such must be done by the architect. In the case of our example, the init state is assigned to be the sole recovery state once any failure happens.

The next step after augmenting the model with failure states and adding failure and recovery transitions, is to assign probabilities to all transitions in the model. As described, the output of the operational profile modeling phase depends on the spe83

cific form of the problem at hand. Our primary focus is on case 3 where a set of training data is synthesized. This data is then used by the Baum-Welch algorithm to estimate the probability of transitions among component states. Case 1 does not rely on synthesized data, but obtains training data from simulating the models, and the result of the case 2, is an estimation of the frequency (i.e., probability) of activation of various transitions in the dynamic behavioral model. Either way, by this point in the reliability modeling we have an estimate of all transition probabilities in the form of a matrix (matrix A in Figure 3-3):

S 1 S 2 ... SN S 1 ⎡ a11 S2 ⎢a A = ⎢ 21 ... ⎢ ... ⎢ SN ⎣ a N 1

a12 a22 ... aN 2

... a1 N ⎤ ... a21 ⎥⎥ ... .. ⎥ ⎥ aNN ⎦

where aij represents the probability of transition from state Si to state Sj. The next step is to incorporate these values in an extended model that includes the failure states and their associated transitions. The new model is a Markov model which is used for reliability prediction. Below, we first present the general form of the transition probability matrix of this new model, and then discuss it in detail.

84

A' = S1

S2

S1 ⎡ a11 (1 − ∑ f1i ) a12 (1 − ∑ f1i ) ⎢ i =1 i =1 ⎢ M M ⎢ S2 ⎢ a21 (1 − ∑ f 2i ) a22 (1 − ∑ f 2i ) i =1 i =1 ... ⎢ ... ... ⎢ ⎢ M M S N ⎢ aN 1 (1 − ∑ f Ni ) aN 2 (1 − ∑ f Ni ) ⎢ i =1 i =1 ⎢ r11 r12 F1 ⎢ ⎢ ⎢ ⎢ r21 r22 F2 ⎢ ... ⎢ ... ... ⎢ ⎢ rM 1 rM 2 FM ⎢ ⎣ M

...

M

SN

F1

F2

...

M

...

a1N (1 − ∑ f1i )

f11

f12

...

... a2 N (1 − ∑ f 2i )

f 21

f 22

...

...

...

...

...

f N1

fN 2

... 0

i =1 M

i =1

... M

... aNN (1 − ∑ f Ni ) i =1

N

r1N

1 − ∑ r1i

0

r2 N

0

1 − ∑ r2i

0

...

...

...

...

...

...

rMN

0

0

...

...

i =1

N

...

i =1

FM ⎤ ⎥ ⎥ ⎥ f2M ⎥ ⎥ ... ⎥ ⎥ f NM ⎥ ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ ... ⎥ N ⎥ 1 − ∑ rMi ⎥ i =1 ⎦ f1M

where A’ is the new transition matrix, M is the number of failure states, N is the number of normal states, Fi is the ith failure state, and Si is the ith normal state. Moreover, rij is the probability of recovery from failure state Fi to the normal state Sj, and fij denotes the probability associated with the failure transition from normal state Si to the failure state Fj. Finally, aij is the probability value previously obtained from matrix A (see above), corresponding to the probability of transitioning to state Sj from state Si based on the results of operational profile modeling. This value is adjusted in the new matrix (A’) to incorporate failure and recovery probabilities, while ensuring that the new matrix preserves the properties of a Markov model. 85

Our approach to initializing recovery probabilities (rij) and failure probabilities (fij) is as follows. As described in Chapter 2, we use our defect quantification approach to estimate the cost of recovery from each failure given a set of domain-specific cost factors. The probability of recovery from each failure type is calculated as a function of this recovery cost. Consequently, we instantiate the value of rij directly from this estimation. For instance, as depicted in Figure 2-8, the recovery probabilities for signature and protocol defect types were calculated as 0.7975 and 0.71125, respectively. These two values were obtained from the instantiation of our cost framework (Section 2.6) as the cumulative influence of a set of domain specific cost factors. These values are assigned to the recovery probability from F1 and F2 to the init state respectively. Recall that the state init was designated as the sole recovery state for the component. The probability of remaining at the failure states F1 and F2 thus, are 1 – 0.7975= 0.2025 and 1 – 0.71125 = 0.28875 respectively.

While our cost function can be used to quantify the cost and consequently the probability of recovery from different types of failures, estimating the probability of failure occurrence can only be done given the historical failure data from the component, or by leveraging the domain expert’s knowledge. In the case of architecture-level reliability modeling when no failure data may be available, a domain expert must estimate the probability of failure occurrence at each state. At early stages of development, estimating these probabilities with any degree of certainty is very diffi86

cult. These uncertainties warrant the need for offering flexibility in the reliability model and allowing for reliability analysis based on different failure probability values. Using a range of failure probability values will result in a range of predicted component reliability values. We consider this flexibility to be a useful analysis tool that helps the architect in making important design decisions. To present the model, however, we first focus on a single failure probability value, and then extend the reliability analysis to a range of possible failure probability values.

Let us assume that the probability matrix obtained from profile modeling of the controller component is as follows. i i ⎡ 0.1503 n ⎢⎢ 0.4708 A= e ⎢ 0.2263 ⎢ c ⎣0.2204

n

e

c

0.2858 0.2653 0.0984 0.3539

0.2830 0.0099 0.3308 0.1998

0.2809 ⎤ 0.2540 ⎥⎥ 0.3445⎥ ⎥ 0.2258⎦

where i is init, n is normal, e is emergency, and c denotes the changed state. In our case, this matrix was obtained by applying the Baum-Welch algorithm to a set of synthesized training data obtained based the domain expert knowledge (an extensive study of the impact of the training data on the predicted reliability value in terms of the sensitivity of the model to random data or data based on the domain knowledge may be found in Chapter 6). As discussed, in conjunction with the results obtained

87

from the analysis of architectural models, this matrix must be augmented with two failure states F1 and F2.

Let us assume that the architect has designated a 5% probability of failure for the signature mismatch and a 2% probability of failure for the protocol mismatch at each related state. The new transition probability matrix will have two additional rows and columns corresponding to the failure states F1 and F2. i i ⎡ 0.1428 n ⎢ 0.4473 ⎢ e ⎢ 0.2105 A' = ⎢ c ⎢0.2050 F1 ⎢ 0.7975 ⎢ F2 ⎣ 0.7113

n

e

c

F1

F2

⎤ 0.2520 0.0094 0.2413 0.05 0 ⎥⎥ 0.0915 0.3076 0.3204 0.05 0.02 ⎥ ⎥ 0.3291 0.1858 0.2100 0.05 0.02 ⎥ 0 0 0 0.2025 0 ⎥ ⎥ 0 0 0 0 0.2887 ⎦ 0.2715 0.2688 0.2669

0.05

0

Examining the new matrix demonstrates that while new failure transition probabilities are incorporated into the model, the ratio between various transition probabilities has remained unchanged. For example, in the original transition probability matrix, once the component was in the init state, the ratio between the probability of remaining in init state or transitioning to the normal state was: 0.1503 = 0.52 0.2858

88

In the new transition probability matrix, the same ratio between those probabilities holds: 0.1428 = 0.52 0.2715

Once the new transition probability matrix for the model (including the failure states, and the failure and transition recovery probabilities) is constructed, we can predict the reliability of the component, by estimating the probability that it is operating normally at a time in the future when the component is in its steady state. Recall that a component is considered reliable if by time tn, where n approaches infinity, it is operating normally. That is, it has either not failed, or has recovered from the failures it may have encountered. The reliability is then predicted by estimating the probability that the component at time tn is in a non-failure state. This can be estimated by calculating the steady state vector that represent the steady state behavior (recall Section 3.2.4) of the component. The steady state distribution vector corresponding to the component’s model can be calculated by solving the following system of equations: x + y + z + ... = 1

[x

y

z ...] A′ = [ x

y

z ...]

89

where the number of unknowns is equal to the number of states in the Markov model (x, y, z, and so on). A steady state probability vector is then given by: V∞ = [ x

y

z ...]

Numerically, calculation of the steady state vector can be performed by raising the matrix A′ to increasingly higher powers until all rows in the matrix converge to the same values. Upon convergence, we obtain the steady-state vector V, whose elements represent the long-term probability of being in the corresponding state.

In the case of the controller component, the steady state vector is calculated as: i

n

e

c

F1

F2

V= [ 0.2832 0.2288 0.1767 0.2372 0.0581 0.0116]

Reliability is then the probability that the component is not in state F1 or F2. In other words: M

Reliability = 1 − ∑ V ( Fi ) i =1

where M is the number of failure states, and V(Fi) are the elements of vector V corresponding to failure states F1, F2,..., FM.

90

In the case of the controller component, the reliability is estimated as:

Reliability = 1 − (0.0581 + 0.0116) = 0.9303

For clarity, the calculation above shows a single approximation of the component’s reliability based on a single assignment of failure transitions’ probabilities by the domain expert. Given the difficulty of accurately estimating transition failure probabilities when no actual failure data is available, we opt to provide a range of analyses based on a threshold given for the transition failure probability estimation. In the example above, the expert had assigned 0.05, and 0.02 for the probability of failure of type F1 and F2 occurring at each state. Given this instantiation, the component reliability was estimated at 93.03%. Assuming that these transition failure probabilities are an estimate at best and may have uncertainties associated with them, we can predict the reliability of the component for a given threshold. Figure 3-5 (top) demonstrates the range of predicted reliability values when the probability of transition to failure state F1 varies between 0.03 and 0.07. As depicted, the original predicted reliability of 93.03% now falls at the middle of the estimated values from 95.22% to 90.94%. Similar type of analysis on the value of probability of failure transition to state F2 is shown in Figure 3-5 (bottom).

An assumption made when estimating a single component’s reliability is that the effect of a defect associated with a service is captured in the providing component’s 91

Component Reliability

Reliability Analysis [P(F2) = 0.02] 0.96 0.95 0.94 0.93 0.92 0.91 0.9 0.89 0.88 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 P(F1): Probability of transition to state F1

Component Reliability

Reliability Analysis [P(F1) = 0.05] 0.945 0.94 0.935 0.93 0.925 0.92 0.915 0.91 0.905 0

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 P(F2): Probability of transition to F2

Figure 3-5. Reliability Analysis Results for the Controller Component

model. That is, when a component requires services of another component, we assume that any defect associated with that service is captured in the reliability model of the providing component. Consequently, the failure states of a given component signify the defects associated with its own functionality, while any problems with the

92

functionality provided to it would be captured in the provider component’s reliability model.

It is noteworthy that architectural analysis of software components may be applied to components and their behavior in isolation (e.g., the case of an Off-the-Shelf component that is not part of any system). It may also be applied to components that along with other (interacting) components comprise a software system. While both type of analysis can reveal defects, the defects are more useful and the analysis is more meaningful when the component is considered in the context of a system. While the former approach results in reliability values that may be reused in different software systems, the latter approach determines the fitness of the component in a particular system. Our approach in this dissertation has focused on components in the context of a software system.

93

Chapter 4: System Reliability

In Chapter 3, we described our approach to predicting the reliability of a single component (aka Local Reliability). In this chapter, we leverage the components’ reliability values and offer a compositional approach to predict the architecture-level reliability of a software system. The system reliability is predicted in terms of the reliabilities of its constituent components, and their complex interactions. Our approach involves two major steps: first we build a model representing the overall behavior of the system in terms of interactions among its components. Next, this model is used as a basis for stochastic analysis of the system’s architectural reliability.

Since our approach is intended to be applied at early stages of the software development life-cycle, lack of knowledge about the system’s operational profile poses a major challenge in building a reliability model. The operational profile is used to determine the failure behavior of the system, and is commonly obtained from runtime monitoring of the deployed system. In the absence of data representing the system’s operational profile, we use an analytical approach that relies on domain knowledge as well as the system’s architecture to predict the reliability of the system. Similar to the component-level reliability model, in cases where operational profile data is available, our reliability model can be adopted to leverage existing data and to provide a more accurate analysis of the system’s reliability.

94

Architectural Models

Global Behavioral Model

Bayesian Network System Reliability Inference

Component Reliability Values

AHMM

Legend Numerical Values

Approach Elements

Artifacts

Learning Process

Figure 4-1. Our Approach to System Reliability Prediction

Figure 4-1 shows a high-level view of the system reliability prediction process. As with our component-level reliability estimation approach, architectural models of the system serve as the core of the reliability model. In our case, the Quartet offers models of components’ interaction protocols in the form of a set of Statecharts [47]. We compose a concurrent model of components’ interaction protocol models to provide a global view of the system’s behavior. Our reliability model leverages this global view, and given components’ reliability values (obtained via our HMM methodology), provides a prediction of the system’s architecture-level reliability, based on the Bayesian Network methodology [49].

In the rest of this chapter, we first describe our approach to modeling the global behavior of a software system in terms of the Quartet models of its constituent com95

ponents. We then describe our Bayesian system-level reliability model. As part of this discussion, we provide a brief overview of Bayesian Networks. We then describe how our Bayesian reliability model is constructed, and demonstrate the analyses it enables.

4.1 Global Behavioral Model In order to estimate the overall reliability of a software system, we need to understand the nature of the complex interactions among its components. This understanding involves answering two types of questions: which interactions are allowed in this particular system, and how often do they occur. To answer the first question, we build the Global Behavioral Model (GBM) of the system. This model is then used by the architect to analytically determine the expected frequency of various interactions.

The behavior of a software system is the collective behavior of its constituent components. These components interact to achieve system-level goals. These interactions are often very complex, and capturing them requires sophisticated modeling techniques that are capable of representing request-response relations, as well as related timing issues. These interactions are often described in terms of components’ provided and required functionality, exhibited through their interfaces [1,2,102,133].

As previously described in Chapter 2, one of the views of the Quartet approach to software modeling is the model of components’ interaction protocols. Recall that a 96

SCRover Actuator

Database

Sensor Actuator

Controller

Sensor

Database

Estimator Estimator

components communication link

Controller

state concurrent state transition

Figure 4-2. View of SCRover System’s Collective Behavior

component’s interaction protocol model provides a continuous external view of the component’s execution by specifying the ordering in which component’s interfaces must be invoked. Our specific approach leverages the Statecharts methodology [47,133] with semantic extensions to model the event/action interactions between communicating components [47].

We model the collective behavior of components using a set of concurrent state machines. Each state machine within this concurrent model represents the interaction protocol of a single component. Figure 4-2 depicts the conceptual view of the interactions among components in the SCRover system. The left hand side diagram is the view of the system’s configuration in terms of its communicating components. The right hand side shows a concurrent state machine containing interaction protocol models of individual components. In the interests of clarity, labels on the transitions,

97

events, actions, parameters, and conditions have been omitted, but are described later in this chapter.

In a concurrent state machine representing the system-level behavior of n communicating components, at any point in time, the active state of the system is represented using a set of component states {S1,S2..., Sn}, where n is the number of components in the system, and Sk corresponds to the active state in the state machine corresponding to the kth component. The interactions among components are represented via event/ action pairs. Each event/action pair acts as a synchronizer among the state machines.

The event/action interaction describes how invocation of a component’s services affects another component. Figure 4-3 depicts the interaction protocols of the controller, estimator, and actuator components in the SCRover system. To avoid unnecessary complexity in the discussions, we discuss the SCRover’s system model in terms of these three components. However, the approach and techniques presented here can be applied to a greater number of components without modification.

The system’s three state machines are concurrently executed. The Statechart semantics [47] permit two types of interactions among concurrent state machines. These interactions leverage event/action semantics, and model how operations in one component affect another component’s operations. The first type of interaction concerns concurrent events. Given the appropriate active state of components, all of the transi98

SCRover System Controller /notifyDistChange

executeSpeedChange /notifyDistChange

/notifyDistChange S1

S2

S3 executeDirChange/ notifyDirChange

getWallDist/ notifyDistChange setDefaults

getWallDist/ notifyDistChange

true

Estimator

Actuator

/getWallDist

S1

tru

/getWallDist S2

S3

S2

executeDirChange e

executeSpeedChange e

S1

tru

notifyDistChange /getWallDist

S3

notifyDirChange

Legend Concurrency x/y

State

Event/Action pair Initial state

Transition

Figure 4-3. SCRover’s Global Behavioral View in terms of Interacting Components

tions with the same event are activated at the same time. For instance, in the case of

the SCRover model (Figure 4-3), if the active state of the controller component is controller.S2 and the active state of the actuator component is actuator.S1, then invo-

cation of the executeSpeedChange interface results in generation of the corresponding event, which in turn causes a change of state in both components to controller.S3, and actuator.S2 respectively. Note that generation of this event has no effect on the state of

the estimator component regardless of its active state. 99

The second type of interaction concerns the event/action pair semantics. Given the appropriate state of components, generation of an event in one of the components may result in the invocation of an action, which in turn may result in generation of another event in another (concurrent) state machine. In the SCRover system, assuming that {controller.S2, estimator.S1, actuator.S1} is the system’s active state, invocation of the executeSpeedChange interface in the actuator component results in generation of the executeSpeedChange event. This in turn results in the triggering of the corresponding

transition in the controller component, causing the notifyDistChange action. The concurrent nature of the three state machines results in triggering of the notifyDistChange transition in the estimator component (event caused by the action in the controller component), as well as the executeSpeedChange transition in the actuator component (original event). The new active state of the system will then be {controller.S3, estimator.S3, actuator.S2}.

The semantics described above [48] form the basic principles upon which the Global Behavioral Model of a system in terms of the behavior of its constituent components is built. In the rest of this chapter, we describe how this model is used to predict the architectural reliability of a software system.

100

4.2 Global Reliability Modeling The Global Behavioral Model describes the behavior of the system as intended by the architect. When the system strays from the intended behavior, it is said to demonstrate a failure. Recall that failure is defined as the occurrence of an incorrect output as a result of an input value that is received, with respect to the specification [101]. Failures are a result of faults or defects that are potentially attributed to design flaws or deviation from a desired or intended behavior as specified by the architect. System failures may be caused by components’ internal failures, or as a result of interactions among communicating components. While all failures adversely affect system reliability, different failures may contribute to system unreliability differently. The overall unreliability of a system can be formulated as the aggregate of the probabilities of occurrences of various types of failures.

While the Augmented Hidden Markov Model (AHMM) presented in Chapter 3 was effective for component reliability estimation, there are serious concerns with its ability to model reliability of a complex system. These concerns are mainly due to the lack of theoretical foundations to build Hierarchical Hidden Markov Models. While recent research has started to address the issue of concurrency and hierarchy in Markov Modeling [9], generalization of the Expectation-Maximization algorithm for Hierarchical Hidden Markov Models is still very much a topic for ongoing research.

101

Moreover, Hierarchical Hidden Markov Models have serious shortcomings with respect to scalability when modeling concurrency.

For modeling the reliability at the system level, we use a related graphical model capable of performing probabilistic inference. A Bayesian Network or Belief Network (BN) [49], is a probabilistic graphical model in the form of a directed acyclic graph. The nodes in a BN represent some variables, and the arcs (or links) connecting these nodes represent the dependency relations among those variables. A Bayesian Network represents a stochastic relationship among the nodes in the graph, in terms of the conditional probabilities of some nodes with respect to the others. Given the topology of a Bayesian Network and the probability distribution values at some of the nodes, the probability distribution value of some other nodes may be deduced. This is known as inference in Bayesian Networks. In the next subsection, we offer a basic overview of Bayesian Networks, and discuss their applicability to the reliability estimation problem.

It is worth mentioning that while theoretically Bayesian Networks may be used to perform component-level reliability analysis, based on our experience, HMMs are more appropriate to model architectural reliability of individual components. Theoretically, Bayesian Networks and Hidden Markov Models belong to a class of models known as Graphical Models. Graphical models merge concepts of Graph Theory and Probability Theory [81]. Furthermore, the two formalism are demonstrated to be iso102

morphic under certain conditions [55]. Particularly, HMMs are considered to be a special case of Dynamic Bayesian Networks. While a natural causal relations among entities in our system level models lends itself to use of BNs at the system level, HMMs are more intuitive when used in component-level reliability modeling. Consequently we opted to use HMMs for component-level and BNs for system-level reliability modeling.

4.2.1. Background on Bayesian Networks Bayesian Networks or Belief Networks have been extensively used in Artificial Intelligence and Machine Learning Decision Making, Medical Diagnosis, and Bioinformatics [49,80,39,92,89]. They have also been used to model reliability of software systems during the testing phase, based on the operational profile obtained from system monitoring [4,58,68,96]. However, little work has been done on predicting system reliability early in the development process when such information is not widely available [97].

A Bayesian Network consists of two parts: qualitative and quantitative. The qualitative part is a directed acyclic graph consisting of set of nodes, and directed arcs (links)

that connect the nodes. The arcs represent the dependency between probability distribution values represented at each node. Similar to standard graph theory concepts, if there is an arc from node A to another node B, then A is a parent of B. If a node has no parents, then it is a root node. If a node has no children then it is a leaf node. The 103

quantitative part consists of specification of the conditional probabilities among the

nodes and their parents in the network.

In probability, two events are independent when knowing whether one of them occurs makes it neither more probable nor less probable that the other occurs. In a Bayesian Network, a node is independent of its ancestors given its parents. This property is known as conditional independence. Formally, two events X and Y are conditionally independent given a third event Z, if the occurrence (or non-occurrence) of X and Y are independent events in their conditional probability distribution given Z. In other words: Pr( X ∩ Y | Z ) = Pr( X | Z ) × Pr(Y | Z )

where Pr( X | Z ) represents the conditional probability of event X given the occurrence of event Z, and Pr( X ∩ Y | Z ) represents the joint probability of events X and Y given the occurrence of event Z.

In a Bayesian Network, a node can represent any type of variable (e.g., a measurement, a parameter, etc.). We use Bayesian Networks to model the dependency between various system states. Specifically, our nodes represent the reliability values at corresponding states in the system. The arcs model how reliability at one state is affected by the reliability value at another state in the system.1 104

The probabilistic inference considers available information about the network and infers conclusions about the other parts of the model. In other words, given the individual component reliability values, and the graph representing the relationship among the reliabilities of various states in the system, we use inference to estimate the posterior probability of the occurrence of different types of failures. The posterior

probability calculation considers the known information about the network (aka evidence), and updates the conditional probability at other nodes (aka belief). The basis

for inference in Bayesian Networks is Bayes’s Theorem. Bayes’s Theorem is essentially an expression of conditional probabilities that represent the probability of an event occurring given evidence. In probability, Pr(A|B) (the conditional probability of event A given B) and Pr(B|A) are two different terms. However, there is a relationship between the two, and Bayes’s Theorem describes this relationship. Bayes’s Theorem can be derived from the definition of conditional probability of events A and B as follows: By Definition : Pr( A ∩ B) P( B) Pr( A ∩ B) Pr( B | A) = Pr( A) Pr( A | B) =

1. It is note-worthy that to remain consistent with the BN terminology and avoid confusion, we use the term state in the context of the behavioral models and state machines, and use the term node in the context of Bayesian models; conceptually the two terms are interchangeable. Moreover, the terms arcs and links are used interchangeably in this discussion.

105

So: Pr( A | B) × Pr( B ) = Pr( B | A) × Pr( A) = Pr( A ∩ B) Pr( A | B) =

Pr( B | A) × Pr( A) Pr( B)

where Pr(A|B) is the posterior probability, Pr(B|A) is known as the likelihood, Pr(A) is the prior probability, and Pr(B) is the probability of the evidence. In other words, the Bayes’s Rule can be phrased as follows: posterior probability =

likelihood × prior probability evidence

Bayesian Networks offer a robust probabilistic formalism for reasoning under uncertainty. It is relatively easy to understand and interpret a Bayesian Network as it reflects our understanding of the world within the model. Furthermore, the conditional independence between the nodes and their ancestors in the model provides a more compact probabilistic relation between the nodes in the model based on their structured representation. This property particularly helps ensure that the complexity of components’ interactions is simplified when considering the effect of these interactions on the system’s reliability.

In the next section, we describe our Bayesian reliability model in terms of its quantitative and qualitative parts.

106

4.3 A Bayesian Network for System Reliability Modeling The global behavioral model presents the interactions among components in a system. Failures may occur during system’s operation, and their cause may be rooted in defects that originate from the architecture and design phases. We build a Bayesian Network using the GBM as its core that determines the dependency between reliability values at various system states. This model is further extended to include failure nodes. Similar to our component-level reliability approach, we acknowledge that dif-

ferent types of failures may occur in the system. Furthermore, the contribution of different types of failures to the overall reliability of the system depends on many factors such as the types of failures and specific components that exhibit the failure behavior. We leverage our classification of architectural defects (presented in Chapter 2) to differentiate among different classes of failures. Our system-level reliability model explicitly represents different types of failure for each component in the system. The overall reliability of the system is then estimated in terms of the cumulative effect of different types of failure in various components. Below we describe our approach to construction of the qualitative and quantitative parts of our Bayesian Network.

4.3.1. Qualitative Representation of the Bayesian Network The dependency relationship among reliabilities at various system states is directly tied to the interactions among the states in the system’s global behavioral model. In a system’s GBM a transition (associated with an event) may result in a change in the state of the component; that is, under the correct conditions (i.e., generation of certain 107

events and given the active state of the component), a change of state to a new state may be caused. This concept serves as the core principle in converting a global behavioral model to BN’s directed graph.

As previously mentioned, the Global Behavioral Model consists of a set of concurrent state machines SM={sm1,...,smn} where n is the number of components in the system. Each state machine (smi) consists of a set of states S= {s1,...,sm}, and a set of transitions T={t1,...,tp}, where m represents the number of states in the state machine smi, and p is the number of transitions in the component corresponding to smi. Each transition has its origin and destination in S. There is either a single event or an event/action pair associated with each transition. Each event and action corresponds to a component’s interface (recall Chapter 2).

Below we describe the steps required to leverage this behavioral model and construct a Bayesian reliability model, in terms of nodes and the links representing the reliability dependency among these nodes.

Nodes. The nodes in our Bayesian Network are directly related to the states in the

behavioral model. All the states in the global behavioral model become the nodes in the BN. Moreover, a “super” node (init) is added to represent the instantiation of the system. This node will be used to model the reliability of the system’s startup process. 108

In addition, for every component, a set of failure nodes are added to the Bayesian Network. These failure nodes correspond to the different defect types revealed during architectural analysis for each component. Each failure node represents the probability of the occurrence of a specific type of failure in a component. A failure may be due to an internal fault in the component, or a result of its interaction with the rest of the system. The top part of Figure 4-4 shows the initial step of the BN construction for the SCRover system. Initialization and failure nodes are added to the basic set of nodes (corresponding to the states in the GBM).

Links. The next step involves designing the arcs (links) in the model to capture the

dependencies between reliabilities at various nodes. The links in our model can be grouped in three distinct groups: instantiation links, failure links, and dependency links.

Instantiation links are added from the init node to all nodes corresponding to initial

states of components. They model the system’s instantiation process, and signify that the reliability of the system depends on the failure-free instantiation of all of its components in the system. Note the links from the init node to controller.S1, estimator.S1, and actuator.S1 in Figure 4-4.

109

Failure links are used to determine the possibility of occurrence of various types of

failures at different states of the system. We rely on the results obtained from our architectural analysis phase to determine relevance of a particular defect type (and its subsequent failure type) for each component. Specifically, for each node in the BN, we consider the corresponding state in the global behavioral model. If in the GBM a particular defect type is associated with an interface corresponding to an outgoing transition at a given state, then a failure link from the corresponding node in the BN to the failure node associated with that type of defect is drawn. In the case of the SCRover system, recall that getWallDist() and setDefaults() were defective interfaces identified during the architectural analysis phase. The first one was shown to demonstrate a signature type mismatch with the estimator component, and the latter had both a pre/post condition mismatch and a protocol mismatch with the database component. States controller.S1 and controller.S2 both have outgoing transitions that are activated once the getWallDist event is generated, as do estimator.S2 and estimator.S3. Consequently, a link from these nodes to the controller and estimator components’ signature failure node (F4) is added. Similarly, links from controller.S2 to controller.F5 and controller.F6 represent the possibility of Pre/Post condition and Protocol

failures respectively. The actuator component does not react to any of the defective interfaces and thus no failure nodes need to be modeled for it. Figure 4-4 (bottom) depicts the initialization and failure links in the SCRover’s Bayesian Network.

110

init

Estimator

Controller

F4

S1

S1

Actuator S1

S2

S2

S2

S3

S3

S3

F5

F6

F4

Legend F1: Direction, F2: Usage, F3: Incomplete, F4: Signature, F5: Pre/Post Condition, F6: Protocol

init

Estimator

Controller

F4

F5

S1

S1

Actuator S1

S2

S2

S2

S3

S3

S3

F6

F4

Legend F1: Direction, F2: Usage, F3: Incomplete, F4: Signature, F5: Pre/Post Condition, F6: Protocol

Figure 4-4. Nodes of the SCRover’s Bayesian Network (Top), and Initialization and Failure Links Extension (Bottom)

Finally, dependency links depict the reliability relationship among various nodes in the system. There are two types of dependency links: inter-component and intra-com111

init

Estimator

Controller

F4

F5

Actuator

S1

S1

S1

S2

S2

S2

S3

S3

S3

F6

F4

Figure 4-5. Interaction Links in SCRover’s Bayesian Network

ponent links. The intra-component dependency links are directly obtained from each

component’s protocol model. For every transition in the interaction protocol model of the component, there is a directed arc from the node corresponding to its origin state, to the node corresponding to its destination in the Bayesian Network. These links signify that reliability (probability of success) at each node depends on the reliability of its parent node. For example, in the SCRover’s GBM, a transition from controller.S1 to controller.S2 signifies that the system’s reliability value at controller.S2 depends on the reliability of the system at controller.S1, justifying a link in the Bayesian Network from the latter to the former as depicted in Figure 4-5.

112

The inter-component dependency links are designed to demonstrate the relationship between reliabilities of the states among interacting components. The notion of event/ action interactions described earlier in this chapter serves as the logical core of these links. Recall that generation of an event in one component may cause a change of state in a different component. More specifically, for each el / ao pair in a component’s state machine smi, we seek all transitions in all other components’ state machines where an event matches the action ao. A link is then added from the origin node of el / ao in smi to the destination nodes of all events ao in the other state machines. The inter-component links indicate that the reliability (probability of success) in the nodes of interacting components are influenced by the reliability at the node of the component initiating the interaction.

Consider Figure 4-5 and Figure 4-6 depicting the SCRover’s BN in two stages. The first diagram shows the inter-component dependencies while the second one depicts the final Bayesian Network including all dependency, failure, and instantiation links. Upon instantiation, {controller.S1, estimator.S1, and actutator.S1} becomes the active state of the system (described in terms of the active states of each component). At this point, several scenarios may happen. As an example a getWallDist event may be generated, which results in a transfer of state in the controller component from controller.S1 to controller.S2, resulting in activation of notifyDistChange action which in turn

causes a change of state in the estimator component from estimator.S1 to estimator.S2. 113

init

Estimator

Controller

F4

F5

Actuator

S1

S1

S1

S2

S2

S2

S3

S3

S3

F6

F4

Figure 4-6. SCRover’s Final Bayesian Network Model

This particular scenario has no immediate influence on the actuator component. A change of state within a component (e.g., controller.S1 to controller.S2 in the above example) can also be interpreted in terms of the dependency among the reliability values at those states. In this case, an unreliable controller.S1 state may affect the probability of correct operations at controller.S2 state. Moreover, because triggering of this transition results in a change of state in the estimator component (from estimator.S1 to estimator.S2), the unreliability at controller.S1 could also affect the probability of suc-

cessful operation at estimator.S2 state.

A final issue concerns cycles in a Bayesian Network. By definition, a Bayesian Network is a directed acyclic graph (DAG). Following the above approach may result in 114

creation of cycles in the graph. However, it is important to note that these cycles are time sensitive: for example, while there are links in both directions between the estimator.S1 and estimator.S3 in Figure 4-6, the two links are not representing the depen-

dency between the two nodes in the same time-step. That is, since the estimator component cannot be in both estimator.S1 and estimator.S3 states at the same time, the cycle introduced in the BN graph represents reliability dependencies at different points in time. To remedy this issue of cycles, before adding a link to our Bayesian model, we check for cycles that may be created once the link is added. If by adding the link a cycle is generated, then that link is marked as a Delay Link (dashed lines in our graphs). Delay links are a standard concept in time-dependent Bayesian Networks and convert a simple Bayesian Network to a Dynamic Bayesian Network (DBN) [81].2 A Dynamic Bayesian Network is a time-sensitive Bayesian Network. Inference performed on a Bayesian Network can also be performed on a DBN. To do this, a DBN is first expanded for a period of time (e.g., 100 time-steps). The result of this expansion is a “regular” Bayesian Network that depicts the reliability dependencies over time.

An example of this expansion process over a period of three time steps for the SCRover system is depicted in Figure 4-7. As shown, expanding a BN increases the complexity of the Bayesian Network by creating more nodes. This process does affect 2.

Bayesian Network editors such as Netica [90] have the delay links built-in as standard functionality.

115

init R_Actuator

S1_Actuator

R_Estimator

R_Controller S3_Estim ator

S1_Actuator1

S1_Estimator

S1_Actuator2

S1_Estimator1

S1_Controller

S1_Estimator2

S1_Controller1

S1_Controller2

S3_Actuator S2_Controller

S3_Actuator1 S2_Controller1

S3_Controller

F_PrePostCond_Controller

S3_Controller2

S2_Actuator1

F_PrePostCond_Controller1 S2_Estimator

S2_Actuator2

F_PrePostCond_Controller2 S2_Estimator1

S3_Estimator F_Signature_Controller

S3_Actuator2 S2_Controller2

S3_Controller1

S2_Actuator

F_Signature_Estimator

S2_Estimator2 S3_Estimator1

F_Signature_Controller1

F_Protocol_Controller

S3_Estimator2 F_Signature_Controller2

F_Protocol_Controller1 F_Signature_Estimator1

F_Protocol_Controller2 F_Signature_Estimator2

Figure 4-7. Expanded View of the SCRover’s Dynamic Bayesian Network

the scalability of the model and its ability to perform inference in a reasonable time. The inference problem in a BN in general is NP-hard, and the complexity of the algorithm is exponential in the number of nodes and the number of variables represented at each node (in our case it is just one variable for reliability). Approximation algorithms such as Variation methods, Sampling (Monte Carlo) Method, or Parametric approximation methods [81] may be applied to provide a solution to the problem.

In our approach, we employ several techniques to help reduce the complexity of the problem. We estimate the reliability of the system in terms of the reliability at particular snapshots during system’s operation, or in a short time span. In doing so we avoid creating a large network. The ramification is that we are not able to provide reliability analysis over a long period of time, but we can do so given smaller time steps. The overall complexity of the approach can be further improved by using principles of hierarchy in architectural models. This will help us reduce the complexity of the components by reducing their number of states. This in turn, results in a decrease in the

116

1. Create a node for every state in the GBM. 2. Add an init node to represent the reliability of the system’s instantiation process. Connect this node to all the nodes corresponding to the initial states of components. 3. For each component, add a component reliability node to represent the (previously predicted) component reliability value. Connect this node to the initial state of the corresponding component. 4. For each transition in a component’s state machine, find the nodes corresponding to its origin and destination states, and draw a link from the init node to the destination node. If a link produces a cycle in the BN, designate it as a delay link. 5. For each el/ao pair in a component’s state machine smi, seek all transitions in other components’ state machines where an event matches the action ao. Add a link from the origin node of el/ao in smi to the destination node of event ao in the other state machines. If a link produces a cycle in the BN, designate it as a delay link. 6. For each component, add a set of relevant failure nodes. 7. For each node, if there is a “defective” outgoing transition in the corresponding state in the GBM, add a failure link to the appropriate failure node. Figure 4-8. Summary of the BN’s Qualitative Construction Steps

number of nodes in the Bayesian Network, which directly improves the efficiency of the inference algorithms.

Figure 4-8 provides a quick summary of all the steps required to build the qualitative part of the Bayesian reliability model for assessing the architectural reliability of soft117

ware systems. Now that the graphical (qualitative) part of the Bayesian Network is constructed, we are ready to assign conditional probability values to each node in the network. These conditional probabilities are used for inference, enabling probability estimation, which in turn results in reliability prediction for the system.

4.3.2. Quantitative Representation of Our Bayesian Network A major challenge of reliability prediction before the system’s implementation phase is its unknown operational profile. If the operational profile of the system were available, the conditional probability values at various nodes could be deduced using statistical techniques similar to what was discussed in Chapter 3. The problem of estimating reliability would be then transformed to performing standard inference methods on the available data. However, given the uncertainties associated with early reliability estimation (e.g., during the architecture phase), the best that can be done is to offer an analytical method that given known information about the system (its topology, components’ interactions, and individual components’ reliabilities) derives the conditional probability values at each node using the knowledge of the system’s architect. These conditional probability values describe the reliability (probability of successful operation) at each node, given the reliability of its parents.

Conceptually at each node we need to define a formula that specifies the dependency between the node’s reliability and the reliabilities of its parents. In other words, we need to calculate the conditional probability of successful operation at that node, 118

given the probability of successful operation of its parents. For a node n, with parents p1,..., pn, we need to calculate Pr (n | p1,..., pn). The reliability at node n depends on

the way each parent affects n. This relationship is one that can be logically formulated by the architect. For instance, consider node controller.S3 in the SCRover system. Its reliability depends on the reliability of its parents controller.S2, and actuator.S1. We ask the architect to specify this relationship for each node in the Bayesian Network. Below we describe a few possibilities for generic formulae to calculate these probabilities and discuss the ramification of each option.

In the field of Reliability Engineering, Reliability Block Diagrams (RBDs) [101] are used to graphically represent how the components of a system are connected reliability-wise. The configuration of a system is typically represented using a serial, parallel, or serial-parallel combination configuration. Other complex configurations (that

cannot be simply classified or broken down to serial and parallel relations) can be also defined. Moreover, configurations such as k-out-of-n nodes that allow the analyst to specify a form of redundancy known as k-out-of-n redundancy can be formulated. In this form of redundancy, at least k out of n elements must function correctly in order for the system to function correctly. We have used the concepts of RBDs and have applied them to our problem of formulating the relationship between the reliability of a node with its parents. Below we describe some of these configurations.

119

Serial Reliability Configuration. A node and its parents are known to be in a serial

type of relationship with respect to their reliability dependency when the node’s reliability directly depends on the reliability of all of its parents. In other words, a low reliability of any of its parents, directly (negatively) affects the reliability of the node, regardless of the reliability of the other parents. The reliability of a node depends on the reliability of all of its parents and is given by: n

Rnode = ∏ Ri i =1

where: Rnode: Reliability of a give node Ri = Reliability of the parent node i

For example, in SCRover’s Bayesian Network, the architect may observe that the reliability of the system at estimator.S2 has a serial type of relationship with the reliability values at estimator.S1 and controller.S2. This would indicate that any change in the reliability at the two parent nodes directly affects the reliability at estimator.S2. Assuming R12 and R21 represent the reliability values at controller.S2 and estimator.S1 respectively, the reliability at the estimator.S2 node can be formulated as:

Restimator.S2 = R12 × R21

120

In such type of relationship, the reliability value of a node is always less than or equal to the reliability of its least reliable parent. In other words, as time progresses, the reliability of the system always decreases and at best the system remains as reliable as it has been in its previous time step. Clearly, if there are design decisions built into the system’s architecture to enhance its reliability (e.g., redundancy), or to build fault tolerance into the system, a serial relationship cannot describe it sufficiently. A discussion on the ramifications of this relationship and insights into situations where the relationship is useful is given at the end of this section.

Parallel Reliability Configuration. In general, a parallel configuration relationship

between a node and its parents can be used to represent the concept of redundancy. The node’s reliability is at least equal to or greater than the reliability of its most reliable parent. In this case, the unreliability of a node with n statistically independent parallel parents nodes is the product of the unreliability value of all of the parents. In other words, in a parallel setting, all n parents must have very low reliability for the node to be very unreliable, i.e., if any of the n parents is highly reliable, then the node will still be very reliable. The reliability of a node in a parallel configuration is then given by: n

Rnode = 1 − ∏ (1 − Ri ) i =1

121

where: Rnode = Reliability of the node Ri = Reliability of the parent node i

In the real world, examples of this type of configuration include RAID-1 computer hard drive systems, standard automobile brake systems (where the front and back brakes typically act as a redundant mechanism), as well as cables supporting a floating bridge.

In the context of the SCRover system discussed throughout this dissertation, redundancy is not designed into the system. Consequently, an example of a node with this property cannot be provided. However, some analysis of this configuration and its ramifications on the reliability of node and thus the reliability of the system can be found in Chapter 6.

Other Complex Configurations. The serial and parallel configurations discussed

above are just two very basic forms of configuration that could determine the probabilistic relationship between a node and its parents. Other customized configurations may describe the relationship between parent and child nodes in a complex system. Examples include a partial parallel configuration known as k-out-of-n parallel configuration. In this type of configuration, k or more parents out of the total of n parents of

122

a node must fail, in order for the node to fail. An example of a type of system when this configuration is relevant is in a four-engine airplane, where a minimum of two engines are required for it to be able to fly and still satisfy minimal reliability requirements. From a reliability perspectives, this is a case of partial-parallel configuration: from k=2 out of n=4 engines must be reliable in order to ensure the system reliability.

To generalize this case, one can observe that as the number of parent nodes that are required to be reliable approaches the total number of parents, the behavior of the configuration approaches the serial reliability case, where all the parent nodes are required to be highly reliable in order for the child to be highly reliable. Detailed analysis of this case may be found in Chapter 6.

Using the above configuration scenarios, as well as customized relationships that may be relevant to specific nodes in a system, we are able to formulate the conditional probabilities at each node in the network. In the next section, we discuss how we use this and similar techniques to perform system-level reliability analysis based on the Bayesian methodology.

4.3.3. Discussion and Insights Deciding on the particular reliability configuration at each node is tasked to the architect who has sufficient knowledge about interactions in the system and the dependencies between various parts of the system. In particular, the relationships may be 123

defined by answering some of the following questions: How do states of components affect the reliability of one another? Would a failure of a single component result in system failure, or does it have to be combined with other components’ failures to do so? Which components need to fail in order for the system to fail? Are there any redundant components, such that failure of one of them does not affect the system’s reliability? Answering these questions enables the architects to understand when failures happen, and help them formulate stochastic formulas that describe the failure behavior of components, and consequently predict the probability of system failure.

In the case of a system of components configured in a serial fashion, we noted that the least reliable component has the biggest influence on the reliability of the system. Conversely, in the case of components with parallel configuration, the component with the highest reliability has the biggest effect on the system's reliability, since the most reliable component is the one that will most likely fail last. This is an important property of the parallel configuration, specifically when making design decisions aimed at improving the system’s dependability in general. Evaluation of this relationship may be found in Chapter 6.

4.4 System Reliability Analysis In Section 4.3 a detailed discussion on construction of a Bayesian reliability model using the system’s global behavioral models was provided. In this section, we take 124

that discussion one step further, and demonstrate how those principles can be leveraged when analyzing the reliability of systems. This is done in terms of insights and guidelines in applying those principles to systems with different characteristics, illustrated in the context of the SCRover example.

The component reliability values obtained via the AHMM modeling discussed in Chapter 3 serve as estimates of the individual component reliabilities in isolation. That is, the quantification of an individual component’s failure behavior is performed when considering the component’s normal and failure behaviors in isolation.3 We use this stand alone measurement as the basis of the “goodness” of each component in performing its operations and name it raw reliability values. We use a component’s raw reliability value as a coefficient to the component’s initial state’s reliability. Recall that instantiation links from the init node in the BN to all components’ initial states signify the reliability of the system’s startup process. The conditional probability of the nodes corresponding to the initial state in each component must then be formulated as a function of the reliability of the startup process and the component’s raw reliability value. For example, in the case of the controller component in the

3. While some of the analyses we performed on architectural models consider the components within a system, the estimated reliability value does not include quantification of failures that are related to interactions among components.

125

SCRover, the controller.S1 node corresponds to the component’s initial state. The probabilistic formula describing the reliability at this node is thus formulated as: n

raw RController .S1 = ∏ ( Ri ) = Rinit × Rcontroller i =1

where Rinit is the reliability value of the startup process (assigned by the architect), and

raw Rcontroller is the raw reliability of the controller component obtained via the

AHMM approach. Assuming that the reliability of the startup process is 0.999, the reliability of the controller component’s initial state (S1) in the SCRover system is calculated as follows:

RController .S1 = 0.999 × 0.93 = 0.929

The reliability of the startup process is a measure to be assigned by the architect. This parameter is used to represent the uncertainties associated with the system’s startup process. Experience with the particular system (or functionally similar systems) is the main source for initializing the values. If desired, a value of 1 could be used which essentially represents no uncertainties predicted for the system’s startup process. We believe using a numerical value (less than 1) strengthens the model by accounting for unknown circumstances at startup.

126

init R_Actuator

S1_Actuator

R_Estimator

R_Controller

S1_Estimator S1_Controller S3_Actuator S2_Controller

S3_Controller S2_Actuator

F_PrePostCond_Controller S2_Estimator S3_Estimator F_Signature_Controller

F_Protocol_Controller F_Signature_Estimator

init R_Actuator R_Estimator S1_Actuator

R_Controller

S1_Actuator1

S1_Estimator

S1_Actuator2

S1_Estimator1

S1_Controller

S1_Estimator2

S1_Controller1

S1_Controller2

S3_Actuator S2_Controller

S3_Controller

S3_Actuator1 S2_Controller1

S3_Controller1

F_Signature_Controller

S3_Estimator F_Protocol_Estimator

S2_Actuator2

F_PrePostCond_Actuator1 F_PrePostCond_Controller1

S2_Estimator

S3_Controller2

S2_Actuator1

F_PrePostCond_Actuator F_PrePostCond_Controller

S3_Actuator2 S2_Controller2

S2_Actuator

F_Signature_Controller1

F_PrePostCond_Actuator2 F_PrePostCond_Controller2

S2_Estimator1 S3_Estimator1 F_Protocol_Estimator1

F_Signature_Controller2

S2_Estimator2 S3_Estimator2 F_Protocol_Estimator2

Figure 4-8. SCRover’s Bayesian Network (top) and the Expanded Bayesian Network (bottom)

An important observation is that in the SCRover’s Bayesian Network (depicted in the top portion of Figure 4-8), there are delay links entering controller.S1 node. Once the network is expanded, the delay links are translated to normal links in the subsequent time steps, while the links in the first time step only include the links from the init node and the component reliability node. Thus the calculation above applies to the first time step in the expanded network. For consequent time steps, there are addi127

tional parent nodes to the controller.S1 node whose reliabilities must be incorporated in the probabilistic formula.

This approach is used to instantiate the reliability of the nodes corresponding to the initial state of all components in the system. The next step involves assigning probabilistic formulas to determine conditional probabilities at all other nodes, given their parents’ reliabilities. For each node the architect will need to determine how the parents’ reliability can affect the node’s reliability. For example, in cases where two or more parents serve the same purpose in the model and can be treated as redundant, a parallel configuration may be assigned. In other cases a serial configuration may be suitable when each parent has equal and direct influence on the reliability of the node, and when the node can never be more reliable than any of its parents. Still in other cases, customized relationships may be defined depending on the domain and application-specific knowledge, or on data from past experiences. For example, different weights may be given to different parent nodes to amplify the importance of the reliability of different parent nodes.

The final step involves assigning probabilistic formulas to each failure node. These formulas determine how different nodes in different components contribute to specific types of failure. There is an important difference between the failure nodes and the other nodes in the network. Throughout the network, for each non-failure node, we model the probability of success or reliability. However, at each failure node, there 128

is a conceptual change and the probability of failure is modeled instead. Consequently, the formulas at the failure nodes must reflect this distinction by assigning the complementary probability value (1- R) at the node.

As an example, consider failure node F_signature_controller in the SCRover model. This specific failure may be caused by problems at either controller.S1 or controller.S2 nodes. Let us assume that the architect has determined that each of the parent nodes contribute equally to this signature failure, and the unreliability at this node is equal to or greater than the unreliability of the most unreliable parent. This justifies a serial configuration for this node. The conditional probability value at this failure node is thus calculated as shown below: n

Fcontroller . F4 = 1 − ∏ ( Ri ) = Rcontroller .S2 × Rcontroller .S1 i =1

where Fcontroller.F4 is the probability of failure at controller.F4 node.

A similar approach to failure probability calculation must be followed for all failure nodes in the network. Once all probabilistic formulas for all the nodes in the model are assigned, Bayesian inference is used to leverage the known data (in this case component’s raw reliability values and the nodes’ probabilistic relationships) to infer unknown information about the system. In our case the primary relevant unknown 129

information about the network is the probability of occurrence of each failure type. The system’s failure probability (or its unreliability) may be formulated by aggregating the probabilities of occurrence of all types of failure in the system.

Devising a generalized technique to aggregate individual failure probabilities into an overall system failure probability is unreasonable given the different development scenarios and domain-specific issues related to different applications. Recall that the number and types of failure nodes are tied to our defect classification discussed in Chapter 2. Using the above approach, we obtain the failure probabilities in various components in the system. We propose two alternative approaches for aggregating these probabilities and obtaining the system’s probability of failure or its unreliability. The first approach is a simple approach and results in a “basic” probability prediction which directly combines all components’ failure probabilities using a Radar Chart. The second approach incorporates the cost values assigned by the domain expert in the aggregation process and results in a “cost-based” reliability prediction. Below we discuss each technique in detail.

Basic Reliability Prediction. In this approach, we simply calculate the system’s fail-

ure probability (or its unreliability) in terms of the cumulative effect of each component’s failures on the system. We plot values of various failure probabilities using a simple Radar Chart. Each failure probability is plotted along an axis. The number of axes is equal to the number of failure nodes in the BN, and the angles between all 130

Cumulative Effect of Failures

controller.PrePostCond 0.3 0.25 0.2 0.15 0.1 0.05

estimator.Signature

0

controller.Protocol

controller.Signature

Figure 4-9. Cumulative Effect of Different Failures in SCRover

axes are equal. Figure 4-9 depicts our instantiation of the Radar chart for the SCRover system. Four axes represent controller’s PrePostCondition, Protocol, and Signature failures as well as estimator’s Signature failure. Each axis has a maximum length of 1. A point closer to the center on any axis depicts a low value, while a point near the edge of the polygon depicts a high value for the corresponding failure probability. The cumulative effect of the failures can be obtained by calculating the surface area formed by all the axes. The surface area is calculated using a triangulation method

131

discussed in Chapter 2. Using this technique, the overall system reliability is estimated as 0.965. 1 1 2π area = × × sin( )[0.136829* 0.07093 + 0.07093*0.07093 + 2 2 4 0.07093*0.262404+0.262404*0.136829] Reliability = 1 − area = 0.982686738

The reliability value estimated using this technique, does not directly incorporate the cost associated with defect recovery as specified in our cost-framework. The cost values are however, indirectly considered as part of the initial component reliability values obtained via the HMM methodology. Calculating reliability using this approach would be beneficial when considering system reliability from a purely technical point of view: the interactions among components. However, in some cases, we may be interested in performing reliability-based risk analysis, and incorporating economical aspects of software development into the reliability estimation process. In those cases, one approach would be to directly incorporate the cost associated with each failure, into the aggregation process.

Cost-based Reliability Prediction. In this approach we leverage the cost-framework

discussed in Chapter 2, and incorporate the cost of recovery for each type of failure (as assigned by the domain expert) into the cumulative failure probability calculation. Recall that a domain expert instantiates our cost-framework by assigning failure costs in the system. The cost of a failure has an inverse relationship with the probability of 132

recovering from that type of failure: as the cost associated with the recovery increases, the probability of recovery decreases, and vice versa.

To do so, we build a Radar Chart that takes the recovery probabilities into consideration. Specifically, the value of each axis on the chart is adjusted to incorporate the cost of recovery from the related failure type. The recovery probabilities as assigned by the domain expert for our cost framework in Chapter 2 is presented again in Figure 4-10. Leveraging this data, we adjust the numerical value for each axis in the Radar Chart, by multiplying failure probabilities by the cost of recovery for that failure. This is justified by, for example, considering that the system is more likely to recover from failure with a high probability of occurrence, for which a low cost of revery is assumed (i.e., high probability of recovery). Alternatively, a failure with very low probability of occurrence and low probability of recovery may be considered more critical than a failure with low probability of occurrence and high probability of recovery. The new radar chart depicting these adjusted values given the previous calculation and the data depicted in Figure 4-10 is shown in Figure 4-11. The new system reliability value calculated based on the weighted cumulative approach is as follows: 1 1 2π area = × × sin( )[0.02770*0.0155+0.0155*0.02048+ 2 2 4 0.02048*0.05314+0.05314*0.02770] Reliability = 1 − area = 0.99917

133

0.9 0.8 Recovery Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Usage Recovery Probability for Each Defect Type

0.7075

Incomplete Signature 0.5725

0.7975

Pre/post cond

Interaction protocols

0.780625

0.71125

Figure 4-10. Recovery Probability for Different Failures (Inverse of Cost)

Weighted Cumulative Effect of Failures controller.PrePostCond 0.06 0.05 0.04 0.03 0.02 0.01

estimator.Signature

0

controller.Protocol

controller.Signature

Figure 4-11. Weighted Cumulative Effect of Failures for SCRover

Incorporating the cost values in this case resulted in an increase in the predicted reliability value. This is a result of the specific recovery probability values (inverse of the 134

cost) for various failures. Particularly, it turns out that since the architect considers some of the failures to have a high recovery probability, the effect is that given the available resources, it is possible to recover from the corresponding failures. Clearly, this approach would be most beneficial to use if it is directly tied to an economical model of the software development process, which explicitly identifies resources and justifications for specific cost values.

The complete model of the SCRover’s Bayesian Network generated in the Netica Environment can be found in Appendix C. Analysis of this model to demonstrate the properties of the reliability model in provided is Chapter 6.

135

Chapter 5: Tool Support

In order to support various computational tasks related to this dissertation research, we use three loosely integrated environments: Our in-house architectural modeling, analysis, and evolution environment called Mae [111], enables us to model a system and its components using different quartet views, and ensure consistency among the views. It provides us with a set of defects revealed as part of the analysis enabled by the tool. We then use an extension to an openly available toolbox for Mathwork’s Matlab that supports Hidden Markov Modeling [82], to perform component reliability calculations. Finally we use Norsys’s Netica environment [90] capable of performing Bayesian inference to perform system-level reliability analysis. Figure 5-1 depicts the process view of the various tools used for this research. Numbers 1, 2, or 3 on each arc signify various phases of the process: architectural modeling and analysis, component reliability modeling, and system reliability modeling respectively.

In this chapter, we first describe the Mae environment in terms of its architecture and its functionality. We then discuss the Matlab extension developed to perform component reliability modeling using Hidden Markov Models. Finally we introduce the Netica environment for Bayesian Modeling of system-level architectural reliability.

136

5.1 Mae The Mae environment is an architectural modeling, analysis, and evolution environment that combines principles of architectural modeling with those of configuration management [111]. It leverages a rich system model to provide a novel approach for managing architecture-centered software evolution. It anchors the evolution process to the architectural concepts of components, connectors subtypes, and interfaces, enhancing it with the power and flexibility of the Configuration Management concepts of revisions, variants, options, and configurations. The result is a novel architectural system model with an associated architecture evolution environment. The environment is extensible: it allows the users to customize the definition of components, connectors, and interfaces, and enables them to define additional properties of interest using a set of XML schemas. The tool seamlessly integrates the customized definitions and regenerates its graphical user interface to allow for modeling of the new concepts.

The architecture of this environment is shown in Figure 5-2. It consists of four major subsystems: The first subsystem, the xADL 2.0 data binding library [25], forms the core of the environment. The data binding library is a standard part of the xADL 2.0 infrastructure that, given a set of XML schemas, provides a programmatic interface to access XML documents adhering to those schemas. In our case, the data binding library provides access to XML documents described by set of customized XML 137

Components Reliabilities

2

Matlab

2

Legend 2

Quartet Models

1

Tool

3 1

Mae

Artifacts

Defects

Data

3

3

Netica

Numerical Values

3

System Reliability

Figure 5-1. Overall View of the Required Tools for Architectural Reliability Modeling and Analysis

schemas that offer a rich definition of architectural concepts in accordance with the Quartet approach. Therefore, the xADL 2.0 data binding library, in essence, encapsulates the Quartet models by providing a programmatic interface to access, manipulate, and store evolving architecture specifications. The details of the XML schemas providing definition of the Quartet approach may be found in Appendix A.

The three remaining subsystems of Mae each perform separate but complementary tasks as part of the overall process of managing the evolution of a software architecture:

138

Design Subsystem

Selector Subsystem

Analysis Subsystem

xADL 2.0 Data Binding Library

xADL Schemas

XML Architectural Specification

Figure 5-2. Mae’s Architecture



The design subsystem combines functionality for graphically designing and editing an architecture in terms of its structure and its behavior. This subsystem supports architects in performing their day-to-day job of defining and maintaining architectural descriptions, while also providing them with the familiar check out/ check in mechanism to create a historical archive of all changes they make.



The selector subsystem enables a user to select one or more architectural configurations out of the available version space.



Finally, the analysis subsystem provides sophisticated analyses for detecting inconsistencies and defects in the architectural models.

The analysis subsystem provides vital support for our approach to reliability estimation. It offers the ability to ensure the consistency among various Quartet views of a

139

software system. The basis for these analyses to ensure inter- and intra-consistencies among various Quartet views was described in Chapter 2.

Once the architectural modeling and analysis phase is complete, a set of defects are obtained. We classify these defects according to our architectural defect taxonomy discussed in Chapter 2, and use the results directly in the component-level and system-level reliability analysis.

5.2 Component Reliability Modeling Once defects from architectural modeling and analysis phase are obtained, the data is used to quantify the effect of each defect and obtain a quantification of the reliability of each component. Component reliability estimation is done by building an extension to the Matlab Hidden Markov Model toolbox [82]. This extension leverages the results obtained from the Expectation-Maximization algorithm to calculate the steady state vector of the Markov Model associated with each component. In doing so, it also incorporates results from the cost-framework in terms of cost and probability of recovery from various defects.

The Matlab code used for estimating component reliability values is presented in Appendix B.

140

5.3 System Reliability Modeling Once the reliability of individual components are obtained via our HMM-based reliability model, we use a Bayesian Network editor called Netica [90] to build a systemlevel Bayesian reliability model. Netica offers a graphical editor to create Bayesian Networks, and in addition to Bayesian inference, offers sensitivity analysis functionality. It also enables the user to specify probabilistic equations at each node in the model that we use to specify the reliability of the nodes based on the reliability of their parents. The network is then compiled and upon completion of the inference process, the probability of arriving at each failure node is calculated. We then use those values to obtain a measure of the system’s overall reliability using the techniques described in Chapter 4.

Netica stores the models in a textual file format. The Bayesian model of the SCRover system is presented in Appendix C.

141

CHAPTER 6: Evaluation

In this chapter, we evaluate our software architecture-based reliability modeling approach to demonstrate that reliability prediction of software systems’ architectures early during the development life-cycle is both possible and meaningful. The main challenge associated with the early reliability prediction problem is the lack of implementation artifacts, and thus lack of knowledge about the systems’ operational profiles. Suitable reliability models must be able to address this challenge and the associated uncertainties.

Recall that our approach uses a set of architectural modeling views called the Quartet to compositionally model software systems in terms of their constituent components. Analysis of these models reveals potential architectural defects that could result in a reduction in the reliability of components, and thus the reliability of the entire system. We quantify the impact of these defects using an architectural defect classification and a cost-framework. An Augmented Hidden Markov Model (AHMM) leverages the quantification results, as well as the Quartet models, to predict individual components’ reliabilities. Finally, a Bayesian reliability model is constructed to compositionally predict the overall reliability of the system, given the reliabilities of individual components and their interactions.

142

The goal of our evaluation is to ensure that our methodology used for reliability prediction is sound, and that the results are both meaningful and useful. We evaluate the approach using the following criteria:

1. The coverage of our architectural analyses, as well as our defect classification is evaluated empirically. 2. The component reliability prediction methodology is evaluated using sensitivity and uncertainty analyses. The goal is to show the sensitivity of the model to changes in various model parameters. Moreover, we demonstrate that our model is capable of handling uncertainties associated with the components’ unknown operational profiles. 3. The complexity and scalability of our adaptation of the Expectation-Maximization algorithm is evaluated theoretically. 4. The system-level reliability model is evaluated in terms of sensitivity analyses with respect to various model parameters. These analyses also demonstrate the usefulness of the approach in terms of its ability to offer helpful insight that can aid the architect as a decision tool during the development process. The rest of this chapter is organized as follows. In Section 6.1 we discuss the empirical evaluation of our architectural modeling, analysis, and defect classification methodology. Sections 6.2 and 6.3 present the evaluation of our component-level and system-level reliability model respectively. 143

6.1 Architectural Analysis and Defect Classification Our architectural modeling, analysis, and evolution environment Mae [111] provides utilities for modeling the structural and behavioral aspects of software systems. It also offers a suite of analysis tools for Quartet models based on the view consistency and conformance principles discussed in Chapter 2. Since specific analysis techniques are not a contribution of this dissertation research, the evaluation of those techniques is beyond the scope of this work. However, the architectural defects revealed by the analyses are directly leveraged by our reliability models, and thus we present an empirical evaluation of our defect classification framework.

Our defect classification was developed after extensive study of the results of architectural analyses obtained from three different modeling and design methodologies. This study was done in the context of the SCRover project [52], a robot testbed based on NASA JPL's Mission Data System (MDS) [27], and in the context of NASA’s High Dependability Computing Program (HDCP) [86]. The goal of the study was to understand the tradeoffs among different architectural modeling approaches.

We extensively studied and documented our experience in using a UML-based methodology called Model-Based Architecting and Software Engineering MBASE [125], as well as two representative Architecture Description Languages (ADLs), Acme [40] and Mae [111] to model SCRover. Both Acme and Mae models were derived from the 144

initial MBASE design and SCRover documentation, but were developed independently of each other, and focused on different aspects of the architecture. We studied the differences that resulted from focusing on varying aspects of the original documentation. We will show how these differences led to the automatic detection of distinct, but complementary, classes of errors, and how automatic analysis afforded by either Mae or Acme yields better results than peer-reviews of the SCRover documentation for architectural defects [110].

The results of these studies reinforced the hypothesis of the benefits of multi-view modeling. The independence of the research groups performing each modeling and analysis activity enabled us to empirically validate our defect classification. In the rest of this section we first offer some statistics on the type and numbers of defects that our architectural modeling environment Mae was able to detect in comparison with the other approaches. These results were initially classified using a standard Software Development Life Cycle classification scheme [125]. Since this classification did not focus on architectural issues, we collaboratively developed a new taxonomy (presented in Chapter 2). The types and numbers of defects detected by all three approaches based on our newly developed taxonomy of architectural defects are presented later in this section.

Figure 6-1 depicts the total number of defects found by all approaches (left column) against the subset that can be captured in Mae-MDS models (middle column), and 145

25 #defects #Represented in Mae # Mae Detected

21 20 17 15 11 10 5

6 6

55 3

4

3 1

0

Interface

4 0 0

0

Class/Obj

Logic/Alg

Ambiguity

1

0 0

DataValues

1 1 Other

0 Inconsistency

Figure 6-1. Mae Defect Detection Yield by Type

those detected by Mae (right column). The categories in this graph correspond to a standard SDLC classification scheme [125] and include Interface, Class/Object, Logic/Algorithm, Ambiguity, Data Values, Inconsistency, and Others. We found this classification too broad and inefficient in the context of defects rooted at the architecture and design stages. For instance, the category Class/Object included both defects rooted at the architecture, as well as implementation-level defects. Similarly, the Logic/Algorithm category contained architectural defects, some of which dealt with mismatched expectations among communicating components (i.e., pre/post-condition mismatch), while others were defects caused by violation of the MDS architectural style. Consequently, in collaboration with researchers from Carnegie Mellon University, we developed a new classification scheme (Chapter 2) that specifically focuses on defects that are architectural in nature.

146

Figure 6-2. Defects Detected by UML, Acme, and Mae (by Type and Number)

Figure 6-2 depicts the results of the re-classification of defects, in terms of the total number and respective types of all architectural defects detected by the three modeling approaches using our defect classification.

The three independent modeling approaches not only confirmed each other’s analysis results, but also demonstrated the value of viewing SCRover (and MDS) from different perspectives. Mae and Acme in tandem detected all architectural defects identified by the peer-review of UML models, and additionally identified previously undiscovered defects. UML peer-reviews, on the other hand, identified additional classes of defects that were not architectural in nature [14].

147

The results presented here show that Quartet-based modeling is capable of capturing useful information about a software architecture, and that the related analysis can reveal critical architectural defects. Classification of these defects based on our taxonomy of architectural defects is used by a cost-framework (presented in Chapter 2) to quantify the impact of defects. The quantification results are leveraged by our reliability models to predict the components’ and the system’s reliability. The next two sections evaluate the component- and system-level reliability models, respectively.

6.2 Component Reliability Prediction Numerical results presented in this section are obtained via extensions to our Javabased architectural modeling and analysis environment, Mae [111], and Matlab simulations. Our results demonstrate that our AHMM methodology is effective in modeling the architectural reliability of software components in the presence of uncertainties associated with early reliability prediction. Furthermore, they demonstrate that our model is useful in enhancing the development process by offering costeffective strategies for mitigating architectural defects. The latter is done by providing sensitivity analyses aimed at identifying the most critical defects in the component’s architecture. In providing these analysis, we also demonstrate that our reliability prediction approach is meaningful: changing model parameters exhibits predictable trends in the estimated component reliability values. Finally, we provide an analytical evaluation of the scalability and complexity of our approach. The results indicate that 148

the complexity and scalability of the model are bounded by the complexity and scalability of the underlying formalisms (state machines and the Expectation-Maximization algorithm). The details of the evaluations in all three categories – uncertainty, sensitivity, and complexity – are provided in the rest of this section.

6.2.1. Uncertainty Analysis There are multiple sources of uncertainties associated with reliability modeling during the architecture and design phases. In the context of component reliability modeling, we have identified two primary sources of uncertainty:

1. Uncertainties associated with incorrect component behavior. 2. Uncertainties associated with unknown operational profile. The first type of uncertainty at early stages of software development is a side-effect of the nature of architectural models. Typically these models only specify the intended (or desired) behavior of components. The undesired behaviors which may reveal the cause of defects (and subsequent failures), are not typically modeled explicitly. Our component reliability model innately addresses this type of uncertainty. We extend the model of the desired behavior of components (dynamic behavioral model), and enhance it with states that represent failure conditions, and transitions that represent the nature of these failures. Recall the details of our component-level reliability model in terms of failure states, and failure and recovery transitions as described in 149

Chapter 3. While a domain expert is expected to instantiate the failure transition probabilities, recovery transitions’ probabilities are instantiated using our cost-framework. We evaluate the approach in terms of uncertainties associated with probabilistic instantiation of failure transitions. Sensitivity of the model to recovery probability values is evaluated later in this section.

The second type of uncertainties relates to the unknown operational profile of the component. A component’s exact operational profile may only be obtained via monitoring its operation while deployed in the field. During early stages of development, however, conditions that are representative of the component in the field may not exist. We offer three different experiments to evaluate the uncertainties associated with the unknown operational profile. As discussed in Chapter 3, in the absence of operational profile data, we use a data synthesis approach to fabricate the training data needed for reliability modeling. Under ideal conditions, the data is generated using domain expertise. For non-ideal cases we generate the data randomly. We compare the results for a number of components under both conditions, and discuss how our approach addresses the uncertainties associated with unknown operational profile.

6.2.1.1 Behavioral Uncertainties Recall that our component reliability framework relies on the knowledge of a domain expert in order to specify the probabilities of various failures at each component state. We appreciate that determining this probability may be a challenging task. Especially 150

in cases when no prior data from the component in operation exist, it may be difficult for the expert to estimate these probabilities with a great degree of confidence. In Chapter 3, we discussed that in order to accommodate this uncertainty, we opt to provide a range of analyses for component reliability. To do this, we calculate the reliability of a component, given a threshold for each failure probability value. We demonstrated this technique for the SCRover’s controller component in Chapter 3, and presented the predicted reliability values based on a threshold for failure probabilities. In this section, we demonstrate the changes to the predicted reliability values for the full range of possible failure probabilities. The goal is to demonstrate that the reliability model reacts predictably to different failure and recovery probabilities, and that it produces meaningful results.

Figure 6-3 demonstrates the changes to the controller component’s predicted reliability values (y-axis) for a range of failure probabilities for the two failure states F1 and F2 (refer to Chapter 3 for the value of other parameters in the reliability model). As expected, overall the reliability model reacts correctly and meaningfully to the changes in input parameters: an increase in the failure probabilities results in a decrease in the component reliability. An interesting conclusion based on the results depicted in Figure 6-3 is that changes to the value of probability of failure F1 (denoted by PF1) causes a sharper decrease in the component reliability than the changes to the value of probability of failure F2 (PF2). This may be explained given 151

Reliability

Controller Component Reliability 1.2 1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of Failure PF2 changing

PF1 changing

Figure 6-3. Controller Component Reliability Analysis Based on Various Probabilities of Failures to the Two Failure States

the specific reliability model of the controller component. Recall that the expert has instantiated the two probabilities of failures as PF1=0.05 and PF2=0.02. Furthermore, (as shown in Figure 3-4), all four (non-failure) states in the model may result in a failure of type F1 (each with probability PF1). However, only two states may result in a failure of type F2 (each with probability PF2). Consequently, the probability that a failure of type F1 may occur from any of the four states at all is 4PF1. Similarly, probability of failure F2 occurring from any of the two states is 2PF2. In our analysis, we kept one of the parameters constant and analyzed the sensitivity of the model to changes in the second parameter. In this model, there is always a greater likelihood of failure of type F1 as a result of changes to PF1.

152

To confirm the conclusions, we repeated the above experiment with an arbitrary component with 10 states, and 14 interfaces. Let us assume that 4 types of defects are identified during architectural analysis. Our model thus is extended with 4 failure states F1, F2, F3, and F4, representing usage, interaction protocol, signature, and pre/ post condition failure types respectively. The probability of recovery from each type of failure is calculated using the values obtained from our cost-framework as depicted in Figure 6-4: RP(F1) = 0.7075, RP(F2) = 0.71125, RP(F3) = 0.7975, and RP(F4) = 0.7806, where PR(Fi) represents the recovery probability from state Fi. Let us assume that other transition probabilities are instantiated randomly. Figure 6-5 shows the effect of changes to various failure probability values (PF(Fi)) on the component’s reliability. As expected, as the probability of various types of failures increases, the overall component’s reliability decreases, with the changes to PF(F4) (probability of pre/post condition failures) having a slightly greater impact than other failure types. Given the random and synthesized nature of this model, however, it is not possible to draw insights and intuitive conclusions regarding the impact of different types of failures on the component’s reliability.

In working with synthesized components and components with larger number of states, we soon realized the challenges involved in instantiating probability matrices for these models. Obviously, the easiest approach is to generate the necessary matrices randomly. The important question is, how do the results obtained from random 153

0.9 0.8 Recovery Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Usage Recovery Probability for Each Defect Type

0.7075

Incomplete Signature 0.5725

0.7975

Pre/post cond

Interaction protocols

0.780625

0.71125

Figure 6-4. Cost-framework Instantiation for Different Defect Types based on data in Chapter 2

and expert instantiation compare with each other? In the next subsection we provide insights with respect to random and expert instantiation of probability matrices that correspond to a component’s operational profile. Here we discuss the effect of random and expert instantiation for the failure probability matrices.

Component Reliability

1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of Failure PF1

PF2

PF3

PF4

Figure 6-5. Changes to a Random Component’s Reliability based on Different Failure Probabilities

154

Given the above model of an arbitrary component with 10 states, and a set of training data, we evaluated the sensitivity of the model to different failure probability values. using two experiments. In both experiments, we randomly generated the matrix representing the failure probabilities. However, in the first experiment, we generated a full matrix where essentially all the elements were non-zero. This indicates that there is a chance of all four types of failures happening from every state in the component’s model. The probabilities of failures ranged from 0.0019 to 0.0995 (0.19% to 9.95% respectively). The mean of predicted reliabilities after 100 iterations of the EM algorithm was 77.92%, with the corresponding histogram depicted in Figure 6-6 (left). We then re-calculated the reliability of the component, with a new instantiation of the failure probability matrix. This time, the random generation of the matrix created a sparse matrix with entries within the same range as the last experiment (0.19% to 9.95%). The sparse matrix indicates that only some types of failures are likely to occur at certain component state. The mean of predicted reliability values after 100 EM iterations was estimated at 95.27%. The corresponding histogram is depicted in the right hand side of Figure 6-6. The results confirm our intuition and insights obtained from the controller component: a full matrix offers a greater opportunity for occurrence of various failures, while the sparse matrix limits this possibility to only a few states: the more opportunity for failures, the lower the component reliability. The primary conclusion is that expert instantiation of failure transitions probabilities is critical in obtaining an accurate prediction of components reliabilities, given that the expert’s knowledge is more representative of expected behavior of the component. 155

20

18

18

16

16

14

14

12

12

Frequency

Frequency

20

10 8

10 8

6

6

4

4

2

2

0 0.772

0.774

0.776

0.778

0.78

0.782

0.784

0.786

0.788

Component Reliability

0 0.944

0.946

0.948

0.95

0.952

0.954

0.956

0.958

0.96

0.962

Component Reliability

Figure 6-6. Predicted Reliability for an Arbitrary Component Given a Full Failure Probability Matrix (Left), and a Sparse Failure Probability Matrix (Right)

In the next subsection, we offer our analysis of the reliability model with regard to its ability to handle other uncertainties, particularly those associated with the component’s operational profile.

6.2.1.2 Operational Profile Uncertainty The biggest challenge in architecture-level prediction of component reliability is the unknown nature of a component’s operational profile at this stage of development. As discussed earlier, a useful reliability model must be able to handle this type of uncertainty and produce meaningful results. Our model addresses this challenge, and in this section we evaluate it using a set of analyses.

As mentioned in Chapter 3, in cases where the operational profile of the component is not available, we essentially synthesize this data. This is done by synthesizing a set of 156

training data for the HMM-based reliability model, and by leveraging domain knowledge. The Expectation-Maximization algorithm then uses this data to estimate the best operational profile for the component. The reliability model leverages the obtained operational profile, and provides a prediction of component’s reliability. Since the data synthesis process relies on the domain expert’s knowledge, it is critical to analyze the ability of the model to handle uncertainties associated with this instantiation. We do so using two types of analyses. First, in cases where a domain expert instantiates the model, we want to analyze the importance of exact instantiation on the estimated reliability. In other words, we want to determine the effect of fluctuations within Initial Transition Probabilities (ITPs) on the predicted reliability. The second set of analysis is aimed at determining the importance of expert instantiation, and the impact of random instantiation of the transition probabilities in cases where domain expertise is not available. This is particularly critical for cases when the model is too complex (too many states or interfaces) and thus the instantiation process is too tedious, or when sufficient expert knowledge is not available.

To determine the effect of fluctuations on the model’s initialization parameters, we performed sensitivity analysis both on the controller component, as well as synthesized (arbitrary) components with various complexities (5, 10, and 20 states). We analyzed each component using an Initial Transition Probability instantiated by an expert, and various levels of noise (fluctuation) in the matrix values (5%, 10%, and 20% noise). These noise levels would represent a range of errors from minor to rather 157

Percentage of Change in Reliability Value

0.1

0.05

0 Controller Component

5% Noise

10% Noise

20% Noise

0.021505376

0.043010753

0.096774194

Controller Component

Figure 6-7. Percentage of changes in the reliability value of the controller component (5%, 10%, and 20% Noise)

significant. Before presenting the results, let us explain the methodology for incorporating noise to the matrices. For each row of a matrix, one element is selected at random. Then a specific percentage of its value (e.g., 5%) is subtracted from the element’s value. Finally, a total of 5% of its value is added to the rest of the elements in the same row to ensure that the sum of each row still adds up to 1 (refer to Chapter 3 for the specific properties of these matrices). This methodology can represent architect’s mistake in a single probability value (and its domino effect on other related probability values).

In the case of the controller component, the result of noise introduction to ITPs can be seen in Figure 6-7. The three experiments (depicted along the x-axis) correspond to the 5% noise, 10% noise, and 20% noise respectively. The changes in the predicted component reliability (in terms of percentages) are depicted along the y-axis. As can 158

be seen, fluctuations of up to 20% have resulted in at most a 0.1% change in the predicted reliability. This shows that the model in this case is resilient to uncertainties associated with the unknown operation profile, and even significant fluctuations to the ITP (in this case up to 20%) only have minor influence on the results.

Additional experiments resulted in similar conclusions about the ability of the model to handle this type of uncertainty. We performed similar experiments on a set of arbitrary components. We varied the number of states in each component in order to study the effect across components with different complexities. Particularly, we studied the results in the context of components with 5, 10, and 20 states, with 10 interface elements and 4 failure states. The results depicted in Figure 6-8 demonstrate that noise of up to 5% induced in components resulted in changes between –0.055% to +0.014% in the component reliability. Similarly, a 10% noise resulted in a –0.111% to +0.042% change to the estimated reliability. Finally, a 20% noise resulted in a –0.236% to +0.099% change in the component reliability value. In other words, noise of up to 20% in the ITPs resulted in a fluctuation of up to 0.23% in the component reliability value. Once again, these results confirm that the model is resilient to fluctuations on the component’s transition probabilities. In other words, if the domain expert is unable to specify the “exact” operational profile for the component, the impact on the estimated reliability is not too pronounced. An interesting question at this point is, why is this the case?

159

0.2 0.15 0.1 0.05 Percentage Change in Reliability

0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3

5% Noise

10% Noise

20% Noise

-0.055725829

-0.111451658

-0.236834773

Arbitrary Component 10 States

-0.01453911

-0.01453911

-0.02907822

Arbitrary Component 20 States

0.014208582

0.042625746

0.099460074

Arbitrary Component - 5 states

Figure 6-8. Percentage Change in Reliability Value of Three Arbitrary Components with 5, 10, and 20 States (5% noise, 10% noise, and 20% noise, respectively)

The answer lies in the heart of our approach. From Chapter 3, recall the last step of the reliability prediction process, where the reliability value is calculated as a function of the probability of not being in a failure state at time tn. In other words: M

Reliability = 1 − ∑ V ( Fi ) i =1

where V(Fi) is the steady state probability vector. It can be seen that the calculated reliability value depends on the probability of being in a failure state at time tn. Recall 160

the Markov property which assumes that the probability of transition to the next state at time t+1 depends on the system at time t and is independent from its past history. Using this assumption, the final calculated reliability primarily depends on the probability of being in a failure state and is independent of the specific path(s) taken to reach the particular failure state. This is consistent with our results presented earlier: the reliability model is very sensitive to changes in the values of the failure probability matrix, while changes in the values of transition probability matrix (ITP) do not greatly impact the predicted reliability value.

The second set of our uncertainty analyses takes this conclusion one step further. The goal is to determine the impact of random generation of the component’s operational profile. In other words, in cases when instantiating the transition probabilities by the domain expert is challenging or impossible, how is the predicted reliability affected by the random instantiation of the model?

Let us start with the SCRover’s controller component. Assuming no operational profile data is available for the component, we generate a random set of training data to predict the reliability. Given the failure and recovery probability values discussed in Chapter 3, the component reliability using random data is predicted to be 0.9295. Alternatively, we asked a domain expert to assign probability values for transition and observation matrices for the controller component. The matrices instantiated by the domain expert tend to be more sparse than the randomly generated matrices. This is 161

0.9305 0.93 Reliability 0.9295 0.929

1

Random Instantiation

0.9295

Expert Instantiation

0.9304

Random Instantiation

Expert Instantiation

Figure 6-9. Controller Component Reliability Based on Random and Expert Instantiation

because the domain expert models the intended behavior of the component by expecting certain behavior(s) at a given state. The matrix instantiation typically reflects this expectation. The predicted reliability of the controller component based on the expert’s knowledge is calculated as 0.9304. In the context of the analysis performed on this model, the 0.09% difference is negligible. Such determination however, needs to be performed in the context of the specific system being analyzed. Repeated experiments confirmed the same results. As discussed previously, our reliability model is a lot more sensitive to the recovery probabilities and failure probabilities when estimating a component’s reliability. In other words, instantiation of the transition probabilities based on the domain expertise or randomly has comparatively little influence on the estimated reliability. In summary, our model can handle uncertainties associated

162

1 0.95 0.9 0.85 Reliability 0.8 0.75 0.7 0.65

5 States Component

10 States Component

20 States Component

Random Instantiation

0.7207

0.9998

0.698

Expert Instantiation

0.7214

0.9986

0.6963

Random Instantiation

Expert Instantiation

Figure 6-10. Arbitrary Component’s Full (Left) and Sparse (Right) Random Instantiation for Training Data Generation

with the operational profile remarkably well. The results of this experiment is shown in Figure 6-9.

We again performed similar experiments with three arbitrary components with 5, 10, and 20 states, 10 interfaces, and 4 failure states. Since the components are arbitrary, a good way to “simulate” the domain knowledge in terms of the initial probabilities is to create a random sparse matrix. The results of the full matrix and sparse matrix instantiations are shown in Figure 6-10. As depicted, the source of instantiation of data had little influence on the estimated reliability. The results confirmed our hypothesis originally discussed in the case of the controller component, and demonstrated that the reliability model can innately accommodate unknown operational profile.

163

6.2.2. Sensitivity Analysis The traditional steady-state sensitivity analysis offered by Markov-based modeling provides insights into critical elements of the model. This is done by understanding and characterizing the relationship between parameters in the model that, together, quantify the global query on the network (in our case the reliability). In the context of component reliability estimation, such analyses offer insights into critical states as well as critical paths within a single component. The critical states determine which states have the most influence on the component reliability value. The critical paths indicate specific paths of execution (i.e, specific series of invocation of component’s interfaces) which result in the highest reliability value for the component.

While this information may be relevant, its impact on the development process may be too limited: states are abstract concepts that are typically not treated as first class entities during implementation. Moreover, knowledge about paths leading to highest/ lowest reliability value, while theoretically interesting, offers little help in enhancing the software development process. However, integrating such information with a cost-framework can offer crucial help in improving the quality of the product under development. Recall the tight integration between our defect classification and costframework on the one hand, and our reliability model on the other hand. Leveraging the cost-framework together with standard sensitivity analysis enables us to provide a cost-effective approach in mitigating architectural defects.

164

To demonstrate the usefulness of the analysis enabled by our approach, we first need to demonstrate that our model is sensitive to various types of architectural defects. Recall that our component reliability model uses a cost-framework instantiated by the domain expert (Section 2.6) to calculate the probability of recovery from failures. We demonstrate the sensitivity of the model by studying the effect of changes to the estimated reliability when cost values in the cost-framework change. Changes to the cost values affect the recovery probabilities. Intuitively, as the cost of recovery increases, the probability of recovery decreases.

Recall the cost-framework presented in Section 2.6. The recovery probabilities were obtained by calculating the surface area under a Radar Chart constructed from different cost values. The results are shown in Figure 6-4. In the case of the controller component, based on the results of architectural analysis, only two types of failures (signature failure and interaction protocol failure) were considered relevant. Figure 64 sets the probability of recovery from a signature and an interaction protocol failure at 0.7975 and 0.7112 respectively. The component reliability was then estimated at 0.9303. Let us analyze the sensitivity of the model by providing a range of recovery probabilities for each of these failure types. The results are shown in Figure 6-11. As the probability of recovery for each type of failure increases from 0 to 1 (x-axis), the component reliability increases from 0.0005 to 0.9414 (y-axis). The two curves denote that the increase in component reliability in the two cases follows a similar pattern. However, changes to the probability of recovery from a protocol type failure 165

Reliability

1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of Recovery Changes to Probability of Recovery for Signature Failure Changes to Probability of Recovery for Interaction Protocol Failure

Figure 6-11. Sensitivity Analysis for the Controller Component with Different Recovery Probabilities

causes a greater increase in the component’s reliability value. The results here seem to indicate that the reliability model reacts to changes in its parameters in a predictable and meaningful way: an increase in the probability of recovery from failures results in an increase in the component’s reliability. Moreover, in the context of this specific example, it can be concluded that in circumstances where recovery probability is really low, it is more rewarding to ensure that the probability of recovery from a protocol type failure is improved. However, once the recovery probabilities are at about 50%, the difference in the amount that recovery from each failure type affects the component’s reliability becomes relatively small.

The results of applying the same principle to an arbitrary component with 10 states, 10 interfaces, and 5 failures states are shown in Figure 6-12. The results show changes to the probability of recovery from usage failure caused the least impact on 166

1 Reliability

0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of Recovery Usage

Protocol

Incomplete

Pre/Post Condition

Signature

Figure 6-12. Changes to the Probability of Recovery from Various Failure Types for an Arbitrary Component

component reliability, while changes to the recovery probability for the protocol, signature and incomplete types of failures followed a very similar pattern on affecting component’s reliability. Since this is an arbitrary component with random parameters, it is difficult to justify the specific behavior of the model. However, the conclusion at this point is that the reliability model reacts predictably to changes in the model’s parameters.

We leverage this conclusion and apply it in the context of our final set of sensitivity analysis experiments. By tightly integrating a cost-framework to our reliability model, we are able to provide an analysis of the most cost-effective approach to defect mitigation. The results in the case of the controller component are shown in

167

1 0.95 0.9 Values 0.85 0.8 0.75 0.7

1

2

3

Signature Failure Recovery Probability

0.7975

1

0.7975

Protocol Failure Recovery Probability

0.71125

0.71125

1

Component Reliability

0.9303

0.9414

0.9334

Signature Failure Recovery Probability Protocol Failure Recovery Probability Component Reliability Figure 6-13. Controller Component Reliability w.r.t. Different Failure Recovery Probabilities

Figure 6-13. The first set of numbers (labeled 1) depict the original reliability estimation. In the second and third experiments (labeled 2 and 3, respectively), we improved the failure recovery probability for the two types of failures to 1. The results suggest that an increase to the probability of recovery from the signature type failure has the greatest effect on the overall reliability. In other words, in a decision making situation during development, where the resources must be used by prioritizing tasks, we can use our analysis to determine which tasks are more critical. In this case, mitigating the root cause of the protocol type failures, and eliminating the associated defects has the most immediate influence on the controller component’s reliability. An important observation in this example is that changes to the component reliability values are 168

quite small. The question thus arises as to whether or not the architect can base his/her design decisions based on these changes. The reliability values calculated here are tied to the model of component’s behavior, as well as the recovery probabilities assigned by the cost framework. In order to determine if a change is significant, we need to first understand how the full range of recovery probabilities affect the component’s reliability. Given the range of possible values, the architect must decide whether a particular change is significant in the context of the specific component under analysis. For example, in the case of the Controller component, Figure 6-14 depicts changes to the component reliability (y-axis) as the recovery probability of the two failure types changes from 0.1 to 1 (x-axis). As a parameter’s value changes, the second parameter is kept constant at 0.1. It is clear that changes to the Signature Recovery Probability in general cause a greater range of change to the component reliability (from about 62% to 94%), while changes to the Protocol Recovery Probability cause a much smaller change (62% to 66%) in the component reliability. As depicted here, changes to the component reliability value are quite considerable, because of the low recovery probability values initially assigned to the two parameters (0.1). Given these results, the architect must then decide whether a small change in the component reliability value based on (small) changes to the recovery probability is significant given the specific software component.

169

Component Reliability

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recovery Probability Signature Recovery Probability Changing Protocol Recovery Probability Changing

Figure 6-14. Controller Component Reliability w.r.t. A Full Range of Recovery Probability Values

To generalize this type of analysis, we performed similar experiments and obtained results from an arbitrary component with 10 states, 5 failure states, and 10 interfaces. We performed a five-part experiment in which we varied the probability of recovery for each failure type. In the initial configuration, we assumed recovery probability values for the arbitrary configuration. In each part of the experiment, we increased the probability of recovery for a failure type to 1, while keeping the other recovery probabilities at their initial values. The results are depicted in Figure 6-15 and suggest that in this component, ensuring that we can recover from the pre/post condition type of failure has the biggest impact on component’s reliability. Once again, a full range of analysis based on the recovery probability values can put these results in perspective, and help the architect to determine the significance of the results. 170

1.05

Values

0.95 0.85 0.75 0.65 0.55

1

2

3

4

5

6

Usage

0.7075

1

0.7075

0.7075

0.7075

0.7075

Protocol

0.71125

0.71125

1

0.71125

0.71125

0.71125

Signature

0.7975

0.7975

0.7975

1

0.7975

0.7975

Incomplete

0.5725

0.5725

0.5725

0.5725

1

0.5725

Pre/Post Condition

0.7806

0.7806

0.7806

0.7806

0.7806

1

Reliability

0.843

0.8432

0.8465

0.8446

0.8521

0.8635

Experiments Usage

Protocol

Signature

Incomplete

Pre/Post Condition

Reliability

Figure 6-15. Sensitivity Analysis for an Arbitrary 10-state Component

The process of elimination of a particular failure type in a component is two-fold: (1) ensuring that a failure does not occur (probability of failure of zero), and (2) making certain that in the case of failure the component is able to recover from it (probability of recovery of one). In other words, both failure and recovery probabilities are critical in component’s reliability estimation. Consequently, to complete our experiment, we represent total elimination of a defect and subsequent failure by assigning the probability of failure to 0 and probability of recovery to 1 for each type of failure. Doing so for the two types of defects in the controller component resulted in Figure 6-16. The

171

Reliability

1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91

1

F2 (Protocol) Eliminated

0.9413

F1 (Signature) Eliminated

0.9871 Experiments

F2 (Protocol) Eliminated

F1 (Signature) Eliminated

Figure 6-16. Effect of Total Elimination of Failures

diagram shows that total elimination of failure of type signature mismatch has the greatest effect on the component’s reliability.

This type of analysis can be used as a decision tool during the architecture and design phase in allocating resources for defect mitigation.

6.2.3. Complexity and Scalability Analysis The complexity of the Baum-Welch algorithm when transitions are labeled is deter2 mined to be O( N × M × T ) [11], where N is the number of states, M is the number of

events, and T is the length of the training data generated by the model. Our adaptation of this algorithm for the AHMM results in changes to the algorithm complexity 172

resulting in O( N 2 × M × K × T ), where K represents the number of actions associated with the events. In other words, the algorithm is proportional to the complexity of the component, which matches our intuition. Even though the numbers of events and actions (M and K) in our model are pre-determined by the number of component’s interfaces, the number of states N may be reduced by applying the concept of hierarchy to the model: a complex model may be abstracted to provide a higher-level view of the component, by applying the principle of hierarchical modeling. The result would be a reduction in the complexity of the algorithm.

6.3 System Reliability Prediction We evaluate our system-level reliability model in terms of sensitivity analyses aimed at demonstrating that architecture-level reliability modeling of software systems is both possible and meaningful. Using a set of case studies, we demonstrate the above along the following three dimensions. First, we demonstrate the results of a series of sensitivity analyses that show the impact of changes to model parameters on the predicted reliability values. We then demonstrate how the model can be used to identify the critical components in a system, as well as their critical defects whose mitigation provides a cost-effective approach to enhancing the reliability of the system under development. Finally, where appropriate, we demonstrate the effect of specific system configurations on its reliability. The results can be helpful to the architect in making architectural changes in the system that may help improve its reliability. In the rest of 173

this chapter, we describe our evaluation in the context of two case studies. Our experience with several other case studies and synthesized models, confirms the conclusions presented here.

Before discussing the details of our evaluation, a brief discussion on the probabilistic instantiation of the model is necessary. The Serial, Parallel, or other customized configurations specified for each node of the Bayesian Network (recall Chapter 4), directly affect the predicted reliability values. Since the specific probabilistic relation is highly application dependent, to avoid the unnecessary complexity we use a simple serial configuration for all nodes. This implies that no redundancy is exercised (unless explicitly stated otherwise), and that the reliability of each node is equally influenced by the reliability of all of its parents. While this assumption at the level of internal nodes in the Bayesian model may be reasonable, at the final stage when the overall system reliability is calculated, special care is needed. Specifically, the aggregation formula that is used to calculate the cumulative impact of individual failures and obtain system’s reliability can greatly impact the results. For example, treating all the failures similarly (by using a serial configuration assumption) to predict the system reliability in a Client-Server system, can lead to conclusions that may not be explained intuitively: a client may have a similar or even greater impact on the system reliability. Using a more sophisticated formula in this case, by either assigning weights to failures from different components, or considering a parallel relationship between the failures may be more reasonable. Although we will come back to this 174

issue in the context of a specific example later in this chapter, addressing issues associated with architectural styles and patterns of interactions and their impact on system-level reliability is beyond the scope of this thesis. Furthermore, a detailed discussion on the ramifications of various reliability relationships is beyond the scope of our work, and can be found in [124].

6.3.1. Case Study 1: The SCRover System In Chapter 4, we demonstrated the steps involved in reliability modeling of the SCRover system. In this section we provide sensitivity analysis on the predicted reliability value.

For the discussion here, recall the SCRover’s Bayesian Network depicted in Figure 617 (originally presented in Chapter 4). The system reliability was predicted at 0.9826, assuming an initial reliability of 0.93, 0.96, and 0.99 for the controller, estimator, and actuator components, respectively. The components’ initial reliability values were obtained from our component-level reliability model discussed and evaluated earlier in this dissertation.

Sensitivity to component-level reliability predictions. In order to analyze the sensi-

tivity of the model to different initial component reliability values, we repeated the prediction process for a range of initial components’ reliabilities.

175

init R_Actuator

S1_Actuator

R_Estimator

R_Controller

S1_Estimator S1_Controller S3_Actuator S2_Controller

S3_Controller S2_Actuator

F_PrePostCond_Controller S2_Estimator S3_Estimator F_Signature_Controller

F_Protocol_Controller F_Signature_Estimator

Figure 6-17. SCRover’s Bayesian Network

Recall that a complete reliability model for the SCRover is in the form of a Dynamic Bayesian Network (DBN). Using the DBN methodology, the reliability of the system is predicted as a function of time. To show the effect of time on the predicted system reliability, we initially performed the prediction process for two consecutive time steps (t=0, and t=1). Figure 6-18 demonstrates the impact of each component’s reliability on the system’s reliability. As shown, changes to the reliability of the controller component greatly influence the reliability of the system, while changes in the reliabilities of estimator and actuator components have less impact on the overall system reliability. This phenomenon is especially more prominent at t=0. In fact, the impact of the reliabilities of the estimator and actuator components at t=0 on the 176

1 0.9 0.8 0.7 0.6

System's Reliability

0.5 0.4 0.3 0.2 0.1 0 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Time t=0 Time t=1

0.2

0.1

Controller's Initial Reliability Time t=1

Time

0

Time t=0

1 0.9 0.8 0.7 0.6

System's Reliability

0.5 0.4 0.3 0.2 0.1 0 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

Estimator's Initial Reliability

Time t=1

Time t=0 Time t=1 0.1

Time

0

Time t=0

1 0.9 0.8 0.7 0.6

System's Reliability

0.5 0.4 0.3 0.2 0.1 0 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

Actuator's Initial Reliability

Time t=1

0.2

Time t=0 Time t=1 0.1

Time

0

Time t=0

Figure 6-18. Changes to the Reliability of the SCRover System at Times t=0 and t=1

177

system reliability is negligible. However, as time goes by, the results change. More specifically, the role of the reliabilities of the estimator and actuator components becomes more significant at t=1. This is due to the Dynamic Bayesian Networks, and the associated delay links that model the behavior of the system in subsequent time intervals. In the context of reliability prediction, the delay links act as a feedback mechanism in the system. They incorporate reliability of various nodes at a given time step ti, into the estimation of the reliability value at the following time ti+1. Given the Serial relationship assumption among nodes and their parents in the system, as the number of parents to a node increases, its reliability (product of parents’ reliabilities in the case of Serial relationship) decreases.

Before providing additional discussion and insights about this concept, let us expand the SCRover’s Dynamic Bayesian Network for a few more time steps, in order to provide a better view of the system during operation. Figure 6-19 shows the predicted system reliability in the first 5 time steps. A first glance reveals that as time passes, system reliability decreases significantly. This is only partially correct. The reliability of each node (and thus the entire system) at time ti, depends on the reliability of the system (in terms of the reliability of its nodes) at previous time steps t1, t2,..., ti-1. If the reliability prediction process is performed without considering the knowledge about system’s operation as time passes, the above calculation is accurate. However,

178

System Reliability

1.2 1 0.8 0.6 0.4 0.2 0 t0

t1

t2

t3

t4

t5

Time Reliability prediction over time

Figure 6-19. SCRover’s Reliability over Time based on its Dynamic Bayesian Network

often times we could infer that if no failure at time tn has occurred, the probabilities of all types of failures at this time step could be reset to zero. The fact that no failure at a particular time step has occurred is known as evidence. One of the properties of Bayesian Networks is the ability to make new inferences based on newly obtained evidence as time passes. In the case of our Dynamic Bayesian Network, this new evidence is the reliability of the system (in terms of the reliabilities of its various nodes) at previous time steps. In other words, as time goes by, if we know that a particular type of failure at a given time step ti did not occur, the model can be updated to include this new evidence when the reliability at time ti+1 is being estimated. The inference results will then be updated based on the newly available evidence and thus the system reliability prediction is updated accordingly.

179

Prediction based on evidence at time t=0

1.2

System Reliability

1 0.8 0.6 0.4 0.2 0 t0

t1

t2 Time t3

t4

t5

Prediction based on evidence at time t=1 Prediction based on evidence at time t=2 Prediction based on evidence at time t=3 Prediction based on evidence at time t=4 Prediction based on evidence at time t=5

Figure 6-20. Updated Prediction of Reliability over Time based on New Evidence

Figure 6-20 demonstrates the result of the system’s reliability prediction assuming the new evidence obtained at each time step. Each curve demonstrates the system reliability at times t=0 through t=5, given the known evidence about lack of failures in the previous time steps

The results demonstrate that the model reacts meaningfully to changes in its parameters. An increase in component reliabilities results in an increase in system reliability, and vice versa. Moreover, as time passes, the system reliability decreases, unless previous evidence is incorporated into the estimation.

180

1 0.9 Reliability Values

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

Experiments Controller Reliability

Estimator Reliability

Actuator Reliability

System Reliability

Figure 6-21. Effect of Changes to Components’ Reliabilities on System’s Reliability

The next experiment leverages the sensitivity of the model to components reliabilities, and shows how the architect can use it to identify critical components in the system.

In Figure 6-21, we depict the effect of changes to individual component reliabilities (depicted via the bars) on the system’s reliability (depicted via the line). The x-axis depicts the four experiments performed, and the y-axis shows the components’ and system’s reliabilities associated with each experiment. The first experiment (labeled 1) puts the system’s reliability at about 10% given individual components’ reliabilities of 0.5. In the subsequent experiments, we increase the reliability of each component to 0.9 to study its effect on system’s reliability. As shown in this case, improving 181

the reliability of the controller component (experiment 2) has the biggest impact on system’s reliability. The reason for this phenomenon lies at the heart of the DBNs methodology.

Using Dynamic Bayesian Network, reliability is predicted in terms of the probability of not getting into failure states during the operation: the sooner a system fails, the lower the system reliability. If a component initiating the interaction with another component fails immediately after initiating the interaction, the system as a whole may reach a failure state faster than if multiple steps have passed before a failure occurs. Subsequently, failures in components that initiate interactions may have a greater impact on the overall system reliability. While this is relatively intuitive to observe and understand in the case of SCRover system, as the complexity of interactions in larger systems increase, and the DBN consequently gets expanded over longer time periods, it would be harder to analyze the impact of components without using an automated analysis process. As an example, consider Figure 6-22, where the reliability prediction of the SCRover based on changing components’ reliabilities is depicted for the second time step. While changes to the reliability of the controller component results in the biggest overall increase of the system’s reliability, the impacts of the reliabilities of the other two components follow a similar trend until a certain point. Particularly, once the components’ reliabilities are at 0.9, improving the reliability of estimator or actuator components to 100% results in reversing the dominant trend before this point: a change in the estimator component reliability results in 182

1.2

System Reliability

1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Component Reliability Controller

Estimator

Actuator

Figure 6-22. Changes to System Reliability Based on Different Component Reliability Values at Time Step t=1

a greater change to system reliability than does a change in the actuator component’s reliability.

Sensitivity to the reliability of system’s startup process. To continue our evalua-

tion of the reliability model’s sensitivity to changes in its parameters, we performed some experiments to analyze the effect of the system’s startup process and its reliability (represented via the init node) on the system. Recall that the value of the node is to be supplied by the architect. This node is specifically designated to model the uncertainties associated with the system’s startup process. In all of the calculations so far, the reliability value at this node was set at 0.999, which essentially represents a highly reliable startup process.

183

As discussed in Chapter 4, there is no one-size-fits-all technique for determining this value for a system. While it is possible to eliminate the use of this parameter in the reliability prediction altogether (by setting it to one), we believe it is a useful means for including a variety of factors that contribute to the system’s reliability. In general, the circumstances that affect this parameter may relate to the software development process adopted, and thus are beyond the scope of the architectural models. For example, specific development processes or component integration strategies have an impact on this value. If component integration has been performed iteratively throughout the development, there is a greater confidence in a successful final integration, and the system’s startup. On the other hand, if COTS components are used, if the development process has followed more of a waterfall approach, or if components are developed independently by different development teams, then it is reasonable to anticipate more problems during the final integration and the system startup. In any case, the value for this parameter is ultimately subjective, and techniques to obtain the value more objectively are beyond the scope of our work. Use of risk management frameworks [75] could be a reasonable approach in determining this value.

The diagram in Figure 6-23 depicts the sensitivity of the SCRover model to changes in the reliability of the startup process. Since the init node serves as the super parent to all components’ initial nodes, its value has a very strong impact on system’s reliability. This is consistent with our intuition that as the reliability of the system’s star-

184

1.2

System's Reliability

1

0.8

0.6

0.4

0.2

0 0.999

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Reliability of the Startup Process Reliability of the Startup Process

Figure 6-23. The Effect of Changes to System’s Reliability as the Reliability of the Startup Process Changes

tup process decreases, the system’s ability to perform its operations successfully decreases (regardless of the reliabilities of individual components).

Sensitivity to component-level failures. The purpose of this set of experiments is to

determine the sensitivity of the model to the failure probabilities. The failure probabilities are estimated using Bayesian Inference given the individual components’ reliabilities, and their interactions. However, as discussed earlier, the inference can be updated using evidence that may be available. We can use this principle to speculate on the impact of each failure on the overall reliability. To do this, we can represent elimination of a particular failure by assigning its probability of occurrence to zero, and observing the effect on the system’s reliability. This type of analysis for instance 185

1

System Reliability

0.98

0.977158

0.966997364

0.96

0.939935202

0.94

0.929774565 0.92

0.906932565

0.9 0.88 0.86

original

controller's pre/post cond

controller's signature

estimator's signature

controller's protocol

Figure 6-24. Effect of Elimination of Particular Failures on SCRover System’s Reliability

could help us decide whether elimination of the Signature failure in the estimator component is more critical than elimination of Protocol failure in the controller component.

The results of performing this type of analysis on SCRover are depicted in Figure 624. The first column shows the original reliability estimation given a set of parameters. Without changing those parameters, the prediction is repeated, with the probabilities of various instances of failures in different components changed to zero. This effectively is equivalent to repeating the inference process assuming no such failure has occurred. The next four columns correspond to lack of Pre/Post condition failures, Protocol failure, and Signature failure in controller component, and the Signature failure in the estimator component, respectively. It can be seen that ensuring that estima186

tor’s signature failure does not occur has the largest impact on the system’s reliability, improving it from the 90% original prediction to 97.7%. This type of analysis can help the architect make cost-effective defect mitigation strategies, by prioritizing failures by their impact on the system’s reliability.

In summary, in this subsection, we analyzed our system-level reliability prediction approach in the context of the SCRover system via a set of sensitivity analyses. We demonstrated that reliability prediction process is meaningful and that useful information may be obtained from our analysis. In the next subsection, we continue the evaluation using a different case study.

6.3.2. Case Study 2: The OODT System Our second case study is based on NASA’s Object Oriented Data Technology (OODT) [87]. OODT is a methodology, a middleware, and a software architecture for development of distributed data-intensive systems. The middleware offers access to geographically distributed and heterogeneous data sources, by concealing the details of mediation at each data source, and offering an extensible and flexible data sharing and transporting methodology. An OODT-based system consists of a set of Clients, one or more ProfileHandlers, and a set of ProfileServers. A high-level architecture of such system is depicted in Figure 6-25. In the OODT methodology, a Client component requests a set of services that may be provided by different ProfileServers. The Client is oblivious to the number, type, and location of these servers. A 187

Profile Server 1

Profile Server 2

Profile Server n

Profile Handler

Client 1

Client 2

Client m

Figure 6-25. OODT’s High Level Architecture

ProfileHandler component acts as a mediator, and routes requests and responses between Clients and Servers.

The Global Behavioral Model of one possible (very high-level) instantiation of this methodology is depicted in Figure 6-26 (top). To build a corresponding Bayesian Network, we used our methodology described in Chapter 4 to construct the quantitative part of the network. Probabilistic formulas were then assigned at each node of the BN to represent the relationship between the reliability at various nodes. In the rest of this section, we evaluate our system-level reliability approach in the context of the OODT system.

Let us assume that in an adaptation of the OODT system, an application is designed to provide a single point of access (via a web page) to multiple databases maintained by 188

Profile Handler

SendServer1 PH1.S3

Client

Results/ Return

Query/Send

Send PH1.S1

PH2.S2

C1.S1

PH1.S5

C1.S2

Return PH1.S4 Results/ Return

SendServer2

Profile Server 2

Profile Server 1

SendServer2/Results

SendServer1/Results

PS1.S1

PS2.S1

PS1.S2

PS2.S2

Return

Return

R_PH1

PH1_S2

PH1_S1

PH1_S3

R_C1

PH1_S5

init PH1_S4 C2_S1

C1_S1 PS1_S1

C1_S2

C2_S2

R_PS1

PS1_S2

R_PS2

PS2_S1 F_C1_PrePost1 PS2_S2

F_PS1_Protocol

F_PS2_Protocol

F_PS2_Interface

F_PS2_PrePost

F_PS1_Interface

F_PS1_PrePost

F_PH1_PrePost

F_PH1_Interface

F_PH1_Protocol

F_C1_PrePost

Figure 6-26. OODT’s Global Behavioral Model (top) and Corresponding BN (bottom)

189

different NASA centers. Each database contains mission information specific to the NASA center. Specifically, let us assume that two independent ProfileServers serving two independent datasets are designed. ProfileServer1 is used to access spacecraft identification numbers as assigned by NASA, for those missions under the supervision of the Jet Propulsion Laboratory (JPL). ProfileServer2 is used to access spacecraft identification numbers assigned by NASA’s Goddard Space Flight Center (GSFC) for the mission under their authority. The former is physically deployed on a set of servers in California, while the latter is physically located in Maryland.

Once a query is issued by a Client (from anywhere in the world), the ProfileHandler component relays the query to the appropriate server, and the server sends a response. The Client in this case is unaware of the specific server that has served the request. In this scenario, since the two servers access independent and non-identical data sources, it is crucial for both servers to operate reliably, in order for the system to operate reliably. In other words, the reliability of the system depends on the reliable operation of all of its components, including the two servers. This is in contrast to cases where one server may be a backup of the other server, in which case reliable operation of at least one of the servers is sufficient for the reliable operation of the system. The failure aggregation formula incorporates the failure probability values obtained from the Bayesian inference, and calculates the corresponding system reliability value, given the specific scenarios discussed above. In our case study, we first focus on the case where the two servers correspond to two independent and different datasets. Later in 190

this section, we demonstrate the result of modeling a different scenario, where the two servers are considered to act as back-ups for identical datasets.

For our analysis, let us assume that there are two faulty services in this system: the Return interface between the Client component and the ProfileServer and ProfileHandler components has a Pre/Post-condition defect, and the Results interface in the ProfileHandler component has both a Protocol and a Signature defect with the corresponding Results interface in the ProfileServer components.

Sensitivity to component-level reliability predictions. In the first set of experi-

ments we studied the effect of changes to components’ initial reliabilities on the system reliability. Since the two ProfileServer components are effectively identical in their functionality (although serving different datasets), one would expect that their reliability should have a similar impact on the overall system reliability. We varied the initial components’ reliabilities to study their impact on the overall system reliability and the result is depicted in Figure 6-27. As expected, changes to the reliability of the two ProfileServers show very similar trends on the changes to the system’s reliability.

Calculations shown in Figure 6-27 are performed on the initial time step in the system’s operation (t=0). Extending this experiment for the subsequent time interval (t=0 and t=1) demonstrate the same trend as shown in Figure 6-28. Moreover, it can be seen that similar to the SCRover experiment, as time goes by, the overall system 191

1 0.9 0.8

System Reliability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Initial Components Reliabilities ProfileServer2 Reliability

ProfileServer1 Reliability

Client Reliability

ProfileHandlerReliability

Figure 6-27. OODT Model’s Sensitivity to Different Initial Component Reliabilities

reliability decreases (unless new evidence is incorporated into the reliability calculation and new inference is made). It can also be seen that changes in the ProfileHandler and Client components result in a greater range of predicted reliability values for the system. For example, changes to the reliability of the Client component result in estimated system reliability values ranging from 18% to 91%, while changes to the reliability of the ProfileServer components result in system reliability variations between 52% and 91%. The reasoning behind this observation is as follows.

Recall that a reliable system is one in which a long series of component interactions occurs without a failure interrupting the chain. If a series of interface invocations represents interactions among components, a failure occurring earlier during invocation 192

Figure 6-28. Changes in the OODT System’s Reliability as Components’ Reliabilities Change

193

System Reliability

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

System Reliability

1 0.9

0

0

0.1

0.3 0.4 0.5 0.6

0.2 0.3

t=0

0.4 0.5 0.6

Profile server 2

t=1

Component Reliability

0.2

Component Reliability

0.1

ProfileServer 1

0.7

0.7

0.8

0.8

0.9

0.9

1

1

t=0

t=1

Time

t=0 t=1 Time

t=0 t=1

System Reliability

System Reliability

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

0

0.2

0.3

0.4

0.5

0.6

0.1

0.3

0.4

0.5

0.6

t=1

t=0

Component Reliability

0.2

0.7

0.7

Client Component

Component Reliability

0.1

0.8

Profile Handler Component

0.8

0.9

0.9

1

1

t=0 t=1

t=0 t=1

Time

Time

t=0

t=1

Reliability Value

1.2 1 0.8 0.6 0.4 0.2 0 1

2

3

4

5

Experiments Profile Handler Component

Client Component

Profile Server 1 Component

Profile Server 2 Component

System Reliability

Figure 6-29. Effect of Changes to Components Reliabilities on System Reliability

has a greater impact on system reliability than a failure that occurs later in the sequence of invocations. This explains why a component such as the Client has a greater impact on the predicted reliability.

Expanding the Dynamic Bayesian Network for the OODT system for a period of three time steps is shown in Figure 6-30. Similar to the results obtained from the SCRover experiment, as time passes, the system’s reliability decreases. By assuming that a failure has not materialized as time progresses, we are able to update the prediction over time and offer a more accurate analysis of the system reliability overtime. This is done by incorporating new evidence in the Bayesian Networks’ Inference process. The middle and top curves in Figure 6-30 demonstrate the updated knowledge at

194

System Reliability

1.05 1 0.95 0.9 0.85 0.8 0.75 0.7 t1

t2

t3

Time Prediction based on the evidence at t=0

Prediction based on the evidence at t=1

Prediction based on time t=2

Figure 6-30. Reliability Prediction of the OODT System Over Three Time Periods

each time step ti, after the evidence (in this case lack of failure) at ti-1 is updated. As demonstrated, the model reacts predictably to changes in its parameters.

Sensitivity to the reliability of system’s startup process. As mentioned before in

the context of the SCRover system, we model the reliability of the system startup process in order to incorporate the uncertainties associated with this process. Figure 6-31 demonstrates the results of sensitivity analysis of the OODT system by varying the reliability of the startup process. Intuitively, the system reliability has a direct relationship with the reliability of the startup process, and the results confirm this intuition.

195

1.2 System Reliability

1 0.8 0.6 0.4 0.2 0 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

Reliability of the Startup Process Reliability at t=0

Figure 6-31. Changes to the OODT’s Reliability based on Different Startup Process Reliability

Sensitivity to the probability of failure The purpose of this set of experiments is to

determine the sensitivity of the model to the estimated failure probabilities. The failure probabilities are obtained using Bayesian Inference, given the individual components’ reliabilities, and their interactions. One interesting and useful analysis on a system under the development is determining the impact of components’ failures on the overall system reliability. We represent elimination of a particular failure by assigning its probability of occurrence to zero, and observing its effect on the system’s reliability, as the inference is updated using the new evidence.

The results of performing this type of analysis on OODT are depicted in Figure 6-32. The x-axis shows the various instances of failures in the four components, while the 196

System Reliability

1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92

ProfileHandler

Original

Protocol

Client

ProfileServer 1

Signature

ProfileServer 2

Pre/Post Condition

Figure 6-32. Eliminating the Probability of different Failures and Their Impact on System Reliability

y-axis represents the system reliability. In each category of data (each component), the first bar represents the original reliability prediction given the parameters. All parameters remained unchanged, but in each experiment, probability of a specific failure in a component was manually set to zero. For example, eliminating the Protocol and Pre/Post Condition defects in the ProfileServers has the most influence on system’s reliability. It is important to note that the results presented here demonstrate very little change in the predicted system reliability. This is a side effect of the model and the interactions among its components. In other examples (including that of the SCRover), the change in the reliability value was more significant. It is very difficult to define a generic threshold level based on which changes in the reliability values are considered statistically significant. However, this is something that could be decided on an application-specific basis, given the results obtained from various reliability analyses. Consequently, this decision is left to the domain expert. 197

Analysis of components’ roles on system reliability. Recall that the results of the

analysis of the OODT system so far, assume a serial relationship among all the nodes in the system. Furthermore, the system reliability is calculated such that all components failures are similarly incorporated in the reliability prediction formula. In this experiment, we modify these assumptions and model redundancy in the system. Specifically, the reliability prediction formula is modified to consider the two ProfileServer components as the back-up for one another. In this setting, the reliability of the system depends on successful operation of at least one server component. In other words, the reliability is estimated by incorporating the reliability value of the most reliable server among the two ProfileServers. The top diagram in Figure 6-33 demonstrates the system reliability (depicted via a line) as the reliability of ProfileServer 1 increases. The diagram in the bottom shows that while increasing the reliability of ProfileServer1, the system reliability remains unchanged, if the reliability of ProfileServer2 decreases. This is because the system reliability formula only incorporates the most reliable server in its reliability calculation. Other scenarios in the case of this system, could formulate the system reliability to represent that the impact of unreliability of the server components are significantly greater than the unreliability of the client components.

This experiment demonstrates that our model is flexible, and allows changes to the roles of individual components and the impact of their reliabilities on the system reliability. 198

1.2

1

Reliability

0.8

0.6

0.4

0.2

0 1

2

3

4

Experiments Profile Server 1

Profile Server 2

System

1.2

1

Reliability

0.8

0.6

0.4

0.2

0 1

2

3

4

Experiments Profile Server 1

Profile Server 2

System

Figure 6-33. Modeling Redundancy in OODT

Analysis of the impact of system’s configuration on its reliability. As

previously

discussed, we envision that our reliability modeling approach may be used to analyze the effect of changes to the system’s structure, on the overall system’s reliability. While a structural change is considered an addition or removal of components in the system, the impact on the interactions among components in the system, the global 199

1 0.95 0.9 0.85 System 0.8 Reliability 0.75 0.7 0.65

t=2

t=1

5 clients, 2 ProfileHandler

5 clients, 1 ProfileHandler

3 Clients, 1 ProfileHandler

Configurations

1 Client, 1 ProfileHandler

0.6 t= 2

t Time t= =0 1

t=0

Figure 6-34. Impact of Different Configurations on OODT System Reliability

behavior of the system, and thus the reliability of the system is beyond a simple structural change. For example, in the case of the OODT system, the ProfileHandler component seems to act as a bottleneck for the system. As the number of clients increases, the load on the ProfileHandler component increases, as it is required to interact with more components than previously. Intuitively, this would have an adverse affect on the system’s reliability. One possible solution is to instantiate additional ProfileHandlers in the system to balance the load, and eliminate the single point of failure in the system.

We demonstrate the results of such structural changes on the OODT system and its impact on system’s reliability in Figure 6-34. The x-axis represents various configura200

tions while the y and z-axes represent time and system reliability, respectively. In every time step, by increasing the number of clients in the system, the reliability decreases gradually. However, once a new instance of the ProfileHandler component is added to the system, the reliability is improved. The second ProfileHandler component is set up such that it is responsible for handling communication to and from the fourth and fifth client components. The return of the system reliability value to approximately reliability value of the system with 3 Clients and a single ProfileHandler component can thus be rationalized, given the load balancing described above.

Similar to the results obtained from SCRover, the OODT results confirm that our system-level reliability modeling approach produces meaningful and useful results. Our experience with a set of synthesized models, as well as other case studies confirm the conclusions presented here. In the next two subsections we first offer an overview of the complexity and scalability of our approach. The uncertainties associated with the system reliability prediction during the architecture phase are discussed last.

6.3.3. Complexity and Scalability In general, the problem of Exact Inference in Bayesian Networks and Dynamic Bayesian Networks is NP-hard [21]. Efficient average case and approximation algorithms have thus been developed to tackle the complexity problem [21,91]. In our approach, Bayesian Networks are only used in a predictive context. That is given the probabilistic relations among the nodes (assigned by the domain expert), we predict the proba201

bility of certain events (failures) at a future time. The complexity of our DBN is thus a function of the number of nodes in the system, as well as the time interval over which the reliability analysis is performed. We now discuss each of these factors and their influence on our reliability modeling.

In this chapter, we discussed how evidence about a system’s reliability at time ti can be used to provide a better prediction of its reliability at times ti+1,ti+2,.... Following this approach, the complexity of the DBN can be reduced in the subsequent time steps by incorporating the results from previous time steps. The number of nodes thus can be controlled systematically to disallow the model to grow arbitrarily complex. For example, in the context of the OODT example, and its DBN expansion over 3 time steps, the total number of nodes in the model is 69. However, if it is already known that no failures at t=0 and t=1 have occurred, the number of nodes in the model are effectively reduced to 21. When modeling large and complex systems over an extended period of time, it may be more effective to do so by reducing the complexity of the model and performing reliability analysis in certain time intervals.

On the other hand, the number of nodes in the DBN in turn, depends on the number of states in the Interaction Protocol models of the components that comprise the system. Unlike the Dynamic Behavioral Models used for component-level reliability modeling, the complexity of the Interaction Protocol Models is bound by the number of

202

externally visible interfaces of each component. The principles of component-based software engineering, and encapsulation in Object-Oriented design, typically prevent a component from having arbitrarily large number of interfaces. Consequently, following the best practices of software design should directly help in the creation of models with reasonable numbers of externally visible interfaces. In turn, his will curb the complexity of the models.

6.3.4. System Reliability Modeling and Handling Uncertainties Modeling reliability of a software system in a compositional manner early during software development process, when implementation is not available and the operational profile is unknown, requires dealing with various sources of uncertainties. Accommodating these uncertainties results in a more realistic prediction of the reliability of the architecture. Below we enumerate some of these sources of uncertainties, and describe the ways our approach handles them.

Uncertainty of Components Reliability values. Existing approaches to component

reliability estimation typically do so in isolation. A component, whether a third party component, an OTS component, or an in house component typically is designed, built, and tested either in isolation, or in an environment that may not be typical of its intended use. Consequently, the reliability values associated with a component may not be accurate if the component is used in a different setting. While such calculation of component reliability value is useful as an “estimate” of how it may perform in a 203

system, depending on the specific system, and the other software and hardware components interacting with it, the value cannot be treated as an absolute number.

Our decision discussed in Chapter 4, of treating the estimated component reliability values as a node in the Bayesian Network, enables us to associate a degree of uncertainty with this value, consistent with the stochastic nature of our approach. This helps us tackle this problem natively in our reliability modeling approach.

Uncertainty of System Startup Process. Building a system out of fully reliable

components may not result in perfect reliability of the final system. This may be due to various sources of uncertainties introduced in the integration process. Starting up a system is among critical steps in the integration which could adversely affect the system reliability. By introducing an init node (as described in Chapter 4), we have addressed the problem of uncertainties associated with the startup process.

Uncertainties of Human Interaction. This is an important and potentially serious

source of uncertainty when dealing with software systems. As an example, a fully functionally reliable system, may result in catastrophic conditions, because of improper usage of the system by the operators or users. One way to eliminate such uncertainties is to design checks and balances in all parts of the system design to disallow such mistakes. Using our architectural modeling approach, such checks and balances could be implemented as pre/post conditions, guards, and other types of 204

assertions in the functional specification of the system. While such approach can help reduce the possibility of such harmful interactions, other uncertainties may need to be blended into the reliability model to address this form of uncertainty. The specific issue of modeling human-computer interactions and associated uncertainties are beyond the scope of this dissertation.

205

Chapter 7: Related Work

The topic of this dissertation research expands over fields of software architecture and reliability modeling. We have studied a variety of approaches in each domain, and identified a few approaches that span both domains. In this section, we first present a summary of related approaches to architectural modeling. We then provide an overview on existing reliability models. While extensive surveys of software reliability modeling have been provided elsewhere [34,45,130], we present an original taxonomy of reliability models with a special emphasis on architectural relevance. Finally, a discussion of Markov-based and Bayesian Network-based reliability models as related to our research is presented.

7.1 Architectural Modeling Building good models of complex software systems in terms of their constituent components is an important step in realizing the goals of architecture-based software development [77]. Effective architectural modeling should provide a good view of the structural and compositional aspects of a system; it should also detail the system’s behavior. Modeling from multiple perspectives has been identified as an effective way to capture a variety of important properties of component-based software systems [16,30,50,57,93]. A well known example is UML, which employs nine diagrams (also called views) to model requirements, structural and behavioral design, deploy207

ment, and other aspects of a system. When several system aspects are modeled using different modeling views, inconsistencies may arise.

Ensuring consistency among heterogeneous models of a software system is a major software engineering challenge that has been studied in multiple approaches, with different foci. A small number of representative approaches are discussed here. [37] offers a model reconciliation technique particularly suited to requirements engineering. The assumption made by the technique is that the requirements specifications are captured formally. [8,38] also provide a formal solution to maintaining inter-model consistency, though more directly applicable at the software architectural level. One criticism that could be levied at these approaches is that their formality lessens the likelihood of their adoption. On the other hand, [32,51] provide more specialized approaches for maintaining consistency among UML diagrams. While their potential for wide adoption is aided by their focus on UML, these approaches may be ultimately harmed by UML’s lack of formal semantics.

We now discuss representative approaches to modeling each of the four views on architectural models.

Interface modeling. Component modeling has been most frequently performed at

the level of interfaces. This has included matching interface names and associated input/output parameter types. Component interface modeling has become routine, 208

spanning modern programming languages, interface definition languages (IDLs) [78,94], architecture description languages (ADLs) [77], and general-purpose modeling notations such as UML [122]. However, software modeling solely at this level does not guarantee many important properties, such as interoperability or substitutability of components: two components may associate vastly different meanings with syntactically identical interfaces.

Static Behavior Modeling. Several approaches have extended interface modeling

with static behavioral semantics [1,65,95,136]. Such approaches describe the behavioral properties of a system at specific snapshots in the system’s execution. This is done primarily using invariants on the component states and pre- and post-conditions associated with the components’ operations. Static behavioral specification techniques are successful at describing what the state of a component should be at specific points of time. However, they are not expressive enough to represent how the component arrives at a given state.

Dynamic Behavior Modeling. The deficiencies associated with static behavior

modeling have led to a third group of component modeling techniques and notations. Modeling dynamic component behavior results in a more detailed view of the component and how it arrives at certain states during its execution. It provides a continuous view of the component’s internal execution details. While this level of component modeling has not been practiced as widely as interface or static behavior modeling, 209

there are several notable examples of it. For instance, UML has adopted a StateChartbased technique to model the dynamic behaviors of its conceptual components (i.e., Classes). Other variations of state-based techniques (e.g., FSM) have been used for similar purposes (e.g., [30]). Finally, Wright [2] uses CSP to model dynamic behaviors of its components and connectors.

Interaction Protocol Modeling.

The last category of component modeling

approaches focuses on legal protocols of interaction among components. This view of modeling provides a continuous external view of a component’s execution by specifying the allowed execution traces of its operations (accessed via interfaces). Several techniques for specifying interaction protocols have been developed. These techniques are based on CSP [2], FSM [133], temporal logic [1], and regular languages [102]. They often focus on detailed formal models of the interaction protocols and enable proofs of protocol properties. However, some may not scale very well, while others may be too formal and complex for routine use by practitioners.

Typically, the static and dynamic component behaviors and interaction protocols are expressed in terms of a component’s interface model. For instance, at the level of static behavior modeling, the pre- and post-conditions of an operation are tied to the specific interface through which the operation is accessed. Similarly, the widely adopted protocol modeling approach [133] uses finite-state machines in which component interfaces serve as labels on the transitions. The same is also true of UML’s 210

use of interfaces specified in class diagrams for modeling event/action pairs in the corresponding StateCharts model.

7.2

Reliability Modeling

Modeling, estimating, and analyzing software reliability –during testing– is a discipline with over 30 years of history. Many reliability models have been proposed: Software Reliability Growth Models (SRGMs) are used to predict and estimate software reliability using statistical approaches [41,53,68,85]. Extensive overview of these approaches are previously provided [34,42].

The major shortcoming of SRGM approaches is that they treat the software system as a monolithic entity. They ignore the internal structure of the system, and thus are called black-box approaches. Consequently, these approaches cannot be used when relating the reliability of the overall system to the reliability of its constituent components. This is a major shortcoming in case of large and complex software systems, where decomposition, separation of concerns, and reuse play important roles in architecting and designing them. Finally, these black-box techniques directly leverage failure data, and thus cannot be applied to stages before testing. Estimating the reliability of the system during testing does little in the way of a cost-effective software development process. The defects detected during testing will be significantly more costly to fix than if detected in earlier stages of the development. Additionally, knowing the 211

estimated reliability value at such a late stage leaves few options in meeting the reliability requirements of a software system.

Another category of software reliability modeling techniques is white-box: they consider a system’s internal structure in reliability estimation. These approaches directly leverage the reliability of individual components and their configuration in order to calculate the system’s overall reliability [43,56]. They usually assume that the individual component reliability is known or can be obtained via SRGM approaches. Goseva-Popstojanova et al. further classify white-box techniques into path-based, state-based, and additive [45]: path-based models compute software reliability based on the system’s possible execution paths; state-based models use the control flow graph to represent the system’s internal structure and estimate its reliability analytically; finally, additive models simply add the failure rates of each individual unit to determine the overall failure rate of the application and do not consider software structure. In summary, white-box approaches leverage two independent models in calculating reliability: a structural model describing software’s internal structure, and a failure model, describing software’s failure behavior.

The common theme across all of these approaches however, is their applicability to implementation-level artifacts, and reliability estimation during testing. Even those approaches assumed to be applicable in other development phases rely on estimates of the code size [23]. When architectural, existing approaches consider only the struc212

ture of the system. The only exceptions are [45,106,134,130]. Reussner et al. [106] build architectural reliability models based on both structural and behavioral specifications of a system. Their parametrized reliability estimation technique assumes the reliability of individual component services to be known. Wang et al. [134] leverage architectural configuration while focusing on architectural styles for building a prediction model that is mostly concerned with sequential control flow across components in a system. Goseva-Popstojanova et al. [45] focus on uncertainties associated with unknown operational profiles, and provide extensive sensitivity analysis to demonstrate the effectiveness of their approach. Their architectural model represents the control flow among the components, but cannot model concurrency and hierarchy often represented in architectural models [77]. Finally, Yacoub et al. [130] leverage a scenario-based model of system’s behavior and build component dependency graphs to perform reliability analysis.

However, none of these approaches consider the effect of a component’s internal behavior on its reliability. They simply assume that the component reliability, or some of its elements (such as reliability of component’s services) is known. They then use these values to obtain system reliability. Additionally, with the exception of [45], they rely on the availability of a running system to obtain the frequency of component service invocations (operational profile).

213

When predicting software reliability at early stages of development, such as during architectural design, proper knowledge of the system’s operation profile cannot be known. This contributes to some level of uncertainty in the parameters used for reliability estimation. In general, if a considerable uncertainty in the estimates of the system’s operational profile exists, then the uncertainty may be propagated to the estimated reliability. Consequently, traditional approaches to software reliability estimation may not be appropriate since they cannot take such uncertainties into consideration. Few approaches assess the uncertainties in reliability estimation heuristically, with variable operational profile, via techniques such as method of moments and simulation-based techniques such Monte Carlo simulation [45]. Other techniques however, assume fixed (a priori known) operational profile and varying component reliability and apply traditional Markov-based sensitivity analysis [18,115].

Hidden Markov Models (HMMs) [103] is a formalism that leverages Markov models while assuming some parameters may be unknown (hidden). Particularly, HMMs assume that, while the number of states in the state-based model is known, the exact sequence of states to obtain a sequence of transitions may be unknown. In addition, HMMs assume that the value of the transition probability distribution may be inaccurate. The challenge is to determine the hidden parameters, from the observable parameters, based on this assumption.

214

With the exception of [29], previous approaches to Markovian software reliability modeling have not leveraged HMMs (the focus of [29] is on imperfect debugging during testing and does not relate components’ interaction and reliability estimations – which is one of the a primary goal of this dissertation research).

While HMMs have not been previously used in the context of architecture-level reliability estimation, they have given results in areas such as recognition of handwritten characters [12], image recognition [19], segmentation of DNA sequences and gene recognition [14], economic data modeling [29], etc.

Bayesian Networks have long been used in various areas of science and engineering where a flexible method for reasoning under uncertainty is needed. They have been used for data mining and knowledge discovery [5], security, filter spamming, and intrusion detection [60,64], forecasting, and data analysis in medical domain [135]. They have also been applied to modeling reliability of engineering systems [4,68,96,98,123].

Two important factors distinguishes our work from all these Bayesian-based reliability approaches: (1) Our approach directly leverages architectural models of the system; and (2) Our approach does not rely on existence of system’s operational profile. The approach in [4], is a BN-based model that predicts the quality of a software product by focusing on the structure of the software development process. The quantita215

tive part of the BN is constructed based on the various activities in the development process and the probabilities are assigned according to the metrics obtained from these activities. The approach clearly is not applicable to early reliability assessment. Both [68,96] rely on the testing data to construct the Bayesian network and perform reliability estimation: once again such data does not exist at the architectural level. The approach in [98] calculates the quality of the development process by specifically focusing on various process activities. Our approach clearly differs from the above by relying on architectural models and domain knowledge.

7.3 Taxonomy of Architectural Reliability Models Extensive study of software reliability techniques are provided elsewhere [23,34,45]. For the purpose of our research, we were interested in approaches relevant to software architecture and its artifacts. In order to relate existing reliability models to software architectural artifacts, we have developed a taxonomy of architecturally relevant aspects to reliability models. The taxonomy is depicted in Figure 7-1. We will now discuss different dimensions of this taxonomy.

Basis. At their cores, reliability models can be grouped as those that are applicable to

implementation-level artifacts (i.e. code), process-based, or architecture-level artifacts (i.e., specification-based). The approaches applicable to code are further

216

Figure 7-1. Taxonomy of Architecture-based Reliability Models

217

Overall Reliability Assessment

Model Richness

Architectural Relevance

Basis

Criteria

Compositional

Flat

Assumptionbased

Estimation

Protocols

Dynamic Behaviors

Static Behaviors

Interface

Architectural Style

Interaction protocols

Wang99, Hamlet01, Reussner03, Bondavalli01, Dolbec95, Gokhale98, Li97, Krishnamurthy97, Everette99, Singh01

CARMA

Hamlet01, Reussner03, Dolbec95, Krishnamurthy97, Everette99, CARMA

Reussner03, CARMA CARMA Wang99, Dolbec95, Gokhale98, Whittaker93, Li97, Krishnamurthy97, Singh01 Reussner03, Li97, CARMA

Reussner03, Singh01, CARMA

Wang99

Reussner03, Whittaker93, Everette99, Singh01, CARMA Wang99, Reussner03, Bondavalli01, Li97, Everette99, CARMA

Configurations

Wang99, Hamlet01, Reussner03, Dolbec95, Gokhale98, Krishnamurthy97, Everette99, Singh01, CARMA Everette99

Interfaces

Connectors

Components

Pai02, Mockus03, Neil96, Smidts96

Wang99, Reussner03, Gokhale98, Whittaker93, Li97, Krishnamurthy97, Everette99, Zequeira00, CARMA

Approaches

Process-based

Control flow

Sub-Value

Wang99, Hamlet01, Bondavalli01, Whittaker93, CARMA

White-box

Black-box

Value

Specification-based

Implementationbased

Dimension

classified to be black-box, where the system structure is not taken into consideration, or white-box, where the system structure is considered in the reliability model.

Traditional reliability models are implementation-based and may be black-box (e.g. SRGMs), or white-box [45]. Process-based approaches such as [79,98] consider the software development process and its various activities (such as architecture and design stage), and measure the reliability of the process.The focus of our research is on specification-based models, where analytical reliability models may be applied to a model of the software system’s architecture.

Architectural Relevance. Architectural models provide an abstraction of software

system properties. In general, a particular system is defined in terms of a collection of components (loci of computation) and connectors (loci of communication) as organized in an architectural configuration. Architecture Description Languages (ADLs) [77] specify software properties in terms of a set of components that communicate via connectors through interfaces. Finally, an architectural style defines a vocabulary of component and connector types and a set of constraints on how instances of these types can be combined in a system [117]. We postulate that a useful model to quantify system reliability at the level of software architecture should consider the above modeling elements.

218

Model Richness. As discussed in Section 7.1, functional properties of software sys-

tems are described using one or more of the following four views: interfaces, static behaviors, dynamic behaviors, and interaction protocols. Even though other aspects of a system may be modeled using other modeling views, we believe that the above four models provide a comprehensive basis for specifying functional properties of systems. Explicit emphasis on these views has not been much of a focus in existing reliability models. With the exception of Reussner et al. [106], which leverages interfaces, static behaviors (pre/post conditions), and interaction protocols, other approaches focus only on components interaction protocols to estimate system reliability. Focusing on a subset of these modeling views results in the need to make simplifying assumptions in estimating overall reliability. Such an approach inherently assumes that values of individual component reliabilities are known a priori.

As discussed earlier, state-based approaches to reliability modeling use a control flow graph to represent the application structure and the application reliability is estimated analytically. Path-based approaches on the other hand, estimate the reliability by considering the possible execution paths of the application. As a result, the path-based approaches provide only approximate estimates for applications which have infinite paths due to the presence of loops. In our taxonomy, we adopt the same principle and classify models of interaction protocols as those that are specifying the control flow vs. those that specify interactions among components. The latter enables modeling

219

overall system’s behavior in terms of the behavior of individual components that are being executed concurrently.

Overall Reliability Assessment. In general there are two classes of approaches to

estimating a system’s overall reliability. The flat techniques (often seen in black-box models), take a non-compositional approach to estimating the overall reliability. Such approaches are inconsistent with software architecture and its goal of decomposition, reuse, and separation of concerns. The compositional approaches take system’s structure and components’ interactions into account when estimating overall reliability. We further classify them into those that assume a components’ reliability, or reliability of components’ constituent elements (e.g., components’ services) is known. Such assumptions undermines usability of these models. Alternatively, component-level reliability may be estimated given proper functional models of the component itself and based on the result of advanced analyses. Our proposed technique takes one such approach.

220

Chapter 8: Conclusion and Future Work

Despite the maturity of software reliability techniques, predicting the reliability of software systems before implementation has not received adequate attention in the past. Reliability estimation techniques are often geared toward the testing phase of the development life cycle, when the system’s operational profile is known. In these approaches, defects are primarily identified during testing. However, about 50% of these defects are rooted in pre-implementation phases of development, such as architecture and design [101]. Studies have shown that early discovery of defects in the software development life cycle results in a more cost effective mitigation process [13]. Reliability prediction early during the software development life cycle is critical in building reliability into the software system. Given the uncertainties associated with software systems early in the development process, appropriate reliability models must be able to accommodate the uncertainties and produce meaningful results.

The approach described in this dissertation aims at closing the gap between architectural modeling on the one hand, and its impact on software reliability, on the other hand. We focused on the reliability of individual software components as the first step. Our approach leverages standard architectural models of the system, and uses an Augmented Hidden Markov Model to predict the reliability of software components.

222

The system’s overall reliability is then predicted as a function of individual components’ reliabilities, and their complex interactions. This is done by building a Dynamic Bayesian Network based on components’ interaction protocol models, and leveraging analytical techniques to quantify the reliability of software systems. In the rest of this chapter, we first enumerate the contributions of this dissertation research. We then conclude by offering several interesting directions this research can take in the future.

8.1 Contributions The contribution of this work can be summarized as follows:



Mechanisms to ensure intra- and inter-consistency among multiple views of system’s architectural models,



A formal reliability model to predict both component-level and system-level reliability of a given software system based on its architectural specification, and



Parameterized and pluggable defect classification and cost-framework to identify critical defects, whose mitigation is most cost-effective in improving a system’s overall reliability.

The combination of the architectural modeling and analysis technique, together with the defect classification, cost-framework, and the reliability models of individual

223

components and the overall system, comprise a comprehensive methodology which has been evaluated on a series of case studies applications.

8.2 Future Work In this section we describe a set of open research questions which form the various aspects of our future work.

8.2.1. Architectural Styles and Patterns and Reliability This work has not considered the impact of specific architectural styles or patterns of interaction on reliability. Architectural styles impose constraints on the interactions among components in a system, as well as on the system’s structure. Moreover, leveraging known interaction patterns can help eliminate some of possible architectural defects. Relating properties of architectural styles, and patterns to our reliability model may be beneficial in two ways. First it could enable the architects to use patterns and styles as a template, where the impact of the specific constraint on system’s reliability is already quantified. Moreover, leveraging various design constraints may enable us to eliminate some of the parameters in the reliability model, and thus reduce the complexity of the underlying algorithms. This in turn, may help improve the scalability of the approach.

The only related approach in incorporating architectural styles into a reliability model [134] simplifies the problem by only modeling the transfer of control among compo224

nents based on the style characteristics. The main challenge in providing a more finegrained incorporation of the two concepts is addressing concurrency issues in components’ interaction. Another interesting problem is to formalize the notion of reliable patterns. A good place to start is to draw parallels with the research and development in the software security community [113]. We believe that our Bayesian reliability modeling approach offers a starting point in incorporating these concepts into the reliability prediction approach.

8.2.2. Reliability Modeling for Software Connectors Software connectors are the loci of communication in a software system and act as a glue that enable the interactions among components. Our reliability model only focuses on software components and their operations, and treats connectors as special components. While Software connectors are determined to provide a suitable vehicle to model other dependability attributes (such as security) [105], there has not been any research in modeling reliability of systems using software connectors. We plan to study this topic and extend our reliability model to encompass both components and connectors. The first challenge here is building appropriate abstractions for modeling relevant connector properties. Unlike components, not much focus has been on developing effective techniques for modeling and analyzing software connectors. Furthermore, special attention must be given to the interaction of components and connectors. Our system-level reliability modeling approach thus must be adapted to

225

incorporate connector models into the Global Behavioral Model, and formalized the component-connector and connector-connector interactions.

8.2.3. Early Prediction of Other Dependability Attributes Modeling other dependability attributes (such as availability, safety, and security) exhibit similar properties to those in the reliability modeling. Building dependable software systems requires addressing other dependability properties. We plan to extend our work to model other dependability aspects of software system’s architecture, in early stages of the development process.

Availability. Similar to software reliability, availability may be modeled stochasti-

cally. An interesting question is whether prediction of system availability may be performed at early stages of software development when no implementation-level artifact exist. Architectural models (e.g., ADLs) must thus be extended to explicitly model, analyze, and simulate the deployment conditions under which the system will be operational.

Security and safety. Recent advances in software security community have brought

architectural risk analysis and threat modeling to the forefront of software development dependability process. The main shortcoming of these approaches however, is the emphasis to think about low-level implementation issues at an early stage of development, when possibly no code is yet developed. Higher-level abstractions are 226

needed to describe, model, and analyze security and safety characteristics of the systems at the architectural level.

8.2.4. Extensions to Support Product Families In the past we have done extensive work in architectural modeling, analysis, and evolution of software systems, a natural spring board for supporting architectural design of product families. This research could benefit from the results of the relations between architectural patterns and styles and their impact on software reliability. Such abstractions reveal themselves more naturally in the contexts in which reuse is leveraged. We intend to expand our previous work in modeling architectural evolution, and build reliability models applicable to product families and their associated challenges.

227

References

1.

N. Aguirre, T.S.E. Maibaum. A Temporal Logic Approach to Component Based System Specification and Reasoning. In Proceedings of the 5th ICSE Workshop on Component-Based Software Engineering, Orlando, FL, 2002.

2.

R. Allen, and D. Garlan. A Formal Basis for Architecture Connection. ACM Transactions on Software Engineering and Methodology, 6(3): p.213-249, 1997.

3.

R. Almond. An extended example for testing Graphical Belief. Technical Report 6, Statistical Sciences Inc. (1992).

4.

S. Amasaki, Y. et. al., Bayesian Belief Network for Assessing the Likelihood of Fault Content, in Proceedings of the 14th International Symposium on Software Reliability Engineering ISSRE, Denver, Colorado 2003.

5.

S. Arnborg. A Survey of Bayesian Data Mining, in John Wang’s Data Mining: Opportunities and Challenges, Montclair State University, USA, 2003.

6.

P. Ashar, A. Gupta, S.Malik. Using complete-1-distinguishability for FSM equivalence checking. ACM Transactions on Design Automation of Electronic Systems Vol. 6, No. 4, pp 569-590, October 2001.

7.

A. Azem. Software Reliability Determination for Conventional and Logic Programming, Walter de Gruyter, 1995.

8.

R. Balzer. Tolerating Inconsistency, in Proceedings of 13th International Conference on Software Engineering (ICSE-13), Austin, Texas, 1991.

9.

A. Benveniste, E. Fabre, S. Haar. Markov Nets: Probabilistic Models for Distributed and Concurrent Systems, IEEE Transactions on Automatic Control AC-48, 11, pages 1936-1950, November 2003.

10.

A. Bondavalli, et. al., Dependability Analysis in the Early Phases of UML Based System Design, Journal of Computer Systems Science and Engineering, Vol. 16, pp. 265-275, 2001

11.

L. E. Baum, An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities, 3:1-8, 1972. 228

12.

B. Boehm. Software Engineering Economics, Prentice-Hall, Englewood Cliffs, NJ, 1981.

13.

B. Boehm. Software Risk Management: Principles and Practices, IEEE Software, January 1991.

14.

B. Boehm, J. Bhuta, D. Garlan, E. Gradman, L. Huang, A. Lam, R. Madachy, N. Medvidovic, K. Meyer, S. Meyers, G. Perez, K. Reinholtz, R. Roshandel, N. Rouquette, Using Testbeds to Accelerate Technology Maturity and Transition: The SCRover Experience, USC Technical Report USC-CSE-2003-507, (Submitted to ICSE 2004), September 2003.

15.

B. Boehm, P. Grünbacher, R. Briggs, Developing Groupware for Requirements Negotiation: Lessons Learned, IEEE Software, May/June 2001.

16.

G. Booch, I. Jacobson, J. Rumbaugh, The Unified Modeling Language User Guide, Addison-Wesley, Reading, MA.

17.

J. Chang and D.J. Richardson, Structural Specification-based Testing: Auto mated Support and Experimental Evaluation, ESEC/FSE’99: Proceedings of the 7th European Software Engineering Conference, Toulouse, France, September 1999.

18.

R.C. Cheung, A user-oriented software reliability model, IEEE Transactions on Software Engineering, SE6 (2):118-125, March 1980.

19.

E. Charniak, Bayesian network without tears, AI Magazine, vol. 12, no. 4, pp. 50-63, 1991.

20.

E. Cinlar, Introduction to Stochastic Processes, Englewood Cliffs, NJ, Prentice-Hall, 1975.

21.

G. F. Cooper, The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks. Artificial Intelligence, 42(2–3):393–405, March 1990.

22.

C. Courcoubetis, and M. Yannakakis, The complexity of probabilistic verification. Journal of the ACM, 42(4):857–907, 1995.

23.

S.R. Dalal, Software Reliability Models: A Selective Survey and New Directions, Handbook of Reliability Engineering, edited by H. Pham, Springer, 2003. 229

24.

T. DeMarco, Controlling Software Projects: Management, Measurement, and Estimation. Englewood Cliffs, NJ: Yourdon Press, 1998.

25.

E. Dashofy, van der Hoek A., Taylor R.N., An Infra-structure for the Rapid Development of XML-based Architecture Description Languages, In Proceedings of the 24th International Conference on Software Engineering (ICSE2002), Orlando, Florida.

26.

D. Dvorak, Challenging Encapsulation in the Design of High-Risk Control Systems. In Proceedings of the 2002 Conference on Object Oriented Programming Systems, Languages, and Applications (OOPSLA’92), Seattle, WA, November 2002

27.

D. Dvorak, R. Rasmussen, G. Reeves, and A. Sacks, Software Architecture Themes In JPL's Mission Data System, In Proceedings of the AIAA Space Technology Conference and Exposition, Albuquerque, NM, September, 1999.

28.

J. Dolbec, T. Shepard, A Component Based Software Reliability Model, in Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research, Toronto, Ontario, Canada, November 1995.

29.

J.B. Durand, O. Gaudoin, Software reliability modelling and prediction with Hidden Markov chains, Technical Report Number: INRIA n°4747, February 2003.

30.

M. Dias, M. Vieira, Software Architecture Analysis based on Statechart Semantics, in Proceedings of the 10th International Workshop on Software Specification and Design, FSE-8, San Diego, USA, November 2000.

31.

A. Egyed, Architecture Differencing for Self Management, in Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems, Newport Beach, California, 2004.

32.

A. Egyed, Scalable Consistency Checking between Diagrams - The ViewIntegra Approach, in Proceedings of the 16th IEEE International Conference on Automated Software Engineering, San Diego, CA, 2001

33.

W. Everett, Software Component Reliability Analysis, in IEEE Symposium on Application - Specific Systems and Software Engineering and Technology, Richardson, Texas, 1999.

34.

W. Farr, Software Reliability Modeling Survey, Handbook of Software Reliability Engineering, M. R. Lyu, Editor. McGraw-Hill, New York, NY, 1996. 230

35.

A. Farías, M. Südholt, On Components with Explicit Protocols Satisfying a Notion of Correctness by Construction. in Proceedings of Confederated International Conferences CoopIS/DOA/ODBASE, 2002.

36.

T.H. Feng, A Virtual Machine Supporting Multiple Statechart Extensions, In Proceedings of Summer Computer Simulation Conference (SCSC 2003), Student Workshop. The Society for Computer Modeling and Simulation. Jul. 2003, Montreal, Canada.

37.

A. Finkelstein, D. Gabbay, A. Hunter, J. Kramer, and B. Nuseibeh, Inconsistency Handling in Multi-Perspective Specifications, IEEE Transactions on Software Engineering, 20(8): 569-578, August 1994.

38.

P. Fradet, D. Le Métayer, M. Périn, Consistency Checking for Multiple View Software Architectures”, in Proceedings of the Seventh European Software Engineering Conference (ESEC) and the Seventh ACM SIGSOFT Symposium on the Foundations of Software Engineering, 1999.

39.

Y. Gal, A. Pfeffer, A Language for Modeling Agents' Decision Making Processes in Games, in Proceedings of the second international joint conference on Autonomous agents and multiagent systems, Melbourne, Australia, 2003.

40.

D. Garlan, R.T. Monroe, and D. Wile, Acme: Architectural Description of Component-Based Systems. Foundations of Component-Based Systems. Leavens, G.T., and Sitaraman, M. (eds). Cambridge University Press, 2000 pp. 47-68.

41.

A.L. Goel, K. Okumoto, Time-Dependent Error-Detection Rate Models for Software Reliability and Other Performance Measures, IEEE Transactions on Reliability, 28(3):206–211, August 1979.

42.

S. Gokhale, P.N. Marinos, and K.S. Trivedi, Important milestones in software reliability modeling, In Proceedings of the 8th International Conference on Software Engineering and Knowledge Engineering (SEKE 96), Lake Tahoe, June 1996.

43.

S. Gokhale, T. Philip, P. Marinos, K. Trivedi, Unification of finite-failure nonhomogenous Poisson process models through test coverage, in Proceedings of the 7th IEEE International Symposium on Software Reliability Engineering (ISSRE-96), November. 1996.

231

44.

S. Gokhale, W. E. Wong, K. S. Trivedi, and J. R. Horgan, An Analytical Approach to Architecture-Based Software Reliability Prediction, IEEE International. Computer Performance and Dependability Symposium, Durham, NC, Sept. 1998.

45.

K. Goseva-Popstojanova, A.P. Mathur, K.S. Trivedi, Comparison of Architecture-Based Software Reliability Models, in Proceedings of the 12th IEEE International Symposium on Software Reliability Engineering (ISSRE-2001), Hong Kong, November 2001.

46.

W.J. Gutjahr, Optimal Test Distributions for Software Failure Cost Estimation, IEEE Transaction on Software Engineering, V. 21, No. 3, pp. 219-228, March 1995.

47.

D. Harel, Statecharts: A visual formalism for complex systems, Science of Computer Programming, Volume 8, Issue 3, June 1987.

48.

D. Harel, A. Naamad, The STATEMATE Semantics of Statecharts. ACM Transactions on Software Engineering Methodoly. 5(4): 293-333 (1996).

49.

D. Heckerman, A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, Cambridge, MA, 1999.

50.

C. Hofmeister, R.L. Nord, and D. Soni, Describing Software Architecture with UML, In Proceedings of the TC2 First Working IFIP Conference on Software Architecture (WICSA1), San Antonio, TX, February 22-24, 1999.

51.

B. Hnatkowska, Z. Huzar, J. Magott, Consistency Checking in UML Models, in Proceedings of Fourth International Conference on Information System Modeling (ISM01), Czech Republic, 2001.

52.

Inspector SCRover Project, http://cse.usc.edu/iscr/pages/ProjectDescription/ home.htm

53.

Z. Jelinski and P. B. Moranda, Software Reliability Research, Statistical Computer Performance Evaluation, edited by W. Freigerger, Academic Press, 1972.

54.

F. Jensen, Bayesian Networks and Decision Graphs. Springer., 2001

55.

M. I. Jordan, (ed), Learning in Graphical Models, MIT Press. 1998.

232

56.

S. Krishnamurthy, A.P. Mathur, On the Estimation of Reliability of a Software System Using Reliability of its Components, in Proceedings of the 8th IEEE International Symposium on Software Reliability Engineering (ISSRE-97), pp.146-155, November 1997.

57.

P.B. Kruchten, The 4+1 View Model of Architecture. IEEE Software, 2(6):4250, 1995.

58.

H. Langseth, Bayesian Networks with Application in Reliability Analysis. Technical Report PhD Thesis, Dept. of Mathematical Sciences, Norvegian University of Science and Technology, 2002.

59.

J.C. Laprie and K. Kanoun, Handbook of Software Reliability Engineering, M. R. Lyu, Editor, chapter “Software Reliability and System Reliability”, pages 27–69. McGraw-Hill, New York, NY, 1996.

60.

W. Lee, Applying Data Mining to Intrusion Detection: the Quest for Automation, Efficiency, and Credibility, ACM SIGKDD Explorations Newsletter, Volume 4, Issue 2, Pages: 35 - 42, December 2002.

61.

N. Leveson, Safeware: System Safety and Computers, Addison Wesley (1995).

62.

J. Li, Monitoring and Characterization of Component-Based Systems with Global Causality Capture, In Proceedings of the 23rd IEEE International Conference on Distributed Computing Systems (ICDCS), 2003.

63.

J. Li, J. Micallef, and J. Horgan, Automatic Simulation to Predict Software Architecture Reliability, in Proceedings of Eighth International Symposium on Software Reliability Engineering (ISSRE '97), Albuquerque, NM, 1997.

64.

P. Liu, W. Zang, M Yu., Incentive-based Modeling and Inference of Attacker Intent, Objectives, and Strategies, ACM Transactions on Information and System Security, Volume 8, Issue 1, Pages: 78 - 118, 2005.

65.

B.H. Liskov, J. M. Wing, A Behavioral Notion of Subtyping, ACM Transactions on Programming Languages and Systems, November 1994.

66.

B. Littlewood, A Reliability Model for Markov Structured Software, In Proceedings of the 1975 International Conference on Reliable Software, pages 204–207, Los Angeles, CA, April 1975.

233

67.

B. Littlewood, A Semi-Markov Model for Software Reliability with Failure Costs, In Proceedings of Symposium on Computational Software Engineering, pp 281–300, Polytechnic Institute of New York, April 1976.

68.

B.A. Littlewood, and J.L. Verrall, A Bayesian Reliability Growth Model for Computer Software, Applied Statistics, Volume 22, pp. 332-346, 1973.

69.

D.C. Luckham, and J. Vera, An Event-Based Architecture Definition Language. IEEE Transactions on Software Engineering, vol. 21, no. 9, pp. 717734, September 1995.

70.

M. R. Lyu, Handbook of Software Reliability Engineering, McGraw-Hill, New York, NY, 1996.

71.

J. Magee, and J. Kramer, Dynamic Structure in Software Architectures, in Proceedings of the Fourth ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp.3-13, 1996.

72.

A. Maggiolo-Schettini, A. Peron, and S. Tini, Equivalence of Statecharts, In Proceedings of CONCUR '96, Springer, Berlin, 1996

73.

D. Mason, Probabilistic Analysis for Component Reliability Composition. In 5th ICSE Workshop on Component-Based Software Engineering (CBSE’2002), Orlando, Florida, USA, May 2002.

74.

The MathWorks Matlab: http://www.mathworks.com

75.

J. McManus, Risk Management in Software Development Projects, Butterworth-Heinemann, 2003.

76.

N. Medvidovic, D.S. Rosenblum, and R.N. Taylor, A Language and Environment for Architecture-Based Software Development and Evolution, In Proceedings of the 21st International Conference on Software Engineering (ICSE'99), Los Angeles, CA, May 1999.

77.

N. Medvidovic, and R.N. Taylor, A Classification and Comparison Framework for Software Architecture Description Languages. IEEE Transactions on Software Engineering 26(1), pp. 70-93, 2000.

78.

Microsoft Developer Network Library, Common Object Model Specification, Microsoft Corporation, 1996.

234

79.

A. Mockus, D.M. Weiss, P. Zhang, Understanding and Predicting Effort in Software Projects, in Proceedings of the 25th International Conference on Software Engineering, Portland, Oregon, 2003.

80.

J.F. Murray, G.F. Hughes, K. Kreutz-Delgado, Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application, The Journal of Machine Learning Research, Vol 6, Pages: 783 - 816, 2005.

81.

K. Murphy, A Brief Introduction to Graphical Models and Bayesian Networks, http://www.cs.ubc.ca/~murphyk/bayes/.html, 1998.

82.

K. Murphy, Hidden Markov Model (HMM) Toolbox for Matlab, http:// www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html

83.

J.D. Musa, A Theory of Software Reliability and Its Application, IEEE Transactions on Software Engineering, 1(1975)3, pp. 312-327, 1975.

84.

J.D. Musa, A. Iannino, K. Okumoto, Software Reliability– Measurement, Prediction, Application, McGraw-Hill International Editions, 1987.

85.

J.D. Musa, and K. Okumoto, Logarithmic Poisson Execution Time Model for Software Reliability Measurement, in Proceedings of Compsac 1984, pp. 230238, 1984.

86.

NASA High Dependability Computing Project (HDCP), http://www.hdcp.org.

87.

NASA Object Oriented Data Technology (OODT), http://oodt.jpl.nasa.gov.

88.

R. Neapolitan, Probabilistic Reasoning in Expert Systems. J. Wiley, 1990.

89.

C. Needham, J.R. Bradford, A.J. Bulpitt, D.R. Westhead, Application of Bayesian Networks to Two Classification Problems in Bioinformatics. Quantitative Biology, Shape Analysis and Wavelets, 87-90, 2005.

90.

Netica. http://www.norsys.com

91.

A. Nicholson, S. Russell, Techniques for Handling Inference Complexity in Dynamic Belief Networks, Technical Report: CS-93-31, Brown University, 1993.

92.

D. Nikovski, Constructing Bayesian Networks for Medical Diagnosis from Incomplete and Partially Correct Statistics, IEEE Transactions on Knowledge and Data Engineering, Volume 12, Issue 4, Pages: 509 - 516, 2000. 235

93.

B. Nuseibeh, J. Kramer, and A. Finkelstein, Expressing the Relationships Between Multiple Views in Requirements Specification, in Proceedings of the 15th International Conference on Software Engineering (ICSE-15), Baltimore, Maryland, USA, 1993.

94.

Object Management Group, The Common Object Request Broker: Architecture and Specification, Document Number 91.12.1, OMG, December 1991.

95.

The Object Constraint Language (OCL), http://www-3.ibm.com/software/ad/ library/standards/ocl.html.

96.

H. Okamura, H. Furumura, and T. Dohi, Bayesian Approach to Estimate Software Reliability in Fault-removal Environment, in Proceedings of the 15th IEEE International Symposium on Software Reliability Engineering (ISSRE 2004) (Fast Abstract), Saint-Malo, France, December 2-5, 2004.

97.

G.J. Pai, and J.B. Dugan, Enhancing Software Reliability Estimation Using Bayesian Networks and Fault Trees, in Proceedings of the 12th IEEE International Symposium on Software Reliability Engineering (ISSRE Fast Abstract Track), 2001.

98.

G.J. Pai, S.K. Donohue, and J.B. Dugan, Estimating Software Reliability from Process and Product Evidence, in Proceedings of the 6th International Conference on Probabilistic Safety Assessment and Management, Feb. 2002.

99.

J. Pearl, Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, 1989.

100.

D.E. Perry, and A.L. Wolf, Foundations for the Study of Software Architectures, ACM SIGSOFT Software Engineering Notes, 17(4): 40-52, 1992.

101.

H. Pham, Software Reliability, Springer 2002.

102.

F. Plasil, S. Visnovsky, Behavior Protocols for Software Components, IEEE Transactions on Software Engineering 28(11), pp. 1056–1076, November 2002.

103.

L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. In Proceedings of IEEE. Volume 77, 1989.

236

104.

L.R. Rabiner, B.H. Juang, and C.H. Lee, An Overview of Automatic Speech Recognition. In C. H. Lee, F. K. Soong, and K. K. Paliwal, editors, Automatic Speech and Speaker Recognition, Advanced Topics, pages 1-30. Kluwer Academic Publishers, 1996.

105.

J. Ren, R. Taylor, P. Dourish, D. Redmiles. Towards An Architectural Treatment of Software Security: A Connector-Centric Approach. In Proceedings of the Workshop on Software Engineering for Secure Systems, International Conference on Software Engineering, St. Louis, Missouri, USA, 2005.

106.

R. Reussner, H. Schmidt, I. Poernomo, Reliability Prediction for Componentbased Software Architectures, In Journal of Systems and Software, 66(3), pp. 241-252, Elsevier Science Inc, 2003.

107.

R. Roshandel, N. Medvidovic, Coupling Static and Dynamic Semantics in an Architecture Description Language, in Proceeding of Working Conference on Complex and Dynamic Systems Architectures, Brisbane, Australia, 2001.

108.

R. Roshandel, N. Medvidovic, Modeling Multiple Aspects of Software Components, in Proceeding of Workshop on Specification and Verification of Component-Based Systems, ESEC-FSE03, Helsinki, Finland, September 2003.

109.

R. Roshandel, N. Medvidovic, Multi-View Software Component Modeling for Dependability, In R. de Lemos, C. Gacek, and A. Romanowski, eds., Architecting Dependable Systems II, Lecture Notes in Computer Science 3069, Springer Verlag, pages 286-306, June 2004.

110.

R. Roshandel, B. Schmerl, N. Medvidovic, D. Garlan, D. Zhang, Understanding Tradeoffs among Different Architectural Modeling Approaches, in Proc. of the 4th Working IEEE/IFIP Conference on Software Architecture, WICSA 2004, Oslo, Norway, June 2004.

111.

R. Roshandel, A. van der Hoek, M. Mikic-Rakic, N. Medvidovic, Mae - A System Model and Environment for Managing Architectural Evolution, ACM Transactions on Software Engineering and Methodology, vol. 11, no. 2, pages 240-276, April 2004.

112.

G.J. Schick, and R.W. Wolverton, An Analysis of Computing Software Reliability Models, in IEEE Transactions on Software Engineering, vol. SE-4, pp. 104120, July 1978.

113.

M. Schumacher, Security Engineering with Patterns: Origins, Theoretical Models, and New Applications, Springer; 1 edition, 2003. 237

114.

SCRover Project: http://cse.usc.edu/hdcp/iscr.

115.

K. Seigrist, Reliability of systems with Markov transfer of control, in IEEE Transactions on Software Engineering, 14(7):1049–1053, July 1988.

116.

M. Shaw, Cost and Effort Estimation. CPSC451 Lecture Notes. The University of Calgary, 1995.

117.

M. Shaw, D. Garlan, Software Architecture: Perspectives on an Emerging Discipline. Prentice-Hall, 1996.

118.

M. Shaw, R. DeLine, D.V. Klein, T.L. Ross, D.M. Young, G. Zelesnik, Abstractions for Software Architecture and Tools to Support Them. IEEE Transactions on Software Engineering, 21(4), 1995.

119.

M. Shooman, Software Engineering, Design, Reliability, and Management, Mc-Graw-Hill, New York, 1983.

120.

N.D. Singpurwalla, and S.P. Wilson, Statistical Methods in Software Engineering: Reliability and Risk. Springer Verlag, New York, NY, 1999.

121.

J. Solano-Soto and L. Sucar, A Methodology for Reliable System Design. In Lecture Notes in Computer Science, Volume 2070, pp. 734–745. Springer, 2001.

122.

The Unified Modeling Language (UML), http://www.uml.org.

123.

J. Torres-Toledano and L. Sucar, Bayesian Networks for Reliability Analysis of Complex Systems. In Lecture Notes in Artificial Intelligence 1484. Springer Verlag, 1998.

124.

K. Trivedi, Probability and Statistics with Reliability, Queueing, and Computer Science Applications, 2nd Edition, Wiley-Interscience, 2001.

125.

USC Center for Software Engineering, Guidelines for Model-Based (System) Architecting and Software Engineering, http://sunset.usc.edu/research/ MBASE, 2003.

126.

M. Vardi, Automatic verification of probabilistic concurrent finite-state programs. In Proceedings of FOCS’85, pages 327–338. IEEE Press, 1987.

238

127.

A. van der Hoek, M. Rakic, R. Roshandel, N. Medvidovic, Taming Architecture Evolution, in Proceedings of the Sixth European Software Engineering Conference (ESEC) and the Ninth ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-9), Vienna, Austria, 2001.

128.

R. van Ommering, Building Product Populations with Software Components, in Proceedings of the 24th International Conference on Software Engineering (ICSE2002), Orlando, Florida.

129.

A.J Viterbi, Error Bounds for Convolutional Codes and An Asymptotically Optimal Decoding Algorithm, IEEE Transactions on Information Theory, 13:260–269, 1967.

130.

S. Yacoub, B. Cukic, and H. Ammar, Scenario-Based Analysis of Componentbased Software. In Proceedings of the Tenth International Symposium on Software Reliability Engineering, Boca Raton, FL, November 1999.

131.

S. Yamada, Software Reliability Models and Their Applications: A Survey, International Seminar on Software Reliability of Man-Machine Systems, Kyoto, Japan 2000.

132.

S. Yamada, M. Ohba, and S. Osaki, S-Shaped Reliability Growth Modeling for Software Error Detection. in IEEE Transactions on Reliability, R32(5):475-485, December 1983.

133.

D.M. Yellin, R.E. Strom, Protocol Specifications and Component Adaptors, ACM Transactions on Programming Languages and Systems, Vol. 19, No. 2, 1997.

134.

W. Wang, Y. Wu, M. Chen, An Architecture-based Software Reliability Model, in Proceedings of Pacific Rim International Symposium on Dependable Computing, 1999, pp. 143-150.

135.

M. West and P.J. Harrison, Bayesian Forecasting and Dynamic Models, 2nd edn. Springer-Verlag, New York, 1997.

136.

A.M. Zaremski, J.M. Wing, Specification Matching of Software Components, ACM Transactions on Software Engineering and Methodology, 6(4):333–369, 1997.

239

Appendix A: Mae Schemas for Quartet Models

This appendix contains three xADL schemas that describe static behaviors, dynamic behaviors, and interaction protocol views of the Quartet model.

Static Behaviors Schema

* * * * * * * * * * * * * * * * *

Copyright (c) 2003-2004 University of Southern California. All rights reserved. This software was developed at the University of Southern California. Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by the University of Southern California. The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

--> xArch Type XML Schema 1.0 Change Log: 2003-3-10: Roshanak Roshandel [[email protected]] Transiting from the C2 schema to the static behavioral schema

241



242

Dynamic Behaviors Schema
Copyright (c) 2003-2004 University of Southern California. All rights reserved. This software was developed at the University of Southern California. Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by the University of Southern California. The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. -->

Suggest Documents