Verification & Validation of Stochastic Models

Verification & Validation of Stochastic Models Module 8; ECE 590; Spring 2017 Kishor Trivedi [email protected] ECE Department Duke University Durham NC USA Copyright © 2017 by K.S. Trivedi

1

Purpose 



To produce a model with enough fidelity to answer the questions posed To produce a credible model

Copyright © 2017 by K.S. Trivedi

2

Steps in Modeling    

 



Step 1: Study/understand system being modeled Step 2: Develop a conceptual model Step 3: Translate into a computerized model Step 4: Parameterize and Step 5: Solve the model Step 6: Improve the model based on results of Validation and Verification Step 7: Use the model Copyright © 2017 by K.S. Trivedi

3

Step 1: Study the system 

Problem formulation    

Ask pertinent questions to the experts (hardware/software architects) Filter out unimportant details Iterate between steps 1 and 2 until agreement Use visual aids in communicating with the architects  

 

Flowcharts Trees State-charts Timing diagrams


4

Step 2: Conceptual Model 

Formalisms  

      

Non-state-space (Ftree, RBD, Relgraph; pfqn,spdag) State-space (DTMC,CTMC,SPN,GSPN,SRN,SMP,MRGP, MRSPN, FSPN,SPA) Hierarchical Fixed-point iterative DEVS, SPN Languages: SAVE, Boeing’s SDM Flowchart SysML, UML None


5

Step 3: Translate into a computerized model



Write your own code 



Use an existing software package (tool) 



NIH syndrome HARP, SHARPE, SPNP, SREPT

A Combination 



Boeing’s IRAP NEC Candy Copyright © 2017 by K.S. Trivedi

6

Step 4: Model Parameterization 

Data collection and analysis 



Gather Input Parameters from various sources  Databases (hardware MTTFs)  Repair delays (maintenance contract)  Fault/Error Injection Experiments (coverages, recovery delays)  Experiments or based on SRGMs (Software MTTFs)  Arrival process parameters (measurements)  Service time parameters (measurements) Statistical inference  Reliability growth, reliability decay, correlations or iid  Estimate parameters of SRGM in case of trend (use SREPT)  Estimate parmeters of MAP/MMPP/BMAP (Okamura)  Determine the distributions (iid case) 

Point and interval estimates of the parameters of the chosen distributions Copyright © 2017 by K.S. Trivedi

7

Step 5: Solve the model 

Solution methods: 

 



Analytic (non-state-space, state space, hierarchical, fixed-point iterative) DES Hybrid

Solution types 

Analytic    



Closed form formula by hand Fully symbolic analysis via Mathematica or Matlab Semi-symbolic analysis via SHARPE Numerical solution via SHARPE or other packages

Discrete-event simulation


8

Step 6: Model V & V 

Verification and validation   



Model qualification Model verification Model and Data validation

Model calibration


9

Step 7: Model Use 

Using the Model 





Sensitivity Analysis (brute-force; derivatives; importance measures) for bottleneck detection Optimization (static vs. dynamic; constrained vs. unconstrained; integer vs. continuous) Propagation of parametric (epistemic) uncertainty through the model


10

Overall Modeling Process Real System or System design Calibration and Validation

Model Qualification or Conceptual Model Validation

Conceptual Model 1. Assumptions on system components 2. Structural assumptions , which define the interactions between system components 3. Input parameters and data assumptions Model Verification Operational Model (Computerized) representation Copyright © 2017 by K.S. Trivedi

11

V&V 

Overall Validation: Does the computerized model faithfully reflect the behavior of the real system? 







Conceptual Model Qualification: Does the conceptual model faithfully represent the real system? The process of assessing the degree to which a conceptual model is an accurate representation of the real world from the perspective of the model’s intended applications Verification: Has the conceptual model been correctly implemented? The process of determining that a computerized software implementation correctly represents the conceptual model and that the model is solved correctly Model validation: The process of substantiating that the computerized model behaves with satisfactory accuracy consistent with the study objectives. Data validation: The process of confirming that the data used is accurate, complete, unbiased, and appropriate in its original and transformed form. Copyright © 2013 by K.S. Trivedi

12

V & V relations

Model Validation

Reality

Simu/Anal

Model Credibility

Model Qualification

Conceptual Model

Computerized Model


Model Verification 13

Model Verification Methods 

Checked by someone else



Check logical flows



Multi-version modeling (do it two or more different ways)



Check reasonableness of model outputs



Comments and documentation



Interactive run control of model execution



Animation of model execution  





Token game for PNs Examine the reachability graph and the CTMC generated for simple cases (for an SPN model) Job flows for QNs

GUI for input and output Copyright © 2017 by K.S. Trivedi

14


If model solved analytically-numerically 

Verify if a special case can be solved in closed-form 

If n-processor model, see if 2-processor can be solved in closed-form



Verify by simulation



Verify by alternative analytical paradigm 





CTMC, SRN, PFQN, hierarchical, fixed-point iterative

Verify by alternative solution algorithm 

ftree solution: bdd or sdp or factoring



CTMC steady-state: power or sor



CTMC transient: uniformization, ode, semi-numerical (transform)

Verify by alternative software packages/tools 

Mathematica, Matlab, SHARPE or SPNP etc. Copyright © 2017 by K.S. Trivedi

15


If model solved using simulation  



Verify if a special case can be solved analytically Alternative solution technique: importance sampling, RESTART, regenerative simulation Verify by alternative simulation method/package 

Csim, Matlab, ns-2, OPNET, OMNET, Simula, SPNP, DEMOS, …


16

MODEL VALIDATION 

Three step process outlined by Naylor and Finger 





Face validation: Discussion with the experts

Input-Output validation: Compare results obtained from model with those from measurements on the real system Validation of model assumptions: Either prove that the assumptions are correct or do statistical hypothesis testing Rejection of a hypothesis regarding model assumption based on measurement data should lead to an improved model


17

Component Inclusion/Exclusion Errors 



Errors in selecting components (subsystems) 

Missing or Extra components (subsystems)



E.g., power subsystem not being considered explicitly



Neglecting some failure modes

Use Face Validation to avoid these errors if the error is in conceptual model



Use model verification if the error is in computerized model



Be prepared to change (improve) the model based on discussions with experts


18

Model Structural Errors 

Errors in Model Structure 

Missing or Extra Arcs (or events)



Missing or Extra States (or conditions)



E.g., ignoring detection delay/ imperfect detection; restart delay/ imperfect restart etc.



Depending upon the measure of interest, some details may or may not be

important (for DPM measure, detection delay was important but not for system availability in SIP/WebSphere example) 

Use Face Validation to avoid these errors if the error is in the conceptual model



Use model verification if the error is in the computerized model



Be prepared to change (improve) the model based on discussions with experts


19

Errors Due to Non-Independence 

Sometimes possible to guarantee by design (fault isolation regions)



Verify by Hypothesis test after collecting data from (sub) system



Be prepared to improve the model to allow for dependence



See the following for methods of including dependence 

Muppala, Malhotra and Trivedi, Markov Dependability Models of Complex Systems: Analysis Techniques, in: Reliability and Maintenance of Complex

Systems, S. Ozekici (ed.), Springer-Verlag, 1996. 

Fricks and Trivedi, Modeling Failure Dependencies in Reliability Analysis Using Stochastic Petri Nets, Proc. European Simulation Multi-conference

(ESM '97), Istanbul, Jun. 1997. Copyright © 2017 by K.S. Trivedi

20

Distributional Errors 

Sometimes possible to show distributional insensitivity 

Time to failure and time to repair distributions don’t matter in the two state availability model (Chapter 6 of Bluebook)



In a state space model, if there is only one arc emanating from a state then the distribution of that transition does not matter (examples in Chapter 8)



In a PFQN for PS, LCFSPR, IS stations, service time distributions don’t matter (Chapter 9 of Bluebook).



Use goodness of fit tests and improve the model in case of poor fit with data collected from real (sub)system



Be prepared to change the model to include more general distributions:



Wang, Fricks and Trivedi, Dealing with Non-exponential Distributions in Dependability Models, in: Performance Evaluation - Stories and Perspectives, G. Kotsis (ed.), Oesterreichchische Computer Gessellschaft, 2003.



Two recent papers with Salvatore Di Stefano, one recent paper with Frattini and Alonso Copyright © 2017 by K.S. Trivedi

21

Parametric Errors 

Input data modeling; use statistical techniques (Ch. 10 of Bluebook)



Parametric sensitivity analysis 





Rubens et al; IEEE-TR 2012

Bounds analysis 



Blake, Reibman and Trivedi, Sensitivity Analysis of Reliability and Performability for Multiprocessor Systems, ACM SIGMETRICS 1988.

Ramesh and Trivedi, On the Sensitivity of Transient Solution of Markov Models, ACM SIGMETRICS 1993.

Confidence interval propagation thru Markov models  

Yin, Smith & Trivedi, Uncertainty Analysis in Reliability Modeling, RAMS 2001. Mishra & Trivedi papers on Uncertainty Propagation in Analytic Availability Models, ISSRE 2011 and many more Copyright © 2017 by K.S. Trivedi

22

Errors due to Approximations 

Model level Decomposition  

 



Flow-equivalent server method in queuing networks (Ch. 9) Ciardo & Trivedi, A Decomposition Approach for Stochastic Reward Net Models, Performance Evaluation, 1993 X. Yin et al papers on VANET Ghosh, Kim, Naik & Trivedi, End-to-End Performability Analysis for Infrastructure-as-aService Cloud: An Interacting Stochastic Models Approach, FGCS 2012; DSN 2011

Matrix level Decomposition 





Courtois, Decomposability, Academic Press, 1977

Courtois & Semal, Computable bounds for conditional steady-state probabilities in large Markov chains and queueing models, IEEE JSAC, 1986 Bobbio & Trivedi, An aggregation technique for the transient analysis of stiff Markov chains, IEEE TC, 1986 Copyright © 2017 by K.S. Trivedi

23

Errors due to Approximations 

Fixed-Point Iteration 

 



Mainkar & Trivedi, Sufficient Conditions for the Existence of a Fixed Point in Stochastic Reward Net-Based Iterative Models, IEEE TSE, 1996 Haring et al in IEEE-TVT; Yin et al, Perf Eval., Ghosh, Kim, Naik & Trivedi, End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud: An Interacting Stochastic Models Approach, FGCS 2012; IEEE TCC 2014, IEEE TSC 2014

State Truncation  

Muppala et al, Vaxcluster model in Averesky book Muntz et al, Bounding Availability of Repairable Computer Systems, IEEE TC, 1989; Mahevas and Rubino, IEEE TC, 2001 Copyright © 2017 by K.S. Trivedi

24

Numerical Solution Errors 

Discretization Errors 



Series Truncation Errors 



a differential equation being turned into a difference equation; possible to estimate or bound the error

may be possible to bound the truncation error

Round-Off Errors 

multi-precision or variable precision arithmetic



minimize or avoid subtractions Copyright © 2017 by K.S. Trivedi

25

Programming Errors



Proof of correctness



Testing and debugging



Code reading



Using good development system



using higher level languages


26

Errors in a Simulative Solution









Sampling (determines sample size and confidence interval width of the output), bias

Correlation (particularly in the outputs; methods are known to estimate the effects of correlation on the variance of the output) random number (may not be random!!) Copyright © 2017 by K.S. Trivedi

27

Model Calibration 

Contrast with model parameterization



Iterative process of   

Comparing model outputs to real system Adjusting or improving the model Comparing revised model to reality


28

Iterative Model Refinement with I/O Validation


29

Difficulties in I/O Validation 



For performance models where time constants are relatively small, the time needed for validation is reasonable. Still there is a paper by a IBM TRL researcher as to how to reduce this time For Reliability/Availability models time needed to carry out experimental validation can be excessive 

ALT and ADT should be employed to reduce experimentation time (see Rivalino papers)


30

Making Models Credible 

Believability/Understandability/Usability 

GUI, many practical examples, short-courses, usable software packages, Boeing IRAP project





Incorporation in the design process 

VHDL  Availability Model,



C Program  Performance Model



Ada Program  SPN Performance Model (SPC)



SysML/UML/XML  perf/avail model (Candy: Duke/NEC)

Connection between measurements & models Copyright © 2017 by K.S. Trivedi

31

BELIEVABILITY UNDERSTANDABILITY 

Integration of Measurements and Models 







Measurements Provide Parameters to Models (input modeling) Models Provide Guidelines For Measurements

Models Validated Against Measurements (inputoutput validation)

Integration of Different Modeling Tools 

Boeing IRAP project Copyright © 2017 by K.S. Trivedi

32

CASE STUDY: BOEING   



A tool being used in Boeing Developed a high-level modeling language (SDM) Designed and implemented an intelligent interpreter Complete Reference 

An Integrated Reliability Modeling Environment, Reliability Engineering and System Safety, 1999; Authors: A. V. Ramesh, D. W. Twigg, Upender R. Sandadi, Tilak C. Sharma, K. S. Trivedi, Arun K. Somani


33

CASE STUDY: BOEING 





(Continued)

Interpreter determines which solution method is applicable Translator translates the SDM input file into an input file of any of the engines shown below Five different modeling engines, with totally different paradigms, different coding languages and developed by different groups, are integrated:  CAFTA, SETS, EHARP, SHARPE and SPNP. Copyright © 2017 by K.S. Trivedi

34

BELIEVABILITY/ UNDERSTANDABILITY 

Many Case-Studies of Validations Needed 

 



(Continued)

Vaxcluster Availability Model: Wein & Sathaye, IEEE-TR, Oct. 1990 Hsueh, Iyer and Trivedi; IEEE-TC, Apr. 1988 Lucent Validation of ESS; Veena Mendiratta, GlobeCom 2007, (D & D Forum)

Technology Transfer  

Short courses Development and Dissemination of Tools (SHARPE, SPNP, CSIM) Copyright © 2017 by K.S. Trivedi

35

BELIEVABILITY/ UNDERSTANDABILITY 

(Continued)

Application of the Techniques and Tools 

Motorola



Cisco



3Com



GE



Honda



HP



Sun Microsystems



EMC



NEC



Boeing



IBM Copyright © 2017 by K.S. Trivedi

36

MODELING AND MEASUREMENTS: INTERFACES 

Measurements supply Input Parameters to Models (Model Parameterization)

Confidence Intervals should be obtained Boeing, Draper, Union Switch projects, NASA/JPL, EMC/Wipro 

Model Structure Based on Measurement Data: Hsueh, Iyer and

Trivedi; IEEE TC, April 1988; Gokhale et al, Perf. Eval. 2004; Vaidyanathan & Trivedi, IEEE-TDSC 2005 

Model Sensitivity Analysis can suggest which Parameters to Measure More Accurately: Blake, Reibman and Trivedi: SIGMETRICS 1988; Fricks and Trivedi: 1997; Sato and Trivedi, ICSOC 2007; Rubens IEEE-TR 2012 Copyright © 2017 by K.S. Trivedi

37

Dependability Model Parameterization What is ?  Fault Model for Each Component (subsystem) 

  



Design, Manufacturing: Mandelbugs, Bohrbugs, aging-related bugs (software) Operational: Permanent, Intermittent,Transient Human, upgrade Malicious, natural disasters

Fault Arrival Processes (PP,Weibull,NHPP) 

 

Failure Rates (Sources:MIL-STD, other) Experiments (for Software; Tables for Hardware) ALT and ADT to reduce experimentation time Copyright © 2017 by K.S. Trivedi

38

Dependability Model Parameterization What is δ ? 

Detection delay, restart delay, reboot delay



Fault/Error Injection experiments (IBM SIP/WebSphere; Trivedi et al. PRDC 2008 paper)

Copyright © 2017 2011 by K.S. Trivedi

39

Dependability Model Parameterization What is c ? 

Field Data (Veena Mendiratta papers)



Fault/Error Injection experiments (IBM SIP/WebSphere; PRDC 2008)



Analytic Coverage Model (Chapter 5 of bluebook;Dugan Trivedi, IEEE TC, 1989); Amari chapter in Misra book



Trivedi et al paper, PRDC 2008


40

Dependability Model Parameterization What is  ? 

Maintenance Model Corrective; dispatch time , travel time, repair time, dead on arrival, imperfect repair, deferred repair; escalated levels of recovery; hardware/software

Preventive: time-based, condition-based Software rejuvenation: time-based, measurement-based: threshold vs. predictive


41

Performability Model Parameterization What is r ?  Binary: Up & Down  Capacity-Oriented: 

Number of Operational Resources in Each State Performance-Oriented: Evaluate Perf. in Each Degraded Level of Syst. Config. 1. Measurements 2. Simulative solution of a Model 3. Analytic solution of a Model – pfqn, ctmc, smp, mrgp, hierarchical, fixed-point iterative (SHARPE, SPNP) Copyright © 2017 by K.S. Trivedi

42

CASE STUDIES


43

CASE STUDY: AT & T 

 

GSHARPE:  A Preprocessor to SHARPE developed at Bell Labs by a Duke Student.  User can specify Weibull Failure times and lognormal and other repair time distributions.  GSHARPE fits these to phase type distributions and produces a Markov model that is automatically generated for solution by SHARPE engine Several graduates hired by AT & T Many users of SHARPE at AT & T 

M. Malhotra, and A. Reibman, “Selecting and implementing phase approximations for semi-Markov models”, Stochastic Models, 1993. Copyright © 2017 by K.S. Trivedi

44

CASE STUDY: AVAYA 



Modeling Swift: A combined hardware-software availability model of a real system being developed at Avaya labs Complete Reference S.



Garg, Y. Huang, C. M. Kintala, K. S. Trivedi, and S. Yajnik,“Performance and reliability evaluation of passive replication schemes in application level fault tolerance,” in Proc. 29th Annual Int. Symp. Fault Tolerant Computing (FTCS), Madison, Wisconsin, pp. 15-18, June 15–18, 1999. [Example 8.26 in Bluebook]

Comprehensive model of 802.11  “Dependability Enhancement for IEEE 802.11 Wireless LANs with Redundancy Technique”, D.-Y. Chen, S. Garg, C. Kintala and Kishor S. Trivedi, International Conference on

Dependable Systems and Networks, Performance and Dependability Symposium (DSN/IPDS 2003), Copyright © 2017 by K.S. Trivedi

45

Modeling Hardware/Software Faults Avaya Swift system

Availability model with passive redundancy (warm replication) of application; Operational phase; Mandelbugs or hardware transients  







Assumptions A web server software, that fails at the rate p running on a machine that fails at the rate m Mean time to detect server process failure -1p and the mean time to detect machine failure -1m The mean restart time of a machine -1m The mean restart time of a Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault-Tolerance S. Garg, Y. Huang, C. Kintala, K. S. Trivedi and S. Yagnik server -1p Proc. of the 29th Intl. Symp. On Fault-Tolerant Computing, FTCS-29, June 1999.


46

CASE STUDY: AVAYA (contd.) Network Survivability

 

“Network Survivability Performance Evaluation: A Quantitative Approach with Applications in Wireless Ad-Hoc Networks”, D.-Y. Chen, S. Garg, Kishor S. Trivedi, Fifth ACM International

Workshop on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM 2002) 

Several graduate student summer interns


47

CASE STUDY: BELLCORE/TELCORDIA 

 

Architecture-based software reliability:  proposed an approach  applied the approach to SHARPE  used Bellcore’s test coverage tool, ATAC, to parameterize the model  Bellcore was then enhancing ATAC to incorporate our approach Several summer interns, several graduates hired Complete Ref.  An Analytical Approach to Architecture-Based Software Reliability Prediction, with S. Gokhale, W. E. Wong and J. R. Horgan, Performance Evaluation, 2005.


48


  

Boeing Integrated Reliability Modeling Environment (IRAP) Developed a high-level modeling language (SDM) Designed and implemented an intelligent interpreter Complete Reference 

An Integrated Reliability Modeling Environment, Reliability Engineering and System Safety, 1999; Authors: A. V. Ramesh, D. W. Twigg, Upender R. Sandadi, Tilak C. Sharma, K. S. Trivedi, Arun K. Somani Copyright © 2017 by K.S. Trivedi

49






(Continued)

Interpreter determines which solution method is applicable Translator translates the SDM input file into an input file of any of the engines shown below Five different modeling engines, with totally different paradigms, different coding languages and developed by different groups, are integrated:  CAFTA, SETS, EHARP, SHARPE and SPNP. Copyright © 2017 by K.S. Trivedi

50

CASE STUDY: Boeing Avionics 

Reliability analysis of each major subsystem of a commercial airplane needs to be carried out and presented to Federal Aviation Administration (FAA) for certification

Real world example from Boeing Commercial Airplane Company


51

Reliability Analysis of Boeing 787 

Current Return Network Modeled as a Reliability Graph  Consists of a set of nodes and edges  Edges represent components that can fail  Source and target nodes  System fails when no path from source to target  Compute probability of a path from source to target A1

B1

E1

D1 B2

A2

D2 D3

B3

E2

F3 F1

D4

A3 A4

D5

B4

F5

E14

E3

F2 A5

B5

C1

D7

E4

D6

F4

B6 B7

A6 A7

C2 C3

B8

D9

source

D10 B9

C4

B10

C5

E8 D13

C6

B11

E9

D14

B12

D15

B13

D16

B14

D17

F6 E10

A12 D18

E11

F7 E13

A13

F8

F10 F9

B15 A14

target

D12

A9

A11

E6 D11 E7

A8

A10

E5

D8

D19 Copyright © 2017 by K.S. Trivedi B16

D20

E12

52

Reliability Analysis of Boeing 787 

Known solution methods for Relgraph 







(cont’d)

Find all minpaths followed by SDP (sum of disjoint products) BDD (binary decision diagrams)-based method

The above two methods implemented in our SHARPE software package Boeing tried to use SHARPE for this problem but it was too large to solve Copyright © 2017 by K.S. Trivedi

53

Reliability Analysis of Boeing 787 (contd.) 

Too many minpaths

A1

B1

E1

D1 B2

A2

D2 D3

B3

E2

F3 F1

D4

A3 A4

D5

B4

F5

E14

E3

F2 A5

B5

C1

D7

E4

D6

F4

B6 B7

A6 A7

C2 C3

B8

D9

source

D10 B9

C4

B10

C5

target

D12 E8

A9

Number of paths from source to target

D13 C6

B11

E9

D14

B12 A11

E6 D11 E7

A8

A10

E5

D8

D15

B13

D16

B14

D17

F6 E10

A12 D18

E11

A13

F7 E13 F8

F10 F9

B15

B16

D19

D20

E12

A14



Idea: Compute upper and lower bounds instead of exact reliability Copyright © 2017 by K.S. Trivedi

54

Reliability Analysis of Boeing 787 (contd.) 

Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation and incorporated in SHARPE

• A patent awarded to Boeing jointly with Duke • Satisfying FAA that SHARPE development used DO-178 B software standard was the hardest part SHARPE: Symbolic Hierarchical Automated Reliability and Performance Evaluator Developed by my group at Duke

http://sharpe.pratt.duke.edu/ Copyright © 2017 by K.S. Trivedi

55

CASE STUDY: CISCO 



  

Conducted an availability comparison of a Cisco product with that of a competition using analytic-numeric solution of models Hierarchical model with top level being a reliability block diagram and bottom level being a Markov chain Models solved using SHARPE Contained hardware, software, power supply, fans etc. A detailed report supplied to Cisco


56

Two Router Top-Level RBDs

RBD of Cisco 12000 GSR

RBD of Juniper M20 K. Trivedi, “Availability Analysis of Cisco GSR 12000 and Juniper M20/M40”

Cisco Internal report, 2000. Copyright © 2017 by K.S. Trivedi

57

One of the Markov sub-models


58

Downtime (Cisco) Components downtime (min/yr) Hardware

Software

25.00 20.00 15.00 10.00 5.00 0.00

-in C L

t ou LC

FC S C_ S C

P R G

sis s a Ch


S IO

59

CASE STUDY: DEC VAXCLUSTER   

Trivedi Sabbatical at DEC 1988-89 Many sites for SHARPE and SPNP Developed three models of Processor Subsystem: 

Two-Level Decomposition 

Inner Level: 9-state Markov model; Outer level: n parallel diodes



Approximate Availability Analysis of VAXCluster Systems, Ibe, Howe, Trivedi,

IEEE Transactions on Reliability, April 1989 



A Detailed SPN Model  O. Ibe, A. Sathaye, R. Howe, and K. S. Trivedi, “Stochastic Petri net modeling of VAXcluster availability,” Proc. Third Int. Workshop on Petri Nets and Performance Models (PNPM89), Kyoto, 1989, pp. 112–121.

A Detailed SPN model for Heterogeneous Cluster: 

Dependability Modeling of a Heterogeneous VAXcluster System Using Stochastic Reward Nets, Muppala, Sathaye, Howe & Trivedi, in Avresky (ed.), Hardware

and Software Fault Tolerance, Ellis Horwood, 1992 Copyright © 2017 by K.S. Trivedi

60

CASE STUDY: DEC VAXCLUSTER 





Storage Subsystem Model: A fixed-point iteration over a set of Markov submodels: Fixed-Point Iteration in Availability Modeling, Tomek & Trivedi, Informatik-Fachberichte, Vol. 283, Springer Verlag, 1991 Observed that availability is maximized with 2 processors: Should I Add a Processor?, Trivedi, O. Ibe, A. Sathaye and R. Howe, 23rd Annual Hawaii Conference on System Sciences, 1990 Many interesting reliability, availability, performability measures computed: Availability and Reliability Modeling for Computer Systems, Heimann, Mittal & Trivedi, in: Advances in Computers, Vol 31, 1990 Copyright © 2017 by K.S. Trivedi

61

CASE STUDY: DRAPER LAB 



Overall aim was Verification of system with very high reliability/availability specifications. Prototype under consideration was FTPP cluster 3. Hybrid approach proposed  

Fault injection based measurements. Statistical analysis of measured data to enable parameterization of analytical models. Copyright © 2017 by K.S. Trivedi

62


63


Reliability modeling of the prototype done: Parameterization done with the aid of existing reliability databases. 



 

Analytical solution provided exact closed form expressions Markov model solved using SHARPE Petri net model solved using SPNP Reliability bottlenecks found Copyright © 2017 by K.S. Trivedi

64


Software reliability growth models developed for four different large software systems developed by Draper Lab



Found Log-logistic based NHPP model the most suited



Used SREPT tool



Complete References 



A Time/Structure Based Software Reliability Model, S. Gokhale and K. S. Trivedi, Annals of Software Engineering,Vol. 8, pp. 85121, 1999 SREPT: Software Reliability Estimation and Prediction Tool, S. Ramani, S. Gokhale, and K. S. Trivedi, Performance Evaluation, Vol. 39, pp. 37-60, 2000. Copyright © 2017 by K.S. Trivedi

65

CASE STUDY: GE   

Short courses offered summer interns many users of SHARPE


66

Fault Tree Model of GE Steam Turbine Control System


67

Fault Tree Model of GE Equipment Ventilation System


68

Case Study: HP 

Short courses offered; many users of SHARPE and SPNP



Cluster Availability Modeling



Server Availability



Mass Storage Arrays Availability Modeling



Started with Markov chains via SHARPE



Progressed toward Stochastic Petri Nets

and Stochastic Reward nets via SPNP


69

CASE STUDY: IBM 

Trivedi sabbatical in 1981 at IBM TJW Res. Ctr.; worked with Phil Heidelberger and Phillip Yu and wrote the following papers: 





Queueing Network Models for Parallel Processing with Asynchronous Tasks, Heidelberger and Trivedi, IEEETC, 1982 Analytic Queueing Models for Programs with Internal Concurrency, Heidelberger and Trivedi, IEEE-TC, 1983 Reliability and Performance Analysis of a Ringnet, Yu, Smith, Trivedi, in: Local Communication Systems: LAN and PBX, 1987


70

CASE STUDY: IBM (contd.) 

System Availability Estimator (SAVE): 













Duke-IBM Yorktown Joint Project; initial version of the software package delivered by Duke to IBM Worked with Steve Lavenberg and Ambuj Goyal and wrote the following papers: Probabilistic Modeling of Computer System Availability,Annals of Operations Research, 1987 Reliability Analysis of Systems with Limited Repairs, Goyal, Nicola, Tantawi & Trivedi, IEEE-TR, 1987 The System AVailability Estimator (SAVE), Goyal, Carter, de Souza e Silva, Lavenberg & Trivedi, FTCS, 1986 Accelerating Mean Time to Failure Computations, Heidelberger, Trivedi and Muppala, Performance Evaluation, 1996

The following Ph.D.s supervised by me are currently in IBM: 

Rahul Ghosh, Steve Hunter, Bob Leech, Srini Ramani, Joe Rusnak, W. Earl Smith, Lorrie Tomek, Steve Woolet Copyright © 2017 by K.S. Trivedi

71


Several projects in Performance modeling with IBM RTP working with Andy Rindos and Steve Woolet; wrote the following papers: 





Techniques and Tools for Reliability and Performance Evaluation: Problems and Perspectives, Haverkort, Rindos, Mainkar & Trivedi, Lecture Notes in Computer Science, 1994 Exact Methods for the Transient Analysis of Nonhomogeneous Continuous-Time Markov Chains, Rindos, Woolet, Viniotis & Trivedi, in a book edited by W. J. Stewart Analysis of a Realistic Bulk Service System, Wang, Rindos, Woolet, Groner & Trivedi, HiPC, 1995


72






Software rejuvenation technology transferred to IBM xserver family; the work is discussed in the following papers written jointly with IBM researchers Analysis and Implementation of Software Rejuvenation in Cluster Systems K. Vaidyanathan, R. E. Harper, S. Hunter and K. Trivedi, ACM SIGMETRICS 2001/Performance 2001, June 2001. Proactive Management of Software Aging V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. P. Zeggert, IBM Journal of Research & Development, Vol. 45, No. 2, March 2001


73

CASE STUDY: IBM (recent) 

    



BladeCenter Availability model: Earl Smith and K. Trivedi, IBM Systems Journal, Oct-Dec. 2008 Availability Monitor for an Appliance: Marc Haberkorn & Trivedi; HASE 2007 Performance and Reliability analysis of Business Processes: Naoto Sato and K. Trivedi; ICSOC 2007: SCC 2007 Availability modeling of SIP protocol on IBM WebSphere: Wang, Trivedi, Hunt, Rindos; PRDC 2008 Computing the Number of Calls Dropped due to Failures, Trivedi, Wang & Hunt, ISSRE 2010 Ghosh, Kim, Naik & Trivedi, End-to-End Performability Analysis for Infrastructure-as-a-Service Cloud: An Interacting Stochastic Models Approach, PRDC 2010; FGCS 2012 Longo, Ghosh, Naik & Trivedi, A Scalable Availability Model of Infrastructure-as-a-Service Cloud, IEEE TCC 2014 Copyright © 2017 by K.S. Trivedi

74

Fault tree for IBM BladeCenter top-level model Smith, Trivedi et al, IBM Systems J. 2008

KEY Basic Event

System Failure

Midplane

k/ n

Defined Sub-model

Cooling

Power Domain 1

Repeated Use of Same Sub-model Instance Instance of Blade in Power Domain 1 Sub-Fault tree Instance of Blade in Power Domain 2 Sub-Fault tree

Blade in Power Domain 1 Fault Subtree

Blade in Power Domain 2 Fault Subtree

Base CPU Memory DISK

FC FC FC FC Port 1 Switch 1 Port 2 Switch 2

Software

Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2

Power Domain 2

Base CPU Memory DISK

FC FC FC FC Port 1 Switch 1 Port 2 Switch 2


Software

Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2

75

Availability model of SIP on IBM WebSphere  

  





Real problem from IBM SIP: Session Initiation Protocol Hardware platform: IBM BladeCenter Software platform: IBM WebSphere Subsystems modeled using Markov chains to capture dependence within the subsystem Fault tree used at higher levels as independence across subsystems can be assumed This is an example of hierarchical composition  A single monolithic model is not constructed/stored/solved  Each submodel is built and solved separately and results are propagated up to the higher level model  SHARPE facilitates such hierarchical model composition Copyright © 2017 by K.S. Trivedi

76

Example: Architecture of SIP on IBM WebSphere Blade Chassis 1 AS 1

Replication Domain 1

AS 2


AS: WebSphere Appl. Server (WAS)

Replication Blade 2 AS 3 Replication Domain 3 AS 4

SIP Proxy 1 DM

Blade 3 Group3

Replication domain

Nodes

1

A, D

2

A, E

3

B, F

4

B, D

5

C, E

6

C, F

AS 5

Blade 1 AS 6

Test Driver

Blade 4

Test drivers

IP Sprayer IBM Load Balancer

SIP

Blade Chassis 2

IBM PC Test Driver

AS 1 Replication Domain 4

SIP Proxy 1 AS1 thru AS6 are Application Server Proxy1's are Stateless Proxy Server

Blade1

AS 4 Blade 2 AS 2 Replication Domain 5 AS 5 Blade 3 AS 3


AS 6 Blade 4 Blade 4 Copyright © 2017 by K.S. Trivedi

77

Software Fault Tolerance  





Identical copies of SIP proxy used as backups (hot spares) Identical copies of WebSphere Applications Server (WAS) used as backups (hot spares) Traditional: Design diversity New Thinking 

Use identical software copies as spares or backups 



Does it help? If yes, why?

Recovery method for software failures is to  

Restart a process, reboot a node Does it help in dealing with failures caused by software bugs? If yes, why? Copyright © 2017 by K.S. Trivedi

78

Adopted SW Fault Classification Mandelbug  A fault whose activation and/or error propagation are complex (e.g. a long time lag between the fault activation and the occurrence of a failure; interactions among hardware, operating system, other applications, timing and sequencing effects, etc.) Bohrbug  An easily isolated fault that always manifests consistently under a well-defined set of conditions, because its activation and error propagation lack “complexity” as defined above. Bohrbug is the complementary antonym of Mandelbug.

Aging-related bug  A fault that causes an increased failure rate and/or degraded performance. The fault causes the accumulation of errors either inside the running application or in its system-internal environment. Copyright © 2017 by K.S. Trivedi

79

Software Faults: Mitigation Software (OS, middleware, applications)

Bohrbugs

Test/ Debug

Design/ Development

Aging-related bugs

Des./Data Diversity

Rejuvenate

Mandelbugs

Restart app.

Reboot node

Failoverto Failover Stnadby standby

Retry opn.

Operational


80

Failures Incorporated in Models failures Physical faults

Power faults

Cooling faults

Blade faults

midplane faults

Software failures

Network faults

OS

Memory faults

Application

WAS

NIC faults

Proxy

CPU faults base faults I/O (RAID) faults

Process hang Copyright © 2017 by K.S. Trivedi

Process die 81

Hierarchical composition System Failure

system

App servers

proxy

AS1

k of 12

AS 1

1A

BS

A

AS 2

CM

2A

1

BS

A

AS 3

CM

AS 7

1D

BS

D

3B

1

BS

B

AS 4

CM

AS 8

CM

2

4D

BS

D

4B

1

BS

B

AS 5

CM

AS 9

CM

2

2E

BS

E

5C

1

CM

CM

5E

2

BS

E

CM

3F

2

(1-e)2

BS

C

CM

BS

F

P1

1

BS

G

PX 2

CM

1

P2

BS

H

CM

2

1A

AS 12

CM

2

6F

BS

F

CM

BS A CM 1

2

BS Failure

UN

CM Failure

(1-d)1

e2

d1 m

1D

(1-d)1

BS

(1-e)2

(1-e)2

e2

d1 UO

6C

1

AS 11

2N



C

AS 10

1N

UP

BS

PX 1

AS 6

e2 qra

UA

(1-q)ra

(1-r)rm

UR

UB

(1-b)bm

CM RE

rrm bbm 

Base CPU Mem

RAID

eth

OS

MP Cool Pwr eth1

nic1

esw1

eth2

nic2

esw2

DN 2(1-cps)×ps


UP

2cps×ps

asp ps asp

U1 ps

2ps

RP

ps

DW

82

Our Contributions (1) 

Developed a very comprehensive availability model 

   

“Discovered the Software failure/recovery architecture Hardware and software failures Hardware and Software failure-detection delays Software Detection/Failover/Restart/Reboot delay Escalated levels of recovery 



Automated and manual restart, failover, reboot, repair

Imperfect coverage (detection, failover, restart, reboot)


83

Our Contributions (2) 

Developed a new method for calculating DPM (defects per million calls) 

 



Taking into account interactions between call flow and failure/recovery & Retry of messages

Many of the parameters collected from experiments Detailed sensitivity analysis to find bottlenecks and give feedback to designers This model made the sale of this system to the Telco customer


84

Parameterization  

 

Hardware/Software Configuration parameters Hardware component MTTFs Hardware/Software Detection/Failover/Restart/Reboot times Repair time 



Software component MTTFs (experiments have started for this) 



OS, WAS, SIP/Proxy

Coverage (Success) probabilities 



Hot swap, multiple components at once, field service travel time

Detection, restart, failover, reboot, repair

Validation


85

Case Study: JPL 



Studied Failure Reports from different JPL/NASA missions Input data analysis 

Studied flight software fault types: 



Studied the nature of the times between software failures 



Proportion of BOHs, NAMs and ARBs Reliability growth and distributional analysis conducted

Studied flight software recovery/mitigation approaches 

Proportion of each recovery/mechanism across the type of software fault which caused the failure Copyright © 2017 by K.S. Trivedi

86

CASE STUDY: LUCENT 

  

  

Short courses, graduates hired, summer internships, many users of SHARPE A Validated Model of Hardware-Software Availability. Worked with V. Mendiratta of Naperville. Model is semi-Markov; solved using SHARPE. Parameters collected form field data. Model results validated against actual measurements. ISSRE 2004 paper on survivability of POTS architectures


87

CASE STUDY: LUCENT/AVAYA 





Software Rejuvenation:  A technique to counter software “aging” and increase its availability to clients.  Evaluated optimum rejuvenation interval which maximizes steady state availability (minimizes expected cost). Subsequently collected data from real systems to show aging and to determine proactive fault management strategies. Complete Reference  A Methodology for Detection and Estimation of Software Aging, Authors S. Garg, A. Van Moorsel, K. Vaidyanathan, K.S. Trivedi, ISSRE 1998 Copyright © 2017 by K.S. Trivedi

88

CASE STUDY: MOTOROLA  



Short courses, summer internships, research contracts, graduates hired, several users of SHARPE and SPNP Availability & Performability Modeling:  Modeled several configurations of Communication Enterprise Common Platform.  Practical approaches for approximating steady state measures in large, repairable, and highly dependable system: model decomposition, state space truncation, etc.  Both SHARPE and SPNP used Complete Reference 

Hierarchical composition and aggregation of state-based availability and performability models, Lanus, M.; Liang Yin; Trivedi, K.S. Page(s): 44- 52, IEEE Transactions on Reliability, 2003 Copyright © 2017 by K.S. Trivedi

89

FAULT TREE MODEL, Motorola Bedrock System

Yin, Lanus, Trivedi, IEEE TR, 2003 Copyright © 2017 by K.S. Trivedi

90

CASE STUDY: MOTOROLA (contd.) 

Recovery strategies in wireless handoff: 

proposed and modeled several strategies; SPNP was used



Hierarchy of two-level models used; Fixed-point iteration was used



Call Admission Control for Reducing Dropped Calls in CDMA Cellular Systems, Y. Ma, James J. Han, and K. S. Trivedi, Computer Communications, May 2002



Composite Performance and Availability Analysis of Wireless Communication Networks, Ma, Han, and Trivedi, IEEE Trans. on Vehicular Technology, Sept. 2001,



A Method for Multiple Channel Recovery in TDMA Wireless Communications Systems, Ma, Han and Trivedi, Computer Communications, July 2001



Channel allocation with recovery strategy in wireless networks, Ma, Han, and Trivedi, European Trans. on Telecommunications (ETT), 2000.


91

CASE STUDY: MOTOROLA (contd.) 

A soft handoff scheme for improving utilization efficiency of traffic channels, X. Ma, Y. Liu, K. S. Trivedi, Y. Ma and J. Han, IEEE Int.

conf. on CSCC, Greece, July 2001. 

A New Handoff Scheme for Decreasing Both Dropped Calls and Blocked Calls in CDMA System, Xiaomin Ma, Yun Liu, Kishor S. Trivedi, Yue Ma and James J. Han, Proceedings of the International Conference on Trends in Communications (EUROCON’2001), Bratislava, Slovak Republic, July 4-7, 2001.



Availability bounds with non-exp distribution 

System availability with non-exponentially distributed outages

Yonghuan Cao; Hairong Sun; Trivedi, K.S.; Han, J.J. IEEE TR-2002 Copyright © 2017 by K.S. Trivedi

92

CASE STUDY: MOTOROLA  



Software rejuvenation being analyzed in Motorola cable modem termination systems (CMTS) as a high availability option Comprehensive Availability Modeling:  Overall implementation architecture proposed for adopting software rejuvenation in current CMTS  Modeled hardware failures, Heisenbugs, aging-related bugs, failure detection coverage  Several software rejuvenation strategies considered.  Computed optimum rejuvenation interval which maximizes system availability or minimizes downtime maintenance cost  SPNP used Complete Reference  Modeling and Analysis of Software Rejuvenation in Cable Modem Termination System ,Yun Liu, Yue Ma, James J. Han, Haim

Levendel, and Kishor S. Trivedi, Proceedings of the 13th Int'l. Symposium on Software Reliability Engineering, ISSRE2002 Copyright © 2017 by K.S. Trivedi

93

CASE STUDY: NEC 

Availability modeling in a virtualized system  











modeled a virtualized system using two-level hierarchical model Fault tree is used in the upper level and CTMCs/SRNs are used to represent sub-models in lower level. incorporated hardware failures (CPU, memory, power, etc) and software failures (virtual machine monitor, virtual machine, and application failures) taken into account high availability service and VM live migration computed steady state availability, downtime per year and capacity oriented availability SHARPE software package was used

Complete Reference 



Availability Modeling and Analysis of a Virtualized System: Dong Seong Kim, Fumio Machida, Kishor S. Trivedi; PRDC 2009. Availability Analysis of a Virtualized Data Center: Dong Seong Kim, Fumio Machida, Kishor S. Trivedi; ICVCI 2009. Copyright © 2017 by K.S. Trivedi

94

CASE STUDY: NEC (contd.) Host1

Host2

APP1

APP2

VM1

VM2

VMM1

VMM2

System Failure

OR

AND

Hardware

CPU Mem Power NIC Cooler

VMs

SAN

SAN

Host1

HW1

CPU1

Mem1 NIC1

Host2

VMM1

Pwr1

Coo1 CPU2

HW2

Mem2 NIC2

VMM2

Pwr2

Coo2


95

CASE STUDY: NEC (current) 

Component-based availability modeling 





A framework to compose SRNs from the set of SysML diagrams representing system configurations and behaviors SRNs are composed by assembling a set of model components generated from SysML diagrams Reference 



Component-Based Availability Modeling for Cloud Service Management : Fumio Machida, Dong Seong Kim, Kishor S. Trivedi; ISSRE2010 Industrial session

Software rejuvenation for server virtualized system 

 



SRNs are applied to model software rejuvenations in both virtual machine (VM) and virtual machine monitor (VMM) Three rejuvenation techniques for VMM are compared Effectiveness of VM migration with VMM rejuvenation is studied Reference 

Modeling and Analysis of Software Rejuvenation in a Server Virtualized System : Fumio Machida, Dong Seong Kim, Kishor S. Trivedi; WoSAR2010.


96

CASE STUDY: NEC (contd.) System administrator

ibd/bdd

Availability model

Model translation

Static System configuration

stm

System SRN

Model components

SysML

Component failure &recovery behavior

Activity SRN

Model repository ad

•Steady-state availability •Down time •Availability bottleneck •Etc.

Model assembly

Model translation

Availability measures

Guard assignment

Model translation

Availability evaluation

SHARPE/SPNP

System maintenance operation

Translation rules

Availability requirements Parameter values


97

CASE STUDY: NEC (contd.) host1

(a) VMM model

[ghtrig]

Phup

Threj

1

Threpair

(b) VMM clock model

[ghrej]

Application

Phclock

Threjt

1

OS

Phrej

Thfp

VM

Thinterval

Threset

[ghtrig]

Phfp

VMM

Phdet [ghreset]

Thdet

[ghinterval]

Phtrigger

Thfprejt

Phpolicy

Phfail

(c) VM model Tvsd

Thpolicy

Thfail

Pvsd

Tvpre

[ghpolicy]

[gvhrej]

Tvrej

1

[gvrej]

Tvdw

Pvstop

Tvrestart

Tvfpdw

Tvfpsd

Tvrepair Pvclock

[gvhup]

Tvrejt

[gvhdw]

Pvfpsd

(d) VM clock model

[gvtrig]

Pvup

1 Pvrej

Tvfp

Pvdet

[gvhup]

Tvreset

Tvinterval

[gvreset]

[gvinterval]

[gvtrig]

[gvhdw]

Tvfppre

Tvdet

Pvtrigger

Pvpolicy

[gvhup]

Pvfp

Tvfprejt

[gvhrej]

Pvfail

Tvpolicy [gvpolicy]

Tvfail


98

CASE STUDY: NOKIA 

 



Short course offered Helped solve an interesting modeling problem A short paper published based on this work Complete Reference The Effect of Deferring the Repair on Availability Supriyo Bose & Veneet Kumar & Kishor Trivedi; Fast Abstract, DSN 2003 


99

CASE STUDY: SOHAR



Dependability Evaluation GUI called SDDS: The tool has been developed High-level modeling language related to SDDS Engine used: SHARPE Funded by Rome Lab under SBIR



Complete Reference

   



A User-Friendly Dependability Evaluation Tool, with Herbert Hecht, Ann T. Tai and Andrew J. Chruscicki, Proc. IEEE NAECON, Dayton, Ohio, May 1996. Copyright © 2017 by K.S. Trivedi

100

CASE STUDY: SUN Microsystems  

 



   

Short courses offered Helped model a fault tolerant system Hierarchical model using RBDs and Markov chains Hardware, software, different types of faults, power supply, fans, network cards etc. A paper: Modeling High Availability Systems, PRDC 2006. Many users of SHARPE Summer interns A graduate hired (Kalyan Vaidyanathan) Worked together on software aging and rejuvenation; a joint patent with Kenny Gross of Sun Microsystems Copyright © 2017 by K.S. Trivedi

101

Modeling High Availability Systems K. Trivedi, R. Vasireddy, D. Trindade, S. Nathan and R. Castro, “Modeling High Availability Systems,” Proc. IEEE Pacific Rim International Symposium on Dependable Computing (PRDC ), Dec. 2006.


102

CASE STUDY: ZITEL 



Comparison of two different fault-tolerant RAMdisks. Stochastic Petri Net Package (SPNP) was used to model the two systems for their reliability.


103

CASE STUDY: ZITEL 

Trivedi worked with the designers directly: 

Model Validation was done using face validation and sanity checks.



Parameterization was easy due to the experience of the designers.



One difficult research problem originated from the study; Subsequently solved and published in Microelectronics and Reliability journal in 1998. Copyright © 2017 by K.S. Trivedi

104

Verification & Validation of Stochastic Models

Verification & Validation of Stochastic Models

Suggest Documents

Validation and Verification of Tsunami Numerical Models

VERIFICATION AND VALIDATION OF SIMULATION MODELS

VERIFICATION AND VALIDATION OF SIMULATION MODELS

VERIFICATION AND VALIDATION OF MODELS IN ...

1999: VALIDATION AND VERIFICATION OF SIMULATION MODELS

Validation and verification of conceptual models of ... - CiteSeerX

Formal Verification and Validation of AADL Models - SEE

Formal Verification and Validation of AADL Models - SEE

verification and validation of simulation models - Winter Simulation ...

Validation and Verification of LADEE Models and ...

Verification, Validation, and Confirmation of Numerical Models in the ...

Towards the Verification and Validation of DEVS Models - CiteSeerX

Validation and Verification of Use Cases and Class Models - CiteSeerX

Validation and Verification of Agent Models for Trust - pucit

Validation and Verification of Component Models ... - RER Energy Inc

Verification and Validation of Agent-based Scientific Simulation Models

Validation and Verification of Computational Models with Multiple ...

Procedures for Statistical Validation of Stochastic Simulation Models

Validation and Verification - CiteSeerX

ViVA - Verification in VAlidation

Method validation and verification

Method validation and verification

SIMULATION MODEL VERIFICATION AND VALIDATION ...

Software Testing: Verification and Validation