Reliability and Availability Modeling in Practice

2 downloads 0 Views 2MB Size Report
Survivability Analysis for Lucent POTS, for VPN and Smart Grid (Alcatel-Lucent,. NTNU, Siemens, BJTU). ▫. Security Modeling & Analysis (DARPA, NSF, NATO, ...
Alibaba, Hangzhou, China, December 8, 2017

Reliability and Availability Modeling in Practice Kishor S. Trivedi Duke High Availability Assurance Lab (DHAAL) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 Phone: (919)660-5269 E-mail: [email protected] URL: www.ee.duke.edu/~ktrivedi Also Researchgate.com

Outline  

Introduction Reliability and Availability Models  

 

Model Types in Use Illustrated through several real examples

Conclusions References

Copyright © 2017 by K.S. Trivedi

Duke University Research Triangle Park (RTP) Duke UNC-CH NC state

USA

North Carolina

Copyright © 2017 by K.S. Trivedi

DHAAL

➢ Dhaal means “Shield” in Hindi/Gujarati ➢ DHAAL research is about shielding systems from various threats ➢ Cooperating with groups in USA, Germany, Italy, Japan, China, Brazil, India, New Zealand, Norway, Spain, … Copyright © 2017 by K.S. Trivedi

DHAAL Research Triangle Theory

Modeling paradigms & numerical solution: Solution of large Fault trees and networks, Solution of large & stiff Markov models, New modeling paradigms of non-Markovian and Fluid Petri nets

Books: Blue, Red, White, Green

Tools HARP, SAVE, SHARPE, SPNP, SREPT

Applications Hardware and software: Boeing, EMC, HP, Ericsson, IBM, Sun, Zitel, Cisco, GE, 3Com, Lucent, Motorola,NEC, TCS, Huawei

Recent major achievements: • Reliability model of CRN subsystem of Boeing 787 for certification by FAA • Reliability model of SIP on WebSphere Copyright © 2017 by K.S. Trivedi

Some Applications 



 





Reliability Analysis of Boeing 787 Current Return Network for FAA Certification (Boeing) Reliability and Availability Analysis of SIP protocol on High Availability WebSphere (IBM) Software Aging and Rejuvenation: (IBM, NEC, Huawei, Beihang, Hiroshima) Software Fault-Tolerance via Environmental Diversity and Failures data analytics (NASA-JPL, TCS, Uni. Naples, Beihang) Cloud capacity planning including performance, availability, power (IBM, NEC, BJTU, Iran) Survivability Analysis for Lucent POTS, for VPN and Smart Grid (Alcatel-Lucent, NTNU, Siemens, BJTU)



Security Modeling & Analysis (DARPA, NSF, NATO, NSWC)



Performance modeling of Blockchain Networks (IBM)



Software Reliability Modeling & Analysis (Draper Lab, NASA)



Prognosis and Patient Flow Models in Health Care (DUHS)

Books Update 



 

Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 1st ed.,1982; 2nd ed., John Wiley, 2001 (Blue book) – Chinese translation, 2015; fully revised paperback, 2016 Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer, 1996 (Red book) Queuing Networks and Markov Chains, John Wiley, 1st edition, 1998; 2nd edition, 2006 (White book)

Reliability and Availability Engineering, Cambridge University Press, 2017 (green book)

Motivation: Dependence on Technical Systems Communication

Health & Medicine

Avionics

Entertainment

Banking

Copyright © 2017 by K.S. Trivedi

Outline    

Introduction Reliability and Availability Models Conclusions References

Copyright © 2017 by K.S. Trivedi

Model analysis approaches 





Black-box (measurements + statistical inference):  The system is treated as a monolithic whole, considering its input, output and transfer characteristics without explicitly taking into account its internal structure. White-box (or grey box):  Internal structure of system explicitly considered using a Probability Model (e.g., RBD, Ftree, Markov chain).  Used to analyze a system with many interacting and interdependent components. Combined approach  Use black-box approach at subsystem/component level  Use white-box approach at the system level. Copyright © 2017 by K.S. Trivedi

Overview of Evaluation Methods Measurement-based

Quantitative Evaluation

Error/Failure data analytics

Discrete-event simulation Model-based

Numerical solution Of analytic models Not as well utilized; Unnecessarily excessive Use of simulation

Hybrid

Analytic methods Close-form solution

Numerical solution via a tool

Copyright © 2017 by K.S. Trivedi

Analytic Modeling Taxonomy Measurement-based Quantitative Dependability Evaluation

Discrete-event simulation Model-based

Hybrid Analytic Methods

Non-state-space methods Analytic methods

State-space methods Hierarchical composition Fixed point iterative methods

Copyright © 2017 by K.S. Trivedi

Non-State-Space Methods : taxonomy

SP reliability block diagrams (RBD)

Fault trees Non-state space methods Fault trees with repeated events

Non-SP reliability block diagrams (relgraph)

Extensions such as multi-state components/systems, phased-mission systems etc. Copyright © 2017 by K.S. Trivedi

Non-state-space Methods 





Model types such as RBDs, relgraphs and FTs are easy to use and assuming statistical independence solve for system reliability, system availability and system MTTF; can find bottlenecks Relatively good algorithms are known for solving medium-sized systems Each component can have attached to it    

A probability of failure A failure rate A distribution of time to failure Steady-state or instantaneous (un)availability Copyright © 2017 by K.S. Trivedi

Cisco & Juniper Routers

RBD of Cisco 12000 GSR

RBD of Juniper M20 K. Trivedi, “Availability Analysis of Cisco GSR 12000 and Juniper M20/M40” Cisco Internal report, 2000. Red colored block means a sub-model. Copyright © 2017 by K.S. Trivedi

Sun Microsystems Trivedi et al., Modeling High Availability Systems, PRDC’06 Conference, Dec. 2006, Riverside, CA

Top level RBD consists of all the subsystems joined by series, parallel and k/n blocks. Red colored block means a sub-model. Copyright © 2017 by K.S. Trivedi

Fault Trees 



 



It is a graphical representation of the combinations of events that can cause the occurrence of system failure. Non-state-space model type (state space is what we used when we employed state enumeration method to solve a relgraph). Components are represented as nodes. Components or subsystems in series are connected together with OR gates. Components or subsystems in parallel are connected together with AND gates.

Copyright © 2017 by K.S. Trivedi

Fault Tree Model of GE Truck- AC6000

MF

S = Suspension BR = Brake Rigging L = Liner O = Others

S

BR

L

Copyright © 2017 by K.S. Trivedi

O

Fault Trees 

(Continued)

Failure of a component or subsystem causes the corresponding input to the gate to become TRUE.



Whenever the output of the topmost gate becomes TRUE, the system is considered failed.



Ftree is a pessimist’s model as opposed to RBD or relgraph that can be considered optimists’ models.



Extensions to Fault-trees include a variety of different gate types: NOT, EXOR, Priority AND, cold spare gate, functional dependency gate, sequence enforcing gate, etc. Some of these are “static” while others are “dynamic” gates.

Copyright © 2017 by K.S. Trivedi

Fault Tree Model of GE Equipment Ventilation System

Fault Tree with Repeated events; inverted triangle indicates such events

Copyright © 2017 by K.S. Trivedi

KEY Basic Event System Failure

Defined Sub-model

k/n Midplane

Cooling

Power Domain 1

Repeated Use of Same Sub-model Instance Instance of Blade in Power Domain 1 Sub-Fault tree Instance of Blade in Power Domain 2 Sub-Fault tree

Blade in Power Domain 1 Fault Subtree

Blade in Power Domain 2 Fault Subtree

Base CPU Memory

FC FC FC FC Port 1 Switch 1 Port 2 Switch 2

DISK

Software

Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2

Power Domain 2

Base CPU Memory

FC FC FC FC Port 1 Switch 1 Port 2 Switch 2

Smith, Trivedi et al., IBM Systems J., 2008 Copyright © 2017 by K.S. Trivedi

DISK

Software

Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2

Software Package SHARPE 



  

 

SHARPE: Symbolic-Hierarchical Automated Reliability and Performance Evaluator Stochastic Modeling tool installed at over 1000 Sites; companies and universities Ported to most architectures and operating systems Used for Education, Research, Engineering Practice Users: Boeing, 3Com, EMC, AT & T, Alcatel-Lucent, IBM, NEC, Motorola, Siemens, GE, HP, Raytheon, Honda,… http://sharpe.pratt.duke.edu/ It is the core of Boeing’s internal tool called IRAP

A Fool with a Tool is still a fool Copyright © 2017 by K.S. Trivedi

Fault trees 

Major characteristics:  Fault trees without repeated events can be solved in polynomial time.  Fault trees with repeated events -Theoretical complexity: exponential in number of components.



Use Factoring (conditioning) [In SHARPE use factor on and bdd off]



Find all minimal cut-sets & then use Sum of Disjoint products to compute reliability [In SHARPE use factor off and bdd off]



Use BDD approach [In SHARPE use bdd on]



In practice, can solve fault trees with thousands of components



Our brand new algorithm can solve even larger fault trees [Xiaoyan Yin’s MS Thesis]. Copyright © 2017 by K.S. Trivedi

Reliability Graph 

Consists of a set of nodes and edges



Edges represent components that can fail



Source and target (sink) nodes



System fails when no path from source to sink



A non-series-parallel RBD



S-t connectedness or network reliability problem

Copyright © 2017 by K.S. Trivedi

Avionics 

Reliability analysis of each major subsystem of a commercial airplane needs to be carried out and presented to Federal Aviation Administration (FAA) for certification

Real world example from Boeing Commercial Airplane Company

Copyright © 2017 by K.S. Trivedi

Reliability Analysis of Boeing 787 

Current Return Network Modeled as a Reliability Graph   

 

Consists of a set of nodes and edges Edges represent components that can fail Source and target nodes System fails when no path from source to target Compute probability of a path from source to target A1

B1

E1

D1 B2

A2

D2 D3

B3

E2

F3 F1

D4

A3 A4

D5

B4

F5

E14

E3

F2 A5

B5

C1

D7

E4

D6

F4

B6 B7

A6 A7

C2 C3

B8

D9

source

D10 B9

C4

B10

C5

E8 D13

C6

B11

E9

D14

B12

D15

B13

D16

B14

D17

F6 E10

A12 D18

E11

A13

F7 E13 F8

F10 F9

B15 A14

target

D12

A9

A11

E6 D11 E7

A8

A10

E5

D8

B16

D19

D20

E12

Copyright © 2017 by K.S. Trivedi

Reliability Analysis of Boeing 787 (cont’d) 

Some known solution methods for Relgraph 

  





Find all minpaths followed by SDP (Sum of Disjoint Products) BDD (Binary Decision Diagrams)-based method Factoring or conditioning Monte Carlo method

The first two methods have been implemented in our SHARPE software package Boeing tried to use SHARPE for this problem but …

Copyright © 2017 by K.S. Trivedi

Reliability Analysis of Boeing 787 (cont’d) 

Too many minpaths

A1

B1

E1

D1 B2

A2

D2 D3

B3

E2

F3 F1

D4

A3 A4

D5

B4

F5

E14

E3

F2 A5

B5

C1

D7

E4

D6

F4

B6 B7

A6 A7

C2 C3

B8

D9

source

D10 B9

C4

B10

C5

target

D12 E8

A9

D13 C6

B11

E9

D14

B12 A11

E6 D11 E7

A8

A10

E5

D8

D15

B13

D16

B14

D17

F6 E10

A12 D18

E11

A13

Number of paths from source to target

F7 E13 F8

F10 F9

B15

B16

D19

D20

E12

A14

  

Idea: Compute bounds instead of exact reliability Lower bound by taking a subset of minpaths Upper bound by taking a subset of mincuts Copyright © 2017 by K.S. Trivedi

Reliability Analysis of Boeing 787 (cont’d) 

Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation and incorporated in SHARPE

• 2011 patent for the algorithm jointly with Boeing/Duke • A paper appeared in EJOR, 2014 • Satisfying FAA that SHARPE development used DO-178 B software standard was the hardest part SHARPE: Symbolic Hierarchical Automated Reliability and Performance Evaluator Developed by my group at Duke http://sharpe.pratt.duke.edu/ Copyright © 2017 by K.S. Trivedi

Non-state-space Methods (cont’d) 

Non-state-space methods provide relatively fast algorithms for system reliability, system availability, system MTTF & to find bottlenecks assuming stochastic independence between system

components     

Series-parallel composition algorithm Factoring (conditioning) algorithms All minpaths followed by Sum of Disjoint Products (SDP) algorithm Binary Decision Diagrams (BDD) based algorithms Bounding algorithm for relgraphs



All of the above implemented in SHARPE



Solving a fault tree of a whole plane (B787) is still a challenge



Failure/Repair Dependencies are often present; RBDs, relgraphs, FTREEs cannot easily handle these (e.g., shared repair, warm/cold spares, imperfect coverage, non-zero switching time, travel time of repair person, reliability with repair). Copyright © 2017 by K.S. Trivedi

State-space models : Markov chains 

To model complex interactions between components, use models like Markov chains or more generally state space models.



Markov reliability models will have one or more absorbing states; Markov availability models will have no absorbing states



Many examples of dependencies among system components have been observed in practice and captured by continuous-time Markov chains (CTMCs).



Extension to Markov reward models makes computation of measures of interest relatively easy. Copyright © 2017 by K.S. Trivedi

Analytic Modeling Taxonomy

Non-state-space methods

Analytic methods

State-space methods

Copyright © 2017 by K.S. Trivedi

Availability model of SIP on IBM WebSphere 

Availability model of the Linux OS mOS

UP

lOS

DN

dOS

DT

(1-bOS)bOS

DW

asp

RP

bOSbOS

Detection delay, imperfect coverage, two-levels of recovery modeled

Copyright © 2017 by K.S. Trivedi

Markov Availability model of WebSphere AP Server 

(1-e)d2

1N

(1-d)d1

2N ed2

dd1

(1-e)d2

(1-e)d2

UO

ed2 qra



ed2

dd1 g



dm

1D

(1-d)d1

UP

UN

Failure detection  By WLM  By Node Agent  Manual detection Recovery  Node Agent

UA

(1-q)ra

(1-r)rm

UR

UB

(1-b)bm

RE



Manual recovery 

rrm



bbm m

Auto process restart



Process restart Node reboot Repair

Application server and proxy server (with escalated levels of recovery) Delay and imperfect coverage in each step of recovery modeled Copyright © 2017 by K.S. Trivedi

Avaya Swift System Availability Model Availability model with hardware & software redundancy (warm replication) of application server

 







Assumptions A web server software, that fails at the rate gp running on a machine that fails at the rate gm Mean time to detect server process failure d-1p and the mean time to detect machine failure d-1m The mean restart time of a server -1p The mean restart time of a machine -1m

Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault-Tolerance S. Garg, Y. Huang, C. Kintala, K. Trivedi and S. Yagnik Proc. of Intl. Symp. On Fault-Tolerant Computing, FTCS, June 1999.

Copyright © 2017 by K.S. Trivedi

State-Space model taxonomy Can relax the assumption of exponential distributions discrete-time Markov chains (DTMC) Markovian models continuous-time Markov chains (CTMC) Markov reward models (MRM) (discrete) State space models

non-Markovian models

Semi-Markov process (SMP) Markov regenerative process Phase-Type Expansion

Non-Homogeneous Markov Copyright © 2017 by K.S. Trivedi

Should I Use Markov Models? + Model Fault-Tolerance and Recovery/Repair + Model Dependencies

+ Model Contention for Resources and concurrency + Generalize to Markov Reward Models for Degradable systems

+ Can relax exponential assumption – SMP, MRGP, NHCTMC, PH + Performance, Availability, Performability, Survivability Modeling Possible - Large State Space Copyright © 2017 by K.S. Trivedi

Problems with Markov (or State Space) Models and their solutions  

State space explosion or the largeness problem Stochastic Petri nets and related formalisms for ease of specification and automated generation/solution of underlying Markov model --- Largeness Tolerance

Copyright © 2017 by K.S. Trivedi

[Scalable] Model for IaaS Cloud Availability and Downtime [paper in IEEE Trans. On Cloud Computing 2014]

Copyright © 2017 by K.S. Trivedi

Three Pools of Physical Machines (PMs) To reduce power usage costs, physical machines are divided into three pools [IBM Research Cloud] ▪ Hot pool (high performance & high power usage) ▪ Warm pool (medium performance & power usage) ▪ Cold pool (lowest performance & power usage) Similar grouping of PMs is recommended by Intel*

*Source: http://www.intel.com/content/dam/www/public/us/en/documents/guides/lenovo-think-server-smart-grid-technology-cloud-builders-guide.pdf

Copyright © 2017 by K.S. Trivedi

System Operation Details Failure/Repair (Availability): ▪ PMs may fail and get repaired. ▪ A minimum number of operational hot PMs are required for the system to function. ▪ PMs in other pools may be temporarily assigned to the hot pool to maintain system operation (migration). ▪ Upon repair, PMs migrate back to their original pool ▪ Migration creates dependence among pools

Copyright © 2017 by K.S. Trivedi

Analytic model  





Markov model (CTMC) is too large to construct by hand. We use a high level formalism of stochastic Petri net (the flavor known as stochastic reward net (SRN)). SRN models can be automatically converted into underlying Markov (reward) model and solved for the measures of interest such as DT (downtime), steady-state (instantaneous, interval) availability, reliability, derivatives of this measures --- all numerically by solving underlying equations Analytic-numeric as opposed to simulation Copyright © 2017 by K.S. Trivedi

Monolithic Stochastic Reward Net Model

Copyright © 2017 by K.S. Trivedi

Other High Level Formalisms 





Many other High level formalism (like SRN) are available and corresponding software packages exist (SAN, SPA, ….) Can generate/store/solve moderate size Markov models Have been extended to non-Markov and fluid (continuous state) models

Copyright © 2017 by K.S. Trivedi

Monolithic Model Monolithic SRN model is automatically translated into CTMC or Markov Reward Model However the model not scalable as state-space size of this model is extremely large #PMs per pool

#states

#non-zero matrix entries

3

10, 272

59, 560

4

67,075

453, 970

5

334,948

2, 526, 920

6

1,371,436

11, 220, 964

7

4,816,252

41, 980, 324

8

Memory overflow

Memory overflow

10

-

-

Copyright © 2017 by K.S. Trivedi

Problems with Markov (or State Space) Models and their solutions  



State space explosion or the largeness problem Stochastic Petri nets and related formalisms for easy specification and automated generation/solution of underlying Markov model --- Largeness Tolerance Use hierarchical (Multilevel) model composition   



Largeness Avoidance e.g. Upper level : FT or RBD, lower level: Markov chains Many practical examples of the use of hierarchical models exist Can also use state truncation Copyright © 2017 by K.S. Trivedi

Analytic Modeling Taxonomy

Non-state-space methods Efficiency, simplicity Analytic models

State-space methods Dependency capture Hierarchical composition To avoid largeness

Copyright © 2017 by K.S. Trivedi

State Space Explosion– Largeness Avoidance 



State space explosion can be avoided by using hierarchical model composition.

Use state-space methods for those parts of a system that require them, and use non-state-space methods for the more “well-behaved” parts of the system.

Copyright © 2017 by K.S. Trivedi

Availability model of SIP on IBM WebSphere 

Real problem from IBM



SIP: Session Initiation Protocol



Hardware platform: IBM BladeCenter



Software platform: IBM WebSphere

Copyright © 2017 by K.S. Trivedi

Architecture of SIP on IBM WebSphere AS: WebSphere Appl. Server (WAS)

Blade Chassis 1 AS 1

Replication Domain 1

AS 2

Replication Domain 2

Replication Blade 2 AS 3

Replication domain

Nodes

AS 5

1

A, D

AS 6

2

A, E

3

B, F

4

B, D

5

C, E

6

C, F

Replication Domain 3 AS 4

SIP Proxy 1 DM

Blade 3 Group3

Blade 1 Test Driver

Blade 4

Test drivers

IP Sprayer IBM Load Balancer

SIP

Blade Chassis 2

IBM PC Test Driver

AS 1 Replication Domain 4

SIP Proxy 1 AS1 thru AS6 are Application Server Proxy1's are Stateless Proxy Server

Blade1

AS 4 Blade 2 AS 2 Replication Domain 5 AS 5 Blade 3 AS 3

Replication Domain 6

AS 6 Blade 4 Blade 4 Copyright © 2017 by K.S. Trivedi

Availability model of SIP on IBM WebSphere (cont’d) 

Failures modes incorporated in Models Failures Physical faults

Power faults

Cooling faults

Blade faults

Software failures

midplane faults

Network faults

OS

Memory faults

Application

WAS

NIC faults CPU faults base faults

Mandelbugs

I/O (RAID) faults Copyright © 2017 by K.S. Trivedi

Proxy

Availability model of SIP on IBM WebSphere 

Subsystems modeled using Markov chains to capture dependence within the subsystem



Fault tree used at higher levels as independence across subsystems can be assumed



This is an example of hierarchical composition 

A single monolithic model is not constructed/stored/solved



Each submodel is built and solved separately and results are propagated up to the higher level model



SHARPE facilitates such hierarchical model composition

Copyright © 2017 by K.S. Trivedi

Availability model of SIP on IBM WebSphere (cont’d) 

SIP Application Server top level of the availability model

iX: ith appserver on node X

System Failure

Pi: ith proxy server

system

BSX: node X hardware CMi: chassis i hardware

App servers

proxy

k of 12

AS1

1A

AS2

BSA CM1

2A

AS3

BSA CM1

AS7

1D

BSD CM2

3B

AS4

BSB CM1

AS8

4D

BSD CM2

4B

AS5

BSB CM1

AS9

2E

BSE CM2

5C

BSC CM1

AS10

5E

BSE CM2

PX1

AS6

6C

AS11

3F

P1 BSG CM1

BSC CM1

BSF CM2

Copyright © 2017 by K.S. Trivedi

AS12

6F

BSF CM2

PX2

P2 BSH CM2

Availability model of SIP on IBM WebSphere (cont’d) 

Availability models of a Blade Server and Common BladeCenter Hardware BS Failure

A circle as a leaf node is a basic event An inverted triangle is a shared event A square indicates a submodel CM Failure

BS

Base CPU Mem

RAID

OS

eth

CM eth1

MP Cool Pwr

Chassis failure

nic1

esw1

eth2

nic2

Blade server failure

Copyright © 2017 by K.S. Trivedi

esw2

Availability model of SIP on IBM WebSphere (cont’d) 

Markov Availability models of subsystems

DN

DN cmp×lmp

UP

(1-cmp)×lmp

asp

asp

U1

asp

lc RP

UP

2lc

RP

mc

mmp

midplane model

asp

U1

m2c

Cooling subsystem model

Copyright © 2017 by K.S. Trivedi

lc

DW

Availability model of SIP on IBM WebSphere (cont’d) 

Availability models of subsystems (cont.)

UP

2lcpu

asp

D1

RP

4lmem

UP

asp

D1

mcpu

RP

mmem

CPU model

memory model

DN 2(1-cps)×lps

UP

2cps×lps

asp lps asp

U1

RP

lps

UP

l

DW

mps m2ps

asp

DN m

Base, Switch and NIC

Power domain model Copyright © 2017 by K.S. Trivedi

RP

Availability model of SIP on IBM WebSphere (cont’d) 

Availability models for subsystems (cont.) m2hd DN lhd UP

2*lhd

lsp

U1

RP

mhd

CP lhd_ src

DNsrc

lhd_ dst

m_ bkp

R Bkp

mhd

RAID model mOS

UP

lOS

DN

dOS

DT

(1-bOS)bOS

DW

bOSbOS

OS model Copyright © 2017 by K.S. Trivedi

asp

RP

Markov Availability model WebSphere AP Server  (1-e)d2

1N

ed2

UO



dm

1D



(1-e)d2

ed2 qra

UA



(1-q)ra

(1-r)rm

UR

UB

(1-b)bm

By WLM By Node Agent Manual detection

Recovery

ed2

dd1 g



dd1

(1-e)d2

UP



UN (1-d)d1

2N

(1-d)d1

Failure detection

Node Agent 

RE



Manual recovery 

rrm



bbm m

Auto process restart



Process restart Node reboot Repair

• Application server and proxy server (with escalated levels of recovery)

• Delay and imperfect coverage in each step of recovery modeled • Note the use of restart and reboot as a method of mitigation

Copyright © 2017 by K.S. Trivedi

Hierarchical Composition System Failure

system

App servers

proxy

AS1

k of 12

AS 1

1A

AS 2

BS A CM 1

2A

AS 3

BS A CM 1

AS 7

1D

BS D CM 2

3B

AS 4

BS B CM 1

AS 8

4D

BS D CM 2

4B

AS 5

BS B CM 1

AS 9

2E

5C

5E

3F

(1-e)d2

ed2

BS G CM 1

P2

BS H CM 2

AS 12

BS F CM 2

6F

1A BSA CM1

BS F CM 2

BS Failure

UN

CM Failure

dd1 dm

1D

(1-d)d1

BS

(1-e)d2

(1-e)d2

ed2

dd1 UO

P1

BS C CM 1

PX 2

(1-d)d1

2N

g

6C

AS 11

BS E CM 2

1N

UP

BS C CM 1

AS 10

BS E CM 2

PX 1

AS 6

ed2 qra

UA

(1-q)ra

(1-r)rm

UR

UB

(1-b)bm

CM RE

rrm bbm m

Base CPU Mem

RAID

OS

eth

MP Cool Pwr eth1

nic1

esw1

eth2

nic2

esw2

A single monolithic Markov model will have too many states Copyright © 2017 by K.S. Trivedi

DN 2(1-cps)×lps

UP

2cps×lps

asp lps asp

U1 mps

m2ps

RP

lps

DW

Availability model of SIP on IBM WebSphere (contributions) 

Developed a very comprehensive availability model  

 

   

Many of the parameters collected from experiments, some obtained from tables; few of them assumed Detailed sensitivity analysis to find bottlenecks and give feedback to designers Developed a new method for calculating DPM (defects per million)  



Hardware and software failures Hardware and Software failure-detection delays Software Failover delay Escalated levels of recovery  Automated and manual restart, reboot, repair Imperfect coverage (detection, failover, restart, reboot)

Taking into account interaction between call flow and failure/recovery Retry of messages (this model will be published in the future)

This model was responsible for the sale of the system by IBM Copyright © 2017 by K.S. Trivedi

Analytic Modeling Taxonomy

Combinatorial models Efficiency, simplicity Analytic models

State-space models Dependency capture Hierarchical models Largeness avoidance Fixed point iteration Nearly independent

Copyright © 2017 by K.S. Trivedi

Scalable Model for IaaS Cloud Availability and Downtime [paper in IEEE Trans. On Cloud Computing 2014]

Copyright © 2017 by K.S. Trivedi

Monolithic SRN Model

Copyright © 2017 by K.S. Trivedi

Monolithic Model Monolithic SRN model is automatically translated into CTMC or Markov Reward Model However the model not scalable as state-space size of this model is extremely large #PMs per pool

#states

#non-zero matrix entries

3

10, 272

59, 560

4

67,075

453, 970

5

334,948

2, 526, 920

6

1,371,436

11, 220, 964

7

4,816,252

41, 980, 324

8

Memory overflow

Memory overflow

10

-

-

Copyright © 2017 by K.S. Trivedi

Decompose into Interacting Sub-models

SRN sub-model for cold pool

SRN sub-model for warm pool

SRN sub-model for hot pool

Copyright © 2017 by K.S. Trivedi

Import graph and model outputs



Model outputs:  

mean number of PMs in each pool (E[#Ph], E[#Pw], and E[#Pc]) Downtime in minutes per year Copyright © 2017 by K.S. Trivedi

Many questions 

 

 

Existence of Fixed Point (easy): IEEE TCC 2014 (In a more general setting: Mainkar & Trivedi paper in IEEE-TSE, 1996) Uniqueness (some cases) Rate of convergence Accuracy Scalability

Copyright © 2017 by K.S. Trivedi

Monolithic vs. interacting sub-models 

#states, #non-zero entries

Copyright © 2017 by K.S. Trivedi

Analytic-Numeric vs. simulative solutions 

Downtime [minutes per year]

Analytic-Numeric vs. simulative solution 

Solution times [seconds]

Copyright © 2017 by K.S. Trivedi

Analytic model  







Markov model (CTMC) is too large to construct by hand. We use a high level formalism of stochastic Petri net (the flavor known as stochastic reward net (SRN)). SRN models can be automatically converted into underlying Markov (reward) model and solved for the measures of interest such as DT (downtime) To improve scalability we used decomposition and fixed point iteration For very large number of PMs even decomposed models are not enough; we resort to discrete-even simulation; same SRN model can be simulated via our software package (SPNP)

Copyright © 2017 by K.S. Trivedi

Steps for system availability modeling        

 

List all possible component level failures (hardware, software) List of all failure detectors & match with failure types List all recovery mechanisms & match with failure types Allocation of software modules to hardware units Formulate the model Face validation and verification of the model Parameterization of the model (tables, websites, experiments) Solve the model (using SHARPE, SPNP or similar software packages) to detect bottlenecks, sensitivity analysis, suggest parameters to be monitored What-if analysis to suggest improvements Validate the model Copyright © 2017 by K.S. Trivedi

Outline    

Introduction Reliability and Availability Models Conclusions References

Copyright © 2017 by K.S. Trivedi

State of the Art of Reliability/Availability Models 





Techniques & software packages are available for the construction & solution of reliability and availability models of real systems System decomposition followed by hierarchical model composition is the typical approach Modeling has been used 



   

To compare alternative designs/architectures (Cisco) Find bottlenecks, answer what if questions, design optimization and conduct trade-off studies At certification time (Boeing) At design verification/testing time (IBM) Configuration selection phase (DEC) Operational phase for system tuning/on-line control

Copyright © 2017 by K.S. Trivedi

Summary: System Availability Models 







Model Types in Use  Non-state-Space: Reliability Block Diagram, Fault tree, Reliability graph  State-space: Markov models & stochastic Petri nets, Semi-Markov, Markov regenerative and non-homogeneous Markov models  Hierarchical composition  Top level is usually an RBD or a fault tree  Bottom level models are usually Markov chains  Fixed-point iterative Solution types  Analytic closed-form  Analytic numerical (using a software package)  Simulative Software packages  SHARPE or similar tools are used to construct and solve such models Structural as well as parametric assumptions means that numbers produced should be taken with a grain of salt Copyright © 2017 by K.S. Trivedi

Challenges in Dependability Models 

Model Largeness (in spite of: hierarchy, fixed-point iteration, approximations)



Dealing with non-exponential distributions (in spite of SMP, MRGP, NHCTMC, PH)



Service (or user)-oriented measures as opposed to system-oriented measures



Combining performance and failure/repair



Model Parameterization



Model Validation and Verification



Parametric uncertainty propagation Copyright © 2017 by K.S. Trivedi

Model Parameterization   

Hardware/Software Configuration parameters Hardware component MTTFs Software component MTTFs 

  

Hardware/Software Failover times Restart/Reboot times Coverage (Success) probabilities 



Detection, location, restart, reconfiguration, repair

Repair time 



OS, IBM Application, customer software, third party

Hot swap, multiple component at once, DOA (dead on arrival), shared/not shared, field service travel time, preventive vs. corrective

Uncertainty propagation: Dealing with not only Aleatory (built into the system models) but also epistemic (parametric) uncertainty Copyright © 2017 by K.S. Trivedi

Outline    

Introduction Reliability and Availability Models Conclusions References

Copyright © 2017 by K.S. Trivedi

Outline    

Introduction Reliability and Availability Models Conclusions References

Copyright © 2017 by K.S. Trivedi

Selected References 

       

 

Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley, 2nd edition, 2001; revised paperback, 2016 L. Tomek & K. Trivedi, Fixed-Point Iteration in Availability Modeling, InformatikFachberichte, M.Dal Cin (ed.), Springer-Verlag, Berlin, 1991 Mainkar & Trivedi, Sufficient Conditions for Existence of a Fixed Point in Stochastic Reward Net-Based Iterative Models, IEEE TSE, 1996 Trivedi, Vasireddy, Trindade, Nathan, Castro, “Modeling High Availability Systems,” PRDC 2006 Trivedi & Bobbio, Reliability and Availability: Modeling, Analysis, Applications, Cambridge University Press, 2017 Trivedi, Wang, Hunt, Rindos, Smith, Vashaw, “Availability Modeling of SIP Protocol on IBM WebSphere,” PRDC 2008 Smith, Trivedi, Tomek, Ackaret, Availability analysis of blade server systems, IBM Systems J., 2008. Trivedi & Sahner, SHARPE at the Age of Twenty two, ACM SIGMETRICS, Performance Evaluation Review, 2008 Trivedi, Wang & Hunt. Computing the number of calls dropped due to failures,

ISSRE2010 Mishra, Trivedi & Some. "Uncertainty Analysis of the Remote Exploration and

Experimentation System", Journal of Spacecraft and Rockets,2012 Ghosh, Longo, Frattini, Russo & Trivedi, “Scalable Analytics for IaaS Cloud Availability”, IEEE Trans. on Cloud Computing, 2014 Copyright © 2017 by K.S. Trivedi

SHARPE portal 

,

https://sharpe.pratt.duke.edu

Copyright © 2017 by K.S. Trivedi

Thank you!

Copyright © 2017 by K.S. Trivedi

DHAAL

➢ Dhaal means “Shield” in Hindi/Gujarati ➢ DHAAL research is about shielding systems from various threats ➢ Cooperating with groups in USA, Germany, Italy, Japan, China, Brazil, India, New Zealand, Norway, Spain, … Copyright © 2017 by K.S. Trivedi

DHAAL Research Triangle Theory

Modeling paradigms & numerical solution: Solution of large Fault trees and networks, Solution of large & stiff Markov models, New modeling paradigms of non-Markovian and Fluid Petri nets

Books: Blue, Red, White, Green

Tools HARP, SAVE, SHARPE, SPNP, SREPT

Applications Hardware and software: Boeing, EMC, HP, Ericsson, IBM, Sun, Zitel, Cisco, GE, 3Com, Lucent, Motorola,NEC, TCS, Huawei

Recent major achievements: • Reliability model of CRN subsystem of Boeing 787 for certification by FAA • Reliability model of SIP on WebSphere Copyright © 2017 by K.S. Trivedi

Green book Outline Part I – Introduction (Chapters 1:3) Part II - Non-state-space models (Chapters 4:8) Part III - State-space Models with Exponential Distributions (Chapters 9:12) Part IV - State-space Models with NonExponential Distributions (Chapters 13:15) Part V - Multi-Level Models (Chapters 16:17) Part VI - Case Studies (Chapter 18) Copyright © 2017 by K.S. Trivedi

Green book Outline

Copyright © 2017 by K.S. Trivedi

Suggest Documents