Alibaba, Hangzhou, China, December 8, 2017
Reliability and Availability Modeling in Practice Kishor S. Trivedi Duke High Availability Assurance Lab (DHAAL) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 Phone: (919)660-5269 E-mail:
[email protected] URL: www.ee.duke.edu/~ktrivedi Also Researchgate.com
Outline
Introduction Reliability and Availability Models
Model Types in Use Illustrated through several real examples
Conclusions References
Copyright © 2017 by K.S. Trivedi
Duke University Research Triangle Park (RTP) Duke UNC-CH NC state
USA
North Carolina
Copyright © 2017 by K.S. Trivedi
DHAAL
➢ Dhaal means “Shield” in Hindi/Gujarati ➢ DHAAL research is about shielding systems from various threats ➢ Cooperating with groups in USA, Germany, Italy, Japan, China, Brazil, India, New Zealand, Norway, Spain, … Copyright © 2017 by K.S. Trivedi
DHAAL Research Triangle Theory
Modeling paradigms & numerical solution: Solution of large Fault trees and networks, Solution of large & stiff Markov models, New modeling paradigms of non-Markovian and Fluid Petri nets
Books: Blue, Red, White, Green
Tools HARP, SAVE, SHARPE, SPNP, SREPT
Applications Hardware and software: Boeing, EMC, HP, Ericsson, IBM, Sun, Zitel, Cisco, GE, 3Com, Lucent, Motorola,NEC, TCS, Huawei
Recent major achievements: • Reliability model of CRN subsystem of Boeing 787 for certification by FAA • Reliability model of SIP on WebSphere Copyright © 2017 by K.S. Trivedi
Some Applications
Reliability Analysis of Boeing 787 Current Return Network for FAA Certification (Boeing) Reliability and Availability Analysis of SIP protocol on High Availability WebSphere (IBM) Software Aging and Rejuvenation: (IBM, NEC, Huawei, Beihang, Hiroshima) Software Fault-Tolerance via Environmental Diversity and Failures data analytics (NASA-JPL, TCS, Uni. Naples, Beihang) Cloud capacity planning including performance, availability, power (IBM, NEC, BJTU, Iran) Survivability Analysis for Lucent POTS, for VPN and Smart Grid (Alcatel-Lucent, NTNU, Siemens, BJTU)
Security Modeling & Analysis (DARPA, NSF, NATO, NSWC)
Performance modeling of Blockchain Networks (IBM)
Software Reliability Modeling & Analysis (Draper Lab, NASA)
Prognosis and Patient Flow Models in Health Care (DUHS)
Books Update
Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 1st ed.,1982; 2nd ed., John Wiley, 2001 (Blue book) – Chinese translation, 2015; fully revised paperback, 2016 Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer, 1996 (Red book) Queuing Networks and Markov Chains, John Wiley, 1st edition, 1998; 2nd edition, 2006 (White book)
Reliability and Availability Engineering, Cambridge University Press, 2017 (green book)
Motivation: Dependence on Technical Systems Communication
Health & Medicine
Avionics
Entertainment
Banking
Copyright © 2017 by K.S. Trivedi
Outline
Introduction Reliability and Availability Models Conclusions References
Copyright © 2017 by K.S. Trivedi
Model analysis approaches
Black-box (measurements + statistical inference): The system is treated as a monolithic whole, considering its input, output and transfer characteristics without explicitly taking into account its internal structure. White-box (or grey box): Internal structure of system explicitly considered using a Probability Model (e.g., RBD, Ftree, Markov chain). Used to analyze a system with many interacting and interdependent components. Combined approach Use black-box approach at subsystem/component level Use white-box approach at the system level. Copyright © 2017 by K.S. Trivedi
Overview of Evaluation Methods Measurement-based
Quantitative Evaluation
Error/Failure data analytics
Discrete-event simulation Model-based
Numerical solution Of analytic models Not as well utilized; Unnecessarily excessive Use of simulation
Hybrid
Analytic methods Close-form solution
Numerical solution via a tool
Copyright © 2017 by K.S. Trivedi
Analytic Modeling Taxonomy Measurement-based Quantitative Dependability Evaluation
Discrete-event simulation Model-based
Hybrid Analytic Methods
Non-state-space methods Analytic methods
State-space methods Hierarchical composition Fixed point iterative methods
Copyright © 2017 by K.S. Trivedi
Non-State-Space Methods : taxonomy
SP reliability block diagrams (RBD)
Fault trees Non-state space methods Fault trees with repeated events
Non-SP reliability block diagrams (relgraph)
Extensions such as multi-state components/systems, phased-mission systems etc. Copyright © 2017 by K.S. Trivedi
Non-state-space Methods
Model types such as RBDs, relgraphs and FTs are easy to use and assuming statistical independence solve for system reliability, system availability and system MTTF; can find bottlenecks Relatively good algorithms are known for solving medium-sized systems Each component can have attached to it
A probability of failure A failure rate A distribution of time to failure Steady-state or instantaneous (un)availability Copyright © 2017 by K.S. Trivedi
Cisco & Juniper Routers
RBD of Cisco 12000 GSR
RBD of Juniper M20 K. Trivedi, “Availability Analysis of Cisco GSR 12000 and Juniper M20/M40” Cisco Internal report, 2000. Red colored block means a sub-model. Copyright © 2017 by K.S. Trivedi
Sun Microsystems Trivedi et al., Modeling High Availability Systems, PRDC’06 Conference, Dec. 2006, Riverside, CA
Top level RBD consists of all the subsystems joined by series, parallel and k/n blocks. Red colored block means a sub-model. Copyright © 2017 by K.S. Trivedi
Fault Trees
It is a graphical representation of the combinations of events that can cause the occurrence of system failure. Non-state-space model type (state space is what we used when we employed state enumeration method to solve a relgraph). Components are represented as nodes. Components or subsystems in series are connected together with OR gates. Components or subsystems in parallel are connected together with AND gates.
Copyright © 2017 by K.S. Trivedi
Fault Tree Model of GE Truck- AC6000
MF
S = Suspension BR = Brake Rigging L = Liner O = Others
S
BR
L
Copyright © 2017 by K.S. Trivedi
O
Fault Trees
(Continued)
Failure of a component or subsystem causes the corresponding input to the gate to become TRUE.
Whenever the output of the topmost gate becomes TRUE, the system is considered failed.
Ftree is a pessimist’s model as opposed to RBD or relgraph that can be considered optimists’ models.
Extensions to Fault-trees include a variety of different gate types: NOT, EXOR, Priority AND, cold spare gate, functional dependency gate, sequence enforcing gate, etc. Some of these are “static” while others are “dynamic” gates.
Copyright © 2017 by K.S. Trivedi
Fault Tree Model of GE Equipment Ventilation System
Fault Tree with Repeated events; inverted triangle indicates such events
Copyright © 2017 by K.S. Trivedi
KEY Basic Event System Failure
Defined Sub-model
k/n Midplane
Cooling
Power Domain 1
Repeated Use of Same Sub-model Instance Instance of Blade in Power Domain 1 Sub-Fault tree Instance of Blade in Power Domain 2 Sub-Fault tree
Blade in Power Domain 1 Fault Subtree
Blade in Power Domain 2 Fault Subtree
Base CPU Memory
FC FC FC FC Port 1 Switch 1 Port 2 Switch 2
DISK
Software
Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2
Power Domain 2
Base CPU Memory
FC FC FC FC Port 1 Switch 1 Port 2 Switch 2
Smith, Trivedi et al., IBM Systems J., 2008 Copyright © 2017 by K.S. Trivedi
DISK
Software
Ethernet Ethernet Ethernet Ethernet Port 1 Switch 1 Port 2 Switch 2
Software Package SHARPE
SHARPE: Symbolic-Hierarchical Automated Reliability and Performance Evaluator Stochastic Modeling tool installed at over 1000 Sites; companies and universities Ported to most architectures and operating systems Used for Education, Research, Engineering Practice Users: Boeing, 3Com, EMC, AT & T, Alcatel-Lucent, IBM, NEC, Motorola, Siemens, GE, HP, Raytheon, Honda,… http://sharpe.pratt.duke.edu/ It is the core of Boeing’s internal tool called IRAP
A Fool with a Tool is still a fool Copyright © 2017 by K.S. Trivedi
Fault trees
Major characteristics: Fault trees without repeated events can be solved in polynomial time. Fault trees with repeated events -Theoretical complexity: exponential in number of components.
Use Factoring (conditioning) [In SHARPE use factor on and bdd off]
Find all minimal cut-sets & then use Sum of Disjoint products to compute reliability [In SHARPE use factor off and bdd off]
Use BDD approach [In SHARPE use bdd on]
In practice, can solve fault trees with thousands of components
Our brand new algorithm can solve even larger fault trees [Xiaoyan Yin’s MS Thesis]. Copyright © 2017 by K.S. Trivedi
Reliability Graph
Consists of a set of nodes and edges
Edges represent components that can fail
Source and target (sink) nodes
System fails when no path from source to sink
A non-series-parallel RBD
S-t connectedness or network reliability problem
Copyright © 2017 by K.S. Trivedi
Avionics
Reliability analysis of each major subsystem of a commercial airplane needs to be carried out and presented to Federal Aviation Administration (FAA) for certification
Real world example from Boeing Commercial Airplane Company
Copyright © 2017 by K.S. Trivedi
Reliability Analysis of Boeing 787
Current Return Network Modeled as a Reliability Graph
Consists of a set of nodes and edges Edges represent components that can fail Source and target nodes System fails when no path from source to target Compute probability of a path from source to target A1
B1
E1
D1 B2
A2
D2 D3
B3
E2
F3 F1
D4
A3 A4
D5
B4
F5
E14
E3
F2 A5
B5
C1
D7
E4
D6
F4
B6 B7
A6 A7
C2 C3
B8
D9
source
D10 B9
C4
B10
C5
E8 D13
C6
B11
E9
D14
B12
D15
B13
D16
B14
D17
F6 E10
A12 D18
E11
A13
F7 E13 F8
F10 F9
B15 A14
target
D12
A9
A11
E6 D11 E7
A8
A10
E5
D8
B16
D19
D20
E12
Copyright © 2017 by K.S. Trivedi
Reliability Analysis of Boeing 787 (cont’d)
Some known solution methods for Relgraph
Find all minpaths followed by SDP (Sum of Disjoint Products) BDD (Binary Decision Diagrams)-based method Factoring or conditioning Monte Carlo method
The first two methods have been implemented in our SHARPE software package Boeing tried to use SHARPE for this problem but …
Copyright © 2017 by K.S. Trivedi
Reliability Analysis of Boeing 787 (cont’d)
Too many minpaths
A1
B1
E1
D1 B2
A2
D2 D3
B3
E2
F3 F1
D4
A3 A4
D5
B4
F5
E14
E3
F2 A5
B5
C1
D7
E4
D6
F4
B6 B7
A6 A7
C2 C3
B8
D9
source
D10 B9
C4
B10
C5
target
D12 E8
A9
D13 C6
B11
E9
D14
B12 A11
E6 D11 E7
A8
A10
E5
D8
D15
B13
D16
B14
D17
F6 E10
A12 D18
E11
A13
Number of paths from source to target
F7 E13 F8
F10 F9
B15
B16
D19
D20
E12
A14
Idea: Compute bounds instead of exact reliability Lower bound by taking a subset of minpaths Upper bound by taking a subset of mincuts Copyright © 2017 by K.S. Trivedi
Reliability Analysis of Boeing 787 (cont’d)
Our Approach : Developed a new efficient algorithm for (un)reliability bounds computation and incorporated in SHARPE
• 2011 patent for the algorithm jointly with Boeing/Duke • A paper appeared in EJOR, 2014 • Satisfying FAA that SHARPE development used DO-178 B software standard was the hardest part SHARPE: Symbolic Hierarchical Automated Reliability and Performance Evaluator Developed by my group at Duke http://sharpe.pratt.duke.edu/ Copyright © 2017 by K.S. Trivedi
Non-state-space Methods (cont’d)
Non-state-space methods provide relatively fast algorithms for system reliability, system availability, system MTTF & to find bottlenecks assuming stochastic independence between system
components
Series-parallel composition algorithm Factoring (conditioning) algorithms All minpaths followed by Sum of Disjoint Products (SDP) algorithm Binary Decision Diagrams (BDD) based algorithms Bounding algorithm for relgraphs
All of the above implemented in SHARPE
Solving a fault tree of a whole plane (B787) is still a challenge
Failure/Repair Dependencies are often present; RBDs, relgraphs, FTREEs cannot easily handle these (e.g., shared repair, warm/cold spares, imperfect coverage, non-zero switching time, travel time of repair person, reliability with repair). Copyright © 2017 by K.S. Trivedi
State-space models : Markov chains
To model complex interactions between components, use models like Markov chains or more generally state space models.
Markov reliability models will have one or more absorbing states; Markov availability models will have no absorbing states
Many examples of dependencies among system components have been observed in practice and captured by continuous-time Markov chains (CTMCs).
Extension to Markov reward models makes computation of measures of interest relatively easy. Copyright © 2017 by K.S. Trivedi
Analytic Modeling Taxonomy
Non-state-space methods
Analytic methods
State-space methods
Copyright © 2017 by K.S. Trivedi
Availability model of SIP on IBM WebSphere
Availability model of the Linux OS mOS
UP
lOS
DN
dOS
DT
(1-bOS)bOS
DW
asp
RP
bOSbOS
Detection delay, imperfect coverage, two-levels of recovery modeled
Copyright © 2017 by K.S. Trivedi
Markov Availability model of WebSphere AP Server
(1-e)d2
1N
(1-d)d1
2N ed2
dd1
(1-e)d2
(1-e)d2
UO
ed2 qra
ed2
dd1 g
dm
1D
(1-d)d1
UP
UN
Failure detection By WLM By Node Agent Manual detection Recovery Node Agent
UA
(1-q)ra
(1-r)rm
UR
UB
(1-b)bm
RE
Manual recovery
rrm
bbm m
Auto process restart
Process restart Node reboot Repair
Application server and proxy server (with escalated levels of recovery) Delay and imperfect coverage in each step of recovery modeled Copyright © 2017 by K.S. Trivedi
Avaya Swift System Availability Model Availability model with hardware & software redundancy (warm replication) of application server
Assumptions A web server software, that fails at the rate gp running on a machine that fails at the rate gm Mean time to detect server process failure d-1p and the mean time to detect machine failure d-1m The mean restart time of a server -1p The mean restart time of a machine -1m
Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault-Tolerance S. Garg, Y. Huang, C. Kintala, K. Trivedi and S. Yagnik Proc. of Intl. Symp. On Fault-Tolerant Computing, FTCS, June 1999.
Copyright © 2017 by K.S. Trivedi
State-Space model taxonomy Can relax the assumption of exponential distributions discrete-time Markov chains (DTMC) Markovian models continuous-time Markov chains (CTMC) Markov reward models (MRM) (discrete) State space models
non-Markovian models
Semi-Markov process (SMP) Markov regenerative process Phase-Type Expansion
Non-Homogeneous Markov Copyright © 2017 by K.S. Trivedi
Should I Use Markov Models? + Model Fault-Tolerance and Recovery/Repair + Model Dependencies
+ Model Contention for Resources and concurrency + Generalize to Markov Reward Models for Degradable systems
+ Can relax exponential assumption – SMP, MRGP, NHCTMC, PH + Performance, Availability, Performability, Survivability Modeling Possible - Large State Space Copyright © 2017 by K.S. Trivedi
Problems with Markov (or State Space) Models and their solutions
State space explosion or the largeness problem Stochastic Petri nets and related formalisms for ease of specification and automated generation/solution of underlying Markov model --- Largeness Tolerance
Copyright © 2017 by K.S. Trivedi
[Scalable] Model for IaaS Cloud Availability and Downtime [paper in IEEE Trans. On Cloud Computing 2014]
Copyright © 2017 by K.S. Trivedi
Three Pools of Physical Machines (PMs) To reduce power usage costs, physical machines are divided into three pools [IBM Research Cloud] ▪ Hot pool (high performance & high power usage) ▪ Warm pool (medium performance & power usage) ▪ Cold pool (lowest performance & power usage) Similar grouping of PMs is recommended by Intel*
*Source: http://www.intel.com/content/dam/www/public/us/en/documents/guides/lenovo-think-server-smart-grid-technology-cloud-builders-guide.pdf
Copyright © 2017 by K.S. Trivedi
System Operation Details Failure/Repair (Availability): ▪ PMs may fail and get repaired. ▪ A minimum number of operational hot PMs are required for the system to function. ▪ PMs in other pools may be temporarily assigned to the hot pool to maintain system operation (migration). ▪ Upon repair, PMs migrate back to their original pool ▪ Migration creates dependence among pools
Copyright © 2017 by K.S. Trivedi
Analytic model
Markov model (CTMC) is too large to construct by hand. We use a high level formalism of stochastic Petri net (the flavor known as stochastic reward net (SRN)). SRN models can be automatically converted into underlying Markov (reward) model and solved for the measures of interest such as DT (downtime), steady-state (instantaneous, interval) availability, reliability, derivatives of this measures --- all numerically by solving underlying equations Analytic-numeric as opposed to simulation Copyright © 2017 by K.S. Trivedi
Monolithic Stochastic Reward Net Model
Copyright © 2017 by K.S. Trivedi
Other High Level Formalisms
Many other High level formalism (like SRN) are available and corresponding software packages exist (SAN, SPA, ….) Can generate/store/solve moderate size Markov models Have been extended to non-Markov and fluid (continuous state) models
Copyright © 2017 by K.S. Trivedi
Monolithic Model Monolithic SRN model is automatically translated into CTMC or Markov Reward Model However the model not scalable as state-space size of this model is extremely large #PMs per pool
#states
#non-zero matrix entries
3
10, 272
59, 560
4
67,075
453, 970
5
334,948
2, 526, 920
6
1,371,436
11, 220, 964
7
4,816,252
41, 980, 324
8
Memory overflow
Memory overflow
10
-
-
Copyright © 2017 by K.S. Trivedi
Problems with Markov (or State Space) Models and their solutions
State space explosion or the largeness problem Stochastic Petri nets and related formalisms for easy specification and automated generation/solution of underlying Markov model --- Largeness Tolerance Use hierarchical (Multilevel) model composition
Largeness Avoidance e.g. Upper level : FT or RBD, lower level: Markov chains Many practical examples of the use of hierarchical models exist Can also use state truncation Copyright © 2017 by K.S. Trivedi
Analytic Modeling Taxonomy
Non-state-space methods Efficiency, simplicity Analytic models
State-space methods Dependency capture Hierarchical composition To avoid largeness
Copyright © 2017 by K.S. Trivedi
State Space Explosion– Largeness Avoidance
State space explosion can be avoided by using hierarchical model composition.
Use state-space methods for those parts of a system that require them, and use non-state-space methods for the more “well-behaved” parts of the system.
Copyright © 2017 by K.S. Trivedi
Availability model of SIP on IBM WebSphere
Real problem from IBM
SIP: Session Initiation Protocol
Hardware platform: IBM BladeCenter
Software platform: IBM WebSphere
Copyright © 2017 by K.S. Trivedi
Architecture of SIP on IBM WebSphere AS: WebSphere Appl. Server (WAS)
Blade Chassis 1 AS 1
Replication Domain 1
AS 2
Replication Domain 2
Replication Blade 2 AS 3
Replication domain
Nodes
AS 5
1
A, D
AS 6
2
A, E
3
B, F
4
B, D
5
C, E
6
C, F
Replication Domain 3 AS 4
SIP Proxy 1 DM
Blade 3 Group3
Blade 1 Test Driver
Blade 4
Test drivers
IP Sprayer IBM Load Balancer
SIP
Blade Chassis 2
IBM PC Test Driver
AS 1 Replication Domain 4
SIP Proxy 1 AS1 thru AS6 are Application Server Proxy1's are Stateless Proxy Server
Blade1
AS 4 Blade 2 AS 2 Replication Domain 5 AS 5 Blade 3 AS 3
Replication Domain 6
AS 6 Blade 4 Blade 4 Copyright © 2017 by K.S. Trivedi
Availability model of SIP on IBM WebSphere (cont’d)
Failures modes incorporated in Models Failures Physical faults
Power faults
Cooling faults
Blade faults
Software failures
midplane faults
Network faults
OS
Memory faults
Application
WAS
NIC faults CPU faults base faults
Mandelbugs
I/O (RAID) faults Copyright © 2017 by K.S. Trivedi
Proxy
Availability model of SIP on IBM WebSphere
Subsystems modeled using Markov chains to capture dependence within the subsystem
Fault tree used at higher levels as independence across subsystems can be assumed
This is an example of hierarchical composition
A single monolithic model is not constructed/stored/solved
Each submodel is built and solved separately and results are propagated up to the higher level model
SHARPE facilitates such hierarchical model composition
Copyright © 2017 by K.S. Trivedi
Availability model of SIP on IBM WebSphere (cont’d)
SIP Application Server top level of the availability model
iX: ith appserver on node X
System Failure
Pi: ith proxy server
system
BSX: node X hardware CMi: chassis i hardware
App servers
proxy
k of 12
AS1
1A
AS2
BSA CM1
2A
AS3
BSA CM1
AS7
1D
BSD CM2
3B
AS4
BSB CM1
AS8
4D
BSD CM2
4B
AS5
BSB CM1
AS9
2E
BSE CM2
5C
BSC CM1
AS10
5E
BSE CM2
PX1
AS6
6C
AS11
3F
P1 BSG CM1
BSC CM1
BSF CM2
Copyright © 2017 by K.S. Trivedi
AS12
6F
BSF CM2
PX2
P2 BSH CM2
Availability model of SIP on IBM WebSphere (cont’d)
Availability models of a Blade Server and Common BladeCenter Hardware BS Failure
A circle as a leaf node is a basic event An inverted triangle is a shared event A square indicates a submodel CM Failure
BS
Base CPU Mem
RAID
OS
eth
CM eth1
MP Cool Pwr
Chassis failure
nic1
esw1
eth2
nic2
Blade server failure
Copyright © 2017 by K.S. Trivedi
esw2
Availability model of SIP on IBM WebSphere (cont’d)
Markov Availability models of subsystems
DN
DN cmp×lmp
UP
(1-cmp)×lmp
asp
asp
U1
asp
lc RP
UP
2lc
RP
mc
mmp
midplane model
asp
U1
m2c
Cooling subsystem model
Copyright © 2017 by K.S. Trivedi
lc
DW
Availability model of SIP on IBM WebSphere (cont’d)
Availability models of subsystems (cont.)
UP
2lcpu
asp
D1
RP
4lmem
UP
asp
D1
mcpu
RP
mmem
CPU model
memory model
DN 2(1-cps)×lps
UP
2cps×lps
asp lps asp
U1
RP
lps
UP
l
DW
mps m2ps
asp
DN m
Base, Switch and NIC
Power domain model Copyright © 2017 by K.S. Trivedi
RP
Availability model of SIP on IBM WebSphere (cont’d)
Availability models for subsystems (cont.) m2hd DN lhd UP
2*lhd
lsp
U1
RP
mhd
CP lhd_ src
DNsrc
lhd_ dst
m_ bkp
R Bkp
mhd
RAID model mOS
UP
lOS
DN
dOS
DT
(1-bOS)bOS
DW
bOSbOS
OS model Copyright © 2017 by K.S. Trivedi
asp
RP
Markov Availability model WebSphere AP Server (1-e)d2
1N
ed2
UO
dm
1D
(1-e)d2
ed2 qra
UA
(1-q)ra
(1-r)rm
UR
UB
(1-b)bm
By WLM By Node Agent Manual detection
Recovery
ed2
dd1 g
dd1
(1-e)d2
UP
UN (1-d)d1
2N
(1-d)d1
Failure detection
Node Agent
RE
Manual recovery
rrm
bbm m
Auto process restart
Process restart Node reboot Repair
• Application server and proxy server (with escalated levels of recovery)
• Delay and imperfect coverage in each step of recovery modeled • Note the use of restart and reboot as a method of mitigation
Copyright © 2017 by K.S. Trivedi
Hierarchical Composition System Failure
system
App servers
proxy
AS1
k of 12
AS 1
1A
AS 2
BS A CM 1
2A
AS 3
BS A CM 1
AS 7
1D
BS D CM 2
3B
AS 4
BS B CM 1
AS 8
4D
BS D CM 2
4B
AS 5
BS B CM 1
AS 9
2E
5C
5E
3F
(1-e)d2
ed2
BS G CM 1
P2
BS H CM 2
AS 12
BS F CM 2
6F
1A BSA CM1
BS F CM 2
BS Failure
UN
CM Failure
dd1 dm
1D
(1-d)d1
BS
(1-e)d2
(1-e)d2
ed2
dd1 UO
P1
BS C CM 1
PX 2
(1-d)d1
2N
g
6C
AS 11
BS E CM 2
1N
UP
BS C CM 1
AS 10
BS E CM 2
PX 1
AS 6
ed2 qra
UA
(1-q)ra
(1-r)rm
UR
UB
(1-b)bm
CM RE
rrm bbm m
Base CPU Mem
RAID
OS
eth
MP Cool Pwr eth1
nic1
esw1
eth2
nic2
esw2
A single monolithic Markov model will have too many states Copyright © 2017 by K.S. Trivedi
DN 2(1-cps)×lps
UP
2cps×lps
asp lps asp
U1 mps
m2ps
RP
lps
DW
Availability model of SIP on IBM WebSphere (contributions)
Developed a very comprehensive availability model
Many of the parameters collected from experiments, some obtained from tables; few of them assumed Detailed sensitivity analysis to find bottlenecks and give feedback to designers Developed a new method for calculating DPM (defects per million)
Hardware and software failures Hardware and Software failure-detection delays Software Failover delay Escalated levels of recovery Automated and manual restart, reboot, repair Imperfect coverage (detection, failover, restart, reboot)
Taking into account interaction between call flow and failure/recovery Retry of messages (this model will be published in the future)
This model was responsible for the sale of the system by IBM Copyright © 2017 by K.S. Trivedi
Analytic Modeling Taxonomy
Combinatorial models Efficiency, simplicity Analytic models
State-space models Dependency capture Hierarchical models Largeness avoidance Fixed point iteration Nearly independent
Copyright © 2017 by K.S. Trivedi
Scalable Model for IaaS Cloud Availability and Downtime [paper in IEEE Trans. On Cloud Computing 2014]
Copyright © 2017 by K.S. Trivedi
Monolithic SRN Model
Copyright © 2017 by K.S. Trivedi
Monolithic Model Monolithic SRN model is automatically translated into CTMC or Markov Reward Model However the model not scalable as state-space size of this model is extremely large #PMs per pool
#states
#non-zero matrix entries
3
10, 272
59, 560
4
67,075
453, 970
5
334,948
2, 526, 920
6
1,371,436
11, 220, 964
7
4,816,252
41, 980, 324
8
Memory overflow
Memory overflow
10
-
-
Copyright © 2017 by K.S. Trivedi
Decompose into Interacting Sub-models
SRN sub-model for cold pool
SRN sub-model for warm pool
SRN sub-model for hot pool
Copyright © 2017 by K.S. Trivedi
Import graph and model outputs
Model outputs:
mean number of PMs in each pool (E[#Ph], E[#Pw], and E[#Pc]) Downtime in minutes per year Copyright © 2017 by K.S. Trivedi
Many questions
Existence of Fixed Point (easy): IEEE TCC 2014 (In a more general setting: Mainkar & Trivedi paper in IEEE-TSE, 1996) Uniqueness (some cases) Rate of convergence Accuracy Scalability
Copyright © 2017 by K.S. Trivedi
Monolithic vs. interacting sub-models
#states, #non-zero entries
Copyright © 2017 by K.S. Trivedi
Analytic-Numeric vs. simulative solutions
Downtime [minutes per year]
Analytic-Numeric vs. simulative solution
Solution times [seconds]
Copyright © 2017 by K.S. Trivedi
Analytic model
Markov model (CTMC) is too large to construct by hand. We use a high level formalism of stochastic Petri net (the flavor known as stochastic reward net (SRN)). SRN models can be automatically converted into underlying Markov (reward) model and solved for the measures of interest such as DT (downtime) To improve scalability we used decomposition and fixed point iteration For very large number of PMs even decomposed models are not enough; we resort to discrete-even simulation; same SRN model can be simulated via our software package (SPNP)
Copyright © 2017 by K.S. Trivedi
Steps for system availability modeling
List all possible component level failures (hardware, software) List of all failure detectors & match with failure types List all recovery mechanisms & match with failure types Allocation of software modules to hardware units Formulate the model Face validation and verification of the model Parameterization of the model (tables, websites, experiments) Solve the model (using SHARPE, SPNP or similar software packages) to detect bottlenecks, sensitivity analysis, suggest parameters to be monitored What-if analysis to suggest improvements Validate the model Copyright © 2017 by K.S. Trivedi
Outline
Introduction Reliability and Availability Models Conclusions References
Copyright © 2017 by K.S. Trivedi
State of the Art of Reliability/Availability Models
Techniques & software packages are available for the construction & solution of reliability and availability models of real systems System decomposition followed by hierarchical model composition is the typical approach Modeling has been used
To compare alternative designs/architectures (Cisco) Find bottlenecks, answer what if questions, design optimization and conduct trade-off studies At certification time (Boeing) At design verification/testing time (IBM) Configuration selection phase (DEC) Operational phase for system tuning/on-line control
Copyright © 2017 by K.S. Trivedi
Summary: System Availability Models
Model Types in Use Non-state-Space: Reliability Block Diagram, Fault tree, Reliability graph State-space: Markov models & stochastic Petri nets, Semi-Markov, Markov regenerative and non-homogeneous Markov models Hierarchical composition Top level is usually an RBD or a fault tree Bottom level models are usually Markov chains Fixed-point iterative Solution types Analytic closed-form Analytic numerical (using a software package) Simulative Software packages SHARPE or similar tools are used to construct and solve such models Structural as well as parametric assumptions means that numbers produced should be taken with a grain of salt Copyright © 2017 by K.S. Trivedi
Challenges in Dependability Models
Model Largeness (in spite of: hierarchy, fixed-point iteration, approximations)
Dealing with non-exponential distributions (in spite of SMP, MRGP, NHCTMC, PH)
Service (or user)-oriented measures as opposed to system-oriented measures
Combining performance and failure/repair
Model Parameterization
Model Validation and Verification
Parametric uncertainty propagation Copyright © 2017 by K.S. Trivedi
Model Parameterization
Hardware/Software Configuration parameters Hardware component MTTFs Software component MTTFs
Hardware/Software Failover times Restart/Reboot times Coverage (Success) probabilities
Detection, location, restart, reconfiguration, repair
Repair time
OS, IBM Application, customer software, third party
Hot swap, multiple component at once, DOA (dead on arrival), shared/not shared, field service travel time, preventive vs. corrective
Uncertainty propagation: Dealing with not only Aleatory (built into the system models) but also epistemic (parametric) uncertainty Copyright © 2017 by K.S. Trivedi
Outline
Introduction Reliability and Availability Models Conclusions References
Copyright © 2017 by K.S. Trivedi
Outline
Introduction Reliability and Availability Models Conclusions References
Copyright © 2017 by K.S. Trivedi
Selected References
Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley, 2nd edition, 2001; revised paperback, 2016 L. Tomek & K. Trivedi, Fixed-Point Iteration in Availability Modeling, InformatikFachberichte, M.Dal Cin (ed.), Springer-Verlag, Berlin, 1991 Mainkar & Trivedi, Sufficient Conditions for Existence of a Fixed Point in Stochastic Reward Net-Based Iterative Models, IEEE TSE, 1996 Trivedi, Vasireddy, Trindade, Nathan, Castro, “Modeling High Availability Systems,” PRDC 2006 Trivedi & Bobbio, Reliability and Availability: Modeling, Analysis, Applications, Cambridge University Press, 2017 Trivedi, Wang, Hunt, Rindos, Smith, Vashaw, “Availability Modeling of SIP Protocol on IBM WebSphere,” PRDC 2008 Smith, Trivedi, Tomek, Ackaret, Availability analysis of blade server systems, IBM Systems J., 2008. Trivedi & Sahner, SHARPE at the Age of Twenty two, ACM SIGMETRICS, Performance Evaluation Review, 2008 Trivedi, Wang & Hunt. Computing the number of calls dropped due to failures,
ISSRE2010 Mishra, Trivedi & Some. "Uncertainty Analysis of the Remote Exploration and
Experimentation System", Journal of Spacecraft and Rockets,2012 Ghosh, Longo, Frattini, Russo & Trivedi, “Scalable Analytics for IaaS Cloud Availability”, IEEE Trans. on Cloud Computing, 2014 Copyright © 2017 by K.S. Trivedi
SHARPE portal
,
https://sharpe.pratt.duke.edu
Copyright © 2017 by K.S. Trivedi
Thank you!
Copyright © 2017 by K.S. Trivedi
DHAAL
➢ Dhaal means “Shield” in Hindi/Gujarati ➢ DHAAL research is about shielding systems from various threats ➢ Cooperating with groups in USA, Germany, Italy, Japan, China, Brazil, India, New Zealand, Norway, Spain, … Copyright © 2017 by K.S. Trivedi
DHAAL Research Triangle Theory
Modeling paradigms & numerical solution: Solution of large Fault trees and networks, Solution of large & stiff Markov models, New modeling paradigms of non-Markovian and Fluid Petri nets
Books: Blue, Red, White, Green
Tools HARP, SAVE, SHARPE, SPNP, SREPT
Applications Hardware and software: Boeing, EMC, HP, Ericsson, IBM, Sun, Zitel, Cisco, GE, 3Com, Lucent, Motorola,NEC, TCS, Huawei
Recent major achievements: • Reliability model of CRN subsystem of Boeing 787 for certification by FAA • Reliability model of SIP on WebSphere Copyright © 2017 by K.S. Trivedi
Green book Outline Part I – Introduction (Chapters 1:3) Part II - Non-state-space models (Chapters 4:8) Part III - State-space Models with Exponential Distributions (Chapters 9:12) Part IV - State-space Models with NonExponential Distributions (Chapters 13:15) Part V - Multi-Level Models (Chapters 16:17) Part VI - Case Studies (Chapter 18) Copyright © 2017 by K.S. Trivedi
Green book Outline
Copyright © 2017 by K.S. Trivedi