Ricky W. Butler and Sally C. Johnson. March. 1990 ..... l. 10 0. 101 l0 s. 103. 104 l0 s. Time (hrs). 10 6 l0 T. Figure. 3: Probability ...... Daniel P.; and Swarz, Robert.
NASA
Technical
Memorandum
102623
t=.
THE ART OF FAULT-TOLERANT RELIABILITY MODELING
Ricky W. Butler
March
and Sally C. Johnson
1990
(NASA-T_-1026_3) SYSTE_
SYSTEM
R_LIA_ILTTY
THL
ART
M(3,q_LING
Of:
FAULT-TOLERANT (NASA)
1._? CSCL
N90-22292 p O¢P, G3/0_
Nalional Aeronautics and Space Adminislration Langley Research Center Hampton, Virginia 23665-5225
Unclds 0280231
Contents 1
Introduction
2
Introduction
3
Modeling
4
5
1 to Markov Non-Reconfigurable
3.1
Simplex
3.2
Static
3.3
N-Modularly
3.4
Fault
Modeling
2
Modeling Systems
3
Computer
................................
4
Redundancy
................................
5
Redundant
Tree
Analysis:
System
Reconfigurable
4.1
Degradable
4.2
Triad
Triad
4.3
Degradable
4.4
Triad
with
4.5
Some
Observations
to Simplex Quad One
.........................
A Few Comments
7
.....................
7 10
Systems .................................
10
.................................
12
.................................
13
Spare
................................
About
14
Multi-Step
Fault-Error
Handling
Models
.....
14
Reliability Analysis Programs 5.1 Overview of SURE ................................
16 16
5.2
19
5.3
5.4
Model-Definition
Syntax
.............................
5.2.1
Lexical
5.2.2
Constant
5.2.3
Variable
5.2.4
Expressions
5.2.5
Slow
Transition
Description
.......................
21
5.2.6
Fast
Transition
Description
.......................
22
5.2.7 SURE
FAST Exponential Transition Commands .................................
5.3.1
READ
5.3.2
RUN
Command
..............................
24
5.3.3
LIST
Constant
..............................
24
5.3.4
START
5.3.5
TIME
5.3.6
PRUNE
Overview
Details
..............................
Definitions
...........................
Definition
19
............................
20
................................
Command
Constant Constant
of the
19
and
21
Description
...............
.............................
24
.............................
25
.............................. WARNDIG
STEM
and
Constants
PAWS
Programs
22 24
25 ................... ..................
25 25
6
Reconfiguration
by 6-plex
26
Degradable
6.2
Single-Point
.................................
6.3
Fail-stop
6.4
Dual-Dual
.....................................
6.5
Degradable
Quad
Failures Dual
Reconfiguration
8
Degradation
6.1
...............................
with
By with
Two
Cold
7.2
Triad
with
Two
Warm
The
ASSIST
_.2
32 34
Partial
Fail-stop
or Self-test
...............
35
Sparing
Triad
Model
Abstract
28
...................................
7.1
8.1
26
38
Spares
...........................
Spares
..........................
39
Language
42
Specification
Language
Syntax
38
............................
42
Statement
43
8.1.1
Constant-Definition
......................
8.1.2
SPACE
Statement
.............................
8.1.3
START
Statement
.............................
8.1.4
DEATHIF
44 44
Statement
...........................
45
8.1.5
PRUNEIF
Statement
...........................
45
8.1.6
TRANTO
Statement
...........................
45
8.1.7 Model Generation Algorithm ....................... Illustrative Example: SIFT-Like Architecture .................
Reconfigurable
Triad
Systems Spares
49
9.1
Triad
with
Cold
9.2
Triad
with
Instantaneous
9.3
Degradable
Triad
with
Non-Detectable
9.4
Degradable
Triad
with
Partial
10 Multiple 10.1 Two
47 48
.............................. Detection
50
of Warm
Spare
Spare
Detection
Failure
Failure
of Spare
..........
.............
Failure
52 53
...........
54
Triads
57
Triads
with
Pooled
(',old
Spares
10.2
Two
Triads
with
Pooled
Cold
Spares
10.3
Two
Degradable
Triads
with
Pooled
Cold
10.4
Two
Degradable
Triads
with
Pooled
Warm
10.5
Multiple
Non-degradable
10.6
Multiple
Degradable
Triads
with
Pooled
Cold
10.7
Multiple
Degradable
Triads
with
Pooled
Warm
10.8
Multiple
Competing
Recoveries
Triads
...................... Reducable
with
Spares
.........................
Triad
................
Spares
Pooled
ii
57 to One
Cold
Spares Spares
58 60
............... Spares
........
62 ...........
63
.............
64
.............
66 67
11 Transient
and
Intermittent Behavior
Faults
72
.............................
72
11.1
Transient
Fault
11.2
Modeling
Transient
11.3
Model
11.4 11.5
Degradable NMR with Transients FTP ........................................
11.6
Modeling
of Quad
Faults
Subject
Intermittent
.............................
to Transient
Faults
and
73 Permanent
Faults
........................ ..........................
84
Failure Dependency ................................ Two Triads with Three Power Supplies
12.4
Byzantine and
Sequence
13.1
Failure
13.2
Recovery
14 Sequences 14.1 Phased
15
88 91
12.2 12.3
13 Time
Rate Rate
Dependencies
113 ............................ .........
113 ..................
of Reliability Models Missions ..................................
Non-Constant
Failure
14.3
Continuously
Varying
92 107 109
Dependencies
14.2
Concluding
.....................
.................................
Dependencies
75 77 81
12 Modeling Control System Architectures 12.1 Monitored Sensors ................................
Faults
..........
114 114 114
Rates
...........................
119
Failure
Rates
123
.......................
124
Remarks
ooo
nl
1
Introduction
The
rapid
cated
growth
of digital
computer
systems
capable
requirements
for computer
range
10 -r
of 1 -
are often
per
expressed
electronics
technology
of achieving
systems
mission,
used
and
very
led
high
in military
reliability
for flight-crucial
has
to the
reliability
aircraft,
systems.
still
ensure
susceptible that
The testing
to failure.
requirements
reliability
are
analysis
approach such
is clearly
impractical,
as light
investigate
reliability
the
To achieve
system
that
estimated
systems which
manufacturing high their
enable
replacing
tolerance. them
often
utilized
most
of the
reliability.
units with
detailed Only
the
reliability
their own faults; i.e. faults, these systems must
be evaluated
to
highly
or dissimilar
spares
reliable
the process
instruction-level "macroscopic"
goals
of
methodology of the
order
generally
that
1 -
taken
to
of failing with
of removing
activities
events
faulty
Since
techniques,
ultrasuch
as
to achieve
components
and
is another
either method
Fortunately,
directly
be included
of
current
to meet
redundancy.
do not
models the system
same function,
configuration,
must
and
reliability
the
itself.
models
components.
of more
of a system
model
formulating failure
redundancy
for computing
the overhead
on the
Mathematical
adequate
use
parameters.
is complex, task.
lead to system
to an alternate
fault-related
specified
dependent
system
systems
algorithms
without
life-testing
approach
the
be a difficult
circuitry
or degrading
reliability
and
strongly
presence
produce
The
reliability
A life-
is
reliable
can
in the
cannot
of a diversity
model
the processes
operation
problem.
The
the model
highly
behavior
is a complex
of the system
of the upon
Reconfiguration,
to increase
system
is consequently
capture
requirements,
redundant fault
that
techniques
reliability
parallel
based
reliability
must
optimistic
flight
or "lifetime"
devices. with
is necessary.
model
parameters
reliability
represent
fault-tolerant capabilities
the
system
systems
reliable
of a fault-tolerant,
accurately
such
systems
reliability
electronic
approach
reliability
system
the behavior
and
the
for computer
a mathematical
3. compute
Since
batteries,
of a highly
or estimate
in the
10 -9 for a 10-hour
tolerate certain
of these
computer
to determine
an alternate
2. measure
The
used
bulbs,
hence,
reliability
of a fault-tolerant
though,
10 -7 or higher;
the
Reliability
are typically
met.
is typically
products
1. develop
Thus,
requirements.
of 1 -
goals, computer systems have been designed to recognize and fault-tolerant computer systems. Although capable of tolerating are
of sophisti-
for example,
requirements
avionics
proliferation
affect
in the
system
reliability
model. Furthermore, as much done
experimentally
experimentation
is to carefully
testing
as is required develop
the
reliability
the
correctness
for life testing! model
and
of the model Consequently, subject
would
require
the best
it to scrupulous
that
at least can be
scrutiny
by
a team best,
of experts. should
The
be called
process
an art.
of reliability
modeling
is thus
It is the goal of this paper
not
an exact
science,
to look into this "craft"
and
at
of reliability
modeling. The
paper
is structured
Consequently,
elementary
more
concepts.
complex
Markov
models.
systems
are developed.
are explored. models and
the
language. and
allows systems Then
is introduced. in more a new
This
is necessary
If they were the
reader
consisting explores
including
sensors,
dependencies,
2
used
the phased
been
the
accomplished
reliability
analysis
missions,
The time.
system
analysis
using
combinatorial
is thus
A particular
on such
attributes
are typically
number of spare more attributes typically
tries
related behavior transition time
with
the
SURE
in the
later
and
sections
ASSIST
intermittent
faults
the
components
of control
complex
are investigated.
topics,
The
next
architectures
such
as sequence
are presented.
Modeling
of a complex
system
mathematics. mathematics.
consisting The
more
powerful
of a vector
attributes
is called
of many
standard
Unfortunately,
as consisting
characteristics
ex-
language
Next,
system
models,
large
model
are presented.
specialized rate
very
of the
of reconfiguration
failure
At this ASSIST
are
manner.
and
the
language.
of the
economical
some
SURE
reliability
transitions
forms
Finally,
the
models--the
expressiveness and
non-constant
components
"fault the
tree"
fault-tree
has
method
of
approach
is
In reconfigurable systems reconfiguration process. It
Markov
modeling
of attributes a "state"
which of the
over These
as the
processors,
the
which have not been removed, etc. complex the model will be. Thus,
The one
can
step in the modeling Since this transition
of working
change
system.
such
that
number
technique.
units more
set of attributes
The next to another.
input
reliability
the states The
systems
program
to solve
various
to model
of the
smallest
of the system. from one state
presented
of
of reconfiguration--degradation
the help of the
in a succint
units, the number of faulty included in the model, the
to choose
types
for describing
models
using
represented system
the two basic
aspects
non-reconfigurable
the computer
can be used
where reconfiguration is possible. the effectiveness of the dynamic
systems
set of values
models,
[8], which
Markov
reliability
such
reconfigurable
etc.
and
to
to model
in modeling
transient
used
incapable of analyzing systems the critical factor often becomes is necessary
used
using
actuators,
is based
techniques
exhausted.
triads
to model
Introduction
Traditionally,
become
techniques buses,
of essential the
complicated
models.
by increasingly
an overview
by enumerating
to be defined
of multiple
the techniques
section
the
of reliability
followed
for modeling
detail
since
are
used
language
defined
would
models
begins
as a catalog which
with
Next,
examined
than first
techniques
Evaluator)
introduces
complex
paper
on to more Range
rather
introduced
basic
paper
complex.
haustively,
the the
pressing
Unreliability
sparing--are
style
are
the fundamental Then,
Before
numerically,
point,
Thus,
Next,
(Semi-Markov
in a tutorial concepts
accurately
describe
the
fault-
process is to characterize the time is rarely deterministic,
the
transition
times
Certain
states
behavior must
are described in the
or correct
represent
difficult
because
software
state,
system
operation
system
reliability
system
made.
of the
For example,
of two faulty faults
may
simplifies
not
actually
the model
to be modeled. conservative the
is the
primary
model
may
then
the
the
wishes
then
corrupt
in such
data
probabilities to say the
reliability
between
the
the
voter.
the faulty than
modeler
the
since This
the
two
assumption
does
value,
only
are
the presence
pair
some
assumes
or
that
assumptions
is conservative
is no better
For example,
conservative
of computers,
a way as to defeat
of collusion
is
events,
to say
conservative
This
the
transitions
behavior
reliability
a model of the
recovery
transitions.
can
be obtained
transition
rates
model
not have then
certain
be measurable.
for a system.
system,
transitions
paper
become which some
fault
the
for fault-tolerant
N-modularly-redundant
from
non-
parts
of
This often
Although
if it depends
the
a particular
upon
injection.
(NMR)
from
unmeasurable categories:
model
and/or
primary
unobservable.
and
two
are
to fault problem
prop-
217C
cal-
arrivals
and
is to properly
transitions.
If the
If the
is too
model
slow
defined
MIL-STD
responses
model
is
detailed
can be exorbitant. modeling
assumptions
have used
been
introduced.
in the
development
The of
systems.
Systems
are non-reconfigurable modeling by describing
a single
system.
of the
of these
Non-Reconfigurable
ranging
states
The
of reliability
computer
fall into
field data
determination
methods
model
to system
be measured
types of systems to model basic elements of reliability systems
If the
experimentally must
of the issues
is to explore
system
correspond
using
so as to facilitate
Modeling
reconfigurable
developing
the
experimentally
models
The simplest troduces the
in the
of a fault-tolerant
fast
of transitions
of this
when
describe
In this introduction
3
either
If one
failure.
all of the transitions
faster
system
number
reliability
value,
system failure
of external
to make
failure.
to be system
slow transitions
The
too coarse
goal
a specific
for the system
function
is considered
are made.
and
can be measured
the
is forced
is system
chosen
fault-free
it is useless.
the
transitions
culations. model
than
complex
modeler
represent
constitutes
system
consideration
Typically, then
what
model
what
(TMR)
wishes
that
elegantly
parameters,
erly,
The
exactly
others
can fail.
It is important
failure
The
about
while
redundant
assumptions
system
of faults.
an extremely
modular
since the
If one
failure,
Defining
state.
is higher
in a triple
computers
presence
is often
hardware
assumptions
distribution.
system
properly.
failure
non-conservative
a probability
represent in the
failure
system and
using
simplex
computer
systems. This section inhow to model simple nonthrough
a majority-voting
Figure 1: Model of a Simplex Computer 3.1
Simplex
The first
Computer
example
variable
is a system
representing
for T, say computers,
consisting
the time
of a single
to failure
F(t). Typically, it is assumed fail according to the exponential F(t)
The parameter modeling is the
A completely failure rate
computer.
of the computer.
First,
Next,
that electronic distribution:
= Prob[T
< t] = 1-
we let T be a random
we must
define
a distribution
components,
and
consequently
e -at
defines this distribution. An important (or hazard rate), h(t) defined as follows
concept
in reliabihty
h(t) = For
the
exponential
distribution given
in figure
simplex
distribution,
with
a constant
1. In this Markov
computer
is working,
has
and
computer
the
failed,
the
For reliability higher have
failure been
reader
rates
rate
transition
of estimating
in a computer of the individual
the chips in the of the computer, the
computer
the chips.
Fc(t)
is determined
Some before
failure
1 to state
that
product
delivery;
suppose
rate
11, _2,...,
may
Once
is simply _.
the the
represent
discussion reliabihty
as follows:
of the
the failure
T be a random variable representing the the time the ith chip fails, the distribution
of and fail
a somewhat
mature
sum
the
simplex
components
distribution
complete
is
occurrence
exhibit
however,
exponential
components. failure
the
are exponential
electronic
devices
[2] for a more
the
only
system
in which
in which
model
immature to the
of electronic
state
state
of a Markov assumed
is the this
2 represents
according
computer's
exponential
representing
the operational
handbook
To see this,
computer. Letting and Ti represent
system
217D
reliability the
state
The
model
to fail
MIL-STD
is known,
the
it is generally testing
= ._.
1 represents
from
distribution.
experimentally to the
state
h(t)
Markov
The transitions hazard rate.
due to insufficient
shown
rate The
2 represents
purposes,
exponential
is referred
problem chip
modeling
to the
rate.
model,
state
the failure of the simplex computer. thus can be labeled by the constant according
hazard
hazard
devices [1].
The
on the of each failure rates
of
time of failure of failure for
3._
C
Figure
2: Model
v (t)
If we assume
that
the
chips
is an exponential
The
above
Prob[T
=
Prob[min{T_,T2,...,Tn}
=
1-
examined
system
in the
following
3.2
Static
The
Triple-Modular
tectures. tions
The
a failed
the
computers
system
consists
the
same
computer
are assumed
single failure
failure does not does not occur state
state
to being
State
computers
In figure failure
rate
used
=
1-
]-Ii"=l exp(-Ait)
---
I-
exp(-
V TM ,--,/=1
failure
rate
for parallel
the
time
> t] Ait)
--,i=_ redundant
that
the
1 to state fail.
The
system 3, the
The affect
first
systems. chip
the
fails.
The
time
Such
of failure
systems
will be
computer
archi-
are assumed
working
failure
exactly
computer.
(not
same
assumed
included
in this
that
computa-
isolated
Mathematically,
It is further
system
the
to be physically
such
therefore,
the
outputs
model),
and
condition
of three
3)_ to represent 2 when
there
because It can
rate
one processor of the
failure
be seen
that 5
computers.
at which has
are only two working
a majority
of system
the
working
computers
computers
as a function high
failed. in the
of mission
reliability
The
any
one
The that system time
is strongly
are
thus
its erroneous value to the external world. Thus, computers fail. The model of figure 2 describes
is in state
probability
)_ is lO-4/hour.
all performing
another
initial
2A since
hull-tolerant
computers
external
2 is labeled system
simplest
computers
to fail independently.
propogate until two
3 has rate
is one of the
of three
by the
1 represents
2 to state
3 represents
I]_'=_ Prob[Ti
(TMR)
inputs.
cannot
prior
from
1 -
work
> t]
sections.
Redundant
voted
system.
not
T,
we have
=
with
merely
> t, ....
Redundancy
on exactly
that
does
< t]
> t,T,
fail independently,
is not
System
< t]
Prob[T1
distribution
technique
of a redundant
of a TMR
=
Fo(t)
which
2_
a
system such a
transition
of the
transition
from
can fail. have
three State
failed.
is given.
The
dependent
on
10o 10-I 10-2 lO-S P.f
10-4
lO-S 10-6 lO-r lO-S 10 0
101
l0 s
I
I
I
l
10 3
10 4
l0 s
10 6
Time
Figure
3: Probability
(hrs)
as a Function
6
of Mission
Time
l0 T
Figure 4: Model of 7MR System a short mission time. It should be noted that it was implicitly assumedthat the system starts with no failed components(i.e. Prob in state 1 at time 0 = 1). This is equivalent to assumingperfect maintenancebetweenmissions. 3.3
N-Modularly
The
assumptions
The
voter
processors describes
of an
used
in such not
of mission as a function
Of course,
this
3.4 Fault
trees
used
for
(This
system
will not
more
from
with
voters.
the
the
same
a majority
voter--as
operational.
The
voter.
5.
The
Figure
the
practical
6 shows
problem
as for a TMR long
the
of building hardware
of
model
4)
(figure
of system
failure
unreliability
of system
failure
an
as a
of an NMR
_
0 as N ---, _.
arbitrarily
would
system.
as a majority
following
probability
probability
additional
A the
Few
large
significantly
N-way
increase
the
Comments
reliability
before succeed
fault
analysis
Fault
fault-tree
Markov
failure
in more
of complex
Compiler
systems solver
that
whenever
which
next
with
the
modeling quite
(see [4]). possible,
approach
efficiently
using
a reliability combinatorically
because 7
overcome
section.)
a limited
Thus,
system
for many
The
years.
Basically, removal
section
could
attempts the
combinatorial
to solve
can be solved
in this
the
could
in the
only be used
examples
systems,
detail
can be calculated Tree
all of the
occurs
system
of the fault-tree method. is no reconfiguration--the
be calculated
state-space
can
Thus,
In reconfigural_e
another cannot
trees
system.
tree.
will be explored
solutions
nonreconfigurable
the
a fault
powerful
Although
cient
the
Analysis:
been
components
Langley's
in figure
in hardware,
component
modeled
natorial
a 7-way
are
it is important to recognize the limitations can be used to model a system where there
of a faulty faulty
with
is still
Theoretically,
ignores
Tree
have
Nevertheless, a fault-tree been
is usually
system
is given
of N.
system
failure rate ._. If implemented in software, the CPU overhead could be enormous, increasing the likelihood of a critical task missing a hard deadfine [3].
Fault
The
time
If implemented
processor seriously
the
system
model
redundant
a system
failed
a 7-processor
system
System
N-modularly
have
function
voter.
Redundant
capabilities
of the
probability fault
have
to remove
tree
that
the
approach.
is needed. class
of problems,
a fault-tree engineer
solver
such
be wise fault-tree
combi-
as NASA
who frequently
would
the combinatoric
these
to use
analyzes an effi-
computations
10°
Pi
....
.....................................................................
10-10
10-2o 10 °
101
10 2
10 3 Time
Figure
5: 7MR
Unreliability
10 4
I 10 5
I 10 6
(hrs)
As a Function
of Mission
Time
10 7
10 0
10 -10
• "...,
10-20
10 -30
2
I
I
I
I
I
I
4
6
8
10
12
14
N
Figure
6: NMR
Unreliability
As a Function
of N
16
are typically
4
more
efficient
Modeling
Fault-tolerant come
to repair
The
The the
are two basic
In this
in later
4.1
Degradable simplest
tion.
The The
The
that
processor
with
analyzes state reason that
a separate
are
the
from the
the
described
transition
non-exponential models. Such
not
the
1 to state
of a faulty
of a
methods
models.
There
with
component
spares. without
sel of components. and
concepts.
state.
their
They
The
replacement
with
will be explored
and
(i.e.
the
2 the
system
in
time
is not
from
2 to state
state
rate.
processor the
programs
The
with
does not fail before good
results
from
faulty
[1].
which
10
bad.
processor.
than
Thus,
a rate.
process
can be used This
is complete. there
is also
system
state
2 to
Reconfiguration
rather
t is F(t).
of each
The from
has the
is simple--the
processors.
the diagnosis
processors.
failure
transition
The
revealed transition
probability
The
model to the class pure Markov models.
two good the
three
Consequently,
meaning
system.
operational.
the
processor.
F(t))
4 is less than
will be discussed
is operational
The
reconfiguration
exponential
failure
one failed
(e.g.,
of the
processors
to represent
the
by degradatriplex
of any of the
have
of the
To increase
degradable
all three
has
function
measurement
a constant
reconfigure
the problem.
reconfiguration)
a distribution
system.
which
failure
we do not
diagnoses
triplex
of a simple 1 with
2 represents
At state
is the
designed
in state
are identical,
the voter
been
behavior
begins
of recovery by
system
distinguish
the
replacement
a degraded
voting
have
transition generalizes the mathematical models are far more difficult to solve than
long as a second could
system
with
time
4 the
removal
and
reliability
and
components these
majority
systems
experimental
section, several computer semi-Markov models. At state
with
of faulty
introduce
7 describes
removal
labelled
distribution be
to complex
removal
continues
removal
upon
processors
for this is that
cannot that
state
the
4 represents
transitions
based
triplex
the errors
the
triplex
of figure
from
since
permanent
or physical
component
Triad
degradable
transition
Note
lead
logical
faulty
Reconfiguration
sections.
system,
model
the
we will briefly
architecture
of the
can
the
the
strategies--degradation
system
both
section
detail
rehability
and
of reconfiguration.
involve
to identify
greatly
reconfigured
a strategy
always
used
involves
involves
greater
The
vary
using
but
of reconfiguration
The
method
a spare.
varieties
method
replacement.
solutions.
Systems
designed
techniques
system
types
degradation
sparing
are often
in many
component.
used
are Markov
Reconfigurable
systems
strategies faulty
than
presence
of a
of semi-Markov In a subsequent
to solve transition Otherwise, a second
Markov
and
occurs
as
the voter transition
r(t)
2_
F(t)
Figure
7: Model
of Degradable
11
Triplex
F(t)
Figure
8: Model
from
state
2 to state
3 which
rate
of this
transition
is 2_,
since
an
absorbing
3 is a death near-coincident
state (i.e. failure.
At state in the state
active 5.
The
7 and state
4.2 The
model
figuration when was
presented process
either the
one
such
is no way
self-test other
As before
of figure
processor
ending
state
where
the the
failure
processors
processors between
and
6.
is one
taking
State
this
due
to
processors
the
system
to
ending
6 is thus
another
good
processor.
of reaching
State
process
remaining
of the last
probability
fail
system
no faulty
reconfiguration
in state
there
failure
and
may
the
of the
The
fail.
processor.
At state
death
8 there
state
is often
of parts.
in the previous the
dual
with
100%
the dual
failed, is not
the horizontal
the
was system
When
unrealistic.
to be perfect. failure
mode,
8 describes
was unrealistic
simplex
accuracy.
diagnosis
to diagnose
section
to the
two processors
for this
option--avoid
good
processor. could
Simplex
an assumption
programs
model
remaining,
two
of these
of a second
two processors
represents
occurs
7 to 8 represents
by exhaustion
from
of the
faulty
processors, there
to
one
System
failure
remaining
which
with
of a second
state
processors
Triad
state)
a race
to Simplex
coincident
of the
7 is the operational
from
to as failure
the
either
Either again,
failure
state
represents
is operational
5, once
the
and
no good
referred
The
state
transition
are
system
configuration.
At
in state death
4 the
of Triplex
in a dual i.e.
degrade
in one major
assumed diagnosed
using
which
a majority
However, In a later
respect--the
to be perfect.
with
only
section
system.
In this
directly
from
of the voter
recon-
In other two
on
processors
three
two good
a triplex
or more
processors,
we will explore section,
words,
the
use of
we will explore to a simplex
an-
system.
this system.
transitions
represent 12
fault
arrivals.
The
vertical
transition
repre-
F (t)
(
2_
._
F Ct)
Figure
sents than
system recovery. a rate to indicate
1 to state active
fails,
2_ competes faulty
4.3
the
plus
one
consists
is in state
of the
only one
9 describes one
of removing three
Quad
active
two. the
processors
Before
can fail.
When
occurs,
there
are still
two
state
3 with
rate
reconfiguration
transition
transition.
that
from
state
Reconfiguration
working
processors.
processor
remains
2 to death
consists
Thus, in the
function rather rate from state
the active
one of those
of discarding
transition
rate
both from
the
state
4
configuration.
Quad
in figure When
are three
fail; thus,
recovery
Degradable
processors. of the
can
5 is _ because
model
there
system
that
with
processor
to state
The
the
processors
of a Degradable
The recovery transition is labelled with a distribution that the transition is not exponential. The transition
2 is 3_ because
processors
9: Model
remaining
the
of those faulty processors
a degrad,_ble four
quad.
processors
processor,
fails
leaving
fails (state
This (state
a triad
5), the
system 2),
starts the
of processors
reconfiguration
with
four
reconfiguration (state process
4). entails
working process When
one
removal
of the faulty processor plus one of the working processors, leaving a simplex (state 7). Note that a different function is used for the transition from state 5 to state 7 than from state 2
13
3)`
,(
2)`
,@
3)`
,@
F(t)
( Figure
to state function
4.4
4. This is necessary of the state.
Triad
with
In the previous
models
continued
operation
duction to the This technique Suppose inactive.
The
10: Model
if the
One
model
degraded
a triplex of figure
processor
state
are 2 active
2, there
and
levels
sibility
which
has
this
experiments
and
hence
its rate,
varies
as a
its replacement
working
processors
one
have
been
with that
processor
spare
a spare
that
processor.
provides with
does
the system
a brief
a spare not
intro-
processor.
fail
when
it is
can fail; thus,
performed of the overall
While
the rate
the
and one isolation
system
is in
of the transition
to
there are once again three active processors 4 to state 5 is 3)`. This model assumes the the next
Multi-Step
14
section
and
there are two good processors 4 represents the detection and
to simplex upon system failure.
About
the distribution
This
processor
system.
the situation where from state 2 to state
Some Observations Models
of measuring
Spare
a faulty
replacing a faulty in section 7.
system
system does not immediately degrade in duplex until the next failure brings
Fault-injection
removed
of redundancy.
death state 3 is 2),. After reconfiguration occurs, that can fail; thus the transition rate from state
4.5
process,
process
10 describes
As before, state 2 represents faulty processor. The transition of the faulty
One
,_@
Spare
the reconfiguration with
with
reconfiguration
technique of sparing, i.e. will be explored in detail
we have
of Triplex
2)`
failure;
but
rather
Fault-Error
operates
Handling
at NASA
Langley
demonstrating
recovery
process
[5!. Recovery
the feaprocesses
havebeenshown to be nonexponential[5, 6], and distributions used
for recoveries
to estimate
ditional
mean
and
of a recovery Before with
the
from
different
types
and
standard
deviation
mean
standard
transition. White's
deviation
For more
solution
nonexponential
recovery
of faults.
are used was
transitions
system Fault
upon
SURE
was
the
extremely
even
difficult.
different
experiments
may
and
to define
section
solution
exhibit,
recovery,
program
see [7, 8] and
developed,
may
injection
conditional
in the
information,
technique
a given
the
this
be con-
behavior
5. of semi-Markov
Some
models
modelers
have
been
willing to make the assumption that the recovery process behaves according to an exponential distribution in order to facilitate the solution of the model. However, it should be recognized that
such
an assumption
Another
approach
compose models models Some sent
the
reliability
process.
fault-error
to detect, However. directly
process
use some the overall
that than
time
isolate,
and
models,
a system
process
parameters
now that observable
handling
steps
a fault-occurrence
was designed
could that
a single are
recover
parameters,
from
not
and
models
CARE
III single-fault
to accomplish
to enable exponential
a more
the
overall
accurate
was to de-
process.
However.
observable
or measurable.
is directly
observable,
the
can
be extremely
solution
an approach
15
difficult
technique is unnecessary.
model,
repre-
reconfiguration
representation
directly
a fault
semi-Markov such
as the
performs
of reconfiguration
an efficient
such
model
quantity.
semi-Markov
handling reliability.
individual
into
of unknown
of solving
19, 10]. Coverage parameters derived from the solution of the fault-error are inserted into the fault-occurrence model in order to compute the system
reconfiguration while
model
error
the necessity
handfing
This multi-step
models
a modeling
to avoid
a set of fault-error
of these the
introduces utilized
these
individual that
multi-step
For
example,
times
required
to measure
is available
of the
accurately. requires
only
5
Reliability
Before
we proceed
analysis PAWS
further
program
as well be
into
5.1
Overview
to the
becomes
to solve
insight
Probably
This
the
the
nature
of
the easiest SURE
is desirable
to present
is used
described
reasons: them
in the (1)
model
for the
in section SURE
As the
graphically
of any
being
for the SURE
reliability STEM
input
models
and
language increase
(2) these
parameter.
and
5.4.
This
in
programs
will provide
modeled.
SURE
way to learn
program
language
will be presented
as functions
systems
language
are
for two
impractical
of the
input
programs
the models
models
the input
same
These
of this paper,
it soon
used
The
programs.
graphically.
complexity,
in the art of modeling,
analysis
remainder
as
Programs
will be presented.
reliability
In the
can
Analysis
for the
the
input
SURE
degradable
quad
values
to identifiers.
language
model
is by way of example.
in figure
The
input
9 is:
LAMBDA = IE-4; MUI = 2.7E-4; SIGMA1 = 1.3E-3; MU2 = 2.7E-_; SIGMA2 = 1.3E-3; 1,2 2,3 2,4 4,5 5,6 5,7 7,8 The the
= = = = = = =
4*LAMBDA; 3*LAMBDA; ; 3*LAMBDA; 2*LAMBDA; ; LAMBDA;
first five statements processor
standard
failure
deviation
equate rate.
The
of the
SIGMA2 are the mean and The final seven statements fault
arrival
process
then
only the
defines
a transition
transition
is a fast
recovery
be given. For example, state 2 to state 4 with
following model
is an illustrative description
has
from process
identifiers from
state
exponential state then
stored
rate
7 to state the mean
the statement mean recovery interactive
been
two time
MU1
and
2 to state
LAMBDA
SIGMA1 4.
The
are
represents
the
mean
and
MU2
and
identifiers
standard deviation of the recovery time from state 5 to state 7. define the transitions of the model. If the transition is a slow
statement must from
next
recovery
The first identifier
2,4 time
session
8 with
rate
SURE
the
A (or 1 × 10-4/hour).
The
of the recovery
last If the time
above defines a transition deviation SIGMA1. The
to process
TRIADP1.
For example,
deviation
= MU1 and standard
in a file named
16
be provided.
and standard
using
prompt.
must
the user
above input
model. follows
The the
?
$
s_re
SURE V7.4 i? read
NASA Langley
Research
Center
triadpl
2: LAMBDA = 1E-6 TO* 3: MU = 2.7E-4; 4: SIGMA = 1.3E-3; 5: 1,2 = 3*LAMBDA; 6: 2,3 = 2*LAMBDA; 7:2,4 = ; 8:4,5 = 3*LAMBDA; 9:5,6 = 2*LAMBDA; 10:5,7 = ; 11:7,8 = LAMBDA; 12: TIME = i0;
1E-2 BY
0.05 13? run
SECS.
MODEL
= _riadpl.mod
FILE
LAMBDA
TO READ
MODEL
FILE
SURE
UPPERBOUND
LOWER.BOUND
V7.4
24 Jan
10:16:21
90
COMMENTS
1.77002e-14 3.12024e-12 1.66224e-09 1.51644e-06 1.26292e-03
1.68485e-14 3.00387e-12 1.61714e-09 1.45575e-06 1.23551e-03
l.O0000e-06 1.00000e-05 1.00000e-04 1.00000e-03 1.00000e-02
10;
RUN
#I
3 PATH(S) TO DEATH STATES Q(T) ACCURACY >= 14 DIGITS 0.633 SECS. CPU TIME UTILIZED 147 157 plo_ xylog 167 exi%
The
first
noted
statement
that
program the
_ is defined
specified
range.
the program
generated
by this
the
skipped
on
clear.
The
next first
perform
Statement to plot command.
logarithmic
In
READ command
_s a variable
to automatically
directs using
uses the
over
to input
a range
a sensitivity 12 defines
the
output
The
the
on the
the model
of values
description
in this file.
analysis
as a function
mission
time
graphics
XYLOG argument
SURE
the
to plot
be
SURE over
Statement
11 shows the
the
X and
15 graph
Y axes
scales. subsections, reading
following
All reserved
directs
of this parameter
Figure
more and
used
conventions
detail as will
is presented.
a reference be used
when
These
subsections
something
to facilitate
the
words
will
be
capitalized
in typewriter-style
17
can
is encountered
description
language: 1.
This
to be 10 hours.
device.
causes
file. It should
print.
of the
probably that SURE
be is not input
le+
00
le-
10
le - 20 le - 07
Figure
I
le - 06
11: SURE
I
I
le - 05 le - 04 Lambda
program's
18
I
le - 03
plot, of output
le - 02
2. Lowercasewordswhich are in italics indicate items which are to be replacedby something definedelsewhere. 3. Items enclosedin squarebrackets r] can be omitted. 4. Items enclosedin braces{ } can be omitted or repeatedas many times as desired. 5.2
Model-Definition
Models
are
program
defined
transitions
and
standard
5.2.1
fast
and
must
means
and
standard
all of the
transitions
slow transitions.
25,000.
numbers. The
termination
If the
limit
instead
may
integers
SURE
user
are multiple
transition
can
between
be
increased
and recompihng. are floating point
be entered notation
Constant user
If there
transition
must
say p
the
competing
mean
transitions
along
with
the
is used
on a line. "(*"
Comments
The
SURE
the
the MAXSTATE
simply
by
The transition numbers. The
for statement
indicates
0 and
may
beginning
program
changing
equate
MAXSTATE
Therefore,
be included
any
of a comment the
user
more
place and
for input
means and is used for
that
"*)"
than
one
blanks
are
indicates
the
by a line number
mark.
numbers numbers.
to identifiers.
Thereafter,
these
constant
For example,
also
be defined
in terms
of previously
GAMMA = IO*LAMBDA; the
the
rates, conditional Pascal REAL syntax
termination.
prompts
implementation
Definitions
of the
may
In general,
SURE
time,
supply
probabilities
LAMBDA = 0.001; RECOVER = 1E-4; Constants
The
the transition is fast. Otherwise, it distributed by the SURE program.
The
respective
mean
model.
deviations.
semicolon
by a question
used
time.
the
be positive
of a comment.
The
transition supply
This
The
may
allowed.
5.2.2
must
in the program deviations, etc.,
statement
distribution.
of the
Details
usually
followed
of the
user
numbers
constant standard
an arbitrary
the
Lexical state
have
deviation
conditional
these
between
can
a state,
limit,
by enumerating
with respect to the mission time, i.e. p < T, then Slow transitions are assumed to be exponentially
Fast
The
in SURE
distinguishes
is small is slow.
from
Syntax
syntax
is 19
defined
constants:
identifiers
may
be
name where
name
letter,
and
= expression; is a string
czpresswn
section
entitled
5.2.3
Variable
In order
of up
to eight
is an arbitrary
letters,
digits,
mathematical
and
underscores
expression
(_) beginning
as described
with
a
in a subsequent
"Expressions". Definition
to facilitate
for this
variable.
variable.
If the
parametric
The
SURE
system
analyses, system
is run
function will be made. 0.001 to 0.009:
The
a single
variable
will compute
in graphics
following
the
mode
(to
statement
may
system
be defined.
reliability
be described
defines
A range
is given
as a function
of this
later),
LAMBDA
then
a plot
as a variable
of this
with
range
LAMBD_ = 0.001 TO 0.009; Only
one
such
of points range
variable
over
can
this
range
be either
POINTS
may
be defined.
to be computed.
"geometric"
= 4, then
A special
the
The
method
or "arithmetic"
geometric
constant,
and
POINTS,
used
is best
to vary explained
defines the
the
variable
by example.
number over
this
Suppose
range
XV = 1 TO* 1000; would
use
XV values
of 1, 10, 100, and
1000,
while
the
arithmetic
An
asterisk
range
XV = I TO+ I000; would
use
geometric One
XV
values
range,
while
additional
BY "increment", adding
of 1, 333,
TO+ or simply
option the
667,
the
of POINTS specified
1000.
TO implies
is available--the
value
or multiplying
and
an arithmetic
BY option.
set
the
TO implies
a
range.
By following
is automatically
amount.
following
such
the
that
above
the
syntax
value
with
is varied
by
For example,
LAMBDA = IE-5 TO* IE-2 BY i0; sets POINTS equal 1E-2. The statement
to 4 and
the
values
of LAMBDA
used
values
of CX used
would
would
be 1E-5,
1E-4,
1E-3,
and
C1(= 3 TO+ 5 BY 1; sets
POINTS
equal
In general, vat where
the
to 3, and syntax
= expression
vat is a string
an arbitrary a + or *. The
the
T0[c]
expression
of up to eight
mathematical BY clause
be 3, 4, and
5.
is
expression is optional;
letters
[BY increment and
digits
as described if it is used, 2O
]; beginning
in the then
next
increment
with section
a letter,
expression
and
optional
the
is any arbitrary
is c is
expression.
5.2.4
Expressions
When specifyingtransition or holding time parametersin a statement, arbitrary functions of the constants and the variable may be used. The following operators may be used: +
addition
-
subtraction
*
multiplication
/
division
**
exponentiation
The following ARCSIN(X), ing in the
standard
Pascal
ARCCOS(X), expressions.
functions
ARCTAN(X),
The
following
are
may be used: SQRT(X). permissible
EXP(X):
Both
( ) and
LN(X),
SIN(X),
I] may be used
COS(X), for group-
expressions:
2E-4 1.2*EXP(-3*ALPHA); 7*ALPHA + 12*L; ALPHA*(I+L) + ALPHA**2; 2*L + (1/ILPBI)*[L + (1/ILPHI)];
5.2.5
Slow
Transition
Description
A slow transition
is completely
and
rate.
the
transition source,
where defining
source the
The
specified syntax
by
citing
the
source
state,
the
destination
state,
is as follows:
dest = rate;
is the
source
exponential
state, rate
dest is the
of the
destination
transition.
The
PERM = 1E-4; TRANSIENT = IO*PERM; 1,2 = 5*PERM; 1,9 = 5*(TRANSIENT 2,3 = 1E-6;
+ PEKM);
21
state, following
and
rate is any
are valid
SURE
valid
expression
statements:
5.2.6
Fast
To enter
Transition
a fast
Description
transition,
source,
dest
the
=
following
< mu,
sig
syntax
[, frac
is used:
]
>;
where mu
=
an expression
defining
the
conditional
mean
sig
=
an expression
defining
the
conditional
standard
.frac
=
an expression
defining
the
transition
and
source
and
eter fast
frac is optional. If omitted, the transition transition. All of the following are valid: 2,5
=
THETA 5,7 = 7,9 =
5.2.7
FAST
Often
when
dest define
;
studies,
one must
these fast transitions necessary to supply
parameters 1 to state
and
0.5>;
Transition
simplicity, it is still these state
source
= 1E-4;
4_ (_',1)-----_
(5',0) 5_
3_
,
(5,2)----::._ (5,a) [
< m,,_
> / < m_.s_2
>
:0/4-A
(4
= NW;
failed
to keep
*)
*)
can fail
(* a spare becomes ;
(* a spare BY NS*GAMMA;
(NW,NF,NS-I)
NF
Since
This
BY
(NF > O) AND (NS = O) (* no more TRANTO (I,0,0) BY ;
keep
*)
(3,0,NSI);
=
IF NW > 0 TRANTO (NW-I,NF+I,NS) IF
of working processors *) of failed active procssors of spares *)
*)
inactive
or that
assumptions. spares
fail
(i.e.
warm)
but
configuration.
The
for the
processors.
rate)
increase
active
to see the
model
advantage
in recovery
time
modeled.
number of spares ini_ially *) failure ra_e of active processors *) failure rate of spares *) mean time to replace with spare *) start, dev. of time to replace wi_h spare
53
its
In this
*)
due
MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5; SPACE = (NW: 0..3, NF: 0..3, NWS: O..NSI, NFS: O..NSI);
(* (* (* (* (* (*
START = (3,0,NSI,O); PRG = NWS/(NWS+NFS);
mean $ime s%an. dev. number of number of number of number of
(* probabili%y
$o degrade _o simplex *) of %ime _o degrade %0 simplex working processors *) failed active processors *) working spares *) failed spares *)
of switching
(* processor failure *) IF NW > 0 TRANTO (NW-I,NF+I,NWS,NFS)
in a good
spare
(NF > O) AND (NWS+NFS > O) THEN (* reconfigure using a spare (* a good spare becomes active *) IF NWS > 0 TRANTD (NW+I,NF-I,NWS-1,NFS) BY ; (* a failed spare becomes ac%ive *) IF NFS > 0 TRANT0 (NW,NF,NWS,NFS-I) BY ; ENDIF; (NF > O) AND (NWS+NFS = O) (* no more TRANTO (I,0,0,0 BY ;
IF NWS > 0 TKANTO (NW,NF,NWS-I,NFS+I) DEATHIF
reconfiguration
is equal
variable
to the
PRG
probability spare
degrade
(* a spare BY NS*GAMMA;
fails
the
of switching
spare
> 0 and
occurs, current
is used
when
and
> 0 check
of good
this
in a good
Conversely, is zero,
NFS
probability
proportion
to calculate
of switching
is zero.
a good
spares,
*)
%o simplex
*)
*)
NF >= NW;
When spare
*)
BY NW*LAMBDA;
IF
IF
*)
probability.
spare
all of the
the probability for these
spares
When
is one,
and
spares
have
the
and
the
the
(_ersus
in the spares
probability
in a bad
prevent
spare
spares
all of the
failed,
of switching
two cases
in a good
to failed
a failed
system. are
good,
of switching
probability
spare
the
in a bad
of switching
is one.
generation
The
The
tests
in
NWS
of a transition
when
it is inappropriate.
9.4 If the
Degradable system
is designed
model.
Two
cannot
detect
aspect Instead,
aspects
with
since
is necessary to expand or undetectable:
off-line
faults
referred
it is used
we will just
with
of an off-line
all possible
is sometimes
"coverage"
Triad
the
diagnostics diagnostic
and
to as the
in so many
call it the state
Partial
"fraction space
Detection for the spares,
must
"coverage" ways
of detectable to keep
track
54
this
be considered:
(2) a diagnostic different
of Spare
requires
of the
must
diagnostic.
faults" of whether
to execute. We will avoid
people and
be included
in the
(1) a diagnostic time
by different
Failure
and is thus
assign
a fault
usually The
first
the
term
confusing.
it an identifer,
in a spare
K. It
is detectable
SPACE
The
=
second
the
(* (* (* (* (*
(NW: 0,.3, NF: 0..3, NWS: O..NSI, NDFS: O..NSI, NUFS: O..NSI);
NDFS
aspect
requires
state-space
thal.-a
variable
ntunber number number number number
rule
according
working processors *) failed active procssors *) working spares *) detectable failed spares *) undeteczable failed spares *)
of of of of of
be
added
to
generate
to
some
fast.
general
transitions
which
recovery
decrement
distribution:
IF NDFS > 0 (* "detectable" spare-failure is detected *) TRANTO (NW,NF,NWS,NDFS-1,NUFS) BY ;
No
such The
that
transition active
the
processor
state
whether
the
is generated
space
for
failure
TRANTO
is larger.
failure
"NUFS"
The
is detectable
that
The
the
of these PRW PRD PRU
The
exist:
processor
processor
are multipled
reconfiguration
possibilities faulty
rates
the
faulty
is replaced
are
= NWS/(NWS+NDFS+NUFS); = NDFS/(NWS+NDFS+NUFS); = NUFS/(NWS+NDFS+NUFS);
reconfiguration
rule
as in the
TRANT0
rule
previous must
be
example, altered
(* a spare fails *) BY K*NS*GAMMA; (* detectable fault BY (I-K)*NS*GAMMA; (* undetectable
active
a spare PRD,
same
failure
by K and
with
with PRW,
is the
spare
rule is now more
(1)
is replaced cases
rule
and
complicated processor
a spare
containing
containing
an
PRU,
respectively,
(* (* (*
prob. prob. prob.
than
is replaced
in the with
a detectable
undetectable defined
fault.
previous
model
NSI = 3; LAMBDA = IE-4; GAMMA = 1E-6; MU = 7.9E-5; SIGMA = 2.56E-5;
include
*) faul%
*)
example.
a working
spare,
fault
(3)
The
and
Three (2) the
the
probability
faulty of each
as follows:
working spare is used *) spare w/ detectable fault spare w/ undezeczable fault
is used *) is used *)
is:
(NF > O) AND (NWS+NDFS+NUF5 > O) THEN (* a spare becomes active IF NWS > 0 TRANTO (NW+I,NF-I,NWS-I,NDFS,NUF5) BY ; IF NDFS > 0 TR4NTO (NW,NF,NWS,NDFS-I,NUFS) BY ; IF NUFS > 0 TRANTO (NW,NF,NWS,NDFS,NUFS-I) BY ; ENDIF;
complete
to
(l-K).
IF
The
except
or not:
IF NWS > 0 THEN TRANTO (NW,NF,NWS-I,NDFS+I,NUFS) TRANTO (NW,NF,NWS-I,NDFS,NUFS+$) ENDIF;
Note
faults.
*)
is: (* (* (* (* (*
number of spares ini%ially *) failure raze of active processors *) failure rate of spares *) mean _ime to replace with spare *) start, dev. of time %o replace wizh spare
55
*)
MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5;
(* mean time ¢o degrade ¢o simplex *) (* start, dev. of time to degrade ¢o simplex
K = 0.9;
(* fraction of faults that the spare off-line diagnostic can detec¢ *) (* mean time %o diagnose a failed spare *) (* s¢andard deviation of time _o diagnose *)
MU_SPD = 2.6E-3; SIGMA_SPD = 1.2E-3; SPACE
=
(NW: 0..3, NF: 0..3, NWS: O..NSI, NDFS: O..NSI, NUFS: O..NSI);
START
=
(3,0,NSI,O,O);
(* (* (* (* (*
IF NW > 0 (* TRANTO (NW-1,NF+I_NWS,NDFS,NUI_S) IF NWS > 0 THEN TRANT0 (NW,NF,NWS-1,NDFS+I,NUFS) TRANTO (NW,NF,NWS-1,NDFS,NU?S+I) ENDIF;
number number number number number
of of of of of
working processors *) failed active procssors *) working spares .) de$ec_able failed spares *) undeeectable failed spares *)
a processor can fail BY NW*LAMBDA; (*
*)
a spare fails *) BY K*NS*GAMMA; BY (1-K)*NS*GAMMA;
*)
(* detectable fault *) (* unde%ec_able fault *)
PRW = NWS/(NWS+NDFS+NUFS); (* prob. a working spare is reconfiguzed *) PRD= NDFS/(NWS+NDFS+NUFS); (* prob. a spare w/ de%. f. is reconfigured PRU = NUFS/(NWS+NDFS+NUFS); (* prob. a spare w/ unde¢ f. is reconfigured IF (NF > O) AND (NWS+NDFS+NUFS > O) THEN (* a spare becomes active *) IF NWS > 0 TRANTO (N_+I,NF-I,NWS-I,NDFS,NUFS) BY ; IF NDFS > 0 TRANTO (NW,NF,NWS,NDFS-I,NUFS) BY ; IF NUFS > 0 TRANT8 (NW,NF,NWS,NDFS,NUFS-I) BY ; ENDIF; IF
(NF > O) AND (NWS+NDFS+NUFS = O) (* no more TRANTD (1,0,0,0,0) BY ;
spares,
degrade
*) *)
to simplex
*)
IF NDFS > 0 (* "de%ec%able" spare-failure is de%ected *) TRANTO (NW,NF,NWS,NDFS-I,NUFS) BY ; DEATHIF
In
NF
this
the following discussed.
>= NW;
section, section,
systems systems
consisting consisting
of a single of multiple
56
reconfigurable sets
of these
triad
were
reconfigurable
modeled. triads
In are
10
Multiple
This
section
models,
starts
various
These
models
concludes
are
Two
In this spares
to simplex. because
then
generalized
a general
discussion
replace system
fails
fault
occurs
define
Since track
the
of the
represented while
they
of spares
model,
triads current
the
when
by a single are inactive, available.
Thus,
competing
of two
triads
pool
of spares.
faulty
which and
two faulty
processors. processor
of spares
faulty
to replace spares
number
the
of spares
to a simplex
of this
Thus,
working.
be represented
before
there
by a single
they
system.
state
Similarly,
full model
description
(*
TWO TRIADS WITH POOL OF SPARES *)
the
need
variable--NS,
INPUT N_SPARES; LAMBDA_P = IE-4; DELTAI = 3.6E3; DELTA2 = 6.3E3;
(* (* (* (*
SPACE = (NWI, NW2, NS);
(* Number of ::orking processors in _riad I *) (* Number of working processors in _riad 2 *) (* Number of spare processors *)
Number of spaces *) Failure ra_e of ac$ive processors *) Reconfigura_ion ra_e of triad I *) Reconfigura$ion ra_e of _riad 2 *)
57
by been
inactive.
triad
to
ASSIST
generating
spares
is:
(* Active processor failure *) IF NWl > 0 TRANTO NWI = NWI-I BY NWI*LAMBDA_P; IF NW2 > 0 TRANTO NW2 = NW2-1 BY NW2*LAMBDA_P;
The
is no
is:
START = (3, 3, N_SPARES);
has are
of each
SPACE = (NWi, NW2, N_SPARES); The
of
INPUT statement
constant
the
pool degrade
can happen
processors
fail while
configuration,
in a triad.
the
do not This
ASSIST in the
inde-
can be replaced
the faulty
do not
we will use
for the value
can
section
operate
When
processors
first
space
the
recoveries.
the
number
In later included.
Spares
consists
the
are
Finally,
has
initial
state state
failures
triad
that
variable---NW, the
multiple
of cold spares.
before
of processors
so their
triads.
with
studies,
degrade
two
than
a common
interactively
not
more
from
it is assumed
number
spare
operating
supply
the
a pool
and
to model
which
either
trade-off
sharing
introduced
Cold
in a triad the
user
do
of how
Pooled
to represent
will query
to model
continuing
performing
a constant
program model.
triads
or because
For this
are
processors
The
To facilitate
of two triads
a system
faulty
the
spare
exhausted.
with
out,
a second
an available
model concepts
we will model
but
runs
a simple
Triads
section
pendently
with
reconfiguration
with
10.1
Triads
the
to keep can
do not
be fail
number
(* Replace failed processor wi_h workin E spare *) IF (NWI < 3) AND (NS > O) TRANTO NWI = NWI+I, NS = NS-I BY FAST DELTA1; IF (NW2 < 3) AND (N$ > O) TRANTO NW2 = NW2÷I, NS = N$-I BY FAST DELTA2; DEATHIF NW1 < 2; DEATHIF NW2 < 2;
The
start
state
of working rules
the
NW1
TI_NTO
is (3, 3, N.SPARES)
processors
define
either
(* Two faults in triad I is sys%em failure *) (* Two faul%s in triad 2 is system failure *)
define
is conditioned
This
can the
SURE
either
has two different
potentially
keyword
conditionaJ
occur times
triads
triad
experiences processor
with
recovery
cannot
processor
of 1 spare
experiences
with
from
a full complement
The
first
is accomplished
faulty
no spares
have
is N_SPARES. This
the
recovery
rates
in section
5.2.7.
recoveries
Two
at the
the
pool
two or more
two
TRAICr0
by decrementing
failure.
The
next
two
a spare.
Note
that
this
take
a working
the
(i.e.
place.
The
result
processor
(i.e.
NWx
NS = NS -
coincident
used,
and
the
these
faults
Since
SURE
with
this
program
in section
1 and
1).
(i.e.
The
(NWl
was
developed
0--if
÷ 1 for triad
2) or (NW2
that
depending
upon
fails
number process
recovery
of reconfiguration system
the
arrival
or NW2
rules
= NWx
and
fault
which
are
not
in which
the
section.
Reducable
to
One
Triad The
model
only
1 triad.
system
given This
faulty
one
exception,
As
until
triad
be modified was
used
by replacing is removed
however.
and When
it can no longer
before,
this
model
The
state
space
must
This
is accomplished
maintained
can
strategy
reconfigures
the triad
above
in a state
its good
processors
is only the
that
be modified space
FTMP
processor
out-vote
by setting
in the
to describe
a faulty there
assumes
easily
the
faulty
NWx
the
= 0 when NT.
system with are triad
Although
58
are
[16].
triad
can
survive
If spares
are
available,
added the i.e.
If no spares
to the
spares
system until
cold--they
notion
that
a spare.
left,
processor,
spares
to indude
variable
one
a system
do
of whether
x is inactive.
pool.
not
second fail
a triad The
this is redundant--the
the
are available,
maintains
the
with
There the
fault while
is active
number number
is
faulty arrival.
inactive. or not.
of triads
is
of active
triads
can
space
be
determined
variable
SPACE
The
greatly
=
(NW1, NW2, NT, NS);
initial
state
DEATHIF Note
that
conflict also
the
the but
collapsed
into
Next,
OMEGAI OMEGA2
The
BY
arrival
FAST
first
the last
above i.e.
TRIADS
= = = = =
processors
in
triad
I *)
working active
processors triads *)
in
triad
2 *)
(*
Number
of
spare
3.6E3; 6.3E3; 5.6E3; 8.3E3; (NWI, NW2, NT,
The
(NW1 < 2)
this
new
NWx
DEATtIIF
is:
*)
statement
becomes:
the
are the
equal
same
to 0 when This
TKANTO
= O)
AND
x becomes
because
This
inactive. This
the
would
last
Note
clause triad
could
is never
be satisfied. and
OMEGA2
which
define
the
rate
at
are available:
Collapsing Collapsin
ra_e E rate
as in the previous
> O)
be wrong.
not included.
follows
can never OMEGA1
rules
triad
= 0) is also
model.
no spares
The
OR (NW2 < 2) ; would
(NW2
condition
constants
when
defines > 0).
when
The
INPUT N_SPARES; LAMBDAI = IE-4; LAMBDA2 = 1E-4; DELTA1 DELTA2 OMEGA1 OMEGA2
working
processors
state
space
of of
change
be altered.
by (NS
triad.
TWO
state
for each NWx
(NT
_riad triad
model.
triad
I 2
*) *)
However,
the
reconfiguration
x are
= NWx+I,
> I)
of of
TRANTO
NS
= NS-1
NWx
= O,
NS
= NS+2,
NT
= NT-I
OMEGAx;
rule
are available,
of this extra the
of
(* (*
rules
must
conditioned
SPACE
two
Thus,
Number Number
= 0) AND
Thus,
(NWx = 2) AND (NS BY FAST DELTAx; (NWx = 2) AND (NS
IF
(*
not
inclusion
Number
= 5.6E3; = 8.3E3;
fault
The
spares.
NW2--the description.
(* (*
of setting
it would
and
input
(NW2 = I);
(NW1
are collapsed
specification IF
strategy
we define
triads
NW1
ASSIST
(*
DEATHIF
condition
be added
which
statement
with
at
the
is (3, 3, 2, N.SPARES).
(NWI = I) OR
the
that
by looking
simplifies
reconfiguration The
rule
defines
NS = 0. Note
that
the
complete
WITH
by replacement
second
POOL
model OF
the
condition
with
collapsing (NT
a spare.
Thus,
of a triad
when
> 1) prevents
the
is:
SPARES
-->
I TRIAD
*)
(* (* (* (* (* (* (*
Number of spares Failure rate of Failure ra_e of Reconfiguration Reconfigura_ion Collapsing rate Collapsing rate
(*
Number
of
working
processors
in
triad
I *)
(* Number (* Number
of of
working active
processors triads *)
in
triad
2
59
*) active processors *) active processors *) rate of triad I *) rate of triad 2 *) of triad i *) of _riad 2 *)
*)
this
rule
is
no spares collapse
of
NS); START = (3,
(*
3,
2,
Number
(* Active (NW1 > O)
processor failure TRANTO NW1 = NWI-1
IF
(NW2
TKANTO
IF
(* Replace failed (NW1 = 2) AND (NS
processor > O) TRANTO
IF
(NW2
= 2)
AND
> O)
IF
(NW1 BY (NW2 BY
= 2) FAST = 2) FAST
AND (NS OMEGA1; AND (NS OMEGA2;
=
DEATHIF
(NWl
=
(NW2
10.3
Two
IF
The
O)
system
unlike the
necessary
condition (i.e.
model
SPACE
The
=
state in the
= NW2÷I,
not
= NS-1
BY
FAST
DELTA1;
BY
FAST
DELTA2;
1) TRANTO NW1 = O, NS = NS+2, NT _o one _riad only -- triad 2 *) > 1) TRANTO NW2 = O, NS = NS+2, NT _o one _riad only -- triad I*)
with
consists
Pooled
of two
this system
degrade
the faulty
space
two
triad
Cold
triads
which
requires
to one
variables
triad.
= NT-1 = NT-1
Spares
can
degrade
the throughput Instead,
into a simplex.
when
The
active
indicate
space
complete
(* Number (* Number (* Number (* Number (* Number
extra
to a simplex.
of two processors. no
more
spares
non-faulty
then
pool initially. reconfiguration
NCx
The
where
processor rules
are
6O
must
the
are
processor
active each
Therefore,
NC1 and
= 3; if triad
it is
configuration state
out of three)
current
which
is a satisfies
or an operational
NC2,
are added
configuration
to the
of each
x has already
triad.
degraded
to
number
of
is:
active processors working processors active processors working processors spare processors
3, 3, 3, N_SPARES)
the
variables,
space
a simplex.
whether 1 good
in the
state of of of of of
whether
(i.e.
of processors
processors,
into
to determine
two state
Thus,
= 1. The
spares
that
be degraded
state
number
is (3,
can
x is a failed
total
three
triads
it is impossible
of 1).
the
As expected,
NS
>
example,
= 1 for triad
NCx
NW2
spare *) NS = NS-1
1);
of the
(NCl, NWI, NC2, NW2, NS);
initial
processors models.
then
*)
pool.
out
x still has
a simplex,
working = NWI+I,
section
does
state
1 good
wi_h NW1
Triads
Otherwise
to indicate
If triad
=
degrades
NWx
NW2*LAMBDA2;
TRANTO
previous
each
to add
NWI*LAMBDAI;
BY
= NW2-1
in this
spares
or simplex.
state
OR
the
system
processors
*) BY
O) AND (NT (* Degrade O) AND (NT (* Degrade
=
system
to the
In this
the
I)
the system
is added
triad
(NS
modeled
Therefore, available,
NW2
Degradable
However,
spare
N_SPARES);
IF
>
of
in _riad in _riad in _riad in _riad *)
N_SPARES failure
rules
be altered.
I *) 1 *) 2 *) 2 *)
represents are the These
same
the
as in previous
rules for each
triad
IF (NWx < 3) AND (NCx BY FAST DELTAx; IF
(NWx < 3) BY FAST
AND 0 TRANTO NW1 IF NW2 > 0 TRANTO NW2
IF
=
replacement
processor)
describes
no
the
states
SPACE
IF
rule
when
is returned
Number of spares Failure ra_e of Failure rate of Reco_figuraZion Reconfiguration Reconfigura_ion Reconfiguration
IF
(NCx simplex
second
(* (* (* (* (* (* (*
IF
condition
the
= NS+I
=
1.
0.
The
O) TRANTO NWx (* Replace failed processor
failure *) = NWI-I BY NWI*LAMBDAI; = NW2-1 BY NW2*LAMBDA2;
(* Replace (NWI < 3) BY FAST (NW2 < 3) BY FAST
failed processor AND (NCl = 3) AND DELTAI; AND (NC2 = 3) AND DELTA2;
(* Degrade (NWI < 3) BY FAST (NW2 < 3) BY FAST
to simplex *) AND (NS = O) AND OMEGA1; AND (NS = O) AND OMEGA2;
wizh working spare *) (NS > O) TRANTO NWI = NWI+I,
NS
= NS-I
(NS > O) TRANTO
NS
= NS-I
NW2
= NW2+I,
(NCI=3)
TRANTO
NC1
= I, NWI
= i, NS = NS+I
(NC2=3)
TRANTO
NC2
= I, NW2
= i, NS = NS+I
61
of the
DEATHIF DEAITIF
2*NWl 2*NW2
All of the cannot
fail
0 TRANTO
optimistic
can be generahzed
if we make
the failure of a warm spare. to the model descriptions: LAMBDA_S
Triads
assumption
configuration.
active
62
and
be modified
SPARES
two
to describe
*)
*) active processors active processors active processors rate of triad I rate of triad 2 ra_e of _riad I ra_e of triad 2 processors
distinct
(2) a working
*) *) *) *) *) *) *)
in triad
1 *)
results: warm a system
(1) a spare
is
of two
NWI, NC2, NW2, NS,
(* (* (* (* (*
NWS); START
Number of working processors in triad I *) Number of active processors in triad 2 *) Number of working processors in triad 2 *) Total number of warm spares *) Number of working warm spares*)
= (3, 3, 3, 3, N_SPARES,
IF NWI IF NW2 IF NWS
> 0 TRANTO > 0 TRANTO > 0 TRANTO
NWI NW2 NWS
= NWI-I = NW2-1 = NWS-I
N_SPARES); BY NWI*LAMBDA1; BY NW2*LAMBDA2; BY NWS*LAMBDA_S;
(* (* (*
Active processor failure Active processor failure Warm spare failure *)
IF
(NS > O) AND (NWS > O) THEN (, Replace with IF (NWl < 3) AND (NCl = 3) TRANTO NW1 = NWI+I, NS = NS-I, NWS = NWS - i BY FAST (NWS/NS)*DELTAI; IF (NW2 < 3) AND (NC2 = 3) AND (NS > O) TRANTO NW2 = NW2+I, NS = NS-1, NWS = NWS BY FAST (NWS/NS)*DELTA2; ENDIF; IF
(NS > O) IF (NWI < TRANTO IF (NW2 < TRANTO ENDIF; IF
AND 3) NS 3) NS
DEATHIF DEATHIF
2*NW1 2*NW2
10.5
Multiple
This an
section
will
number work
a specific For
also
simplify
MULTIPLE
BY FAST
Triads
development
of a generalized
first
the
active
TRIADS
or
first into
model
be
investigate
by
POOL
with
spare
*)
*)
so
simplex
*)
So
simplex
*)
and
assuming
OF COLD
the
Cold
that
creating
can
be used
a general
ASSIST
Spares to model
specification
program
prompt
for
model.
this
which
fails
cold
spares
complete SPARES
63
is incapable
system
we have The
by
having
a system
Thus,
Pooled
description
accomplished
triads
a specific
spares.
configuration. WITH
will
of initial
to generate
we will
this
This
number
in order
a simplex
into
(* Degrade 8MEGAI; (* Degrade OMEGA2;
BY FAST
Non-degradable
of triads. any
simplicity,
either
brought
for
value
into
failed
spare
0 TRANTO NW[J] = NW[J]-I
IF
(* Number (* Number
*)
BY
NW[J]*LAMBDA;
(* Replace failed processor wi_h workinK spare *) (NW[J] < 3) AND (NS > O) TRANTO NW[J] =-NW[J]+I, NS = NS-1 BY FAST DELTA'J;
DEATHIF
NWEJ]
(*
< 2;
Two
faul_s
in a _riad
is system
*)
failure
ENDFOR;
The array state space the number of working the
variable NW contains a value for each triad representing a count processors in that triad. Similarly, the FOR loop (which terminates
ENDFOR statement)
(2) the
defines
replacement
conditions
for that
of failed triad
To accommodate catenation The
feature
expression
for triad
of triads
is unknown
this
section
10.6
When
section,
are added
1 triad Since
This
that
values
all triads
model
to the
spares
the
consists
throughpu!
failures
from
the
in that
pool,
results
triad_
and
(3) the
same
triad
of only one
the
processor.
the
using
This must
the
INPUT
by editing in
rates.
to allow can
of triads.
the non-faulty operate
words,
can still maintain
Spares
degradation
up and
In other
since the
of
of the models
Cold
system
rate
be done
the rest
con-
statement.
Unfortunately,
is specified
Pooled
is broken
that
the system
2, etc.
reconfiguration
will be modified
triads,
TRAFr0
in a reconfiguration
For simplicity,
with
faulty
for different
identifiers.
time.
the
rates
reconfiguration
H_TRIADS
to these
It is assumed triads,
spares
for triad
(i.e.
above
each
pool.
of mutiple
of the
Triads given
with
FOR loop
run
have
Degradable
are available,
the time
at SURE
spares
triad
processor
failure.
DELTA2,"
run
active
reconfiguration
within
until
the
expression
1, "FAST
them
the simple
with
configuration only
will assume
no more
performance
rate
is no way to assign
Multiple
In this sors
there
file or entering
in that
differing
BY FAST DELTA_J
DELTAI,"
output
with
(t)
in system
in the
number the
result
systems was used
triad
processors
that
"FAST statement),
for each
of at
with
although
its vital
procesdegraded the
initial
functions
with
remaining.
we are going
will be done
to be breaking
by adding
an array
up triads,
we need
state-space 64
variable
a way of deciding NP to keep
track
if a triad of the
is active. number
of
actzve be
processors
set
to zero
in a triad. for each
count
of the
number
used
to keep
track
the
number
the
specification The
(*
triad
of how
specification
it is broken
many
= (N_TRIADS
FOR
up.
are
in array
The
it is in
including
each
state
triad.
in operation.
Thus, by
for
array
in each
still
NP.
of three
triad
space
variable
The
state
This
variable
some
sense
it in the
initially
space
and
NFP
keeps
a
variable
NT
is
will
always
redundant.
SPACE
will
equal
However,
statement.
is:
WITH
POOL
OF WARM (* (* (* (* (*
START
value
active
triads
INPUT N_TRIADS; INPUT N_SPARES; LAMBDA = IE-4; DELTA1 = 3.6E3; DELTA2 = 5.1E3; =
the
TKANTO is simplified
TRIADS
SPACE
have
processors
entries
of the
will
when
of failed
of nonzero
MULTIPLE
This
SPARES
*)
Number of %riads iniZially *) Number of spares *) Failure ra%e of ac%ive processors *) Reconfiguration ra%e to switch in spare Reconfiguration ra%e %o break up a %riad
(NP: ARRAY[I..N_TRIADS] 0F 0..3, (* Number of processors NFP : ARRAY [I..N_TRIADS] Of 0..3, (* Num. failed active NS, (* Number of spare processors *) NT: O..N_TRIADS) ; (* Number of non-failed %riads *) OF 3, N_TRIADS
OF O, N_SPARES,
*) *)
per %riad procs/%riad
*) *)
N_TRIADS);
J = I, N_TRIADS; IF NP[J] > NFP[J] TRANT0 NFP[J] BY (NP[J]-NFP[J])*LAMBDA;
= NFP[J]+I (* Ac%ive processor
failure
IF NFP[J] > 0 THEN IF NS > 0 THEN TRANTO NFP[J] = NFP[J]-I, NS = NS-I BY FAST NFP[J]*DELTAI; (* Replace failed processor wi%h working spare
*)
*)
ELSE IF NT > I TRANTO NP[J]=O, NFP[J]=O, NS = NS + (NP[J]-NFP[J]), BY FAST DELTA2; (* Break up a failed triad when no spares available *) ENDIF; ENDIF; DEATHIF 2 * NFP[3] (* Two faul%s in
>= NP[J] AND NP[J] > O; a_n ac%ive _riad is sys%em
NT = NT-I
*)
failure
ENDFOR;
As
before,
they
are
repeated
for
When
there
processor. replaces then
the
all of the
that faulty
failed triad
and
TRANT0 each
triad.
is an
processor is broken
active with up
DEATHIF The
first
failed one and
statements TKANT0
processor
from its
the
pool
working
fi5
are statement
in a triad,
set
inside
of a FOR loop
defines the
failure
second
of spares.
If there
processors
are
put
of an
TKANT0
are
no
into
spares the
so that active
statement available,
spares
pool.
(This assumesthat the systemcan determinewhich processorhas failed with 100lasttriad in the system(i.e., NT (3,3,3,0,0,1,1,3)
->
(3,3,3,1,0,0,0,3) (0,3,3,0,0,0,1,2) (0,3,3,0,1,0,0,2)
-> -> ->
-> -> ->
10.7'
Multiple
The
model
given
(*
MULTIPLE
(0,3,3,0,0,0,2,2) (0,3,3,1,0,0,1,2) (0,0,3,0,0,0,2,2)
Degradable above
TRIADS
can
WITH
INPUT N_TRIADS; INPUT N_SPARES; LAMBDA_P = IE-4; LAMBDA_S = 1E-5; DELTA1 = 3.6E3; DELTA2 = 5.1E3;
be
Triads
easily
POOL
(* (* (* (* (* (*
START
=
IF NS
> NFS
FOR
OF 3, N_TRIADS NFS
= NFS+I
to include
spare
Warm
Spares
failures:
ini%ially *) *) active processors *) warm spare processors *) ra_e to swi%ch in spare *) ra%e %0 break up a %riad *)
(* Number of processors per %riad (* Num. failed active procs/triad spare processors *) failed spare processors *) non-failed _riads *)
OF O, N_SPARES, BY
-> -> ->
*)
Number of triads Number of spares Failure rate of Failure ra%e of Reconfiguration Reconfigura%ion
= (NP: AKRAY[I..N_TRIADS] OF 0..3, NFP: ARRAY[1..N_TRIADSJ Of 0..3, NS, (* Number of NFS, (* Number of NT: O..N_TRIADS); (* Number of
TRANTO
Pooled
SPARES
SPACE
(N_TRIADS
with
modified
OF WARM
(3,3,3,0,0,0,0,3) (0,3,3,0,1,0,2,2) (0,3,3,0,0,0,0,2)
O, N_TRIADS);
(NS-NFS)*LAMBDA_S;
(* Spare
failure
J = I, N_TRIADS; IF NP[J] > NFP[J] TRANTO NFP[J] BY (NP[JJ-NFP[J])*LAMBDA_P;
= NFP[J]+I (* At%ire
processor
failure
IF NFP[J] > 0 THEN IF NS > 0 THEN IF NS > NFS TRANTO NFP[J] = NFP[JJ-I, NS = NS-I BY FAST (I- (NFS/NS)),NFF [J] *DELTA1 ; (* Replace failed processor wi_h working spare IF NFS > 0 TRANTO NS = NS-I, NFS = NFS-I BY FAST (NFS/NS)*NFP[J]*DELTAI; (* Replace failed processor wi_h failed
66
spare
*)
*)
*)
*)
*) *)
ELSE IF NT > I TRANTO NP[J]=O, NFP[J]=O, NS = NP[J]-NFP[J], NT = NT-I BY FAST DELTA2; (* Break up a failed triad when no spares available *) ENDIF; ENDIF; DEATHIF 2 * NFP[J] >= NP[J] AND NP[J] > O; (* Two faults in an active triad is system failure *) ENDFOR;
The
additional
are
in the
the
placement
placed
state
spares
of this
inside
plied
space
pool.
of the
processor
working
spare,
replacement
This with
The
processor
two
first
with
of how
first
this
to having
the
replacement
existence
a failed
spare
of the
spares
failed
rate
multi-
replacement processor
The
on the
Note
incorrectly
failure
spares.
conditioned
failed
were
to define
of non-failed
spare,
many
TRANT0 statement.
statement
TRANT0 statements
defines
on the
track
by the
FOR loop--if
be equivalent
includes
a spare.
to keep
is defined
of the
it would
model
it is conditioned
of a failed
is needed
of spares
outside
F0R loop,
and
NFS,
failure
statement
by N_TRIADS.
a failed
variable,
The
of with
second
existence
a
defines of failed
spares.
10.8
Multiple
In the
preceding
example This two
when
cause
triads
state
examples
of a model
occurs
do not
Competing
(2,2,7)
multiple
containing multiple
system
which
(i.e.
states
faults
failure.
Recoveries
The
both
have
with
diagram
in figure
we encountered
processes
parts
at rate
active
leaving
of the
25 illustrates
first
processor
spares),
recovery
in different
faults--the
a faulty
pooled
multiple
accumulate
are accumulating
they
triads
with
this
_1 and
at the
system
same
which
concept:
the
first state.
together
Here we have
second
time.
the
a single
at rate
This
A2. In
is not system
failure since the two failures are in separately voted triads. There are two possible recoveries from this state--triad 1 recovers first then triad 2, or vice versa. Which case occurs depends on how long each recovery takes. In some systems, the recovery
process
also
ongoing
even
each
other
in the in the
specification, deviations
since for the
system.
system, the
the
minimum
specification
SURE
competing
recovery distributions half of the time triad of the
But the
of the
presence
of a third
when
take the
requiies
recovery
processes.
On average, first. The
competing parameter--the
recoveries
when
recovery
conditional
still
means
Consider
there
is another
processes
recoveries the
have
impacts
and
recovery
no effect
on
the
transition
conditional
standard
simple
case
where
the
two
half of the time triad 1 will recover first, and conditional mean recovery time is the mean times.
transition
67
longer
two
of competing
program
are identical. 2 will recover two
may
The
SURE
probability.
program This
is the
also
requires
probability
3X2
3X2 "__
_ Figure
25:
Model
With
...
__... Multiple
68
Competing
Recoveries
NN__FIRS
T 1
2)_2
Figure
that
this transition
state.
The
state
must In the
tributed,
sum
of the
equal
one.
and
the
5.2.7,
conditional the
exist..
of other
(2) the
peting
for
recovery
R1..FIRST
and
rates,
given
was the
unconditional
used
fast
recoveries,
the
problem
of the redundancy
possible
repairs
systems
both
rates
or in which
at the
same
a dis-
As discussed calculate
the
times
are affected
by
recoveries
in state
(2,2,7)
system.
(1) the system time,
leaving
given. recovery
management
are discussed:
triads
transitions.
of competing
behaves
this
exponentially
will automatically
exponential
actually
leaving
transitions
to be
these
program
transitions
fast
assumed
to specify
times
fast
for all of the were
SURE
recovery
First
one of the other
processes
How the system
three
Repaired
and
can be dif-
of the Many
always
two-triad
possibilities repairs
(3) two independent
triad repair
place.
model
two
case
the design
system
take
The
special
competing
upon
To illustrate,
processes
recovery
FAST keyword the
than
probabilities
the
accurately.
depends
1 first,
the
from
1 is Always
rather
with nonexponential
to model
example
above, SURE
rates
presence
transition
for this
For systems ficult
will be traversed
examples
in section
26: Triad
case
(1),
triad
transitions
no longer
R2_SECOND,
R1 and
R2.
1 alway_
This
may
occur
or may
depends
repaired
first,
is shown
simultaneously, not
on the
have
the
system,
the same
and
may
in figure recovery
distribution
26.
Although
transition as the
be determined
rates, noncom-
by analysis
or experimentation. The
case
(2) mode]
of both
triads
being
repaired 69
at the same
time
is shown
in figure
27.
2AI
__
2A2 °°*
• Figure
The
mean
and standard
27:
Both
deviation
Triads
third
recoveries
are
case,
two independent
truly
independent
Repaired
of the multiple
perimentally. This can be accomplished the time to recovery completion. The
are
and
recovery
by injecting repair not
at the
processes, competing
Same
transition
Time
must
two simultaneous is shown
faults
in figure
for resources,
be determined
the
and
28. Even transition
ex-
measuring if the
two
rates
will
still be different from the noncompeting rates competition. There is no way to analytically
because they are conditioned on "winning" the determine the conditional means and standard
deviations
distributions
distributions
from must
the
unconditional
be measured
recovery experimentally.
7O
in general;
therefore,
these
four
7_
2A1
__
2A2
( __
Figure
28: Two
2_
°°°
Independent,
71
Repair
Processes
11
Transient
Computer
systems
nent
faults.
time
and
Since
significant is that left
can
than
restarted, faults
system--if
tend
to occur
deceive
the
and
much returned
for a much
the
more
a solid
to operation.
would
depends
11.1
Transient
A transient system's transient
Case
upon
fault voters. fault:
may The
1: Reconfiguration
I s
careful
of these
Fault
Behavior
or
not
may
does
I el
errors graphs
2: Reconfiguration
I
system
fault
fault,
I
I
e2 e3
e4
which
are
illustrate
I
fault
may
be
removed,
to near-coincident
and
also
detectable the
I
e5 e6
Z
may
increasing designed of such
by
two
the
possible
operating effects
of a
-->
t
-->
t
I ...
en
.....
>1
occurs
I
I
I
I 72
a
faults
is transient
be repeatedly
vulnerable
have
occur
t
I
R
[= NP;
Vo_er
(* Processor failures *) IF NP > NFP TRANTO NFP = NFP+I
IF
that
example
Dependencies
voltage
are
Dependencies
system.
in this
Failure
regulators.
rate
failure
history are
Consider voltage
Sequence
experience
failure
dependencies
the
and
*)
of active processors *) of failed active processors *) of working voltage regulators *)
BY
3 working
processors,
2 regs.
*)
*)
(NP-NFP)*LAMBDA_P;
(* Failure of processor due to voltage surge *) (NR = O) AND (NP > NFP) TRANTO NFP _ NFP+i BY (NP-NFP)*LAMBDA_V;
(* Voltage regulator failures *) IF NR > 0 TRANTO NR = NR-I BY NR*LAMBDA_R; (* Reconfiguration *) IF NFP > 0 TRANTO (NP-I,NFP-I,NR)
Since bilities of the the
this will
one
system
be
model
above
could
failure
DEATHIF
contains
lumped
capture
condition
(NR=O)
only
together.
AND
BY
one
The the
;
statement,
DEATHIF
addition
of another
probability
of having
is reached: (2 * NFP
>= NP);
113
all of the DEATHIF both
system
statement
voltage
failure
proba-
placed
regulators
in front fail
before
13.2
Recovery
In this
system,
Rate
the
active
components.
space
variable
speed
(* RECOVERY
of the
This
NP,
the
BATE
=
START
recovery
is accomplished
number
process
is significantly
by making
of active
DEPENDENCIES
the
affected
recovery
rate
a function
of
of the state
processors.
*)
(NP: O..N_PROCS, NFP: 0..N_PR0CS);
(* Number (* Number
2 * NFP
>= NP;
(* Vo_er
(* Processor failures ") IF NP > NFP TRANTO NFP = NFP+I
of active of failed
*)
processors ,) active processors
defeated
*)
BY
*)
(NP-NFP)*LAMBDA_P;
(* Reconfiguration where rate is a function of NP *) IF NFP > 0 TRANTO (NP-1,NFP-I) BY ;
Models
provides the user with the in each of the operational
probabilities.
operational state models to model
final
number
= (N_PROCS,0);
DEATHIF
state
by the
(* Initial number of processors *) (* Permanen_ failure rate of processors
N_PROCS = 10; LAMBDA_P = 8E-3; SPACE
Dependencies
considerably environment
only during first
to calculate
phase
the initial
state
dif-
higher of space.
a specific
of the
model. (The second model usually differs from the first model is repeated for as many phases as there are in the mission.
114
during
mission.
phase, The
probabilities
in some
manner.)
The
SURE
for
the
death
are
usually
fast
transitions
usually
program
leaving
in the
operational
time for the
final
death
will
usually
always
The
the goes
machines
into
each
phase.
spend subsequent not
words,
the
crude
separation
of the
may
sometimes
bounds
small
of
separation state
obtained
mission
apart,
they
which
a landing
operates a triad of
phase
be
model
this
The
following
lasts
two-phased
the
cruise
and
(2) landing.
warm
spares.
For
simplicity,
the
cruise
phase
which
lasts
After During that
system
the
this
cruise
phase,
would
be
phase, the
needed
is designed
to
input
different
describes
models
a model
for
must
"turn
the
be
cruise
off"
created--one phase:
MU = 7.9E-5; SIGMA = 2.56E-5;
(* Mean time to replace with spare *) (* Start. dev. of time to replace with spare
*)
MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5;
(* (*
*)
SPACE
(* Number (* Number (* Number
=
(NW: 0..3, NF: 0..3, NS: O..NSI);
Number of spares initially *) Failure rate of active processors Failure rate of spares *) Mission time *)
*)
Mean time to degrade so simplex *) S%an. dev. of time to degrade to simplex of working processors *) of failed active procssors of spares *)
*)
(3,0,NSI);
LIST=3; IF NW > 0 TRANTO IF
(* A processor (NW-I,NF+I,NS)
(NF > O) AND (NS > O) TBANTO (NW+I,NF-I,NS-I)
can
fail
*)
BY NW*LAMBDA; (* BY
;
115
A spare
becomes
active
*)
the
workload
phase. two
for
to perform
(* (* (* (*
START
will
two
degradation.
processing
mission,
ASSIST
and
3 minutes.
Therefore,
this
phases--(1)
During
and
additional
tolerated. during
failure.
sparing
which the
basic
of processors
spare
by
that
in two
NSI = 2; LAMBDA = IE-4; GAMMA = IE-6; TIME = 2.0;
=
the
of the
probabilities
in phased far
the
of their
crudeness
recovery
unacceptably
states
than
percentage the
with
these lower
to an excessive
bounds
be
as but
states
Fortunately,
operational final
(i.e.
of magnitude
phases
lead
just
probabilities_
states
a very
do
using
is so high cannot
for
typically
orders
states
state
together.
several
in
reconfigures
processes to
close
operational
death
recovery
Thus,
detection
reconfiguration order
on
phases
a small
a system
reconfiguration
In
very
the
the
correct.
system
on
bounds
are
on
as
in previous other
the
perfect
system the
in only
is implemented
assume
2 hours,
In
mathematically
system will
not
tight
recoveries.
states
Although
we have
usually
systems
bounds
as
lower
which
because
bounds.
result
Suppose
we
model
lower
not
and
probabilities
performing
state
be
are
and are
upper
them)
recovery
calculations.
upper bounds
The
operational
states
bounds
These
acceptable.
have
other
reports
states.
the
IF
(NF > O) AND (NS = O) TRANT0 (I,0,0) BY ;
IF NS > 0 TRANTO
(NW,NF,NS-1)
(* No more
spares,
(* A spare
fails
degrade
and
_o simplex
is deSecSed
*)
*)
BY NS*GAMMA;
DEAI'HIF NF >= NW;
The
ASSIST
program
generates
the following
SURE
model.
NSI = 2; LAMBDA = 1E-4; GAMMA = 1E-6; TIME = 2.0; MU = 7.9E-5; SIGMA = 2.56E-5; MU_DEG = 6.3E-5; SIGMA_DE = 1.74E-5; LIST = 3;
2(* 2(* 3(* 3(* 3(* 4(* 4(* 5(* S(* 5(* 6(* 7(* 7(* 8(*
3,0,2 3,0,2 2,1,2 2,1,2 2,1,2 3,0,1 3,0,1 2,1,1 2,1,1 2,1,1 3,0,0 2,1,0 2,1,0 1,0,0
*), *) *) *) *) *) *) *), *) *) *) *) *) *)
3(* 4(* 1(* 4(* 5(* 5(* 6(* 1(* 6(* 7(* 7(* 1(* 8(* 1(*
2,1,2 3,0,1 1,2,2 3,0,1 2,1,1 2,1,1 3,0,0 1,2,1 3,0,0 2,1,0 2,1,0 1,2,0 1,0,0 0,1,0
*) *) *) *) *) *) *) *) *) *) *) *) *) *)
= = = = = = = = = = = = = =
3*LAMBDA; 2*GAMMA; 2*LAMBDA; ; 2*GAMMA; 3*LAMBDA; 1*GAMMA; 2*LAMBDA; ; 1*GAMMA; 3*LAMBDA; 2*LAMBDA; ; I*LAMBDA;
(* NUMBER OF STATES IN MODEL = 8 *) (* NUMBER OF TRANSITIONS IN MODEL = 14 *) (* 4 DEATH STATES AGGREGATED STATES 1 - 1 *)
The deleting resulting
model the
for the
second
reconfiguration
phase
(call
transitions
it "phaz2.mod") and
changing
file is:
NSI = 2; LAMBDA = 1E-4; GAMMA = 1E-6; TIME = 0.05; LIST = 3; 2(*
3,0,2
*),
3(*
2,1,2
*)
= 3*LAMBDA;
116
is easily the
mission
created time
with to 0.05
an editor hours.
by The
2(* 3(*
3,0,2 2,1,2
*), *),
4(* 1(*
3,0,1 1,2,2
*) *)
= 2*GAMMA; = 2*LAMBDA;
3(* 4(* 4(* 5(*
2,1,2 3,0,1 3,0,1 2,1,1
*), *), *), *),
5(* 5(* 6(* 1(*
2,1,1 2,1,1 3,0,0 1,2,1
*) *) *) *)
= = = =
5(* 6(* 7(*
2,1,1 3,0,0 2,1,0
*), *), *),
7(* 7(* 1(*
2,1,0 2,1,0 1,2,0
*) *) *)
= 1*GAMMA; = 3*LAMBDA; = 2*LAMBDA;
8(*
1,0,0
*),
1(*
0,1,0
*)
= I*LAMBDA;
The
SURE
the LIST
= 3 option.
probabities
SURE
is then This
executed
causes
as well as the death
V7.2
1? readO 317
program
NASA
Langley
the
2*GAMMA; 3*LAMBDA; 1*GAMMA; 2*LAMBDA;
on the SURE
first model program
state
probabilities.
Research
Center
(stored
to output This
in file "phaz.mod"), all of the
is illustrated
operational
below:
phaz
run
MODEL
FILE
DEATHSTATE
SURE
= phaz.mod
LOWERBOUND
UPPERBOUND
COMMENTS .................................
................................
1 TOTAL
OPER-STATE 2 3 4 5 6 7 8
9.35692e-12
9.48468e-12
9.35692e-12
9.48468e-12
LOWERBOUND 9.99396e-01 0 O0000e+O0 6 02277e-04 0 O0000e+O0 i 80332e-07 0 O0000e+O0 3 57995e-ll
20 PATH(S) PROCESSED 0.617 SECS. CPU TIME 32? exit
V7.2
UPPERBOUND 9.99396e-01 1.53952e-06 6.03819e-04 1.43291e-09 1.81768e-07 5.59545e-13 3.63591e-ll
UTILIZED
117
11 Jan
90
13:56:49
RUN
#1
using state
The can
SURE
be
used
program
also
to initialize
"phaz.ini", above is:
i.e.
adds
creates
the ".ini"
INITIAL_PROBS( 1: ( 9.35692e-12, 2: ( 9.99396e-01, 3: ( O.O0000e+O0, 4: ( 6.02277e-04, S: ( O.O0000e+O0, 6: ( 1.80332e-07, 7: ( O.O0000e+O0, 8: ( 3.57995e-11,
a file
states to
the
for
the
containing next
file name.
these
phase.
The
probabilities
The
contents
SURE of this
in
a format
program file
names
generated
by
that the
the
file run
9.48468e-12) 9.99396e-01) 1.53952e-06) 6.03819e-04) 1.43291e-09) 1.81768e-07) 5.59545e-13) 3.63591e-11)
);
Next,
the
initialized states
SURE
using
the
in an equivalent
format
for the
SURE
program SURE manner
is executed
on
INITIAL_PROB$ to the
first
the
second
model.
command. model.
Note
The that
second ".ini"
The
state
probabilities
model
must
file output
number
is in the
program:
sure
SURE V7.2
NASA
17 readO
Langley
read
32: 33: 34: 35: 36: 37: 38: 39: 40: 41:
INITIAL_PROBS( 1: ( 9.35692e-12, 2: ( 9.99396e-01, 3: ( O.O0000e+O0, 4: ( 6.02277e-04, 5: ( O.O0000e+O0, 6: ( 1.80332e-07, 7: ( O.O0000e+O0, 8: ( 3.57995e-11, );
42?
run
1 TOTAL
phaz.ini
FILE
DEATHSTATE
Cen_er
phaz2
317
MODEL
Research
9.48468e-12), 9.99396e-01), 1.53982e-06), 6.03819e-04), 1.43291e-09), 1.81768e-07), 5.59545e-13), 3.63591e-11)
SURE V7.2
= phaz.ini
LOWERBOUND
UPPERBOUND
8.43564e-11
9. 98944e-II
8. 43564e- 11
9. 98944e- 11
COMMENTS
118
11JaJ_
90
13:58:i2
RUN
are
#1
its
correct
OPER-STATE
UPPER3OUND
LOWERBOUND
2 3 4 5 6 7 8
9 I 6 I I 3 3
9.99381e-01 1.49908e-05 6.02368e-04 9.03554e-09 1.80359e-07 2.70540e-12 3.57993e-ll
99381e-01 65304e-05 03910e-04 04918e-08 81795e-07 28658e-12 63589e-ll
9 PATH(S) PRUNED AT LEVEL 1.49540e-16 SUM OF PRUNED STATES PROBABILITY < 5.04017e-18 9 PATH(S) PROCESSED 0.417 SECS. CPU TIME 43?
14.2
UTILIZED
Non-Constant
In the
previous
for
each
the
same,
section,
of the but
Consider
Failure a two-phased
phases. some
a triad
Rates
A related
system
situation
was occurs
analyzed when
parameters,
such
as the
with
spares
that
experiences
x 10 -4 ,
"r=
10 -4
warm
failure
which
the
rates,
required
structure
change
different
from
failure
of the one
different
models
model
remains
phase
rates
to another.
for each
of the
phases: , phasel(fimin):
_=2
• phase
2 (2 hours):
• phase3
(3rain):
The
SURE
for the INPUT
The
same
parameter LAMBDA,
A=10 A=10
model values
GAMMA,
full SURE
-3 ,
"_=10 "_=10
can be used
using
the
SURE
-s -4
for all of the phases, INPUT command:
TIME;
model,
INPUT LAMBDA, GAMMA, NSI = 2; MU = 7.9E-5; SIGMA = 2.56E-5; MU_DEG = 6.3E-5; SIGMA_DE = 1.74E-5; LIST = 3; QTCALC = 1;
-4 ,
stored
in file "phase.mod,"is:
TIME;
119
and
the
user
can
be prompted
2(* 2(* 3(* 3(* 3(* 4(* 4(* 5(* 5(* S(* 6(* 7(* 7(* 8(*
3,0,2 3,0,2 2,1,2 2,1,2 2,1,2 3,0,1 3,0,1 2,1,1 2,1,1 2,1,1 3,0,0 2,1,0 2,1,0 1,0,0
The
0TCALC
numerical sions. The SURE
*), *), *), *), *), *), *), *), *), *), *), *), *), *),
1? readO
2,1,2 3,0,1 1,2,2 3,0,1 2,1,1 2,1,1 3,0,0 1,2,1 3,0,0 2,1,0 2,1,0 1,2,0 1,0,0 0,1,0
= 1 command
routines. interactive
V7.2
3(* 4(* 1(* 4(* S(* 5(* 6(* 1(* 6(* 7(* 7(* 1(* 8(* 1(*
NASA
*) *) *) *) *) *) *) *) *) *) *) *) *) *)
causes
= = = = = = = = = = = = = =
3*LAMBDA; 2*GIMMA; 2*LAMBDA; ; 2*GAMMA; 3*LAMBDA; I*GJLMMA; 2*LAMBDA; ; 1*GAMMA; 3*LAMBDA; 2*LAMBDA; ; I*LAMBDA;
SURE
the
This increased accuracy session follows: Langley
Research
program
is often
to use
necessary
more when
accurate analyzing
(but phased
Cen_er
phase
LAMBDA? GAMMA? TIME?
2e-4 le-4 .1
30? run MODEL
TIME
FILE
=
= phase.mod
1.000o-01,
DEATHSTATE
GAMMA
SURE
=
1.000o-04,
V7.2
LAMBDA
=
LOWERBOUND
UPPERBOb_D
COMMENTS
1.78562e-12
1.89600e-12
1.78562e-12
1.89600o-12
#1
slower) mis-
7 8
O.O0000e+O0 5.08706e-14
5.17358e-15 5.60442e-14
10 PATH(S) PRUNED AT LEVEL 4.75740e-20 SUM OF PRUNED STATES PROBABILITY < 6.11113e-20 Q(T) ACCURACY >= 14 DIGITS 10 PATH(S) PROCESSED 2.867 SECS. CPU TIME 31? readO phase LAMBDA? GAMMA? TIME?
UTILIZED
le-4 Ie-5 2.0
60?
read
61: 62: 63: 64: 65: 66: 67: 68: 69: 70:
INITIAL_PROBS( 1: ( 1.78562e-12 2: ( 9.99920e-01 3: ( O.O0000e+O0 4: ( 7.89960e-05 5: ( O.O0000e+O0 6: ( 2.67751e-09 7: ( O.O0000e+O0 8: ( 5.08706e-14 );
71?
0.07 run
MODEL
TIME
phase.ini
FILE
=
SECS.
89600e-12), 99920e-01), 98043e-07), 99941e-05), 14966e-10), 80076e-09), 17358e-15), 60442e-14)
MODEL
FILE
SURE
= phase.ini
2.000e+O0,
DEATHSTATE
TO READ
1 9 9 7 1 2 5 S
GAMMA
LOWERBOUND
=
1.000e-05,
LAMBDA
UPPERBOUND
TOTAL
0PER-STATE
=
1.13950e-li
1.11438e-ll
1.13950e-ll