THE ART OF FAULT-TOLERANT SYSTEM RELIABILITY MODELING ...

4 downloads 0 Views 5MB Size Report
Ricky W. Butler and Sally C. Johnson. March. 1990 ..... l. 10 0. 101 l0 s. 103. 104 l0 s. Time (hrs). 10 6 l0 T. Figure. 3: Probability ...... Daniel P.; and Swarz, Robert.
NASA

Technical

Memorandum

102623

t=.

THE ART OF FAULT-TOLERANT RELIABILITY MODELING

Ricky W. Butler

March

and Sally C. Johnson

1990

(NASA-T_-1026_3) SYSTE_

SYSTEM

R_LIA_ILTTY

THL

ART

M(3,q_LING

Of:

FAULT-TOLERANT (NASA)

1._? CSCL

N90-22292 p O¢P, G3/0_

Nalional Aeronautics and Space Adminislration Langley Research Center Hampton, Virginia 23665-5225

Unclds 0280231

Contents 1

Introduction

2

Introduction

3

Modeling

4

5

1 to Markov Non-Reconfigurable

3.1

Simplex

3.2

Static

3.3

N-Modularly

3.4

Fault

Modeling

2

Modeling Systems

3

Computer

................................

4

Redundancy

................................

5

Redundant

Tree

Analysis:

System

Reconfigurable

4.1

Degradable

4.2

Triad

Triad

4.3

Degradable

4.4

Triad

with

4.5

Some

Observations

to Simplex Quad One

.........................

A Few Comments

7

.....................

7 10

Systems .................................

10

.................................

12

.................................

13

Spare

................................

About

14

Multi-Step

Fault-Error

Handling

Models

.....

14

Reliability Analysis Programs 5.1 Overview of SURE ................................

16 16

5.2

19

5.3

5.4

Model-Definition

Syntax

.............................

5.2.1

Lexical

5.2.2

Constant

5.2.3

Variable

5.2.4

Expressions

5.2.5

Slow

Transition

Description

.......................

21

5.2.6

Fast

Transition

Description

.......................

22

5.2.7 SURE

FAST Exponential Transition Commands .................................

5.3.1

READ

5.3.2

RUN

Command

..............................

24

5.3.3

LIST

Constant

..............................

24

5.3.4

START

5.3.5

TIME

5.3.6

PRUNE

Overview

Details

..............................

Definitions

...........................

Definition

19

............................

20

................................

Command

Constant Constant

of the

19

and

21

Description

...............

.............................

24

.............................

25

.............................. WARNDIG

STEM

and

Constants

PAWS

Programs

22 24

25 ................... ..................

25 25

6

Reconfiguration

by 6-plex

26

Degradable

6.2

Single-Point

.................................

6.3

Fail-stop

6.4

Dual-Dual

.....................................

6.5

Degradable

Quad

Failures Dual

Reconfiguration

8

Degradation

6.1

...............................

with

By with

Two

Cold

7.2

Triad

with

Two

Warm

The

ASSIST

_.2

32 34

Partial

Fail-stop

or Self-test

...............

35

Sparing

Triad

Model

Abstract

28

...................................

7.1

8.1

26

38

Spares

...........................

Spares

..........................

39

Language

42

Specification

Language

Syntax

38

............................

42

Statement

43

8.1.1

Constant-Definition

......................

8.1.2

SPACE

Statement

.............................

8.1.3

START

Statement

.............................

8.1.4

DEATHIF

44 44

Statement

...........................

45

8.1.5

PRUNEIF

Statement

...........................

45

8.1.6

TRANTO

Statement

...........................

45

8.1.7 Model Generation Algorithm ....................... Illustrative Example: SIFT-Like Architecture .................

Reconfigurable

Triad

Systems Spares

49

9.1

Triad

with

Cold

9.2

Triad

with

Instantaneous

9.3

Degradable

Triad

with

Non-Detectable

9.4

Degradable

Triad

with

Partial

10 Multiple 10.1 Two

47 48

.............................. Detection

50

of Warm

Spare

Spare

Detection

Failure

Failure

of Spare

..........

.............

Failure

52 53

...........

54

Triads

57

Triads

with

Pooled

(',old

Spares

10.2

Two

Triads

with

Pooled

Cold

Spares

10.3

Two

Degradable

Triads

with

Pooled

Cold

10.4

Two

Degradable

Triads

with

Pooled

Warm

10.5

Multiple

Non-degradable

10.6

Multiple

Degradable

Triads

with

Pooled

Cold

10.7

Multiple

Degradable

Triads

with

Pooled

Warm

10.8

Multiple

Competing

Recoveries

Triads

...................... Reducable

with

Spares

.........................

Triad

................

Spares

Pooled

ii

57 to One

Cold

Spares Spares

58 60

............... Spares

........

62 ...........

63

.............

64

.............

66 67

11 Transient

and

Intermittent Behavior

Faults

72

.............................

72

11.1

Transient

Fault

11.2

Modeling

Transient

11.3

Model

11.4 11.5

Degradable NMR with Transients FTP ........................................

11.6

Modeling

of Quad

Faults

Subject

Intermittent

.............................

to Transient

Faults

and

73 Permanent

Faults

........................ ..........................

84

Failure Dependency ................................ Two Triads with Three Power Supplies

12.4

Byzantine and

Sequence

13.1

Failure

13.2

Recovery

14 Sequences 14.1 Phased

15

88 91

12.2 12.3

13 Time

Rate Rate

Dependencies

113 ............................ .........

113 ..................

of Reliability Models Missions ..................................

Non-Constant

Failure

14.3

Continuously

Varying

92 107 109

Dependencies

14.2

Concluding

.....................

.................................

Dependencies

75 77 81

12 Modeling Control System Architectures 12.1 Monitored Sensors ................................

Faults

..........

114 114 114

Rates

...........................

119

Failure

Rates

123

.......................

124

Remarks

ooo

nl

1

Introduction

The

rapid

cated

growth

of digital

computer

systems

capable

requirements

for computer

range

10 -r

of 1 -

are often

per

expressed

electronics

technology

of achieving

systems

mission,

used

and

very

led

high

in military

reliability

for flight-crucial

has

to the

reliability

aircraft,

systems.

still

ensure

susceptible that

The testing

to failure.

requirements

reliability

are

analysis

approach such

is clearly

impractical,

as light

investigate

reliability

the

To achieve

system

that

estimated

systems which

manufacturing high their

enable

replacing

tolerance. them

often

utilized

most

of the

reliability.

units with

detailed Only

the

reliability

their own faults; i.e. faults, these systems must

be evaluated

to

highly

or dissimilar

spares

reliable

the process

instruction-level "macroscopic"

goals

of

methodology of the

order

generally

that

1 -

taken

to

of failing with

of removing

activities

events

faulty

Since

techniques,

ultrasuch

as

to achieve

components

and

is another

either method

Fortunately,

directly

be included

of

current

to meet

redundancy.

do not

models the system

same function,

configuration,

must

and

reliability

the

itself.

models

components.

of more

of a system

model

formulating failure

redundancy

for computing

the overhead

on the

Mathematical

adequate

use

parameters.

is complex, task.

lead to system

to an alternate

fault-related

specified

dependent

system

systems

algorithms

without

life-testing

approach

the

be a difficult

circuitry

or degrading

reliability

and

strongly

presence

produce

The

reliability

A life-

is

reliable

can

in the

cannot

of a diversity

model

the processes

operation

problem.

The

the model

highly

behavior

is a complex

of the system

of the upon

Reconfiguration,

to increase

system

is consequently

capture

requirements,

redundant fault

that

techniques

reliability

parallel

based

reliability

must

optimistic

flight

or "lifetime"

devices. with

is necessary.

model

parameters

reliability

represent

fault-tolerant capabilities

the

system

systems

reliable

of a fault-tolerant,

accurately

such

systems

reliability

electronic

approach

reliability

system

the behavior

and

the

for computer

a mathematical

3. compute

Since

batteries,

of a highly

or estimate

in the

10 -9 for a 10-hour

tolerate certain

of these

computer

to determine

an alternate

2. measure

The

used

bulbs,

hence,

reliability

of a fault-tolerant

though,

10 -7 or higher;

the

Reliability

are typically

met.

is typically

products

1. develop

Thus,

requirements.

of 1 -

goals, computer systems have been designed to recognize and fault-tolerant computer systems. Although capable of tolerating are

of sophisti-

for example,

requirements

avionics

proliferation

affect

in the

system

reliability

model. Furthermore, as much done

experimentally

experimentation

is to carefully

testing

as is required develop

the

reliability

the

correctness

for life testing! model

and

of the model Consequently, subject

would

require

the best

it to scrupulous

that

at least can be

scrutiny

by

a team best,

of experts. should

The

be called

process

an art.

of reliability

modeling

is thus

It is the goal of this paper

not

an exact

science,

to look into this "craft"

and

at

of reliability

modeling. The

paper

is structured

Consequently,

elementary

more

concepts.

complex

Markov

models.

systems

are developed.

are explored. models and

the

language. and

allows systems Then

is introduced. in more a new

This

is necessary

If they were the

reader

consisting explores

including

sensors,

dependencies,

2

used

the phased

been

the

accomplished

reliability

analysis

missions,

The time.

system

analysis

using

combinatorial

is thus

A particular

on such

attributes

are typically

number of spare more attributes typically

tries

related behavior transition time

with

the

SURE

in the

later

and

sections

ASSIST

intermittent

faults

the

components

of control

complex

are investigated.

topics,

The

next

architectures

such

as sequence

are presented.

Modeling

of a complex

system

mathematics. mathematics.

consisting The

more

powerful

of a vector

attributes

is called

of many

standard

Unfortunately,

as consisting

characteristics

ex-

language

Next,

system

models,

large

model

are presented.

specialized rate

very

of the

of reconfiguration

failure

At this ASSIST

are

manner.

and

the

language.

of the

economical

some

SURE

reliability

transitions

forms

Finally,

the

models--the

expressiveness and

non-constant

components

"fault the

tree"

fault-tree

has

method

of

approach

is

In reconfigurable systems reconfiguration process. It

Markov

modeling

of attributes a "state"

which of the

over These

as the

processors,

the

which have not been removed, etc. complex the model will be. Thus,

The one

can

step in the modeling Since this transition

of working

change

system.

such

that

number

technique.

units more

set of attributes

The next to another.

input

reliability

the states The

systems

program

to solve

various

to model

of the

smallest

of the system. from one state

presented

of

of reconfiguration--degradation

the help of the

in a succint

units, the number of faulty included in the model, the

to choose

types

for describing

models

using

represented system

the two basic

aspects

non-reconfigurable

the computer

can be used

where reconfiguration is possible. the effectiveness of the dynamic

systems

set of values

models,

[8], which

Markov

reliability

such

reconfigurable

etc.

and

to

to model

in modeling

transient

used

incapable of analyzing systems the critical factor often becomes is necessary

used

using

actuators,

is based

techniques

exhausted.

triads

to model

Introduction

Traditionally,

become

techniques buses,

of essential the

complicated

models.

by increasingly

an overview

by enumerating

to be defined

of multiple

the techniques

section

the

of reliability

followed

for modeling

detail

since

are

used

language

defined

would

models

begins

as a catalog which

with

Next,

examined

than first

techniques

Evaluator)

introduces

complex

paper

on to more Range

rather

introduced

basic

paper

complex.

haustively,

the the

pressing

Unreliability

sparing--are

style

are

the fundamental Then,

Before

numerically,

point,

Thus,

Next,

(Semi-Markov

in a tutorial concepts

accurately

describe

the

fault-

process is to characterize the time is rarely deterministic,

the

transition

times

Certain

states

behavior must

are described in the

or correct

represent

difficult

because

software

state,

system

operation

system

reliability

system

made.

of the

For example,

of two faulty faults

may

simplifies

not

actually

the model

to be modeled. conservative the

is the

primary

model

may

then

the

the

wishes

then

corrupt

in such

data

probabilities to say the

reliability

between

the

the

voter.

the faulty than

modeler

the

since This

the

two

assumption

does

value,

only

are

the presence

pair

some

assumes

or

that

assumptions

is conservative

is no better

For example,

conservative

of computers,

a way as to defeat

of collusion

is

events,

to say

conservative

This

the

transitions

behavior

reliability

a model of the

recovery

transitions.

can

be obtained

transition

rates

model

not have then

certain

be measurable.

for a system.

system,

transitions

paper

become which some

fault

the

for fault-tolerant

N-modularly-redundant

from

non-

parts

of

This often

Although

if it depends

the

a particular

upon

injection.

(NMR)

from

unmeasurable categories:

model

and/or

primary

unobservable.

and

two

are

to fault problem

prop-

217C

cal-

arrivals

and

is to properly

transitions.

If the

If the

is too

model

slow

defined

MIL-STD

responses

model

is

detailed

can be exorbitant. modeling

assumptions

have used

been

introduced.

in the

development

The of

systems.

Systems

are non-reconfigurable modeling by describing

a single

system.

of the

of these

Non-Reconfigurable

ranging

states

The

of reliability

computer

fall into

field data

determination

methods

model

to system

be measured

types of systems to model basic elements of reliability systems

If the

experimentally must

of the issues

is to explore

system

correspond

using

so as to facilitate

Modeling

reconfigurable

developing

the

experimentally

models

The simplest troduces the

in the

of a fault-tolerant

fast

of transitions

of this

when

describe

In this introduction

3

either

If one

failure.

all of the transitions

faster

system

number

reliability

value,

system failure

of external

to make

failure.

to be system

slow transitions

The

too coarse

goal

a specific

for the system

function

is considered

are made.

and

can be measured

the

is forced

is system

chosen

fault-free

it is useless.

the

transitions

culations. model

than

complex

modeler

represent

constitutes

system

consideration

Typically, then

what

model

what

(TMR)

wishes

that

elegantly

parameters,

erly,

The

exactly

others

can fail.

It is important

failure

The

about

while

redundant

assumptions

system

of faults.

an extremely

modular

since the

If one

failure,

Defining

state.

is higher

in a triple

computers

presence

is often

hardware

assumptions

distribution.

system

properly.

failure

non-conservative

a probability

represent in the

failure

system and

using

simplex

computer

systems. This section inhow to model simple nonthrough

a majority-voting

Figure 1: Model of a Simplex Computer 3.1

Simplex

The first

Computer

example

variable

is a system

representing

for T, say computers,

consisting

the time

of a single

to failure

F(t). Typically, it is assumed fail according to the exponential F(t)

The parameter modeling is the

A completely failure rate

computer.

of the computer.

First,

Next,

that electronic distribution:

= Prob[T

< t] = 1-

we let T be a random

we must

define

a distribution

components,

and

consequently

e -at

defines this distribution. An important (or hazard rate), h(t) defined as follows

concept

in reliabihty

h(t) = For

the

exponential

distribution given

in figure

simplex

distribution,

with

a constant

1. In this Markov

computer

is working,

has

and

computer

the

failed,

the

For reliability higher have

failure been

reader

rates

rate

transition

of estimating

in a computer of the individual

the chips in the of the computer, the

computer

the chips.

Fc(t)

is determined

Some before

failure

1 to state

that

product

delivery;

suppose

rate

11, _2,...,

may

Once

is simply _.

the the

represent

discussion reliabihty

as follows:

of the

the failure

T be a random variable representing the the time the ith chip fails, the distribution

of and fail

a somewhat

mature

sum

the

simplex

components

distribution

complete

is

occurrence

exhibit

however,

exponential

components. failure

the

are exponential

electronic

devices

[2] for a more

the

only

system

in which

in which

model

immature to the

of electronic

state

state

of a Markov assumed

is the this

2 represents

according

computer's

exponential

representing

the operational

handbook

To see this,

computer. Letting and Ti represent

system

217D

reliability the

state

The

model

to fail

MIL-STD

is known,

the

it is generally testing

= ._.

1 represents

from

distribution.

experimentally to the

state

h(t)

Markov

The transitions hazard rate.

due to insufficient

shown

rate The

2 represents

purposes,

exponential

is referred

problem chip

modeling

to the

rate.

model,

state

the failure of the simplex computer. thus can be labeled by the constant according

hazard

hazard

devices [1].

The

on the of each failure rates

of

time of failure of failure for

3._

C

Figure

2: Model

v (t)

If we assume

that

the

chips

is an exponential

The

above

Prob[T

=

Prob[min{T_,T2,...,Tn}

=

1-

examined

system

in the

following

3.2

Static

The

Triple-Modular

tectures. tions

The

a failed

the

computers

system

consists

the

same

computer

are assumed

single failure

failure does not does not occur state

state

to being

State

computers

In figure failure

rate

used

=

1-

]-Ii"=l exp(-Ait)

---

I-

exp(-

V TM ,--,/=1

failure

rate

for parallel

the

time

> t] Ait)

--,i=_ redundant

that

the

1 to state fail.

The

system 3, the

The affect

first

systems. chip

the

fails.

The

time

Such

of failure

systems

will be

computer

archi-

are assumed

working

failure

exactly

computer.

(not

same

assumed

included

in this

that

computa-

isolated

Mathematically,

It is further

system

the

to be physically

such

therefore,

the

outputs

model),

and

condition

of three

3)_ to represent 2 when

there

because It can

rate

one processor of the

failure

be seen

that 5

computers.

at which has

are only two working

a majority

of system

the

working

computers

computers

as a function high

failed. in the

of mission

reliability

The

any

one

The that system time

is strongly

are

thus

its erroneous value to the external world. Thus, computers fail. The model of figure 2 describes

is in state

probability

)_ is lO-4/hour.

all performing

another

initial

2A since

hull-tolerant

computers

external

2 is labeled system

simplest

computers

to fail independently.

propogate until two

3 has rate

is one of the

of three

by the

1 represents

2 to state

3 represents

I]_'=_ Prob[Ti

(TMR)

inputs.

cannot

prior

from

1 -

work

> t]

sections.

Redundant

voted

system.

not

T,

we have

=

with

merely

> t, ....

Redundancy

on exactly

that

does

< t]

> t,T,

fail independently,

is not

System

< t]

Prob[T1

distribution

technique

of a redundant

of a TMR

=

Fo(t)

which

2_

a

system such a

transition

of the

transition

from

can fail. have

three State

failed.

is given.

The

dependent

on

10o 10-I 10-2 lO-S P.f

10-4

lO-S 10-6 lO-r lO-S 10 0

101

l0 s

I

I

I

l

10 3

10 4

l0 s

10 6

Time

Figure

3: Probability

(hrs)

as a Function

6

of Mission

Time

l0 T

Figure 4: Model of 7MR System a short mission time. It should be noted that it was implicitly assumedthat the system starts with no failed components(i.e. Prob in state 1 at time 0 = 1). This is equivalent to assumingperfect maintenancebetweenmissions. 3.3

N-Modularly

The

assumptions

The

voter

processors describes

of an

used

in such not

of mission as a function

Of course,

this

3.4 Fault

trees

used

for

(This

system

will not

more

from

with

voters.

the

the

same

a majority

voter--as

operational.

The

voter.

5.

The

Figure

the

practical

6 shows

problem

as for a TMR long

the

of building hardware

of

model

4)

(figure

of system

failure

unreliability

of system

failure

an

as a

of an NMR

_

0 as N ---, _.

arbitrarily

would

system.

as a majority

following

probability

probability

additional

A the

Few

large

significantly

N-way

increase

the

Comments

reliability

before succeed

fault

analysis

Fault

fault-tree

Markov

failure

in more

of complex

Compiler

systems solver

that

whenever

which

next

with

the

modeling quite

(see [4]). possible,

approach

efficiently

using

a reliability combinatorically

because 7

overcome

section.)

a limited

Thus,

system

for many

The

years.

Basically, removal

section

could

attempts the

combinatorial

to solve

can be solved

in this

the

could

in the

only be used

examples

systems,

detail

can be calculated Tree

all of the

occurs

system

of the fault-tree method. is no reconfiguration--the

be calculated

state-space

can

Thus,

In reconfigural_e

another cannot

trees

system.

tree.

will be explored

solutions

nonreconfigurable

the

a fault

powerful

Although

cient

the

Analysis:

been

components

Langley's

in figure

in hardware,

component

modeled

natorial

a 7-way

are

it is important to recognize the limitations can be used to model a system where there

of a faulty faulty

with

is still

Theoretically,

ignores

Tree

have

Nevertheless, a fault-tree been

is usually

system

is given

of N.

system

failure rate ._. If implemented in software, the CPU overhead could be enormous, increasing the likelihood of a critical task missing a hard deadfine [3].

Fault

The

time

If implemented

processor seriously

the

system

model

redundant

a system

failed

a 7-processor

system

System

N-modularly

have

function

voter.

Redundant

capabilities

of the

probability fault

have

to remove

tree

that

the

approach.

is needed. class

of problems,

a fault-tree engineer

solver

such

be wise fault-tree

combi-

as NASA

who frequently

would

the combinatoric

these

to use

analyzes an effi-

computations

10°

Pi

....

.....................................................................

10-10

10-2o 10 °

101

10 2

10 3 Time

Figure

5: 7MR

Unreliability

10 4

I 10 5

I 10 6

(hrs)

As a Function

of Mission

Time

10 7

10 0

10 -10

• "...,

10-20

10 -30

2

I

I

I

I

I

I

4

6

8

10

12

14

N

Figure

6: NMR

Unreliability

As a Function

of N

16

are typically

4

more

efficient

Modeling

Fault-tolerant come

to repair

The

The the

are two basic

In this

in later

4.1

Degradable simplest

tion.

The The

The

that

processor

with

analyzes state reason that

a separate

are

the

from the

the

described

transition

non-exponential models. Such

not

the

1 to state

of a faulty

of a

methods

models.

There

with

component

spares. without

sel of components. and

concepts.

state.

their

They

The

replacement

with

will be explored

and

(i.e.

the

2 the

system

in

time

is not

from

2 to state

state

rate.

processor the

programs

The

with

does not fail before good

results

from

faulty

[1].

which

10

bad.

processor.

than

Thus,

a rate.

process

can be used This

is complete. there

is also

system

state

2 to

Reconfiguration

rather

t is F(t).

of each

The from

has the

is simple--the

processors.

the diagnosis

processors.

failure

transition

The

revealed transition

probability

The

model to the class pure Markov models.

two good the

three

Consequently,

meaning

system.

operational.

the

processor.

F(t))

4 is less than

will be discussed

is operational

The

reconfiguration

exponential

failure

one failed

(e.g.,

of the

processors

to represent

the

by degradatriplex

of any of the

have

of the

To increase

degradable

all three

has

function

measurement

a constant

reconfigure

the problem.

reconfiguration)

a distribution

system.

which

failure

we do not

diagnoses

triplex

of a simple 1 with

2 represents

At state

is the

designed

in state

are identical,

the voter

been

behavior

begins

of recovery by

system

distinguish

the

replacement

a degraded

voting

have

transition generalizes the mathematical models are far more difficult to solve than

long as a second could

system

with

time

4 the

removal

and

reliability

and

components these

majority

systems

experimental

section, several computer semi-Markov models. At state

with

of faulty

introduce

7 describes

removal

labelled

distribution be

to complex

removal

continues

removal

upon

processors

for this is that

cannot that

state

the

4 represents

transitions

based

triplex

the errors

the

triplex

of figure

from

since

permanent

or physical

component

Triad

degradable

transition

Note

lead

logical

faulty

Reconfiguration

sections.

system,

model

the

we will briefly

architecture

of the

can

the

the

strategies--degradation

system

both

section

detail

rehability

and

of reconfiguration.

involve

to identify

greatly

reconfigured

a strategy

always

used

involves

involves

greater

The

vary

using

but

of reconfiguration

The

method

a spare.

varieties

method

replacement.

solutions.

Systems

designed

techniques

system

types

degradation

sparing

are often

in many

component.

used

are Markov

Reconfigurable

systems

strategies faulty

than

presence

of a

of semi-Markov In a subsequent

to solve transition Otherwise, a second

Markov

and

occurs

as

the voter transition

r(t)

2_

F(t)

Figure

7: Model

of Degradable

11

Triplex

F(t)

Figure

8: Model

from

state

2 to state

3 which

rate

of this

transition

is 2_,

since

an

absorbing

3 is a death near-coincident

state (i.e. failure.

At state in the state

active 5.

The

7 and state

4.2 The

model

figuration when was

presented process

either the

one

such

is no way

self-test other

As before

of figure

processor

ending

state

where

the the

failure

processors

processors between

and

6.

is one

taking

State

this

due

to

processors

the

system

to

ending

6 is thus

another

good

processor.

of reaching

State

process

remaining

of the last

probability

fail

system

no faulty

reconfiguration

in state

there

failure

and

may

the

of the

The

fail.

processor.

At state

death

8 there

state

is often

of parts.

in the previous the

dual

with

100%

the dual

failed, is not

the horizontal

the

was system

When

unrealistic.

to be perfect. failure

mode,

8 describes

was unrealistic

simplex

accuracy.

diagnosis

to diagnose

section

to the

two processors

for this

option--avoid

good

processor. could

Simplex

an assumption

programs

model

remaining,

two

of these

of a second

two processors

represents

occurs

7 to 8 represents

by exhaustion

from

of the

faulty

processors, there

to

one

System

failure

remaining

which

with

of a second

state

processors

Triad

state)

a race

to Simplex

coincident

of the

7 is the operational

from

to as failure

the

either

Either again,

failure

state

represents

is operational

5, once

the

and

no good

referred

The

state

transition

are

system

configuration.

At

in state death

4 the

of Triplex

in a dual i.e.

degrade

in one major

assumed diagnosed

using

which

a majority

However, In a later

respect--the

to be perfect.

with

only

section

system.

In this

directly

from

of the voter

recon-

In other two

on

processors

three

two good

a triplex

or more

processors,

we will explore section,

words,

the

use of

we will explore to a simplex

an-

system.

this system.

transitions

represent 12

fault

arrivals.

The

vertical

transition

repre-

F (t)

(

2_

._

F Ct)

Figure

sents than

system recovery. a rate to indicate

1 to state active

fails,

2_ competes faulty

4.3

the

plus

one

consists

is in state

of the

only one

9 describes one

of removing three

Quad

active

two. the

processors

Before

can fail.

When

occurs,

there

are still

two

state

3 with

rate

reconfiguration

transition

transition.

that

from

state

Reconfiguration

working

processors.

processor

remains

2 to death

consists

Thus, in the

function rather rate from state

the active

one of those

of discarding

transition

rate

both from

the

state

4

configuration.

Quad

in figure When

are three

fail; thus,

recovery

Degradable

processors. of the

can

5 is _ because

model

there

system

that

with

processor

to state

The

the

processors

of a Degradable

The recovery transition is labelled with a distribution that the transition is not exponential. The transition

2 is 3_ because

processors

9: Model

remaining

the

of those faulty processors

a degrad,_ble four

quad.

processors

processor,

fails

leaving

fails (state

This (state

a triad

5), the

system 2),

starts the

of processors

reconfiguration

with

four

reconfiguration (state process

4). entails

working process When

one

removal

of the faulty processor plus one of the working processors, leaving a simplex (state 7). Note that a different function is used for the transition from state 5 to state 7 than from state 2

13

3)`

,(

2)`

,@

3)`

,@

F(t)

( Figure

to state function

4.4

4. This is necessary of the state.

Triad

with

In the previous

models

continued

operation

duction to the This technique Suppose inactive.

The

10: Model

if the

One

model

degraded

a triplex of figure

processor

state

are 2 active

2, there

and

levels

sibility

which

has

this

experiments

and

hence

its rate,

varies

as a

its replacement

working

processors

one

have

been

with that

processor

spare

a spare

that

processor.

provides with

does

the system

a brief

a spare not

intro-

processor.

fail

when

it is

can fail; thus,

performed of the overall

While

the rate

the

and one isolation

system

is in

of the transition

to

there are once again three active processors 4 to state 5 is 3)`. This model assumes the the next

Multi-Step

14

section

and

there are two good processors 4 represents the detection and

to simplex upon system failure.

About

the distribution

This

processor

system.

the situation where from state 2 to state

Some Observations Models

of measuring

Spare

a faulty

replacing a faulty in section 7.

system

system does not immediately degrade in duplex until the next failure brings

Fault-injection

removed

of redundancy.

death state 3 is 2),. After reconfiguration occurs, that can fail; thus the transition rate from state

4.5

process,

process

10 describes

As before, state 2 represents faulty processor. The transition of the faulty

One

,_@

Spare

the reconfiguration with

with

reconfiguration

technique of sparing, i.e. will be explored in detail

we have

of Triplex

2)`

failure;

but

rather

Fault-Error

operates

Handling

at NASA

Langley

demonstrating

recovery

process

[5!. Recovery

the feaprocesses

havebeenshown to be nonexponential[5, 6], and distributions used

for recoveries

to estimate

ditional

mean

and

of a recovery Before with

the

from

different

types

and

standard

deviation

mean

standard

transition. White's

deviation

For more

solution

nonexponential

recovery

of faults.

are used was

transitions

system Fault

upon

SURE

was

the

extremely

even

difficult.

different

experiments

may

and

to define

section

solution

exhibit,

recovery,

program

see [7, 8] and

developed,

may

injection

conditional

in the

information,

technique

a given

the

this

be con-

behavior

5. of semi-Markov

Some

models

modelers

have

been

willing to make the assumption that the recovery process behaves according to an exponential distribution in order to facilitate the solution of the model. However, it should be recognized that

such

an assumption

Another

approach

compose models models Some sent

the

reliability

process.

fault-error

to detect, However. directly

process

use some the overall

that than

time

isolate,

and

models,

a system

process

parameters

now that observable

handling

steps

a fault-occurrence

was designed

could that

a single are

recover

parameters,

from

not

and

models

CARE

III single-fault

to accomplish

to enable exponential

a more

the

overall

accurate

was to de-

process.

However.

observable

or measurable.

is directly

observable,

the

can

be extremely

solution

an approach

15

difficult

technique is unnecessary.

model,

repre-

reconfiguration

representation

directly

a fault

semi-Markov such

as the

performs

of reconfiguration

an efficient

such

model

quantity.

semi-Markov

handling reliability.

individual

into

of unknown

of solving

19, 10]. Coverage parameters derived from the solution of the fault-error are inserted into the fault-occurrence model in order to compute the system

reconfiguration while

model

error

the necessity

handfing

This multi-step

models

a modeling

to avoid

a set of fault-error

of these the

introduces utilized

these

individual that

multi-step

For

example,

times

required

to measure

is available

of the

accurately. requires

only

5

Reliability

Before

we proceed

analysis PAWS

further

program

as well be

into

5.1

Overview

to the

becomes

to solve

insight

Probably

This

the

the

nature

of

the easiest SURE

is desirable

to present

is used

described

reasons: them

in the (1)

model

for the

in section SURE

As the

graphically

of any

being

for the SURE

reliability STEM

input

models

and

language increase

(2) these

parameter.

and

5.4.

This

in

programs

will provide

modeled.

SURE

way to learn

program

language

will be presented

as functions

systems

language

are

for two

impractical

of the

input

programs

the models

models

the input

same

These

of this paper,

it soon

used

The

programs.

graphically.

complexity,

in the art of modeling,

analysis

remainder

as

Programs

will be presented.

reliability

In the

can

Analysis

for the

the

input

SURE

degradable

quad

values

to identifiers.

language

model

is by way of example.

in figure

The

input

9 is:

LAMBDA = IE-4; MUI = 2.7E-4; SIGMA1 = 1.3E-3; MU2 = 2.7E-_; SIGMA2 = 1.3E-3; 1,2 2,3 2,4 4,5 5,6 5,7 7,8 The the

= = = = = = =

4*LAMBDA; 3*LAMBDA; ; 3*LAMBDA; 2*LAMBDA; ; LAMBDA;

first five statements processor

standard

failure

deviation

equate rate.

The

of the

SIGMA2 are the mean and The final seven statements fault

arrival

process

then

only the

defines

a transition

transition

is a fast

recovery

be given. For example, state 2 to state 4 with

following model

is an illustrative description

has

from process

identifiers from

state

exponential state then

stored

rate

7 to state the mean

the statement mean recovery interactive

been

two time

MU1

and

2 to state

LAMBDA

SIGMA1 4.

The

are

represents

the

mean

and

MU2

and

identifiers

standard deviation of the recovery time from state 5 to state 7. define the transitions of the model. If the transition is a slow

statement must from

next

recovery

The first identifier

2,4 time

session

8 with

rate

SURE

the

A (or 1 × 10-4/hour).

The

of the recovery

last If the time

above defines a transition deviation SIGMA1. The

to process

TRIADP1.

For example,

deviation

= MU1 and standard

in a file named

16

be provided.

and standard

using

prompt.

must

the user

above input

model. follows

The the

?

$

s_re

SURE V7.4 i? read

NASA Langley

Research

Center

triadpl

2: LAMBDA = 1E-6 TO* 3: MU = 2.7E-4; 4: SIGMA = 1.3E-3; 5: 1,2 = 3*LAMBDA; 6: 2,3 = 2*LAMBDA; 7:2,4 = ; 8:4,5 = 3*LAMBDA; 9:5,6 = 2*LAMBDA; 10:5,7 = ; 11:7,8 = LAMBDA; 12: TIME = i0;

1E-2 BY

0.05 13? run

SECS.

MODEL

= _riadpl.mod

FILE

LAMBDA

TO READ

MODEL

FILE

SURE

UPPERBOUND

LOWER.BOUND

V7.4

24 Jan

10:16:21

90

COMMENTS

1.77002e-14 3.12024e-12 1.66224e-09 1.51644e-06 1.26292e-03

1.68485e-14 3.00387e-12 1.61714e-09 1.45575e-06 1.23551e-03

l.O0000e-06 1.00000e-05 1.00000e-04 1.00000e-03 1.00000e-02

10;

RUN

#I



3 PATH(S) TO DEATH STATES Q(T) ACCURACY >= 14 DIGITS 0.633 SECS. CPU TIME UTILIZED 147 157 plo_ xylog 167 exi%

The

first

noted

statement

that

program the

_ is defined

specified

range.

the program

generated

by this

the

skipped

on

clear.

The

next first

perform

Statement to plot command.

logarithmic

In

READ command

_s a variable

to automatically

directs using

uses the

over

to input

a range

a sensitivity 12 defines

the

output

The

the

on the

the model

of values

description

in this file.

analysis

as a function

mission

time

graphics

XYLOG argument

SURE

the

to plot

be

SURE over

Statement

11 shows the

the

X and

15 graph

Y axes

scales. subsections, reading

following

All reserved

directs

of this parameter

Figure

more and

used

conventions

detail as will

is presented.

a reference be used

when

These

subsections

something

to facilitate

the

words

will

be

capitalized

in typewriter-style

17

can

is encountered

description

language: 1.

This

to be 10 hours.

device.

causes

file. It should

print.

of the

probably that SURE

be is not input

le+

00

le-

10

le - 20 le - 07

Figure

I

le - 06

11: SURE

I

I

le - 05 le - 04 Lambda

program's

18

I

le - 03

plot, of output

le - 02

2. Lowercasewordswhich are in italics indicate items which are to be replacedby something definedelsewhere. 3. Items enclosedin squarebrackets r] can be omitted. 4. Items enclosedin braces{ } can be omitted or repeatedas many times as desired. 5.2

Model-Definition

Models

are

program

defined

transitions

and

standard

5.2.1

fast

and

must

means

and

standard

all of the

transitions

slow transitions.

25,000.

numbers. The

termination

If the

limit

instead

may

integers

SURE

user

are multiple

transition

can

between

be

increased

and recompihng. are floating point

be entered notation

Constant user

If there

transition

must

say p

the

competing

mean

transitions

along

with

the

is used

on a line. "(*"

Comments

The

SURE

the

the MAXSTATE

simply

by

The transition numbers. The

for statement

indicates

0 and

may

beginning

program

changing

equate

MAXSTATE

Therefore,

be included

any

of a comment the

user

more

place and

for input

means and is used for

that

"*)"

than

one

blanks

are

indicates

the

by a line number

mark.

numbers numbers.

to identifiers.

Thereafter,

these

constant

For example,

also

be defined

in terms

of previously

GAMMA = IO*LAMBDA; the

the

rates, conditional Pascal REAL syntax

termination.

prompts

implementation

Definitions

of the

may

In general,

SURE

time,

supply

probabilities

LAMBDA = 0.001; RECOVER = 1E-4; Constants

The

the transition is fast. Otherwise, it distributed by the SURE program.

The

respective

mean

model.

deviations.

semicolon

by a question

used

time.

the

be positive

of a comment.

The

transition supply

This

The

may

allowed.

5.2.2

must

in the program deviations, etc.,

statement

distribution.

of the

Details

usually

followed

of the

user

numbers

constant standard

an arbitrary

the

Lexical state

have

deviation

conditional

these

between

can

a state,

limit,

by enumerating

with respect to the mission time, i.e. p < T, then Slow transitions are assumed to be exponentially

Fast

The

in SURE

distinguishes

is small is slow.

from

Syntax

syntax

is 19

defined

constants:

identifiers

may

be

name where

name

letter,

and

= expression; is a string

czpresswn

section

entitled

5.2.3

Variable

In order

of up

to eight

is an arbitrary

letters,

digits,

mathematical

and

underscores

expression

(_) beginning

as described

with

a

in a subsequent

"Expressions". Definition

to facilitate

for this

variable.

variable.

If the

parametric

The

SURE

system

analyses, system

is run

function will be made. 0.001 to 0.009:

The

a single

variable

will compute

in graphics

following

the

mode

(to

statement

may

system

be defined.

reliability

be described

defines

A range

is given

as a function

of this

later),

LAMBDA

then

a plot

as a variable

of this

with

range

LAMBD_ = 0.001 TO 0.009; Only

one

such

of points range

variable

over

can

this

range

be either

POINTS

may

be defined.

to be computed.

"geometric"

= 4, then

A special

the

The

method

or "arithmetic"

geometric

constant,

and

POINTS,

used

is best

to vary explained

defines the

the

variable

by example.

number over

this

Suppose

range

XV = 1 TO* 1000; would

use

XV values

of 1, 10, 100, and

1000,

while

the

arithmetic

An

asterisk

range

XV = I TO+ I000; would

use

geometric One

XV

values

range,

while

additional

BY "increment", adding

of 1, 333,

TO+ or simply

option the

667,

the

of POINTS specified

1000.

TO implies

is available--the

value

or multiplying

and

an arithmetic

BY option.

set

the

TO implies

a

range.

By following

is automatically

amount.

following

such

the

that

above

the

syntax

value

with

is varied

by

For example,

LAMBDA = IE-5 TO* IE-2 BY i0; sets POINTS equal 1E-2. The statement

to 4 and

the

values

of LAMBDA

used

values

of CX used

would

would

be 1E-5,

1E-4,

1E-3,

and

C1(= 3 TO+ 5 BY 1; sets

POINTS

equal

In general, vat where

the

to 3, and syntax

= expression

vat is a string

an arbitrary a + or *. The

the

T0[c]

expression

of up to eight

mathematical BY clause

be 3, 4, and

5.

is

expression is optional;

letters

[BY increment and

digits

as described if it is used, 2O

]; beginning

in the then

next

increment

with section

a letter,

expression

and

optional

the

is any arbitrary

is c is

expression.

5.2.4

Expressions

When specifyingtransition or holding time parametersin a statement, arbitrary functions of the constants and the variable may be used. The following operators may be used: +

addition

-

subtraction

*

multiplication

/

division

**

exponentiation

The following ARCSIN(X), ing in the

standard

Pascal

ARCCOS(X), expressions.

functions

ARCTAN(X),

The

following

are

may be used: SQRT(X). permissible

EXP(X):

Both

( ) and

LN(X),

SIN(X),

I] may be used

COS(X), for group-

expressions:

2E-4 1.2*EXP(-3*ALPHA); 7*ALPHA + 12*L; ALPHA*(I+L) + ALPHA**2; 2*L + (1/ILPBI)*[L + (1/ILPHI)];

5.2.5

Slow

Transition

Description

A slow transition

is completely

and

rate.

the

transition source,

where defining

source the

The

specified syntax

by

citing

the

source

state,

the

destination

state,

is as follows:

dest = rate;

is the

source

exponential

state, rate

dest is the

of the

destination

transition.

The

PERM = 1E-4; TRANSIENT = IO*PERM; 1,2 = 5*PERM; 1,9 = 5*(TRANSIENT 2,3 = 1E-6;

+ PEKM);

21

state, following

and

rate is any

are valid

SURE

valid

expression

statements:

5.2.6

Fast

To enter

Transition

a fast

Description

transition,

source,

dest

the

=

following

< mu,

sig

syntax

[, frac

is used:

]

>;

where mu

=

an expression

defining

the

conditional

mean

sig

=

an expression

defining

the

conditional

standard

.frac

=

an expression

defining

the

transition

and

source

and

eter fast

frac is optional. If omitted, the transition transition. All of the following are valid: 2,5

=

THETA 5,7 = 7,9 =

5.2.7

FAST

Often

when

dest define

;

studies,

one must

these fast transitions necessary to supply

parameters 1 to state

and

0.5>;

Transition

simplicity, it is still these state

source

= 1E-4;



4_ (_',1)-----_

(5',0) 5_

3_

,

(5,2)----::._ (5,a) [

< m,,_

> / < m_.s_2

>

:0/4-A

(4

= NW;

failed

to keep

*)

*)

can fail

(* a spare becomes ;

(* a spare BY NS*GAMMA;

(NW,NF,NS-I)

NF

Since

This

BY

(NF > O) AND (NS = O) (* no more TRANTO (I,0,0) BY ;

keep

*)

(3,0,NSI);

=

IF NW > 0 TRANTO (NW-I,NF+I,NS) IF

of working processors *) of failed active procssors of spares *)

*)

inactive

or that

assumptions. spares

fail

(i.e.

warm)

but

configuration.

The

for the

processors.

rate)

increase

active

to see the

model

advantage

in recovery

time

modeled.

number of spares ini_ially *) failure ra_e of active processors *) failure rate of spares *) mean time to replace with spare *) start, dev. of time to replace wi_h spare

53

its

In this

*)

due

MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5; SPACE = (NW: 0..3, NF: 0..3, NWS: O..NSI, NFS: O..NSI);

(* (* (* (* (* (*

START = (3,0,NSI,O); PRG = NWS/(NWS+NFS);

mean $ime s%an. dev. number of number of number of number of

(* probabili%y

$o degrade _o simplex *) of %ime _o degrade %0 simplex working processors *) failed active processors *) working spares *) failed spares *)

of switching

(* processor failure *) IF NW > 0 TRANTO (NW-I,NF+I,NWS,NFS)

in a good

spare

(NF > O) AND (NWS+NFS > O) THEN (* reconfigure using a spare (* a good spare becomes active *) IF NWS > 0 TRANTD (NW+I,NF-I,NWS-1,NFS) BY ; (* a failed spare becomes ac%ive *) IF NFS > 0 TRANT0 (NW,NF,NWS,NFS-I) BY ; ENDIF; (NF > O) AND (NWS+NFS = O) (* no more TRANTO (I,0,0,0 BY ;

IF NWS > 0 TKANTO (NW,NF,NWS-I,NFS+I) DEATHIF

reconfiguration

is equal

variable

to the

PRG

probability spare

degrade

(* a spare BY NS*GAMMA;

fails

the

of switching

spare

> 0 and

occurs, current

is used

when

and

> 0 check

of good

this

in a good

Conversely, is zero,

NFS

probability

proportion

to calculate

of switching

is zero.

a good

spares,

*)

%o simplex

*)

*)

NF >= NW;

When spare

*)

BY NW*LAMBDA;

IF

IF

*)

probability.

spare

all of the

the probability for these

spares

When

is one,

and

spares

have

the

and

the

the

(_ersus

in the spares

probability

in a bad

prevent

spare

spares

all of the

failed,

of switching

two cases

in a good

to failed

a failed

system. are

good,

of switching

probability

spare

the

in a bad

of switching

is one.

generation

The

The

tests

in

NWS

of a transition

when

it is inappropriate.

9.4 If the

Degradable system

is designed

model.

Two

cannot

detect

aspect Instead,

aspects

with

since

is necessary to expand or undetectable:

off-line

faults

referred

it is used

we will just

with

of an off-line

all possible

is sometimes

"coverage"

Triad

the

diagnostics diagnostic

and

to as the

in so many

call it the state

Partial

"fraction space

Detection for the spares,

must

"coverage" ways

of detectable to keep

track

54

this

be considered:

(2) a diagnostic different

of Spare

requires

of the

must

diagnostic.

faults" of whether

to execute. We will avoid

people and

be included

in the

(1) a diagnostic time

by different

Failure

and is thus

assign

a fault

usually The

first

the

term

confusing.

it an identifer,

in a spare

K. It

is detectable

SPACE

The

=

second

the

(* (* (* (* (*

(NW: 0,.3, NF: 0..3, NWS: O..NSI, NDFS: O..NSI, NUFS: O..NSI);

NDFS

aspect

requires

state-space

thal.-a

variable

ntunber number number number number

rule

according

working processors *) failed active procssors *) working spares *) detectable failed spares *) undeteczable failed spares *)

of of of of of

be

added

to

generate

to

some

fast.

general

transitions

which

recovery

decrement

distribution:

IF NDFS > 0 (* "detectable" spare-failure is detected *) TRANTO (NW,NF,NWS,NDFS-1,NUFS) BY ;

No

such The

that

transition active

the

processor

state

whether

the

is generated

space

for

failure

TRANTO

is larger.

failure

"NUFS"

The

is detectable

that

The

the

of these PRW PRD PRU

The

exist:

processor

processor

are multipled

reconfiguration

possibilities faulty

rates

the

faulty

is replaced

are

= NWS/(NWS+NDFS+NUFS); = NDFS/(NWS+NDFS+NUFS); = NUFS/(NWS+NDFS+NUFS);

reconfiguration

rule

as in the

TRANT0

rule

previous must

be

example, altered

(* a spare fails *) BY K*NS*GAMMA; (* detectable fault BY (I-K)*NS*GAMMA; (* undetectable

active

a spare PRD,

same

failure

by K and

with

with PRW,

is the

spare

rule is now more

(1)

is replaced cases

rule

and

complicated processor

a spare

containing

containing

an

PRU,

respectively,

(* (* (*

prob. prob. prob.

than

is replaced

in the with

a detectable

undetectable defined

fault.

previous

model

NSI = 3; LAMBDA = IE-4; GAMMA = 1E-6; MU = 7.9E-5; SIGMA = 2.56E-5;

include

*) faul%

*)

example.

a working

spare,

fault

(3)

The

and

Three (2) the

the

probability

faulty of each

as follows:

working spare is used *) spare w/ detectable fault spare w/ undezeczable fault

is used *) is used *)

is:

(NF > O) AND (NWS+NDFS+NUF5 > O) THEN (* a spare becomes active IF NWS > 0 TRANTO (NW+I,NF-I,NWS-I,NDFS,NUF5) BY ; IF NDFS > 0 TR4NTO (NW,NF,NWS,NDFS-I,NUFS) BY ; IF NUFS > 0 TRANTO (NW,NF,NWS,NDFS,NUFS-I) BY ; ENDIF;

complete

to

(l-K).

IF

The

except

or not:

IF NWS > 0 THEN TRANTO (NW,NF,NWS-I,NDFS+I,NUFS) TRANTO (NW,NF,NWS-I,NDFS,NUFS+$) ENDIF;

Note

faults.

*)

is: (* (* (* (* (*

number of spares ini%ially *) failure raze of active processors *) failure rate of spares *) mean _ime to replace with spare *) start, dev. of time %o replace wizh spare

55

*)

MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5;

(* mean time ¢o degrade ¢o simplex *) (* start, dev. of time to degrade ¢o simplex

K = 0.9;

(* fraction of faults that the spare off-line diagnostic can detec¢ *) (* mean time %o diagnose a failed spare *) (* s¢andard deviation of time _o diagnose *)

MU_SPD = 2.6E-3; SIGMA_SPD = 1.2E-3; SPACE

=

(NW: 0..3, NF: 0..3, NWS: O..NSI, NDFS: O..NSI, NUFS: O..NSI);

START

=

(3,0,NSI,O,O);

(* (* (* (* (*

IF NW > 0 (* TRANTO (NW-1,NF+I_NWS,NDFS,NUI_S) IF NWS > 0 THEN TRANT0 (NW,NF,NWS-1,NDFS+I,NUFS) TRANTO (NW,NF,NWS-1,NDFS,NU?S+I) ENDIF;

number number number number number

of of of of of

working processors *) failed active procssors *) working spares .) de$ec_able failed spares *) undeeectable failed spares *)

a processor can fail BY NW*LAMBDA; (*

*)

a spare fails *) BY K*NS*GAMMA; BY (1-K)*NS*GAMMA;

*)

(* detectable fault *) (* unde%ec_able fault *)

PRW = NWS/(NWS+NDFS+NUFS); (* prob. a working spare is reconfiguzed *) PRD= NDFS/(NWS+NDFS+NUFS); (* prob. a spare w/ de%. f. is reconfigured PRU = NUFS/(NWS+NDFS+NUFS); (* prob. a spare w/ unde¢ f. is reconfigured IF (NF > O) AND (NWS+NDFS+NUFS > O) THEN (* a spare becomes active *) IF NWS > 0 TRANTO (N_+I,NF-I,NWS-I,NDFS,NUFS) BY ; IF NDFS > 0 TRANTO (NW,NF,NWS,NDFS-I,NUFS) BY ; IF NUFS > 0 TRANT8 (NW,NF,NWS,NDFS,NUFS-I) BY ; ENDIF; IF

(NF > O) AND (NWS+NDFS+NUFS = O) (* no more TRANTD (1,0,0,0,0) BY ;

spares,

degrade

*) *)

to simplex

*)

IF NDFS > 0 (* "de%ec%able" spare-failure is de%ected *) TRANTO (NW,NF,NWS,NDFS-I,NUFS) BY ; DEATHIF

In

NF

this

the following discussed.

>= NW;

section, section,

systems systems

consisting consisting

of a single of multiple

56

reconfigurable sets

of these

triad

were

reconfigurable

modeled. triads

In are

10

Multiple

This

section

models,

starts

various

These

models

concludes

are

Two

In this spares

to simplex. because

then

generalized

a general

discussion

replace system

fails

fault

occurs

define

Since track

the

of the

represented while

they

of spares

model,

triads current

the

when

by a single are inactive, available.

Thus,

competing

of two

triads

pool

of spares.

faulty

which and

two faulty

processors. processor

of spares

faulty

to replace spares

number

the

of spares

to a simplex

of this

Thus,

working.

be represented

before

there

by a single

they

system.

state

Similarly,

full model

description

(*

TWO TRIADS WITH POOL OF SPARES *)

the

need

variable--NS,

INPUT N_SPARES; LAMBDA_P = IE-4; DELTAI = 3.6E3; DELTA2 = 6.3E3;

(* (* (* (*

SPACE = (NWI, NW2, NS);

(* Number of ::orking processors in _riad I *) (* Number of working processors in _riad 2 *) (* Number of spare processors *)

Number of spaces *) Failure ra_e of ac$ive processors *) Reconfigura_ion ra_e of triad I *) Reconfigura$ion ra_e of _riad 2 *)

57

by been

inactive.

triad

to

ASSIST

generating

spares

is:

(* Active processor failure *) IF NWl > 0 TRANTO NWI = NWI-I BY NWI*LAMBDA_P; IF NW2 > 0 TRANTO NW2 = NW2-1 BY NW2*LAMBDA_P;

The

is no

is:

START = (3, 3, N_SPARES);

has are

of each

SPACE = (NWi, NW2, N_SPARES); The

of

INPUT statement

constant

the

pool degrade

can happen

processors

fail while

configuration,

in a triad.

the

do not This

ASSIST in the

inde-

can be replaced

the faulty

do not

we will use

for the value

can

section

operate

When

processors

first

space

the

recoveries.

the

number

In later included.

Spares

consists

the

are

Finally,

has

initial

state state

failures

triad

that

variable---NW, the

multiple

of cold spares.

before

of processors

so their

triads.

with

studies,

degrade

two

than

a common

interactively

not

more

from

it is assumed

number

spare

operating

supply

the

a pool

and

to model

which

either

trade-off

sharing

introduced

Cold

in a triad the

user

do

of how

Pooled

to represent

will query

to model

continuing

performing

a constant

program model.

triads

or because

For this

are

processors

The

To facilitate

of two triads

a system

faulty

the

spare

exhausted.

with

out,

a second

an available

model concepts

we will model

but

runs

a simple

Triads

section

pendently

with

reconfiguration

with

10.1

Triads

the

to keep can

do not

be fail

number

(* Replace failed processor wi_h workin E spare *) IF (NWI < 3) AND (NS > O) TRANTO NWI = NWI+I, NS = NS-I BY FAST DELTA1; IF (NW2 < 3) AND (N$ > O) TRANTO NW2 = NW2÷I, NS = N$-I BY FAST DELTA2; DEATHIF NW1 < 2; DEATHIF NW2 < 2;

The

start

state

of working rules

the

NW1

TI_NTO

is (3, 3, N.SPARES)

processors

define

either

(* Two faults in triad I is sys%em failure *) (* Two faul%s in triad 2 is system failure *)

define

is conditioned

This

can the

SURE

either

has two different

potentially

keyword

conditionaJ

occur times

triads

triad

experiences processor

with

recovery

cannot

processor

of 1 spare

experiences

with

from

a full complement

The

first

is accomplished

faulty

no spares

have

is N_SPARES. This

the

recovery

rates

in section

5.2.7.

recoveries

Two

at the

the

pool

two or more

two

TRAICr0

by decrementing

failure.

The

next

two

a spare.

Note

that

this

take

a working

the

(i.e.

place.

The

result

processor

(i.e.

NWx

NS = NS -

coincident

used,

and

the

these

faults

Since

SURE

with

this

program

in section

1 and

1).

(i.e.

The

(NWl

was

developed


0--if

÷ 1 for triad

2) or (NW2

that

depending

upon

fails

number process

recovery

of reconfiguration system

the

arrival

or NW2

rules

= NWx

and

fault

which

are

not

in which

the

section.

Reducable

to

One

Triad The

model

only

1 triad.

system

given This

faulty

one

exception,

As

until

triad

be modified was

used

by replacing is removed

however.

and When

it can no longer

before,

this

model

The

state

space

must

This

is accomplished

maintained

can

strategy

reconfigures

the triad

above

in a state

its good

processors

is only the

that

be modified space

FTMP

processor

out-vote

by setting

in the

to describe

a faulty there

assumes

easily

the

faulty

NWx

the

= 0 when NT.

system with are triad

Although

58

are

[16].

triad

can

survive

If spares

are

available,

added the i.e.

If no spares

to the

spares

system until

cold--they

notion

that

a spare.

left,

processor,

spares

to indude

variable

one

a system

do

of whether

x is inactive.

pool.

not

second fail

a triad The

this is redundant--the

the

are available,

maintains

the

with

There the

fault while

is active

number number

is

faulty arrival.

inactive. or not.

of triads

is

of active

triads

can

space

be

determined

variable

SPACE

The

greatly

=

(NW1, NW2, NT, NS);

initial

state

DEATHIF Note

that

conflict also

the

the but

collapsed

into

Next,

OMEGAI OMEGA2

The

BY

arrival

FAST

first

the last

above i.e.

TRIADS

= = = = =

processors

in

triad

I *)

working active

processors triads *)

in

triad

2 *)

(*

Number

of

spare

3.6E3; 6.3E3; 5.6E3; 8.3E3; (NWI, NW2, NT,

The

(NW1 < 2)

this

new

NWx

DEATtIIF

is:

*)

statement

becomes:

the

are the

equal

same

to 0 when This

TKANTO

= O)

AND

x becomes

because

This

inactive. This

the

would

last

Note

clause triad

could

is never

be satisfied. and

OMEGA2

which

define

the

rate

at

are available:

Collapsing Collapsin

ra_e E rate

as in the previous

> O)

be wrong.

not included.

follows

can never OMEGA1

rules

triad

= 0) is also

model.

no spares

The

OR (NW2 < 2) ; would

(NW2

condition

constants

when

defines > 0).

when

The

INPUT N_SPARES; LAMBDAI = IE-4; LAMBDA2 = 1E-4; DELTA1 DELTA2 OMEGA1 OMEGA2

working

processors

state

space

of of

change

be altered.

by (NS

triad.

TWO

state

for each NWx

(NT

_riad triad

model.

triad

I 2

*) *)

However,

the

reconfiguration

x are

= NWx+I,

> I)

of of

TRANTO

NS

= NS-1

NWx

= O,

NS

= NS+2,

NT

= NT-I

OMEGAx;

rule

are available,

of this extra the

of

(* (*

rules

must

conditioned

SPACE

two

Thus,

Number Number

= 0) AND

Thus,

(NWx = 2) AND (NS BY FAST DELTAx; (NWx = 2) AND (NS

IF

(*

not

inclusion

Number

= 5.6E3; = 8.3E3;

fault

The

spares.

NW2--the description.

(* (*

of setting

it would

and

input

(NW2 = I);

(NW1

are collapsed

specification IF

strategy

we define

triads

NW1

ASSIST

(*

DEATHIF

condition

be added

which

statement

with

at

the

is (3, 3, 2, N.SPARES).

(NWI = I) OR

the

that

by looking

simplifies

reconfiguration The

rule

defines

NS = 0. Note

that

the

complete

WITH

by replacement

second

POOL

model OF

the

condition

with

collapsing (NT

a spare.

Thus,

of a triad

when

> 1) prevents

the

is:

SPARES

-->

I TRIAD

*)

(* (* (* (* (* (* (*

Number of spares Failure rate of Failure ra_e of Reconfiguration Reconfigura_ion Collapsing rate Collapsing rate

(*

Number

of

working

processors

in

triad

I *)

(* Number (* Number

of of

working active

processors triads *)

in

triad

2

59

*) active processors *) active processors *) rate of triad I *) rate of triad 2 *) of triad i *) of _riad 2 *)

*)

this

rule

is

no spares collapse

of

NS); START = (3,

(*

3,

2,

Number

(* Active (NW1 > O)

processor failure TRANTO NW1 = NWI-1

IF

(NW2

TKANTO

IF

(* Replace failed (NW1 = 2) AND (NS

processor > O) TRANTO

IF

(NW2

= 2)

AND

> O)

IF

(NW1 BY (NW2 BY

= 2) FAST = 2) FAST

AND (NS OMEGA1; AND (NS OMEGA2;

=

DEATHIF

(NWl

=

(NW2

10.3

Two

IF

The

O)

system

unlike the

necessary

condition (i.e.

model

SPACE

The

=

state in the

= NW2÷I,

not

= NS-1

BY

FAST

DELTA1;

BY

FAST

DELTA2;

1) TRANTO NW1 = O, NS = NS+2, NT _o one _riad only -- triad 2 *) > 1) TRANTO NW2 = O, NS = NS+2, NT _o one _riad only -- triad I*)

with

consists

Pooled

of two

this system

degrade

the faulty

space

two

triad

Cold

triads

which

requires

to one

variables

triad.

= NT-1 = NT-1

Spares

can

degrade

the throughput Instead,

into a simplex.

when

The

active

indicate

space

complete

(* Number (* Number (* Number (* Number (* Number

extra

to a simplex.

of two processors. no

more

spares

non-faulty

then

pool initially. reconfiguration

NCx

The

where

processor rules

are

6O

must

the

are

processor

active each

Therefore,

NC1 and

= 3; if triad

it is

configuration state

out of three)

current

which

is a satisfies

or an operational

NC2,

are added

configuration

to the

of each

x has already

triad.

degraded

to

number

of

is:

active processors working processors active processors working processors spare processors

3, 3, 3, N_SPARES)

the

variables,

space

a simplex.

whether 1 good

in the

state of of of of of

whether

(i.e.

of processors

processors,

into

to determine

two state

Thus,

= 1. The

spares

that

be degraded

state

number

is (3,

can

x is a failed

total

three

triads

it is impossible

of 1).

the

As expected,

NS

>

example,

= 1 for triad

NCx

NW2

spare *) NS = NS-1

1);

of the

(NCl, NWI, NC2, NW2, NS);

initial

processors models.

then

*)

pool.

out

x still has

a simplex,

working = NWI+I,

section

does

state

1 good

wi_h NW1

Triads

Otherwise

to indicate

If triad

=

degrades

NWx

NW2*LAMBDA2;

TRANTO

previous

each

to add

NWI*LAMBDAI;

BY

= NW2-1

in this

spares

or simplex.

state

OR

the

system

processors

*) BY

O) AND (NT (* Degrade O) AND (NT (* Degrade

=

system

to the

In this

the

I)

the system

is added

triad

(NS

modeled

Therefore, available,

NW2

Degradable

However,

spare

N_SPARES);

IF

>

of

in _riad in _riad in _riad in _riad *)

N_SPARES failure

rules

be altered.

I *) 1 *) 2 *) 2 *)

represents are the These

same

the

as in previous

rules for each

triad

IF (NWx < 3) AND (NCx BY FAST DELTAx; IF

(NWx < 3) BY FAST

AND 0 TRANTO NW1 IF NW2 > 0 TRANTO NW2

IF

=

replacement

processor)

describes

no

the

states

SPACE

IF

rule

when

is returned

Number of spares Failure ra_e of Failure rate of Reco_figuraZion Reconfiguration Reconfigura_ion Reconfiguration

IF

(NCx simplex

second

(* (* (* (* (* (* (*

IF

condition

the

= NS+I

=

1.

0.

The

O) TRANTO NWx (* Replace failed processor

failure *) = NWI-I BY NWI*LAMBDAI; = NW2-1 BY NW2*LAMBDA2;

(* Replace (NWI < 3) BY FAST (NW2 < 3) BY FAST

failed processor AND (NCl = 3) AND DELTAI; AND (NC2 = 3) AND DELTA2;

(* Degrade (NWI < 3) BY FAST (NW2 < 3) BY FAST

to simplex *) AND (NS = O) AND OMEGA1; AND (NS = O) AND OMEGA2;

wizh working spare *) (NS > O) TRANTO NWI = NWI+I,

NS

= NS-I

(NS > O) TRANTO

NS

= NS-I

NW2

= NW2+I,

(NCI=3)

TRANTO

NC1

= I, NWI

= i, NS = NS+I

(NC2=3)

TRANTO

NC2

= I, NW2

= i, NS = NS+I

61

of the

DEATHIF DEAITIF

2*NWl 2*NW2

All of the cannot

fail

0 TRANTO

optimistic

can be generahzed

if we make

the failure of a warm spare. to the model descriptions: LAMBDA_S

Triads

assumption

configuration.

active

62

and

be modified

SPARES

two

to describe

*)

*) active processors active processors active processors rate of triad I rate of triad 2 ra_e of _riad I ra_e of triad 2 processors

distinct

(2) a working

*) *) *) *) *) *) *)

in triad

1 *)

results: warm a system

(1) a spare

is

of two

NWI, NC2, NW2, NS,

(* (* (* (* (*

NWS); START

Number of working processors in triad I *) Number of active processors in triad 2 *) Number of working processors in triad 2 *) Total number of warm spares *) Number of working warm spares*)

= (3, 3, 3, 3, N_SPARES,

IF NWI IF NW2 IF NWS

> 0 TRANTO > 0 TRANTO > 0 TRANTO

NWI NW2 NWS

= NWI-I = NW2-1 = NWS-I

N_SPARES); BY NWI*LAMBDA1; BY NW2*LAMBDA2; BY NWS*LAMBDA_S;

(* (* (*

Active processor failure Active processor failure Warm spare failure *)

IF

(NS > O) AND (NWS > O) THEN (, Replace with IF (NWl < 3) AND (NCl = 3) TRANTO NW1 = NWI+I, NS = NS-I, NWS = NWS - i BY FAST (NWS/NS)*DELTAI; IF (NW2 < 3) AND (NC2 = 3) AND (NS > O) TRANTO NW2 = NW2+I, NS = NS-1, NWS = NWS BY FAST (NWS/NS)*DELTA2; ENDIF; IF

(NS > O) IF (NWI < TRANTO IF (NW2 < TRANTO ENDIF; IF

AND 3) NS 3) NS

DEATHIF DEATHIF

2*NW1 2*NW2

10.5

Multiple

This an

section

will

number work

a specific For

also

simplify

MULTIPLE

BY FAST

Triads

development

of a generalized

first

the

active

TRIADS

or

first into

model

be

investigate

by

POOL

with

spare

*)

*)

so

simplex

*)

So

simplex

*)

and

assuming

OF COLD

the

Cold

that

creating

can

be used

a general

ASSIST

Spares to model

specification

program

prompt

for

model.

this

which

fails

cold

spares

complete SPARES

63

is incapable

system

we have The

by

having

a system

Thus,

Pooled

description

accomplished

triads

a specific

spares.

configuration. WITH

will

of initial

to generate

we will

this

This

number

in order

a simplex

into

(* Degrade 8MEGAI; (* Degrade OMEGA2;

BY FAST

Non-degradable

of triads. any

simplicity,

either

brought

for

value

into

failed

spare

0 TRANTO NW[J] = NW[J]-I

IF

(* Number (* Number

*)

BY

NW[J]*LAMBDA;

(* Replace failed processor wi_h workinK spare *) (NW[J] < 3) AND (NS > O) TRANTO NW[J] =-NW[J]+I, NS = NS-1 BY FAST DELTA'J;

DEATHIF

NWEJ]

(*

< 2;

Two

faul_s

in a _riad

is system

*)

failure

ENDFOR;

The array state space the number of working the

variable NW contains a value for each triad representing a count processors in that triad. Similarly, the FOR loop (which terminates

ENDFOR statement)

(2) the

defines

replacement

conditions

for that

of failed triad

To accommodate catenation The

feature

expression

for triad

of triads

is unknown

this

section

10.6

When

section,

are added

1 triad Since

This

that

values

all triads

model

to the

spares

the

consists

throughpu!

failures

from

the

in that

pool,

results

triad_

and

(3) the

same

triad

of only one

the

processor.

the

using

This must

the

INPUT

by editing in

rates.

to allow can

of triads.

the non-faulty operate

words,

can still maintain

Spares

degradation

up and

In other

since the

of

of the models

Cold

system

rate

be done

the rest

con-

statement.

Unfortunately,

is specified

Pooled

is broken

that

the system

2, etc.

reconfiguration

will be modified

triads,

TRAFr0

in a reconfiguration

For simplicity,

with

faulty

for different

identifiers.

time.

the

rates

reconfiguration

H_TRIADS

to these

It is assumed triads,

spares

for triad

(i.e.

above

each

pool.

of mutiple

of the

Triads given

with

FOR loop

run

have

Degradable

are available,

the time

at SURE

spares

triad

processor

failure.

DELTA2,"

run

active

reconfiguration

within

until

the

expression

1, "FAST

them

the simple

with

configuration only

will assume

no more

performance

rate

is no way to assign

Multiple

In this sors

there

file or entering

in that

differing

BY FAST DELTA_J

DELTAI,"

output

with

(t)

in system

in the

number the

result

systems was used

triad

processors

that

"FAST statement),

for each

of at

with

although

its vital

procesdegraded the

initial

functions

with

remaining.

we are going

will be done

to be breaking

by adding

an array

up triads,

we need

state-space 64

variable

a way of deciding NP to keep

track

if a triad of the

is active. number

of

actzve be

processors

set

to zero

in a triad. for each

count

of the

number

used

to keep

track

the

number

the

specification The

(*

triad

of how

specification

it is broken

many

= (N_TRIADS

FOR

up.

are

in array

The

it is in

including

each

state

triad.

in operation.

Thus, by

for

array

in each

still

NP.

of three

triad

space

variable

The

state

This

variable

some

sense

it in the

initially

space

and

NFP

keeps

a

variable

NT

is

will

always

redundant.

SPACE

will

equal

However,

statement.

is:

WITH

POOL

OF WARM (* (* (* (* (*

START

value

active

triads

INPUT N_TRIADS; INPUT N_SPARES; LAMBDA = IE-4; DELTA1 = 3.6E3; DELTA2 = 5.1E3; =

the

TKANTO is simplified

TRIADS

SPACE

have

processors

entries

of the

will

when

of failed

of nonzero

MULTIPLE

This

SPARES

*)

Number of %riads iniZially *) Number of spares *) Failure ra%e of ac%ive processors *) Reconfiguration ra%e to switch in spare Reconfiguration ra%e %o break up a %riad

(NP: ARRAY[I..N_TRIADS] 0F 0..3, (* Number of processors NFP : ARRAY [I..N_TRIADS] Of 0..3, (* Num. failed active NS, (* Number of spare processors *) NT: O..N_TRIADS) ; (* Number of non-failed %riads *) OF 3, N_TRIADS

OF O, N_SPARES,

*) *)

per %riad procs/%riad

*) *)

N_TRIADS);

J = I, N_TRIADS; IF NP[J] > NFP[J] TRANT0 NFP[J] BY (NP[J]-NFP[J])*LAMBDA;

= NFP[J]+I (* Ac%ive processor

failure

IF NFP[J] > 0 THEN IF NS > 0 THEN TRANTO NFP[J] = NFP[J]-I, NS = NS-I BY FAST NFP[J]*DELTAI; (* Replace failed processor wi%h working spare

*)

*)

ELSE IF NT > I TRANTO NP[J]=O, NFP[J]=O, NS = NS + (NP[J]-NFP[J]), BY FAST DELTA2; (* Break up a failed triad when no spares available *) ENDIF; ENDIF; DEATHIF 2 * NFP[3] (* Two faul%s in

>= NP[J] AND NP[J] > O; a_n ac%ive _riad is sys%em

NT = NT-I

*)

failure

ENDFOR;

As

before,

they

are

repeated

for

When

there

processor. replaces then

the

all of the

that faulty

failed triad

and

TRANT0 each

triad.

is an

processor is broken

active with up

DEATHIF The

first

failed one and

statements TKANT0

processor

from its

the

pool

working

fi5

are statement

in a triad,

set

inside

of a FOR loop

defines the

failure

second

of spares.

If there

processors

are

put

of an

TKANT0

are

no

into

spares the

so that active

statement available,

spares

pool.

(This assumesthat the systemcan determinewhich processorhas failed with 100lasttriad in the system(i.e., NT (3,3,3,0,0,1,1,3)

->

(3,3,3,1,0,0,0,3) (0,3,3,0,0,0,1,2) (0,3,3,0,1,0,0,2)

-> -> ->

-> -> ->

10.7'

Multiple

The

model

given

(*

MULTIPLE

(0,3,3,0,0,0,2,2) (0,3,3,1,0,0,1,2) (0,0,3,0,0,0,2,2)

Degradable above

TRIADS

can

WITH

INPUT N_TRIADS; INPUT N_SPARES; LAMBDA_P = IE-4; LAMBDA_S = 1E-5; DELTA1 = 3.6E3; DELTA2 = 5.1E3;

be

Triads

easily

POOL

(* (* (* (* (* (*

START

=

IF NS

> NFS

FOR

OF 3, N_TRIADS NFS

= NFS+I

to include

spare

Warm

Spares

failures:

ini%ially *) *) active processors *) warm spare processors *) ra_e to swi%ch in spare *) ra%e %0 break up a %riad *)

(* Number of processors per %riad (* Num. failed active procs/triad spare processors *) failed spare processors *) non-failed _riads *)

OF O, N_SPARES, BY

-> -> ->

*)

Number of triads Number of spares Failure rate of Failure ra%e of Reconfiguration Reconfigura%ion

= (NP: AKRAY[I..N_TRIADS] OF 0..3, NFP: ARRAY[1..N_TRIADSJ Of 0..3, NS, (* Number of NFS, (* Number of NT: O..N_TRIADS); (* Number of

TRANTO

Pooled

SPARES

SPACE

(N_TRIADS

with

modified

OF WARM

(3,3,3,0,0,0,0,3) (0,3,3,0,1,0,2,2) (0,3,3,0,0,0,0,2)

O, N_TRIADS);

(NS-NFS)*LAMBDA_S;

(* Spare

failure

J = I, N_TRIADS; IF NP[J] > NFP[J] TRANTO NFP[J] BY (NP[JJ-NFP[J])*LAMBDA_P;

= NFP[J]+I (* At%ire

processor

failure

IF NFP[J] > 0 THEN IF NS > 0 THEN IF NS > NFS TRANTO NFP[J] = NFP[JJ-I, NS = NS-I BY FAST (I- (NFS/NS)),NFF [J] *DELTA1 ; (* Replace failed processor wi_h working spare IF NFS > 0 TRANTO NS = NS-I, NFS = NFS-I BY FAST (NFS/NS)*NFP[J]*DELTAI; (* Replace failed processor wi_h failed

66

spare

*)

*)

*)

*)

*) *)

ELSE IF NT > I TRANTO NP[J]=O, NFP[J]=O, NS = NP[J]-NFP[J], NT = NT-I BY FAST DELTA2; (* Break up a failed triad when no spares available *) ENDIF; ENDIF; DEATHIF 2 * NFP[J] >= NP[J] AND NP[J] > O; (* Two faults in an active triad is system failure *) ENDFOR;

The

additional

are

in the

the

placement

placed

state

spares

of this

inside

plied

space

pool.

of the

processor

working

spare,

replacement

This with

The

processor

two

first

with

of how

first

this

to having

the

replacement

existence

a failed

spare

of the

spares

failed

rate

multi-

replacement processor

The

on the

Note

incorrectly

failure

spares.

conditioned

failed

were

to define

of non-failed

spare,

many

TRANT0 statement.

statement

TRANT0 statements

defines

on the

track

by the

FOR loop--if

be equivalent

includes

a spare.

to keep

is defined

of the

it would

model

it is conditioned

of a failed

is needed

of spares

outside

F0R loop,

and

NFS,

failure

statement

by N_TRIADS.

a failed

variable,

The

of with

second

existence

a

defines of failed

spares.

10.8

Multiple

In the

preceding

example This two

when

cause

triads

state

examples

of a model

occurs

do not

Competing

(2,2,7)

multiple

containing multiple

system

which

(i.e.

states

faults

failure.

Recoveries

The

both

have

with

diagram

in figure

we encountered

processes

parts

at rate

active

leaving

of the

25 illustrates

first

processor

spares),

recovery

in different

faults--the

a faulty

pooled

multiple

accumulate

are accumulating

they

triads

with

this

_1 and

at the

system

same

which

concept:

the

first state.

together

Here we have

second

time.

the

a single

at rate

This

A2. In

is not system

failure since the two failures are in separately voted triads. There are two possible recoveries from this state--triad 1 recovers first then triad 2, or vice versa. Which case occurs depends on how long each recovery takes. In some systems, the recovery

process

also

ongoing

even

each

other

in the in the

specification, deviations

since for the

system.

system, the

the

minimum

specification

SURE

competing

recovery distributions half of the time triad of the

But the

of the

presence

of a third

when

take the

requiies

recovery

processes.

On average, first. The

competing parameter--the

recoveries

when

recovery

conditional

still

means

Consider

there

is another

processes

recoveries the

have

impacts

and

recovery

no effect

on

the

transition

conditional

standard

simple

case

where

the

two

half of the time triad 1 will recover first, and conditional mean recovery time is the mean times.

transition

67

longer

two

of competing

program

are identical. 2 will recover two

may

The

SURE

probability.

program This

is the

also

requires

probability

3X2

3X2 "__

_ Figure

25:

Model

With

...

__... Multiple

68

Competing

Recoveries

NN__FIRS

T 1

2)_2

Figure

that

this transition

state.

The

state

must In the

tributed,

sum

of the

equal

one.

and

the

5.2.7,

conditional the

exist..

of other

(2) the

peting

for

recovery

R1..FIRST

and

rates,

given

was the

unconditional

used

fast

recoveries,

the

problem

of the redundancy

possible

repairs

systems

both

rates

or in which

at the

same

a dis-

As discussed calculate

the

times

are affected

by

recoveries

in state

(2,2,7)

system.

(1) the system time,

leaving

given. recovery

management

are discussed:

triads

transitions.

of competing

behaves

this

exponentially

will automatically

exponential

actually

leaving

transitions

to be

these

program

transitions

fast

assumed

to specify

times

fast

for all of the were

SURE

recovery

First

one of the other

processes

How the system

three

Repaired

and

can be dif-

of the Many

always

two-triad

possibilities repairs

(3) two independent

triad repair

place.

model

two

case

the design

system

take

The

special

competing

upon

To illustrate,

processes

recovery

FAST keyword the

than

probabilities

the

accurately.

depends

1 first,

the

from

1 is Always

rather

with nonexponential

to model

example

above, SURE

rates

presence

transition

for this

For systems ficult

will be traversed

examples

in section

26: Triad

case

(1),

triad

transitions

no longer

R2_SECOND,

R1 and

R2.

1 alway_

This

may

occur

or may

depends

repaired

first,

is shown

simultaneously, not

on the

have

the

system,

the same

and

may

in figure recovery

distribution

26.

Although

transition as the

be determined

rates, noncom-

by analysis

or experimentation. The

case

(2) mode]

of both

triads

being

repaired 69

at the same

time

is shown

in figure

27.

2AI

__

2A2 °°*

• Figure

The

mean

and standard

27:

Both

deviation

Triads

third

recoveries

are

case,

two independent

truly

independent

Repaired

of the multiple

perimentally. This can be accomplished the time to recovery completion. The

are

and

recovery

by injecting repair not

at the

processes, competing

Same

transition

Time

must

two simultaneous is shown

faults

in figure

for resources,

be determined

the

and

28. Even transition

ex-

measuring if the

two

rates

will

still be different from the noncompeting rates competition. There is no way to analytically

because they are conditioned on "winning" the determine the conditional means and standard

deviations

distributions

distributions

from must

the

unconditional

be measured

recovery experimentally.

7O

in general;

therefore,

these

four

7_

2A1

__

2A2

( __

Figure

28: Two

2_

°°°

Independent,

71

Repair

Processes

11

Transient

Computer

systems

nent

faults.

time

and

Since

significant is that left

can

than

restarted, faults

system--if

tend

to occur

deceive

the

and

much returned

for a much

the

more

a solid

to operation.

would

depends

11.1

Transient

A transient system's transient

Case

upon

fault voters. fault:

may The

1: Reconfiguration

I s

careful

of these

Fault

Behavior

or

not

may

does

I el

errors graphs

2: Reconfiguration

I

system

fault

fault,

I

I

e2 e3

e4

which

are

illustrate

I

fault

may

be

removed,

to near-coincident

and

also

detectable the

I

e5 e6

Z

may

increasing designed of such

by

two

the

possible

operating effects

of a

-->

t

-->

t

I ...

en

.....

>1

occurs

I

I

I

I 72

a

faults

is transient

be repeatedly

vulnerable

have

occur

t
I

R

[= NP;

Vo_er

(* Processor failures *) IF NP > NFP TRANTO NFP = NFP+I

IF

that

example

Dependencies

voltage

are

Dependencies

system.

in this

Failure

regulators.

rate

failure

history are

Consider voltage

Sequence

experience

failure

dependencies

the

and

*)

of active processors *) of failed active processors *) of working voltage regulators *)

BY

3 working

processors,

2 regs.

*)

*)

(NP-NFP)*LAMBDA_P;

(* Failure of processor due to voltage surge *) (NR = O) AND (NP > NFP) TRANTO NFP _ NFP+i BY (NP-NFP)*LAMBDA_V;

(* Voltage regulator failures *) IF NR > 0 TRANTO NR = NR-I BY NR*LAMBDA_R; (* Reconfiguration *) IF NFP > 0 TRANTO (NP-I,NFP-I,NR)

Since bilities of the the

this will

one

system

be

model

above

could

failure

DEATHIF

contains

lumped

capture

condition

(NR=O)

only

together.

AND

BY

one

The the

;

statement,

DEATHIF

addition

of another

probability

of having

is reached: (2 * NFP

>= NP);

113

all of the DEATHIF both

system

statement

voltage

failure

proba-

placed

regulators

in front fail

before

13.2

Recovery

In this

system,

Rate

the

active

components.

space

variable

speed

(* RECOVERY

of the

This

NP,

the

BATE

=

START

recovery

is accomplished

number

process

is significantly

by making

of active

DEPENDENCIES

the

affected

recovery

rate

a function

of

of the state

processors.

*)

(NP: O..N_PROCS, NFP: 0..N_PR0CS);

(* Number (* Number

2 * NFP

>= NP;

(* Vo_er

(* Processor failures ") IF NP > NFP TRANTO NFP = NFP+I

of active of failed

*)

processors ,) active processors

defeated

*)

BY

*)

(NP-NFP)*LAMBDA_P;

(* Reconfiguration where rate is a function of NP *) IF NFP > 0 TRANTO (NP-1,NFP-I) BY ;

Models

provides the user with the in each of the operational

probabilities.

operational state models to model

final

number

= (N_PROCS,0);

DEATHIF

state

by the

(* Initial number of processors *) (* Permanen_ failure rate of processors

N_PROCS = 10; LAMBDA_P = 8E-3; SPACE

Dependencies

considerably environment

only during first

to calculate

phase

the initial

state

dif-

higher of space.

a specific

of the

model. (The second model usually differs from the first model is repeated for as many phases as there are in the mission.

114

during

mission.

phase, The

probabilities

in some

manner.)

The

SURE

for

the

death

are

usually

fast

transitions

usually

program

leaving

in the

operational

time for the

final

death

will

usually

always

The

the goes

machines

into

each

phase.

spend subsequent not

words,

the

crude

separation

of the

may

sometimes

bounds

small

of

separation state

obtained

mission

apart,

they

which

a landing

operates a triad of

phase

be

model

this

The

following

lasts

two-phased

the

cruise

and

(2) landing.

warm

spares.

For

simplicity,

the

cruise

phase

which

lasts

After During that

system

the

this

cruise

phase,

would

be

phase, the

needed

is designed

to

input

different

describes

models

a model

for

must

"turn

the

be

cruise

off"

created--one phase:

MU = 7.9E-5; SIGMA = 2.56E-5;

(* Mean time to replace with spare *) (* Start. dev. of time to replace with spare

*)

MU_DEG = 6.3E-5; SIGMA_DEG = 1.74E-5;

(* (*

*)

SPACE

(* Number (* Number (* Number

=

(NW: 0..3, NF: 0..3, NS: O..NSI);

Number of spares initially *) Failure rate of active processors Failure rate of spares *) Mission time *)

*)

Mean time to degrade so simplex *) S%an. dev. of time to degrade to simplex of working processors *) of failed active procssors of spares *)

*)

(3,0,NSI);

LIST=3; IF NW > 0 TRANTO IF

(* A processor (NW-I,NF+I,NS)

(NF > O) AND (NS > O) TBANTO (NW+I,NF-I,NS-I)

can

fail

*)

BY NW*LAMBDA; (* BY

;

115

A spare

becomes

active

*)

the

workload

phase. two

for

to perform

(* (* (* (*

START

will

two

degradation.

processing

mission,

ASSIST

and

3 minutes.

Therefore,

this

phases--(1)

During

and

additional

tolerated. during

failure.

sparing

which the

basic

of processors

spare

by

that

in two

NSI = 2; LAMBDA = IE-4; GAMMA = IE-6; TIME = 2.0;

=

the

of the

probabilities

in phased far

the

of their

crudeness

recovery

unacceptably

states

than

percentage the

with

these lower

to an excessive

bounds

be

as but

states

Fortunately,

operational final

(i.e.

of magnitude

phases

lead

just

probabilities_

states

a very

do

using

is so high cannot

for

typically

orders

states

state

together.

several

in

reconfigures

processes to

close

operational

death

recovery

Thus,

detection

reconfiguration order

on

phases

a small

a system

reconfiguration

In

very

the

the

correct.

system

on

bounds

are

on

as

in previous other

the

perfect

system the

in only

is implemented

assume

2 hours,

In

mathematically

system will

not

tight

recoveries.

states

Although

we have

usually

systems

bounds

as

lower

which

because

bounds.

result

Suppose

we

model

lower

not

and

probabilities

performing

state

be

are

and are

upper

them)

recovery

calculations.

upper bounds

The

operational

states

bounds

These

acceptable.

have

other

reports

states.

the

IF

(NF > O) AND (NS = O) TRANT0 (I,0,0) BY ;

IF NS > 0 TRANTO

(NW,NF,NS-1)

(* No more

spares,

(* A spare

fails

degrade

and

_o simplex

is deSecSed

*)

*)

BY NS*GAMMA;

DEAI'HIF NF >= NW;

The

ASSIST

program

generates

the following

SURE

model.

NSI = 2; LAMBDA = 1E-4; GAMMA = 1E-6; TIME = 2.0; MU = 7.9E-5; SIGMA = 2.56E-5; MU_DEG = 6.3E-5; SIGMA_DE = 1.74E-5; LIST = 3;

2(* 2(* 3(* 3(* 3(* 4(* 4(* 5(* S(* 5(* 6(* 7(* 7(* 8(*

3,0,2 3,0,2 2,1,2 2,1,2 2,1,2 3,0,1 3,0,1 2,1,1 2,1,1 2,1,1 3,0,0 2,1,0 2,1,0 1,0,0

*), *) *) *) *) *) *) *), *) *) *) *) *) *)

3(* 4(* 1(* 4(* 5(* 5(* 6(* 1(* 6(* 7(* 7(* 1(* 8(* 1(*

2,1,2 3,0,1 1,2,2 3,0,1 2,1,1 2,1,1 3,0,0 1,2,1 3,0,0 2,1,0 2,1,0 1,2,0 1,0,0 0,1,0

*) *) *) *) *) *) *) *) *) *) *) *) *) *)

= = = = = = = = = = = = = =

3*LAMBDA; 2*GAMMA; 2*LAMBDA; ; 2*GAMMA; 3*LAMBDA; 1*GAMMA; 2*LAMBDA; ; 1*GAMMA; 3*LAMBDA; 2*LAMBDA; ; I*LAMBDA;

(* NUMBER OF STATES IN MODEL = 8 *) (* NUMBER OF TRANSITIONS IN MODEL = 14 *) (* 4 DEATH STATES AGGREGATED STATES 1 - 1 *)

The deleting resulting

model the

for the

second

reconfiguration

phase

(call

transitions

it "phaz2.mod") and

changing

file is:

NSI = 2; LAMBDA = 1E-4; GAMMA = 1E-6; TIME = 0.05; LIST = 3; 2(*

3,0,2

*),

3(*

2,1,2

*)

= 3*LAMBDA;

116

is easily the

mission

created time

with to 0.05

an editor hours.

by The

2(* 3(*

3,0,2 2,1,2

*), *),

4(* 1(*

3,0,1 1,2,2

*) *)

= 2*GAMMA; = 2*LAMBDA;

3(* 4(* 4(* 5(*

2,1,2 3,0,1 3,0,1 2,1,1

*), *), *), *),

5(* 5(* 6(* 1(*

2,1,1 2,1,1 3,0,0 1,2,1

*) *) *) *)

= = = =

5(* 6(* 7(*

2,1,1 3,0,0 2,1,0

*), *), *),

7(* 7(* 1(*

2,1,0 2,1,0 1,2,0

*) *) *)

= 1*GAMMA; = 3*LAMBDA; = 2*LAMBDA;

8(*

1,0,0

*),

1(*

0,1,0

*)

= I*LAMBDA;

The

SURE

the LIST

= 3 option.

probabities

SURE

is then This

executed

causes

as well as the death

V7.2

1? readO 317

program

NASA

Langley

the

2*GAMMA; 3*LAMBDA; 1*GAMMA; 2*LAMBDA;

on the SURE

first model program

state

probabilities.

Research

Center

(stored

to output This

in file "phaz.mod"), all of the

is illustrated

operational

below:

phaz

run

MODEL

FILE

DEATHSTATE

SURE

= phaz.mod

LOWERBOUND

UPPERBOUND

COMMENTS .................................

................................

1 TOTAL

OPER-STATE 2 3 4 5 6 7 8

9.35692e-12

9.48468e-12

9.35692e-12

9.48468e-12

LOWERBOUND 9.99396e-01 0 O0000e+O0 6 02277e-04 0 O0000e+O0 i 80332e-07 0 O0000e+O0 3 57995e-ll

20 PATH(S) PROCESSED 0.617 SECS. CPU TIME 32? exit

V7.2

UPPERBOUND 9.99396e-01 1.53952e-06 6.03819e-04 1.43291e-09 1.81768e-07 5.59545e-13 3.63591e-ll

UTILIZED

117

11 Jan

90

13:56:49

RUN

#1

using state

The can

SURE

be

used

program

also

to initialize

"phaz.ini", above is:

i.e.

adds

creates

the ".ini"

INITIAL_PROBS( 1: ( 9.35692e-12, 2: ( 9.99396e-01, 3: ( O.O0000e+O0, 4: ( 6.02277e-04, S: ( O.O0000e+O0, 6: ( 1.80332e-07, 7: ( O.O0000e+O0, 8: ( 3.57995e-11,

a file

states to

the

for

the

containing next

file name.

these

phase.

The

probabilities

The

contents

SURE of this

in

a format

program file

names

generated

by

that the

the

file run

9.48468e-12) 9.99396e-01) 1.53952e-06) 6.03819e-04) 1.43291e-09) 1.81768e-07) 5.59545e-13) 3.63591e-11)

);

Next,

the

initialized states

SURE

using

the

in an equivalent

format

for the

SURE

program SURE manner

is executed

on

INITIAL_PROB$ to the

first

the

second

model.

command. model.

Note

The that

second ".ini"

The

state

probabilities

model

must

file output

number

is in the

program:

sure

SURE V7.2

NASA

17 readO

Langley

read

32: 33: 34: 35: 36: 37: 38: 39: 40: 41:

INITIAL_PROBS( 1: ( 9.35692e-12, 2: ( 9.99396e-01, 3: ( O.O0000e+O0, 4: ( 6.02277e-04, 5: ( O.O0000e+O0, 6: ( 1.80332e-07, 7: ( O.O0000e+O0, 8: ( 3.57995e-11, );

42?

run

1 TOTAL

phaz.ini

FILE

DEATHSTATE

Cen_er

phaz2

317

MODEL

Research

9.48468e-12), 9.99396e-01), 1.53982e-06), 6.03819e-04), 1.43291e-09), 1.81768e-07), 5.59545e-13), 3.63591e-11)

SURE V7.2

= phaz.ini

LOWERBOUND

UPPERBOUND

8.43564e-11

9. 98944e-II

8. 43564e- 11

9. 98944e- 11

COMMENTS

118

11JaJ_

90

13:58:i2

RUN

are

#1

its

correct

OPER-STATE

UPPER3OUND

LOWERBOUND

2 3 4 5 6 7 8

9 I 6 I I 3 3

9.99381e-01 1.49908e-05 6.02368e-04 9.03554e-09 1.80359e-07 2.70540e-12 3.57993e-ll

99381e-01 65304e-05 03910e-04 04918e-08 81795e-07 28658e-12 63589e-ll

9 PATH(S) PRUNED AT LEVEL 1.49540e-16 SUM OF PRUNED STATES PROBABILITY < 5.04017e-18 9 PATH(S) PROCESSED 0.417 SECS. CPU TIME 43?

14.2

UTILIZED

Non-Constant

In the

previous

for

each

the

same,

section,

of the but

Consider

Failure a two-phased

phases. some

a triad

Rates

A related

system

situation

was occurs

analyzed when

parameters,

such

as the

with

spares

that

experiences

x 10 -4 ,

"r=

10 -4

warm

failure

which

the

rates,

required

structure

change

different

from

failure

of the one

different

models

model

remains

phase

rates

to another.

for each

of the

phases: , phasel(fimin):

_=2

• phase

2 (2 hours):

• phase3

(3rain):

The

SURE

for the INPUT

The

same

parameter LAMBDA,

A=10 A=10

model values

GAMMA,

full SURE

-3 ,

"_=10 "_=10

can be used

using

the

SURE

-s -4

for all of the phases, INPUT command:

TIME;

model,

INPUT LAMBDA, GAMMA, NSI = 2; MU = 7.9E-5; SIGMA = 2.56E-5; MU_DEG = 6.3E-5; SIGMA_DE = 1.74E-5; LIST = 3; QTCALC = 1;

-4 ,

stored

in file "phase.mod,"is:

TIME;

119

and

the

user

can

be prompted

2(* 2(* 3(* 3(* 3(* 4(* 4(* 5(* 5(* S(* 6(* 7(* 7(* 8(*

3,0,2 3,0,2 2,1,2 2,1,2 2,1,2 3,0,1 3,0,1 2,1,1 2,1,1 2,1,1 3,0,0 2,1,0 2,1,0 1,0,0

The

0TCALC

numerical sions. The SURE

*), *), *), *), *), *), *), *), *), *), *), *), *), *),

1? readO

2,1,2 3,0,1 1,2,2 3,0,1 2,1,1 2,1,1 3,0,0 1,2,1 3,0,0 2,1,0 2,1,0 1,2,0 1,0,0 0,1,0

= 1 command

routines. interactive

V7.2

3(* 4(* 1(* 4(* S(* 5(* 6(* 1(* 6(* 7(* 7(* 1(* 8(* 1(*

NASA

*) *) *) *) *) *) *) *) *) *) *) *) *) *)

causes

= = = = = = = = = = = = = =

3*LAMBDA; 2*GIMMA; 2*LAMBDA; ; 2*GAMMA; 3*LAMBDA; I*GJLMMA; 2*LAMBDA; ; 1*GAMMA; 3*LAMBDA; 2*LAMBDA; ; I*LAMBDA;

SURE

the

This increased accuracy session follows: Langley

Research

program

is often

to use

necessary

more when

accurate analyzing

(but phased

Cen_er

phase

LAMBDA? GAMMA? TIME?

2e-4 le-4 .1

30? run MODEL

TIME

FILE

=

= phase.mod

1.000o-01,

DEATHSTATE

GAMMA

SURE

=

1.000o-04,

V7.2

LAMBDA

=

LOWERBOUND

UPPERBOb_D

COMMENTS

1.78562e-12

1.89600e-12



1.78562e-12

1.89600o-12



#1

slower) mis-

7 8

O.O0000e+O0 5.08706e-14

5.17358e-15 5.60442e-14



10 PATH(S) PRUNED AT LEVEL 4.75740e-20 SUM OF PRUNED STATES PROBABILITY < 6.11113e-20 Q(T) ACCURACY >= 14 DIGITS 10 PATH(S) PROCESSED 2.867 SECS. CPU TIME 31? readO phase LAMBDA? GAMMA? TIME?

UTILIZED

le-4 Ie-5 2.0

60?

read

61: 62: 63: 64: 65: 66: 67: 68: 69: 70:

INITIAL_PROBS( 1: ( 1.78562e-12 2: ( 9.99920e-01 3: ( O.O0000e+O0 4: ( 7.89960e-05 5: ( O.O0000e+O0 6: ( 2.67751e-09 7: ( O.O0000e+O0 8: ( 5.08706e-14 );

71?

0.07 run

MODEL

TIME

phase.ini

FILE

=

SECS.

89600e-12), 99920e-01), 98043e-07), 99941e-05), 14966e-10), 80076e-09), 17358e-15), 60442e-14)

MODEL

FILE

SURE

= phase.ini

2.000e+O0,

DEATHSTATE

TO READ

1 9 9 7 1 2 5 S

GAMMA

LOWERBOUND

=

1.000e-05,

LAMBDA

UPPERBOUND

TOTAL

0PER-STATE

=

1.13950e-li



1.11438e-ll

1.13950e-ll

Suggest Documents