context of closed-form functional representation of probabilities as utilized in the .... in the context of numerical methods instead of closed-form solutions is a ...
Computers Math. Applic. Vol. 23, No. 12, pp. 43-~7, 1992 Printed in Great Britain
0097-4943/92 $5.00 + O.00 Pergamon Press Ltd
C L O S E D - F O R M S O L U T I O N OF DECOMPOSABLE STOCHASTIC MODELS .]ON A . SJOGB.EN U.S. Army Avionics Research and Development Activity, Langley Research Center Hampton, VA 23665-5225, U.S.A.
(Received February 1991) A b s t r a c t - - M a r k o v a n d semi-Markov processes are increasingly being used in the modeling of complex reconfigurable systems (fault-tolerant computers). The estimation of the reliability (or some measure of performance) of the system reduces to solving the process for its state probabilities. Such a model may exhibit numerous states a n d complicated transition distributions, contributing to a n expensive a n d numerically delicate solution procedure. Thus, when a system exhibits a decomposition property, either structureaUy (autonomous subsystems), or behaviorally (component failure versus reconiiguration), it is desirable to exploit this decomposition in the reliability calculation. In interesting cases there can be failure states which arise from non-failure states of the subsystems. We present equations which allow the computation of failure probabilities of the total (combined) model without requiring a complete solution of the combined model. This material is presented within the context of closed-form functional representation of probabilities as utilized in the Symbolic Hierarchical A u t o m a t e d Reliability and Performance Evaluator (SHARPE) tool. T h e techniques adopted enable one to compute such probability functions for a much wider class of systems at a reduced computational cost. Several examples show how the m e t h o d is used, especially in enhancing the versatility of the S H A R P E tool.
1. I N T R O D U C T I O N Modern reliability modeling practice involves several techniques, including Fault-Tree analysis and (semi)-Markov processes. Fault-Trees enable one to evaluate the impact of certain dependencies within the entire system. For instance, the fact that the failure of a certain component results in the failure of higher-level component (sub-system) may be modeled. Markov and semiMarkov processes (chains) can depict more general types of dependency For example, consider a triplex system consisting of 3 identical components A1, A2,As, each with failure rate A. Then, system failure can be represented within the domain of Fault-Trees by a "2 out of 3 gate" as in Figure 1.1a, or by a set of AND and OR gates (Figure 1.1b). Failure may also be depicted by a Markov process (Figure 1.2). With the rates indicated, the probability of failure at time t is the probability that the process is in state F at t. Another "fault-tolerant architecture" is a triplex which operates by (instantly) detecting a first fault and then reconfiguring to a simplex system. This is accomplished by unplugging the defective component and a randomly selected "good" component. There is no way, using only the three components in a logic gate (Fault-Tree) arrangement, to represent the event of system failure. But it is easy to give the corresponding Markov process (Figure 1.2b). Thus the necessity for Markov models tends to come about when system reconfiguration is a characteristic feature. For mission-critical systems found in process-control, avionics, and so on, one is concerned with the reliability at some particular time (mission time). Given a Markov process that models a system, the unreliability or failure probability can be found by using a numerical differential equation solver [1]. Alternatively, one may wish to have this quantity in closed form as a function of time. Large models (with many states and transitions) arise naturally in the study of complex reconfigurable systems. The solution of such a model can be both expensive and time-consuming. However, when the model can be decomposed hierarchically into smaller models, the process of Typeset by .AA4~TEX 43
44
J.A. S $OGREN
/ R3
AI
AI A2 A3
(a) 2-of-3 gate
A2
A3
A2
f13
RI
(b) Implemented with And/.Or gates Figure 1.1.
(a) Triplex system
(b) Recontlgurable-to-sirnplex Figure 1.2.
solving the smaller models is generally less expensive than solving the large one. The desired reliability number associated with the large number can be computed from quantities derived in the solution of the smaller ones. Difficulties with a straightforward decomposition approach were discussed by P. Protzel [2]. A contribution toward a "standard" for an input description language for Markov models is made in [3]. The major purpose of this article is to increase our understanding of hierarchical modeling and to increase our capabilities for solving such models. Two equations, presented as (5.6) and (5.7), govern the method of solution by decomposition. In order to write down these decomposition relationships we must present the ChapmanKolmogorov equations somewhat differently than in previous literature. However, [4] gives a related treatment. These C-K equations involve quantities (probability density functions) which are used in a new integral equation, (5.6). The solution of the integral equation is a function that expresses part of the failure probability present in a "combined model" that results from two smaller models. The failure mode involved here (in the combined model) does not generally arise from the failure modes of the smaller models. Thus we have developed a technique for combining models accurately, even though there may be some interaction between them, leading to new failure modes. The obvious advantage of the closed-form framework is that once the solution is found (presumably at significant computational cost), the reliability is easily calculated for any desired value of time t. In addition, it is easier to find sensitivity functions (with respect to failure rates or other parameters). The Symbolic Hierarchical Automated Reliability and Performance Evaluator (SHARPE) program takes this closed-form approach [5]. The reliability functions encountered in SHARPE are the so-called "exponomial" functions and can be easily represented "symbolically." They consist of probability distribution functions of a particular algebraic form to be described shortly. The SHARPE modeling framework is amenable to the use of hierarchical modeling techniques. SHARPE was intended to promote hierarchical decomposition. Much of the information needed
Closed-form solution
45
in applying our decomposition method can be obtained from SHARPE, either directly as output from the program or after a modest amount of additional computation. The examples (in Section 7) that illustrate the use of our method are made much clearer by adhering to the closed-form (SHARPE) framework; the reader can observe functions arising in the solution process explicitly. For the remainder of the introduction we examine the SHARPE methodology, closed-form solution, and hierarchical modeling in greater detail. Section 2 is a review of the basic concepts of stochastic process theory needed for the finite-state semi-Markov processes that we deal with. Section 3 gives a form of the Chapman-Kolmogorov integral equations for a chain, in a manner that provides quantities necessary in the study of hierarchical decomposition. Section 4 presents the basic facts of the SHARPE solution method applied to certain types of chain. This is not to be interpreted as a literal description of code (the author is not a developer of SI-IARPE), but as background useful in judging what the program's capabilities are, and what enhancements might be desirable. Section 5 presents an introductory example which shows how simple models can be built up into larger ones, and how repair complicates the decomposition issue. The problem is stated, of how to compute probabilities in a "combined model" (which may not even be semi-Markov). An integral equation is given whose solution answers this question. Section 6 is a digression on constructing certain exponomial distributions, and Section 7 provides two examples that illustrate the power of our method.
Exponomial Distributions To construct these distributions in an algebraic manner, we take as base field R, the real numbers (an ideafization of computer floating-point numbers). The set of functions {e ~a, sin at, cosat,t} where ~ E R, generate a ring of functions, which is extended by linearity over R to form an algebra Exp. The subset of this algebra consisting of distribution is called Dexp, the
exponomial distributions. A function F E Dexp has the properties that (1) F(O) = O, (2) th__.mF(t) = 1, (3) F(t) is non-decreasing for t > 0. Condition (2) can be relaxed to lim F(t) < I for certain apphcations. In that case we say that F t--~oO
is defective, or incomplete, and has mass = 1 - lim F(t) at oo. It is true that, given a finite ~--*OO
state Markov process (chain) M, its unreliability function is a complete exponomial distribution provided that a certain technical condition holds. The condition is that every absorbing state is a failure state, and that whenever a state is exited, there is a non-zero probability that it will never be visited again. Hence the "operational" states are transient: call this the "transient state condition." If we weaken this condition to say merely that every failure state is an absorbing state, then the unreliability function is a defective exponomial distribution. This defines a larger class of functions, which we call Dexp +. The SHARPE program calculates such an unreliability function in closed form for any Markov process satisfying the "transient state" criterion. In addition, SHARPE can find the unreliability distribution of certain semi-Markov processes. The process must be acyclic and the transition functions should be in Dexp. In a semi-Markov process, transition rates from one state (the present state) to another (the receiving state) are not constant, but depend upon the time elapsed since the system entered the "present state." We will later show (in Section 4) how to use SHAR.PE to obtain more information about the chain (Markov or semi-Markov process) than is simply provided by the output of the program. For example, given a Markov process with cycles, SHARPE does not furnish the probability function (not a distribution) of a transient state. We show how to modify the chain, and then apply the C-K equations to find this. It is necessary only to solve one convolutional integral equation, not a system of equations, and to perform a few simple manipulations using output generated by SHAR.PE. The integral equation is generally easy to solve using the Laplace transform. This approach was first used by Lotka in the theory of industrial replacement, where similar renewal-type integral equations also arise.
46
J.A.
S JOGREN
See [6]. The rigorous foundations of this solution method were brought together in [7], using analytic techniques developed in [8].
Hierarchical Modeling As already indicated, hierarchical modeling methods are built into SHARPE. For instance, one way to construct a semi-Markov process is by means of state transition functions (distributions) given by the failure distribution of a (constant-rate) Markov chain. T h a t is, in order to know explicitly a transition function of the semi-Markov chain, the "low-level" constant-rate chain must be solved. One process is in a sense embedded in the other. For examples of this type see [9]. A "full model" could be constructed by expanding the states in the higher-level semi-Markov chain into several states of a Markov chain. In many cases this capability of solving the higher- and lower-level models separately results in less computation and greater numerical robustness. Another common class of decomposable systems consists of the Cartesian products. Given M1,M2 Markov chains, then N = M1 x Ms is Markov. Selecting a state A = ( A , B ) E N, A E M1, B E M2, one may modify N by removing all exit transitions from A to form N. Thus, A is now an absorbing state. Define PA to be the distribution function (possibly defective) corresponding to A. This function could be interpreted as "the probability that by time T we have simultaneously been in state A of M1 and state B of Ms." We give several examples of how this scenario can arise in practice. The method we present allows one to find PA without having to solve the large chain N. It is only necessary to obtain certain information about the smaller chains M1 and Ms, which can be found for example by using SHARPE. Then, another integral equation, (5.6), must be solved, leading to the desired probability function PA. This integral equation is similar to the Chapman-Kolmogorov forward equation (see [10, p. 458]), and may be solved by the Lotka-Feller method. This decomposition approach has several advantages. Work involved in solving an arbitrary chain with n states increases as the cube (n3). In the Markov case, this is essentially the work involved in finding the eigenvalues of an n x n matrix. Hence, where decomposition is possible, much less work is needed to find the desired probabilities. Furthermore, the Cartesian product of even a Markov chain M1 with St, a semi-Markov chain, need not be semi-Markov. In general, it is a stochastic process where transition rates depend on the time elapsed since entry into the state previous to the present state. Provided that one does not wish to deal with the theory of such general processes, the decomposition method is essentially the only viable approach. S H A R P E is a powerful tool for obtaining exponomial solutions of reliability models. The techniques suggested in this paper greatly expand the computational horizons of S H A R P E in directions consistent with the hierarchical modeling philosophy, and thus have more than theoretical value. But the techniques may also be used without recourse to SHARPE, and using them in the context of numerical methods instead of closed-form solutions is a promising possibility as well. 2. S T O C H A S T I C P R O C E S S E S A N D D I S T R I B U T I O N S We give a brief review of stochastic processes, with emphasis on the ones that are of greatest interest to us, namely Markov and semi-Markov processes (or chains). Although the term chain is sometimes used for discrete-time systems, we use the term to denote a continuous-time process that is either Markov or semi-Markov. The probabilistic definition of a Markov chain proceeds as follows. One starts with a space of outcomes, or sample space S. A continuous-time stochastic process is, for each t > 0, a function X(t). The domain of the function is the sample space, and each image X(t) a, where a E S, is an integer j from 1, . . . , N, identifying a state. So one may say that at time t, the process or outcome ~r results in state j. It is most common to classify outcomes into events, and to assign probabilities to the various events. For a finite sample space where all outcomes are equally probable, of course
P(E) -- # of ~r in E # ofer in S '
and the expression (which is a slight abuse of notation) P[X(t) = j] is the probability of the event consisting of all outcomes a such that vft~ c~ = j. The quantity P[X(t) =. j], called the
Closed-form solution
47
state probability for state j. One can define a finite Markov chain to be a finite-state stochastic process X such that, for any set of times to < tl < ... < tn < t, the conditional probability that X ( t ) = x given that X(tn) = Xn, X(tn-1) = X n - l , . . . , X ( t o ) = x0, where x, z 0 , . . . , a : n are certain states, is equal to the conditional probability that X(t) = z given that X(tn) = Zn. This is the characteristic memoryless property. This finite Markov chain has the additional property of time-homogeneity if the quantity pij(t) = P[X(u + t) =
jlX(u)
= i]
depends only on t and not on u, for all u > 0, i, j such that i ¢ j. As usual, PIE[F] means the probability of Z given F, or P[E f) F]/P[F]. Now, if one defines d =
p
j(t)l,=o,
this quantity may be interpreted as follows. The increase over a short interval dr, of the probability of being in state j, due to the original probability of being in i, is equal to P[X(t + dr) = j and X(t) = i]. From the definition of ,~i~, this equals )tijP[X(t) = i]. dr. By a suitable modification of what we have just done, we may obtain the definition and some properties of a semi-Markov chain. We use the idea of conditional probability density function, for which see [11]. Say that the process X enters state zj at time tj if X ( t j ) = xj, and if there is an ~ > 0 such that X ( t j - &) ~ xj, for all 0 < 8 < c. Also, if X(0) = z0, the process X entered x0 at t = 0 (x0 was the initial state). Given a set of times to < tl < • -- < tn < t, consider the conditional density function of X entering x at t given that X entered zn at tn, X entered xn_l at t n _ l , . . . , X entered z0 at to. If this is always equal to the density of X entering x at t given that X entered xn at tn, the process is said to be semi-Markov. We have for each pair i , j of states a density function
Gij (t, u) = density of X entering j at u + t
(2.1)
conditional on X entering i at u. The time-homogeneous case occurs when Gij is independent of u > 0 for all i, j, t > 0. We shall henceforth refer to a time-homogeneous Markov or semi-Markov process as a chain. Such a chain, since Markov implies semi-Markov, is characterized by the functions
(2.2)
Gij(t) = density of X entering j at t,
given that X entered i at time 0. It has been shown (see [4, p. 89]) that certain other sets of functions serve to characterize a chain. Consider the (possibly defective) distribution Fij = P[X enters j at some time r, 0 < r < t, and X does not enter any state at any time to, 0 < ~ < f i X entered i at 0]. In words, Fij is the probability, conditional on entering i at 0, of ending the sojourn in i by a jump to j before time t. In his 1964 study [12] of the C-K equations, Feller makes use of F/j. Another class of functions which determine a chain is referred to as the transition distributions Cij. Described in words, Cij(t) is the probability that X will jump to j (first entry) by time t, given that X entered i at 0 and assuming that Cij is the only transition out of state i. Thus, Cij is a distribution valid in the absence of competing transitions. The distributions {Cij }, over all j , are assumed to correspond to the independent events of jumping from i to the various states j . The functions {Cij } are what is supplied to SHARPE when it is desired to "solve" a chain (determine the time-dependent probability functions of its states). A relation between Fij and Cij will be given subsequently. For instance, when the chain is Markov, we have
Cij(t) = 1 - e -x',t
and Fij(t) = ~ X i ~ k
1- e
.
(2.3)
J.A. S JOOREN
48
T h e Cij functions can be given in various ways. For certain purposes, it is not necessary to define them completely but rather it is enough to give their mean and variance as distributions. This approach is used by the SURE [13] package to find upper and lower bounds on the reliability of a system whose reconfiguration times are not exponentially distributed. The S H A R P E package, on the other hand, expects to be provided with Cij as a function in the class Dexp. In other words, m
c,j (t)
=
(2.4)
where kv is a non-negative integer and ar and br are real or complex numbers. Cij should be a complete distribution function, in particular real-valued; from this it is not hard to show that the terms with non-real coefficients at, can be matched in pairs with indices r, r', such that kr -- kr,, a~ -- ~ , , br -- [~,, where the bar denotes complex conjugation. This property will be referred to as the "conjugacy condition." A typical expression would be
1-e-t+
( 2 ) [e-(1-i)t ]
(2.5)
or 1 - e - t - a e - t s i n t . Having made the requisite definitions, we introduce standard terminology relating to the classification of chains in order to simplify later exposition. DEFINITION 2.6. A chain is ergodic if, given that it is in state j at time t, i l k is another state, then there is a later time t~ such that P[X(tk) = k] > O. DEFINITION 2.7. A state k is absorbing if, given X ( t ) = k, then P[X(t') = j] = O, for all t' > t and j ~ k. Thus an absorbing state, once entered, can never be left. Similarly, a subset of the set o f states could form an absorbing subchain if once entered, it is never left. Clearly, a chain with an absorbing subchain that is not the whole chain cannot be ergodic. DEFINITION 2.8. A state k is transzent if there is a state j ~ k such that given X ( t ) = k, then for some t' > t, P[X(t') = j] > O, but given X ( t ) = j, then for ali t' > t, P[X(t') = k] = O. DEFINITION 2.9. A chain is irreducible if it has no absorbing subchain, other than itself. Consider an absorbing state A in a chain M. If M is not Markov, S H A R P E requires that M be acyclic, and in any case M must have the property that all of its states are either absorbing or transient. The distribution of time until A is reached, conditional upon A eventually being reached, and denoted by PA, is provided as output by SHARPE. The output appears in symbolic "exponomial" form. A sample input and output format for a constant-rate chain is shown in Figure 2.1. The class of exponomial distribution functions is a natural one for the study of chains. In fact, in the Markov case, the function PA as above is exponomial as can be seen from the differential theory of constant-rate chains. Let Q be the "infinitesimal generator" matrix, qij = )qj for i ¢ j, qii = - ~ Aij. Then, if i s a r o w vector of functions [ P 1 , . . . , P i , . . . , P,], where the n states j#i of the chain are numbered 1 , . . . , n and Pi(t) is the probability of being in state i P[X(t) = i] at time t, we have [1]
fi(t)
fi'(t) = / 3 ( t ) Q,/~(0)= P0.
(A1)
Here, fi0 is the vector of initial probabilities. Then,
fi(t) = floeQt, where we use the matrix exponential eQ' =
(A2) Putting O into Jordan normal form
i=0
Q = S J S -1, it follows from [14, p. 381] that .P(t) = fioSeJt S -1.
(A3)
Closed-form solution
49
bind lam .022 mu .004 end markov ty2 I 2 2*mu 2 3 mu+lam I 31am end
i I. end
cdf(ty2) end
CDF for system ty2:
ezp( O.O000e+O0 t) ezp(-2.eOOOe-02 t) exp(-3.0OOOe-02 ~)
1.O000e+O0 t ( 0 ) + -2.0000e+O0 t ( O )
1.0000e+00 t ( O )
+
mean:
4.3590e+01
variance:
1.7949e+03 Figure 2.1.
I f J = diag(J1, . . . , Jp), then e Jt = d i a g ( e ' h t , . . . , e l , t ) .
(A4)
If 0
Ji 0
Ai J
and mi × mi complex matrix, then
e JJ
FeA,t teA,t i eMt L
tm,-leXJ (A5) te'X,~ e x,~
0
From formulas A1-A5 it follows that any Pi(t) can be written ra p , - 1 j=l
k=0
where m is the number of distinct eigenvalues of Q, Aj is the jth distinct eigenvalue of Q and pj is the multiplicity of the factor (x - Aj) in the minimum polynomial of Q. Since Pi(t) must be a real function, the conjugacy condition must hold, and we have an exponomial function• If i is an absorbing state of a Markov chain, Pi must be a distribution function, but may be defective. If i is the only absorbing state, then P, is a complete distribution, under the assumption made above that all states are either absorbing or transient. 3. C H A P M A N - K O L M O G O R O V E Q U A T I O N S We present a form of the Chapman-Kolmogorov equations for a semi-Markov process which will be convenient for our purposes. These are a form of the backwards equations in that they give probabilities (and densities) by summing over all epochs (times of a jump), and all results of a jump out of a given state, for the first jump from that state. Similar equations are stated,
50
J . A . S JOGREN
with proof, in [4, p. 93]. We deal with state probability functions, and their "densities," which integrate to give the probability function. Thus, a density does not have to be the derivative of
distribution. We recall some of our semi-Markov terminology from Section 2. In this section, i, j, k are states; then Ci~(T) = probability that a jump from i to k would be made by time T, given that i was entered at time 0, in the absence of competing transitions. These are the transition distributions, and they are assumed to be independent and competing. (They are distributions of the independent events resulting in jumps to the different states.) This is the same as the definition from Section 2 provided that C gives a complete distribution. Recall that the unconditional transition function Fik(T) is the possibly defective distribution of a jump from i to k by time T, given that i was entered at time 0. The two distributions are related by f
Fib(T) = / ^ JU
T
t
e l k ( t ) H [ 1 - Cij(t)]dt. j#k
(3.1)
In words, the probability of leaving i for k by T is the integral of the density of jumping from i to k at t, times the probability of not having jumped to any other state by t. Next let Eik(t) = density of (first) entry time from i to k, given that i was entered at 0. This is the density corresponding to a possibly defective probability distribution. We have
Eik(T) = dFik + y ~ j#k Ekk(T) = ~
dF,j(r) . Ejk(T - v),
i ¢ k,
(3.2) ~oTdFkj(r)" E j k ( T - v).
In words, for the first equation, the density of first arrival in k is the density of jumping to k plus the density which results from jumping to a third state at a time r < T, followed by a subsequent first arrival in k at T. The density Gik(t) defined above in (2.2) will be used in forming state probability functions, and is necessary to compute probabilities of "combined states" in hierarchical, or Cartesian product, models. We recall that it is the density of entering k at t, given that you entered i at 0. This differs from Eik in that state k may previously have been visited (after leaving state i). Then, we have
Gkk(T) = Ekk(r) + Gik(T) = Eik(T) +
Ekk(r) Gkk(T - v) dr,
//
(3.3) Eik(r) Gkk(T - r) dr
The first expression is a Volterra integral equation of the second kind, for which [15] is a clear introductory source. It is the same type of equation discussed in Feller's article on renewal theory [7] and which was previously treated as part of the theory of industrial replacement in [2]. The equation is also pivotal in later studies of population and economic growth, as more recent references from [16, Chapter 2] indicate. The verbal description of the equation and formula above are now given. For the equation, "given that you start in k, the density of entering k at T is the density of a first entry, plus the density of having entered k at a previous time 1- and subsequently entering at T, summed over all v." For the formula, "starting in j, the density of arriving in k is the sum of the density of first arrival, plus that of a previous first arrival followed by a subsequent arrival from k to k." These quantities can be used to express the state probabilities. That is, Pik(T) is the probability of being in k at T, given that you entered i at 0. We let Sk = ~ Fkj be the holding time j;tk distribution in state k. Then, k(T) =
f
Pkk(T) = 1 -- Sk(T) +
-
Sk(T
-
d,-,
(3.4) Gkk(r)[1 -- Sk(T - r)] dr.
Closed-form solution
51
For the first formula, a description in words reads: "the probability of being in k equals the density of arriving in k, and subsequently not leaving k, integrated up to the present time." An alternative formulation of state probabilities, using only the Ei.i functions can be derived from the above equations. In fact, Pi~:(T) = ~0 T E/~(T) Pkk(T - r) dr,
(3.5) e
k(T) = 1 - & ( T )
+
Ekk(") e k k ( T -- ,')
Again, the advantages to the approach indicated by the above formulas can be summarized: (1) The method encompasses both Markov and semi-Markov chains, (2) Eik, i i~ k can be found by SHARPE, (3) after which only one integral equation need be solved to determine a probability function, (4) then Feller's method of solving the renewal equation can be applied; (5) the approach is well adapted for hierarchical modeling as will be seen. 4. T H E S H A R P E SOLUTION M E T H O D Acyclic Chains (Markov and Semi-Markov) The method adopted by SHARPE in this case is equivalent to an analysis of paths from "initial states" to absorbing states. An initial state can be defined as one that has a non-zero probability at time 0. That is, if i is the state, then the vector /~0 has a positive i th component. We discuss this by examining each path separately, as in a "depth-first search," whereas SHARPE is actually programmed to compute probabilities at states as they are successively reached in a "breadth-first" search. The difference is one of form. For the system to traverse a particular path in the (acyclic, directed) graph representing the semi-Markov process is an event, which is disjoint from the other events corresponding to the other paths. Therefore, to get the distribution of an absorbing state, the traversal distributions of all paths leading from some initial state must be added, weighted by their probabilities of occurrence. We must find a traversal distribution from a given path, and a probability. Suppose the initial state is i0 and the final state is ira. The path of concern er can be written io, il, . . . , ira. We define recursively B ' n ( T ) = 1,
B J ( T ) "-
foT F[~,G+,(t) B J + I ( T -
t) dr,
(4.1)
where j = 0 , . . . , m - 1. Then, define Da = B ° ( T ) . The derivative and the convolutions are performed "symbolically" by SHARPE, within the class of functions Exp. Let po = Pio,il " Pia,i2 "...Pi,,,_x,i,~ • Pio(O) • Here, Pi,,i,+l =
F/ . (t) dt, f0 °° I1 i$J-I-1
j=0,...,m-
1.
Then, P k ( T ) = ~'~p,,Do(T), where ~r runs over all paths ending in k. a Cyclic Chains (Markov) In the (cyclic) Markov case, SHARPE uses both matrix analysis and estimation in the transform domain to find the distribution of an absorbing state. There is no reason why this method could not be used to give probability functions at transient states, but at present, SHARPE does not do this. Such information may be useful, however, as the example of a phased mission points up. Here the "mission" proceeds in two phases, the second commencing at time T1. The models for the two phases are the same, but certain failure and reconfiguration distributions have changed
52
J . A . S JOGREN
due to a maintenance action [17, personal communication]. The initial state probabilities for the model of the second phase are given by the state probabilities of the first phase at time T1. We indicate how to use SHARPE to find this information, however. Recalling the infinitesimal generator matrix Q of Section 2, it is clear from (2.10) that its eigenvalues and their multiplicities are of great importance in finding the probability functions. Any real matrix has a Schur decomposition Q = U H U T,
(4.2)
where U T denotes the transpose of U, U is orthogonal (UU T = I). The n x n matrix H is to have a nearly upper triangular form. T h a t is, it is block upper triangular, with diagonal blocks either of size 1 x 1 or 2 x 2. In particular, H is an upper Hessenberg matrix: it is upper triangular except for possible non-zero entries on the diagonal i = j + 1 (just below the main diagonal). Then, the eigenvalues of H, and hence of Q are the 1 x 1 real scalars and the complex-conjugate pairs arising from the 2 x 2 blocks. In order to take Q to this form, one may first find G = LkLk-1...Lo •Q. Lo...Lk, where G is in upper Hessenberg form, and Li is a "Householder matrix." A Householder matrix represents a reflection through an (n-1)-dimensional hyperplane orthogonal to a certain vector ~ . T h e details of this algorithm can be found in [14, p. 222]. It is due to Wilkinson who exposited it in his book [18]. Now G has a Q - R decomposition G = W R , where W is orthogonal (essentially a product of rotations) and R is upper triangular. Another algorithm implicitly finds G ~ = R W . Using G ~ as the new Hessenberg matrix G, we repeat this process until the real Schur form is attained. Then, the eigenvalues (which have not changed through any of these transformations) may be read off. The Q - R method is compared with other "exponentiation" techniques in the well-known article [19]. Next, SHAR.PE must determine the coefficients aijk of formula (2.10). By transforming the differential equation in Formula A1 of Section 2, one obtains s P ( s ) - P0 = P(s) Q, o r / 5 ( s . I - Q) = Po. Each Pi(s) is of the form
j = l k=0
(s+)~J)k'
determining the flijk is equivalent to finding aijk. But for a particular choice of s, say ~1, we get /5((1) T - P0, T = (~, I - Q). Thus we have n equations for the n 2 unknowns {/~ijk}. Similarly, setting s - ~ , . . . ,~n, for suitably chosen values, will give enough equations to determine the coefficients we seek. We now indicate how to use the SHARPE approach to determine transient state probabilities. This will work for any Markov chain; if general (semi-Markov) transitions are allowed, the technique is only good for states (if any) that satisfy the following. "If all transitions out of the state of interest are removed, that which remains is a chain that is (1) pure Markov, possibly with cycles, or (2) acyclic semi-Markov." Call this condition Condition Q. It is rather remarkable that SHARPE, with some additional calculation, can treat certain semi-Markov chains with cycles. The computational techniques involved are amply illustrated by the examples at the end of the paper. Now the method is described in general terms. Given a (non-absorbing) state r, we are interested in P r ( T ) as a function. If there is a single initial state j, Pj(0) = 1, this is the same as Pit(T). But using SHARPE, for any state i # r, one may find Eir(T). This is done by describing the chain to SHARPE, giving the transition rates and distributions as usual, but omitting any transitions out of r. This makes r into an
Closed-form solution
53
absorbing state r'. We assign in the input to SHARPE, Pi(0) = 1, and all other initial state probabilities zero. Given that Condition Q holds, SHARPE can find the cumulative distribution function Hr,(t) of arrival into r', as well as the overall probability Pr, of reaching r'. Consider
Lit, (T) = Hr,(T) . Pr,. This is the unconditional distribution of entering r'. The derivative of Lit, with respect to time is just Eir(T), since arrival and first arrival are identical for an absorbing state. Thus SHARPE has found the functions Eir, i ~ r. These are then used by means of the second part of (3.2) to find Err(T). The second part of (3.5) is an integral equation for the unknown function Prr- Once this has been solved, we need only perform the convolution integration of (3.5), first equation, to obtain Pjr which was the desired state probability function. 5. D E C O M P O S I T I O N M E T H O D S Given a complex failure-repair-reconfiguration system, it is tempting to decompose it into independent subsystems, or ones that are nearly independent. Independence allows one to compute the probability of being in a given state (for each subsystem) by using the product formula. Since the computational cost of analyzing a system model increases geometrically with size, significant savings can be had if the system is decomposable in this manner. As an example, consider two triplex systems attached to a voter as shown in Figure 5.1. Both are required to be functioning for system viability. In each subsystem, Sa, Sb, the failure of two components causes a triplex failure. If the component failure probabilities are p~ and Pb, we have
Ps = Ps. + Psb - P s . "Psi,
(5.1)
Ps. = 3-p~(1 - pa) -t- p3, PSb = 3" p~(1 -- Pb) + p3.
(5.2)
where
Thus, we see that in forming system failure probability, we certainly do not need to consider separately all failure modes, such as "one unit in Sa has failed, together with 2 units in Sb." s1
s2
Fig-are 5.1. Independent voted triplexes.
On the other hand, in Figure 5.2. the same failure conditions apply for Sa and Sb, but their "failures" are not independent events. We say that unit B1 is "isolated" when al has failed (the voter has no access to it). To be isolated is as bad as failed. Thus when al and b~ are failed, the system has failed, since the voter can see neither bl nor b2. The "failure conditions" need to be explicitly analyzed. If we take the simplest Markov chain representation of Sa and Sb, we get Figure 5.3. At a given time t we again have ps.(t)
= 3. P2(t) -
2. Pg(t)
(5.3)
for the failure probability. But it is not clear how to obtain Ps(t) for the combined system. Figure 5.4a gives an "equivalent" Markov chain; its failure probability function is the same as for Figure 5.3. The corresponding chain for subsystem Sb is shown in Figure 5.4b. In the "combined" model (not shown), certainly when one of the subsystems is in a failed state, the system is failed. Thus (Fa, *) and (*, Fb) give system failure, where * is any non-failed state of the appropriate
54
J.A. SJOGREN
subsystem. But also for example (011, ioi) is a failed state. Due to independence of unit failures, this state's probability is Potl • Pioi = Pa(t) • Pb(t). Examination gives 6 of these states so we finally get Ps(t) = P s , ( t ) . (1 - Psb(t)) + (1 -- Ps,(t)) " Psb + 6" Pa(t)" Pb(t). (5.4) Therefore, the combined model is the Cartesian product Sa x Sb with certain transitions modified. For example, the definition of Cartesian product implies that (101, ioi) is a state with a transition of rate 2Aa to (F~, ioi) and a transition of rate 2)~b to (101, Fb). Next, all "combined states" satisfying the failure condition are made absorbing. Thus the transitions of rate 2Aa from (110, ioi) to (Fa, ioi) and rate 2Ab from (110, ioi) to (110, Fb) are deleted. This reliability problem, of coupled nodes and sensors, can be solved in two ways: first by forming and solving the combined Markov model in the manner we have just indicated, and second by solving each of the two models S~ and Sb, not only for their failure probability distributions, but also for their state probability functions. These functions are then combined in some way, similar to (5.4), to give the distribution of the entire system S. The situation becomes more involved when the components admit of repair. The Markov model for system S~ is then shown in Figure 5.5, and the model for Sb is similar. At time T, if we are in a failed state of S~ or of Sb, the system S has certainly failed. If we are in a state such as 011 in Sa and ioi in Sb, the system is failed as well. But we may well be in an "up" state, such as 111 in Sa and iii in Sb and still have to consider that we are failed. This is because at some previous time t < T, we may have been in 011 and ioi simultaneously which would have brought down the system. The combined model, called in, accurately reflects this state of affairs: the state (011, ioi) has been made absorbing, so the transitions (named after their numerical rate):
#a :(011, ioi) --, (111, ioi), Pb :(011, ioi) ---* (111, ioi)
(5.5)
do not exist. The problem is how to compute the failure contribution of combined states such as (011, ioi) without solving the combined model m. This is analogous to the non-repair situation, with the difference that we cannot simply use the expression P011(T) • Pioi(T) as we did there. The expression we seek could be expressed in words as "the probability that at some time prior to T, the Sa state was 011 and the Sb state was ioi." In the semi-Markov case it is not feasible to find these "combined state" probabilities by using a Cartesian product model. If M and N are semi-Markov chains, the Cartesian product M x N will generally not have the semi-Markov property. For example, Figure 5.6 depicts a two state Markov chain and a two state semi-Markov chain with hypoexponential distribution C(t) = 1 - 2 . e - t + e -2t. The combined model (Cartesian product) is then shown with distributions indicated. Since C(t) is not exponential and hence not memoryless, the density function of the transition (b, x) ---* (b, y) depends not only on the "local" time spent in state (b, x), but also on the entry time into (b, x), or the time spent in (a, x). This violates the semi-Markov property. For this reason we do not work explicitly with the "combined model." But we still consider ordered pairs of states, and say, informally, "the system is in state (A, X), where A is a state of M, and X is a state of N." We a consider failure condition given by a pair ( B , Y ) , B E M , Y E N. Let Z A B , x y ( T ) -- the probability that M has entered state B while N was in state Y, or N entered state Y while M was in state B, at a time t, 0 < t < T, given that M was in A at 0, and N was in X at 0. It should be helpful to look ahead to Figure 7.1 which gives a good illustration of this situation. In case A # B or X # Y, one can also interpret ZAB,X Y (T) as follows: the probability, given that M started in A and N started in X, that M has been in B simultaneous with N being in Y. The quantities are determined by means of two fundamental equations. The first is:
Z B B , y y ( T ) = ~oT (GBB(T)' P y y ( r ) + P B B ( r ) " G y y ( r ) ) [1 - Z B B , v y ( T -- r)] dr.
(5.6)
Closed-form solution
55
U
bZ
2
Figure 5.2. Dependent processor-node system.
0
F
Figure 5.3. Voted triplex system.
iii
x/11 x
011
] 01
oii
] }0
ioi
Fa
llo
Fb (b)
(~) Figure 5.4. Elaborated triplex models.
IJ a
011
101
110
Fa Figure 5.5. Repairable triplex model, subsystem a.
c(t~~) Elementary models
Combined model Figure 5.6.
In words, the right-hand expression is the integral over 7" of the density of entering into the "state" (B, Y), and subsequently never arriving again (to avoid counting arrivals twice). This is similar to a Chapman-Kolmogorov forward equation in that we integrate over densities of
56
J.A.
S JOGREN
the last jump into (B, Y). The quantities GBB, PBB, Gyy, PrY are found as in Section 3 from the separate models M and N. Then, (5.7) is an integral equation to be solved. Note that if we use a Laplace transform method, the expression GBB(r) • P r y ( r ) must be multiplied in the time domain, and then transformed, or else GBB(S) and/T'Yy(s) convolved before proceeding further. This is illustrated in the subsequent examples. Given ZBB,Yy, one can find ZAB,XY by integrations:
ZAB,xy(T) = Jo T (GAB(r)" Pxy(r) + PAB(r) • Gxy(r)) [1 -- ZBB,yy(T -- r)] dr.
(5.7)
The verbal interpretation of the right-hand expression is left to the reader. 6. D I S T R I B U T I O N S F R O M M E A N AND V A R I A N C E Modern fault-tolerant computers, as used in high-reliability applications such as aerospace and nuclear plant control, employ architectural features beyond simple majority voting of independent processors. Instead, faulty components may be switched off, and spares activated; the system is changed upon detection of a fault. A simple system with such dynamic reconfiguration is shown in Figure 6.1. This depicts the triplex degradable to a simplex mentioned in Section 2. Practice has generally borne out the constant failure rate assumption for electronic components during their active life span. But the "reconfiguration distribution" w(t) has been observed not to be exponential, as in [20]. See also [21]. This transition includes the time necessary for the system to detect the presence of single fault, isolate the two components (one good and one bad), and remove them from service.
(0
Figure 6.1. Reconllgurable triplex.
It has been shown in [13] that giving the mean M and variance V of w(t) is sufficient to determine Ps(T) to within a few percent, assuming that M is much smaller than the reciprocal of the largest failure rate in the system S. Here, T is the mission time. T h a t is, the system is assumed to fail slowly and reconfigure quickly. To check such reliability results, obtained by the SURE program package, one might use SHARPE on the same example. To do so would necessitate presenting w(t) in exponomial form. Our goal in the present section is simply to give a way of determining w(t) explicitly, knowing that it is a distribution with mean M and variance V. We utilize the method of Cox from his classic paper [22]. Three cases exhaust the possibilities. CASE 1. Suppose M 2 = V. Then, take w(t) = 1 - e -At, where A = 1/M. CASE 2. If M 2 > V, let k = [M2/V] be the greatest integer less than M2/V. Consider the linear chain in Figure 6.2, consisting of k stages with rate kA and a final stage of rate 7.
k~
kk
k~
kA, , ~
Figure 6.2. k stages.
Closed-formsolution
57
Since the random variable "time to failure" is the sum of the independent transition times of the stages, the mean and variance are additive (respect the summation). See [11, p. 192]. Thus,
M=k
+7=~+-, 1
V = k.
+
1
7 1
=
+
(6.1)
1
From this one obtains the formula
1
M-~/M2-(I+~)(M2-V) =
(1-1- ~;)
(6.2)
The practical way to get w(t) in closed form is to find k,A, 7 and enter a SHARPE file for the linear chain. SHARPE will then find the desired distribution w(t). Figure 6.3 shows the input and output formats. In a hierarchical fashion, SHARPE allows the cumulative distribution function of this chain to be used in a "higher" system, eliminating the need ever to write the exponomial form of w(t) explicitly. bind Zambda .4167 g~mma . 16667
end markov l i n e a r 0 1 4*lambda 1 2 4*lambda 2 3 4*lambda 3 4 4*lambda 4 5 g~mr.a end 0 1. end cdf (linear) end
CDF for system linear:
8.5749e-02 + 3.2582e-01 + 6.1957e-01 + 1.0000e+O0 + -1.5241e+00 + 5.2412e-01
t(3)
t(2) t(1) t(O) t(O) t(O)
exp(-1.6668e+O0 exp(-1.6668e+O0 exp(-1.6668e+O0 exp( O.O000e+O0 exp(-1.6667e-01 exp(-1.6668e+O0
t)
t) t) t) t) t)
mean: 8.3997e+00 variance : 3. 7438e+01
Figure 6.3. A complete explanation of SHARPE input and output formats should be found in [5]. Certain symbolic variable names such as "lambda" are bound to a numerical value. The system is described by type (Markov) and given a name (linear). The states and transitions, with rates, are given in the following lines; after an "end," state 0 is assigned initial probability 1. Then, the cumulative distribution function (cdf) is requested of SHARPE. The cdf then appears in the output, using mantissa and exponent notation to describe floating point numbers, and "exp" to denote the exponential function. Hence, the meaning of the first line of the output is 8.5749 x 10-2 tSe-1"666a~. Finally the mean and variance of the cdf are given. CASE 3. If M 2 < V, we take
w(t) to be hyperexponential with distribution =
1 -
-
qe
(8.3)
58
J.A. SJOOR~N
where p + q = 1. According to [11, p. 212],
M=P
+ q 2p
(6.4)
2q _ MS"
An effective iterative procedure to find p, q,/~, and A is to set p = q = .5, 1.1 a = -~-, 1
/z - 2M _--~_1 Then, let (Step 1) X = ( M s + V)/2 - q/A s. This should approximate p//~s. Then, set a new/J value (Step 2) equal to (M - q/X)/X. Multiplying out the following expression shows (Step 4) t h a t / ~ A ( M - 1/A)/(A - p) should equal p, so we take this value as our new p. Finally, (Step 5), take q - l - p , and begin again at (Step 1), repeating until the computed mean M and variance V are as close to the given values as needed. This method is used in the next section to construct a distribution. 7. E X A M P L E S As our first example we consider the two models I and II depicted in Figure 7.1. Model I is a transient-fanlt detection mechanism. In state A the mechanism is functioning normally; in state B transient faults are incorrectly diagnosed as being permanent. See [23, p. 20]. In state C, a rare kind of error causes spurious signals to be sent to external parts of the system, causing an overall crash. Ct
r
f'---"x
f'----'x
©
S
Model I (Transient error detection)
Model II (Error arrival and recovery) Figure 7.1.
Model II represents the arrival of, and recovery from, transient faults over the entire system. State X represents the active presence of a fault, and Y the disappearance (absence), of faults. A similar model could be used to depict the error-producing and benign phases of a single "intermittent" fault. Several other reliability estimation packages besides S H A R P E provide a capability for modeling the arrival and detection of permanent, transient, and intermittent faults to the system. See [24,25]. We wish to consider the system as being up when the two "subsystems" I and II are in states (A, X), (A, Y), or (B, Y), respectively. Whenever model I is in state C, the system is down, but also whenever I is in state B at the same time as model II is in state X, we must consider the system to have crashed, since a transient fault is present but is incorrectly diagnosed (as permanent). The "combined model," with states { 1 , . . . , 5 } is shown in Figure 7.2. The correspondence between the states of the combined model and the Cartesian product I x I I is indicated. The system failure states are absorbing. Note that there is no state corresponding to ( C , X ) : it is superfluous. Thus, to find the failure probability at time T, one may take
P4(T) + Ps(T).
(7.1)
59
Closed-form solution r
Q s© © State Correspondence: 1 ~A,X 2~A,Y 3~B,Y 4~B,X 5~C,Y
Figure 7.2. Combined model.
We are most interested in P4(T), the probability of failure due to being in "state" ( B , X ) at some time t < T. For simplicity, suppose that the system starts life with model I in state A and model II in state X. In the notation of Section 5, we see that P4(T) -- ZAB,xx(T).
(7.2)
The total failure probability can also be obtained from the solution of the individual models I and II. The remaining part to be considered is for state C to be entered while model II is in state Y. Since C is absorbing, we know that the density of entering C in model I is GAG = EAC, and we obtain
fo r EAC(r) " P y ( r ) dr.
(7.3)
Then, adding expressions (7.2) and (7.3) givesthe total failureprobability, and should he equal to the distribution obtained from considering the absorbing states 3 and 5 of the combined model. In finding Z A S , x x ( T ) as a closed-form exponomial function, we will need to know GBB, PBB, GAB, PAB, G x x , and P x x , as indicated by Equations (5.6) and (5.7). We set coefficients in system I as ~ .3,
/~= .5, •=.1. We have dFBa(t) = .5e- st, and since there is only one transition into B, it also follows that
EAB = dFaB = .3e -'st. Thus, by (3.2),
EBB(T) = ~0 T dFBA(r) . EAB(T -- r ) d r . Transforming according to the construction in [7] gives .15
~BB~') = (, + .3)(s + .6)" By (3.3) we know that
/~BB .15 GTBB(S) = 1 -- J~SB = S2 + .9S + .03"
(7.4)
J . A . SJOfiREN
60
Applying (3.3) also yields OBB(s) = [~AB + E'AB "GBB
. 3 s + .18 s ~ + . 9 s +.03"
=
Next, note that SB is the "reliability" function of state B, given that the state was B at T = 0. Thus, SB(t) = e - 6 ' , so by (3.4), we obtain s + .3 PAB (s) -- ,~ + .9s + .03"
From model II, we require G x x and P x x . We take r = 2 and s = 3 which are of course intended to have didactic value if not realism. In a manner similar to that made in the computation for model I is obtained
P xx(,) = Gxx(s)
6
(, + 3)(, + 2)' 6
=
(7.5)
+ 5s"
Then, we have from (3.5), Pxx(s)=
1 s+2
(1+
6)
s+3
s2-~ 5s
- s~ + hs"
Now, define HB(t) = G B B ( t ) . P x x ( t ) + PBB(t)" G x x ( t ) . We are in fact interested in agB(s). To find this, one must invert the transforms GnB (s), P x x (s) and so on, perform multiplication and addition in the time domain, and then re-transform. A numerical mathematics package is helpful here. By this means one obtains 3
E ~i ,3-i ~IB(8)----.
i=1
E Vi$ - i i----1
where g = [6.1500
34.6350
13.0995],
~'--[1.0000
11.8000
39.3700
26.9040
0.8859].
By (5.6) we have
ZBB,XX(') = fiB(')" I 1 - - 2BB,XX ] . Writing ZBB,XX in rational form as ZtB°P(s)/Zb°t(s) yields ztop
B
= /-_/-top "'B , z b o t = s . (H~°p -4- Hb°t),
(7.6)
6
giving ZB = ~ wis 6-i. Here, u~ = [1.0000 i----1
11.8000
45.5200
we require HA(t) = GAB(t) • P x x ( t ) + PAB(t) • a x x ( t ) . that 4
A = ~, ~ U",S 4-i , HA
HtOp
i----1
61.5390
0.00]. Next
Setting/-)A(s) = "ut°P/~rb°t A ~ " A , we find 5
~-~ Vi S 5-i ,
= ~ /=1
where ~=[0.3000 ~'=[1.0000
13.9854
2.8500 10.7010 13.8294], 11.8000 39.3700 26.9040 0.8859].
(7.7)
Closed-form solution
61
Next, we apply (5.7) and obtain using a similar notation ztop(8) = r..rtop/~bot ~top h IJ A k~B -- 8"~,B 1, 7bot • = , . , , T_irbot a
(7.8)
The numerator of ZA has degree 7 in s, and the denominator has degree 9, but they have 5 roots in common. When the corresponding factors have been canceled, what remains is 3
E ai8 -I
2AB,XX(8)- i =5l bi sS-i i=1
where ~ = [1.0000
6.9000
17.7300],
b = [1.0000
9.2000
21.6000
5.3790
0.0000].
Now, ZAB,XX(S) has distinct poles, and its partial fraction expansion corresponds to the explicit exponomial form of ZAB,XX(t). Writing simply 2 and Z, we have 4 2(s)
:
" i = l S ~- pi '
where we write ~ and ff in column form
4
~=
~=
0.98884551031790 -0.05848988319612 0.08378848442687 -1.01414411154865
0 -5.35172855471732 -3.56645190997164 -0.28181953531104
Then, of course Z(t) = ~ aie p,t. i=l Consider the SHARPE input file for the combined model, together with the output information about node 4 in Figure 7.3. The distribution given is conditional upon entering the absorbing state 4. When multiplied by the given entrance probability, this gives the unconditional distribution, which is seen to agree with ZAB,XX(t) to 9 digits of accuracy. In the second example we depict several physical components and their failure modes hierarchically. New features which were not present in the first example include (1) determination of a simple exponential form of a distribution given its mean and variance, (2) semi-Markov transitions, (3) double poles in certain transition transforms, (4) trigonometric solutions, (5) neither coincident state is an initial state. The example is a simplification of one aspect of the Integrated Airframe/Propulsion Control System Architecture (IAPSA). See [26, p. 71]. The nodes (sensor-processor pairs) form a reeonfigurable duplex. The failure rate of each component is ¢ = .003, the resulting model is shown in Figure 7.4. Here C and E are failure states, but as in the previous example we are concerned with failure modes arising from coincident conditions on separate structural levels. The transition function c(t) represents the distribution of system reconfiguration time. It is the distribution of the random variable which is the sum of the times taken by the duplex operating system to detect an error, isolate the faulty unit, and configure to a simplex system. Experimentation with
62
J.A. S JOGREN
bind alph .S bet .5 lam . I r 2. s 3. end
markov death 1 4 alph 12r 21s 2 3 alph 3 2 bet 34s
351am end
I I. end
cdf (death, 4) end
information about system death node 4 probability of entering node:
9.88845510e-01
conditional CDF for time of reaching this absorbing state 1.00000000e+00 t ( 0 ) + -I.02558398e+00 t ( 0 ) + 8.47336450e-02 t ( 0 ) + -5.91496676e-02 t ( 0 )
exp( 0.00000000e+00 exp(-2.81819538e-01 exp(-3.56645191e+00 exp(-S.35172855e+00
t) ~) t) ~)
mean: 3.62644539e+00 variance: 1.26658132e+01 Figure 7.3.
Q ¢
Figure 7.4. Duplex Node System. faults injected into the system has yielded a mean time of .01 sec with a variance of .001 sec 2. According to Section 6, a hyper-exponomial distribution can be used for c(t). A S H A R P E model, and o u t p u t realizing this are given in Figure 7.5, model "reconfig." The other hierarchical component of the system is a dual partition network to which the nodes are attached. For simplicity we assume that either of two states can hold: both partitions are functioning, or else one partition is functioning and the other is undergoing repair (by configuring in a spare communication link). The "degraded" network is fully functional when the "node" system I is in either a stable duplex or simplex mode. However, the overall system cannot tolerate a simultaneous partition repair and duplex-to-simplex reconfiguration. The two-state model in Figure 7.6 illustrates the communication network, model II.
Closed-form solution
63
bind p . 025 mu7. lam 150 end
markov reconfig I 3mu 231am end
lp 2 1.-p end
cdf (reconfig) end
CDF f o r system r e c o n f i g : l.O0000000e+O0 t ( O ) exp( O.O0000000e+O0 t ) + -2.50000000e-02 ¢ ( O ) exp(-7.0OOOOOOOe+O0 t) + -9.TSOOOOOOe-01 t ( O ) exp(-1.SOOOOOOOe+02 t)
1.00714286e-02 variance: 1.00564116e-03
mean:
Figure 7.5. Ct
©
b(t)
Figure 7.6. Repairable network. T h e partition failure rate is taken as a constant a = .01; due to a rather complete understanding of the link repair mechanism, the repair distribution b(t) is precisely known and is shown in Figure 7.7 (model net-repair). As indicated in the S H A R P E output, the mean and variance of repair are roughly .02 sec and .0003 sec 2, respectively. Note the factor of t in one of the terms
of b(t). In the notation of the last example we are concerned with the function ZAs,xy(T). This is the probability given that model I begins (at t = 0) in A and model II begins in X, that before the time t = T model I has been in B simultaneous with model II being in state Y. To this end one must find, for model I, the quantities GBB, PBB, GAS, and PAB. For model II, one seeks
Gyy, Pyy, Gxy, and PxY. Since B is not a recurrent state, we immediately obtain Gns = 0 and thus PnB(T) = 1-SB(T) from (3.4), second equation. A calculation of the distributions FBc and FBD yields SB = FBC + FBD = 1 - p e - ( ~ + ~ ' ) ' - qe - ( ~ + x ) t .
Since GAB -'- EAB = dFAB = 2¢e -2¢t, we also have
PAB(T) = ~0 T 2¢e-2¢r[1 -- S ( T - v)] dr. We could also obtain PAB directly from the SHARPE model in Figure 7.8. The quantity PAA is obviously e -2@t, and the S H A R P E output gives the total failure distribution, that is PAc + PAD, SO PAB is 1 minus the sum of these two quantities. Converting to the s-domain, one has GAS(S) = 2@/(s + 2¢) and
PAB(s) =
p
+
q]
s + 4,+,k
"
64
J.A. SJOGREN
bind p .02 mu 16. lam 100. end markov net-repair 1 2mu 341am 4 5 lain end 1p 3 1.-p end cdf ( n e t - r e p a i r ) end CDF f o r system n e t - r e p a i r :
+
1.00000000e+O0 ¢ ( O ) -9.80000000e+01 t ( 1 )
exp( O.O0000000e+O0 t ) exp(-1.0OOOOOOOe+02 t )
+ -9.80000000e-01 t ( O )
exp(-1.0OOOOOOOe+02 t )
+ -2.00000000e-02 t ( O )
exp(-1.6OOOOOOOe+01 t )
mean: 2. 08500000e-02 variance: 3.09527500e-04 Figure 7.7.
bind phi .003 pC .025
qC .975 mu 7. la.= 150. end semimark nodes 1 2 exp(2*phi)
2 3 exp(phi) 2 4 genT 1,0,OT -pC,O,-muT -qC,O,-lam end 11. end cdf(nodes) end CDF for system nodes: l.O0000000e+O0 t ( O ) + -1.00006044e+00 t ( O ) + 2.14377590e-05 t ( O ) + 3.90007800e-05 t ( O )
exp( O.O0000000e+O0 exp(-6.0OOOOOOOe-03 exp(-7.003OOOOOe+O0 exp(-1.5OOO3OOOe+02
Figure7.8. Notehowinthe modd, thesemi-M~kovtransit~nisente~d~ agener~ ~stdbution.
t) ¢) t) t)
Closed-form solution
._ Z y y
Also, eyy ~top yy
~bot yy
-.}- r_cyy . V y y
65
from (2.3). In fact Gyy(8)
~top "- U y y / v yl~bot y
where
= 10s x (0.00000320s 2 % 0.09864000s + 1.6000) =
i05 x (O.O0001000s4 + 0.0021601000s 3 + 0.13202156800s 2 + 1.60033360000s + 0).
Next, from (3.3), second equation, we have
Gxy(T) = E,xy(T) + f [ ~Exy(v)G y y ( T - v) dr,
so
.01
r ] t~,bot ryy ~ "I- g%_top ~.~yy ( ~ r X y ( 8 ) - - 8 -t" .01 ! 7~_-6-o~L "~YY
Next, from (3.4), letting L, denote Laplace transform IOOV
Pxy(s) = Gxy(s). L,[I - b(t)] = Gxy(x).
V
(s ÷ i00) 2 ~" - s + 100
+
1-V
Here, V = .98 as indicated in Figure 7.7 (model repair-net). We begin computing the quantities that govern the coincident states. Firstly, since GBB(t) = 0 we have from (5.6) ZBB,yy(T)
=
tI~y(O
. [1 - Z s s , Y Y ( O ]
dr,
where HBy(t) = PsB(t). Gyy(t). Solving yields ZBB,YY
~--
HBt°P Y 8( ~rbot
fstop~ " ~,'" B Y -~- "" B Y )
After some simplification, one arrives at 10
2B,,rr(s)
=
s + j----1
-~= lO-Sx
~*=
4.1360951 2.6757974 2.6757974 + 0.1175371 -6.5008554 0.2214496 0.2214496 ÷ 0.0217117 -3.5689825
0 -2.500031 + 0.001565/ -2.500031 0.001565/ -1.660038 -1.500127 - 1.070078 + 0.009775/ - 1.070078 - 0.009775/ -0.230031 -0.070032
1220.9445298i 1220.9445298i
-
Ii.7051909i 11.7051909i
Manipulation of(5.7) yields the formulas ~top AB,XY
~bot
AB,XY
~Ttop ---- " ' A X
~rbot
- bot " k~BY
o~top -- °'-'BY]
F/bot
~ ~ x A X " "-JBY"
Here Hxx(t) = GAB(t) Pxy(t) -I- PAB(t) Gxy(t). We present flax explicitly; its value when s = 0 is of interest in that it represents the long-term or steady-state arrival density. Since in practice Z~B,yr(t) is very small, formula (5.7) shows that the long-term probability of ending up in our coincident failure state (B, Y), instead of one of the other failure states, should be very close to this number which is 3.09 x 10 -4. Approximating our semi-Markov models by constant rate models gives an estimate of 2.99 x 10 -4 for this probability. We do not give ZAB,XY(t) explicitly,
J.A. SJOGREN
66
since there are many terms. There is a strong temptation to simplify ,~(s) by canceling roots in numerator and denominator which seem equal or very close, but this is a numerically delicate procedure. Instead we give HAX (t) explicitly from the partial fraction expansion. ~top
AX(8)
i0
"'AX
_
"'AX
"3 = j=l
(s +
We obtain = 10 -6 x
-0.000038 -0.000038 -0.389927 -0.214333 -0.596018 -0.596018 -0.074953 1.854995 0.016351
+ 0.001930/ - 0.001930i
-2.500079 -2.500079 -1.500030 -0.070030 -1.000109 -1.000109 -0.160062 -0.000060 -1.000063
+ 29.687109i - 29.687109/
+ 0.009899i - 0.009899i
+ 0.009899/ - 0.009899/
Then, writing 8
AX =-
lO
ai8
,
aaAX =
bis i - l ,
i=1
i=1
we get
0 1.2000e 1.0796e 3.8301e 6.7934e 6.2567e 2.7433e 4.2901e 1.9499e
+ + + + + +
04 01 01 03 05 07 08 09
1.0000e 9.7306e 3.8452e 7.9588e 9.2674e 6.0156e 1.9872e 2.6289e 1.0529e
Letting crj = dj + ej i, pj = uI + vj i, where i = ~ ,
+ + + + + + + + +
00 02 05 07 09 11 13 14 15
we get
H A X ( t ) = 0"3 ep3t + o'se p~t + crZe p't -{- 6rsePst -[- cr9ePgt
+ 2eU"(dl cos v,t + ez sin vlt) + 2eU't(d4 cos v4t + e4 sin v4t).
(7.9)
8. C O N C L U S I O N S
An important recent approach in reliability (and performance) theory is found in the use of closed-form, analytical solutions. One advantage is that this approach lends itself very well to models which are built up of smaller submodels in a hierarchical fashion. In this manner fault arrival behavior, system response, architectural fault-tolerance features, and operating system features can be analyzed separately. Each model yields an analytic expression, which can then be put together according to formulas valid for the underlying stochastic process. In practice, closed-form hierarchical solution of dependability problems has seen limited use. One limitation is that in combining two models, new failure states may have to be considered, which do not arise naturally from any particular failure state of either constituent submodel. We have presented a method for resolving such a situation. Using our formulas, it would seem feasible to incorporate the possibility of failure arising from the interaction of different hierarchical levels into a solution package such as SHARPE. The point of view we have presented emphasizes
Closed-form solution
67
:ertain density functions and distributions arising in the study of semi-Markov processes. These luantities shed new light even on constant-rate processes, and are the key to solving models )y decomposition. Large classes of (cyclic) semi-Markov chains can now be solved using the oundations laid in this article. The question of the numerical robustness of the closed-form Lpproach is still an open one. This does not detract from the fact that "exponomial" methods Lre of great potential value in solving the problems of reliability modeling, which remain of both )ractical and theoretical interest. REFERENCES 1. A. Reibman and K. Trivedi, Numerical transient analysis of Markov models, Computers and Operations Research (to appear). 2. P. Protzel, Problems with decomposition, Talk at Briefing on Reliability Modeling, NASA Langley Rcs. Center, Institute for Computer Applications in Science and Engineering notes, (November 1987). 3. Ft.W. Butler, An abstract language for specifying Markov reliability modeLs, IEEE Transactions on Reliability EL-35 (5) (December 1986). 4. S.M. Ross, Applied Probability Models with Optimization Applications, Holden-Day, San Francisco, (1983). 5. R.A. Sahner and K.S. Trivedi, A hierarchical, combinatorial-Markov method of solving complex reliability models, In Proc. Fall Joint Computer Conference, Dallas, (November 1986). 6. A.J. Lotka, A contribution to the theory of serf-renewing aggregates, with special reference to industrial replacement, Annals of Math. Stat., 1-25 (1939). 7. W. Feller, On the integral equation of renewal theory, Annals of Math. Star. 12, 243-267 (1941). 8. R.V. Churchill, The inversion of the Laplace transformation by a direct expansion in series, Math. Zeitschrifl 42,567-579 (1937). 9. R.A. Sahner and K.S. Trivedi, SHARPE: Symbolic hierarchical automated reliability and performance evaluator, In Guide for Users, Duke Univ., Dept. of Computer Science, (September 1986). 10. W. Feller, An Introduction to Probability Theory and Its Applications, Vol. LI, John Wiley, New York, (1971). 11. K.S. Trivedi, Probability ~ Statistics with Reliability, Queuing, and Computer Science Applications, Prentice-Hall, Englewood Cliffs, (1982). 12. W. Feller, On semi-Markov processes, Proc. Natl. Acad. Sci. 51,653-659 (1964). 13. Ft. Butler and A. White, SURE reliability analysis, program and mathematics, NASA Tech. Paper No. 2764, (March 1988). 14. G. Golub and C. Van Loan, Matrix Computations, Johns Hopkins Univ. Press, Baltimore, (1983). 15. T.A. Burton, Volterra Integral and Di.~erential Equations, Academic Press, New York, (1983). 16. J. Kohlas, Stochastic Methods oJ Operations Research, Cambridge University Press, (1982). 17. G. Baker, Charles Stark Draper Laboratories, Cambridge, Mass., (personal communication). 18. J.H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, Oxford, (1965). 19. C.B. Moler and C.F. Van Loan, Nineteen dubious ways to compute the exponential of a matrix, SIAM Review 20, 801-836 (1978). 20. G.B. Finelli, Characterization of fault recovery through fault injection of FTMP, IEEE Trans. Reliability R-36 (2), 164-170 (June 1987). 21. E. Ellis and Ft. Butler, Estimating the distribution of fault latency in a digital processor, NASA Tech. Memorandum No. 0521, Langley Res. Center, (November 1987). 22. D.Ft. Cox, A use of complex probabilities in the theory of stochastic processes, Proceedings Camb. Phil. Soc. 51,313-319. 23. P.K. Lala, Fault Tolerant ~ Fault Testable Hardware Design, Prentice-Hall, Englewood Cliffs, (1985). 24. K. Trivedi, J. Dugan, R. Grist and M. Smotberman, Modeling imperfect coverage in fault-tolerant systems, FTCS-14 Digest 84CH2050-3, IEEE, 77-82 (1984). 25. S. Bavnso and P. Peterson, CARE lIl Model Overview and User's Guide, NASA Tech. Memorandum No.
86404, (1985). 26. G. Cohen, C. Lee, L. Brock and 2. Allen, Design/validation concept for an integrated airframe/propulsion control system architecture (IAPSA II), NASA Contractor Fteport No. 178084.