Frequentist and Bayesian Coverage Estimations ... - Semantic Scholar

1 downloads 0 Views 179KB Size Report
validity and the conservatism of the Bayesian and frequentist estimations. .... i.e., the conditional probability of correct fault handling, given the occurrence of a.
Frequentist and Bayesian Coverage Estimations for Stratified Fault-Injection* Michel Cukier, Jean Arlat, David Powell LAAS-CNRS 7 Avenue du Colonel Roche 31077 Toulouse - France Research Report 96336 Revised January 2, 1997 Abstract. This paper addresses the problem of estimating the coverage of fault tolerance through statistical processing of observations collected in fault-injection experiments. In an earlier paper, we have studied various frequentist estimation methods based on simple sampling in the whole fault/activity input space and stratified sampling in a partitioned space. In this paper, Bayesian estimation methods are introduced for stratified sampling. Two methods are presented to obtain an approximation of the posterior distribution of the coverage by calculating its moments. The moments are then used to identify the type of the distribution in the Pearson distribution system, to estimate its parameters and to obtain the coverage confidence limit. Two hypothetical example systems are used to compare the validity and the conservatism of the Bayesian and frequentist estimations. It is shown that one Bayesian estimation method is valid for both examples and that its estimations are much less conservative than the frequentist ones. However, the Bayesian estimations for stratified sampling are still conservative compared to estimations for simple sampling. Keywords: Fault-Injection, Coverage Estimation, Sampling Method, Stratification, Bayesian Estimation

1 Introduction Computer systems have to be designed to be fault-tolerant if they are to meet the stringent dependability requirements imposed by critical applications. The degree of dependability that can be achieved heavily relies on the efficiency of the fault tolerance mechanisms — the fault tolerance coverage — that is usually defined as the probability of system recovery given that a fault exists [4]. Fault-injection experiments are commonly carried out to assess the fault tolerance mechanisms [3, 5, 6, 8, 12, 15, 17]. A single fault-injection experiment consists of injecting a fault condition into a simulation or a prototype of a fault-tolerant system and observing the behavior of the system to determine whether or not the injected fault has been properly handled by the system’s fault tolerance mechanisms. In practical systems, it is not possible to inject all fault conditions that could possibly occur during the system’s operational life, so coverage evaluation is usually achieved by statistical estimation. In this paper, we address the problem of processing the observations collected in a set of fault injection experiments for estimating the fault tolerance coverage. Since the effect of a fault is dependent on system activity at the moment the fault is sensitized, we consider a sampling space consisting of the combination of the fault space and the set of system “activities” at the moment of fault sensitization. *

This work was partially supported by the DEVA project (ESPRIT n°20072).

1

Many papers have been dedicated to estimation methods of the frequentist School when sampling in the complete “fault/activity” sample space (simple sampling) but only a few papers have studied estimation methods when the sample space is partitioned into classes or strata (stratified sampling). These methods were first applied to fault-injection in [14]. Stratification offers several practical advantages. It can be used to re-interpret the results of a fault-injection campaign should it be necessary, for example, to use test data for coverage estimation or should new knowledge be acquired about the actual distribution of real faults. Stratified sampling allows, under certain conditions, for the composition of results of fault-injection experiments. Sometimes, stratification simplifies the faultinjection process, for example for physical fault injection. Wang et al. [18] used stratified sampling to introduce equivalence classes, defined as a set of faults that can be determined to have the same effect on system behavior. The aim of Wang et al. was to reduce the number of fault-injection experiments necessary to obtain a statistically significant estimate of coverage while only injecting one fault in each equivalence class. However, it was concluded that no benefit could be obtained from this approach in practical systems. In [14], several coverage estimation techniques based on simple sampling and on stratified sampling were considered. It was concluded that stratified sampling techniques were of particular interest. The coverage confidence limit estimations presented in that paper were based on a normal distribution approximation obtained when applying the central limit theorem. This approximation led to the well-known result that stratification improves the precision of the estimation. It has been shown in [13] that the approximation leading to a normal distribution is not usually valid for coverage confidence limit estimations. First, most fault tolerance mechanisms are characterized by very high coverage factor values. Second, the number of faults injected during a campaign is often relatively low (in the order of thousands). Both aspects restrict the validity of the normal distribution approximation. The confidence region theory was therefore applied for stratified sampling. Confidence regions obtained by two point estimators and a vectorial statistic were introduced. For more than three strata, the coverage confidence limit estimations using the point estimators became computationally intractable so, most of the time, only the vectorial statistic can be applied. When applying the confidence region theory, the result that stratification improves the precision of the estimation is no longer correct. In fact, with the vectorial statistic, stratification leads to very conservative upper confidence limits on non-coverage. This paper investigates whether coverage estimation methods from the Bayesian School can be used with stratified sampling to obtain less conservative yet tractable upper confidence limits. In Bayesian estimation, the coverage confidence limit depends directly on the overall non-coverage posterior distribution. Since the analytical expression of this posterior distribution can be easily obtained only for partitions into a small number of classes, we will consider methods giving an approximation of the posterior distribution for any number of classes. Two methods are presented to obtain the first moments of the distribution. The first method is based on the moment generating function of the posterior distribution. The second method introduces independence assumptions between the coverages of the various classes. The moment values are then used to identify, by using the Pearson distribution system, the type of the posterior distribution, to estimate its parameters, and, finally, to obtain the coverage confidence limit.

2

One way to assess the validity and the precision of an estimation method is to experimental data obtained from a simulation of a system with known characteristics. In this way, it is possible to compare the estimates obtained from the simulated data with the parameters of the distributions used to obtain the data. Therefore, to compare the Bayesian estimates with the frequentist ones, we will use two hypothetical systems whose characteristics are chosen to aggressively test the developed statistical methods. The remainder of the paper is organized as follows. Section 2 recapitulates some notions from [14] concerning basic definitions and the two categories of considered sampling techniques: simple sampling and stratified sampling. In order to distinguish the Bayesian methods from the frequentist ones, Section 3 summarizes the frequentist estimation methods introduced in [13]. Section 4 is dedicated to the Bayesian methods and describes, in particular, how to obtain an approximation of the posterior distribution of the overall non-coverage by its moments and how to identify the type of the approximated distribution in the Pearson distribution system. Section 5 compares the validity and the precision of the Bayesian and frequentist estimations for two hypothetical examples with both stratified and simple sampling. Finally, Section 6 concludes the paper.

2 Definitions The definitions presented in this section recapitulate the concepts introduced in [14]. 2.1 Coverage factor The effect of a given fault is dependent on system activity at the moment of, and following, the sensitization of the fault, so the complete input space of a fault-tolerance mechanism consists not only of the considered fault space F but also the activity space A due to the system’s functional inputs. We can thus formally define the coverage factor of a fault tolerance mechanism in terms of the complete input space defined as the Cartesian product G = F × A. Let H be a variable characterizing the handling of a particular fault, such that H =1 if the mechanism correctly handles the fault and H = 0 otherwise. The coverage is then defined as: c = Pr{H = 1 g ∈G} (1) i.e., the conditional probability of correct fault handling, given the occurrence of a fault/activity pair g ∈G . H is a discrete random variable that can take the values 0 or 1 for each element of the fault/activity space G. Let h( g) denote the value of H for a given point g ∈G and let p( g) be the relative probability of occurrence of g. Expression (1) can then be rewritten as: c = ∑ h( g) p( g) (2) g∈G

The coverage factor c can be viewed as E{H} , the expected value of H. It should be stressed that the distribution p( g) is an inherent part of the very definition of coverage as a conditional probability parameter. Without knowledge about p( g) , one cannot make any meaningful statements about the coverage factor of a system.

3

The most accurate way to determine c would be to submit the system to all g ∈G and to observe all outcomes h( g) . However, such exhaustive testing is only possible under very restrictive hypotheses. For this reason, coverage evaluation is in practice carried out by submitting the system to a subset of fault/activity occurrences G* ⊂ G obtained by random sampling in the space G and then using statistics to estimate c. We recall the theory of coverage factor estimation by sampling first in a non-partitioned sampling space and then in a partitioned sampling space. 2.2 Sampling in a non-partitioned space We consider sampling with replacement of a sub-set G* of n fault/activity pairs in G. To each element of G is assigned a selection probability, t ( g) , such that ∀g ∈G, t ( g) > 0 and ∑ t ( g) = 1. Note that the experiments are Bernoulli trials with outcome H = 1 with g∈G

probability θ and H = 0 with probability 1 − θ where θ =

∑ h(g)t(g) . By setting t(g)

g∈G

equal to p( g) , we obtain a representative sample and θ = c , the coverage factor. With representative sampling, it is shown in [13] that an unbiased estimator of the system non-coverage c (=1- c) is given by: X (3) cˆ ( X ) = n where X is the number of fault-tolerance deficiencies observed for n injected fault/activity pairs. In the sequel, we refer to the random variable X as the deficiency number and an observation of X is noted x. 2.3 Sampling in a partitioned space For the sampling techniques that follow, the sampling space G is considered as partitioned into M classes or strata: M

G = U Gi such that ∀i, j, i ≠ j, Gi ∩ G j = ∅ i=1

We can rewrite the coverage factor definition (2) as follows: M

c=∑

M

M

∑ h(g) p(g) = ∑ p(g ∈Gi ) ∑ h(g) p(g g ∈Gi ) = ∑ pi ci

i=1 g∈Gi

i=1

g∈Gi

i=1

where p( g ∈Gi ) ≡ pi is the relative probability of fault/activity occurrences in class Gi and ci is the corresponding coverage factor: ci =

∑ h(g) p(g g ∈Gi )

g∈Gi

A target system can be characterized by: • p = [ p1,..., pM ] , the vector of fault/activity occurrence probabilities in each class of the fault/activity space G, • c = [c1,...,cM ] , the (unknown) vector of class non-coverages,

4

such that the overall non-coverage is given by the scalar product: M

c = ∑ pi ci = p.c T i=1

In a stratified fault-injection campaign, a fixed number of experiments are carried out in each class (using representative sampling within each class, cf. Section 2.2). The random variables characterizing the deficiency number Xi for each class together constitute a random deficiency vector X. A stratified fault-injection campaign is therefore defined by: • n = [n1,...,nM ] , the number of fault/activity pairs injected per class; • x = [ x1,..., x M ], the observed value of the deficiency vector X.

We consider two stratified allocations of n experiments corresponding respectively to a representative and a homogeneous allocation: n n n R = [ p1n, … , pM n] nH =  , … ,  M   M With a representative allocation, the number of faults injected into each class is proportional to the fault/activity occurrence probability in that class whereas a homogeneous allocation results in the injection of an identical number of faults in each class. The next two sections are dedicated to the frequentist and Bayesian confidence limit estimations with stratified sampling.

3 Frequentist confidence limits In frequentist theory, the upper 100 γ % confidence limit estimator for c is defined by:

[

]

cγ↑ (X) : Pr c ≤ cγ↑ (X) c = γ

(4)

i.e., if a large number of individual values x of the deficiency vector X are considered, cγ↑ ( x ) will be greater than the unknown value c , 100 γ % of the time. Several frequentist estimation methods were presented in [13]. That study included approximated estimations based on the central limit theorem for two coverage point estimators as well as estimations obtained without approximation using either the point estimators or a vectorial statistic. We concluded that the approximated estimations are not valid when few faults are injected or when high coverage fault-tolerant systems are considered. Therefore, we will consider hereafter only the frequentist estimation methods without approximation. To make this paper self-contained, the remainder of this section gives a summary of the statistical theory developed in [14]. 3.1 Coverage estimators We consider two point estimators defined respectively by the arithmetic and weighted averages of the elements of the deficiency vector X: M 1 M p YA (X) = ∑ Xi YW (X) = ∑ i Xi n i=1 i=1 ni

5

It has been shown in [13] that the computation of the upper 100 γ % confidence limit cγ↑ ( x ) can be expressed as the following maximization problem: Maximization of c = p.c T under the constraints:



•c :

M

 ni 

∏  xi′ ci x ′ (1 − ci ) i

ni − xi′

x ′:Y ( x ′ )≤Y ( x ) i=1

=1− γ

(5)

• ∀i ∈{1.. M}, ci ∈[0,1] The non-linear constraint defines a confidence region frontier for the vector c around the origin of the parameter space which is composed of the non-coverages of each class in the partitioned fault/activity space, ci (i = 1, ..., M ) . The upper 100 γ % confidence limit on the overall non-coverage, cγ↑ ( x ) , is then given by p.c T at the most distant point from the origin on the confidence region frontier. It was found in [7] that this maximization problem could be solved numerically only when considering partitions into few classes and when injecting a relatively small number of faults. Therefore, we introduced a vectorial statistic which provides a numerical solution for a large range of fault-injection experiments. 3.2 Vectorial statistic The vectorial statistic is defined by the deficiency vector X itself. The upper 100 γ % confidence limit cγ↑ ( x ) is then given by the solution of: Minimization of c = p.c T under the constraints: xi    ni  n − x′ • c : ∏ 1 − ∑   cixi′ (1 − ci ) i i  = γ x  i=1  xi′ =0  i′ M

• ∀i ∈{1.. M}, ci ∈[0,1] The non-linear constraint defines the locus of the apex of hypercubic confidence regions for c around the origin of the parameter space. The value of p.c T at the apex of each such confidence region defines an upper 100 γ % confidence limit. The least conservative upper 100 γ % confidence limit cγ↑ ( x ) is thus given by p.c T at the nearest point to the origin on the locus of these apexes. This minimization problem can be solved numerically for partitions into many classes and for large fault-injection campaigns. Therefore, all frequentist estimations considered in Section 5 will be based on this vectorial statistic.

6

(6)

4 Bayesian confidence limits In the Bayesian theory, the overall non-coverage c and the class non-coverages ci are no longer considered as parameters but as random variables. Accordingly, they will be noted respectively C and Ci . When applying the Bayesian theory, an upper 100 γ % confidence limit is directly defined by the distribution for the random variable C :

[

]

cγ↑ ( x ) : Pr C ≤ cγ↑ (X) X = x = γ

(7)

i.e., the value cγ↑ ( x ) such that, for the observed value x of the deficiency vector X, the probability that C is less than cγ↑ ( x ) is equal to γ. 4.1 Considered distributions To calculate the confidence limit defined by (7) we must first obtain the posterior distribution of C , noted fC (c X = x ) . With stratified sampling, we seek to obtain this distribution from the posterior distributions of the class non-coverages Ci , noted fC (ci Xi = xi ) . The posterior distributions combine the prior distributions, which i

represent the knowledge or belief about the distributions of the Ci before carrying out the experiments, with the extra information about these distributions obtained during the experiments, through the observed number of deficiencies. It is of course essential to choose the priori distributions for the class non-coverages Ci very carefully. There are several reasons for which a beta distribution is a good choice as a prior distribution for the Ci . First, since the deficiency number of each class Xi has a binomial distribution, a beta prior distribution belongs to a family of conjugate prior distributions, which ensures that the prior and the posterior distributions are both from the same family. The posterior distribution will then also have an analytical expression. Secondly, when the parameters of the beta distribution are equal to one, the obtained distribution is a uniform distribution over the interval [0, 1], which means that all values ci have the same weight. We therefore define the prior distributions for the Ci as beta distributions: c ki −1 (1 − ci ) i fC (ci ) = i i B( ki , li )

l −1

0 ≤ ci ≤ 1, ki > 0, li > 0

(8)

where B( ki , li ) is a Beta function with parameters ki et li [1, p. 258]. Since the deficiency number of each class Xi has a binomial distribution:

(

) ( nx )ci x (1 − ci )n − x

f Xi xi Ci = ci =

i i

i

the posterior distribution for Ci is given by [2, p. 28] : fC (ci i

i

i

ci xi +ki −1 (1 − ci ) i i i Xi = xi ) = B( xi + ki , ni − xi + li )

Let ki′ = xi + ki and li′ = ni − xi + li so that:

7

(9)

n − x +l −1

(10)

fC (ci i

c ki′ −1 (1 − ci ) i Xi = xi ) = i B( ki′, li′)

l ′ −1

which is a beta distribution with parameters ki′ and li′ , beta( ki′, li′) . M

Since C = ∑ pi Ci , we will look for the posterior distribution for C by combining the i=1

posterior distributions for the various Ci . It has been shown in [7] that obtaining an analytical expression of the posterior distribution for C is far too complicated for partitions into more than about three classes. Therefore, in general, the posterior distribution for C can only be obtained by applying approximate methods. When a distribution cannot be calculated exactly, then for most ordinary purposes, knowledge of the moments is equivalent to knowledge of the distribution function, in the sense that it should be possible theoretically to exhibit all the properties of the distribution in terms of the moments [16, p. 108-109]. It is further stressed in [16, p. 109110] that distributions that have a finite number of the lower moments in common will, in a sense, be approximations to one another. In practice, approximations of this kind often turn out to be remarkably good even when only the first two, three or four moments are equated. In Section 4.4, the Pearson distribution system will be used to identify the type of the posterior distribution for C . This system uses the values of the second, third and fourth central moments of the distribution to identify its type. With the addition of the mean, the parameters of the identified distribution can be estimated. Sections 4.2 and 4.3 are dedicated to two different methods for calculating the first four moments of the posterior distribution for C . The first method is based on the moment generating function of C . The second method introduces independence assumptions between the coverages of the various classes. 4.2 Moment generating function Since the moment generating functions of the Ci are well-known for beta distributions, the moment generating function of C can easily be obtained by combining the moment generating functions of the Ci . The derivatives of the moment generating function of C define the moments (about zero) of C . The second, third and fourth central moments of C are then obtained by applying the classic expressions that relate the central moments to the moments about zero. The moment generating function [9, p. 20] of a random variable which has a beta distribution can be expressed as a confluent hypergeometric function [1, chap. 13, 11, p. 40]. So, when Ci has a beta distribution beta( ki′, li′) , the moment generating function of Ci is given by:

φ C (t )=1F1 ( ki′; ki′ + li′; t ) i

where 1F1 ( ki′; ki′ + li′; t ) is the confluent hypergeometric function.

8

(11)

The moment generating function of pi Ci is equal to: φ p C (t )=1F1 ( ki′; ki′ + li′; pi t ) . Since i i

the moment generating function of a sum of random variables is equal to the product of the moment generating functions of the various random variables [9, p. 21], the moment M

generating function of C = ∑ pi Ci is given by: i=1

M

φ C (t ) = ∏ 1F1 ( ki′; ki′ + li′; pi t )

(12)

i=1

Since the r-th derivative of the moment generating function of C for t=0 defines the r-th moment µ r′ of C , the various moments can easily be obtained. The expressions of the first four moments are presented in Figure 1. For sake of conciseness, we only detail the derivation of the first moment. pi ki′ k i =1 i′ + li′ M

µ1′ = ∑ M

µ 2′ = ∑

i =1

i =1

( ki′ + li′)( ki′ + li′ + 1)( ki′ + li′ + 2 )

pi ki′ M p j k ′j ∑ i =1 ki′ + li′ j =1 k ′j + l ′j M

( ki′ + li′)( ki′ + li′ + 1)

pi3 ki′( ki′ + 1)( ki′ + 2 )

M

µ 3′ = ∑

pi2 ki′( ki′ + 1)

+∑

j ≠i

pi2 ki′( ki′ + 1)

M

+ 3∑

i =1

p j k ′j

M

( k ′ + l ′)( k ′ + l ′ + 1) ∑ k ′ + l ′ i

i

i

i

j =1 j j ≠i

j

pi ki′ M p j k ′j M pu ku′ ∑ ∑ i =1 ki′ + li′ j =1 k ′j + l ′j u=1 ku′ + lu′ M

+∑

j ≠i

pi4 ki′( ki′ + 1)( ki′ + 2 )( ki′ + 3)

M

µ 4′ = ∑

i =1

u≠ j u≠i M

( ki′ + li′)( ki′ + li′ + 1)( ki′ + li′ + 2 )( ki′ + li′ + 3)

M

+ 3∑

i =1

pi2 ki′( ki′ + 1) i

i

i

i =1

pi3 ki′( ki′ + 1)( ki′ + 2 )

M

p j k ′j

( k ′ + l ′)( k ′ + l ′ + 1)( k ′ + l ′ + 2 ) ∑ k ′ + l ′ i

i

i

i

i

i

j =1 j j ≠i

p 2j k ′j k ′j + 1

u≠i

pi ki′ M p j k ′j M pu ku′ M pv kv′ ∑ ∑ ∑ i =1 ki′ + li′ j =1 k ′j + l ′j u=1 ku′ + lu′ v =1 kv′ + lv′ M

+∑

j ≠i

u≠ j u≠i

v≠ k v≠ j v≠i

Figure 1 - The first four moments

The first derivative of φ C (t ) for t=0 leads to:

µ1′ =

dφ C (t ) dt

t =0

j

( ) + 6 M pi2 ki′(ki′ + 1) M p j k ′j M pu ku′ ∑ ( k ′ + l ′)( k ′ + l ′ + 1) ∑ k ′ + l ′ ∑ k ′ + l ′ i i i j u=1 u u j =1 ( k ′j + l ′j )( k ′j + l ′j + 1) i =1 i j =1 j j ≠i j ≠i u≠ j M

( k ′ + l ′)( k ′ + l ′ + 1) ∑ i

+ 4∑

  M M  d1F1 ( ki′, ki′ + li′, pi t )  = ∑ ∏ 1F1 k ′j , k ′j + l ′j , p j t  dt j =1  i=1    t =0 j ≠i

(

)

By the properties of the confluent hypergeometric function [1, pp. 504-515], we have:

9

1F1

(a; b; 0) = 1

d1F1 ( a; b; z ) a = 1F1 ( a + 1; b + 1; z ) dz b So µ1′ becomes finally: pi ki′ i=1 ki′ + li′ M

µ1′ = ∑

The second, third and fourth central moments µ r (r=2, 3, 4) can be obtained from the moments µ r′ by the expressions [9, p. 18, eq. 100]: µ 2 = µ 2′ − µ1′ 2 ; µ 3 = µ 3′ − 3µ 2′ µ1′ + 2µ1′ 3 ; µ 4 = µ 4′ − 4µ 3′ µ1′ + 6µ 2′ µ1′ 2 − 3µ1′ 4 (13) 4.3 Independence assumptions Simpler expressions for the moments of C can be obtained by making some assumptions on the independence between the powers of the Ci . The first moment of C is directly obtained by a linear combination of the first moments of the Ci . By making assumptions on powers of the non-coverages Ci , we can obtain linear expressions between the second, third and fourth central moments of C and, respectively, the second, third and fourth central moments of the Ci . By using the wellknown relationship between moments about zero and central moments, we will derive the second, third and fourth moments of each Ci . Since the Ci have beta distributions, the central moments of C can easily be obtained by the moments of the Ci . It can be shown that the second, third and fourth central moments of a weighted sum of independent random variables Zi are equal to a weighted sum of the central moments of these independent random variables if: ∀i, j and i ≠ j : Zi and Z j ; Zi2 and Z j ; Zi2 and Z 2j ; Zi3 and Z j are independent (14) M

Applied to the overall non-coverage C = ∑ pi Ci , if µ ri is the r-th central moment of i=1

the Ci , and if (14) is satisfied for the Ci taken pair by pair, then the second, third and fourth central moments of the overall non-coverage C can be simply expressed in function of central moments of the non-coverages Ci : M

µ 2 = ∑ pi2 µ 2 i ; i=1

M

µ 3 = ∑ pi3µ 3i ; i=1

M

µ 4 = ∑ pi4µ 4 i i=1

For each class, the three central moments µ ri (r=2, 3, 4) can be obtained from the moments µ r′i by the expressions identical to (13):

µ 2 i = µ 2′ i − µ1′i2 ; µ 3i = µ 3′ i − 3µ 2′ i µ1′i + 2µ1′i3 ; µ 4 i = µ 4′ i − 4µ 3′ i µ1′i + 6µ 2′ i µ1′i2 − 3µ1′i4 (15) Finally, the r-th moment µ r′i (r=2, 3, 4) of Ci which has a beta distribution beta( ki′, li′) is given by:

10

µ r′i =

B( ki′ + r, li′) ki′( ki′ + 1)... ( ki′ + r − 1) = B( ki′, li′) (ki′ + li′)(ki′ + li′ + 1)... (ki′ + li′ + r − 1)

where B( ki′, li′) is a Beta function with parameters ki′ and li′ [1, p. 258]. 4.4 The Pearson distribution system It is noted in [10, p. 9] that “some families of distributions have been constructed which are intended to provide approximations to a variety of observed distributions as wide as possible. Such families are called systems of frequency curves. For most practical purposes, it is sufficient to use four parameters to describe a distribution.”. Since we are looking for an approach which ascertains “a family of distributions that will satisfactorily represent observed data” [16, p. 210], we will consider the Pearson distribution system. For every member of the system, the probability density function f ( z ) satisfies a differential equation of the form [10, p. 9]: 1 df a+z =− (16) f dz b0 + b1z + b2 z 2 The shape of the distribution depends on the values of the four parameters a, b0 , b1 and b2 . These parameters are related to the four first moments of distribution f ( z ) by the expressions [10, p. 13]:  µ 2 4µ 2 µ 4 − 3µ 32 b0 = 10µ 2 µ 4 − 12µ 32 − 18µ 23   µ 3 µ 4 + 3µ 22  (17) a = b1 = 10µ 2 µ 4 − 12µ 32 − 18µ 23   2 3 b2 = 2µ 2 µ 4 − 3µ 32 − 6µ 2 3 10µ 2 µ 4 − 12µ 3 − 18µ 2   The Pearson distribution system contains seven types of distribution (I to VII) which are, for example: • the beta distribution (Type I), • the gamma distribution (Type III), • the inverse Gaussian (Wald) distribution (Type V), • the t distribution (Type VII).

(

)

(

)

The seven distributions can be represented in a plane (β1, β 2 ) , where β1 is the skewness

µ 32 µ4 3 ) and β 2 is the kurtosis coefficient ( β 2 = 2 ). Figure 2 presents the µ2 µ2 regions corresponding to the various distributions in the plane (β1, β 2 ) .

coefficient ( β1 =

11

1

II 2 3

VII

I

4

β2 5

IV

VI

III

6

V 7 8

0

0.2

0.4

0.6

0.8

1.0

β1

1.2 1.4

1.6

1.8

Figure 2 - The Pearson distribution system [10, p.13]

Only Types I, IV and VI correspond to areas in the (β1, β 2 ) diagram. The remaining types correspond to lines and are sometimes called transition types. Practically, the values of β1 and β 2 will lead to Type I, IV or VI. The parameters of the identified distribution will be estimated by the values of the first four moments. The confidence limit estimation cγ↑ ( x ) corresponds, for the Bayesian theory, to the inverse function of the posterior distribution of C (7). This estimation is then obtained numerically either by using a routine from a mathematical library (for the classic distributions of the Pearson system), or by successive approximations (for the less common distributions).

5 Results and comparisons In this section, two hypothetical systems (which we will call1 System 1 and System 2) are used to compare the frequentist and the Bayesian confidence limit estimation methods defined in Sections 3 and 4. For the Bayesian estimations, we consider uniform prior distributions (parameters of the beta distributions equal to one) which minimize the subjectivity in the Bayesian approach. 5.1 Definition of the systems The characteristics of the two example systems are defined in Figures 3a and 3b. Each system is partitioned into M = 50 classes. The figures show the non-coverage ci and the fault/activity occurrence probability pi of each class as well as the values of the system 1 M non-coverage, c , and the mean class non-coverage, c˜ = ∑ ci . Each system is also M i=1 characterized by the correlation factor ρ between the pi and ci [14]. The characteristics of the two example systems are quite different and have been chosen to aggressively test the coverage estimation methods in extreme situations. System 1 (Figure 3a) features a 1

These systems correspond respectively to System A and System C in [13]. Since no result could be obtained for System B in [13] with the frequentist estimation method using the vectorial statistic, its analysis is not included here.

12

relative homogeneity among the classes regarding the non-coverage, a relative low variability of the fault/activity occurrence probabilities pi and a slight (positive) correlation factor (≈14%). System 2 (Figure 3b) is a near-perfect system with a very low non-coverage. It has quite a high variability over the classes and a large positive correlation between the pi and ci (≈90%). 1.0E-01

c

1.0E-02

ci

˜c

1.0E-03

1.0E-04

1.0E-05 1.0E-03

1.0E-02

1.0E-01

pi

a) System 1 ( c =2.20E-02, ρ=13.5%) 1.0E-0 5

1.0E-0 6

˜ c

c

1.0E-0 7

ci

1.0E-0 8

1.0E-0 9

1.0E-1 0

1.0E-1 1 1.0E-0 3

1.0E-02

1.0E-0 1

pi

b) System 2 ( c =3.72E-06, ρ=89.1%) Figure 3 - System characteristics

5.2 Comparison method We compare the various confidence limit estimation methods by simulating a large number (25,000) of fault-injection campaigns carried out on each of these two systems, with a total sample size (the total number of injected faults in each campaign) taking six values ranging from 500 to 50,000,000. For each simulated campaign, we draw a random deficiency vector using the known class coverages of our hypothetical systems. We then compute the 99% confidence limits that would be obtained with each considered method. Since we have total knowledge about our two hypothetical systems, we can verify whether the confidence we claim for the various methods is justified by calculating the proportion of simulated fault-injection campaigns for which the confidence limit statement c < cγ↑ ( x ) is true. We call this proportion the success proportion. With the

13

prescribed 99% confidence level, the success proportion should also be about 99%. If this is not so for a given system and sample size, then the corresponding method is inadequate for that situation and should not have been used. To assess the conservatism of the confidence limits for each method, we would also like to evaluate the expected values of the confidence limits for each total sample size. The expected confidence limits can be calculated by summing the corresponding confidence limit expressions over the whole sample space weighted by the (known) sample distribution. However, this summation would be much too time-consuming so we resort to the simulated fault-injection campaigns and use the average of all the simulated confidence limits as an approximation of the expected value. The average is finally divided by the value of the system non-coverage. We call this measure the normalized expected 99% upper confidence limit. The nearer this normalized limit is to one, the less conservative is the corresponding estimation method. 5.3 Presentation of the results Each estimation method will be identified by a mnemonic using the notation defined in the table of Figure 4. Mnemonic

Theory

Method

Sample allocation

F/H

Frequentist (F)*

BM/H

Bayesian (B)

Moment generating func. (M)

Homogeneous (H)

BI/H

Bayesian (B)

Independence assump. (I)

Homogeneous (H)

F/R

Frequentist (F)*

BM/R

Bayesian (B)

BI/R

Bayesian (B)

Homogeneous (H)

Representative (R) Moment generating func. (M)

Independence assump. (I) * using the vectorial statistic [13]

Representative (R) Representative (R)

Figure 4 - Notation

The success proportions obtained for each system are presented respectively on Figures 5a and 6a. Figures 5b and 6b show the normalized expected 99% upper confidence limits for each system. As we shall explain below, it sometimes happens that it is not possible to obtain a confidence limit estimation cγ↑ ( x ) for a given fault-injection campaign. When this occurred in more than 10% of the simulated campaigns, we will not give any result. When the confidence limit estimations could be calculated for more than 90% of all simulated campaigns, the presented results are normalized by the number of simulated fault-injection campaigns leading to a result. The optimization problem obtained with the frequentist estimation method based on the vectorial statistic is solved numerically by using the e04vcf routine of the NAG library. For high total sample sizes, this routine can fail to find a solution. In [13], we have indicated for which sample sizes the routine could find a solution for more than 90% of all simulated campaigns. Concerning the Bayesian estimations, the posterior distribution of C was identified as a distribution of Type I, IV or VI. For both considered systems, in most cases, the posterior

14

distribution was identified as a distribution of Type I, which corresponds to a beta distribution. When the values of the central moments lead to a distribution of Type I, the inverse function of the beta distribution, which gives the confidence limit estimation cγ↑ ( x ) , is easily obtained numerically by using the g01fef routine of the NAG library. When considering the two other possible transition types, Type VI is a less classic distribution. Its inverse function can only be obtained numerically by successive approximations. Since this distribution is seldom used, we will not present, in this paper, the obtained confidence limit estimation cγ↑ ( x ) . Finally, no common statistical distributions are of Type IV form [10, p. 12]. It is also stressed in [10] that its probability density function “leads to intractable mathematics if one attempts to calculate values of the cumulative distribution function” (which must be calculated to obtain the inverse function of the distribution by successive approximations). Therefore, when the values of β1 and β 2 lead to a distribution of Type IV, we will not be able to calculate the confidence limit estimation cγ↑ ( x ) . For both systems, no results are shown for the Bayesian estimation method based on the moment generating function for a homogeneous allocation since it leads to a distribution of Type VI for more than 2,500 simulated fault-injection campaigns. Similarly, a Type IV distribution is obtained in more than 2,500 campaigns for System 1 with the moment generating function for large values of n. For System 2, some results (n=500) are not presented for a representative stratification with small total sample sizes since this combination leads to a null allocation ( ni = 0 ) for some classes and is therefore meaningless. 5.4 Discussion of the results 5.4.1 System 1 On Figure 5a, the success proportion for the available frequentist estimations is always 100%. This underlines the fact that these estimations are conservative ones. The success proportion for the Bayesian estimation method based on the moment generating function decreases from 100% to a value about 99% when the number of faults injected increases. These estimations are thus less conservative than the frequentist ones (corresponding to the vectorial statistic), as confirmed by Figure 5b. Since, for both methods, the success proportion is higher than 99%, these methods are valid ones. Finally, the success proportion for the Bayesian estimation method using the independence assumptions decreases from 100% to a value less than 99% when the number of faults injected increases. So, this method is not valid for System 1 and we can suppose that this will be the case for any system with modest coverage. Figure 5b shows that the frequentist estimations are always more conservative than the Bayesian estimations based on the moment generating function. However, this conservatism decreases as the number of faults injected increases.

15

100%

30 25

99%

20 98% BI/R BM/R

97%

10

F/R BI/H

96%

BI/R BM/R

15

F/R BI/H

5

BM/H

BM/H

0

n

F/H

a) Success proportion

5E+07

n

5E+06

500000

50000

5000

500

5E+07

5E+06

F/H 500000

5000

50000

500

95%

b) Normalized expected 99% upper confidence limits Figure 5 - System 1

5.4.2 System 2 On Figure 6a, the success proportion for the frequentist estimations is always 100%. The success proportion for both Bayesian estimation methods decreases from 100% to a value close to 100% (about 99.8%) when the number of faults injected increases. Since the success proportion is higher than 99%, the frequentist estimation method and both Bayesian methods are valid. The high success proportion indicates that these estimations will be conservative even if, for large samples, the Bayesian estimations might become less conservative. Figure 6b shows that the frequentist estimations are always more conservative than the Bayesian estimations. Again, this conservatism decreases as the number of faults injected increases.

100%

1000

99%

800

98%

600 BI/R BM/R

97%

BI/R BM/R

400

F/R

F/R BI/H

96%

BI/H

200

BM/H

BM/H

a) Success proportion

5E+06

5E+07

n

F/H 500000

5000

500

5E+07

5E+06

n

50000

0

F/H 500000

50000

5000

500

95%

b) Normalized expected 99% upper confidence limits Figure 6 - System 2

16

5.4.3 Summary For both System 1 and System 2, the Bayesian estimations are less conservative than the frequentist ones. Even if both Bayesian methods are valid for System 2 (a system with very high coverage), only the Bayesian estimation method based on the moment generating function is valid for System 1 (a system with modest coverage). In practice, the coverage of a system is unknown. Therefore, from now on we will discuss only the Bayesian estimation method based on the moment generating function. Considering the Bayesian method for a homogeneous allocation, the posterior distribution for C is often identified, in the Pearson system of distributions, as a distribution of Type VI. For a representative allocation, the posterior distribution is always identified as a distribution of Type I. The confidence limit estimation cγ↑ ( x ) can thus be calculated more easily for this type of allocation. These observations lead us to recommend to use the Bayesian estimation method based on the moment generating function for a representative allocation to estimate the 100γ% upper non-coverage confidence limit cγ↑ ( x ) when stratification is used. 5.5 Stratified sampling vs. simple sampling We have shown in [13] that the frequentist 100γ% upper non-coverage confidence limit with the vectorial statistic is more conservative for stratified sampling than for simple sampling. Since, for stratified sampling, the Bayesian estimations are less conservative than the frequentist ones, it is interesting to study how well they fare compared to the estimations obtained when simple sampling is used. For simple sampling, frequentist and Bayesian estimations are quite identical for the considered sample sizes [7]. Figures 7 and 8 present the normalized expected 99% upper confidence limits respectively for System 1 and System 2. The simple sampling estimations are noted on these figures “NS” (Non-Stratified).

100

10 F/R BM/R

500000

5000

50000

NS 500

1

n

Figure 7 - Normalized expected limits for stratified and non-stratified sampling (System 1)

17

100000 10000 1000 100 F/R

10

BM/R

5E+07

5E+06

500000

50000

5000

NS

500

1

n

Figure 8 - Normalized expected limits for stratified and non-stratified sampling (System 2)

The figures show that the Bayesian estimations based on the moment generating function are more conservative than the estimations when simple sampling is used. However, the Bayesian estimations are much less conservative than the frequentist estimations with the vectorial statistic.

6 Conclusion This paper has presented Bayesian estimation methods applied to fault-injection for assessing coverage figures of fault-tolerant systems. In particular, we have focused on the statistical methods that can be applied to the results of a stratified fault-injection campaign on a system with high coverage. As a complement to our previous work on stratified sampling which presented frequentist estimation methods, this paper was dedicated to Bayesian estimation methods. For the Bayesian theory, the coverage confidence limits directly depend on the posterior distribution of the overall non-coverage. For partitions into more than three classes, the analytical expression of the posterior distribution becomes too complicated so two methods have been presented to approximate the posterior distribution by calculating its first four moments. The first method is based on the moment generating function of the distribution, while the second method uses assumptions on the independence between the coverages of the classes. The values of the second, third and fourth central moments are used to identify the type of the distribution in the Pearson system. With addition of the mean, the parameters of the identified distribution (the approximation of the posterior distribution of the overall non-coverage) are then estimated. Finally, the non-coverage upper confidence limit estimation is obtained by calculating the inverse function of the identified distribution. Two hypothetical examples of fault-tolerant systems have been used to compare the validity and the precision of the Bayesian estimations and the frequentist estimations (with the vectorial statistic) when stratification is used. To minimize the subjectivity of the Bayesian approach, uniform prior distributions were used. It has been shown that the Bayesian estimation method based on the moment generating function is always valid and leads to more precise estimations. When considering a representative allocation, this

18

method leads to the identification of the posterior distribution as a beta distribution, for which the inverse function can easily be obtained. Compared to simple sampling estimations, the Bayesian estimations are more conservative. However the Bayesian methods have reduced a large part of the conservatism of the frequentist estimations obtained when using the vectorial statistic. This paper has thus developed a Bayesian estimation method which is easy to apply and is less conservative than the previous frequentist methods.

References [1] [2] [3] [4] [5]

[6]

[7] [8]

[9] [10] [11] [12] [13]

M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables, Dover Publications, New York, 1972. J. Aitchison and I. R. Dunsmore, Statistical Prediction Analysis, Cambridge University Press, Cambridge, UK, 1975. J. Arlat, A. Costes, Y. Crouzet, J.-C. Laprie and D. Powell, “Fault Injection and Dependability Evaluation of Fault-Tolerant Systems”, IEEE Trans. Computers, 42 (8), pp.913-923, August 1993. W. G. Bouricius, W. C. Carter and P. R. Schneider, “Reliability Modeling Techniques for Self-Repairing Computer Systems”, in Proc. 24th National Conference, pp.295-309, ACM, 1969. R. Chillarege and N. S. Bowen, “Understanding Large System Failures — A Fault Injection Experiment”, in Proc. 19th Int. Symp. on Fault-Tolerant Computing (FTCS-19), (Chicago, MI, USA), pp.356-363, IEEE Computer Society Press, June 1989. G. S. Choi, R. K. Iyer, R. Saleh and V. Carreno, “A Fault Behavior Model for an Avionic Microprocessor: a Case Study”, in Dependable Computing for Critical Applications, (A. Avizienis and J.-C. Laprie, Ed.), pp.171-195, Springer-Verlag, Vienna, Austria, 1991 (Proc. IFIP 10.4 Work. Conf. held in Santa Barbara, CA, USA, February 1989). M. Cukier, Estimation of the Coverage of Fault-Tolerant Systems, Doctoral Dissertation, National Polytechnic Institute of Toulouse, France, July 1996 (in French). U. Gunneflo, J. Karlsson and J. Torin, “Evaluation of Error Detection Schemes using Fault Injection by Heavy-ion Radiation”, in Proc. 19th Int. Symp. FaultTolerant Computing (FTCS-19), (Chicago, MI, USA), pp.340-347, IEEE Computer Society Press, June 1989. N. L. Johnson and S. Kotz, Distributions in Statistics — Discrete Distributions, John Wiley & Sons, New York, 1969. N. L. Johnson and S. Kotz, Distributions in Statistics — Continuous Univariate Distributions - 1, John Wiley & Sons, New York, 1970. N. L. Johnson and S. Kotz, Distributions in Statistics — Continuous Univariate Distributions - 2, John Wiley & Sons, New York, 1970. G. A. Kanawati, N. A. Kanawati and J. A. Abraham, “FERRARI: A Flexible Software-Based Fault and Error Injection System”, IEEE Trans. Computers, 44 (2), pp.248-260, February 1995. D. Powell, M. Cukier and J. Arlat, “On Stratified Sampling for High Coverage Estimations”, in Proc. 2nd European Dependable Computing Conference (EDCC2), (A. Hlawiczka, J. G. Silva and L. Simoncini, Ed.), (Taormina, Italy), Lecture Notes on Computer Science, 1150, pp.37-54, Springer Verlag, October 1996.

19

[14] D. Powell, E. Martins, J. Arlat and Y. Crouzet, “Estimators for Fault Tolerance Coverage Evaluation”, in Proc. 23rd Int. Conf. on Fault-Tolerant Computing (FTCS-23), (Toulouse, France), pp.228-237, IEEE Computer Society Press, June 1993 (An extended version of this paper appears in IEEE Trans. Computers, 44 (2), pp.261-274, 1995). [15] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton, D. Rancey, A. Robinson and T. Lin, “FIAT — Fault Injection based Automated Testing Environment”, in Proc. 18th Int. Symp. on Fault-Tolerant Computing (FTCS-18), (Tokyo, Japan), pp.102-107, IEEE Computer Society Press, June 1988. [16] A. Stuart and J. K. Ord, Distribution Theory, Kendall’s Advanced Theory of Statistics, 1, Edward Arnold, London, 1987. [17] C. J. Walter, “Evaluation and Design of an Ultra-Reliable Distributed Architecture for Fault Tolerance”, IEEE Trans. Reliability, 39 (4), pp.492-499, October 1990. [18] W. Wang, K. S. Trivedi, B. V. Shah and J. A. Profeta III, “The Impact of Fault Expansion on the Interval Estimate for Fault Detection Coverage”, in Proc. 24th Int. Conf. on Fault-Tolerant Computing (FTCS-24), (Austin, TX, USA), pp.330337, IEEE Computer Society Press, June 1994.

20