Document not found! Please try again

Modeling correlation in software recovery blocks - IEEE Xplore

4 downloads 0 Views 1MB Size Report
We consider three types of dependence which can be captured using measurements. We consider correlation between software modules for a single input,.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 19, NO. 11, NOVEMBER 1993

1071

Modeling Correlation in Software Recovery Blocks Lorrie A. Tomek, Member, IEEE, Jogesh K. Muppala, Member, IEEE, and Kishor S . Trivedi, Fellow, IEEE

Abstract- This paper considers the problem of accurately modeling the software fault-tolerance technique based on recovery blocks. Models of such systems have been criticized for their assumptions of independence. Analysis of some systems have considered the correlation between software modules. This correlation may be due to a portion of the functional specihcation that is common to all software modules or due to the inherent hardness of some problems. We consider three types of dependence which can be captured using measurements. We consider correlation between software modules for a single input, correlation between successive acceptance tests on correct module outputs and incorrect module outputs, and correlation between subsequent inputs. The technique we use is quite general and can be applied to other types of correlation. In accounting for dependence, we use the intensity distribution introduced by Eckhardt and Lee. We consider a new method of generating the intensity distribution which is based on the paimise correlation between modules. This method provides us with a pessimistic result and a probability-based approximation. We contrast this method with the assumption of independent modules as well as the use of the beta-binomial density which was introduced by Nicola and Goyal. For the purpose of obtaining numerical results, we use stochastic reward nets (SR”s) that incorporate all of the above dependencies and then use a modeling tool called Stochastic Petri Net Package (SPNP). Index Terms-Correlation,

Markov models, software fault tolerance, software recovery blocks, software reliability, stochastic modeling, stochastic Petri nets.

We analyze one kind of software fault-tolerance scheme known as recovery blocks. It consists of a primary module, one or more alternate modules, and an AT. The primary and the alternate modules are based on different algorithms for the same problem and may be implemented by different programmers. On a given set of data inputs, the primary is executed first and the results are checked using the AT. Should the AT fail to accept the results, a rollback recovery is attempted and this process is repeated for each alternate module in succession until either the rollback recovery fails or a module is found to produce results that are accepted by the AT or until all modules have failed to satisfy the AT. In the latter case, the RB is said to have failed on this input data set. The pseudocode for an RB with N modules (a primary and N - 1 alternate modules) is shown below: ensure acceptance test by primary module (#1) else by alternate module (#2) else by alternate module (#3)

...

else byalternate module ( # N ) else error

The development of multiple versions of software modules to improve system reliability is a controversial topic. Models of ANY APPLICATIONS demand high reliability and multiversion software reliability often assume that the modules availability from computer systems. Hardware redun- fail independently. However, experiments have shown that dancy is routinely used to enhance reliability/availability. even independently developed modules are prone to exhibit the Software is a major component in computer systems and thus same types of errors when operating on the same input [4]-[7]. tends to become the reliability bottleneck. Design diversity as Analysis based upon the independence of software modules a means of achieving fault-tolerance in software has been sug- consistently overestimates reliability. The correlation between gested by several authors. Fault-tolerant software techniques module outputs may be due to the inherent difficulty associated include N-version programming [ 11 and recovery blocks [ 2 ] . with some inputs or the portion of the functional specification The former uses voting on the results of various versions for common to all modules. In an experiment by Scott et al. error detection and the latter uses an acceptance test (AT) and [7], once the joint (or conditional) probabilities needed to rollback recovery. While these are the two major approaches capture the dependence between modules, acceptance tests, to software fault-tolerance, several hybrid methods have also and recovery attempts were determined, the reliability was been proposed [2], [3]. accurately determined to a 99% confidence level. Pucci [8],[9] points out some of the difficulties in estimating Manuscript received January 12, 1993; revised June 4, 1993. L. A. Tomek was supported by the IBM Corporation, Research Triangle Park, NC, through the parameters used in earlier models. He classifies events the IBM Resident Study Program. K. S. Trivedi was supported in part by the occurring in an RB into four distinct categories based on the National Science Foundation under Grant CCR-9108114, and by the Naval behavior of the alternate modules and the AT. Four different Surface Warfare Center under Contract N60921-92-C-0161. Recommended by F. Bastani. events can occur: L. A. Tomek is with the Department of Computer Science, Duke University, 1) Module i produces correct output which the AT accepts. Durham, NC 27708. J. K. Muppala is with the Department of Computer Science, The Hong 2) Module i produces correct output which the AT rejects. Kong University of Science and Technology, Kowloon, Hong Kong. 3) Module i produces incorrect output which the AT rejects. K. S. Trivedi is with the Department of Electrical Engineering, Duke 4) Module i produces incorrect output which the AT University, Durham, NC 27708. IEEE Log Number 9213115. accepts. I. INTRODUCTION

M

0098-5589/93$03.00 0 1993 IEEE

1072

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 19, NO. 11, NOVEMBER 1993

We consider a similar event classification to simplify the task of estimating parameters which correspond to these events. We consider the probability that module z produced a correct/incorrect output, the probability of the AT on module 1's output is accepted/rejected, and the probability that the rollback recovery attempt following the execution and negative AT diagnosis is successful/unsuccessful. In this paper the probabilities attached the above events are determined assuming that three types of correlation may exist: The first is correlation between outputs of distinct modules operating on a single input. The correlation between module outputs on a single input implies that the probability that module i produces a correct/incorrect output is dependent upon the execution results of previously executed modules in the recovery block. The greater the number of previous modules which have failed, the more difficult the input is likely to be, causing module i to be more likely to produce an incorrect output. The second is correlation between successive acceptance test results on correct/incorrect module outputs. This correlation implies that the AT diagnosis following module i's correct output is dependent upon the AT diagnosis on all correct module outputs prior to module i; and correspondingly for incorrect module outputs. Modules operating on the same input may produce similar outputs. AT diagnosis based on these outputs may therefore be correlated. The third is the correlation between successive inputs; that is, inherently difficult inputs are clustered in the input stream. The correlation between successive inputs implies that the likelihood that the primary module produces a correct output on the next input is dependent upon previous executions of the recovery block. In determining these probabilities, we make two important assumptions. The first is that the dependence between the software module's execution and acceptance test execution is due only to the correctness of the software module's output. The second is that the rollback recovery attempts are independent of both the software modules outputs and AT results. This paper first presents the theoretical background for analysis of multiversion software in Section 11. This background is based on the ground-breaking work of Eckhardt and Lee [lo]. Correlation between software modules operating on a single input is developed in Section 111. The theory requires the determination of the intensity of coincident failures across modules for each possible input. Several methods for determining the intensity distribution are examined. These include the assumption of module independence, the use of the betabinomial density which was suggested by Nicola and Goyal [ll], and two new methods based on the pairwise correlation of modules. The development of the intensity distribution using these methods is shown in Section IV. Section V applies the same techniques to account for correlation between successive acceptance tests. Clustering of the input data is then developed in Section VI. In order to numerically evaluate the recovery block, a stochastic reward net (a variation of a stochastic

Petri net) model which incorporates all types of correlation considered is developed in Section VII. Several measures of interest for such a system are described in Section VIII. Numerical results of the recovery block including the impact of the dependencies considered are shown in Section IX. Concluding remarks are given in Section X. Proof of results used in the paper are given in Appendix A and details of stochastic reward nets are provided in Appendix B.

11. THEORETICAL BACKGROUND A theoretical basis for correlated failures in the analysis of multiversion software was developed by Eckhardt and Lee [lo]. Using their notation, let the infensityfunction e ( x ) represent the proportion of all possible software modules that produce an incorrect output with input x E R, the set of all possible inputs. Other required information includes Q, the usage distribution or distribution of the inputs, and N , the number of software modules. The intensity distribution is labeled Ge(x)(y) and is interpreted as the probability that less than or equal to fraction y of all possible modules would produce an incorrect output on a randomly chosen input X. In other words, Gqx)(y) is the distribution function of the random variable e ( X ) .

Ge(x)(Y) = W V X ) 5 Y) =

/x:e(2)sy

d ~ .

Eckhardt and Lee based their work on the following assumptions [lo]: 1) repetitions of the process of developing component versions follow statistically independent trials; 2) the development process gives identically distributed score functions which are vectors of binary variables that designate whether component versions fail on various inputs; 3) each component of the system is required to operate on a common series of inputs which is stationary and independent. Based on these assumptions, the probability that 1 of the N modules fail on a randomly chosen input

This probability can be written in terms of Gqx)(y) as

In the above formulation, Eckhardt and Lee assumed that e ( x ) was the proportion of all possible software modules among a continuum that produce an incorrect output on input x. Consider the discrete analog where e,($) is the proportion of the N developed software modules that produce an incorrect output on input x. In Section IV-C, when the pairwise correlation between modules is given, the discrete version of this function is a more convenient representation.

1073

TOMEK et al.: MODELING CORRELATION IN SOFTWARE RECOVERY BLOCKS

The range of function e ~ ( x is) (0, 1/N, 2 / N , . . . ,1}. The intensity distribution can be defined correspondingly as GB,(X)(Y)

=

J

x:B,(x)

Suggest Documents