Stochastic bayes measures to compare forecast accuracy of software ...

92

IEEE TRANSACTIONS ON RELIABILITY, VOL. 50, NO. 1, MARCH 2001

Stochastic Bayes Measures to Compare Forecast Accuracy of Software-Reliability Models Mehmet Sahinoglu, Senior Member, IEEE, John J. Deely, and Sedat Capar

Abstract—ARE (absolute relative error) and SqRE (squared relative error), are random variables that are suggested as measurements of forecast accuracy of the total number of estimated software failures at the end of a mission time. The purpose is to compare the predictive merit of competing software reliability models, an important concern to software reliability analysts. This technique calculates the Bayes probability of how much better the prediction accuracy is for one method relative to a competitor. This novel approach is more realistic, in the assessment of predictive merit, than a) comparing merely the average values of ARE and SqRE as conventionally done; and b) Conducting statistical hypothesis tests of pair-wise means of ARE and SqRE, an approach somewhat more sensible than a), because b) incorporates variability of predicted values, which a) does not. To implement this technique, first noninformative (across the border) are used and then informative (specified) priors. For the informative case, half-normal priors are placed on the mean of the ARE or SqRE random variables, because these means are hypothesized to remain peaked around zero relative-error (ideal error percentage). This problem is related to the general problem of ranking usual means discussed in the literature by Berger and Deely (1988), and is a follow-up to an invited research paper presented at ISI-97 by Sahinoglu and Capar (1997). Index Terms—Bayes, forecast accuracy, informative, noninformative, pairwise comparison, relative error, software-reliability model.

ACRONYMS1 pdf probability density function r.v. random variable MLE maximum likelihood estimate RE relative error Notation: checkpoint between 1 and true number of software failures over forecast value of the total number of software failures estimated at time point error r.v., j 1, 2 for the two methods being compared absolute RE

ARE

Manuscript received February 28, 1998; revised July 28, 2000. M. Sahinoglu is with the Department Computer and Information Science, Troy State University, Montgomery, AL 36103-4419 USA (e-mail: [email protected]). J. J. Deely is with the Department of Statistics, Purdue University, W. Lafayette, IN 47907 USA (e-mail: [email protected]). S. Capar is with the Department of Statistics, Dokuz Eylul University, Kaynaklar Kampusu, Buca-Izmir Turkey. Publisher Item Identifier S 0018-9529(01)06806-3. 1The

singular and plural of an acronym are always spelled the same.

AvRE AvSqRE CPMLE CPNLR MO SqRE SSqRE

average RE: arithmetic average of ARE average SqRE: arithmetic average of SqRE compound Poisson MLE compound Poisson nonlinear regression Musa–Okumoto logarithmic Poisson (method) squared RE sum of SqRE over sampled checkpoints I. INTRODUCTION

T

HERE IS increasing pressure to develop and quantify measures of computer software reliability [8], [18]. With the ascent of software reliability models, there is even more pressure to assess the predictive quality of these measures, both in their “goodness of fit” and “pair wise comparisons” [6], [9], [14]–[17]. However, the current methods, to compare these software reliability models, use constant measures and hence their results do not reflect the variability inherent in the observations. In particular, forecast accuracy of various methods are compared through measures such as AvRE and MSE (mean square error), both of which are constant measures. These do not consider the effect of stochastic variability. An earlier suggestion was to devise and study more precise methods for choosing the best predictive procedure through frequentist methods, such as two sample -tests of equality of means, which consider this inherent variability. In addition to assessing the quality of fit to zero RE of an individual model, comparisons between competing models were conducted by -tests. Such research was necessary to choose between the many new and old reliability models [15]. The research in this paper proposes and studies several new and particularly, actual data-supported Bayes methods of assessment, which acknowledge the presence of stochastic variation in the observed sequence of failure data, assumed or selected to be -independent [16], [17]. The authors have already compared pairs of certain reliability models’ forecast accuracy using statistical hypothesis tests in the frequentist sense. It was observed that a constant difference between the means of r.v. ARE, i.e., the AvRE of any two methods did not necessarily prove -significant as to which of two competing estimation procedures was better. An alternative way of measurement through a more severe squared penalty reflected in r.v., SqRE is also considered in all calculations in this study. This paper brings a new dimension to the comparative assessment of the predictive accuracy of two competing methods. In developing Bayes methods, an innovative new approach is proposed, not only to allow for deciding which method is better, but additionally to quantitatively describe how much one is better than the other. This is done by experimenting with

0018–9529/01$10.00 © 2001 IEEE

SAHINOGLU et al.: STOCHASTIC BAYES MEASURES

93

prior distributions, noninformative and informative, for the unknown parameters in the light of a-priori software engineering field experience. Results show the improving trend of accuracy in probability from noninformative to informative priors satisfactorily in terms of tables and related graphs, using one data set for brevity and due to lack of space. II. DEFINITIONS

ARE

(1)

SqRE

(2)

The popularly used AvRE is the arithmetic average of the ARE. Similarly, AvSqRE is the arithmetic average of the SqRE over checkpoints. SSqRE is the sum of SqRE over sampled checkpoints predicted respectively, as follow. SSqRE can be likened to a chi-square distributed r.v., although related research continues. AvRE AvSqRE SSqRE

ARE

(3)

SqRE

(4)

SqRE

(5)

III. MODEL AND COMPUTATIONAL FORMULAS This paper compares one method of predicting a software reliability index against that of another method, based on the observed data and the predictions obtained by these methods. One of the authors [14] has dealt with 5 data-sets (time-base simulated at JPL) where software failures are clumped or clustered in weekly intervals. These data sets are: 1) WD1: 131 failures within 64 weeks (VOYAGER SAF) 2) WD2: 213 failures within 224 weeks (GALILEO SAF S/C) 3) WD3: 340 failures within 41 weeks (GALILEO CDS FSW Phase3/1) 4) WD4: 197 failures within 114 weeks (MAGELLAN S/C) 5) WD5: 366 failures within 50 weeks (ALASKA SAR). A detailed explanation on the competing methods CPMLE, CPNLR, MO are in [10]–[15]. A frequentist treatment of the same problem was earlier examined by only establishing hypotheses tests. It was shown that a numerical difference in the mean values of the ARE or SqRE of the two competing models was not necessarily -significant. In our study, however, we search for the probability that one method’s mean is higher or lower than another by a specific margin in proportion of the difference between them. We approach this problem using Bayes noninformative and informative prior distributions, and their posterior distributions computationally [1]–[5], [7], [19].

With a method and its prediction, we associate an . In this paper, the error r.v. are the means of the competing ARE and ARE (or SqRE and SqRE ). Assumptions: is Gaussian distributed with unknown mean and 1) known standard deviation . 2) The , is large enough to facilitate a large sample approach to the problem so that one can monitor the extent of checkpoints. are exactly unknown quantitatively, 3) Even though , in probamethod 1 is better than method 2 if bility. 4) The quantitative measure of “how much better” method 1 . This difference is unknown is than method 2 is and can only be estimated, but the Bayes model here produces a probability assessment of the magnitude of this difference. 5) The criterion function is (SM Sample Mean): (6) Greater SM of Smaller SM of is an arbitrary multiplier for , which is the tolerance suggests that one is comparing between and ; and . A casual perusal of criterion (6) indicates why it can be used to make realistic and quantitative comparisons between any two methods being studied. If comparisons among the group of more than 2 methods is desired, then (6) can be suitably altered to give them; thus the posterior probability that any one of the several methods is sufficiently smaller than all of the others can be computed. These details are discussed extensively in [2] for the general problem of ranking Gaussian means. This paper restricts the problem to comparing only 2 methods at a time. Section IV introduces the Bayes model with the relevant formulas to compute (6). IV. PRIOR-DISTRIBUTION APPROACHES Notation: variance vector of the conditional prior on unknown prior mean vector for ,

,

A. A Non-Informative Case For the development of the prior distribution of , , use a hierarchical (embedded or nested) Bayes model, which assumes a priori that the unknown means are exchangeable. This has the desirable property that knowledge of one mean gives some information about the other. More discussion of this general model is in [1]. have the Gaussian distribution, , Let , are “hyper-pawith mean, , and variance, , where and respecrameters” and have “hyper-prior” distributions, , is tively. Thus the prior distribution on , (7)

94


TABLE I 2-SAMPLE t-TESTS AND NON-INFORMATIVE PROBABILITIES OF COMPARISONS FOR COMPETING METHODS

Choices for , depend on the type of prior information other than available in the given problem. A choice for Gaussian might also be indicated by the prior information. Even so, when is Gaussian, the closed form of this prior is not available. But this is not necessary to compute the Criterion Function. Rather, only the conditional distributions are used, as shown here. The computational formulas required to obtain are derived using (6)

(8) The conditional distribution of and variance

is Gaussian with mean

(9) (10)

A truly noninformative case for would have been the im. However this does not lead to a proper posproper choice terior; hence (11) is used. The same situation can take the limit of uniform distributions on very large intervals for . See [2]. If and were known, there would be no need for the hierare unknown. archical model. But we have assumed that , Thus we put hyper-prior distributions on these variables in Section IV-B. Table I shows results of two sample -tests and noninformative Bayes probabilities of comparisons between several competing models. B. An Informative Case This section is an example of how an informative approach can incorporate the available prior information. To accomplish this, assume a certain form of the prior information, and state that there are many facets to the informative problem, which must be handled on a case-by-case basis. The specifics in each case dictate the particular form for the distributions used in evaluating , but the calculation is basically the same as in this section. are positive and Let the prior information imply that , , and describe small. Use a half Gaussian distribution with by a uniform distribution on

The first factor in (8) becomes (11)

(13)

is the standard Gaussian Cdf. Eq (11) allows numerical calculation of , given , ; but it would not be true if were not Gaussian. Even if it were not Gaussian, the Monte Carlo evaluation for is straightforward. We now give details for the noninformative and informative cases; see (6) for the definition of . For the noninformative case, only vague values for , are available. Thus is a Gaussian distribution whose variance is approaches infinity, and

is the normalizing constant; let . The exact expression for is complicated, but can be numerically evaluated by Monte Carlo techniques. Then

(12)

(14) This is easily evaluated by Monte Carlo methods, provided that from the posterior are easily generated.


95

The basic idea is to use the accept/reject method of sampling which is easily implemented in this model for an elementary exposition [7]. Let the posterior be

(15) the denominator is just the normalizing constant. The likelihood does not depend on . Apply function the reject/accept algorithm: from using the Uniform distribution in 1) Generate . 2) Put this value in and generate , ; see (13). as coming from the posterior 3) Accept with probability ; maximum of the likelihood as a function of , , i.e., the value of the likelihood evaluated at the MLE of , . in step #3 is In this model with Gaussian likelihood, the easily ascertained; hence the form of the complete term in step #3 is

Fig. 1.

Non-informative Probabilities from Table II for ARE.

(16) Applying this algorithm to the “total estimated a subset of keepers from which is

values” gives Fig. 2. Informative Probabilities from Table III for ARE of WD1.

(17) VI. DISCUSSION AND CONCLUSIONS all

pairs

which are in the set of “keepers.”

V. COMPUTATIONS AND APPLICATIONS TO DATA SETS Table I provides an overall comparison of previous methods [14], [15] and our Bayes approach [16], [17] for a noninformative prior, from (11). Tables II and III for data set WD1 due to . Figs. 1 noninformative and informative priors where and 2 indicate the trend. The results are computed using FORTRAN 77 code and verified using MicroSoft Excel simulation procedures. In Figs. 1 and 2: ARE (CPNLR), ARE (CPMLE), ARE (MO), SqRE(CPNLR), SqRE(CPMLE), SqRE (MO). where for : Table II contains for ARE, for SRE. Also, sample means and standard deviations are given where between roughly 10th and 95th percentiles. Predictions on total number of failures are converted to ARE and SqRE by using the equations in Section I. The means and standard deviations of competing methods for WD1 are listed in Table II.

Tables I–III and Figs. 1 and 2 show that one improves from (0 to ) of Section IV-A to a noninformative prior on , , of Section IV-B an informative prior on of one on the a priori testing data. The probability that method of prediction such as CPMLE is greater than that of CPNLR is 0.775 35 as in Table II when no information on the for a default case. See Fig. 1 variability is known and on ARE, for varying values of in (11) of the noninformative approach. However, as variability is reduced to lesser values with more informative priors as in Table III, then the probability also drops as anticipated to 0.644 55 as in Table III (the lowest it can be is 0.5). See Fig. 2. This means we are using the information on the data and producing more secure results. On the other hand, as illustrated in the equations, as increases, the difference between the two sample means decreases. Thus, the probability of one mean being greater than the other decreases. This way of quantifying whether one method is better (less AvRE, or less AvSqRE) or worse (higher AvRE, AvSqRE) than the other is far more realistic than: a) deciding deterministically that one method is better by merely comparing the AvRE or AvSqRE [14], and b) deciding stochastically by performing statistical hypothesis tests of pair-wise means, an approach more realistic than the former [15]. However, this quantification of as in (6) can not be tested and statistical hypotheses tests, as in Table II, can then only be used as assisting decision-makers to reject, or fail-to-reject, the equality.

96


TABLE II ‘NON-INFORMATIVE PRIOR’ RESULTS OF COMPARISON PROBABILITY IN DATA-SET WD1

TABLE III INFORMATIVE PRIOR RESULTS OF COMPARISON PrfY

> Xg IN WD1

In this work, the half-Gaussian distribution is used for the prior distribution of the mean of ARE and SqRE because ARE and SqRE are positive quantities whose idealized values are peaked around zero. An absolute penalty of deviation of prediction from the true value can be attributed to inadequate testing before the release of software as in ARE. However, the more severe squared penalty deviation of prediction from the true value can be attributed to testing after the release of software because it is more costly to redeem after the software is released to the end-user as in SqRE. The correlation between the ARE measurements at checkpoints from 1 to can be resolved by considering the covariance when needed. For large (e.g., as in the example), the asymptotic Gaussian assumption by CLT holds and allows the use of Gaussian theory. This method of testing and verification for software reliability measurement, exemplifying a variety of software reliability models, is important. It opens a new avenue of comparing and contrasting predictive accuracy of competing methods’ mean values of ARE (absolute penalty) and SqRE (squared penalty) in probabilistic quantities of how much better or worse, rather than “a better or worse” style of qualitative comparison based on an arithmetic difference. The newly proposed quantitative ways of comparison are far superior to simple arithmetical comparisons of AvRE usually performed by software analysts [9]. With the rising number of software reliability estimation models, it is important to assess and compare their predictive accuracy and quality. Our research is a novel attempt to quantify the probability of how much one method’s prediction ability is better than the other, rather than only to qualify that one is superior to other in terms of hypothesis testing or a mere arithmetic difference.

ACKNOWLEDGMENT A NATO Research Fellowship, from TUBITAK, Ankara and sabbatical leave (1997–1999) for research from Department of Statistics, Dokuz Eylul University, Izmir, Turkey are gratefully acknowledged. M. Sahinoglu is pleased to thank the Departments of Statistics at Purdue University and CWRU, Cleveland for support and financial assistance in the presentation of this research at the International Symposia at DeKalb, IL and Paderborn, Germany in 1998, respectively. J. and N. Sedransk at CWRU are acknowledged for their proof-reading and suggestions that contributed to this manuscript. The authors’ thanks go to F. Bastani and F. Belli, ISSRE’98 Committee Chairman for their encouragement to participate at ISSRE98. Contributions by the referees and expert guidance by Associate-Editor M. Vouk are acknowledged with gratitude. REFERENCES [1] J. O. Berger, Statistical Decision Theory and Bayesian Analysis: S. Verlag, 1985. [2] J. O. Berger and J. J. Deely, “A Bayesian approach to ranking and selection of related means and alternatives to AOV methodology,” JASA, vol. 83, pp. 364–373, 1988. [3] J. J. Deely and J. B. Keats, “Bayes stopping rules for reliability testing with the exponential distribution,” IEEE Trans. Reliability, vol. 43, no. 2, pp. 288–293, 1994. [4] J. J. Deely and A. F. M. Smith, “Quantitative refinements for comparisons of institutional performance,” JRSS A, 1998. [5] J. J. Deely and W. J. Zimmer, “Choosing a quality supplier—A Bayesian approach,” in Bayesian Statistics: Oxford Press, 1988, vol. 3, pp. 585–592. [6] T. Downs and A. Scott, “Evaluating the performance of software quality from software measures,” IEEE Trans. Reliability, vol. 41, pp. 533–538, 1992. [7] A. E. Gelfand and A. F. M. Smith, “Bayesian statistics without tears: A sampling–resampling perspective,” The American Statistician, vol. 46, pp. 84–88, 1992. [8] A. L. Goel, “Software reliability models: Assumptions, limitations, and applicability,” IEEE Trans. Software Engineering, vol. 11, no. 12, pp. 1411–1423, 1985. [9] T. Khoshgoftaar, J. C. Munson, B. B. Bhattacharya, and G. D. Richardson, “Predictive modeling techniques of software quality from software measures,” IEEE-TSE, vol. 18, no. 11, pp. 979–987, 1992. [10] P. Randolph and M. Sahinoglu, “A stopping rule for a compound Poisson random variable,” Applied Stochastic Models and Data Analysis, vol. 11, pp. 135–143, 1995. [11] M. Sahinoglu, “The limit of sum of Markov Bernoulli variables in system reliability evaluation,” IEEE Trans. Reliability, vol. 39, pp. 46–50, 1990. [12] M. Sahinoglu, “Negative binomial density of the software failure count,” in Proc. Fifth Int’l Symp. Computer and Information Sciences (ISCIS), vol. 1, 1990, pp. 231–239.


[13] [14]

[15] [16]

[17] [18] [19]

, “Compound Poisson software reliability model,” IEEE Trans. Software Engineering, vol. 18, no. 7, pp. 624–630, 1992. M. Sahinoglu and Ü. Can, “Alternative parameter estimation methods for the compound Poisson software reliability model with clustered failure data,” Software Testing Verification and Reliability, vol. 7, no. 1, pp. 35–57, 1997. M. Sahinoglu and S. Capar, “Statistical measures to evaluate and compare predictive quality of software reliability estimation methods,” in IP-46, Proc. ISI-97, 1997, pp. 525–528. J. J. Deely and M. Sahinoglu, “Bayesian measures to compare predictive quality of software reliability methods,” in Software Reliability Session (Invited), Int’l Conf. Reliability and Survival Analysis, Book of Abstracts, 1998, p. 43. , “Bayesian measures to assess predictive accuracy of software reliability methods,” in Proc. Ninth Int’l Symp. Software Reliability Engineering (ISSRE’98), 1998, pp. 139–148. M. Xie, “Software reliability models—A selected annotated bibliography,” Software Testing Verification and Reliability, vol. 3, pp. 3–28, 1993. W. J. Zimmer and J. J. Deely, “A Bayesian ranking of survival distributions using accelerated or correlated data,” IEEE Trans. Reliability, vol. 45, no. 3, pp. 499–504, 1996.

Mehmet Sahinoglu obtained his B.S. from METU, Ankara and M.S. from UMIST, England, both in ee, and his Ph.D. from Texas A&M jointly in ee and statistics. Prior to joining the TSUM CIS Department in 1999 as its first Eminent Scholar and Chairman, he worked for 20 years at METU and as a reliability consultant to the Turkish Electricity Company and the Defense Industry (1976–1992), all in Ankara, Turkey in the capacity of a certified professional engineer, and served for 5 years as a founder Dean of Science and founder chairman of Department of Statistics at the Izmir Dokuz Eylul University (1992–1997). He served at Purdue (1989–1990, 1997–1998) and Case Western Reserve University. (1998–1999) in the capacity of visiting Fullbright and a NATO research scholar, respectively, and retired in 2000 after 26 years of civil service in Turkey, as a Professor Emeritus. He published extensively in electric power, earlier in his career, and computer software reliability and testing in the latter years. Dr. Sahinoglu, a Senior Member IEEE and a member of ASA, and elected member of ISI, is accredited for the original findings of the “Compound Poisson Software Reliability Model” to account for the multiple (clumped) failures in predicting the total number of failures at the end of a mission time, and the “Compound Poisson Stopping Rule Algorithm” in software testing literature, and responsible jointly with Dr. David Libby for the “FOR (Forced Outage Ratio) probability density function” or “G3B (Generalized 3-parameter Beta) pdf” denoted as Sahinoglu-Libby pdf useful in the availability evaluations of integrated software systems (composed of software and hardware components) such as the Internet. He is writing a textbook on Internet Reliability Evaluation. http: cis.tsum.edu/mesa

97

John J. Deely is Professor Emeritus from the University of Canterbury in Christchurch, New Zealand and is a Continuing Lecturer in the Department of Statistics at Purdue University where he had spent 3 years as a Visiting Professor. Prior to visiting Purdue, he occupied the Chair of Statistics at Canterbury University 25 years. He has a B.E.E. from Georgia Tech, an M.Sc. in Mathematics and a Ph.D. in Statistics both from Purdue University. He was a Senior Research Scientist at Sandia Corporation in Albuquerque for 3 years before taking a position as Senior Lecturer in Statistics at Canterbury University in 1968. Professor Deely has authored over 50 technical papers and has widespread consulting experience in many commercial areas including engineering, TV, radio, farming, medical and transportation. He is a member of the American Association of Statisticians, the Institute of Mathematical Statistics, the International Institute of Statistics, the International Society for Bayesian Analysis, and the Royal Statistical Society.

Sedat Capar obtained his B.S. (1994) and M.S. (1996) in Statistics from METU, Ankara and DEU, Izmir. His M.S. thesis was “The stochastic comparison of reliability models,” which he conducted under his supervisor Dr. M. Sahinoglu. He then completed his M.S. studies in computer engineering at DEU. He is now the head system-administrator of the Computing Services at the College of Science and Arts at DEU while studying for a joint Ph.D. in Computer Software Science and Statistics. He has been accredited with bringing a wide range of Internet applications to the city of Izmir (4 million) for the first time under the guidance of his supervisor and then his Dean, M. Sahinoglu. DEU’s entry was recognized as the third node in 1993 October after the pioneering entries of Middle East Technical and Bilkent Universities all under the auspices of the TUBITAK (Turkish Scientific and Technical Research Foundation) in Turkey’s Internet struggle.