A non-parametric order statistics software reliability model
May Barghout, Bev Littlewood
Abdalla Abdel-Ghaly
Centre for Software Reliability
Department of Statistics
City University
Faculty of Economics and Political Science
Northampton Square
Cairo University Cairo, Egypt
ECIV 0HB
E-mail:
[email protected]
London, England
Telephone: 00202 386 0596
E-mail:
[email protected]
Fax : 00202 506 2797
[email protected] Telephone: 0171-477 8420 Fax:
0171-477 8585
Contact Author : Bev Littlewood Abstract: This paper addresses a family of probability models for the failure time process known as order statistic models. Conventional order statistics models make rather strong distributional assumptions about the detection times: typically they assume that these come from some parametric family of distributions. In this paper a new model is presented that relaxes these distributional assumptions, and - in the tradition of non-parametric statistics generally - ‘allows the data to speak for themselves’. The accuracy of the new model is compared on some real data sets with the predictions that come from several of the better parametric reliability growth models. key words: Software reliability growth, non-parametric estimation, kernel density estimation
1
1. Introduction and background The problem with which this paper deals is that of measuring and predicting reliability of a program on the assumption that its behaviour during the times of interest is similar to its behaviour during the time when past failure data were collected. An example would be predicting operational reliability using the failure data obtained from testing the software in an environment that accurately emulates the operational environment. A program starts its life with a deterministic but unknown number of faults N . These faults can be seen as competing risks. Each fault will eventually manifest itself by causing a failure. The chance that it will be a particular fault will be determined by the rate of that fault. Once a failure occurs, the time of occurrence of the failure is noted (Abdel-Ghaly et al., 1986). It is assumed that time is continuous and truly represents the extent to which the software is used (see, for example, (Musa et al., 1987, Musa, 1993) for discussions of the use of ‘execution time’). At each failure it will be assumed that an attempt is made to remove the cause of the failure (the fault), whereupon the program is put back into operation. Let fault i (in an arbitrary labelling) be detected after a time xi , which is a realisation of a random variable X i . The { X i } are assumed to be independent, identically distributed (i.i.d.) random variables. The order statistics 0 ≤ X (1) ≤ X ( 2 ) ≤ X ( 3) ≤.................. of the { X i } represents the successive times of fault detection. The times of fault detection { X i } are assumed to be conditionally independent exponential random variables with rates represented by independent identically distributed random variables,ϕ i . Different distributions for ϕ i can be assumed. For example, the Littlewood model (Littlewood, 1981)
2
assumes that the ϕ i ’s are gamma distributed which implies that the unconditional distribution for X i is Pareto and the X ( i ) ' s are order statistics of i.i.d. Pareto distributions. An exponential distribution for the times to detection of faults, conditioned on their rates, seems a plausible assumption: see for example
(Miller, 1986). However the particular
parametric form assumed for the distribution ofϕ i , and thus the form of the distribution of the X i , is less easy to justify. In the work described here the data are allowed to speak for themselves: i.e. non-parametric estimates for the distribution of X i are obtained. The inter-failure times T1 , T2 .. are the spacing between the order statistics i.e.; T1 = X (1) T2 = X ( 2 ) − X (1) Ti = X ( i ) − X (i −1) The distributions of the inter-failure times can be expressed in a general form as follows. Let the probability density function, pdf, and the cumulative distribution function, cdf, of X be denoted by g ( X ) and G ( X ) respectively. Let Ti be the random variable representing the ith inter failure time. The conditional distribution for Ti given τi-1 is (Miller, 1986)
f i ( t | τ i −1 ) =
( N − i + 1) g (t + τ i −1 )[1 − G (t + τ i −1 )]
[1 − G(τ )] i −1
i −1
where τ i −1 = ∑ t j j : =1
3
N − i +1
N −i
The most well known order statistics models are the Jelinski and Moranda model (JM) (Jelinski and Moranda, 1972), the Littlewood model (L), and the Weibull model (AbdelGhaly et al., 1986). All these models are parametric models: they assume a parametric family of distributions for the fault detection times. Thus the JM model assumes an exponential distribution, the L model, as previously mentioned, assumes a Pareto distribution and the Weibull model assumes a Weibull distribution. The distribution of the inter-failure times for the previous models can be obtained from the general form above by substituting the pdf and cdf of the failure times with the distributional form assumed by the underlying model. This distribution will depend on unknown parameters. Estimates of these parameters, and of the unknown N , are made (commonly via maximum likelihood) at each stage i using previous failure data t1 , t 2 ,..., ti −1 . These estimates are then substituted in the cdf and pdf in order to make predictions about the yet unobserved Ti :
f i (t |τ i −1 ) =
[
( N i − i + 1) g (t + τ i −1 ) 1 − G (t + τ i −1 )
[1 − G (τ )]
∧
]
∧
N i −i
N i − i +1
i −1
with predictive cdf ∧
N i − i +1 1 − G (t + τ i −1 ) Fi (t |τ i −1 ) = 1 − 1 − G (τ i −1 )
2. New non-parametric models The importance of the above general form is that it expresses the distribution of the independent, but non identically distributed, inter failure-times as a function of the distribution of the i.i.d. times to detection of faults and N , the total initial number of faults. All the conventional models
make rather strong distributional assumptions about the
4
detection times: typically they assume that these come from some parametric family of distributions. The aim of the present work is to relax the parametric assumptions about the detection times distribution: instead of assuming a distribution for the detection times a nonparametric kernel density estimation method is used to estimate this distribution. Details of kernel estimation can be found elsewhere (Silverman, 1986) but the basic idea is a simple one. Assume that we have a sample of n real observations x1 , x2 ,..., xn whose underlying density is to be estimated. The kernel estimate assumes that around each observation, xi a kernel function Kh ( x − xi ) is centred. For any general kernel K ( x ) define
Kh ( x ) =
1 x K( ) h h
The kernel density estimator is then obtained by averaging over these kernel functions in the observation i.e.
g ( x ) =
1 n ∑ Kh ( x − x i ) n i =1
where h is a smoothing parameter. Whilst many other methods could be used to estimate the distribution of the X i , the kernel method has been chosen here for several reasons. It is flexible and mathematically tractable, allowing us to define a large class of density estimates by appropriate choice of the kernel function. It has been widely used in statistical applications for density estimation. The kernel estimator, being a simple linear function of its constituent kernels, inherits their properties of continuity and differentiability. This will be important when examining the accuracy of reliability predictions via the PLR (Prequential Likelihood Ratio) (Lyu, 1995), since this requires continuous density predictions. 5
Clearly, the choice of the mathematical form of the kernel function here will be very important. In the statistical literature the gaussian kernel is by far the most commonly used and will, thus, be the first choice of kernel functions. In conventional statistics non-parametric density estimation is restricted to fitting distributions and does not address the problem of prediction. In the case of software reliability prediction, the problem is predicting the yet unobserved Ti based on the previous t1 , t 2 ,..., ti −1 . The general form, above, suggests that this prediction will depend on how close the models estimate, ( N , G (t )) , is to the real ( N , G (t )) . The estimate G (t ) is a statement concerning the ith order statistic X ( i ) having observed the first (i − 1) order statistics x(1) , x( 2 ) ,..., x(i −1) . This implies that the predictions will be mainly in the right hand tail of the distribution and the estimate will be mainly dependent upon the shape of the right hand tail of the kernel function used. (A similar observation was made in (Kaufman et al, 1997), where an extreme value distribution is used instead of kernel estimator) In the software reliability case it might be reasonable to expect that the true distribution is at least approximately exponential: a double exponential kernel might therefore seem plausible. One disadvantage of these kernels is that they are both defined on the whole real line, and thus give an estimated density on the whole real line - in fact the true density is known to be defined on the positive real line. One way of overcoming this difficulty would be to use a lognormal kernel, with median equal to the observation. Lognormal distributions have often been used in socio-economic studies to investigate phenomena with long right tails. The use of all three kernels will be illustrated in the following examples. Finally, it is necessary to consider the estimation of the parameter h in the kernel estimator. There are two approaches. The first procedure is the likelihood cross validation method. This
6
is a natural development of the idea of using likelihood to judge the adequacy of fit of a statistical model. It is of general applicability in density estimation. The rationale behind the method, as applied to density estimation, is as follows. One observation x ( j ) from the sample used to construct the density estimate is omitted. The non parametric estimate of g ( x ) , g − j ( x ) , is the density estimate constructed from all the (i − 1) data points except x ( j ) , that is to say, g − j ( x ) = (i − 2) −1 h −1 ∑ K {h −1 ( x − x ( k ) )} k≠ j
The non parametric estimate of f j (t ) , f − j (t ) is given by,
[
( N − j + 1) g − j (t + τ j −1 ) 1 − G− j (t + τ j −1 ) + f − j (t |τ j −1 ) = N − j +1 1 − G− j (τ j −1 )
[
]
]
N−j
.
Since there is nothing special about the choice of which observation to leave out, the log likelihood is averaged over each choice of omitted x ( j ) , to give i −1 CV = (i − 1) −1 ∑ log f − j ( t j ) j =1
The likelihood cross validation choice of the parameters is the value of the parameters which maximise the function CV for the given data. Non-parametric density estimation has mainly been used to estimate densities in conventional situations where there is no element of prediction. In such cases it seems logical to estimate parameters by maximising the sequence of estimates. The software reliability problem, however, is one of prediction in the presence of reliability growth. What is needed are good
7
estimates of future behaviour and less interest in the quality of estimates of past behaviour. The second inference procedure, based on the prequential likelihood, is designed to give good predictions. At any stage i we want to predict the behaviour of the yet unobserved X (i ) using the previous observed data x(1) , x( 2 ) ,..., x(i −1) . For any earlier value j (2 < j < i − 1) the previously observed data x (1) , x ( 2 ) ,..., x ( j −1) can be used to make a one step ahead prediction for X ( j ) by applying j −1
g~ j ( x ) = ( j − 1) −1 h −1 ∑ k {h −1 ( x − x ( k ) )} k =1
~ This is done sequentially for j=2,3,…i-1 leading to a sequence of predictions f j (t )
[
~ ( N − j + 1) g~ (t + τ j −1 ) 1 − G (t + τ j −1 ) ~ f j (t |τ j −1 ) = N − j +1 ~ 1 − G (τ j −1 )
[
]
]
N− j
for given h , N . These parameters are estimated by finding the value, (h, N ) , that maximises the prequential log likelihood: i −1 ~ PL = ∑ log f ( t j ) j =2
Typically the models are used to make a sequence of predictions as i increases; thus a sequence of successive one-step-ahead predictions, Fi (t ) , of the random variables Ti is generated. The accuracy of a model’s predictions will solely depend on its estimate, Fi (t ) , of the true i . The difficulty of analysing the closeness of Fi (t ) to the true Fi (t ) arises from
8
never knowing, even at later stages of analysis, the true i . However, t i , the realisation of Ti , is later observed and all the analysis of the predictive quality of a model is based upon these pairs {Fi (t ), ti } . There has been considerable progress in using the {Fi (t ), ti } sequence to assess the accuracy of a particular method of prediction upon a particular data source (AbdelGhaly et al., 1986, Brocklehurst and Littlewood, 1992). These tools will be used to compare the accuracy of the new model with that of the older parametric models. Basically there are two main tools: the u-plot and the prequential likelihood ratio (PLR) (Dawid, 1984). The PLR is a very general and powerful means of comparing the accuracy of sequences of predictions coming from two competing prediction systems while the u-plot detects ‘bias’ in a series of model predictions. In fact it does this in a very general way which has been used successfully to improve the predictions coming from a reliability model (Brocklehurst et al., 1990). The idea here is to use the u-plot to ‘recalibrate’ the predictions of a model by taking account of the detailed nature of its past inaccuracies on the data source: the model is allowed to ‘learn’ from its past mistakes. It should be emphasised that this procedure is still genuinely predictive: only past data is used to predict the future. This means that the same tools - uplots, PLR - can be used to evaluate its accuracy. The details of how this analysis of predictive accuracy is carried out can be found elsewhere (Abdel-Ghaly et al., 1986, Brocklehurst and Littlewood, 1992). In fact, earlier studies have shown that the accuracy of the predictions arising even from existing parametric models can be very variable. Some models sometimes give good results, some models often perform badly, but no model can be trusted to be accurate in all circumstances. Worse than that, there is no way in which that model (if any) which will give accurate results for a particular new data source can be identified a priori: those attributes of
9
the data sources that cause the models to vary in their accuracy have not been identified. There seems no alternative, therefore, to analysing the predictive accuracy of each model on the specific data source that is under examination. The model (or models) that has shown itself to be satisfactory for predictions before stage i would then be used to make the prediction at stage i. This approach has two important aspects. In the first place it tailors the analysis of predictive accuracy to the particular data source that is being studied: there are no claims to universality of accuracy for a model. Secondly, it is dynamic in operation: if the fortunes of a model should wane in later predictions, this can be detected and a better model used. 3. Application of the model to some real software failure data The predictive accuracy of the new model will be examined here on four real data sets using the techniques previously described. To use this model to predict the reliability of the program at stage i , based upon observations t1 , t 2 ,..., t i −1 a kernel estimator is used. As previously mentioned the kernel functions applied are the gaussian (gau), double-exponential (dexp) and lognormal (log) functions. The smoothing parameter, h , used by the kernel estimator will be estimated by the likelihood cross validation method (CV) and the prequential likelihood method (PL). This results in 6 different predictors. The notation used is as follows: the first two letters in the predictor’s name denotes the inference procedure used and the rest of the name indicates the kernel function used. For example pldexp means that the double-exponential function is applied and the smoothing parameter is estimated using prequential likelihood. The predictive performance of some of the popular parametric models: Jelinski-Moranda (JM), Littlewood (L), Littlewood-Verrall (LV), Goel-Okumoto (GO), Musa-Okumoto (MO), Duane (Du), Keiller-Littlewood (KL) and Littlewood NonHomogenous Poisson Process (LNHPP) has been previously analysed on these data sets. The 10
full analysis can be found elsewhere (Brocklehurst, 1995). The analysis carried out on these data sets, applying the previous parametric models, suggest a best parametric model for each data set: the judgement as to which is best is based on the log (PLR) plots. The predictive accuracy of the new predictors are compared with one another and with this best parametic model. Due to space constraints it is necessary to limit the number of plots shown: a detailed analysis will be carried out on the first data set while the results on the other three data sets will be summarised. The first and second data sets are part of a larger data set collected at the Centre for Software Reliability at the City University. The original data set USBAR consists of 397 inter-failure times recorded on a single-user workstation. The set was subdivided according to different categories, such as the usage under which the failures occurred or the type of fault that caused the failure. The first data set to which the new models are applied, USPSCL consists of 104 failures which occurred when compiling and running Pascal programs. The second data set TSW includes 129 software failures which occurred when the system software did not behave as required, e.g., incorrect output, operating system crash, etc. The models will then be applied to the original data set USBAR. The last data set, Musa system 1, is a set of 136 failures collected by Musa (Musa, 1979); this data has been widely used in the software reliability community for model comparison. Figure 1 shows plots of the successive current median times to next failure for USPSCL data as calculated by the new models. Thus, in the plot at stage i the predicted median of Ti is calculated based upon all the data that has been observed prior to this stage, i.e., inter-failure times t1 , t 2 ,..., t i −1 . Since the best parametric model for this data set, according to the earlier
11
analysis, is the KL model, the median predictions obtained by this model are included in the figure. Figure 1 Figure 1 shows that there is great disagreement between the different predictors in how they predict the medians. The median predictions from all the models (excluding the pldexp model) exhibit a high degree of noise and are in general higher than the KL predictions. This suggests that the predictions of the new predictors are more optimistic than those from the KL model. The fact that the KL model is the best of the parametric models does not necessarily mean that it is producing accurate predictions, however: actually all the parametric models are biased in their predictions. Figure 2 shows the u-plots of the new predictors and the KL models. The u-plots indicate that although all the models are optimistic for predictions associated with the left hand tail of the distribution of time to next failure, they differ when predicting for the right-hand tail: the PL(gau and dexp) predictions seem to be quite accurate, while the CV (gau and dexp), like the KL, are pessimistic as evident by their plots crossing the line of unit slope from the right. The u-plot for the log models are everywhere above the line of unit, i.e. the predictions are too optimistic, as suspected from the earlier median plots. As was the case for the parametric models, all the new predictors are biased in their predictions with u-plots significant at 1%. Figure 2 Figure 3 shows the PLR plots for the new model’s predictions versus the KL predictions. The drops exhibited by the CV(gau and dexp) predictors occur when predicting for an inter-failure time which is far higher than the highest previously observed inter-failure time. Even if these drops are ignored, the plots indicate that the predictions of the PL (gau and dexp) models are 12
more accurate than their CV variants. In general, excluding the earlier stages of analysis, all the plots show a downward trend in the PLR, indicating the superiority of the KL model (although it is known from previous analysis that none of the models is giving accurate results), this trend being more pronounced for the log models. Figure 3 All the models, because of their poor u-plots, are candidates for recalibration. The effect of recalibration on the new predictors is examined in Figure 4, which shows the PLR of the predictions after recalibration against the raw predictions for the same predictor. The plots show an obvious upward trend indicating
that recalibration has greatly improved the
predictions of all the models. The greatest improvement arising from recalibration is in the log and cvdexp models, which is because these models were originally worse. Figure 4 Figure 5 shows how recalibration has made adjustments for optimism in the raw median predictions. Comparison of Figure 1 and Figure 5 shows that the recalibrated medians are lower than the raw medians and in closer agreement. Furthermore, the median predictions are also in closer agreement with the median predictions produced by the Du model after recalibration. Here comparison is with the recalibrated Du model since this is the best parametric model after recalibration. Figure 5 The u-plots, Figure 6, support the previous analysis. The u-plots are all in close agreement with one another and with the line of unit slope. Their K-S distances that were all significant at 1% are, after recalibration, insignificant at 20%.
13
Figure 6 Recalibration has not only adjusted for optimism in the medians and u-plots of the new models, but has genuinely improved their predictive accuracy as evident in their PLR plots, Figure 7. If the drops in the plots were ignored there would be no evidence to discredit any of the models in favour of the Du model. That is, the predictions from any of these models are as accurate as the best of the parametric models. Figure 7 Figure 8 shows the PLR plot for the new predictors against the LV model (the best of the parametric models after recalibration) all after recalibration for TSW data set. The upward jumps seen in the CV (gau and dexp) plots occur at points where recalibration has greatly improved the predictions of these models while less improvement occurred on the LV predictions. A mild upward trend is evident during the earlier stages, where there is no reliability growth. In general, if the drops and jumps in the plots were ignored all the nonparametric statistic (os) predictions would be comparable with the LV and there is nothing to choose between the models. Like the LV model the log models and the pldexp model have insignificant u-plot KS distances (at 20% level), whilst they were significant (at 1% level) before recalibration. Figure 8 Figure 9 shows the PLR plot for the new predictors against the Du model, all after recalibration for USBAR data set. If the jump and drops, which are coincident with particularly large inter-failure times were ignored, the performance of the log models, PL (gau and dexp) would be similar to the Du model and there is no evidence to discredit one in favour of the other. The other models reveal a downward trend indicating the superiority of 14
the Du model. The u-plots for the log models are insignificant at 20% level while the Du model is only insignificant at 1% level. Figure 9 Figure 10 shows the PLR plots for the recalibrated predictions from the new predictors against the L model all after recalibration for Musa system 1 data set. The plots show that although there is a mild upward trend in the gau and dexp plots between stages 61 and 104, the L model seems to be producing more accurate predictions in the later stages. If the drops in the plots that coincide with inter-failure times equal to zero were ignored, on the whole there would be nothing to choose between the models and the L model. Figure 10 4. Discussion Previous order statistics models, indeed almost all software reliability growth models, have been parametric models in which quite strong assumptions have been made about underlying distributions. It is difficult to confirm these assumptions directly, because of the comparative complexity of the models. In this paper an alternative class of models has been proposed which ‘allow the data to speak for themselves’, in the tradition of non-parametric statistics, instead of relying on these strong assumptions. The main structural assumption in the new models is that fault discovery in software is of a competing risk nature: i.e. that times to detection of the faults can be regarded as independent, identically distributed random variables. Notice that this does not mean that it is believed that all faults are identical in their impact upon the unreliability of a program. Instead it is assumed that the times to detection are conditionally exponentially distributed, with the
15
(different) rates treated as i.i.d. random variables; the unconditional times to detection are then independent and identically (but not exponentially) distributed. In the results here kernel estimates of the probability density of the time to fault detection are used. The results from the basic models are not very good - they are worse than the best of the conventional models. The raw predictions from the lognormal are discredited in favour of the best of the parametric models, and are often discredited in favour of the other (gaussian and double-exponential) functions. This is disappointing, given that the lognormal gives the only predictions that are defined on the positive real line and has the desired long right hand tail: the lognormal kernel was expected to provide better estimates of G ( x ) , the time to fault detection distribution, than the other kernel functions. The closeness of the estimate F (t ) to the true F (t ) will depend on the closeness of ( N , G ( x )) to the truth. It might be, therefore, that the better performance of the gaussian and double-exponential models is not due to its estimate of G ( x ) , rather that the accuracy of the estimates of N play a crucial role. The disappointing performance of the raw lognormal models warrant further investigation. The following investigation was conducted to provide more insight to what might be occurring. In this example 200 observations, x1 , x 2 ,... x 200 were generated from a Pareto(1,1000) distribution. The observations are sorted in ascending order and the first 150 observations are used to obtain a sample of 150 inter-failure times which constitute the data set. For simplicity only the cvgau and pllog models will be considered. The analysis is conducted in the same manner as in section 3 using the cvgau, pllog and L models. The predictive accuracy of the cvgau and pllog models are compared with the L model in Figure 11.
16
Figure 11 A clear downward trend is exhibited by both non-parametric models discrediting these models in favour of the L model. More important, this downward trend is more pronounced for the pllog model indicating that this model is not only producing poorer predictions than the L model but the cvgau model as well. Clearly, the closeness of the estimate F (t ) to the true F (t ) will depend on the closeness of ( N , G ( x )) to the true ( N , G ( x )). The data used in this example are generated from a Pareto distribution, thus, the true ( N , G ( x )) are known, as are their estimates using both the cvgau and pllog models. This provides an opportunity to compare these estimates with the truth. In the following example the ability of the two non-parametric models to estimate the true Pareto G ( x ) is examined. What is of interest is the distribution of the i th failure time, given that this exceeds x (i −1) : g ( x| x ( i −1) ) = g ( x ) / (1 − G ( x (i −1) )) Figure 12 shows the probability that X (100 ) exceeds x. Figure 12
The rapid convergence to zero of the gaussian function is evident in this figure. The gaussian function is over-estimating the conditional distribution, i.e. it is over-estimating the probability of failure, indicating that the gaussian predictions are too pessimistic. The lognormal function converges to zero at a slower rate. Although it also produces pessimistic predictions they are generally more accurate than those obtained by the gaussian function. It
17
might be, therefore, that the better performance of the gaussian model is not due to its estimate of G ( x ) , rather that the accuracy of the estimates of N play a crucial role. Indeed further investigation of the estimates, N shows that both models are optimistic in their estimates of N : N =151 for the cvgau model and 153 for the pllog model while the true N is 200. Informally what seems to be occurring is that the cvgau estimates of both N and G ( x ) are in error but the nature of the error is different: the model is too pessimistic in its estimate of G ( x ) while it is too optimistic in its estimate of N . It is essentially the contrast in the nature of these errors that result in accurate final predictions. In a sense a compensation occurs between these different kinds of errors leading to accurate final predictions. On the other hand, the pllog model is also optimistic in its estimate of N , but more accurate in its estimate of G ( x ) . It is essentially this optimism in the model’s estimate of N that is leading to the optimism in the final predictions of the lognormal model, revealed in the unacceptably high median plot, the highly significant u-plot K-S distance and the poor PLR plot. Figure 13 shows the plot for the PLR of the non-parametric models against the L model after recalibration. While the cvgau plot shows a slight downward trend, discrediting it in favour of the L model, the pllog plot exhibits no evident trend, implying that the predictions produced by the pllog model are similar in accuracy to those produced by the L model. Figure 13 This is, of course, just an informal explanation of what might be causing the poor performance of the lognormal function in comparison with the gaussian and doubleexponential function. It is not certain that the pllog estimates of G ( x ) are always reasonably
18
accurate, but it seems likely that the cvgau will always behave badly. Furthermore, the degree of compensation that occurs in the gaussian (and double-exponential) models will, presumably, differ from one stage to another and from one data set to another, thus, the final predictions will not always be accurate. 5. Summary and conclusions The performance of the ‘raw’ models here is not very impressive. However, the results from the recalibrated versions are very encouraging: the accuracy of the recalibrated lognormal models is as good as the best of the recalibrated conventional models. The consistency of performance is a considerable achievement, since it is known that the accuracy of the conventional models varies greatly from one data set to another. Current practice suggests that recalibration should be applied as a matter of course to all software reliability predictions (Brocklehurst et al, 1990; Brocklehurst and Littlewood, 1992), even those based upon parametric models. It should also be emphasised that the recalibration procedure is itself a non-parametric procedure that does not rely upon any distributional assumptions. It, also, allows the data to speak for themselves - in this case the evidence from previous predictive errors is used to adjust the current prediction. The results presented here involve only one of a whole class of non-parametric order statistics models: clearly other kernel functions could be tried, as could different inference procedures. The authors intend to investigate such variants in future work. References Abdel-Ghaly, A.A., Chan, P.Y. and Littlewood, B. (1986) ‘Evaluation of Competing Software Reliability Predictions’, IEEE Transactions on Software Engineering, 12 (9), 950-967.
19
Brocklehurst, S. (1995) Software Reliability Predictions: A Multi-Modelling Approach, Ph.D. dissertation, City University, London. Brocklehurst, S., Chan, P.Y., Littlewood, B. and Snell, J. (1990) ‘Recalibrating Software Reliability Models’, IEEE Transactions on Software Engineering, 16 (4),458-470. Brocklehurst, S. and Littlewood, B. (1992) ‘New Ways to get Accurate Reliability Measures’, IEEE Software, 9 (4), 34-42. Dawid, A.P. (1984) ‘Statistical Theory: The Prequential Approach’, Journal Royal Statistical Society, Series A,147, 278-292. Jelinski, Z. and Moranda,P.B (1972) ‘Software Reliability Research’, Proceedings of the Statistical Methods for the Evaluation of Computer System Performance, Academic Press, New York, U.S.A., 465484. Kaufman, L.M., Smith, D.T., Dugan, J.B, Johnson, B.W. (1997) ‘Software Reliability Analysis Using statistics of the Extremes’, Proceedings Annual Reliability and Maintainability Symposium, IEEE, 175-180, 1997. Littlewood, B. (1981) ‘Stochastic Reliability Growth: A Model for fault Removal in Computer Programs and Hardware Designs’, IEEE Transactions on Reliability, 30,313-320. Lyu, M. (1995) Handbook of Software Reliability Engineering, McGraw-Hill, New York, U.S.A.,131132, (850). Miller, D.R. (1986) ‘Exponential Order Statistics Models of Software Reliability Growth’, IEEE Transactions on Software Engineering, 12 (1), 12-24. Musa, J.D. (1979) ‘Software Reliability Data’, Technical Report, Data and Analysis Centre for Software, Rome Air Development Centre, Griffins AFB, New York, U.S.A. Musa, J.D. (1993) ‘Operational Profiles in Software Reliability Engineering’, IEEE Software,10 (2),14-32. Musa, J.D., Iannino, A. and Okumoto, K. (1987) Software Reliability - Measurements, Prediction, Application, McGraw-Hill, New York, U.S.A. Silverman, B.W. (1986) Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, U.K., 13-19, (175).
20
500
cvgau cvdexp cvlog 400
plgau pldexp pllog KL
300
200
100
0 35
40
45
50
55
60
65
70
75
80
85
90
95
100
Figure 1. Successive one-step ahead median predictions from the raw non-parametric os models, of the time to next failure Ti , i =36 to 104, plotted against i, for data set USPSCL
21
1
0.8
0.6
0.4 unit slope cvgau cvdexp cvlog plgau pldexp
0.2
pllog KL
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2. u-plots for raw predictions Ti , i =36 to 104, from non-parametric os models for data set USPSCL
22
10
0
-10
-20
-30
-40
cvgau cvdexp cvlog -50
plgau pldexp pllog
-60
Figure 3. Log (PLR) plots for predictions Ti, i =36 to 104, from the raw non-parametric os models versus the raw KL model, for data set USPSCL
23
45
cvgau cvdexp 40
cvlog plgau pldexp pllog
35
30
25
20
15
10
5
0
-5
Figure 4. Log(PLR) plots for the recalibrated non-parametric os models versus the raw nonparametric os models for data set USPSCL
24
300
cvgau cvdexp cvlog plgau pldexp
250
pllog Du
200
150
100
50
0 35
40
45
50
55
60
65
70
75
80
85
90
95
100
Figure 5. Successive one-step ahead median predictions from the recalibrated non-parametric os models, of the time to next failure Ti , i=36 to 104, plotted against i, for data set USPSCL
25
1
0.8
0.6
unit slope
0.4
cvgau cvdexp cvlog plgau pldexp pllog
0.2
Du
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 6. u-plots for the recalibrated non-parametric os models for data set USPSCL
26
1
10
5
0
-5
-10
-15
cvgau -20
cvdexp cvlog plgau pldexp pllog
-25
Figure 7. Log (PLR) plots for predictions Ti , i =36 to 104, from the recalibrated nonparametric os models versus the recalibrated Du model for data set USPSCL
27
5
0
-5
-10
-15
-20
-25
-30
-35
cvgau cvdexp cvlog plgau
-40
pldexp pllog
-45
Figure 8. Log (PLR) plots for predictions Ti, i =36 to 129, from the recalibrated nonparametric os models versus the recalibrated LV model for data set TSW
28
20
0
-20
-40
-60
-80
-100
-120 cvgau cvdexp -140
cvlog plgau pldexp pllog
-160
-180
Figure 9. Log (PLR) plots for predictions Ti, i =36 to 397, from the recalibrated nonparametric os models versus the recalibrated Du model for data set USBAR
29
6 cvgau cvdexp cvlog plgau pldexp 4
pllog
2
0
-2
-4
-6
Figure 10. Log (PLR) plots for predictions Ti, i =36 to 136, from the recalibrated nonparametric os models versus the recalibrated L model for data set Musa system 1
30
5
0
-5 cvgau -10
pllog
-15
-20
-25
Figure11. Log (PLR) plots for predictions T i , i=21 to 150, from the raw cvgau and pllog models versus the raw L predictions.
0.0014 0.0012 0.001 cvgau 0.0008
pllog real
0.0006 0.0004 0.0002 0 3120
3170
3220
3270
3320
3370
3420
3470
3520
3570
3620
Figure12. Real and estimates of the conditional distribution for X > x (100) for simulated data.
31
10 8 6 4 2
cvgau
0
pllog
-2 -4 -6 -8 -10
Figure13. Log (PLR) plots for recalibrated predictions T i , i=21 to 150, form the raw cvgau and pllog models versus the raw L predictions.
32