Then, to choose the best fit SRGM, assumption set of the software ... The idea behind this threshold is not to check the release threshold and continue ... Rank Models: We applied following procedure suggested in (Lyu, 1995) to rank models;.
IADIS International Conference Informatics 2008
AN ALGORITHM FOR SOFTWARE RELIABILITY GROWTH MODEL SELECTION Hakan Burak Duygulu Department of Computer Engineering, Boğaziçi University Bebek, Istanbul, 34342 TURKEY
Oğuz Tosun Department of Computer Engineering, Boğaziçi University Bebek, Istanbul, 34342 TURKEY
ABSTRACT Software reliability is an important attribute for software quality. It is of prime importance for mission critical systems, which demand high reliability. Thus reliability modeling of software products and predicting reliability via such models at different phases of the software life cycle is of demand. To predict the reliability, a large number of software reliability models have been proposed in the literature. Among those there does not exist a single model that fits best to all cases and can universally be recommended. Therefore, recent works have been focused on selecting the reliability model that best describes the collected failure data. In this study, an algorithm is proposed to serve as a guideline for software reliability engineers to select the most appropriate software reliability growth model. Proposed algorithm combines the strengths of the two previously published algorithms (Kharchenko, 2002), (Stringfellow, 2002) and further incorporates novel ideas to provide enhancements on model selection process. The algorithm is tested with the publicly available data and performance analysis is provided. KEYWORDS Software Reliability, Model Selection, Release Decisions.
1. INTRODUCTION Software plays a critical role in our daily life since it is embedded in appliances, such as computers, automobiles, televisions etc., which are widely used. As software controls the entire system functions; the faults in the software may cause critical problems such as human death, injuries or financial loss. Thus it is important to use methods, which measure and control the reliability of the software in use. To measure and control the reliability, more than hundred models have been proposed in the literature (Lyu, 2007). But there does not exist a model that can be used in all cases and universally be recommended to users. In this study, a software reliability model selection algorithm is proposed. The proposed algorithm is aimed to be a guideline for the potential user who wants to evaluate the software reliability. The proposed algorithm is tested with the publicly available data. Section 1.1 gives the background information. Section 2 discusses the proposed algorithm and associated performance study. Section 3 presents the test results and concluding remarks on the proposed algorithm.
1.1 Background Today there exists more then a hundred software reliability models and more models are being developed every year. Detailed discussion of the existing reliability models can be found in (Lyu, 1996) and (Farr, 1983). Although there exists so many models, there does not exist a model that can be applied in all cases. Since we can not rely on a single model, we must have a procedure to compare the success of those models and choose the one that is superior to others for given failure data. Goodness of fit tests, prequential likelihood ratio, u-plot and y-plot methods are a few of such procedures.
115
ISBN: 978-972-8924-62-1 © 2008 IADIS
Goodness of fit tests (GOF) are used to test how well a statistical model fits a set of observations. Two important GOF tests are Kolmogorov-Smirnov and Chi-Square tests. Prequential Likelihood is a measure that indicates how much more appropriate one model is than another and it can be used to discredit one model in favor of another for a particular set of failure data (Nikora, 2000). The u-plot method is based on a generalization of the simple median check and detects systematic objective differences between predicted and observed failure behavior (Lyu, 1996). It is used to determine whether the predictions are on average close to the true distributions. The u-plot can be used to find the bias from the reality. There are other departures from reality, which cannot be detected by u-plot. For example, a data set may lead to optimistic early predictions and pessimistic late predictions. These deviations can be averaged out in the u-plot (Lyu, 1996). Then, y-plot can be drawn to detect such cases. In the following parts, the two software reliability model selection methods that are proposed in the literature are discussed.
1.1.1 The Method of Software Reliability Growth Models Choice Using Assumptions Matrix This method relies on how well the assumptions of the reliability models represent the reality. As the first step Software Reliability Models’ assumptions are analyzed and a matrix, called Assumptions Matrix, is constructed. Assumptions Matrix is composed of rows and columns where rows correspond to various assumptions and columns to software reliability growth models (SRGMs). Entries of the Assumptions Matrix are either 1 or empty. Entry “1” means the model on the corresponding column, uses the assumption on the corresponding row. Then, to choose the best fit SRGM, assumption set of the software development process under consideration is tried to be matched with those of the reliability models (Kharchenko, 2002).
1.1.2 An Empirical Method for Choosing A Software Reliability Growth Model This is an empirical method for selecting SRGMs to make release decisions (Stringfellow, 2002). It provides guidelines to select a SRGM to be used as failures are reported during the test phase. The proposed method evaluates various SRGMs iteratively during system test. They are fitted to weekly cumulative failure data and used to estimate the expected number of failures in software after release. Those SRGMs that pass the proposed criteria, may then be used to make release decisions.
2. PROPOSED ALGORITHM FOR GROWTH MODEL SELECTION
SOFTWARE
RELIABILITY
The proposed algorithm combines advantages of the algorithms, named “The Method Of Software Reliability Growth Models Choice Using Assumptions Matrix” (Kharchenko, 2002) and “An Empirical Method For Selecting Software Reliability Growth Models” (Stringfellow, 2002) which were discussed in 1.1.1 and 1.1.2. We found the approach useful of iteratively collecting data and evaluating the models in Empirical Method (Stringfellow, 2002). However, we believe that it would be better to replace some parts of this method to improve the success of the algorithm. First of all, the method selects the initial set of models intuitively. However, intuitive model selection approach may not always include the correct set of models and may cause best models to never be evaluated. Secondly, this method selects the most pessimistic model from the pool of models left, which predicts the maximum number of remaining failures in the software. However, selecting the pessimistic model may increase the testing time unnecessarily. Also, this method considers Failure Count (FC) models, which predict the number of remaining failures. But some times, estimation of next time of failure after the release, for example to check whether the software is likely to survive a mission (Schneidewind, 1997), is more important than the number of remaining failures. This information could be obtained from the Times Between Failures (TBF) models. Besides removing that restriction we propose the “local threshold” concept as a novel approach, to be used in deciding on the release time of software. The method described in (Kharchenko, 2002) is useful if the time to select a reliability model is limited and the reliability requirement is not so high. But to achieve high reliability and find out a model, which represents the test data best, Assumptions Matrix alone is not adequate since assumptions are often violated in practice. Also use of some assumptions may not be practical. For example, one cannot easily assume the number of failures in the system is finite or infinite. Although it has some disadvantages, Assumptions
116
IADIS International Conference Informatics 2008
Matrix could be used to select an initial model set by using only those valid assumptions for software under test.
2.1 Modified Assumptions Matrix We use Assumptions Matrix (Kharchenko, 2002) approach to choose initial set of reliability models. However, before using the Assumptions Matrix we had to modify it since some of the models that exist in Assumptions Matrix do not exist in the tool named CASRE, which we have used in our studies. Moreover, some of the models that CASRE supports do not exist in Assumptions Matrix. Therefore, we have rearranged the Assumptions Matrix to provide the models that the tool supports. The modified Assumptions Matrix is shown in Table 1. We have also added new assumptions to the Assumptions Matrix and modified the models according to these new assumptions. New assumptions are listed below; V. Failure Data Format Layer V.1 Failure data format is time between failures (TBF). V.2 Failure data format is failure counts (FC) per test interval. Table 1. Modified Assumptions Matrix
2.2 Proposed Algorithm Fig.1 shows the proposed algorithm, which has 17 main parts that are described in detail in the following steps: 1. Determine Static Usable Assumptions: Analyze test and development environment. Then select static usable assumptions, which are most suitable for the software environment. Among all the assumptions, which are listed in (Kharchenko, 2002) and part 2.1, static assumptions are; I.1, I.2, I.3, I.4, II.1.1, II.1.2, III.3, IV.2.2, IV.2.3, IV.3.2, IV.4.2, IV.5.3, V.1 and V.2. 2. Determine Failure Data Format: Failure data can be in either Failure Counts (FC) or Time Between Failures (TBF) format. 3. Determine Local & Release Threshold: Local threshold will be used before determining release threshold. The idea behind this threshold is not to check the release threshold and continue with testing if last estimated failure time (number of failures for FC data) is close to the one just observed. 4. Collect Failure Data: Record the failure counts per test interval for FC data or times between failures for TBF data. 5. Is Testing Adequate To Apply Models: Determine if collected data is adequate to apply software reliability models. The studies indicate that software reliability growth models typically do not become stable until 60 percent of testing is complete (Stringfellow, 2002). This is the minimum amount of time to collect data. If there is an obvious reason, one can continue collecting data before applying the models. 6. Initial Model Selection Already Done?: Check whether the potential model selection done before. If not; then continue with steps 7 and 8 else continue with step 9.
117
ISBN: 978-972-8924-62-1 © 2008 IADIS
Figure 1. Proposed algorithm
7. Determine Dynamic Usable Assumptions: Analyze collected data and select suitable dynamic assumptions. Among all the assumptions, which are listed in (Kharchenko, 2002) and part 2.1, dynamic assumptions are; II.2.1, II.2.1.1, II.2.1.2, II.2.1.3, II.2.2, II.2.2.1, II.2.2.2, II.3.1, II.3.2, III.1, III.4, III.6, III.7, IV.1.2, IV.4.3, IV.4.4, IV.6.2, IV.6.3. To determine suitable dynamic assumptions, we have used a data analysis tool. We tried to fit the collected data to a known distribution, such as Binomial or Poisson distribution. 8. Select Potential Models: Making use of the determined static and dynamic assumptions, select reliability growth model/models from the modified Assumptions Matrix shown in Table 1. 9. Apply Models, Estimate Model Parameters: Estimate model parameters and evaluate models on test data. In this study, we use a CASE tool, named CASRE, to determine model parameters and evaluate models. In the estimation of model parameters, we use Maximum Likelihood method for TBF data and Least Squares method for FC data. Note that, one can use other methods such as regression methods (Stringfellow, 2002) in parameter estimation. We use the data collected from the beginning of testing and to the time at which the potential model selection is done to estimate model parameters. 10. Do Models Converge: Continue to collect failure data unless at least one model converges. Sometimes due to the nature of data collected , model parameters cannot be obtained and models diverge. If all the selected models diverge then continue testing to collect more data.
118
IADIS International Conference Informatics 2008
11. Take Next Predictions: In this step, calculate next step predictions of the models. For FC data, this is the predicted number of failures in the next test interval. For TBF data, this is the predicted time of the next failure. 12. Rank Models: We applied following procedure suggested in (Lyu, 1995) to rank models; • Apply a goodness of fit test to determine if the model results fit the input data to a specified significance level. • If a number of models pass the fitness test: o Choose the most appropriate model(s) based on the Prequential likelihood. o In the event of a tie, use the model bias, then model bias trend to break the tie. • If only one model provides a good fit to the data then choose that model. • If no models provide a good fit to the data: o Choose the most appropriate model(s) based on the Prequential likelihood. o In the event of a tie, use the model bias, then model bias trend to break the tie. 13. Is FC Data ?: If the failure data is of FC format then continue with step 14 else continue with step 15. 14. Is Estimate < Actual: Check whether the models’ estimate is lower then the actual total failure. If any of the selected reliability models’ estimated number of the total failures less than the actual; then model is underestimating the remaining number of failures and give a false sense of security. Then eliminate this model and continue with others. 15. Check Local Threshold: Estimation for current failure/interval was calculated in previous step. Now, compare first ranked models’ previous estimations with actual collected data. For FC data, compare estimated number of failures with actual found failures. Whereas, in TBF data compare estimated time to next failure with the actual time to failure. If the error is less than the local threshold then apply step 16 else continue testing. 16. Check Release Threshold: A release threshold was set at the beginning. Now compare first ranked models’ estimation and actual collected data. If the difference is lower than the release threshold, then release may be decided. If the difference is higher than the threshold then testing must continue. 17. Make Release Decision. At this step we know that both the local and release thresholds are satisfied. With this information in hand, test manager may decide to release the software.
2.3. Performance Study Modified Assumptions Matrix, which is shown in Table 1, is used to select initial set of reliability models considering static and dynamic assumptions. While determining the static assumptions, only those valid assumptions for software under test are used. To obtain the appropriate dynamic assumptions, a data analysis tool is used to determine type of data sets, which have either Poisson or Binomial distribution. In Schneidewind models, a detailed knowledge about the software is required to determine the cutoff interval. Since we do not have such detailed information on public data sets, Schneidewind models were not included in the tests although they exist in Table 1. Schneidewind models are finite failure category, Poisson type, Exponential class and NHPP group models. Since Goel-Okumoto model is also in the same category, type, class and group; it is used instead of the Schneidewind models. Also, in our tests Generalized Poisson and Schick-Wolverton models gave identical results. Since they exist in the same category, type, class and group, we included only one of them in the test results. Kolmogorov-Smirnov GOF test is used for TBF data, whereas Chi-Square GOF test is used for FC data. Threshold for Kolmogorov-Smirnov test is set to five percent. In the Chi-Square test, significance level is set to five percent and cell combination frequency is set to five, where cell combination is the number of degrees of freedom minus the number of parameters in the model. We have set the parameters of these tests subjectively. In (Stringfellow, 2002) the threshold for the R2-GOF is set to 0.90 and it is accepted as a very high confidence level. The thresholds for the GOF tests can be varied for different software environments, and one can choose other values.
2.3.1 Test case 1: Real Time Command & Control Application We have chosen a data set from a Real Time Command & Control application named SYS1 provided by (Musa, 1979). The application has 21700 delivered object code instructions and 136 failures were recorded in the test phases. This data set was also used in (Kharchenko, 2002) to test the Assumptions Matrix approach
119
ISBN: 978-972-8924-62-1 © 2008 IADIS
discussed in 1.1.1. Same data set was used, in order to compare results of the proposed algorithm with results of the Assumptions Matrix approach. According to (Kharchenko, 2002) below assumptions are valid for this data set, • I.1, I.2, I.3 and I.4 of the first layer assumptions. • II.1.1, II.2.1.1, II.2.1.2, II.2.1.3 and II.3.1 of the second layer assumptions. First four assumptions are general assumptions that are applicable to all of the models. II.1.1 assumes that total number of failures is finite. We believe that assuming the number of failures in the system is whether finite or infinite, is not reasonable. So we have decided not to use this assumption. Assumptions II.2.1.1 through III.3.1 are applicable for Poisson type and Exponential class models. Besides those assumptions, since the failure format is TBF type, V.1 assumption is also valid for this data set. Therefore, assumptions that are valid for this data set are: • I.1, I.2, I.3 and I.4 of the first layer assumptions. • II.2.1.1, II.2.1.2, II.2.1.3 and II.3.1 of the second layer assumptions. • V.1 of the data format category. Due to the above assumptions; Goel-Okumoto NHPP, Musa Basic, Geometric and Musa-Okumoto models are chosen from Table 1 as the initial set of models. This data set was also used in (Kharchenko, 2002), where Musa Basic model was selected from the Assumptions Matrix. In our test, Musa Basic model was always ranked third after the Musa-Okumoto and Geometric models. From Table 1, these three models are Poisson Type and Exponential Class models. The main difference is, Musa Basic model assumes finite number of failures whereas other two models assume infinite number of failures. We have recommended to the practitioners not to use the assumption on the number of failures unless there is an obvious reason. And the test result shows that, not using the number of failures assumptions resulted in more accurate predictions. Fig. 2. shows plot of the cumulative failures of actual data, Musa-Okumoto model and Musa Basic model. As it can be observed from the plot, Musa-Okumoto model fits the cumulative failures of actual data better than the Musa Basic model.
Figure 2. Cumulative failures of actual data, Musa-Okumoto model and Musa Basic model
2.3.2 Test Case 2: Large Medical Record System Data Set Next data set comes from a large medical record system, which consists of 188 software components. Initially, software was composed of 173 components and 15 components were added to the software after three releases (Stringfellow, 2002). However, many of the other components were modified in all three releases as a side effect of the added functionality. In this part, test results of release 3 are presented. We have identified the below assumptions as valid for this data set for FC data format. • I.1, I.2, I.3 and I.4 of the first layer assumptions. • II.2.1.1, II.2.1.2, II.2.1.3 of the second layer assumptions. • V.2 of the data format category.
120
IADIS International Conference Informatics 2008
First four assumptions are general assumptions that are applicable to all of the models. Assumption V.2 is applicable since the failure format is FC type. We analyzed the data after week 10 to determine whether it has Poisson or Binomial distribution. The result is that, the test data fit both to the Binomial and Poisson distributions. We have identified that Poisson distribution fits the actual data better by using KolmogrovSmirnov goodness of fit test. Therefore, assumptions II.2.1.1, II.2.1.2 and II.2.1.3 are also applicable for this data set. Due to above assumptions; Goel-Okumoto NHPP and Yamada S-Shaped models are applicable according to Table 1. We have also included Generalized Poison model, which is a Binomial type model, to test the validity of our assumptions. Yamada S-Shaped, Generalized Poisson and Goel-Okumoto NHPP models were ranked as first, second and third, respectively in the tests. First ranked model predicted the number of remaining failures as just three. If we were selecting best model as the most pessimistic model like the Empirical Method (Stringfellow, 2002), then we would choose the Goel-Okumoto NHPP model. Because, it has predicted the maximum number of remaining failures. Fig. 3 displays cumulative failures plot of actual data, Yamada S-Shaped model and Goel-Okumoto NHPP model. Cumulative failures plot curve shows that the Yamada S-Shaped model fits to the actual data better than the Goel-Okumoto NHPP model. Also note that, from (Stringfellow, 2002), total number of failures found in the post-release is 83, which is slightly more than the predicted number.
Figure 3. Cumulative failures of actual data, NHPP model and Yamada S-Shaped model
3. CONCLUSION In this study, we have proposed an algorithm for software reliability growth model selection. Proposed algorithm helps to analyze, manage and improve the reliability of the software products. With the help of correctly decided reliability objectives, one can determine when the software is good enough to release and also minimizes risks of deploying the software with serious failures. Moreover, by using the proposed algorithm one can avoid excessive test time and release the product at the correct time with the required reliability. The proposed algorithm takes advantages of the algorithms, namely “The Method Of Software Reliability Growth Models Choice Using Assumptions Matrix” (Kharchenko, 2002) and “An Empirical Method For Selecting Software Reliability Growth Models” (Stringfellow, 2002). We have eliminated the weaknesses and combined strengths of those algorithms to introduce a new, but powerful algorithm. The proposed algorithm is tested with publicly available data sets, which were also used in (Kharchenko, 2002) and (Stringfellow, 2002). Software environments may have different reliability objectives, such as estimating number of remaining failures, achieving a specified failure intensity level or estimating time to next failure. The method proposed in (Kharchenko, 2002) does not discuss how this method could be used to achieve a reliability objective, where the method discussed in (Stringfellow, 2002) proposes to use the number of remaining failures for
121
ISBN: 978-972-8924-62-1 © 2008 IADIS
determining the release decision. In the proposed algorithm we’ve supported all of the three software reliability objectives discussed above to have a more general algorithm. The method proposed in (Stringfellow, 2002) decides to keep or discard the models according to results of the GOF tests. Then from the remaining models it selects the most pessimistic one, which predicts the maximum number of remaining failures. However, as discussed in (Lyu, 1995), the GOF test is not sensitive enough to make fine distinctions among models. Therefore, we’ve changed the approach of selecting the best model and proposed the usage of the ranking algorithm discussed in (Lyu, 1995). Also, the method proposed in (Stringfellow, 2002) eliminates the models in case they do not converge or pass the GOF tests. In our proposed algorithm such models are not discarded immediately but kept and evaluated instead. Because, in our tests we’ve observed that although some models may diverge or not pass GOF test in early test intervals, they might converge or pass the GOF tests later and give successful estimations. Also, we propose a novel approach, which is called local threshold, to help the release decision. Local threshold is used before applying the release threshold and ensures that the best model gives meaningful local estimations. It suggests that, if the previously estimated value (number of failures for FC data, next time of the failure for TBF data) for the current interval/instance is not satisfactory then one should not test for the release threshold and continue testing. In some cases we found that, although the release threshold is good enough, the most recent estimation is much different than the actual results. For example in case of FC data, the difference between the estimated number of failures per interval and actual failures in that interval can be too high. The difference may be due to several reasons. For example, test team might be changed, new code might be deployed or environment might be changed, etc. Whatever the reason is, it may be a good idea to continue testing until the model results are satisfactory for both local and release thresholds. Test results showed that checking the local threshold before the release threshold helps to make more accurate release decisions. Tests demonstrated that the proposed algorithm is more successful on selecting the best model than using the Assumptions Matrix alone, which is proposed in (Kharchenko, 2002). Although, we have expected that the proposed algorithm would produce better estimations then (Stringfellow, 2002), we observed that both algorithms’ produced close predictions. In the tests, we have used a CASE tool, named CASRE, which includes a few but widely known models. We use CASRE since it is the most popular CASE tool, which automates the reliability modeling process. The method discussed in (Stringfellow, 2002) has also used almost the same widely known models, which is the reason for close predictions. We found out that, testing the proposed algorithm in CASRE limits the prediction performance of the proposed model. We believe that starting with a large number of models would be resulted in better estimations. Therefore as a future work, we plan to extend the Assumptions Matrix and develop a new Software Reliability Tool to be used in place of CASRE, to include all the models in the extended Assumptions Matrix, in order to improve the proposed algorithm selection performance.
REFERENCES Farr, W.H., 1983. A survey of software Reliability Modeling and Estimation, NavSWC technical Report TR 82-171. Kharchenko, V.S. et al, 2002. The Method of Software Reliability Growth Models Choice Using Assumptions Matrix. Proceedings of the 26 th Annual International Computer Software and Applications Conference (COMPSAC’02). Lyu, M.R., 1996. Handbook of Software Reliability Engineering, IEEE Computer Society Press and McGraw-Hill Book Company. Lyu, M.R., and Nikora, A., 1995. An Experiment in Determining Software Reliability Model Applicability. Proceedings of the Sixth International Symposium on Software Reliability Engineering, Tolouse, France, pp. 304-313. Lyu, M.R. 2007. Software Reliability Engineering: A Roadmap. Future of Software Engineering (FOSE'07), 153 – 170. Musa, J.D. 1979. Software Reliability Data. Technical Report, Data and Analysis Center for Software, Rome Air Development Center, Griffins AFB, New York. Nikora, A., 2000. Computer Aided Software Reliability Estimation Tool User’s Guide. Schneidewind, N.F., 1997. Reliability Modeling for Safety Critical Software, IEEE Transactions on Reliability, Vol. 46, No.1. Stringfellow, C., and Amschler, A.A., 2002. An Emprical Method for Selecting Software Reliability Growth Models. Emprical Software Engineering, 7, 319-343.
122