Using a Reliability Growth Model to Control Software ... - Springer Link

2 downloads 19085 Views 515KB Size Report
Keywords: Software inspection, defect detection capability estimation, reliability growth model, controlled ..... ministrative information system for managing ticket sales. ...... From 1980 to 1988 he was with Siemens Corporate, working in.
Empirical Software Engineering, 7, 257–284, 2002.  2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Using a Reliability Growth Model to Control Software Inspection STEFAN BIFFL Stefan.Biffl@tuwien.ac.at Research Group Industrial Software Engineering, Institute for Software Technology and Interactive Systems, Vienna University, Vienna, Austria WALTER J. GUTJAHR [email protected] Department of Statistics and Decision Support Systems, University of Vienna, Vienna Austria

Abstract. After a software inspection the project manager has to decide whether he can pass a product on to the next software development stage or whether it still contains a substantial number of defects and should be reinspected to further improve its quality. While a substantial number of defects remaining in a product after inspection is a reasonable precondition to schedule a reinspection, it is also important to estimate whether the likely number of defects to be found with a reinspection will lower the defect density under the target threshold. In this work we propose a reliability growth model and two heuristic linear models for software inspection, which estimate the likely number of additional defects to be found during reinspection. We evaluate the accuracy of these models with time-stamped defect data from a large-scale controlled inspection experiment on reinspection. Main findings are: (a) The two best models estimated the defect detection capability for reinspection with good accuracy: over 80% of the estimates had an absolute relative error of under 10%; (b) The reinspection decision correctness based on the estimates of all investigated models, overall around 80% correct decisions, was much better than the trivial models to always or never reinspect; the latter is the default decision in practice. Keywords: Software inspection, defect detection capability estimation, reliability growth model, controlled experiment, empirical software engineering Abbreviations: ARE – absolute relative error; DCET – defect content estimation technique; DDE – defect detection effectiveness; EDDE – estimated DDE; MRE – mean relative error; RE – relative error; RGM – reliability growth model; TDDE – true DDE: TDDE1 (TDDE2) after first (second) inspection cycle

1. Introduction The basis for improvement of software reliability is to understand software defects and our ability to detect them with different activities (Wohlin and Wessle´n, 1998). Software inspection is a defect detection activity that can be applied effectively as soon as a software artifact is developed (Gilb and Graham, 1993; Laitenberger and DeBaud, 2000).

258

BIFFL AND GUTJAHR

After a software inspection the project manager has to decide whether a product has sufficient quality, in terms of defect density, to pass it on to the next software development stage or whether it should be reinspected to further improve its quality.1 He can use the information on defects found during inspection and an estimate on the number of defects originally in the document to calculate the defect density and an estimate of the defect detection effectiveness of the inspection (El Emam et al., 2000). Defect detection effectiveness is defined in this work as the ratio of the number of defects found in an inspection to the number of defects originally in the product. Based on the estimate on the number of defects remaining in the product a rule for the reinspection decision can be formulated: e.g., reinspect, if the number of defects found in the inspection was lower than a target threshold for the remaining defect density, determined by the project manager in the project context. While a substantial number of defects remaining in a product after inspection is a reasonable precondition to schedule a reinspection, it is also important to estimate whether the likely number of defects found with a reinspection will lower the defect density under the target threshold. The actual number of defects a reinspection will likely find depends on the inspection process applied, the type and quality of the inspected artifact, the team involved, and the time they invest for defect detection. Recent reports on defect estimation in the software inspection context concentrate on the estimation of defect content after inspection with objective (Briand et al., 1998; Wohlin and Runeson, 1998; Briand et al., 2000) and subjective approaches (El Emam et al., 2000; Biffl, 2000a). So far there are very few empirical reports on the estimation of the defect detection effect of an actual reinspection, which consider the time invested. The defect content measures the defect density after inspection, while the defect detection capability estimates the number of defects likely to be found, if a reinspection is conducted. In this work we propose that the reinspection decision should be based on the defect detection capability of a team and the inspection process, in addition to the defect content of the software product. A reinspection decision model, which uses an estimate of defects that are likely to be found by a given team and process in a given time period, is supposed to make the correct reinspection decision more often than the trivial model of always or never to reinspect, if the initial inspection cycle fails to lower the defect density under the target threshold. Reliability growth models (RGMs) are used in software testing to determine the number of faults a testing process is likely to uncover in a given time frame. We propose to adapt a RGM to the inspection process to determine whether a reinspection is likely to achieve its goal, i.e., lower the defect density to a certain level. In this paper we describe an initial RGM for software inspection, which assumes the same inspection process for reinspection that was used for the first inspection cycle. We evaluate the relative error of the proposed model for the defect detection capability of inspection teams and processes with data from a controlled experiment

USING A RELIABILITY GROWTH MODEL TO CONTROL SOFTWARE INSPECTION

259

on the inspection of a requirements document in two inspection cycles. The estimation accuracy is evaluated based on the true number of defects in the inspection object, which is known in the experiment and would in practice be estimated with an appropriate defect content estimation technique (Briand et al., 2000). Further we compare the correctness of the reinspection decision taken with estimates from the RGM to the correctness of heuristic estimation models. For a software product, which was found to still contain a substantial number of defects after the first inspection cycle this additional information gives the project manager feedback to schedule a reinspection, if the reinspection goal is likely to be reached. If defect analysis shows, that there are still a lot of defects in the product, but a reinspection is not likely to find enough additional defects in the given context, the project manager can pass the product on to the next development step with appropriate plans for additional development and quality assurance activities. In this case the project manager saves precious time and resources for development, which would otherwise have been wasted for an ineffective reinspection. This decision is not easy to take as it involves estimating the likely performance of a reinspection. Section 2 summarizes related work on defect detection capability estimation with a RGM that calculates the estimated defect detection effectiveness of a team in a second inspection cycle from inspection data. Section 3 describes the experiment settings and the inspection object. Section 4 presents experiment results on the performance of the RGM. Section 5 discusses the significance of results for development practice and suggests further work.

2. Defect Detection Capability Estimation for Software Inspection This section discusses parameters of the software inspection process, introduces criteria for deciding whether to reinspect the document, presents a RGM for the defect detection capability estimation of a reinspection, and finally describes evaluation criteria for this estimation model as well as for the correctness of the reinspection decision.

2.1. Parameters of the Software Inspection Process The inspection process consists of management steps for inspection planning and preparation before the actual inspection, and for inspection assessment to decide on further activities, which may be another inspection cycle or appropriate quality assurance activities during further development. The actual inspection includes steps for individual defect detection and for defect collection, which may be a group activity, and results in a team defect list for subsequent defect correction (Gilb and Graham, 1993; Laitenberger and DeBaud, 2000). During inspection planning the inspection manager has to decide on several parameters, which influence the effectiveness and cost of the inspection (Freimut et al.,

260

BIFFL AND GUTJAHR

2001): The inspection steps, the defect detection procedures and aids applied in a step, and the composition of the inspection team. The inspection steps–individual reading, team meeting, and defect correction–can be varied in number and sequence. The defect detection procedures, e.g., checklists (Gilb and Graham, 1993) have to be selected. Further the number of inspectors has to be planned as well as the individual inspectors on the team and their responsibilities during inspection.

2.2. Deciding Whether to Reinspect the Document After the inspection, the list of defects found by the individual inspectors of the team has to be analyzed to assess inspection effectiveness and number of defects remaining in the product. Recent studies have proposed and evaluated a number of defect content estimation techniques (DCETs) to determine the number of defects in a software product: Objective DCETs are capture–recapture models (Eick et al., 1992; Runeson and Wohlin, 1998; Briand et al., 2000; Biffl, 2000b) and the graphical defect profile method (Wohlin and Runeson, 1998), which is based on a reliability growth approach; further promising subjective estimation approaches based on individual (El Emam et al., 2000) and team estimation (Biffl, 2000a) have been reported. Software development products that are suspected to contain a substantial number of defects may be subject to a second inspection cycle (reinspection) with the primary objective to reduce the number of remaining defects to an acceptable quality level (Laitenberger, 2000). Based on the number of defects remaining in the product after inspection and assumptions for the effectiveness of a reinspection, several reinspection decision models have been suggested, which are either based on the defect density after inspection alone (El Emam et al., 2000) or assume the same efficiency for the second inspection cycle (Adams, 1999). As for an initial inspection the actual properties of a reinspection depend on parameters listed above. Depending on the effectiveness and efficiency of an inspection a project manager has to choose a design for the second inspection cycle. If the manager assumes that the inspection process is optimal to uncover target defects and the inspection team just needs more time to thoroughly apply this process to the inspection object, he may keep the inspection design. Another option is to vary one or more inspection parameters to change the focus of the process. Changes of the inspection parameters may lead the inspection team to look for the same defect classes as in the first inspection cycle with more or less different probabilities to find a given defect, or to look for entirely different defect classes, e.g., different defect types or defects in different document locations. In the latter case the second inspection cycle is less an extension of the first inspection and more like a new inspection of the original product. In this work we investigate reinspection with the same process and the inspection team that executed the initial inspection to build on their understanding of the inspection object from the previous reading.

USING A RELIABILITY GROWTH MODEL TO CONTROL SOFTWARE INSPECTION

261

2.3. Defect Detection Capability Estimation of a Reinspection with a RGM Software RGMs provide statistical descriptions of the process of fault recognition (usually by testing) and fault removal with the effect of reliability improvements. In Appendix A we provide a short recapitulation of the history of such models; for more information, the reader is referred to surveys on the classical models (Ramamoorthy and Bastani, 1982; Xie, 1991). Besides other applications, RGMs are applied during software testing to estimate the number of failures likely to occur in a given period of testing, which can be used to decide when to stop testing. The basic intuition behind RGMs is that a large number of failures occur in early testing phases while the fault detection frequency decreases with each further testing period. A (parameterized) stochastic process is used to model the fault detection process, and based on this process and failure data from previous test periods, the mean time between failure events or the likely number of detected faults in a given time period can be estimated. We adopt this idea to document inspection in an analogous way, replacing program faults by document defects, and a failure (i.e., a fault detection event) by the event where an inspector detects a defect. Figure 1 shows the scheme of a RGM: The horizontal axis consists of time units, the vertical axis shows the number of defects found so far. The software product contains a certain number of defects, N0. Based on time-stamped defect detection data from a given inspection team with a given inspection process the model determines how many of these defects they are likely to find in a given time period, up to unlimited time (dashed line), which is their maximal potential for reinspection. For our purpose we decided to start with one of the simplest RGMs, the Jelinski– Moranda model (JMM; Jelinski and Moranda, 1972), analyze its applicability to the inspection context and adjust it to this context. The basic JMM is the oldest and one of the most elementary software reliability models. As mentioned above, in this

Figure 1. Example of a RGM.

262

BIFFL AND GUTJAHR

model, the failure rate of the program is assumed to be proportional to the number of remaining faults at any time, so that each fault is assumed to make the same contribution to the failure rate. While in applications of RGMs to software testing it is not usual to keep track of which tester has found a specific fault, inspection offers the opportunity to exploit the additional information contained in the records on which defect has been detected by which inspector during the inspection cycle. The reason is that the defect detection ability of inspectors varies to a large degree, and since the defect detection data of each individual inspector during the (first) inspection cycle are readily available, it would be a waste to neglect this specific information by taking account of the defect reports as if they were anonymous, i.e., as provided by the overall team. This has led us to an extension of the original JMM which considerably improves the predictive power in our application (Table 1). Our assumptions are the following: Inspectors 1; . . . ; k work separately from each other on the detection of defects in the given software document. Each inspector i has a certain intensity (marginal probability per time unit) to detect a next defect in the document. In accordance with the basic assumption of JMM, this intensity is assumed to be proportional to the number of defects not yet found by the inspector (see Equation (1)). As in JMM, the resulting times between defect detection are exponentially distributed (Equation (2)). Defect removal, as assumed in the JMM, corresponds to the fact that a fixed considered inspector only reports the same defect once. Different inspectors, however, may report the same defect. The proportionality constants can differ for different inspectors reflecting their varying defect-detecting competences. The reader should notice that this exceeds the assumptions of JMM, so we have to extend JMM for our purpose. The resulting maximum-likelihood estimates for the inspector-specific proportionality constants are shown in Equation (3). In Appendix A, we present the formal details of our model, together with the derivations and a simulation procedure to estimate the relevant parameter values. kij ¼ /i ðN0  j þ 1Þ;

ð1Þ

Table 1. Variables for the extended JMM. N0 k i /i ni tij ATD DTD OF

Total number of defects in the document Inspection team size, 4–6 inspectors in this experiment Inspector identifier in a team JMM proportionality constant for inspector i Number of (true) defects reported by inspector i Length (in time units) of the time interval between the (j1)th and the jth defect found by inspector i The number of all true defects found by a team The number of different true defects found by a team. DTD is less than or equal to ATD Overlap factor, OF = ATD/DTD

USING A RELIABILITY GROWTH MODEL TO CONTROL SOFTWARE INSPECTION

probability density in tij ¼ kij expðkij tij Þ

/i ¼ ni

ni X

ði ¼ 1; . . . ; k; j ¼ 1; . . . ; ni Þ;

263

ð2Þ

!1 ðN0  j þ 1Þtij

:

ð3Þ

j¼1

The RGM can be used to calculate the following measures for both individuals and inspection teams: the mean time to the next defect report in a following inspection cycle and the expected number of defect reports in a given time period. The estimates are derived by the following steps: 1. Determine N0, the number of defects originally in the document. In practice N0 can be estimated with an appropriate defect content estimation technique (see Section 2.2 for examples). As we focus on the evaluation of the potential of a RGM for predicting the defect detection process, we want to eliminate as many sources of error as possible outside the RGM. Therefore we take the true number of defects originally in the document (known in the experimental setting). 2. Based on the N0 estimate, the proportionality constant for each inspector and the mean time to next defect report can be calculated (see Equation (3) and Appendix A). 3. Then we estimate the number of defects found during reinspection by each individual inspector for a certain time with a simulation approach (see Appendix A for a detailed description of the simulation algorithm). 4. The team estimate is the sum of the team members’ individual estimates, as all inspectors work independently during the defect detection phase. However, to determine the net effectiveness of a team during reinspection the defect overlap factor among team members, that is the quotient of the sum of the numbers of reported defects and the total number of different defects in the team, must be considered. As a first-order approximation, we assume a linear relationship between the number of reference defects found and the OF (see Equation (4)). OF ¼ aATD þ b:

ð4Þ

The parameters a and b are estimated by solving a system of two linear equations: The first equation is Equation (4) with the empirical values for OF and ATD observed in the first inspection cycle. The second equation is Equation (4) for the limiting case of all defects having been found by all inspectors; in this case, OF has its largest possible value, which equals the number of inspectors in the team, and ATD is (the estimate for) N0. After having determined a and b in this way, we

264

BIFFL AND GUTJAHR

compute the OF for the reinspection cycle from Equation (4), using the RGM estimate computed in step 3. 2.4. Evaluation Criteria For the evaluation of the suitability of a RGM to determine the reinspection effectiveness of an inspection team, we exploit the information known in the experiment environment: We determine N0 from the known number of defects in the inspection object and we estimate the team defect detection effectiveness for the time the team actually invested during reinspection. We use the following criteria to evaluate the performance of the RGM. 2.4.1. Mean Relative Error The relative error (RE, see Equation (5)) characterizes the relative over- or underestimation of a defect estimate. We base the RE of the number of defects a team is estimated to have found after both inspection cycles on the true N0, as this number is equal for all teams in the experiment. We use the following acronyms: DDE for defect detection effectiveness, EDDE for the estimated DDE, TDDE for the true DDE. RE ¼ EDDE  TDDE:

ð5Þ

We use the mean relative error to describe the central tendency of the estimates yielded by all teams using the RGM. 2.4.2. ‘Traffic Light’ Model for Estimation Accuracy The credibility of the estimate is a major aspect to project managers, who have to determine the weight of a given estimate for their reinspection decision process. We propose a model for the credibility of a group of estimates with the shares of most likely defect estimates that fall into intervals with good, sufficient, and poor accuracy (Briand et al., 2000; Biffl and Grossmann, 2001). The limits of these intervals depend on the judgment of the project manager. For the experiment purpose we use the following limits: Good estimates exhibit less than 10% absolute RE (ARE, see RE in Equation (5)), sufficient estimates 10–20% ARE, poor estimates lie outside the sufficient range.2 2.4.3. Correctness of the Reinspection Decision For the project manager in practice the correctness of a reinspection decision taken based on the RGM estimate is a crucial performance criterion.

USING A RELIABILITY GROWTH MODEL TO CONTROL SOFTWARE INSPECTION

265

Figure 2. Success or failure of a reinspection to meet its goals.

Table 2. Reinspection decision correctness table.

EDDE2>70% EDDE2 0) { // Generate an exponentially distributed // random variable for the amount of time // between this defect finding and the previous one. X = exponential(/i  numberDefects(i)); consumedTime (i) = consumedTime (i) + X; // Check if there is still time left for reinspection. if (consumedTime (i) £ Ti) { // Increase the number of detected defects // and decrease the number of remaining defects. numberFoundDefects (i) þþ; numberDefects (i) – –;

} // If time for reinspection is up, exit loop. else break;

} An exponentially distributed random variable X with parameter k is generated from a uniformly distributed random variable U on the interval [0,1] by the transformation X ¼  lnð1  UÞ=k: This simulation algorithm must be executed at least a few hundred times resulting in a data series of defects found during reinspection for an individual inspector. The

282

BIFFL AND GUTJAHR

mean of this series was used as the actual estimate for the inspector. Additionally the empirical 95% confidence interval was derived.

Notes 1. Reinspection is used in this paper to find more defects in the product rather than to check the rework on defects found in a previous inspection (El Emam et al., 2000). 2. For defect content estimation models we used 20 and 40% as threshold for good and sufficient accuracy (Biffl and Grossmann, 2001). For reinspection the number of defects can be expected to be 30–70% lower than for inspection, which warrants lower accuracy thresholds for reinspection accuracy. 3. The threshold parameters, 50 and 70%, are based on inspection exit criteria proposed in Gilb and Graham (1993) and the defect density in the inspection object.

References Abdel-Ghaly, A. A., Chan, P. Y., and Littlewood, B. 1986. Evaluation of competing software reliability predictions. IEEE Transactions on Software Engineering 12(9): 950–967. Adams, T. 1999. A formula for the re-inspection decision. Software Engineering Notes 24(3): 80. Basili, V. R., Green, S., Laitenberger, O., Lanubile, F., Shull, F., Soerumgaard, S., and Zelkowitz, M. 1996. The empirical investigation of perspective-based reading. Empirical Software Engineering: An International Journal 1(2): 133–164. Biffl, S. 2000a. Using inspection data for defect estimation. IEEE Software (special issue on recent project estimation methods) 17(6): 36–43. Biffl, S. 2000b. Evaluating defect estimation models with major defects. Technical Report 00-38, Dept. Software Engineering, Vienna University of Technology, Austria, Journal of Systems and Software (accepted for publication). Biffl, S., and Grossmann, W. 2001. Evaluating the accuracy of objective estimation models based on inspection data from multiple inspection cycles. Proceedings of the ACM/IEEE ICSE’01, Toronto, May 2001, pp. 145–154. Briand, L., El Emam, K., and Freimut, B. 1998. A comparison and integration of capture–recapture models and the detection profile method. Proceedings of the Ninth International Symposium on Software Reliability Engineering, IEEE Computer Society. Briand, L., El Emam, K., Freimut, B. and Laitenberger, O. 2000. A comprehensive evaluation of capture– recapture models for estimating software defect content. IEEE Transactions on Software Engineering 26(6): 518–540. Curtis, B. 1986. By the way, did anyone study any real programmers? Empirical Studies of Programmers: First Workshop, pp. 256–262. Ablex Publishing Corporation. Eick, S. G., Loader, C., Long, M. D., Votta, L. G., and Vander Wiel, S. 1992. Estimating software fault content before coding. Proceedings of the 14th International Conference on Software Engineering, pp. 59– 65. El Emam, K., Laitenberger, O., and Harbich, T. 2000. The application of subjective estimates of effectiveness to controlling software inspections. Journal of Systems and Software 54(2): 119–136. Freimut, B., Laitenberger, O., and Biffl, S. 2001. Investigating the impact of reading techniques on the accuracy of different defect content estimation techniques. IEEE International Software Metrics Symposium, London. Gilb, T., and Graham, D. 1993. Software Inspection. Reading: Addison-Wesley. Ho¨st, M., Regnell, B., and Wohlin, C. 2000. Using students as subjects – a comparative study of students and professionals in lead-time impact assessment. Empirical Software Engineering 5: 201–214.

USING A RELIABILITY GROWTH MODEL TO CONTROL SOFTWARE INSPECTION

283

Jelinski, Z., and Moranda, P. B. 1972. Software Reliability research. In W. Freiberger (ed.): Statistical Computer Performance Evaluation. New York: Academic Press, pp. 465–497. Laitenberger, O. 2000. Cost-effective Detection of Software Defects through Perspective-based Inspections. PhD Thesis, University of Kaiserslautern, Germany. Laitenberger, O., and DeBaud, J.-M. 2000. An encompassing life cycle centric survey of software inspection. Journal of Systems and Software 50(1): 5–31. Littlewood, B., and Verall, J. L. 1973. A Bayesian reliability growth model for computer software. Applied Statistics 22: 332–346. Miller, J., Wood, M., and Roper M. 1998. Further experiences with scenarios and checklists. Empirical Software Engineering: An International Journal 3(1): 37–64. Porter A., and Votta, L. 1998. Comparing detection methods for software requirements inspection: A replication using professional subjects. Empirical Software Engineering: An International Journal 3(4): 355–380. Ramamoorthy, C. V., and Bastani, F. B. 1982. Software reliability – status and perspectives, IEEE Transactions on Software Engineering SE-8: 354–371. Robson, C., 1993. Real World Research, Blackwell. Runeson, P., and Wohlin, C. 1998. An experimental evaluation of an experience-based capture–recapture method in software code inspections. Empirical Software Engineering: An International Journal 3(4): 381–406. Sandahl, K., Blomkvist, O., Karlsson, J., Krysander, C., Lindvall, M., and Ohlsson, N. 1998. An extended replication of an experiment for assessing methods for software requirements inspections. Empirical Software Engineering: An International Journal 3(4): 327–354. Schick, G. J., and Wolverton, R. J. 1973. Assessment of software reliability. Proceedings of the Operations Research. Wu¨rzburg: Physica Verlag, pp. 395–422. Schneidewind, N. F. 1975. Analysis of error processes in computer software. Sigplan Notices 10: 337–346. Shooman, M. L. 1972. Probabilistic models for software reliability prediction. In: W. Freiberger (ed.): Statistical Computer Performance Evaluation, New York: Academic Press, 485–502. Tichy, W., 2001. Hints for reviewing empirical work in software engineering. Empirical Software Engineering: An International Journal 5: 309–312. Wohlin, C., and Runeson, P. 1998. Defect content estimations from review data. Proceedings of the International Conference on Software Engineering, Los Alamitos, CA, USA, pp. 400–409. Wohlin, C., and Wessle´n, A. 1998. Understanding software defect detection in the personal software process. Proceedings of the 9th International Symposium on Software Reliability Engineering, Los Alamitos, CA, USA, pp. 49–58. Wohlin, C., Runeson, P., Ho¨st, M., Ohlsson, M. C., Regnell, B., and Wessle´n, A. 2000. Experimentation in Software Engineering – An Introduction. The Kluwer International Series in Software Engineering, The Netherlands: Kluwer Academic Publishers. Xie, M. 1991. Software Reliability Modeling, Singapore: World Scientific.

284

BIFFL AND GUTJAHR

Stefan Biffl is an associate professor of software engineering at the Vienna University of Technology. He received MS and PhD degrees in computer science from the Vienna University of Technology and an MS degree in social and economic sciences from the University of Vienna. He is a member of the ACM and IEEE. His research interests include project and quality management in software engineering.

Walter J. Gutjahr received his MSc and PhD degrees in mathematics from the University of Vienna, Austria, in 1980 and 1985, respectively. From 1980 to 1988 he was with Siemens Corporate, working in technical and in management positions on diverse software development projects. Since 1988, he is at the University of Vienna, currently as an associate professor of computer science and applied mathematics. His research interests include analysis of algorithms, optimization, software engineering, and project management.

Suggest Documents