focus estimation
Using Inspection Data for Defect Estimation To control projects, managers need accurate and timely feedback on the quality of the software product being developed. I propose subjective team estimation models calculated from individual estimates and investigate the accuracy of defect estimation models based on inspection data.
36
IEEE SOFTWARE
Stefan Biffl, Vienna University of Technology
roduct quality directly relates to project cost and schedule estimation; for example, undetected defects in a key work product— such as a requirements document—might lead to time-consuming adjustments. Thus, from the early project stages on, developers and project managers need to update their estimates of software project schedules and activities with feedback from the development process. A major aspect of this feedback is data on work-product quality levels (number
P
and type of deviations from specified quality goals).1,2 Project managers can compare that data to the quality goals in the project plan, which were the basis for the initial estimate. Software product inspection is effective for determining the quality level. To assess the credibility of the initial cost and schedule estimates, the project manager must know how many defects the software product contains (almost always more than the number of defects found during inspection). One way to assess product quality is to monitor product defects throughout development and operation. However, because of this process’s time lag, the information is not available in time for developers to decide on appropriate development and quality assurance activities. Another approach uses estimation models based on the number of defects found after a work product inspection to estimate a product’s total number of defects. In his recent software engi-
November/December 2000
neering textbook,3 Watts Humphrey proposes a simple objective defect estimation model (DEM) to control project activities. Empirical evidence on the applicability and accuracy of product quality estimation models based on sound project data can encourage wider use of such models. In this article, I investigate the accuracy of objective and subjective DEMs based on inspection data. Objective DEMs do not depend on personal opinion but require input from a high-quality data collection process. Subjective estimates, conversely, are relatively cheap and easy to obtain but, by definition, depend on the knowledge and capability of the individual estimator (in this case, an inspector who read the inspection object carefully and reported a list of defects). I propose models to calculate most likely team defect estimates and a confidence interval from subjective individual defect estimates. I report on data from an experiment 0740-7459/00/$10.00 © 2000 IEEE
to evaluate the accuracy of subjective team estimates and objective DEMs on the number of defects in a requirements document. Because the number of defects in the inspection object is known, we can objectively assess the accuracy of the DEMs—that is, their usability in providing feedback on product quality for project control. Project Control with Product Quality Assessment A large part of project management is risk management. Therefore, project cost estimation strategies must consider risk assessment to yield realistic results.1 The quality of work products and the effectiveness of quality assurance methods are important risk factors for project planning and control. Estimating software project costs and schedules anticipates quality goals for the software product to be delivered. The quality levels of the final software product depend on the respective quality levels of development work products. Key requirements for well-founded estimates are the availability and use of sound data as well as some documented rationale for deriving the estimate. If the project-estimation approach allows developers to define software quality levels for those work products and link these to results of the development process, an assessment of the quality levels of these work products can become a basis to adjust initial cost and schedule estimates. Cost estimation without credible data is widespread in practice, often for lack of a data basis on properties of past projects.4 Creating an adequate data basis as required by sophisticated cost estimation models (for example, Cocomo II5) becomes even more difficult with changing development paradigms, new development methods, and new computing environments that quickly make historical data obsolete. Feedback on the quality levels of development products is crucial to assess a given estimate, particularly in projects for which no historical data exists to support a current cost and schedule estimate.
detect defects and provide timely feedback on quality to developers and managers. Inspection denotes the verification of software documentation by a team of inspectors2 with defect-detection, meeting, and repair steps. The defect detection step is an individual activity, implying no interaction among the inspection team. After inspection, developers and project managers can analyze the retrieved data to evaluate the quality of the work product and development and quality assurance processes. The number of defects found during inspection is not a sufficient criterion for these quality evaluations, because bad products with a bad inspection process might rest undetected while very good products might needlessly be elected for another inspection cycle. An approach to overcome this problem is to estimate the number of defects originally present in the document. With this estimate, the project manager can identify relationships between product quality, inspection process quality, and invested time and costs. Similar to traditional cost estimation strategies, defect content estimation approaches based on historical project data exist.6 Project managers must use these approaches with care because such historical data is often unavailable or inapplicable to the project at hand. In this article, I solve this problem with DEMs that use data from inspection of development work products. These models yield an estimate for the most likely number of defects and a confidence interval to give the project manager information on the defect estimate’s probability distribution.
Feedback on the quality levels of development products is crucial to assess a given estimate.
Objective Defect Estimation Models The investigation employs capture–recapture models and the detection profile method (DPM).7,8 CR models, originally developed to estimate the size of closed animal populations,9 are based on the set of defects found by a team of inspectors. There are four CR models, which differ in their assumptions on the probability of defects to be found and of inspectors to find defects. Product Quality Assessment Each model has at least one estimator (a Along the development life cycle, partic- formula based on the model’s assumptions) ularly in the early stages of software devel- to calculate the defect estimate. In the exopment, inspection of software documents periment, the estimator calculated the defect is an effective quality assurance measure to estimate’s most likely value and a 95% conNovember/December 2000
IEEE SOFTWARE
37
Table 1 Capture–Recapture Model Model
Model assumptions
M0
Defects have equal probability of being detected. Inspectors have equal ability to detect defects. Defects have equal probability of being detected. Inspectors have different abilities to detect defects. Defects have different probabilities of being detected. Inspectors have equal ability to detect defects. Defects have different probabilities of being detected. Inspectors have different abilities to detect defects.
Mt Mh Mth
Estimators
vidual estimates with a weighted average model. This has the advantage Mt: maximum likelihood.9 that such team estimates Mtc: Chao’s estimator.8 are less susceptible to exMh: Jackknife.9 treme outliers, because Mhc: Chao’s estimator.8 they can compensate for Mthc: Chao’s estimator.8 individual extreme estimates and might favor inspectors with a better estimation basis. fidence interval (see Table 1). I use an interval estimate to give the inSimilar to CR models, the DPM builds spector the opportunity to express his or her on the number of inspectors who found a confidence as a range of values, the most given defect. The DPM sorts the set of de- likely value, and a minimum and maximum fects descending by the number of inspec- value.12 If an estimator uses a point estitors who found a given defect. The DPM mate, he or she might bias the most likely uses linear regression to fit an exponential estimate to be on the safe side regarding a function to these data points. The estimate given decision. is the point at which the function falls beThe models calculate team estimates low a given threshold.7 from individual three-point estimates and According to Lionel Briand, Khaled El propose three combination models and Emam, and Bernd Freimut,7,8 objective three kinds of weights. DEMs tend to underestimate and yield exThe Largest Interval (LI) model calcutreme outliers for special cases. The authors lates the minimum and maximum team estisuggested an approach to combine both CR mate from the most extreme individual valand DPM DEMs, which resulted in im- ues. The most likely value is the average of proved estimation performance. the extreme team values. This model does not use weights. Subjective Defect Estimation Models The Weighted Average of Individual EstiSubjective defect estimation strategies ask mates (WAE) model calculates the weighted knowledgeable people to supply a number average for the team estimate directly from or set of numbers that describe a software the individual estimates: product’s set of defects. These estimates can S easily be obtained from an inspection ηkj ⋅ω k ∑ team—even from a single inspector. An imN j = k =1 S portant advantage of subjective estimation is (1) ∑ω k that it does not require complex data meask =1 urement. Others10,11 have reported encouraging results from individual subjective estimation of defects in code modules. where The basis for a serious estimate should be well-structured information on the product ■ S is the inspection team size (four to six and the set of defects to be estimated. So, inspectors in this experiment), successful inspection is an excellent prepa- ■ k is the inspector identifier in a team, ration for estimating the number of defects ■ ηkj is the estimate from k of the total deremaining in the inspection object.2 Khaled fects present in the document before inEl Emam, Oliver Laitenberger, and Thomas spection, where the index j indicates Harbich were the first to evaluate an experwhether the minimum (min), most likely iment on subjective defect estimation with (ml), or maximum (max) number of dedata from an inspection.11 fects is estimated. In the work reported here, I start with the ■ Nj is the team estimate of the total deindependent estimates of individual inspecfects present in the document before intors who read the requirements documents spection, and by themselves. Then, I combine these indi- ■ ωk is the weight of k’s estimate. M0: maximum likelihood.9
(
38
IEEE SOFTWARE
November/December 2000
)
The Weighted Average of Individual Offsets (WAO) model uses our knowledge of the total number of defects that the team found when combining individual estimates. When the inspectors made their estimates, they only knew the list of defects they found themselves. So, the model calculates a weighted average of their offsets— that is, the difference between the number of defects that an inspector reported and his or her estimate of the number of defects in the document—and adds this team offset to the number of unique defects found by the team: S
k =1 S
(
)
∑ω k
k =1
Mean relative error of estimation models The relative error characterizes a defect estimate’s relative over- or underestimation: RE =
N ml − True number of defects True number of defects
The mean relative error describes the central tendency of a group of estimates—for example, of the estimates yielded by all teams.
N j = D + ∑ η − nk ⋅ω k k j
defects in the product to assess product quality with regard to his or her project plan. Therefore, I propose three criteria to evaluate a given DEM’s performance to yield accurate most likely defect estimates and defect estimation intervals.
The credibility of estimates is important to project managers, who must determine a given estimate’s weight for their decision process.
(2)
where, in addition to the definitions identified for the same variables in Equation 1, ■ nk is the number of defects that k re-
ported and ■ D is the number of unique true defects
found by a team. The WAE and WAO models use weights (ωk) for individual estimation contributions. I investigated the following approaches: ■ Model U: uniform weights (ωk = 1).
Defect estimation model credibility The credibility of estimates is important to project managers, who must determine a given estimate’s weight for their decision process. I model the credibility of a group of estimates with the shares of most likely defect estimates that fall into intervals with good, sufficient, and poor accuracy. The limits of these intervals depend on the project manager’s judgment. For the experiment, I use the following limits: Good estimates exhibit less than 20% absolute RE (ARE), sufficient estimates exhibit 20% to 40% ARE, and poor estimates lie outside the sufficient range.
Each individual estimate weighs 1. ■ Model DR: number of defects reported
(ωk = nk). This model assumes that inspectors who reported more defects estimate better, because they made a particular effort to contribute to the inspection result. ■ Model TDF: number of true defects found (ωk = dk, where dk is the number of defects that k reported that match a true defect). This model assumes better estimation accuracy from highly effective inspectors, because they demonstrated more intensive knowledge of the product and defects present. This weight is only known in the experimental environment.
Confidence interval accuracy I define a defect-estimation confidence interval, minimum and maximum, to be accurate if it contains the true number of defects. For one estimate, this is a binary value. For a set of estimates that a given estimation process yields, this is the share of estimates that contain the true number of defects.
Experiment Description The experiment was part of a university software development workshop that teaches 200+ undergraduate students to develop a real medium-size software product. All students knew how to develop small programs; approximately 10% had profesEvaluation Criteria sional development experience. In the exA project manager uses the estimate of periment, individuals from development November/December 2000
IEEE SOFTWARE
39
Table 2 Mean Relative Error of Estimation Models Objective model
MRE (%)
Mh Mhc Mthc Mtc DPM M0 Mt
–19.6 –20.1 –20.8 –30.4 –33.8 –35.4 –37.1
Standard deviation (%)
Subjective model (combination model– weight model)
22.5 23.9 23.6 18.8 14.0 15.1 14.6
teams of four to six people independently inspected a requirements specification, which they later had to use as input for their development activities. The developers were randomly assigned to a defect detection approach (that is, to use a checklist or one of three scenario-based viewpoints—user, designer, or tester).13 Developers were balanced among development teams according to their development and problem-solving qualifications as graded in an entry test. In a training phase, the subjects learned inspection methods, their assigned reading technique, and the experimental procedures; they also inspected a sample software requirements specification. During individual reading, each subject applied the assigned reading technique independently to the inspection object. Results from each subject were a defect list, an inspection protocol, and an individual threepoint estimate for the number of defects remaining in the document after inspection. Inspection supervisors performed an initial data analysis: They checked the completeness and validity of the collected defect and effort data. Then, they analyzed the individual defect lists and determined whether reported defects corresponded to true defects. The experiment design reduced the impact of threats to experiment validity, as follows: 14 ■ We took a specification from a real ap-
plication context to deal with an inspection object that was representative of real development specifications. ■ We used a classroom setting in order to control the experiment environment. ■ We applied inspection activities that had been used in a professional development environment15 to work with an inspec40
IEEE SOFTWARE
November/December 2000
WAO–TDF WAO–DR WAO–U LI WAE–DR WAE–TDF WAE–U
MRE (%)
–1.6 –1.7 –3.4 –5.8 –16.9 –17.0 –20.5
Standard deviation (%)
26.1 26.0 24.4 20.3 24.4 24.6 21.1
tion process that was representative of software development practice. ■ We used students as the sample, so the results could reasonably be generalized to people with a comparable background—possibly novice rather than professional developers. The inspection object was a 35-page requirements document containing 9,000 words and six diagrams. The document described a distributed administrative information system for managing ticket sales, administration, and location management. The system comprised four parts: ■ context information in natural language
text and illustrating diagrams, ■ business functions and nonfunctional
requirements in structured natural language and tables, ■ a relational database model in entity-relationship notation, and ■ an object-oriented class model and class description in Unified Modeling Language notation. From historical project data on the workshop, we estimated the effort for development after inspection at 30 to 36 personmonths (approximately 5,000 personhours) over five months. The document contained 86 reseeded defects—48 minor and 38 major—that had been found during development. All defects in the requirements models and business functions—that is, wrong, missing, unclear, or inconsistent information—could be found without needing to refer to external documentation. Experiment Results In the experiment, we collected individual estimates from 169 inspectors and com-
Mean Relative Error of Estimation Models Both objective and subjective DEMs tended to underestimate in general. Table 2 presents the mean relative error objective and subjective DEMs for all 31 development teams, sorted descending by the MRE. Objective DEMs that assume a variation of the probability for a defect to be found (Mh, Mhc, and Mthc) performed significantly (p < 0.01) better than the other CR models and the DPM. These results confirm other findings.7,8 Subjective team estimation models based on the number of defects the team found (WAO) performed very well, with a MRE of –1.6% to –3.4%, followed by the LI model, with a MRE of –5.8%. Both models estimated significantly (p < 0.01) more accurately than the WAE models, which are based on the number of defects estimated by the individual inspectors only. The variation of weights for the combination models did not result in a significantly different MRE. According to the data results, the top four subjective DEMs performed significantly better than all objective DEMs. The worst subjective DEMs estimated as accurately as the best objective DEMs. Figure 1 shows box plots of the distribution of the RE for the three best subjective and objective DEMs. For both objective and subjective DEMs, the worst outliers in our dataset underestimated the number of defects by 67%, yet there were no outliers that
Relative error of the most likely defect estimates
bined them into 31 team estimates. Team size varied from four to six inspectors. For data analysis, we calculated the team estimates according to the models described in earlier sections and evaluated them using the criteria already discussed.
0.6 0.4 0.2 0.0 –0.2 –0.4 –0.6
WAO-U
WAO-DR
WAO-TDF
Mhc
Mh
Mthc
Top three subjective and objective defect estimation models
overestimated more than 53%. Outliers of the best subjective DEMs were teams of which more than half of the members were inspectors with particularly poor inspection effectiveness and accordingly low individual defect estimates.
Figure 1. Distribution of the RE for the top three subjective and objective DEMs.
Defect Estimation Model Credibility Table 3 analyzes the shares of most likely value estimates that fall into intervals with good (ARE < 20%), sufficient (ARE 20% to 40%), and poor (ARE > 40%) accuracy, sorted ascending by the share of poor estimates. The best objective models were the DPM and Mh. They yielded more than 30% good and fewer than 20% poor estimates. CR models—with the assumption that each defect has the same probability to be found (M0, Mtc, and Mt)—performed considerably worse. All subjective DEMs yielded at least 48% good estimates—a higher share than the best objective model—and less than 20% poor
Table 3 Distribution of Estimates on Absolute Relative Error Intervals Objective model
ARE < 20%
ARE < 40%
ARE > 40%
Subjective model
ARE < 20%
ARE < 40%
ARE > 40%
DPM Mh Mhc Mthc M0 Mtc Mt
35.5 41.9 38.7 32.3 9.7 19.4 6.5
48.4 38.7 35.5 38.7 61.3 48.4 54.8
16.1 19.4 25.8 29.0 29.0 32.3 38.7
LI WAO–U WAO–DR WAO–TDF WAE–DR WAE–U WAE–TDF
71.0 61.3 54.8 54.8 51.6 48.4 48.4
19.4 29.0 32.3 32.3 29.0 32.3 32.3
9.7 9.7 12.9 12.9 19.4 19.4 19.4
November/December 2000
IEEE SOFTWARE
41
Table 4 Confidence Interval Accuracy Objective model
Interval contains true number of defects (%)
Subjective model
Interval contains true number of defects (%)
74.2 71.0 58.1 48.4 12.9 12.9 9.7
LI WAO–U WAO–DR WAO–TDF WAE–DR WAE–TDF WAE–U
87.1 58.1 58.1 58.1 48.4 48.4 38.7
Mthc Mhc Mh Mtc M0 DPM Mt
estimates. The WAO–U and LI models performed much better than the WAE models. Confidence Interval Accuracy Table 4 shows the percentage of interval estimates that contain the true number of defects, sorted descending. Mthc and Mhc performed significantly better than the other objective DEMs. For objective models, the interval tested is the estimator’s 95% confidence interval for the total number of defects in the requirements document. With the experiment data results, these intervals contained the true number of defects significantly less often. As expected, the LI model was the best subjective estimation approach, because it maximizes interval size. WAO and WAE models contained the true value for under 60% of all teams. Team interval estimates that missed the true number of defects had a much shorter interval span than the successful cases. These interval accuracy results suggest you need considerable caution when using such intervals in project monitoring. An expedient approach for the project manager to construct more accurate intervals is to select an interval with sufficient tolerance around the most likely defect estimate (see Table 3).
with a MRE of –2% to –6% performed significantly better than subjective estimates based on individual defect data. The postinspection data analysis effort to determine the number of unique true defects for the estimating team paid off and, at the same time, did not complicate data collection during inspection. The weight models with variables available in a project context (U, DR) were as accurate as the models with knowledge only available in the experiment (TDF). The confidence interval accuracy of all but the three best DEMs was much lower than 95% and must be viewed with caution. The experiment’s subjects were mostly novice software developers with little experience in inspection and estimation. Yet, the inspection process proved to be a good preparation for defect estimation; reasonable to good individual inspection effectiveness was the key to good team estimates. So, further work should investigate the influence of defect detection methods used and inspection team composition (that is, team size and inspector capability) on the accuracy of team estimates. Experimental results suggest studying the best estimation models for data on product quality when assessing the credibility of project estimates regarding a product’s specified quality levels. Project managers should consider including a procedure for inspection and estimation in their project control process.
Acknowledgments I thank the 169 experiment participants, 28 inspection supervisors, and the experiment support team, Michael Halling, Barbara Tappeiner, and Wolfgang Lautner, for their contributions to preparing and executing the experiment.
A
ll the models tested tended to underestimate the true number of defects in the document. Model Mh, with a MRE of –19.6%, performed better than the other objective approaches and showed sufficient performance regarding all evaluation criteria. Subjective DEMs generally estimated more accurately than objective DEMs. Estimates based on team data and the LI model
42
IEEE SOFTWARE
November/December 2000
References 1. B.W. Boehm and T. DeMarco, “Software Risk Management,” IEEE Software, Vol. 14, No. 3, May/June 1997, pp. 17–19. 2. T. Gilb and D. Graham, Software Inspection, AddisonWesley, Reading, Mass., 1993. 3. W.S. Humphrey, Introduction to the Team Software Process, Addison-Wesley, Reading, Mass., 2000.
4. J. Hihn and H. Habibagahi, “Cost Estimation of Software Intensive Projects: A Survey of Current Practices,” Proc. 13th Int’l Conf. Software Eng., IEEE Computer Soc. Press, Los Alamitos, Calif., 1991, pp. 276–287. 5. B. Boehm et al., “Cost Models for Future Software Life Cycle Processes: COCOMO 2.0,” Annals Software Eng., Vol. 1, 1995, pp. 57–94. 6. S. McConnell, “Gauging Software Readiness with Defect Tracking,” IEEE Software, Vol. 14, No. 3, May/June 1997, pp. 136–139. 7. L. Briand, K. El Emam, and B. Freimut, “A Comparison and Integration of Capture–Recapture Models and the Detection Profile Method,” Proc. Ninth Int’l Symp. Software Reliability Eng., IEEE Computer Soc. Press, Los Alamitos, Calif., 1998. 8. L. Briand, K. El Emam, and B. Freimut, A Comprehensive Evaluation of Capture–Recapture Models for Estimating Software Defect Content, Tech. Report ISERN98-31, Int’l Software Eng. Research Network, Germany, 1998. 9. D.L. Otis et al., “Statistical Inference from Capture Data on Closed Animal Populations,” Wildlife Monographs, No. 62, 1978. 10. R.W. Selby, Evaluations of Software Technologies: Testing, Cleanroom and Metrics, doctoral dissertation, Dept. Computer Science, University of Maryland, 1985. 11. K. El Emam, O. Laitenberger, and T. Harbich, The Application of Subjective Estimates of Effectiveness to Controlling Software Inspections, Tech. Report ISERN99-09, Fraunhofer Inst. for Experimental Software
12. 13.
14.
15.
Eng., Int’l Software Eng. Research Network, Germany, 1999. S. Grey, Practical Risk Assessment for Project Management, John Wiley & Sons, New York, 1995. V. Basili et al., “The Empirical Investigation of Perspective-Based Reading,” Empirical Software Eng.: An Int’l J., Vol. 1, No. 2, Apr. 1996, pp. 133–164. J. Miller, M. Wood, and M. Roper, “Further Experiences with Scenarios and Checklists,” Empirical Software Eng. J., Vol. 3, No. 1, Jan. 1998, pp. 37–64. A. Porter and L. Votta, “An Experiment to Assess Different Defect Detection Methods for Software Requirements Inspections,” Proc. 16th Int’l Conf. Software Eng., IEEE Computer Soc. Press, Los Alamitos, Calif., 1994, pp. 103–112.
About the Author Stefan Biffl is an assistant professor of software engineering at the Vienna University of
Technology. His research interests include project and quality management in software engineering. He received an MS and PhD in computer science from the Vienna University of Technology and an MS in social and economic sciences from the University of Vienna. He is a member of the ACM and IEEE. Contact him at Institute for Software Technology, Technische Universtät Wien, Karlsplatz 13, A–1040, Vienna, Austria;
[email protected].
Areas of expertise include ■ Astronomy ■ Chemistry ■ Visualization ■ Signal Processing ■ Professional Resources
and more…
A comprehensive, peer-reviewed resource for the scientific computing field. COMPUTER.ORG/CISEPORTAL November/December 2000
IEEE SOFTWARE
43
E D I T O R I A L C A L E N D A R
1 2 3 4 5 6
JANUARY/FEBRUARY
Usability Engineering in Software Development When usability is cost-justified, it can be integrated into the development process; it can even become one of the main drivers of software development. How can you avoid conflict between your usability and development staff and build even stronger teams?
Also
To build software, people are doing a lot of things beyond typical programming. So what is “software” these days? And what is “programming” these days?
MARCH/APRIL
Global Software Development What factors are enabling some multinational and virtual corporations to operate successfully across geographic and cultural distances, while others struggle and fail? Software development is increasingly becoming a multisite, multicultural, globally distributed undertaking.
MAY/JUNE
Organizational Change Today’s organizations must cope with reorganization, process improvement initiatives, mergers and acquisitions, and ever-changing technology. Through experience reports, case studies, and position papers, we will look at what organizations are doing and can do to cope.
JULY/AUGUST
Fault Tolerance We used to think of fault-tolerant systems as ones built from parallel, redundant components. Today, it’s much more complicated. Software is fault-tolerant when it can compute an acceptable result even if it receives incorrect data during execution or suffers from incorrect logic. How do you plan for this? How do you build it in?
SEPTEMBER/OCTOBER
Software Organizational Benchmarking How do you decide what to benchmark and how much detail is necessary? How do you identify the right information sources? What goes into determining competitive gaps and projecting future performance? What is the best way to communicate your findings and overcome resistance? How do you establish functional goals and develop an action plan?
NOVEMBER/DECEMBER
Security and Privacy Update This hot topic will grow even hotter as we try to strike a balance between performance, privacy, and access to information. This update will keep you abreast of new developments.