Jul 20, 2008 - fault prediction from information in a software change man- agement database ... ence of certain keywords in the report's natural language description. 1. ... The prediction algorithms used by our tool have been de- veloped by ...
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932
(weyuker,ostrand)@research.att.com ABSTRACT
The study described in this paper is about determining when we have encountered a fault. It is not about predictive models, per se, rather it considers two different ways of automatically deciding whether or not a change has been made in response to an identified fault. For clarification purposes, we include some terminology. If the software is run on an input, and the output is determined to be incorrect, we say that a failure has occurred. A fault is the problem in the software that caused the failure to occur. Faults are also known as defects. The removal of a fault is called debugging. The prediction algorithms used by our tool have been developed by analyzing historical fault and change information as well as characteristics of the source code for several large software systems, listed in Table 1. The algorithms rely on a set of factors that include the file’s size, the programming language used, and the number of faults and changes made to the file in earlier releases. This information is extracted from the database of the commercially available change management/version control tool that records all changes made to the system under development. Changes are described in modification requests (MR), which include information about the changed software, including dates of MR creation and of the software changes, id numbers of the developers and testers, stages of the development process when the MR was created, and natural language text that describe the reason for requesting a change and sometimes an additional description of the actual change. The size and programming language of a given file are obtainable directly from the code itself. MRs are written to change software for a variety of reasons that are not fault-related, such as adding new functionality to the system, performing maintenance updates, enhancing performance, applying requirements changes, among other reasons. Surprisingly, the change management tool used by all the systems we have analyzed does not provide a simple way to differentiate the fault-related MRs from those submitted for other reasons. In order to validate the effectiveness of our prediction models, we have done a series of empirical studies on the systems in Table 1. With one exception, the subject systems have had thousands of MRs written, too many to permit careful reading of each one to determine whether or not it represents a fault. In these studies, we have tried a number of different ways of determining whether an MR is fault-related, to serve as proxies for complete knowledge of the actual fault reports. This paper compares the latest method we have used, which is based on the development
A key problem when doing automated fault analysis and fault prediction from information in a software change management database is how to determine which change reports represent software faults. In some change management systems, there is no simple way to distinguish fault reports from changes made to add new functionality or perform routine maintenance. This paper describes a comparison of two methods for classifying change reports for a large software system, and concludes that, for that particular system, the stage of development when the report was initialized is a more accurate indicator of its fault status than the presence of certain keywords in the report’s natural language description.
1.
INTRODUCTION
A very important problem for testing large software systems is to make testing as automatic and repeatable as possible, and to make sure that the substantial costs associated with testing are well spent. It is also important to be able to automate these activities to a large extent. Automation helps make the testing phase cost-effective, and helps mitigate the variation between different testers’ skill levels. Automation also enables the development of substantial test suites to validate large software systems that are often required to run continuously and are mission-critical for an organization. For these reasons, we have been developing algorithms that automatically predict which files of a large software system are likely to contain the largest numbers of faults in the next release of a software system, and implementing a tool to make those predictions. This tool requires no data mining or analysis to be performed by the user, and requires no particular expertise of the user. By helping to localize where faults are most likely to occur, its output will help testers prioritize their efforts, and more effectively allocate the time that is available for testing.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DEFECTS’08, July 20, 2008, Seattle, Washington, USA. Copyright 2008 ACM 978-1-60558-051-7/08/07 ...$5.00.
27
System Inventory Provisioning Voice Response Maintenance Support
No. of Releases 17 9 35
Years 4 2 2+ 9
KLOC 538 438 329 668
% Faults In 20% Files 83% 83% 75% 84%
MR fault criterion 1 or 2 changed files manual reading created by tester development stage
Table 1: Information for Empirical Studies stage during which the change was made, to a method based on analysis of the natural language text of the MRs. In our first empirical study [2], we used an algorithm proposed by project personnel of the inventory system that was the subject of that study. They suggested that if just one or two files were changed as the result of an MR, then the MR likely represented a fault, while if many files were changed, then it was more likely to be a change to some interface or a wholesale introduction of many files implementing some new functionality. This simple rule of thumb performed surprisingly well. In a small informal experiment, we randomly selected about 50 MRs and read their natural language descriptions manually to determine whether or not they represented faults. Almost all of the 50 were correctly characterized using the “1 or 2 files” approach. Although this approximation seemed satisfactory for the inventory system, it seemed too rigid for general use, and we continued looking for a better method. The second system we studied, a service provisioning system, contained relatively few modification requests, although it was made up of over 400 thousand lines of code. Because of the relatively small number of MRs, we were able to manually read every one, and classify each as being either a fault MR or not. The other systems we have encountered all contained far too many changes for manual categorization to be feasible, and in any case, manual reading cannot be made part of an automated tool. For the third system we studied, a voice response system, we used two different methods to determine which MRs represented faults. After we began our study, we were able to convince the administrators of the change management system to add a specific bug identification field to the standard MR template. In this field, users could classify each newly created MR as one of bug, no-bug, or don’t know. The field was available during approximately the last third of the system’s lifespan, and provided accurate MR classification for that period. Unfortunately, this field was not provided for any of the systems we have since analyzed. For historic MRs that had already been created for the voice response system when we began our study, we identified fault MRs by using a heuristic proposed by the project’s system testers. In retrospect, it is difficult to understand why we did not think of their suggestion originally. The testers pointed out that any change proposed by a tester is by definition a fault, since their sole job function is to test the software in order to identify faults. Testing personnel do not propose new functionality. Of course, sometimes a tester may incorrectly believe that an output is wrong and submit an MR that does not in fact describe a failure situation. In such a case, once it is determined that the MR does not really represent a fault, no changes will be made to the software for that MR. In all our studies, we have counted one fault for each file changed in an MR that meets the fault identification criterion being
used. Regardless of the criterion that identifies fault MRs, there must always be changed files recorded for an MR to contribute to the count of faults for the system. Thus, if a tester creates an MR for which no files are ever changed, it will have no effect on the fault counts. Faults might be observed in any of the following testing stages: integration test, system test, load test, operations readiness test, end to end test, and user acceptance testing. Each testing stage is uniquely identified appropriately in the MR database. Again, testers responsible for these phases do nothing but test software. They cannot initiate new functionality and so any MRs created during these development stages are considered to be faults if they lead to some software file being changed. Each MR includes a listing of every file changed. It is also possible that a fault might escape detection during all testing phases and cause a failure that is first noticed by a customer. This is relatively rare for the systems we have studied, with the vast majority of faults being identified during some pre-release testing phase. However, for faults that do escape to the field, they are identified in our MR database with an indication that the MR was initiated by a customer, or more often by a support person contacted by a customer. We used a fourth software system, a maintenance support system, as the subject of the empirical study described in this paper. For this system there was a new category that had not been used by earlier subject systems. This is the post-release category mentioned above in which the MR is designated in the MR database as having been found by support personnel. We therefore defined a fault to be any change to a software file made as the result of an MR initiated during integration testing, system testing, load testing, operations readiness testing, end to end testing, or user acceptance testing, or by a support person once the software is in the field. Table 1 shows system information and prediction results for the four systems mentioned above, and characterizes the method used to identify fault-related MRs. Column 5 shows the percentage of the faults actually contained in the 20% of the files that the model predicted would contain the largest numbers of faults. For each of the systems, the models correctly identified the files containing the vast majority of the faults. An explanation of the slightly less accurate results for the voice response system is presented in [1]. In the next section we describe an empirical study that compares the resulting MRs identified when two different types of definitions are used to identify which modification requests represent faults. The purpose of this study is not a competition. Instead the goal is to determine what is the most effective way of identifying faults and then to use that algorithm to improve our prediction results. Since our ultimate goal is to find a way of assisting practitioners in their development of large software systems with high reliability
28
Release Release Release Release
A B C D
Development Stage Def Only 29 47 79 34
Keyword Def Only 16 28 2 1
Both 19 32 36 20
Total MRs 270 426 519 254
Table 2: Faults According to Different Definitions requirements, this is an important issue.
2.
accurate fault-identification algorithm possible, we also plan to determine whether combining the two definitions will improve the accuracy of identifying the set of fault MRs.
AN EMPIRICAL STUDY
In the current study, we compare the results of using two different definitions to identify fault MRs. The first is the one that we used when we studied the maintenance support system while doing fault prediction. This was based on the development stage at which the modification request was first initiated. This definition has also been the basis for determining how effective our prediction algorithms are. The second definition we consider is based on scanning the English description in the MR for the keywords “bug”, “defect”, “patch” and “fix”. By scanning for these strings, we would identify any variant of the word, such as bug or bugs, fix, fixes and fixed. Each keyword must be either at the beginning of the string or preceded by at least one whitespace character. This prevents the identification of words that coincidentally include one of the keywords as a substring, such as “dispatch”, “suffix”, or “ladybug”. On the other hand, there might be other MRs including words that would be identified but probably should not be. One example would be an MR including the word “fixate”. However, in our study, we have not noticed any incidents of undesired words being selected. This keyword method is very similar to the one used by Zimmermann et al. [4] to analyze change reports for the open-source Eclipse system. Their goal was to make fault predictions for the Eclipse code, and they were also faced with deciding which changes in the repository represented faults. The study in this paper was done using the fourth system listed in Table 1, a maintenance support system that was described in detail in [3]. This largescale system has been in the field for over nine years, and provided us with data for 35 consecutive releases. The most recent release contains well over a half million lines of code. During those 35 releases, 8568 MRs were written, some of which represented faults, many more of which did not represent faults, but rather enhancements or modifications made for some other reason. For the current study we selected four of the 35 releases to consider in detail, using a program to separate MRs into four categories. The first three of these categories were considered faults by at least one of the definitions: those identified as faults because of their development stage only, those identified because of the presence of at least one keyword but not an appropriate development stage, and those identified for both reasons. The remaining MRs did not have either characteristic, and were therefore not considered faults. For each MR in those four subject releases that was identified by one of the definitions, but not by the other, we manually read its description to see whether we could determine whether or not it represented a fault. Our goal was to determine which type of definition was most effective both at correctly identifying faults, and also not identifying nonfaults as faults. Since the ultimate goal is to have the most
3. FINDINGS We identify the four releases studied as Releases A, B, C, and D. Table 2 summarizes our observations. Column 2 shows the number of MRs classified as faults because of their development stage, but not containing one of the designated keywords. Column 3 shows the number of MRs classified as faults by keyword, but not by development stage. Column 4 shows the number of MRs classified as faults using both definitions, and Column 5 shows the total number of MRs for that release. Using either or both of the fault definitions, it is clear that the majority of all MRs written for this project were classified as being non-fault MRs and are believed to have been written to make other types of modifications. For this reason, it is essential that we have a definition that can correctly identify the relatively small number of fault MRs. The table indicates that the use of the development stage always identifies more modification requests as faults than the use of the keywords definition, for each of the four releases studied. Out of all MRs identified as being a fault MR by at least one of the definitions, the percentage that was identified by the development stage only ranged from 43.9% to 67.5%, while those identified only by keyword ranged from 1.7% to 26.2%. For the four releases studied, only about one third of the MRs identified as faults by either definition are identified by both definitions. To determine which fault definition yielded the most accurate results, we manually read the natural language descriptions of each MR that met one or the other, but not both, of the two fault criteria. Based on our understanding of the descriptions, we then classified each MR as either a fault or a non-fault. The results are shown in Table 3, which splits the MRs identified by only one of the two definitions into those correctly identified as faults, and those incorrectly identified. The table’s first row shows that for Release A, 25 of 29 or 87% of the MRs that were classified as faults by the development stage definition, but not the keyword definition, were actually faults, and hence correctly classified. In contrast, only 7 of 16 or 43% of the MRs that were classified as faults using the keyword definition, but not by the development stage definition, were correctly classified. For Release B, 96% of the MRs that were classified as faults by the development stage definition, but not the keyword definition, were correctly classified, while none of the MRs classified as faults using the keyword definition, but not by the development stage definition, were correctly classified. For Releases C and D, none of the very few keyword-only MRs represented faults. In contrast, for Releases C and D respectively, 95% and 100% of the MRs identified by the de-
29
Release Release Release Release
A B C D
Classified as Faults by Development Stage Only Fault Non-Fault Percent Correct Faults 25 4 86% 45 2 96% 75 4 95% 34 0 100%
Classified as Faults by Keyword Only Fault Non-Fault Percent Correct Faults 7 9 44% 0 28 0% 0 2 0% 0 1 0%
Table 3: Correct and Incorrect Fault Classifications
Release Release Release Release
A B C D
Type 1 misclassification rate Development Stage Keyword 8.3% 25.7% 2.5% 46.7% 3.5% 5.3% 0.0% 4.8%
Type 2 misclassification rate Development Stage Keyword 3.2% 10.6% 0.0% 12.3% 0.0% 15.6% 0.0% 14.6%
Table 4: Misclassification Rates fications for a given definition appear in the “Fault” column of the other definition, since these represent MRs determined to be faults by manual reading, but not identified as such by the given definition. A fault criterion can be considered effective if Type 1 and Type 2 misclassifications are low, but both numbers have to be considered. For example, if a criterion identified no MRs as faults, then there would be no Type 1 misclassifications, regardless of the number of actual faults. If there were in fact many faults, then the Type 2 number would be high. Similarly, if a criterion identified every MR as a fault, then it would yield no false negatives, but the number of false positives would be high. Because the four releases in this study have different total counts of MRs, and different fault counts, we use rates of Type 1 and Type 2 misclassifications to compare them uniformly. We calculate the Type 1 misclassification rate for a definition by dividing the number of Type 1 misclassifications by the total number of items classified as a fault by the definition. The denominator is thus the number of MRs classified as faults by the definition alone, plus the number classified as faults by both definitions. For example, 79 MRs in Release C were classified as faults by the development stage definition (and not by the keyword definition), and 36 were classified as faults by both definitions. Four of the development-only MRs were misclassified because they really were not faults, for a Type 1 misclassification rate of 4 3.5% = 79+36 . Calculated in this way, for the development stage definition, the Type 1 misclassification rates were 8.3%, 2.5%, 3.5% and 0% for Releases A, B, C and D respectively. In contrast, the Type 1 misclassification rates for the keyword definition were 25.7%, 46.7%, 5.3%, and 4.8%, for Releases A, B, C and D respectively. The Type 2 misclassification rates for the development stage definition are very low. For Release A, 7 faults were correctly identified by the keyword definition, but were missed by the development stage definition, giving a Type 2 rate of 3.2% for development stage for this release. For Releases B, C, and D, there were no false negatives, and hence the Type 2 rate is 0% for those three releases. In contrast, the keyword definition produced a total of 179 false negatives over the four releases, giving Type 2 rates of 10.6%, 12.3%, 15.6%, and 14.6%.
velopment stage definition, but not the keyword definition, were correctly classified as faults. Extrapolating from our study of the four releases in Table 2 to the 35 releases in the system’s complete history, we would expect that somewhere between 20% and 25% of the total 8568 MRs, or roughly 2,000 MRs, would be identified as faults by at least one of the definitions. Because of the large numbers of MRs involved, we did not attempt to read every one of these 2000 to determine whether or not it was correctly categorized by one or both of the definitions, but limited ourselves to the four releases as representative. We now want to make an assessment of the numbers of misclassifications observed during a release for each of the two definitions. It is important to recognize that there may be some actual faults that are not identified by either the development stage definition or the keyword definition, but they are not considered in this study because we are unaware of them. In addition, there may be faults in the software that have not been identified at all and therefore do not correspond to any modification request in the change database. We assume throughout our entire discussion that MRs that are not identified by either fault definition do not represent faults and similarly that MRs identified by both definitions really do represent faults. What we have observed so far is that for the four releases studied for this subject system, the development stage definition identified significantly more MRs as faults, and they were much more likely to be correctly identified as faults. For the four releases, while an average of 95% of the MRs identified as faults by the development stage definition alone turned out to actually be faults, less than 15% of the MRs identified by the keyword definition alone actually were faults. This considers only MRs identified by one definition but not the other. Type 1 or false positive misclassifications consist of those instances in which an MR that is not a fault MR is incorrectly classified as being a fault MR. Type 2 or false negative misclassifications consist of those instances in which an MR that really is a fault MR is classified as not being a fault MR. The numbers of Type 1 misclassifications for each definition are in the columns labelled “Non-Fault” in Table 3. These are Column 3 for the development stage definition, and Column 5 for the keyword definition. Type 2 misclassi-
30
Table 4 shows the Type 1 and Type 2 misclassification rates for the two definitions, for each of the four releases. There are very different consequences of each sort of misclassification. Too many false positive classifications will lead to an inaccurate assessment that some files historically contained more faults than they really did. Since the fault count of a file in previous releases is a factor in predicting its future fault-proneness, this can result in an overly pessimistic (and incorrect) prediction that the file will contain future faults, leading testers to pay unwarranted extra attention to that part of the code. In contrast, when Type 2 misclassifications occur, actual faults are not correctly identified, and so a file or other software artifact that is likely to contain faults, might not get the scrutiny it deserves. The discussion above indicates that the keyword definition has relatively large numbers of both Type 1 and Type 2 misclassifications, while the development stage definition has relatively few of each. This indicates to us that at least for this subject system, using keywords to identify which MRs represent faults does not yield acceptable results, while the use of development stage yields far more accurate results. Initially we had thought that we might be able to improve our fault identification algorithm by including MRs that were identified by either the development stage or keyword definitions. The high number of Type I misclassifications observed convinced us that that would yield poorer results than we have seen using the development stage alone.
4.
“this is not a defect” or “nothing needs to be patched”, although we did not encounter them in any of the four releases we studied. As indicated above, we did encounter related phrases that said something to the effect that nothing needed to be fixed.
5. CONCLUSIONS Based on this single case study, it appears that using the development stage fault classification algorithm correctly identifies a far larger percentage of actual faults than the keyword definition approach. Augmenting the development stage approach with the keyword method does identify a small number of faults missed when using the development stage by itself, but it also incorrectly identifies far more MRs. We expect to repeat this experiment on other systems and study the fault MRs not identified by the current set of keywords for possible additional keywords that can be used to augment the keyword set. Because open-source systems like the Eclipse system studied in [4] do not follow a set of standard development stages, and do not have professional developer/testers with specific roles, the development stage method is probably not applicable to them. However, it has been highly effective for our subject system.
6. REFERENCES [1] R.M. Bell, T.J. Ostrand, and E.J. Weyuker. Looking for Bugs in All the Right Places. Proc. ACM/International Symposium on Software Testing and Analysis (ISSTA2006), Portland, Maine, July 2006, pp. 61-71. [2] T.J. Ostrand, E.J. Weyuker, and R.M. Bell. Predicting the Location and Number of Faults in Large Software Systems. IEEE Trans. on Software Engineering, Vol 31, No 4, April 2005. [3] T.J. Ostrand, E.J. Weyuker, and R.M. Bell. Automating Algorithms for the Identification of Fault-Prone Files. Proc. ACM/International Symposium on Software Testing and Analysis (ISSTA07), London, England, July 2007. [4] T. Zimmermann, R. Premraj and A. Zeller. Predicting Defects for Eclipse. Proc. Third International Workshop on Predictor Models in Software Engineering, (Promise07), May 2007.
EXAMPLES
We were surprised that so many MRs were incorrectly identified as faults or non-faults using the keyword definition and were interested in seeing why that was the case. In this section we consider some examples of phrases in modification requests that included one of the keywords, but when the MR description was read, we determined that the MR was not a fault MR. In several cases, the MR was written to initialize all new files into the release. In that case there was often a phrase used that spoke of doing the “patch build” or some similar wording. Another case that occurred multiple times was a phrase indicating that something was “fixed elsewhere” or that nothing needed to be fixed. One interesting example used the word defective in the description, but it was not referring to code, but rather to data. This was an MR written to change the contents of a database, and was not written in response to an observed failure. A different MR used the word fix when changing or reinitializing some data. Another set of MRs used the word fix in the description when referring to the manual or documentation. Again, this did not represent a fault but rather a change in the requirements. One final example of a non-fault MR that used the word patch spoke about writing a patch to change timeout variables during testing. This would allow testers to do certain types of performance, stress or load tests and the patch was to be reversed once testing was complete. This example was especially interesting because, although the MR referred to testing, it was not initiated by testers but rather during a rarely-used category called “controlled intro”. In addition to the above examples, one could easily imagine encountering phrases such as “this is not a bug” or
31