Defect Categorization: Making Use of a Decade of Widely Varying Historical Data Carolyn SeamanA,B, Forrest ShullA, Myrna RegardieA, Denis ElbertA, Raimund L. FeldmannA, Yuepu GuoB, Sally GodfreyC A
Fraunhofer Center for Experimental Software Engineering College Park Maryland 20742, USA +1-301-403-8970
B
Department of Information Systems University of Maryland Baltimore County Baltimore, Maryland 21250, USA +1-410-455-3937
C
NASA Goddard Space Flight Center Greenbelt, Maryland 20771, USA +1-301-286-5706
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
aggregation has the potential to be very useful in building descriptive or prescriptive models based on historical data. Such models are essential to properly manage the major software development outcomes (e.g., cost, schedule, quality) and to build up organizational knowledge. To be effective, however, the models must be built using a significant base of historical data that describes numerous variations within the domain of the model (e.g., an organizational unit). Often, such data, collected over a period of time by different people, and in different projects, do not have identical structures. Thus, some sort of mapping must be done to convert the data into a common format that still preserves as much of the meaning of the original data as possible.
ABSTRACT This paper describes our experience in aggregating a number of historical datasets containing inspection defect data using different categorization schemes. Our goal was to make use of the historical data by creating models to guide future development projects. We describe our approach to reconciling the different choices used in the historical datasets to categorize defects, and the challenges we faced. We also present a set of recommendations for others involved in classifying defects.
Categories and Subject Descriptors D.2.0 [Software Engineering]: General
An example situation in which such a mapping is necessary is when defect data from a variety of projects is aggregated to build a quality model that relates some characteristics of software development to resulting quality in terms of defects. In particular, a major source of variation in the collection of defect data is in the various categorization schemes used to characterize defects. There are a number of such schemes in the literature, and even more in use in practice. Some categorizations are orthogonal, i.e., they address different dimensions of interest (e.g., severity vs. technical cause), so combining them results in categories that are not mutually exclusive. So instead one must choose a set of discrete categories and devise mapping rules to convert the data consistently and coherently into the new categories.
General Terms Management, Experimentation.
Keywords Defects, historical data, defect categories.
1. INTRODUCTION Aggregating heterogeneous data from different sources is always difficult, especially when the data definitions use different terminology or formats. However, often such an
This paper describes our experience in aggregating a number of historical datasets containing inspection defect data using different categorization schemes. We have collected data from 2,529 inspections from 81 projects across 5 NASA (US National Aeronautics and Space Agency) Centers. Each Center used different defect taxonomies, and in some cases multiple taxonomies were used at the same Center. Our goal was to
Copyright 2008 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor, or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. ESEM’08, October 9–10, 2008, Kaiserslautern, Germany. Copyright 2008 ACM 978-1-59593-971-5/08/10...$5.00.
149
categorization schemes. Unlike the quantitative data for aspects such as inspection effort, combining the counts of defect types proved problematic due to the wide variety of categorization schemes used by the historical projects.
develop a unified defect categorization scheme that is completely “backwards compatible” with existing data, that captures domain-specific elements of interest to NASA, and that has other desirable attributes of good defect taxonomies [1]. Such a taxonomy would then help us make use of the historical data by creating models to guide future development projects.
3. RELATED WORK Having different objectives, researchers in different areas have developed various defect taxonomies. Vijayaraghavan and Kaner summarized existing taxonomies in terms of their objectives in [2].
Because the defect categorization schemes used in the different datasets varied widely, and no one scheme appeared preferable to any other, we constructed a new set of defect categories at a slightly higher level of abstraction than the historical ones. We then devised mapping rules to convert the historical data into our new categorization scheme.
The most widely used classification is the Orthogonal Defect Classification (referred to as ODC hereafter), which was presented by Chillarege et al. in [3]. ODC bridged the gap between statistical defect models and qualitative defect analyses [4]. In [5] Chillarege et al. analyzed defect data, which included both the symptom of the defects and the cause analysis. Defects were categorized in terms of their causes. After several rounds of refinement, they determined 5 types of defects. The number of defect types was then extended to 8 in [3]. This classification is one of the most commonly used in current research regarding software faults.
Section II further explains the motivation for this research initiative. In Section III, in order to provide a larger context to our work, we present an overview of the defect categorization approaches in the literature. Section IV gives an overview of our approach to the mapping exercise. Then Section V outlines the various challenges we faced in our research project in mapping heterogeneous defect data that used different categorization schemes. We end with some conclusions, as well as a summary of recommendations for those engaged in similar activities, in Section VI.
A number of researchers have used ODC as a starting point for developing new defect taxonomies for particular purposes, as we have. For example, Leszak et al. [6] based their taxonomy on the ODC framework. They applied root-cause analysis to the classification of defects by relating the defects with their underlying causes, or defect triggers. Freimut et al. [1] also used ODC as a starting point for building and validating a defect categorization scheme.
2. MOTIVATION The project we describe is a NASA-funded research project carried out by the Fraunhofer USA Center for Experimental Software Engineering in College Park, Maryland. One of the objectives of the research project is to optimize the planning of early life-cycle validation and verification (V&V) activities such as requirements and design inspections and to demonstrate the tradeoffs with respect to later V&V activities such as code inspections and testing, in terms of total effort and types of defects that can be detected. This is accomplished by providing practical guidance to projects in fine-tuning their inspection processes, and analyzing the types of defects that are most effectively found by various V&V activities throughout the lifecycle.
Another example is a series of studies on web application faults done by Li et al. in [4], [7], and [8]. Based on the ODC framework, they identified web error attributes by analyzing web server logs and associating errors with the unique characteristics of web applications compared with traditional software systems. Also related to web application defects, Sampath et al. [9] applied a web fault classification. This classification extended the work in [10] where web application faults were categorized as scripting faults, database query faults, and forms faults.
An initial milestone in this work was the development of a preliminary model of inspection effectiveness across multiple NASA Centers. This model allows an examination of inspection parameters, across different types of projects and different work products, for an analysis of factors that impact defect detection effectiveness.
Nikora et al. developed a new fault classification in [11]. This taxonomy differs from others in that “it does not seek to identify the root cause of the fault. Rather, it is based on the types of changes made to the software to repair the faults” [12]. Mariani classified known faults according to their causes and effects. Causes are related either to the technology or to a particular scenario, e.g., system maintenance. Effects are the failures caused in the system [13]. A number of taxonomies have been developed specifically for security concerns (e.g., [14], [15], and [16]).
The data used to build the model came from several NASA Centers where we gained access to inspection data collected by projects. Data elements include the product types inspected, defect numbers, types, and severity, inspection effort (broken down in several ways), type of software (e.g., flight vs. groundsupport software, as well as the NASA-specific designations for mission and safety criticality) and characteristics about the project (e.g., development language, in-house/contracted out, etc.). Data models were built for each source (i.e., Center) as the data became available, and then mappings were created so that data from different contexts, but related to inspections of the same work product, could be compared across Centers. In this paper we discuss the mapping between different defect
In the above classifications, [4, 8-10, 13-16] are system-specific, while [5, 6, 12] are generic and form the basis on which a more specific classification can be developed. Most of them were developed in an iterative way. While many classifications exist, there is clearly precedent for developing new taxonomies for particular purposes and environments.
150
Table 2. Design and Source Code inspection defect types
4. OVERVIEW OF OUR APPROACH Our goal was to make use of several extensive historical datasets by using them to create a model describing the behavior of software inspections at NASA. In order to combine the datasets in a useful way, mappings had to be defined between the different defect categorization schemes used in the different historical datasets. While the next section will outline many of the obstacles we encountered in creating these mappings, this section outlines our overall approach. This approach emerged from our efforts to combine the historical datasets we had available, and thus is grounded in that specific experience. We began by looking at defect categorization schemes in the literature, in particular Orthogonal Defect Classification [3]. While we wanted to facilitate mapping the historical data to the new sets of categories, we also wanted to create a scheme that was small, simple, and easy to apply in future data collection efforts. For these reasons we started with an industry standard that was well documented in the literature. However, we also wanted to understand the major concerns of the historical categorization schemes, so that our resulting scheme would remain NASA-specific and address NASA-specific concerns. So we then looked at the defect categorization schemes used in each NASA dataset in turn. For each one, we made decisions about what categories to include, combine, or partition. Once the initial draft of the defect categories for each work product (e.g., requirements, design, code, and test plan artifacts, shown in Tables 1-3) was created, we began mapping the historical data. We proceeded by analyzing and mapping one dataset at a time. For each one, a member of our research team took a first cut at mapping the defects from the categories used in the historical dataset to our draft set of categories, using the following procedure (illustrated in Figure 3):
Defect Type
Definition
algorithm / method
An error in the sequence or set of steps used to solve a particular problem or computation, including mistakes in computations, incorrect implementation of algorithms, or calls to an inappropriate function for the algorithm being implemented.
assignment / initialization
A variable or data item that is assigned a value incorrectly or is not initialized properly or where the initialization scenario is mishandled (e.g., incorrect publish or subscribe, incorrect opening of file, etc.)
checking
Inadequate checking for potential error conditions, or an inappropriate response is specified for error conditions.
data
Error in specifying or manipulating data items, incorrectly defined data structure, pointer or memory allocation errors, or incorrect type conversions.
external interface
Errors in the user interface (including usability problems) or the interfaces with other systems.
internal interface
Errors in the interfaces between system components, including mismatched calling sequences and incorrect opening, reading, writing or closing of files and databases.
logic
Incorrect logical conditions on if, case or loop blocks, including incorrect boundary conditions ("off by one" errors are an example) being applied, or incorrect expression (e.g., incorrect use of parentheses in a mathematical expression).
non-functional defects
Includes non-compliance with standards, failure to meet non-functional requirements such as portability and performance constraints, and lack of clarity of the design or code to the reader both in the comments and the code itself.
timing / optimization
Errors that will cause timing (e.g., potential race conditions) or performance problems (e.g., unnecessarily slow implementation of an algorithm).
other
Anything that does not fit any of the above categories that is logged during an inspection of a design artifact or source code.
Table 1. Requirements inspection defect types Defect Type
Definition
clarity
A problem in the wording or organization of the document that makes it difficult to understand.
completeness
A missing requirement or other piece of information.
compliance
A problem with compliance to any relevant standard.
consistency
Two or more statements in the document that are not consistent with each other, e.g., requirements that are mutually exclusive.
correctness
Any statement in the document that is incorrect.
testability
A requirement that is not stated in a way that makes it clear how it can be tested.
other
Anything that does not fit any of the above categories that is logged during a requirements inspection.
151
category, assign all the historical defects into it. If there are multiple categories, use the assignment algorithm that we discuss in Section V.A. If there are no such categories:
Table 3. Test Plan inspection defect types Defect Type clarity
Definition A problem in the wording or organization of the document that makes it difficult to understand.
completeness
A missing test case or other piece of information.
compliance
A problem with compliance to any relevant standard.
correctness
Any statement in the document that is incorrect, including incorrect expected output for a test case.
testability
An infeasible test case (e.g., one too costly to test).
redundancy
Test cases or other information that is not necessary because it appears more than once.
other
Anything that does not fit any of the above categories that is logged during a test plan inspection.
3.
Then the entire team reviewed the categorization as a group, discussing each of the issues described in Section V as they came up. Whenever possible, we consulted defect category definitions and defect logs from the historical dataset. Again, categories were included, merged, and partitioned as we went along. Terminology and definitions were modified and refined as well. Once one historical dataset had been mapped, we repeated the process with the next dataset. The resulting set of categories for all work products is shown in Tables 1-3. Note that, although we started with ODC defect categories, the end result looks very different, especially for requirements and test plan defect categories. This is due to the process we used, as described above, as well as the situations we encountered, described in the next section. All the mappings were encoded in an automated tool that we were developing in parallel. The tool aids us in our work by managing the historical datasets, automatically mapping the historical defect data into our common defect categorization scheme using the mapping rules we specify, analyzing the data in various ways (including the construction of statistical models), and displaying the results of those analyses using various visualizations. One such visualization, shown in Figure 1, shows the distribution of the defects in the entire combined historical dataset over the new set of defect categories. The tool also gives feedback to current projects on how their inspection
For each category in the historical dataset: 1.
Determine whether this category already exists in the new categorization scheme, possibly under another name. If so, assign all the historical defects into this category. If not:
2.
Determine whether defects in this historical category can semantically be included under any existing categories in the new scheme. If there is one such
Add this category to the new categorization scheme, along with all its defects.
Figure 1. Example from prototype tool: Comparing one inspection’s defect distribution to all others from the same Center.
152
defect categories. Our approach to this issue is illustrated with a simplified example, shown in Figure 2.
parameters (inspection team size, inspection rate, etc.) compare to historical baselines. Such an analysis would not be possible without first being able to map the defects in the historical data sets into a unified defect categorization scheme. The tool can also be used to enter new inspection data, as well as access our experience base of inspection process checklists.
One of our historical datasets used a set of categories for code inspections that included the more technical structural categories similar to our second dimension above (“data”, “control”, “computation”, etc.) as well as the more document-oriented categories (“completeness”, “consistency”, and “correctness”). We could map the defects in the first set of categories more or less straightforwardly to our code inspection defect categories, but the second group was much more challenging. We had no information that could tell us how and why the original defect categories were chosen. For example, how was it determined to place a defect concerning an incorrect data definition into the “correctness” category as opposed to the “data” category? So we decided to make the simplifying assumption that the defects categorized as “completeness”, “consistency”, or “correctness” had the same distribution of technical concerns as did the rest of the defects logged. Thus we first mapped all the defects in the technical categories (“data”, “control”, and “computation” shown in Figure 2) into our code inspection defect categories, and calculated the percentage of defects in each category (54%, 31%, and 15%, respectively). Then we used those percentages to assign all the “completeness”, “consistency”, and “correctness” defects into the technical categories. So, for example, we mapped 54% of the 23 “completeness” defects, or 12 defects, into the “data” category.
5. CHALLENGES As outlined in Section I, our objective was to create a mapping between the defect categorization schemes used in different datasets describing the outcomes of software inspections from a variety of NASA software development projects. The motivation for this exercise was that we wanted to be able to make the large amount of available historical data useful through the creation of models that would guide and inform software inspection activities on future projects. Our approach was to create our own defect categorization scheme, using as many categories as possible from the historical datasets, and then mapping the historical defect data to our new categories (shown in Tables 1-3). We aimed to preserve as much information as possible in the mappings, but we encountered a number of challenges in attempting to do this. Each of these challenges required us to make tradeoffs in order to be able to use the data in our models. Although we describe and illustrate each of these challenges in terms of our specific context and the data we had, they are illustrative of situations that arise in any effort to convert historical data to a usable form. These challenges, and the tradeoff decisions made in response, are elaborated in the subsections below.
One effect of this procedure is that the resulting data can no longer said to be primary data, but derived data. As such, any associated information about individual defects is lost and any insights gained from examination of the data are valid only at an aggregate level. In our case, there was no information available about individual defects, so no information was lost. Further, our objective was analysis of the aggregated data, not of individual defects.
5.1 Orthogonality The problem of orthogonality arose because some defect categorizations in our historical datasets used categories that were not mutually exclusive. This must have made it impossible to consistently put a defect in just one category at the time of data collection; for us as future users of the data, it made it impossible to imagine that the data had been categorized consistently. In particular, there seemed to be two different sets of orthogonal concerns reflected in the defect categories. One dimension of defect categories was concerned more with the qualities of the document being inspected (e.g., categories such as “completeness”, “consistency” and “clarity”). The other dimension was more technical and was concerned with problems affecting the behavior of the system (e.g., categories such as “algorithm”, “data”, and “logic”). When categories from these two dimensions were combined into one defect categorization scheme, the result is a set of categories that are not mutually exclusive. An inconsistency defect type, for example, could also be a data or an interface defect. Our solution in designing our own defect categorization scheme was to pick just one of these dimensions for each set of categories (i.e., for a work product). As outlined above, we chose to focus on the first dimension (i.e., “completeness”, “consistency”, etc.) for requirements and test plan artifacts, and the second dimension (i.e., “algorithm”, “data”, etc.) for design and code artifacts.
Figure 2. Mapping orthogonal categories Our recommendation for designers of defect categories is to pick one dimension for each set of categories (i.e., for each work product or quality assurance activity). If it is desired to capture more than one dimension, then more than one set of categories should be used. This is commonly done to capture
This seemed intuitively appealing, but did not solve the problem of mapping defects from the historical, non-mutually exclusive
153
appear obvious. Also we recommend training the teams on the definitions to help ensure they can more easily assign the correct defect category.
severity, where there is a separate set of categories (“major” vs. “minor”, or “critical” vs. “cosmetic”, etc.) to describe this aspect of the defect. For those trying to map historical data that used orthogonal categories, our approach to distributing defects, as described above, seems to have worked well but, of course, we cannot verify how accurately it reflects the nature of the defects logged.
5.3 The Big Other After mapping a dataset, we sometimes ended up with the problem of a very large number of defects in the “other” category. Usually, this was due to the fact that the original data collection and categorization of the historical dataset had overused the “other” category. In these cases, we usually found that “other” defects were logged at a much higher rate early in the project, and decreased significantly as projects progressed. We interpreted this to mean that there was a learning curve in using the defect categories and that developers got better and more precise at it as the project progressed. Our solution was either to eliminate the data from the early inspections from that project, or to re-categorize the defects from that early period according to the distribution of defects over categories in the rest of the project.
5.2 Terminology One of the most commonly occurring problems we encountered was inconsistent terminology. In many cases, different data sets used similar but different names for what appeared to be the same category. For example, defects categorized as “anomaly management” fit into our “checking” category. Similarly, there was a “control” category in one of the historical data sets that was analogous to our “logic” category. We attempted to make our defect categorization scheme a bit higher-level than the historical schemes because we thought that would make the mapping task a bit easier. As a result, some of the terminology uncertainties had to do with understanding subcategories. For example, defects labeled “performance” in one of the historical datasets were included in our “timing/optimization” category, because performance is one issue addressed by optimization mechanisms. As another example, we assigned “usability” defects to our “external interface” category, because usability issues are a subset of the types of problems that might occur at the interface between the application and the user.
Our recommendation for categorizing defects is to save the “other” category for defects that are truly one-of-a-kind and should not impact model building or analysis. A large “other” usually signals some kind of phenomenon that should be investigated. Analyzing data quality as it is collected helps ensure the inspectors understand definitions of the defect categories. When defects are assigned to the “other” category, it is helpful to require a textual description of the defect, along with the categorization, to allow identification of new categories emerging within the “other” category.
In yet other cases, higher-level data had to be mapped to more detailed data. For example, our defect categories included both “internal interface” and “external interface” defects. One of our historical datasets, however, only has a single “interface” defect category. So the historic defects had to be divided up and assigned into each of the new “interface” defect types.
5.4 Handling different artifacts The historical data we collected included data on inspections of numerous different types of work products. Categorizing the work products inspected actually presented a categorization problem in and of itself. To simplify our analysis, we attempted to categorize all work products into four categories: requirements artifacts, design artifacts, source code, and test plans. This, however, was not completely straightforward, as the terminology describing the work products differed considerably within and among the historical datasets. For example, the names of work products we categorized as “design” included “detailed software design specification”, “interface control document”, and “structure chart”. For our analysis, we have combined all of these into the design category. Similarly, in another case, a data source distinguished between software requirements and subsystem level requirements. For our analysis, we have combined both requirements types into one requirements category.
Our solution to the terminology problem was to rely on defect categorization definitions and actual defect descriptions wherever they existed, and to make the best guess we could in other cases. The key to increasing confidence in our guesses was to, first, scrupulously document our categorization decisions. This helped ensure that the categorization decisions we made for different historical datasets were consistent. Secondly, we subjected our categorization decisions to constant peer review. All categorizations were made by an individual member of our team, then presented, discussed, and edited in group sessions with the whole team. Also, previously categorized datasets were revisited after categorizing each subsequent dataset. Careful documentation of the categorization decisions facilitated the review process.
Different datasets also used different strategies when it came to categorizing defects from different work products. Some datasets used completely different sets of categories for different products, while others used overlapping sets. In making the decision about how much the defect categories for different work products should differ in our defect categorization scheme, our main criteria were 1) to facilitate and minimize the mapping that had to be done from the historical datasets and 2) to create defect categories that made sense, for each work product, from the point of view of the person logging the defect. An implication of criterion 2 is that the defect categories for a
Clearly resolution of terminology differences between categorization schemes would have been greatly facilitated by well-written definitions for all the defect categories used in the historical datasets. Further, actual defect logs would also have been useful, either in the absence of category definitions, or to make sure that defects were categorized correctly according to the definitions. So a recommendation for those developing and using defect categorization schemes is to develop and maintain careful definitions of all category names, even for those that
154
Ideally, we believe defect categorization schemes should include multiple categories for non-functional defects, that such defects should be logged, and that analysis of those defect categories should be used to improve the development process. These non-functional defect categories should be used similarly to the way functional defect data is used. This would be our recommendation for future defect data collection efforts. However, in the absence of a consistent way of treating such defects in the historical datasets, at this time we cannot perform such a detailed analysis.
particular work product should require only information that is available in that work product. This meant, for example, that categories such as “assignment/initialization” were not included in the set of categories for requirements documents, because the information required to determine whether or not a variable was initialized properly would not be available in a requirements document. In our solution, we ended up using the same set of defect categories for design documents and source code. This was primarily because that was the case in most of the historical datasets we were using. This set included categories that required technical details about the structural characteristics of the software that could be expected to be decipherable from either the design or the code.
6. EVALUATION A traditional empirical evaluation of our defect categorization scheme is not possible because there is no known “ground truth” or representation of “correctness” against which to compare our results. Therefore, we must use other criteria to judge our scheme.
We also used similar, but not identical, sets of categories for requirements documents and test plans. These categories, rather than requiring a technical understanding of the structure of the software (as one would get from a design or code document), focused more on the literary or editorial aspects of the work product. For example, concerns such as “clarity” and “completeness” affect the user of the document more than the technical behavior of the system. A “consistency” category was included for requirements but not for test plans, as we could not think of an example of a consistency problem that would apply to a test plan. Also, a “redundancy” category was included for test plans to signal instances where functionality was unnecessarily tested multiple times. This category was not included for requirements because we believed redundancy in the requirements document would not be a problem, and would sometimes be desired for clarity, as long as the redundant portions were consistent.
Freimut et al. [1] present a set of criteria that can be used to judge defect categorization schemes, including: •
Clear meanings and definitions for all categories defined;
•
Examples given for each category;
•
A complete and orthogonal scheme;
•
A small number of choices (between 5 and 9);
•
Allows useful analysis of defect data.
One way to determine whether or not a categorization scheme has been made sufficiently clear, through its definitions and examples, is to observe it being used by different people to classify the same set of defects, and then to compare their results. In our project, we did not have new defects available to classify, but were re-classifying existing defects, for which we did not have any information except for their original classifications. Therefore, we could only perform this type of evaluation partially, by reaching decisions about how to classify defects by consensus among the members of our project team, and by continually checking each others’ classifications. We also reviewed our category definitions extensively, including by personnel outside our project team. It should be noted that, because our defect categorization scheme includes definitions and examples, it is automatically an improvement over some of the historical categorization schemes it replaces.
Our recommendation for designers of defect categories is to use categories that make sense from the point of view of the person logging the defect, and that require only the level of information contained in the work product. Otherwise, the data quality issues will be insurmountable.
5.5 Functional vs. Non-functional defects We define “functional defects” as those that affect the observable behavior of the software by causing undesirable outcomes or incorrect output, while “non-functional” defects do not generally have directly observable effects. Examples of nonfunctional defects include non-compliance to documentation standards, unclear variable names, performance issues, etc. The problem we encountered in mapping defect categories from our historical datasets was that different datasets treated nonfunctional defects very differently.
We placed a special emphasis on making the defect categorization scheme both complete and orthogonal. Lack of orthogonality was a particular problem, as noted earlier, in translating our data from the historical defect categories. Therefore, the orthogonality problem was made clear to us early on. While any defect categorization scheme can be made complete, by definition, by including an “other” category, and we did include such a category, we also attempted to minimize the number of defects classified as “other”. We were somewhat limited in the extent to which we could do this, as many defects in our datasets were already classified as “other”, without further information.
While one of our datasets seemed to ignore non-functional defects altogether, two others included numerous detailed nonfunctional categories. One dataset included five different nonfunctional defect categories, including “compliance to standards”, “portability”, “maintainability”, “functionality”, and “clarity”. Our solution was to create just one “non-functional” defect category in our new categorization scheme. However, for some of the analysis, we had to exclude this category because it dominated the combined dataset.
The size of our defect categorization scheme, in terms of number of categories, comes close to the 5-9 range
155
between inspection factors (e.g. number of inspectors, etc.) and inspection outcomes (e.g. numbers and types of defects found). We encountered a number of obstacles in this process, which we have outlined along with our solutions.
recommended by Freimut et al. [1]. We have 7 categories for classifying requirements and test plan defects, but 10 for classifying code defects. For our purposes, the most important criteria by which to judge our categorization scheme is the degree to which it enables useful analysis. This is, in fact, the major motivation behind developing the scheme. We believe that we have already demonstrated useful analysis results, e.g. as shown in Figure 1, that are of direct utility to future NASA development projects. In addition, our future research plans include the development of more sophisticated models of inspection effectiveness based on this historic data. Such models could not even be contemplated without the defect categorization scheme we have developed.
One result of our work is a general process that we have abstracted from our experience, shown in Figure 3. Although it is based on our own limited experience in one project in one large organization, we believe it can be useful to others faced with similar problems and objectives concerning historical data. The process involves iterating through each historical dataset, and through each defect category in each of those datasets. Each defect category is compared to the current set of categories that have been chosen thus far to be included in the new combined defect categorization scheme (which initially consisted of the ODC scheme, in our case). For each defect category, a determination must be made as to whether it already exists in the set of categories chosen thus far (possibly under a different name), or it is a subcategory of an existing category (i.e. the defects can be subsumed into a broader category), or it represents an overlap between several existing categories (i.e. the defects must be distributed among these several existing categories), or if it constitutes an entirely new category that must be added as is to the new set of defect categories. This distribution and re-categorization of the old data takes place in the shaded part of the flow chart (i.e., the inner loop).
Further evaluation of our scheme will occur naturally as we are able to use it to classify new defects (and determine if it can be used consistently by multiple people) and to incorporate heterogeneous data into our models.
7. CONCLUSIONS This paper describes our experience working with a variety of defect categorization schemes in an effort to map diverse historical data into a consistent set of defect categories. This was part of a larger project aiming to build a tool and an experience base to encapsulate historical data. The larger project was also concerned with using the data to model relationships
We also cull from our experience a number of recommendations, for those conducting similar analyses as well
Figure 3. Defect category mapping process
156
as those defining and collecting primary data. These recommendations are summarized below. When designing a new defect categorization scheme, we recommend that one coherent dimension of interest be chosen, so that the categories defined will be mutually exclusive. The categories should also be chosen and named so as to make sense from the point of view of the person examining the work product in question, not the developer or user of the work product. The categories should be applicable using only the information available at the time the product is being inspected. This means, for example, that requirements document defects should be able to be categorized without design or code structure knowledge. Of course, it is essential to carefully document the definitions of all category names, even (or especially) those that appear obvious. While it makes sense to include an “other” category, it should be reserved for truly unique and rarely occurring types of defects. When categorizing defect data, once the categories have been defined, it is important to apply quality assurance checks on the data, and to document the quality assurance procedures that have been used. Data owners should also continually monitor the “other” category. When it becomes large, it should be examined to determine if there are significant categories of defects classified as “other” that should constitute a new category. Historical data is indispensable for creating models of software development factors, and capturing knowledge that can be reused to improve current and future practice. Historical data is also often rare, so opportunities to use it must not be wasted. Inconsistent categorization schemes used in different datasets is often an obstacle to making use of historical data. If the inconsistencies cannot be resolved, then the potential usefulness of the data is wasted. This paper provides some strategies for overcoming this obstacle, which we hope will make future efforts at leveraging historical data more successful.
[4]
M. Li and J. Tian, 2007. Web error classification and analysis for reliability improvement, Journal of Systems and Software vol. 80, pp. 795-804.
[5]
R. Chillarege, W.-L. Kao, and R. G. Condit, 1991. Defect Type and its Impact on the Growth Curve, In Proceedings of the 13th International Conference on Software engineering (Austin, Texas, USA, 1991).
[6]
M. Leszak, D. E. Perry, and D. Stoll, 2002. Classification and evaluation of defects in a project retrospective, The Journal of Systems and Software, vol. 61, pp. 173-187.
[7]
M. Li and J. Tian, 2003. Analyzing errors and referral pairs to characterize common problems and improve web reliability, In Proceedings of the 3rd International Conference on Web Engineering (Oviedo, Spain, 2003).
[8]
M. Li, L. Zhao, and J. Tian, 2003. Analyzing web logs to identify common errors and improve web reliability. In the Proceedings of the IADIS International Conference on eSociety (Lisbon, Portugal, 2003).
[9]
S. Sampath, S. Sprenkle, E. Gibson, L. Pollock, and A. S. Greenwald, 2007. Applying Concept Analysis to UserSession-Based Testing of Web Applications, IEEE Transactions On Software Engineering, vol. 33, pp. 643657.
[11] A. P. Nikora, 1998. Software System Defect Content Prediction From Development Process And Product Characteristics, Dept. of Computer Science, University of Southern California. Doctoral Dissertation.
This work was sponsored by NASA grant NNG05GE77G, "Full-Lifecycle Defect Management Assessment." The authors also wish to thank our collaborators at NASA for helping us to compile this data set from across many projects and Centers, without which this work would not have been possible.
[12] A. P. Nikora, and J. C. Munson, 1998. Software Evolution and the Fault Process, In Proceedings of the 23rd Annual Software Engineering Workshop NASA (Goddard, Kansas, USA, 1998). [13] L. Mariani, 2003. A Fault Taxonomy for ComponentBased Software, International Workshop on Test and Analysis of Component-Based Systems, vol. 83, pp. 5565.
9. REFERENCES
[2]
R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S. Moebus, B. K. Ray, and M.-Y. Wong, 1992. Orthogonal Defect Classification-A Concept for InProcess Measurements, IEEE Transactions on Software Engineering, vol. 18, pp. 943-956.
[10] S. Elbaum, G. Rothermel, S. Karre, and Marc Fisher II, 2005. Leveraging User-Session Data to Support Web Application Testing, IEEE Transactions On Software Engineering, vol. 31, pp. 1-16.
8. ACKNOWLEDGMENTS
[1]
[3]
B. Freimut, C. Denger, and M. Ketterer, 2005. An Industrial Case Study of Implementing and Validating Defect Classification for Process Improvement and Quality Management, In Proceedings of the 11th IEEE International Software Metrics Symposium (METRICS 2005) (Como, Italy, 2005).
[14] T. Aslam, E. H. Spafford, 1996. Use of a taxonomy of security faults. In the 19th National Information Systems Security Conference (Baltimore, Maryland, USA, 1996). [15] C. E. Landwehr, A. R. Bull, J. P. McDermott, and W. S. Choi, 1994. A taxonomy of computer program security flaws, ACM Computing Surveys, vol. 26.
G. Vijayaraghavan and C. Kaner, 2003. Bug Taxonomies: Use Them to Generate Better Tests. In the Software Testing Analysis & Review Conference (STAR) East (Orlando, Florida, USA, 2003).
[16] S. Weber, P. A. Karger, and A. Paradkar, 2005. A software flaw taxonomy: aiming tools at security, ACM SIGSOFT Software Engineering Notes vol. 30, pp. 1-7.
157