Jun 26, 2001 - [MMN+92] Keith W. Miller, Larry J. Morell, Robert E. Noonan, ... Adam A. Porter and Richard W. Selby. ... [SKV94] George E. Stark, Louise c.
Politecnico di Milano | Dipartimento di Elettronica e Informazione
Measurement Driven Testing Process: Further Empirical Studies
Giovanni Denaro Sandro Morasca Mauro Pezze o
Rapporto interno n
Giugno 2001
2001.52
Measurement Driven Testing Process: Further Empirical Studies
Giovanni Denaro, Sandro Morasca, Mauro Pezze June 26, 2001
1.
Introduction
Demand for quality in software applications has undergone a rapid growth during the last years, and testing related issues have become more and more crucial for software producers. Testing activities require a huge amount of resources and thus it is very important to be able steer the testing process in the right directions. An important step towards the improvement of the testing process is the ability of estimating software fault-proneness, i.e., to estimate to what extent a software module is expected to be faulty at given moments. A realistic estimation of software fault-proneness before testing allows to better focus the testing activities and can improve cost estimation. Estimation of software fault-proneness after a testing session can provide feedback on the testing activity and help in de ning delivery and maintenance procedures. Software fault-proneness, de ned as the probability of presence of faults in the software, cannot be directly measured on software [FN99]. However, fault-proneness can be estimated based on software attributes directly measurable on software, if relations are found between these attributes and fault-proneness. Research in this eld has been directed in two main directions: de nition of metrics to capture software complexity and testing thoroughness, and identi cation and experimentations of models that relate software metrics to fault-proneness. Several software metrics have been proposed. These metrics characterize software both statically and dynamically. Static metrics measure characteristics of the code structure. Well know static metrics are cyclomatic complexity [McC76], number of lines of code, number of operators and operands [Hal77], number of knots [WHH79]. Dynamic metrics measure the thoroughness of testing. Common dynamic metrics are based on structural and data- ow coverage [Mye79, FW93]. The existence of correlation between software metrics and fault-proneness, as well as many other nondirectly measurable software attributes, has been empirically proved by many authors [GK91, BH83, HFGO94, SKV94, LC87, KAKG96, SB91, RAv96, LPR98, FI98]. Unfortunately, the attempts of de ning models for computing fault-proneness indicators based on software metrics did not succeed in produce convincing general models.
1
Instead, encouraging results have been achieved by building prediction models based on sets of metrics tailored to speci c application domains or processes. Rather than nding a general solution to the problem, these works aim at investigating how to nd speci c solutions based on the available domain knowledge. To this end, many methods have been explored, based on machine learning principles such as decision trees [SP88, PS90] or neural networks [KLP94], probabilistic approaches such as Bayesian Belief Networks [FN99], statistical techniques such as discriminant analysis [MK92] and regression [MR00], or mixed techniques such as optimized set reduction [BBT92]. Some of the proposed methods give only a discrete characterization of potentially faulty modules, i.e., classify modules as fault-prone or non-fault-prone, while others, and logistic regression in particular, produce a continuous fault-proneness indicator that allows modules to be sorted according to their fault-proneness. We argue that a continuous fault-proneness indicator can be more valuable for planning test and analysis activities, since it provide a better support for de ning testing and analysis strategies. Previous works on logistic regression provide a preliminary set of data for evaluating the approach [KM90, MR00]. These works select valid models using methods that require large set of experimental data. In particular, they use classic data splitting techniques that partition the available data into two sets used for building and evaluating the model, respectively. This method is sound and can be used to prove the validity of the approach. However, in many practical cases, partitioning the amount of available data may prevent statistical signi cance of the results. This paper exploits logistic regression [HL89] as a technique for correlating module fault-proneness to software metrics. In particular, it proposes and evaluates a new method for computing the quality of prediction of the models, discusses test strategies that can take advantage of the produced models for tuning the testing processes, and provides additional experimental data for supporting the validity of logistic regression for this purpose. We propose cross-validation [WF00] as the technique for evaluating the quality of prediction provided by the models. Cross-validation allows the use of all available data for building the model, while evaluating its quality of prediction. Crossvalidation can thus provide statistically relevant results also in presence of relatively small amount of data. We conducted our experiments on an industrial-size application that has been used as benchmark in other scienti c works [PCM96, FI98, KPR00]. We use Alberg diagrams [OA96] to show the advantages of tuning the testing process based on the produced models. The work described in this paper is part of an ongoing larger project that studies relationships and dependencies among the statically measurable attributes of software modules (i.e., static metrics), testing thoroughness (i.e., testing coverage) and faultproneness, before and after testing, of the corresponding software. In this paper, we investigate the possibility of nding valid prediction models for speci c software application domains. Other experiments to validate the suitability of the models to correctly estimate fault-proneness of new software are in progress on new classes of software. The research aims at producing also a tuning mechanism to be used to adapt the models to the evolution of software within the same class of products and thus maintain their validity. 2
In the rest of the paper, Section 2 gives the technical background necessary to understand the presentation. Section 3 de nes hypotheses and settings of the experiments we have been performing and Section 4 reports on the results. Section 5 discusses similarities and dierences with related works. Finally, Section 6 draws some conclusions and brie y describes the global project this work is part of. 2.
Logistic Regression and Cross-Validation
This section brie y presents logistic regression and cross-validation, which we use for computing and evaluating fault-proneness models. Detailed presentations of logistic regression and cross-validation can be found in [HL89] and [WF00]. A logistic regression model estimates the probability that an object belongs to a speci c class, based on the values of some numerically quanti ed attributes of the object. The variable describing the classes to be estimated is called dependent variable of the model. The variables that quantify the object attributes are called explanatory (or independent) variables of the model. For example in our experiments, we instantiate logistic regression as follows:
the objects to be classi ed are software modules;
the explanatory variables are software metrics, that quantify the attributes of the software modules.
the classes that classify modules are: faulty and non-faulty; the dependent variable of the models is a variable Y which assumes value 1 for faulty modules and 0 for non-faulty ones;
A logistic regression model having Y as dependent variable estimates the probability of the event [Y = 1] as de ned by the following equations: p(Y = 1) = (Linear) =
eLinear 1 + eLinear
Linear = C0 + C1 X1 + C2 X2 + : : : + Cn Xn
where, X1 , X2 , : : : ; Xn are n explanatory variables and the coeÆcients C0 , C1 , : : : ; Cn identify a linear combination (Linear) of them. Logistic regression is based on maximum likelihood and does not assume any strict functional form to link the explanatory variables and the probability function. Instead, this functional correspondence has a exible shape that can adjust itself to several different situations. The regression coeÆcient, C1 , C2 ; : : : ; Cn , are estimated optimizing the likelihood function on the set of available observations (that have to be independent by hypothesis). In our experiments the observations are vectors that collect metrics and the value of Y for software modules. Observations are based on historical data. For each traced module, the value of Y is 1 if at least one fault is historically known and 3
0 otherwise. The metrics are directly measured by means of commercial and prototype tools. We rely on the approximation that the subject application is stable enough to have no faults, but the historically known ones [PCM96]. A logistic regression model is preliminarily assessed based on the following statistics:
pi , the statistical signi cance of each estimated coeÆcient, which provides the probability that the corresponding coeÆcient Ci is dierent from zero by chance. In other words, pi measures the probability that we erroneously believe that the metric Xi has an impact on the estimated probability that a module is faulty. A signi cance threshold of 0:05 is commonly used to assess whether an explanatory variable is a signi cant predictor;
R2 , the goodness of t, which provides the percentage of uncertainty that is attributed to the model t. This coeÆcient is not to be confused with least-square regression R2 : they are built upon very dierent formulae, even though they both range between 0 and 1. The higher R2 , the higher the eect of the explanatory variables, the more accurate the model. As opposed to the R2 of least-square regression, high values of R2 are rare for logistic regression, because of technical reasons. Values of R2 greater than 0:3 can be considered good in the context of logistic regression.
A logistic regression model can be evaluated by applying the model on a new set of observations (test set) for which we know the actual value of the dependent variable. Once xed a threshold, we can classify as faulty (resp. non-faulty) the modules with a fault-proneness value higher (resp. lower) than the threshold. The quality of prediction of the model is measured as the success rate in classifying faulty and non-faulty modules in the test set. This evaluation method is known as data splitting because it assumes that part of the observations are used as test set. Unlike data splitting, cross-validation assesses quality of prediction referring to the same data used to build the model, and thus produces statistically valid indicators of quality also for relatively small amount of data. Given a set of explanatory variables, the cross-validation of the correspondent logistic regression model consists of the following steps:
partition the observations into n approximately equal-size sets; compute n logistic regression models for the given explanatory variables. Each model is computed based on the observations belonging to n 1 sets of the partition, while the observations belonging to the remaining set are used as test-set;
cumulate the results achieved for the n test sets in order to measure the success rate in classifying faulty and non-faulty modules of all the initial data set;
iterate a number of times the procedure using dierent partitions and average the obtained results.
4
In the common practice, observations are partitioned in 10 sets and the partitions are strati ed, i.e., in each partition the distribution of the objects belonging to each class is approximately the same as in the whole dataset (strati ed ten-fold cross-validation). Strati ed ten-fold cross-validation is used throughout this work for evaluating the quality of prediction provided by logistic regression models. 3.
The Experiment
The carried data analysis was intended to address a number of research questions about the existence of suitable fault-proneness models and the capability of logistic regression of identifying such models. These questions include:
Is it possible to identify suitable fault proneness models valid for speci c classes of software products and software environments? I.e., are we able to nd out a correlation between software metrics and the probability that the software is faulty, based on the knowledge of software process historical data?
Is automatic generation of logistic regression models a feasible approach to identify fault-proneness models? I.e., can we proceed in an automatic way to generate sets of logistic regression models and then chose the best performing ones?
How do high statistically signi cant models perform for prediction making purpose? I.e., what is the quality of prediction, computed by means of cross-validation, of the most statistically signi cant models?
To what extent are the models, which best perform for prediction making purpose, statistically signi cant? I.e., if we select the sets of metrics for the models according to the goodness of the cross-validation results, to what extent are these models statistically signi cant?
How can the identi ed models help in planning the testing strategy? I.e., if we consider the yielded probability that modules are faulty, how much can we bene t from planning the testing accordingly?
The experiments were conducted on a program derived from an antenna con guration system developed by professional programmers. This software system, which is greater than 10; 000 lines of code, provides a language-oriented user interface for con guration of antenna arrays. Users enter a high level description and the program computes the antenna orientations. The availability of detailed data on faults revealed during testing makes the program extremely useful for studies on fault-proneness. The same program has been formerly used in the experiments described in [PCM96, FI98, KPR00] and thus it is a well recognized benchmark for scienti c experiments. The dataset contains 136 modules, of which 109 do not contain faults and 27 contain at least one fault. The term modules is used to refer to subprograms as already in [FI98, SP88, PS90]. The total number of documented faults is 39. For each module in the dataset we considered 33
5
dierent software metrics collected with commercial and prototype tools. Appendix A lists these metrics for the sake of completeness. In the carried data analysis, we built models that classify modules as either faulty or non-faulty, aiming at estimating software fault-proneness before testing, based on the values obtained for the static metrics. We associated to each module in the dataset a variable Y that assumes value 1 if the module contains at least 1 fault and value 0 otherwise. Considering Y as dependent variable, we run logistic regression with dierent sets of software metrics as explanatory variables. In particular, we consider all the possible sets of metrics, with cardinality ranging between 1 and 6, selectable among the 33 available software metrics. This yielded more than 1,300,000 models 1 . Since models with too many explanatory variables lead to less credible increments in the model goodness of t 2 , the threshold of 6 metrics seems to us a good compromise between the number of models to be computed and the ability of interpreting the models. Once all the models were automatically computed, only the ones with good statistical signi cance of the explanatory variables (all pi < 0:05) and acceptable goodness of t (R2 > 0:3) have been considered for further analysis. For such models, we evaluated the provided quality of prediction by means of cross validation. The fault-proneness threshold for classifying modules as faulty or non-faulty was xed to 0:5 (50% probability). In particular the following three parameters have been considered to evaluate the success rate of a model: 1. Overall completeness: the proportion of modules that have been classi ed correctly. This parameter measures the ability of a model of correctly classifying the target software modules, regardless of the category in which the modules have been classi ed. 2. Faulty module completeness: the proportion of faulty modules that have been correctly classi ed as faulty. This parameter measures the help that a model provides in identifying the most fault-prone part of the modeled software system. 3. Faulty module correctness: the proportion of modules identi ed as faulty that were faulty indeed. This parameter measures the eÆciency of a model, in terms of the percentage of actually faulty modules among the ones that are candidates for further veri cation. The second and third of such parameters focus on evaluating the performance of models with respect to identi cation of faulty modules. As already noticed by other authors [MR00], faulty modules can be considered more relevant than non-faulty ones as far as veri cation issues are concerned and in software engineering in general. The early identi cation of faulty modules is likely to augment the eÆcacy of the testing process and to improve the overall quality of the delivered software that are the aims of this study. 1
the reader can easily verify the actual number as the sum of the rst six binomial coeÆcients on a total of 33 elements 2 in practice when the number of explanatory variable reaches the number of observations, the R2 of the model becomes 1 by construction
6
Number of expl. variables
1 2 3 4 5 6
Number of models
R2 0.1735 0.2692 0.3241 0.4314 0.4884 0.5270
33 528 5,456 40,920 237,336 1,107,568
Best
Table 1: Summary of the computed prior testing fault-proneness models 4.
Characterizing Prior Testing Fault-Proneness
This section discusses the results of the experiments we have been performing. Table 4 presents a summary of the models for estimating prior testing fault-proneness that have been computed during the experiments. In this table the number of models (column 2) is related to the number of explanatory variables on which the models are based (column 1). Columns 3 report the best vales of goodness of t (R2 ) for the models in each set. It can be noticed as good values of R2 can be achieved building models based on more than two explanatory variables. Since several models with good signi cance indices exist, a strategy is needed for selecting the best suitable ones, i.e., the candidates for being used on future projects. Four possible selection strategies have been investigated. First, one can select models that present the best statistical signi cance, i.e., the highest R2 . Second, one can select models that best perform as quality of prediction. In this latter case, the selection strategy varies according to how the quality of prediction is measured, i.e., based on either the overall completeness, or the faulty module completeness or the faulty module correctness. For testing such selection strategies, we have pruned from the set of computed models the ones with R2 less than 0:3 or based on at least one explanatory metric with associated pi greater than 0:05. During the selection, R2 was used as discriminator whenever the applied strategy selected more than one model. Figure 1 plots the parameters of the four selected models against the selection strategy that has been applied. Plots a, b, c and d of this gure report respectively the value of R2 , the overall completeness parameter, and faulty module completeness and correctness parameters, for each selected model. It can be noticed as all the strategies select models with values of R2 that are good of logistic regression. In the contest of this experiment, and without claims of generality, the strategy based on the overall completeness seems to be the winning one. The selected model has a very good value of R2 . A far as faulty module completeness is concerned it performs as well as the model selected by the strategy based on the faulty module completeness. Finally it is a good second in the faulty module correctness plot, just after the model selected con the correspondent strategy (that is the best by construction). The strategies based on best R2 and on the overall completeness yields the following two models, respectively:
7
squared R 0,6
overall completeness (%) 90
0,5272
0,5
88,97
0,4884
0,4795
88,24 0,3685
0,4
88 86,76
0,3
86,03 86
0,2 0,1
84
0,0 best squared R
best overall compl.
best faulty module corr.
best squared R
best faulty module compl.
best overall compl.
(a)
best faulty module compl.
(b)
faulty module completeness (%)
faulty module correctness (%)
80
90 62,96
60
best faulty module corr.
62,96
51,85
80,00
80
44,44
77,27 73,91
40 70
70,00
20 0
60 best squared R
best overall compl.
best faulty module corr.
best faulty module compl.
best squared R
(c)
best overall compl.
best faulty module corr.
best faulty module compl.
(d)
Figure 1: Parameters of the best models selected according to dierent strategies
8
R2 number of explanatory variables: 6 R2 = 0:5272 Linear p(Y = 1) = (Linear) = 1+e eLinear Linear = 4:63+0:82 eLOC +0:44 Comments 0:48 Lines 0:44 F P +0:18 LF C + 0:33 EXEC
selection strategy: best
selection strategy: best overall completeness
number of explanatory variables: 5 R2 = 0:4795 Linear p(Y = 1) = (Linear) = 1+e eLinear Linear = 3:23+0:36 eLOC +0:37 Comments 0:34 Lines 0:38 IC +0:95 NION
As for model interpretation, the metrics on which these models are based quantify the following characteristics of a software module:
the metrics eLOC , i.e., the number of lines of code without considering blank lines, comment lines and lines containing only brackets, and Lines, i.e., the total number of lines of code, quantify the size characteristic of a module. The cumulate impact of these metrics metrics is positive in both the models. This seems to con rm the intuitive belief that the likelihood of having a fault increases with the size of the software code. Size:
Comments:
Interface complexity:
Internal complexity:
it appears that the number of comment lines in a module does in uence the likelihood of having a fault. The impact is in the expected directions, i.e., the higher the number of comment lines the lower the likelihood of having a fault. We interpret this variable as an indicator of the pressure during development: in a controlled development environment, low number of comments usually denotes high pressure on the programmers. However, further investigations will be required to explain this eect in more detail. the metric F P , i.e., the number of formal parameters that have to be passed when the module is invoked, and IC , i.e., the interface complexity computed as sum of the fan-in and fan-out of a module, quantify the complexity of the interface of a module. Modules with more complex interfaces also tend to be more specialized, less likely to be reusable, and more fault-prone. the metrics LF C , i.e. the logic ow complexity, EXEC , i.e., the number of executable instructions, and NION , i.e., number of input and output nodes, quantify structural characteristics of a module. This suggests that measures of the structure of a module, in terms of testable paths or number of control statements, provide a meaningful way to predict the existence of faults in the software code.
Although we have already discussed the use of these models for classi cation purpose, i.e., for classifying modules in faulty or non-faulty, it is also possible to use directly the 9
faults (%) 100%
80%
60%
40%
20%
0% 1
16
31
46
61
76
91
number of modules
Figure 2: Alberg diagram
10
106
121
136
continuous values they provided. Logistic regression models provide estimates of the probability that modules in the system are faulty. In facts, the highest the probability to be faulty, the more critical the module, the more testing resources should be allocated. Alberg diagrams [OA96] plot cumulative percentage of faults of the modules in the dataset, sorted according to speci c criteria. Figure 2 (a) shows this diagram for the two models we have discussed. In this gure, the highest solid line represents the cumulative percentage of faults against given amounts of modules taken according the decreasing number of faults they contain. This line represent the place to be for the models, i.e., the best sorting they could ever provide on the set of system modules. The dashed and thinsolid lines are obtained by sorting the modules according to the probabilities provided by the two models, respectively the ones selected by the best R2 and the best overall completeness strategies. Since we aim at early identifying the most fault-prone part of the system, the left part of the diagram is the most important one. Figure 2 (b) shows the same Alberg diagram limited to the rst 30 modules of the system. It can be noticed that the two models have very similar performances, with the overall completeness based model winning for smaller amounts of modules (up to 15) and the R2 based model performing well for bigger sets. This suggests that the former model could be useful when due to lack of time only few modules can be selected for further veri cation. 5.
Related Works
Several studies address the relationships between software metrics and software faultproneness. For a comprehensive, critical survey of the literature, the reader is referred to [FN99]. For instance, studies have tried to investigate the usability of classi cation trees for identifying the most fault-prone part of software systems [SP88, PS90]. Also discriminant analysis and neural networks have been investigated for the same purposes [KLP94]. Although the good results obtained with such techniques, there are not theoretical or practical reasons for believing that the proposed models re ect the true structure of the underlying phenomenon (software fault-proneness). Furthermore the output achievable with such techniques is intrinsically discrete: usually software modules are classi ed into two classes, fault-prone or non-fault-prone, while modules classi ed within the same class are practically indistinguishable from each other. Conversely, the logistic regression models investigated in this paper yield estimates of the probability that software modules are faulty. In this respect, the greater the estimated probability to be faulty, the more fault-prone a software module. As for the other techniques, this allows for solving two-class problems (once xed a threshold for the probabilities of fault-prone and nonfault-prone modules), but more articulated strategies may be applied as we showed. Logistic regression has been used in other works in the eld of software engineering [KAH+ 98]. Morasca and Ruhe [MR00] have compared logistic regression with rough set analysis, nding that logistic regression performs better than rough set analysis in identifying the most fault-prone part of a software systems. In particular, our work is related to [KM90], where logistic regression is used in a systematic way for identifying
11
models for software fault-proneness. In this work, the authors use stepwise regression and backward elimination for nding logistic regression models. Stepwise regression (resp. backward elimination) is a model generation procedure based on adding (resp. eliminating) new explanatory variables from the model while they signi cantly contribute (resp. do not contribute) to improve the model. Since it has been showed [MN89] that both stepwise regression and backward elimination can lead to suboptimal models, we preferred to use the combinatorial approach based on building all the models for xed-size subsets of metrics. Moreover, in [KM90], in order to test the quality of prediction provided by the models, part of the data is excluded from the model generation procedure. We claim that this approach is feasible only when very big amount of data are available, which is not generally the case for real-life software processes. To tackle this problem we propose the cross-validation approach that allow for evaluating the provided quality of prediction without excluding data from the model generation procedure. 6.
Conclusions and Future Work
The technique proposed in this paper uses logistic regression and cross-validation for understanding whether a correlation exists between software metrics, testing coverage and software fault-proneness. Models, built based on process historical data, can be used on new projects for estimating software fault-proneness and tuning the testing activities accordingly. Once a project is completed, new model can be built and compared with the former ones to maintain the models. We do not aim at providing a general model of fault-proneness, but rather a methodology for building several models, for classes of homogeneous products, that can be of practical use if included in the testing processes of groups working on speci c product lines. Continuous validation of the current models on the data that become available while progressing in the development is required to limit the divergence of the models from the development practice that may vary for many reasons including bene cial feedback from the models themselves. The technique proposed in this paper requires some preliminary work on historical data for building the models of software fault-proneness and tuning the testing process. However, it results in a minimal overhead on the processes themselves, being easily implemented with simple elaborations of data that can be automatically collected with commercial tools. The proposed model selection strategy based on cross-validation enables for automate also the data analysis. The main requirement on the processes for the application of this technique is a mature documentation, i.e., detailed description of failures and faults, and their relation with the testing process. The technique investigated so far aims at using logistic regression for estimating software fault-proneness before the testing phase. Fault-proneness indicators are derived for software modules based on static measures. Once instantiated, the technique will include a main phase, executed within the process, for computing fault-proneness indicators for the current project, and a tuning phase, executed after the process, for tuning the fault-proneness models according to the new data. The technique is being extended to provide indicators of the thoroughness of the
12
testing as well. Such indicators are built by investigating the correlation between software metrics and probability of revealing faults once achieved a speci c coverage. We propose to measure the ratio between test cases that exercise a module and reveal the fault and all the test cases that exercise a module. Such ratio is used as a measure of the probability of revealing a fault once covered the module. We are currently investigating whether a correlation exists between static metrics and such ratio. Such correlation for speci c classes of faults may be used to estimate the degree of assurance provided by a given coverage for the considered class of faults. This work is partly related to some works concerned with testability [VM93, MMN+ 92]. Up to now, we performed some preliminary experiment where the goal of data analysis was to build models for estimating the probability of discovering faults in a software module once achieved a given coverage, e.g., once the module has been executed a given number of times. We built dierent versions of the software system, each of which contained a single fault. Therefore, we were able to examine each faulty module in isolation and avoid possible interactions among faults. For each version of the software system, we run 10; 000 test cases. These test cases are the same automatically generated for the experiments described in [FI98] and we simply relied on them. The reader interested in understanding how this test pool has been obtained may consult this reference. Once all test cases had been executed, we were able to identify, for each faulty module in the dataset, the actual ratio between the number of test cases that pass through the module, i.e., make the module execute, and the number of test cases that reveal the contained faults, i.e., result in a failure of the system. We refer to this number as the RevealPass-Through ratio (RPT for short) of the module and interpret it as an estimate of the probability that faults contained in a module gets discovered when the module has been executed just once. If we were able to estimate in advance the RPT ratio for modules that are likely to be faulty, we could use this number for estimating the probability that a module is still faulty after a number of tests failed on it. However, if we were only able to know in advance if the RPT ratio associated with a module is likely to be high or low, we can accordingly plan dierent levels of coverage for each module of the system. For preliminary investigations, the values of RPT achieved for the modules in the dataset have been divided in ve classes: one class (corresponding to the value 0 of RPT) associated with non-faulty modules only, four other classes, identi ed as four intervals in the range between 0 and 1, among which the faulty modules are uniformly divided. The idea behind building a model for a discretized RPT is that we can identify whether the static metrics are signi cant predictors of the degrees of fault-proneness without having to specify a xed functional form (e.g., linear) for the dependence. To build models, we used the general formula of logistic regression, which can handle discrete dependent variables with more than two values. We also attempted to build linear regression model to estimate continuous values of RPT. We build many models considering dierent sets of metrics as explanatory variables, but the results of this analysis were not satisfactory. As an example, one model we obtained contains 13 independent variables, which include those used in the above model for the faulty/non-faulty case. This seems to con rm that the four dimensions of size, comments, interface complexity, and internal complexity, are in uential factors on fault13
proneness, even at ner granularity levels. The statistical signi cance of the coeÆcients in this model is below 0:05, i.e. each coeÆcient can be adequately trusted. The value R2 = 0:4 we obtained is quite good in the context of logistic regression, but because of the number of metrics involved, other experiments are needed to assess the physical role that each metric plays in the model. With linear regression we build a model that includes a complexity metric and a coupling metric for the calls that a module makes to other modules. The value of the linear regression R2 is 0:36, which cannot be considered satisfactory in the context of linear regression, so further investigations will be required. We do not make any claims about the generality of the results of these experiments. However, we include such results in this report because we are convinced that, even though preliminary, the underlying idea is interesting and can represent a good basis of further investigations. Furthermore, we are well aware of the current limitations of our experiments relating to post testing fault-proneness. First, the reduced number of faulty modules in the dataset (there are only 27 modules in our dataset) is a threat for the statistical validity of the achieved models. Second, the low number of faults (the maximum is 2) available for the faulty modules can aect the hypothesis that RPT is an intrinsic characteristic of a module (rather than of the pair module-fault) and can thus be estimated based on static metrics. For further studies, we aim at considering case studies where several faults for each faulty module are available: this will allow us to compute the RPT ratio of each fault in isolation, and then mediate among the achieved values. This would result in values of RPT that less depend on the speci c locality of the faults. Acknowledgments
We would like to thank Alex Orso, Sergio Scarpati, Maurilio Zuccala and Adam Porter for their important contribution to the experiments. This research is partially supported by the European Commission within the Best Practice Project Promote. References
[BBT92]
Lionel Briand, Victor Basili, and William Thomas. A pattern recognition approach for software engineering data analysis. IEEE Trans. on Software Engineering, 18(11):931{942, November 1992.
[BH83]
Victor R. Basili and David H. Hutchens. An empirical study of a syntactic complexity family. IEEE Transactions on Software Engineering, 9(6):664{ 672, November 1983. Special Section on Software Metrics.
[FI98]
Phyllis G. Frankl and Oleg Iakounenko. Further empirical studies of test eectiveness. ACM SIGSOFT Software Engineering Notes, 23(6):153{162, November 1998. Proceedings of the ACM SIGSOFT Sixth Internatioal Symposium on the Foundations of Software Engineering.
14
[FN99]
Norman E. Fenton and Martin Neil. A critique of software defect prediction models. IEEE Transactions on Software Engineering, 25(5):675{689, September/October 1999.
[FW93]
Phyllis. G. Frankl and Elain G. Weyuker. Provable improvements on branch testing. IEEE Transactions on Software Engineering, 19(10):962{975, October 1993.
[GK91]
Georey K. Gill and Chris F. Kemerer. Cyclomatic complexity density and software maintenance productivity. IEEE Transactions on Software Engineering, 17(12):1284{1288, December 1991.
[Hal77]
Maurice H. Halstead. Elements of Software Science. Elsevier North-Holland, New York, 1 edition, 1977.
[HFGO94] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the eectiveness of data ow- and control ow-based test adequacy criteria. In Bruno Fadini, editor, Proceedings of the 16th International Conference on Software Engineering, pages 191{200, Sorrento, Italy, May 1994. IEEE Computer Society Press. [HL89]
D. Hosmer and S. Lemeshow. Interscience, 1989.
Applied Logistic Regression.
Wiley-
[KAH+ 98] Taghi M. Khoshgoftaar, Edward B. Allen, Robert Halstead, Gary P. Trio, and Ronald M. Flass. Using process history to predict software quality. Computer, 31(4):66{72, April 1998. [KAKG96] Taghi M. Khoshgoftaar, Edward B. Allen, Kalai S. Kalaichelvan, and Nishith Goel. Early quality prediction: a case study in telecommunications. IEEE Software, 13(1):65{71, January 1996. [KLP94]
T. M. Khoshgoftaar, D. L. Lanning, and A. S. Pandya. A comparativestudy of pattern-recognition techniques for quality evaluation of telecommunications software. IEEE Journal On Selected Areas In Communications, 12(2):279{291, 1994.
[KM90]
T. M. Khoshgoftaar and J. C. Munson. The lines of code metric as a predictor of program faults: a critical analysis. In Proceedings of the Fourteenth Annual International Computer Software and Applications Conference (COMPSAC 90), pages 408{413, 1990.
[KPR00]
Jung-Min Kim, Adam Porter, and Gregg Rothermel. An empirical study of regression test application frequency. In Proceedings of the 22th International Conference on Software Engineering, pages 126{135, Limerick, Ireland, June 2000.
15
[LC87]
H. F. Li and W. K. Cheung. An empirical study of software metrics. IEEE Transactions on Software Engineering, SE-13(6):697{708, June 1987.
[LPR98]
M. M. Lehman, D. E. Perry, and J. F. Ramil. Implications of evolution metrics on software maintenance. In T. M. Koshgoftaar and K. Bennett, editors, Proceedings; International Conference on Software Maintenance, pages 208{ 217. IEEE Computer Society Press, 1998.
[McC76]
Thomas J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308{320, December 1976.
[MK92]
J. C. Munson and T. M. Khoshgoftaar. The detection of fault-prone programs. IEEE Transactions on Software Engineering, 18(5):423{33, 1992.
[MMN+ 92] Keith W. Miller, Larry J. Morell, Robert E. Noonan, Stephen K. Park, David M. Nicol, Branson W. Murrill, and Jerey M. Voas. Estimating the probability of failure when testing reveals no failures. IEEE Transactions on Software Engineering, 18(1):33{43, January 1992. [MN89]
P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, London, second edition, 1989.
[MR00]
Sandro Morasca and Gunther Ruhe. A hybrid approach to analyze empirical software engineering data and its application to predict module faultproneness in maintenance. The Journal of Systems and Software, 53(3):225{ 237, September 2000.
[Mye79]
Glenford Myers. The Art of Software Testing. John Wiley and Sons, 1979.
[OA96]
Niclas Ohlsson and Hans Alberg. Predicting fault-prone software modules in telephone switches. IEEE Transactions on Software Engineering, 22(12):886{894, December 1996.
[PCM96]
A. Pasquini, A. N. Crespo, and P. Matrella. Sensitivity of reliability-growth models to operational pro le errors vs. testing accuracy [software testing]. IEEE Trans. on Reliability, 45(4):531{540, December 1996.
[PS90]
Adam A. Porter and Richard W. Selby. Empirically guided software development using metric-based classi cation trees. IEEE Software, 7(2):46{54, March 1990.
[RAv96]
Jan Rooijmans, Hans Aerts, and Michiel van Genuchten. Software quality in consumer electronics products. IEEE Software, 13(1):55{64, January 1996.
[SB91]
Richard W. Selby and Victor R. Basili. Analysing error-prone system structure. IEEE Transactions on Software Engineering, 17(2):141{152, 1991.
16
[SKV94]
George E. Stark, Louise C. Kern, and C. W. Vowell. A software metric set for program maintenance management. The Journal of Systems and Software, 24(3):239{249, March 1994.
[SP88]
Richard W. Selby and Adam A. Porter. Learning from examples: Generation and evaluation of decision trees for software resource analysis. IEEE Transactions on Software Engineering, 14(12):1743{1757, December 1988. Special Issue on Arti cial Intelligence in Software Applications.
[VM93]
J. M. Voas and K. W. Miller. Semantic metrics for software testability. Journal Of Systems and Software, 20(3):207{216, 1993.
[WF00]
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Tecniques with Java Implementations. Morgan Kaufmann Publishers, 2000.
[WHH79]
Martin R. Woodward, Michael A. Hennell, and David Hedley. A measure of control ow complexity in program text. IEEE Transactions on Software Engineering, 5(1):45{50, January 1979.
A.
Collected Metrics
The metrics we collect for the experiments described in this paper are here listed for the sake of completeness.
LOC
: lines of code that are not just comment lines or blank lines.
: eective lines of code. All lines of code that are not just comments, blanks or stand-alone braces or parenthesis. eLOC
lLOC
Comments
: logical lines of code. Logical lines of code represent a metrics for those lines of code which form code statements, i.e., terminate with a semi-colon. In the case of for loops, the control contains two semi-colons, but it accounts for only one lLOC. : comment lines. Comments can occur by themselves on a physical line or be co-mingled with source code. This metric counts all physical lines that contain a comment. : total all-inclusive count of lines of code.
Lines
: function formal parameters. This metric provides insight to the complexity of the function's interface. Functions with a large number of input parameters are diÆcult to use and are subject to parameter ordering errors. FP
: function return points. Multiple function return points can be problematic. Ideally a function should have a single entry point and a single exit point. Multiple exit points can increase maintenance costs. FR
17
IC
V(g)
: interface complexity. This metric is comprised of the number of function parameters and the number of function return points, i.e., IC=FP+FR. : cyclomatic complexity. Cyclomatic complexity is the degree of logical branching within a function. Logical branching occurs when a "while", "for", "if", "case" and "goto" keywords appear within the function. Cyclomatic complexity is the count of these constructs. : number of comparison operators.
CO
: linear ow complexity. This metric is computed as CO+V(g).
LFC
: operational complexity. This metric assigns weights to operations which can occur in expressions and sums the values of weights for all the expressions in a module. This metric is complementary to V(G) since it looks at the complexity of the expressions which are being evaluated rather than the number of decision points in the method. OC
V'(G)
eV(G)
NEST
: enhanced cyclomatic complexity. This metric brings together the notions of cyclomatic and operational complexity, by dividing OC by V(G). This provides the average operational complexity for decision points in a method and is a strong indicator of the overall cognitive complexity of the method. : essential complexity. This metric is a measure of the structure of testable paths in a component. High values of eV(G) usually signal unstructured implementations of control logic. eV(G) is calculated by (1) developing a directed graph that represents the control structure of the module (this graph contains only the four basic and simple structured constructs: sequence, selection, iteration and case), (2) keeping structures which contain occurrences of unstructured constructs (break, continue, goto) and then (3) recalculating complexity on this graph. This is intended to measure the diÆculty of testing as the presence of unstructured constructs increases the needed testing eort, as well as future code modi cations. : maximum number of levels. This metric counts the number of if-then or if-then-else in a nest. Logical units with a large number of nested levels are likely to be more complex. Halstead's software science: set of metrics based on the count of operators and operands in a program. : total number of operators; N2: total number of operands; n1: number of unique operators; n2: number of unique operands; N: program length. This metric is calculated as N1+N2 and is a general measure of program length for a given module;
{ N1 { { { {
18
: program vocabulary. This metric is calculated as n1+n2 and is a measure of the number of unique operands and operators used in a module; V: program volume. This metric is calculated as V = N log 2n; D: diÆculty. This metric is calculated as D=(n1/n2)*(N2/n2); E: eort. This metric is calculated as E=D*V.
{ n
{ { {
: number of function calls. This metric is the number of calls from a module to subordinate logical units. CALLS
: number of branching nodes.
BRANCH
: operation argument complexity. This metric is de ned as the sum of the values of each argument i in each operation, according to the following correspondence: boolean=1, integer=1, char=1, real=2, array=3-4, pointer=5, struct=6-9, le=10. OAC
NION
ANION
CONTROL
EXEC
NSTAT
CDENS
: number of input/output nodes. This metric is a measure of the number of input/output nodes in a given module. Programming practices today state that there should be one way into a module and one way out. NION can be interpreted as a measure of the diÆculty of testing the control logic of the software. : adjusted number of input/output nodes. This metric corresponds to the value of NION adjusted to behave intelligently where redundant return statements exist. : number of control statements. This metric measures the number of control statements (selection, iteration) in a module. : number of executable statements. This metric measures the number of executable statements in a module. : number of statements. This metric measures the total number of statements in a module. This is calculated as CONTROL+EXEC. : control density. This metric measures the richness of decision nodes in a module. High density modules are likely to be more diÆcult to maintain than less dense ones. Control density is the ratio of control statements to the total number of statements.
19