Our study utilized the instance-based classifier implemented in Weka [8] to ... Imputation of missing software engineering metrics data and its associated ...
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
1
Multiple Imputation of Missing Values in Software Measurement Data Taghi Khoshgoftaar
Andres Folleco
Jason Van Hulse
Lofton Bullard
Keywords: Data Quality, Missing Numeric Data, Multiple Imputation, Dependent/Independent Attributes, Software Metrics Abstract The value of knowledge inferred from information databases is critically dependent on the quality of data. We present multiple imputation as a reliable and consistent imputation technique for handling missing data in a numeric dependent variable in software metrics data sets. Experiments were conducted using multiple, mean, k-Nearest Neighbors, regression, and REPTree to impute missing values in two case studies from a military command, control, and communications system (CCCS). One case study contained the original noisy continuous dependent variable representing the number of faults, nf aults, while the second case study had the dependent variable significantly cleansed of noise. All experiments were conducted with six different percentages or levels of simulated missing values in the dependent variable. The average and standard deviation of the imputation errors obtained from the case studies were used to measure the accuracy of each imputation technique across all missingness levels. In many cases, multiple imputation was found to have the smallest imputation error and lowest variability, particularly at higher missingness levels.
I. I NTRODUCTION
D
ATA quality is of great importance for the validity of any decision making-process. One common factor that negatively impacts data quality is missing data. Missing data can be observed in dependent as well as independent variables and occurs often in data sets originating from many sources and applications. In this study, multiple [20], mean, k-Nearest Neighbors, regression, and REPTree decision tree imputation methods were used to impute missing values in a numeric dependent variable of the CCCS software engineering metrics data set. Multiple imputation (MI) was proven to be the most reliable and consistent technique in this study. Relatively little previous work has explored multiple imputation as a technique to handle missing data in the software engineering domain. Although we do not address directly the issue of data quality in this study, it makes little sense to carry out any imputation study with data sets of unknown or suspicious quality. To our knowledge, there are no studies that have addressed the issue of quality of data in software engineering metrics properly before embarking on an imputation analysis using data sets with incomplete or missing data. Additionally, very few studies have referenced or applied multiple imputation, initially proposed by Rubin [20], as a reliable and consistent technique for imputing missing values in software metrics data sets. There are two types of noise identified within the domain of software engineering metrics [22]. Class noise is also known as a labeling error [4], while attribute noise represents incorrect values for the attributes of the data set. Our empirical case study used a software quality estimation data set called CCCS [14]. CCCS contained a significant amount of noise in the dependent variable nf aults. Our experiments, therefore, considered two different case studies using CCCS. The first case study used the original (i.e., noisy) values for nf aults, while the the second considered a cleansed value for nf aults for some of the instances in CCCS. Even though the original CCCS data set did not have missing data, we simulated missing values in the dependent variable nf aults, having the observed values a priori. In this way, we could accurately and efficiently judge the performance of the imputation techniques used in the empirical analysis. The purpose of this study was to present multiple imputation as a reliable and consistent prediction method and to compare its performance to several other well-known imputation techniques using two CCCS case studies [10][15]. The imputation techniques selected for this study were mean, regression, k-Nearest Neighbors, REPTree, and multiple imputation. After a review of missing data mechanisms, a brief description of each imputation technique is presented below. A. Missing Data Mechanisms There are several well-defined and established missing data mechanisms [20]. The accuracy of an imputation technique can be directly correlated to the missing data mechanism. The question of whether or not the missing data can be predicted by the observed data set attribute values may directly affect how missing values are handled and whether they can be ignored. Little and Rubin [19] defined three such mechanisms: missing completely at random (MCAR), missing at random (MAR), and non-ignorable (NI) missing data. In this study the missing data was artificially induced in a completely random fashion. Authors are with Florida Atlantic University’s Department of Computer Science and Engineering
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
2
B. Imputation Techniques In addition to multiple imputation, we selected four well-known imputation techniques for our comparative study. The other selected techniques were mean, regression, k-NN, and REPTree imputation. 1) Mean Imputation: Each missing value is imputed with the mean of the observed values for the corresponding attribute. This technique is fast and simple but under-estimates the sample variance. The results obtained from mean imputation in our case study demonstrated that it is not an accurate imputation technique and should not be used for software engineering metrics data. 2) Regression Imputation: Regression imputation uses multivariate linear regression models as explained in [6] and implemented within Weka [8]. This technique replaces missing values with predicted ones based on complete data observations. One disadvantage of this technique is that it under-estimates sample variances. In this study, the regression model was built using the observed data of all independent variables from CCCS. Regression imputation demonstrated good imputation results in our case study. 3) K-Nearest Neighbors: The k-NN imputation technique implemented in Weka [8] searches for the k most similar cases to the instances with missing values and replaces any missing values with the mean of the complete observations only. k-NN is known to preserve the population sample distribution by substituting the average of several of the most similar observed values [8]. The value for the parameter k has been subjected to intense study because it can critically affect the model’s performance [16]. Jonsonn [16] recommended that k should be no larger than the square root of the number of observations with complete data. Our study utilized the instance-based classifier implemented in Weka [8] to ‘automatically’ determine an optimal value for k from a range of possible values. Weka utilizes leave-one-out cross-validation [8] to determine the optimal value for k before the k-NN imputation takes place. Results obtained from experiments using the k-NN imputation technique were also satisfactory. 4) REPTree Decision Tree: As implemented in Weka [8], a decision tree is constructed by evaluating the predictor attributes against a quantitative target attribute. This building process is applied recursively to the data set, using variance reduction to derive balanced tree splits. Reduced error pruning is also used to simplify and prune the tree. 5) Multiple Imputation: The version of multiple imputation used in this study was designed by Schafer [21]. Multiple imputation was first proposed by Rubin [20] and can produce estimates that are consistent, asymptotically efficient, and normal when the data is MAR or MCAR. Multiple imputation can be used with virtually any kind of data and the analysis can be done with available software. Input preparation and pre-processing requirements are minimal when compared to other techniques. The specific implementation of multiple imputation used in this study was from SAS [9] and is based on a multivariate normal model. Even though multivariate normal is a strict requirement, in practice this model works very well even if some of the variables are far from normality [21]. For variables that do not have normal distributions, normalizing transformations can be used to improve the imputation quality. Unfortunately, with the exception of multiple imputation, the other four imputation methods have a fundamental problem [1][2]: Treating imputed data as if it were complete data produces standard errors that are underestimated and test statistics that are overestimated. Typical analytic methods do not adjust for the fact that the imputation process involves uncertainty about the missing values. Furthermore, other methods have difficulties when multivariate missing data is present, as experienced during our more recent experiments. This important topic will be considered in our future research. II. R ELATED W ORK Imputation of missing software engineering metrics data and its associated techniques have very limited published references. On the contrary, software cost prediction studies using imputation techniques are found more frequently. For example in [13][18], sampling-based hot deck and other imputation techniques were used for software cost estimation. A combination of hot deck and single imputation was presented in [16], also for cost estimation. Only a few studies including [7] have applied model-based imputation methods in software engineering cost prediction. The methods included list-wise deletion, pairwise deletion, mean imputation, hot-deck imputation, and similar response pattern imputation, which was found to be the most robust. Strike’s [13] work on building software cost estimation models with incomplete data compared list-wise deletion, mean imputation, and eight different types of hot-deck imputation. Their results indicated that all the techniques performed well with little bias and very good accuracy (including mean imputation). Myrtveit [7] presented an empirical evaluation of imputation methods and likelihood-based methods for software cost modeling. The imputation techniques included list-wise deletion, mean imputation, similar response pattern imputation, and full information maximum likelihood (FIML). Their evaluation suggested that FIML was the only appropriate technique when missing data is not completely a random. Furthermore, they reported that the other imputation techniques generated biased results if the missing data was not MCAR. Emam [12] validated the emerging ISO/IEC 15504 measure of software requirements and analysis process capability. This is an international standard on software process assessment. It defines several software engineering processes and a flexible scale for quantifying and measuring their value. Emam’s empirical work evaluated the predictive ability of software requirements
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
3
analysis using a hot-deck imputation method with a conditional Bayesian technique. Their method consisted of building a response propensity model followed by an approximate Bayesian multiple imputation model. The technique was interesting and could be applied to multivariate missing data. Song [18] proposed a new imputation method for effort prediction in small software data sets. They justified the use of simple imputation methods because of the very small size of the data sets used in their study. Their proposed method was a combination of class mean imputation and a k-NN hot deck imputation to impute both continuous and nominal missing data in small data sets. Their results suggested that their proposed ensemble method outperformed both the class mean imputation and the k-NN imputation methods. Therefore, they recommended the use of their method for software effort prediction in very small software data sets. Twala [3] investigated the ensemble strategy in the context of incomplete data and prediction of software development efforts. They proposed an ensemble Bayesian multiple imputation and nearest neighbor single imputation method. Their results strongly supported the proposed method, which was applied to two benchmark data sets. Song’s [17] short note on missingness mechanisms and common assumptions compared class mean imputation and k-NN imputation. They used missing data under the MCAR and MAR conditions and compared the results obtained under each paradigm. Two main conclusions were obtained from their experiments: First, class mean imputation was the preferred method because of its accuracy. Second, the impact of the missingness mechanism on the accuracy of imputation methods was not statistically significant. They also used small software data sets for the experiments involving software effort prediction. Jonsonn[16] reported an evaluation of k-NN imputation using Likert data. Their findings recommended the use of the k-NN imputation method with Likert data. One important parameter of concern in their work was the value of the parameter k. They suggested a suitable value of k is approximately the square root of the number of complete cases rounded to the nearest odd integer [16]. III. M ETHODOLOGY The purpose of this study was to introduce multiple imputation and to compare its performance to several other imputation techniques for imputing a missing numeric dependent variable in software engineering metrics data. Multiple imputation is one of the least known techniques in this application domain. This section provides some basic conditions, parameters, limitations, and references related to multiple imputation and its proper usage. Schafer [21] utilized data augmentation (DA), which is based on Markov chain Monte Carlo processes, to generate proper multiple imputations. The expectation maximization (EM ) algorithm [21] was used to generate initial values for the parameters required in multiple imputation. A brief overview of EM and the multiple imputation are presented here. Detailed explanations can be found in [2][11][19][20][21]. A. Expectation Maximization The EM algorithm is a method for obtaining Maximum-Likelihood (M L) estimates when missing values are present [2]. EM consists of an Expectation (or E) step and a Maximization (or M) step. These steps are repeated multiple times in an iterative fashion that eventually converges to the maximum-likelihood (M L) estimates. In a multivariate normal distribution environment, the E step reduces to regression imputation of the missing values [21]. If all variables in a data set had missing values in no particular pattern, the algorithm would select starting values for the unknown parameters based on the means and covariance matrix from the original data. Having starting values for the parameters, coefficients can be calculated for the regression of any of the variables. If the missing data was univariate, regression imputation based on the remaining complete variables would take place. Once the missing data have been imputed, the M step calculates new values for the means and covariance matrix, using the new complete data set. Means are calculated using standard formulas, but variances and covariances require special expressions, e.g., residual variances and residual covariances based on regression [21]. Once the new estimates for the means and covariance matrix have been calculated, the E step is executed again. The new estimates produce new regression imputations for the missing values. The E and M steps continue to execute iteratively until the estimates converge. One can consider reaching convergence when there is little or no differences between the latest estimates. In this study, the EM algorithm was used to obtain initial parameter values needed for the process of multiple imputation. B. Multiple Imputation MI was formally proposed by Rubin [20]. M I creates s imputed data sets, which can be combined with the observed data to create complete data sets. The M I recommended in our study for software engineering metrics imputation is the version in [9][21]. Imputation can be done using DA with Markov Chain Monte Carlo (MCMC) methods, which inject a random error at each iteration that accounts for uncertainty on the imputed value. Using the notation from [21], a data set X with missing values can be split into Xobs and Xmis where Xobs is the completely observed part of X and Xmis is the missing data. 1 s By using iterative MCMC methods, M I generates s complete data sets Xmis , . . . , Xmis , which combined with Xobs can be properly analyzed with standard methods [21].
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
4
TABLE I CCCS S OFTWARE M ETRICS D ESCRIPTION
Independent Variables Unique Operators Total Operators Unique Operands Total Operands Cyclomatic Complexity Logical Operators Total Lines of Code Executable LOC Dependent Variable Number of Faults (nfaults)
The multiple imputation software tool [9] used in this study is based on a multivariate normal model. This model has been found to work well even if the data set is not normally distributed. The initial parameters for the mean and covariance matrix were obtained using EM . M I was executed using sequential chains to generate a sequence of complete imputed data i+1 i i sets Xmis . Sequential execution can have the disadvantage that Xmis and Xmis may not be independent, invalidating the assumption of generating unbiased estimates. This problem can be avoided by having a large number of iterations between estimations. In multiple imputation, missing data are imputed s > 1 times with a different randomly chosen error term added in each imputation. Each missing value is replaced by a set of s plausible values drawn from their predictive distribution. The s complete imputed data sets contain the same observed values but the missing values are filled in with different imputations that reflect the uncertainty about the missing data [21]. It is customary to combine inferences from the imputed complete data sets to obtain a predicted value for each missing attribute value [20]. In our study, the imputed values from the s complete data sets were averaged to obtain a single imputed value for each missing data point. Rubin [5][20], Dempster [2], Schafer [21], Allison [1] and others have provided excellent guidelines, references, and examples on multiple imputation, maximum likelihood, expectation maximization, data augmentation, multivariate normal models, and related topics. IV. E MPIRICAL C ASE S TUDIES A. CCCS Data Set The CCCS data set consisted of measurements recorded for 282 software modules during the system’s implementation phase, testing phase, and the first year of deployment [14]. Each instance in the CCCS data set represented a software module from a large military command, control and communication system. The system was written in the Ada high-level language. The CCCS data set contained 8 independent variables (software metrics) along with a dependent variable labeled nf aults. This continuous attribute indicated the number of faults recorded for each module during the system integration and test phases, including the first year of deployment. CCCS contained a total of 282 modules, 136 of which contained at least one known fault. The software metrics in CCCS utilized in this study are presented in Table I. The first data set considered in our case study was the original CCCS data set, which contained noise in the dependent variable nf aults. This data set is denoted in the empirical case study as Original or O. The second data set consisted of CCCS cleansed of a significant amount of noise in the dependent variable nf aults. The cleansed CCCS data set is denoted Clean or C. Note that the original CCCS data set contained inherent noise in the dependent variable. In particular, noise was not injected into the data set. The interested reader can find a brief review of the methodology employed to cleanse CCCS in the Appendix (Section VI). B. Experimental Setting For experimental evaluation, simulated missingness was injected into the dependent variable in both Original and Clean. Our experiments considered 6 levels or percentages (5%, 10%, 15%, 20%, 30% and 40%) of simulated missing values. The number of instances at each missingness level was 14 for 5%, 28 for 10%, 42 for 15%, 56 for 20%, 85 for 30%, and 113 for 40% missingness levels. Missingness was only introduced in the dependent variable, and instances were selected randomly (i.e., MCAR). In addition, 5 different data sets were generated at each missingness level to avoid any potential bias due to the random selection process. More specifically, 5 different data sets or versions with missing values in α% of the instances were created for both Original and Clean, where α = 5%, 10%, 15%, 20%, 30% or 40%. Further, given a particular value for the missingness level and version, the same selected set of instances were used for both Original and Clean. Denote the version
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
5
TABLE II I MPUTATION C OMPARISON , 5% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 2.886 1.186 2.586 0.729 2.814 0.714 4.157 4.100 3.014 1.543
SDAE 4.904 2.189 5.539 0.700 4.444 0.745 7.502 5.746 5.526 2.717
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 26 13 30 2 22 2 40 27 30 14
medianAE 1 0 1 1 1 1 2 2 1 1
P-value 0.46 0.04 0.39 0.45 – – 0.10 0.00 0.41 0.01
of Original with α% randomly chosen instances with version β by O(α,β) . Further, denote the set of instances with missing o m . For Clean, these data sets and the set of instances with observed values for nf aults by O(α,β) values for nf aults by O(α,β) o m o m o m . ∪ O(α,β) would be denoted C(α,β) , C(α,β) and C(α,β) , respectively. Of course C(α,β) = C(α,β) ∪ C(α,β) and O(α,β) = O(α,β) m m Then for given values of α and β, O(α,β) = C(α,β) . A total of 30 data sets (6 missingness levels and 5 versions for each level) were generated for each of the two case studies. More specifically, the data sets used in the empirical study can be labeled {C(5%,1) , . . . , C(5%,5) , . . . , C(40%,1) , . . . , C(40%,5) } for Clean and {O(5%,1) , . . . , O(5%,5) , . . . , O(40%,1) , . . . , O(40%,5) } for Original. Therefore a total of 60 data sets were generated with simulated missing values in nf aults. m m (i.e., for given values of α and β, the same set of instances were used for both Original = C(α,β) The fact that O(α,β) and Clean) was very important in our empirical study. Since the same set of instances were selected for both Clean and Original, it was possible to directly compare the imputation performance of each technique on clean and noisy data on the same observations. As mentioned previously, each data set Original and Clean was partitioned into the instances with observed values for nf aults and the instances with missing values for nf aults. For Clean, these data sets are denoted C o and C m , respectively, while for Original they are denoted Oo and Om (we omit α and β here for simplicity). A model was constructed using regression, k-NN, or REPTree with the observations in either C o or Oo and applied to the observations in C m or Om (depending on whether the Clean or Original was used), thereby imputing the missing values. With mean imputation, missing values for the instances in Om (or C m ) were imputed with the average value for nf aults in Oo (or C o ). Note that multiple imputation does not require an explicit partitioning of the data set into subsets, and hence this preprocessing step was unnecessary. C. Imputation Accuracy The results of the imputation experiments are presented in Tables II - VII. The first column labeled T echnique contains the names of the imputation methods. The second column labeled Case denotes the version of the CCCS data set, i.e., Original or Clean. Original means that the original CCCS data set with noisy instances was used to generate the data sets used in the experiments. Clean indicates the cleansed version of the CCCS data set was used. The next 5 columns labeled AAE, SDAE, minAE, maxAE, and medianAE represent the average absolute error (AE) over all 5 subsets of imputed values for each missingness level, the standard deviation of the AE, and the minimum, maximum, and median of the AE. More specifically, m m these error statistics are calculated over 5 subsets, for example C(α,1) , . . . , C(α,5) for Clean with α% missingness. The last column labeled P − value contains the p-values obtained by performing a t-test comparing the AAE from multiple imputation with that of the other 4 imputation techniques. Since it is not sensible to perform a t-test comparison of MI with itself, the p-value in this case was set to ‘-’. The AE was selected as an evaluation metric for imputation accuracy because it provided a simple and unbiased method to measure the magnitude of the differences in imputed values from the original data values. The rows in Tables II - VII list each imputation method and their corresponding case studies for either Original or Clean for each missingness level. The average absolute errors were calculated from the combination of all 5 subsets of instances with missing values for each missingness level. The average absolute error AAE was calculated as follows: n
AAE =
1X |Yi − Yˆi | n i=1
(1)
where Yi was the original software metric value for instance i, Yˆi was the rounded imputed value for the simulated missing dependent variable value, and n was the number of observations with missing data. Any negative imputed value in Yˆi was rounded to zero. Table II shows the results at the 5% missingness level for all imputation methods. Regression imputation had p-values exceeding 10% on both Original and Clean, while REPTree and k-NN had p-values exceeding 10% only on Original. The
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
6
TABLE III I MPUTATION C OMPARISON , 10% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 1.371 0.557 1.300 0.529 1.357 0.536 2.421 2.179 1.486 0.671
SDAE 2.193 0.908 2.266 0.800 2.192 0.772 2.228 2.497 2.483 1.166
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 12 6 14 4 12 4 18 18 17 7
medianAE 1 0 1 0 1 0 2 2 1 0
P-value 0.48 0.42 0.42 0.47 – – 0.00 0.00 0.32 0.13
medianAE 1 0 1 0 1 0 2 2 1 0
P-value 0.38 0.12 0.46 0.45 – – 0.00 0.00 0.34 0.00
TABLE IV I MPUTATION C OMPARISON , 15% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 1.962 0.729 1.919 0.590 1.948 0.600 3.024 2.733 2.086 0.967
SDAE 3.151 1.344 3.575 0.791 3.147 0.790 4.563 3.616 3.869 1.734
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 23 8 31 4 23 4 40 27 26 10
AAE for regression was lower than that obtained by multiple imputation on Original (2.586 vs. 2.814) but was larger on Clean (0.729 vs. 0.714). The SDAE for multiple imputation was in all cases lower than the competing techniques, except when regression was used on Clean. In this case, the SDAE for regression was 0.700 while MI had an SDAE of 0.745. Table III shows the results at the 10% missingness level. k-NN, regression, and REPTree had p-values greater than 10% on both Original and Clean. Of these methods, only regression had a lower AAE than multiple imputation. On Original, the AAE for regression was 1.300 while MI obtained an AAE of 1.357. On Clean, regression and MI obtained average absolute errors of 0.529 and 0.536 respectively. k-NN performed slightly worse, with an AAE of 1.371 on Original and 0.557 on Clean. The best SDAE value in both data sets was obtained by MI, although the SDAE for k-NN on Original was nearly identical. Table IV shows results at the 15% missingness level. Both k-NN and regression had p-values greater than 10% on both Original and Clean, while REPTree had a p-value greater than 10% on Original. The AAE obtained by regression was smaller than MI by 1.5% for Original and 1.7% for Clean. All other imputation methods obtained higher AAEs than MI for both Original and Clean. The best SDAE was obtained by MI for both Original and Clean, with values of 3.147 and 0.790 respectively. Table V shows results at the 20% missingness level. Regression obtained a slightly lower AAE than MI on both Original and Clean, although the differences were very small. MI, for example, obtained an AAE of 0.511 for Clean while the AAE for regression was 0.504. The SDAE for MI, however, was lower by 11.0% and 2.8% than that of regression for Original TABLE V I MPUTATION C OMPARISON , 20% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 1.689 0.711 1.646 0.504 1.675 0.511 2.989 2.654 1.779 0.964
SDAE 2.711 1.654 3.080 0.718 2.742 0.698 3.810 3.230 3.537 2.021
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 24 18 28 4 24 3 40 27 31 18
medianAE 1 0 1 0 1 0 2 2 1 0
P-value 0.47 0.03 0.45 0.45 – – 0.00 0.00 0.35 0.00
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
7
TABLE VI I MPUTATION C OMPARISON , 30% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 1.894 0.706 3.586 0.619 1.941 0.553 3.005 2.722 2.047 0.706
SDAE 2.872 1.176 4.879 0.952 3.190 0.760 3.746 3.311 3.223 1.176
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 23 10 39 8 22 4 27 27 19 10
medianAE 1 0 2 0 1 0 2 2 1 0
P-value 0.41 0.01 0.00 0.14 – – 0.00 0.00 0.32 0.00
medianAE 1 0 1 0 1 0 2 2 1 0
P-value 0.17 0.00 0.39 0.00 – – 0.00 0.00 0.47 0.00
TABLE VII I MPUTATION C OMPARISON , 40% M ISSING Technique k-NN Regression Multiple Mean REPTree
Case Original Clean Original Clean Original Clean Original Clean Original Clean
AAE 1.929 0.781 1.805 0.940 1.750 0.540 3.193 2.977 1.735 0.901
SDAE 3.373 1.655 3.657 2.112 2.935 0.758 4.900 3.709 3.745 1.850
minAE 0 0 0 0 0 0 0 0 0 0
maxAE 27 16 31 23 27 4 40 27 34 16
and Clean, respectively. k-NN and REPTree both produced similar (although slightly worse) imputation results as measured by the AAE than MI on Original, but both techniques performed significantly worse on the cleansed data. Therefore, only one method (regression) had a p-value greater than 10% with 20% missing data when the cleansed data set was considered. For the higher missingness levels of 30% and 40% shown in Tables VI and VII respectively, none of the imputation methods had p-values greater than 10% in both the Original and the Clean cases. Clearly, the trend observed at these higher missingness levels implied that the accuracy of the the other techniques degraded as the missingness levels increased. Using the cleansed data, only one technique obtained a p-value greater than 10% (regression imputation with 30% missingness). In all other cases with 30% or 40% missingness using Clean, the p-values were significantly less than 5%, and often less than 1%. Notice further that the SDAE obtained by MI was lower than that of the competing techniques in almost all cases. With 40% missingness, the SDAE on Clean was 54.2% smaller than the nearest competitor, regression imputation. These statistics demonstrated the robustness of the imputations calculated by MI, as it is critical that an imputation technique not only have a low average absolute error, but that the variability of the absolute error is as small as possible. Mean imputation performed very poorly in all cases, producing the worst p-values, mostly at 0.00%, regardless of the missingness level or CCCS data set version used. The results shown in Tables II - VII demonstrated that multiple imputation was a reliable and consistent imputation technique when applied to software engineering metrics data sets with missing values in a dependent variable. Further, a comparison of the imputation accuracy between Clean and Original clearly demonstrated that quality of data had a significant impact on the effectiveness of each imputation technique (although the effect on mean imputation was much less dramatic when compared to the other techniques). Regression, MI, REPTree and k-NN all performed significantly better when the data was relatively clean. In other words, the imputation performance of each technique as measured by the AAE was always better on C m than on Om , regardless of the missingness level. On the other hand, the extremely poor results produced by mean imputation led us to the conclusion that this technique should not be considered a reliable imputation method in the software engineering metrics domain. D. Imputation Variability Across All Missingness Levels The variability of the AAE and SDAE values across all 6 missingness levels for the Original and Clean case studies was calculated and plotted in Figures 1 - 4. The x-axis labeled Imputation M ethods identifies each method in the following order from left to right: k-NN, regression, multiple imputation, mean, and REPTree. The y-axis labeled AAE V ariability or SDAE V ariability shows the calculated variability of the AAE or SDAE across all missingness levels for each imputation method, separately for Original and Clean. The plotted variability of each method is connected with single lines to its closest
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
8
Regression
−0.2
10
−0.3
AAE Variability
10
−0.4
10
Mean −0.5
10
REPTree −0.6
10
k−NN Multiple Imputation Methods
Fig. 1.
Original Case Study AAE Logarithmic Variability Mean
SDAE Variability
2
Regression
1
REPTree k−NN
Multiple Imputation Methods
Fig. 2.
Original Case Study SDAE Logarithmic Variability
neighbors on the x-axis for visual clarity purposes only. These points are independent from each other and the peaks and valleys in the graphs only amplified the variability of the different imputation methods. The variability of the AAE and SDAE across all missingness levels was obtained by transforming the variance of their respective values to a logarithmic scale for presentation clarity in Figures 1 - 4. The logarithmic variability across all 6 missingness levels was calculated as follows: v u 6 u1 X V ar = 10 log t (2) (xα − x)2 6 α=1 where xα , α = 1, . . . , 6, was either the AAE or SDAE values respectively from either case study, x was their respective mean, and α was an index to the missingness level values of the AAE or SDAE. The variability of the AAE and SDAE values across all missingness levels for Original are shown in Figures 1 and 2 respectively. In both Figures, the method with the smallest variability was multiple imputation. Similarly, the variability of the AAE and SDAE values across all 6 missingness levels for Clean are shown in Figures 3 and 4 respectively. Again, multiple imputation had the smallest variability in both the AAE and SDAE amongst all the imputation methods considered in this work. As demonstrated in Figures 1 - 4, the average absolute error and standard deviation of the absolute error for MI did not vary significantly across all 6 missingness percentages for both Clean and Original. These results were another strong indication of the superior reliability of MI and the stability of its imputed values. V. C ONCLUSION The quality of data is a critical component of any data-dependent decision making process. The study of accurate, efficient, and responsive techniques to insure data quality is of great importance to data analysts and users in any application field. The primary motivation for this study was to present multiple imputation as a viable procedure for handling missing data in a software engineering metrics data set. Further, we presented a comparison of MI to other well-known and widely used imputation techniques such as mean, k-NN, Regression and REPTree imputation. The results demonstrated that multiple imputation was a reliable and consistent imputation technique when missing data is present in a dependent variable from software engineering
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
9
0
10
Mean
−1
10
AAE Variability
k−NN
REPTree Regression
−2
10
Multiple
−3
10
Imputation Methods
Fig. 3.
Clean Case Study AAE Logarithmic Variability 1
10
Mean 0
10
SDAE Variability
Regression
−1
10
REPTree
k−NN
−2
10
−3
10
Multiple −4
10
Imputation Methods
Fig. 4.
Clean Case Study SDAE Logarithmic Variability
metrics data set. Indeed, multiple imputation often generated the best results amongst all the imputation techniques considered in this study according to the average absolute error and the other statistics used in this study. The two case studies presented in this work utilized the CCCS software metrics data set. The dependent variable, nf aults, contained noisy values in the first case study. These noisy instances represented inherent cases of noise that existed in the data set, i.e., noise was not injected into CCCS. In the second case study, nf aults was significantly cleansed of its original noisy values. The purpose of this second case study was to evaluate the imputation techniques using a relatively clean software metrics data set. Multiple imputation obtained remarkably low variability in the AE and SDAE, performing better than the other imputation techniques. Each case study contained 6 levels of simulated missing values in nf aults and 5 versions of the data sets for each level. The levels were established at 5%, 10%, 15%, 20%, 30% and 40% for each case study. Five independent data set versions for each data missingness level were built to avoid any potential bias due to the random selection process for the simulated missing data. Multiple imputation often obtained the lowest error rates and smallest variability when compared to the other imputation techniques, especially at the higher levels of missingness. This conclusion was supported by examining the p-values obtained when comparing multiple imputation against the other techniques. Our empirical results also demonstrated that the error rates obtained by mean imputation were significantly higher than the other imputation techniques. The results of mean imputation also exhibited the largest variability amongst all the techniques. The p-values obtained from mean imputation were the worst of all the techniques. Therefore, mean imputation should not be considered as a reliable imputation method for handling missing data in software engineering metrics data sets. In realistic settings, missing data is often observed in both the dependent and independent variables within a data set. Our preliminary results from recent experiments identified multiple imputation as the only technique from those used in this study that can effectively handle this type of missing data scenario. Our current and future studies are concentrated in these important missing data scenarios. R EFERENCES [1] P.D. Allison. Missing Data. Sage University Press, Thousand Oaks, CA, 2002. [2] A.P.Dempster, N.M.Laird, and D.B.Rubin. Maximum likelihood estimates from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977. Series B. [3] B. Twala and M. Cartwright. Ensemble Imputation Methods for Missing Software Engineering Data. In Proceedings of 11th IEEE Intl. Software Metrics Symposium, page 30. IEEE Computer Society, 2005.
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
10
[4] C. Brodley and M. Friedl. Identifying and eliminating mislabled training instances. In Proceedings of 13th National Conference on Artificial Intelligence, pages 799–805, 1996. [5] D.B.Rubin. Multiple imputation after 18+ years. Journal of the American Statistical Society, 91:473–489, 1996. [6] Y. Haitovsky. Missing data in regression analysis. Journal Royal Statistical Society, 30:67–81, 1968. [7] I. Myrtveit, E. Stensrud, and U.H. Olsson. Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans. on Software Engineering, 27(11):999–1013, Nov 2001. [8] I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CA, 2nd edition, 2005. [9] SAS Institute. SAS/STAT User’s Guide. 2004. [10] J. Van Hulse, T. M. Khoshgoftaar, and H. Huang. The Pairwise Attribute Noise Detection Algorithm. Knowledge and Information Systems Journal, Special Issue on Mining Low-Quality Data, 11(2):171–190, February 2007. [11] J.L. Schafer and M.K. Olsen. Multiple imputation for multivariate missing data problems: A data analyst’s perspective. Multivariate Behavioral Research, 33(4):545–571, 1998. [12] K. El Emam and A. Birk. Validating the iso/iec 15504 measure of software requirements analysis process capability. IEEE Trans. on Software Engineering, 26(6):541–566, June 2000. [13] K. Strike, K. El Emam, and N. Madhavji. Software cost estimation with incomplete data. IEEE Trans. on Software Engineering, 27(10):890–908, Oct 2001. [14] Taghi M. Khoshgoftaar and E. B. Allen. Classifcation of fault-prone software modules: Prior probabilities, costs and model evaluation. Empirical Software Engineering, 3:275 – 298, 1998. [15] Taghi M. Khoshgoftaar and Jason Van Hulse. Determining noisy instances relative to attributes of interest. Intelligent Data Analysis: An International Journal, 10(3):251–268, 2006. [16] P. Jonsson and C. Wohlin. An evaluation of k-nearest neighbour imputation using likert data. 10th IEEE Intl. Symposium on Software Metrics (METRICS’04), pages 108–118, 2004. [17] Q. Song, M. Shepperd, and M. Cartwright. A short note on safest default missingness mechanism assumptions. Empirical Software Engineering, 10(2):235–243, 2005. [18] Q. Song, M. Shepperd, M. Cartwright and B. Twala. A New Imputation Method for Small Software Project Data Sets. Technical report, Bournemouth University, 2005. [19] R.J.A. Little and D.B. Rubin. Statistical Analysis with Missing Data. John Wiley and Sons, New York, NY, 2nd edition, 2002. [20] D.B. Rubin. Multiple Imputation. John Wiley and Sons, New York, NY, 1987. [21] J.L. Schafer. Analysis of Incomplete Multivariate Data. Chapman and Hall/CRC, Boca Raton, FL, 2000. [22] X. Zhu and X. Wu. Class noise vs attribute noise: A quantitative study of their impacts. Artificial Intelligence Review, 22(3):177–210, 2004.
VI. A PPENDIX : DATA C LEANSING M ETHODOLOGY The appendix describes our hybrid procedure for cleansing a quantitative dependent variable and our application of that methodology to the CCCS data set, which is known to contain noise in the dependent variable. Our hybrid technique relies on two components described individually in VI-A and VI-B. In VI-C, the hybrid procedure itself is explained along with its application to the CCCS data set. The objective of this procedure is to create a relatively cleaner data set from CCCS, which will be used to conduct further experiments (which in the context of this work is a comparative study of imputation techniques). The first component described in VI-A uses our proposed technique for identifying noise in a user-specified attribute of interest (AOI) [15], which can be any attribute in the data set. A relative ranking of the instances is provided based on the level of noise contained in the AOI. Alternative values are then calculated for those instances determined to be noisy relative to the AOI. Independent of the AOI technique, we have proposed another novel procedure called the multiple imputation (MI) quantitative noise detector (VI-B). This procedure calculates an alternative value for the dependent variable of each observation in the data set. Noisy instances are identified when the average imputed value for the dependent variable differs significantly from the actual value. Using the alternative values suggested by these techniques, the outcome variable can be cleansed based on expert input. The combination of these procedures into a hybrid technique is presented in VI-C. Expert input was a critical component when cleansing the CCCS data set. The input of a software engineering domain expert ensured that the alternative values suggested by our procedure were sensible given the values for the independent variables. Cleansing occurred only if the expert agreed that the instance was noisy and if the estimated values were sensible. Therefore in this case study, a conservative approach to noise cleansing was taken and only indisputable noise was cleansed. A. Determine Noisy Instances Relative to an Attribute of Interest A brief overview of our procedure for ranking noisy instances relative to an attribute of interest or AOI is presented here. This technique utilizes a procedure called PANDA (Figure 5), which provides a ranking of instances from most to least noisy based on the Noise Factor Si . For each observation in the data set, PANDA examines each pair of attributes and computes the deviation of the second attribute from its mean value given the partitioned value of the first attribute. For a given instance, if these deviations occur often and severely enough when compared to the remainder of the data set, that instance will appear more noisy. Additional details on PANDA can be found in [10]. Using PANDA, our procedure for detecting noise relative to an attribute of interest is presented here, with further details available in [15]. Suppose there are m attributes in the data set and AOI is selected by a user of the technique to be one of these attributes. The noise ranking of the instances is calculated using PANDA with all m attributes. This ranking is denoted by rank and the rank of instance xi as rank(xi ). Values for rank(xi ) closer to one represent the most noisy instances in
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
11
PANDA Algorithm input: Data set X = [xij ]n×m with n observations and m attributes, where xij is the value of the j th attribute for the ith observation. x∗j denotes the j th attribute. output: Noise Factor Si for all the observations in Data set X 1) Partition each attribute x∗j into disjoint bins {1, . . . , L}, where L = # of partitions. Denote the partitioned attribute by x ˆ∗j and the value of the j th partitioned attribute for instance i as x ˆij . Note in particular that x ˆij ∈ {1, . . . , L} ∀ i. 2) Calculate M ean(x∗k | x ˆ∗j = l) and Std(x∗k | x ˆ∗j = l) for each pair of atributes (ˆ x∗j , x∗k ), 1 ≤ j 6= k ≤ m and for l = 1, . . . , L. 3) For each instance i ∈ X, calculate Si =
m X m X
|xik − M ean(x∗k | x ˆ∗j = x ˆij )| / Std(x∗k | x ˆ∗j = x ˆij ).
k=1 j=1 j6=k
4) Sort the instances based on Si from the largest to the smallest value.
Fig. 5.
Pairwise Attribute Noise Detection Algorithm
Multiple Imputation Quantitative Noise Detector 1) For λ = 1, . . . , Λ do: 2) Randomly divide D into disjoint, equal-sized partitions P1 , . . . , PL 3) Create additional data sets M1 , . . . , ML where Mi = Pi with Y missing ∀xi ∈ Pi S 4) Dl = Ml L j=1, Pj , l = 1, . . . , L. 5) 6) 7) 8) 9) 10) 11)
j6=l
For ω = 1, . . . , Ω do: Execute Impute(Dl , chain, Γ, burn, lag) ∀ l = 1, . . . , L ∀ xi ∈ Ml ⊂ Dl , Impute() calculates Γ imputed values for the dependent variable Y of instance xi denoted Yˆ (λ,ω,γ) (xi ), γ = 1, . . . , Γ. End End P PΩ PΓ ˆ (λ,ω,γ) (xi ). Yˆ (xi ) = (Λ × Ω × Γ)−1 Λ λ=1 ω=1 γ=1 Y ∀ xi ∈ D, calculate the absolute error (²a ) and relative error (²r ) between Yˆ and Y : ¯ ¯ ¯ ¯ ²a (xi ) = ¯Yˆ (xi ) − Y (xi )¯
¯ ¯ ¯ Yˆ (x ) − Y (x ) ¯ ¯ i i ¯ ²r (xi ) = ¯ ¯. ¯ Y (xi ) + 1 ¯
(3)
12) Instances with larger values for either ²a or ²r contain a relatively larger amount of noise.
Fig. 6.
Algorithm for Detecting Noise in a Quantitative Outcome Variables using Multiple Imputation
the data set relative to the AOI. The AOI is then removed from the data set and PANDA calculates the instance noise ranking using the remaining m − 1 attributes. Denote the obtained instance ranking as rankAOI . The difference in ranking before and after the AOI is removed is calculated for all instances (∆(xi ) = rankAOI (xi ) − rank(xi )). Instances with the largest (positive) difference in ranking are considered the most noisy relative to the AOI. Therefore, our procedure provides a ranking of instances relative to the amount of noise contained in the AOI. B. Multiple Imputation Quantitative Noise Detector Our technique for the detection of noise in a quantitative outcome using multiple imputation is presented in Figure 6. The input data set D contains a quantitative dependent variable Y . Given a user-defined parameter L, D is randomly partitioned into disjoint, equal-sized data sets P1 , . . . , PL . L additional data sets are created where Ml = Pl with Y set to missing for all instances. Finally, the data set Dl is created on Line 4. Impute(·), on Line 6, imputes Y for those instances in Dl with a missing value for Y . The first parameter to Impute() specifies the input data set. Next, the type of chain (either single or multiple) is specified. The third parameter Γ is the total number of values for Y that are retained from a single execution of Impute(). lag is the number of cycles in a chain between imputed values, and burn is the number of burn-in iterations. For each data set Dl , Impute() will execute several times as determined by the user-defined parameter Ω. Due to the random nature of the MI procedure, each execution will generate slightly different imputed values. Each execution ω will calculate Γ imputed values for the instances with a missing value for Y . Additionally, the entire process (Lines 2 to 8) is executed multiple times as determined by the user-specified parameter Λ. Multiple executions Λ of the entire procedure are used to reduce any potential bias introduced due to one particular random selection. The imputed value for the outcome variable for instance xi is denoted by Yˆ (λ,ω,γ) (xi ) ∈ R. Each instance has Λ × Ω × Γ imputed values for Y . On Line 10, the average imputed value Yˆ (xi ) over all of the executions of the entire procedure is computed for each instance. Line 11 computes the absolute (²a ) and relative (²r ) errors for each instance. Instances where ²a
INTERNATIONAL JOURNAL OF SOFTWARE MEASUREMENT, VOL.1, NO.1, NOVEMBER 2007
12
or ²r are small are less noisy than instances where the error is large. This procedure, therefore, returns a ranking of instances relative to the amount of noise contained in the quantitative dependent variable. C. A Hybrid Approach to Quantitative Outcome Correction The hybrid technique proposed for the correction of a quantitative outcome variable is based on the components discussed in VI-A and VI-B . The input data set to be cleansed is denoted D, and the dependent variable is denoted Y . Our procedure utilizes two different approaches to detect and correct noise in the dependent variable. The output of these two components is combined by our hybrid procedure to calculate a cleansed value for the dependent variable. By setting AOI equal to nf aults, our procedure ranks instances relative to the noise contained in a single attribute, which in this case is the dependent variable. In particular, ∆(xi ) = rankY (xi ) − rank(xi ) is calculated for each instance xi ∈ D. Alternative values for Y were derived for each instance with ∆(xi ) > 0 using imputation and regression. Denote the estimated values for those instances determined to contain noise relative to the AOI as YˆAOI . The second component in our hybrid approach uses the MI quantitative noise detector proposed in VI-B. The alternative value calculated by our technique is denoted by YˆM I instead of Yˆ , which is used in Figure 6, to distinguish YˆM I from YˆAOI . Note that YˆM I was computed for each instance in the data set D. This is different than YˆAOI , which was only computed for those instances with ∆(xi ) > 0. Based on the alternative value YˆM I and the original value Y , ²a and ²r were computed for each instance in D. Lastly, noisy instances are identified and the correct value Y c for the dependent variable Y is derived utilizing the information provided by these two components. In the case study considered in this work, CCCS was the input data set to be cleansed. The MI noise detector used a sequential chain with parameters Γ = 15, lag = 100, burn = 100, Λ = 3 and Ω = 11. Based on previous experiments, these parameters were deemed reasonable for the CCCS data set. After applying our hybrid procedure, the values for the 8 independent variables, the original value for Y , the estimated values YˆAOI and YˆM I , and the statistics ∆, ²a and ²r were presented to a software engineering domain expert. After considering the alternative values, nf aults was cleansed for 81 instances in the data set. The expert maintained a conservative approach to cleansing, only correcting those instances that were deemed indisputable cases of noise. A hybrid approach to data cleansing is proposed due to the difficulty of the problem. Generally speaking, hybrid approaches may obtain better performance than their individual components, hence our reliance on numerous procedures for data cleansing. Any data cleansing procedure should be careful not to modify data that is a priori relatively clean. The inadvertent injection of noise into previously clean instances during the cleansing process is obviously undesirable. By relying on multiple techniques for both the detection and correction of noise, confidence in the cleansing process increases. Furthermore, cleansing of the CCCS data set was performed under the careful guidance of a domain expert. Using 10-fold cross validation, regression models were constructed using both the original and cleansed versions of the CCCS data set. The error statistics for the regression model constructed using the cleansed data were significantly lower than those of the regression model constructed on the original CCCS data set. Based on the involvement of the domain expert and the results obtained from the linear regression models, we can therefore conclude that the data cleansing process applied to the CCCS data set was carried out properly and that the resulting data is indeed relatively cleaner than the original.