somewhat overlooked problem, namely missing data. There are various .... generate a wide range of socio-technical issues [8, 9], however, our concern in this ...
Dealing with Missing Software Project Data M.J. Shepperd and M.H. Cartwright Empirical Software Engineering Research Group School of Design, Engineering & Computing, Bournemouth University, Bournemouth, BH1 3LT, UK {mshepper, mcartwri}@bmth.ac.uk The focus of this paper is on the commonplace but somewhat overlooked problem, namely missing data. There are various reasons why data may be missing. Within the domain of software engineering they include lack of time, cost, lack of commitment, lack of training, problems applying counting rules to a particular situation (especially where that situation has not been anticipated), and political reasons (refusal to release figures which “look bad”). Of course the problem of missing data observations is not unique to software engineering, so it is no surprise to find that there are a wide range of techniques that have been developed to try and fill in the gaps in data sets. These are known as data imputation techniques. Data imputation is in widespread use, yet despite this situation, the first paper of which we are aware exploring the application of imputation within the field of empirical software engineering is the recent work by Myrtveit et al. [1]. Here the authors evaluate two imputation methods and compare them with the listwise deletion technique for dealing with missing data. Unfortunately their results are rather inconclusive. In this paper we ask the question: do imputation methods allow us to improve the usefulness of software engineering data sets that contain missing values? For the purposes of this analysis we focus upon software project effort prediction. This is because it is important problem domain, because many data sets are characterised by being both small, in terms of the number of cases, and incomplete, in terms of missing observations. For this study we use two real world data sets that were collected by two different organisations with the goal of project effort prediction. Each data set contained a number of missing values, 15% and 18% respectively. We also had an advantage in that one of the authors had been involved with their initial analysis hence we were confident that they are representative of "as is" industrial data sets, as opposed to some of the highly "scrubbed" and sanitised
Abstract Whilst there is a general consensus that quantitative approaches are an important adjunct to successful software project management there has been relatively little research into many of the obstacles to data collection and analysis in the real world. One feature that characterises many of the data sets we deal with is missing or highly questionable values. Naturally this problem is not unique to software engineering, so in this paper we explore the application of various data imputation techniques that have been used to good effect elsewhere. In order to assess the potential value of imputation we use two industrial data sets containing a number of missing values. The relative performance of effort prediction models derived by stepwise regression methods on the raw data and data sets with values imputed by various techniques is compared. In both data sets we find that k-Nearest Neighbour (kNN) and sample mean imputation (SMI) significantly improve effort prediction accuracy with the kNN method giving the best results. Keywords: project effort estimation, imputation, data analysis.
1. Introduction Over the past 30 years or more there has been substantial interest by software engineering researchers into the analysis and use of empirical data. There has, however, been rather less work targeted at the problems of collecting the data in the first place. One manifestation of this situation is the tendency for research groups — including our own — to carry out analysis on a small number of what might be termed "classic" public domain data sets. This has effect of somewhat distancing researchers from the day to day difficulties of collecting and validating industrial data. 1
data sets upon which a considerable amount of software metrics research is based. We compare prediction models derived from the raw data sets with models based on data using two common imputation methods. The modelling technique we use is stepwise regression although to some extent this is incidental since our primary concern is the practical utility of data imputation. The remainder of the paper is organised as follows. The next section briefly reviews some of the more widely used imputation techniques and examines their application by Myrtveit et al. to a software project data set [1]. Then we describe our method, the two data sets and our two sets of results. We conclude by considering the wider implications for software engineering data analysis.
Another commonly used imputation technique is based on k Nearest Neighbours (kNN ) where k is the number of cases or analogies sought. kNN works by finding the k most similar complete cases to the target case to be imputed where similarity is measured by Euclidean distance. The known values of the most similar cases are then used to derive a value for the missing data. Where more than one case is being used there are various possibilities, the simplest being a straightforward arithmetic mean and alternatively an inverse distance weighted algorithm. For categorical features vote counting is used. Typically k is 1 or 2 but as it tends to n, where n is the sample size, so the method converges on a sample mean method. The obvious advantage of this technique is that it enables the most similar cases to be the most influential. In addition, many distance metrics will allow the inclusion of categorical features, for example, development method or type of programming language. Since both data sets in our case study comprise a mixture of continuous and categorical features this is a distinct advantage. A number of studies have reported good results using kNN including the investigation by Troyanskaya et al. [3] comparing imputation techniques on DNA data. They found that the kNN method using Euclidean distance gives the best results. A more complex method is Multiple Imputation [4]. At its simplest some technique, for example, Rubin describes a Monte Carlo simulation approach, is used to complete the dataset with p>1 simulated values (where typically 3>p