Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
SOFTWARE EFFORT PREDICTION: AN EMPIRICAL EVALUATION OF METHODS TO TREAT MISSING VALUES WITH RAPIDMINER® OLGA FEDOTOVA Department of Economics, Management and Industrial Engineering, University of Aveiro Campus Universitário de Santiago, Aveiro, 3810-193, Portugal
[email protected]
GLADYS CASTILLO Department of Mathematics, University of Aveiro Center for Research and Development in Mathematics and Applications, University of Aveiro, Campus Universitário de Santiago, Aveiro, 3810-193, Portugal
[email protected]
LEONOR TEIXEIRA Department of Economics, Management and Industrial Engineering, University of Aveiro Governance, Competitiveness and Public Politics (GOVCOPP), University of Aveiro Institute of Electronics and Telematics Engineering of Aveiro (IEETA) Campus Universitário de Santiago, Aveiro, 3810-193, Portugal
[email protected]
HELENA ALVELOS Department of Economics, Management and Industrial Engineering, University of Aveiro, Governance, Competitiveness and Public Politics (GOVCOPP), University of Aveiro Campus Universitário de Santiago, Aveiro, 3810-193, Portugal
[email protected] Abstract: Missing values is a common problem in the data analysis in all areas, being software engineering not an exception. Particularly, missing data is a widespread phenomenon observed during the elaboration of effort prediction models (EPMs) required for budget, time and functionalities planning. Current work presents the results of a study carried out on a Portuguese medium-sized software development organization in order to obtain a formal method for EPMs elicitation in development processes. This study focuses on the performance evaluation of several regression-based EPMs induced from data after applying three different methods to treat missing values. Results show that regression imputation offers substantial improvements over traditional techniques (case deletion and mean substitution). All the machine learning methods were implemented in RapidMiner®, one of the leading open-source data mining applications. Keywords: Software effort prediction; missing values treatment, multiple linear regression. 1. Introduction Software effort prediction (SEP) is the forecast about the amount of effort required to make a software system and estimate its duration [Singh, Bhatia et al. (2008)]. SEP first appeared in 1950s, and since then, continued to attract attention of software community specialists aimed at developing useful models that will constructively explain the development life-cycle and accurately predict the development cost of a software product [Boehm,
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6064
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
Abts et al. (2000)]. From that moment, there were elaborated a lot of models for the effort and cost estimation. Indeed, the diversity of these models reported in the literature can be considered as an indicator of the problem complexity. Nevertheless, there is no unique model that completely satisfies the need for objective, fast and accurate predictions in all circumstances. The predictive quality of estimates determines the success of the project and helps to avoid the risks related to the cost and schedule overruns. SEP is usually required in the very beginning of the project’s lifecycle, making the task of effort estimation more complex. The ability to deliver the software on time, within the budget, and with the expected functionalities and quality, is a challenge for all software development organizations. Inaccurate estimations in software development industry are one of the most serious causes of software projects failure. Poor estimates are usually caused by the occurrence of missing values in the datasets used to learn the effort prediction models (EPMs) [Myrtveit, Stensrud et al. (2001)]. Missing data is a common problem in the statistical analysis. The causes that generally produce the missing data in the context of the SEP resume to [Myrtveit, Stensrud et al. (2001)]: (i) the high costs and unestablished practices associated to the data collection and reporting; (ii) report of the incorrect data (wild data) that has to be replaced by missing values as those do not make sense; and (iii) absence of respondents report of some variables for some reasons. This article presents the results of a study carried out on a Portuguese medium-sized software development organization in order to obtain a formal method for the elicitation of EPM for development processes. The study focuses on the performance evaluation of several regression-based EPMs induced from data after applying three techniques to treat missing data: case deletion, mean substitution and regression imputation. The implementation was carried out using RapidMiner® 5.0 [Rapid-I (2010)], one of the world-wide leading open-source software for data mining. Results show that the regression imputation provides considerable improvements in comparison with traditional techniques, such as case deletion and mean substitution. The rest of this paper is organized as follows. In Section 2, we summarize the main aspects concerning with the construction of a regression-based EPM and provide a description of the dataset used. In Section 3 we briefly describe several methods to treat missing values and give some details about their implementation in RapidMiner®. In Section 4, we present the results of the conducted experiments with real data aimed at evaluating the performance of the EPMs obtained by the three missing data treatments. Finally, in Section 5 we present our conclusions and propose some future work. 2. Effort Prediction Models There is no consensus in literature about the effectiveness of one or another SEP method. In spite of a great variety of existing models there are no generally applicable methods that would represent more accurate and precise results than the others for all projects. According to Galorath and Evan [(2006)], the main reasons for the SEP failures are: (i) the lack of or misuse of historical data; (ii) overoptimistic leadership or management; (iii) failure to use the estimate; or (iv) failure to keep the estimate current. As pointed out in [Jorgensen (2004)], low popularity of EPMs may be explained by the discomfort felt by software development organizations using models that they do not fully understand, together with the lack of substantial evidences that their use produce more accurate estimations than those provided by experts. However, there are several reasons that can lead software companies to implement EPMs [Jorgensen, Kirkeboen et al.]: (i) the good performance shown by them in less predictive environments; (ii) independence from the experts presence and experience; and finally, (iii) satisfaction of the requirements imposed by software process improvement frameworks (e.g. CMMI and SPICE BPG) in order to adopt rationale effort/cost estimation methods that guarantee the software evolution to more advanced capability/maturity level. There are a great number of techniques used to elicit formal EPMs from databases where the critical information of occurred projects has been gathered over time. Multiple linear regression (MLR) is pointed out as one of the most popular among these techniques. MLR belongs to an algorithmic group of techniques [Shepperd, Schofield et al. (1996)] that requires data from past projects in order to evaluate the current one. MLR is aimed to find out how n independent variables Xi are related to the dependent variable Y [Anderson, Sweeney et al. (2009)]. The MLR model is defined as:
Y = β0 + β1 X1+ β2 X2 + …+ βn Xn + ε ISSN : 0975-5462
Vol. 3 No. 7 July 2011
(1)
6065
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
where X1, X2, …, Xn are regressors; β0 is the intercept parameter; β1, β2, …, βn are the regression coefficients; and ε is the random error component. The dataset used to elicit the EPMs described in this work was collected by the project managers of a Portuguese medium-sized software development organization during the phase of planning and effort estimation of several software projects. Project planning and further monitoring and control is made by means of change set’s management. Change set (CS) is the element of work breakdown structure that is considered to be the work unit grouping a set of requirements. For this reason, project effort and cost estimation is performed in the CS level what provides possibility for more detailed cost and effort control. Information about all closed change sets (prototyped, developed, tested and documented) were added to the dataset. In total, 144 CSs from 5 projects of different sizes were processed. The independent variables used to characterize a CS unit are resumed in the Table 1. Table 1- Variables Characterization.
Variable Acronym Dev Frc
Variable Description Hours estimated by the expert to program the CS and perform unitary tests.
CS size
Size of CS in terms of its number of requirements.
Nr Modules
Number of modules – logic units of code - that the CS impacts.
Nr CRs
Number of change requests – development tasks – per CS.
Code Complexity
Ordinal variable that presents complexity of CS programming, which varies from low to high (1-3).
Code Reuse Business Acquaintance
Technical Experience
Impact
Ordinal variable that indicates to what extent the existing code will be reused, varying from wide to none reuse (1-3). Ordinal variable that indicates if the team elements are familiar with the business rules and laws involved in the CS programming. Varies from large to small (1-3). Ordinal variable that expresses the experience of people in use of technology, in programming language and module. Varies from high to low (1-3). Ordinal variable that indicates the volume of changes to make to the code as well as number of processes to change. Varies from small to significant (1-5).
Dev Eff is the dependent variable – the response – that we want to predict and represents the effective hours spent on programming CS and unitary tests realization. 3. Methods to Treat Missing Values in RapidMiner® The knowledge discovery process in databases (KDD) is a process that consists in “the nontrivial extraction of implicit, previously unknown, and potentially useful information from the data” [Frawley, Piatetsky-Shapiro et al. (1992)]. KDD is an interactive and iterative process involving the following steps [Fayyad, Piatetsky-Shapiro et al. (1996)]: (i) understanding of the application domain and goal(s) identification; (ii) data selection; (iii) data cleaning and pre-processing; (iv) data reduction and projection; (v) modeling (in essence, the selection of data mining algorithm(s) and method(s) used for patterns searching); (vi) evaluation and interpretation; and (vii) decision making based on the discovered knowledge.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6066
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
Fig. 1 - A RapidMiner® Process Graph.
To perform our study we used RapidMiner® 5.0 [Rapid-I (2010)], an open-source application for data mining that combines the power of development environments with the simplicity of visual programming. KDD processes in RapidMiner® are implemented through more than 500 nestable operators (for input/output, preprocessing, modeling, evaluation, etc.). As depicted in Figure 1, all the operators are combined into a process graph by means of a powerful but intuitive user interface. In this work we focus on the treatments of missing values in the pre-processing phase. According to Little and Rubin [(2002)], the methods to treat missing data can be categorized into three groups: (i) case/pairwise deletion; (ii) imputation methods: missing values are replaced with estimates derived from applying statistics methods to available data; and (iii) model-based methods: variants of the iterative expectation-maximization algorithm for finding maximum likelihood estimates (MLE) in parametric models from incomplete data. In practice, the case deletion is the most widely used method due to the simplicity of its implementation, especially for large datasets. However, the deletion of all the instances with missing values can lead to loss of useful information thus introducing some bias in the data. As a result, the induction of EPMs from incomplete/biased database can result in incorrect/biased forecasts. On the other hand, model-based methods suffer from some drawbacks: (i) they are parametric and hence require a model distribution (usually, a multivariate normal distribution is assumed in order to find the MLE of a mean vector and a covariance matrix); and (ii) they are iterative, therefore they can require a higher computational cost than alternative methods. For the induction of regression-based EPMs we have compared and evaluated case deletion (to serve as baseline) against two imputation methods: mean substitution and regression imputation. We now briefly describe how these three methods can been implemented in RapidMiner®. 3.1. Case/pairwise deletion ®
RapidMiner enables the deletion of all examples with missing values by means of the Filter Examples operator. This operator takes a dataset as input and returns a new dataset that contains only the examples that fulfill a certain condition (Figure 2).
Fig. 2 - Database filtering in RapidMiner®.
To set the condition that acts as a filter we need to select a condition class. We can choose to use some predefined conditions or present our own condition. As shown in Figure 3, among pre-defined conditions we can select examples only with complete attribute values (no_missing_attributes) or with complete labels (no_missing_labels), and vice versa – only examples with missing attribute values (missing_attributes) or missing labels (missing_labels). Alternatively, as exposed in Figure 4, we can specify our own condition by
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6067
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
selecting the attribute_value filter and writing the condition (a parameter string with the form attribute op value, where op is a binary logical operator).
Fig. 3 - Filter Examples: parameters.
Fig. 4 - Filter Examples: attribute value filter.
3.2. Case/pairwise deletion Imputation is the process of filling in the missing values of an attribute. The imputation may be carried out using some statistics derived from historical data (average, maximum, minimum, mode, etc.) or pre-defined values. In RapidMiner® we can implement the imputation methods using two operators: Replace Missing Values and Impute Missing Values. The Replace Missing Values operator (Figure 5) enables the substitution of the missing values by the minimum, maximum or average statistics calculated on the basis of existing values for all or selected attributes. Moreover, we can also replace the missing values by some pre-defined values (e.g. zero or values that we consider that provide better fit to data).
Fig.5 - Missing values replacement in RapidMiner®.
The operator Impute Missing Values (Figure 6) allows the estimation of missing data for all or selected attributes. In order to achieve this, a predictive model (regression or classification model) is learned from a complete dataset composed by all the attributes values except the attribute for which we want to impute the missing values (taken as the target). The induced predictive model is then applied to fill in the missing values of the target. More precisely, this operator divides the dataset into a training set with those examples that have no missing values and then applies the model to the rest of the data where the attribute values are missing in order to estimate them. The operator that implements the algorithm for model induction from the training set must be specified as an inner operator. Depending on the type of the target, as well as, the attribute types, we can use a great variety of predictive models (e.g. linear regression, decision tree, k-NN classifier, Bayesian classifiers, etc).
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6068
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
Fig.6 - Missing values estimation in RapidMiner® using MLR as the inner operator.
Fig.7 - Impute Missing Values operator: parameters.
As depicted in Figure 7, the operator interface allows the user to configure some parameters: (i) attribute filter type – select the attributes for which the missing values will be imputed; (ii) invert selection – invert the selection of the attributes, specified in the attribute filter type; (iii) include special attributes - decide whether it should also include special attributes as ID, for instance; (iv) iterate – impute missing values immediately after having learned the predictive model or in the end of the process; (v) learn on complete cases – define whether to learn predictive models only using complete cases (vi) order – establish the order (chronological, random, number of missing values or information gain) of attributes for which missing values are estimated; (vii) sort – select the direction (ascending or descending) for the order strategy, and finally, (viii) use local random seed. 4. Experimental Study The main goal of the experimental study was to evaluate the results of several linear regression-based EPMs induced from data after applying the three different methods to treat missing values. We were interested to measure the goodness of fit to data as well as the predictive capability of the induced EPMs. To assess the goodness of fit we used the coefficient of determination R2, that measures the proportional reduction in variability about the mean resulting from the fitting of the multiple regression model [Freund, Wilson et al. (2006); Montgomery, Peck et al. (2006)]. To measure the accuracy of the predicted results, the mean absolute error (MAE) was employed. MAE [Willmott and Matsuura (2005)] expresses the average model-prediction error in units of the variable of interest and is defined as follows:
(2)
where N is the number of examples, υi is the real value of the dependent variable of the i-th example and i is its respective estimated value. We applied a bootstrap validation scheme in order to obtain the estimates of the R2 and MAE values. The RapidMiner® operator Linear Regression was used to induce the regression models. This operator implements the well-known least squares algorithm. For embedded feature selection we set the stepwise method. The bootstrap validation method is implemented in RapidMiner® by means of the operator with the same name. 4.1. Descriptive Statistics The descriptive statistics obtained with RapidMiner® are shown in the Table 2. As observed, all variables, except the Dev Eff and Dev Frc have missing values.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6069
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST) Table 2 - Descriptive statistics of the CS database.
CS size is the attribute with the larger number of missing values – approximately 38%. Moreover, further analysis has revealed that there are 19 cases that have no value at all for most attributes, except Dev Eff and Dev Frc. 4.2. Pre-processing Methodologies Before the induction of linear regression EPMs, we removed all duplicates from the database, as well as the CSs that have Dev Eff equal to zero. Thus, initial dataset was reduced from 144 to 81 CSs. After that we applied the following methods to treat missing values: • Case deletion: we removed all examples with missing values using the Filter Examples operator as desribed in Section 3.1. As a result the dataset was reduced to 38 CSs. • Imputation Methods: First, the examples where most attribute values are missing were excluded from the dataset as they do not bring any useful information for model construction using the operator Filter Examples (see Figure 10). As a result, the original dataset was reduced from 81 to 70 CSs. After that an imputation method was applied. Mean Substituition: missing values were substituted by the average values (in the case of nominal attributes – by a mode) using the Replace Missing Values operator, as explained in the Section 3.2. Regression Imputation: missing values were substituted by the values predicted by the induced regression models using the Impute Missing Values operator, as explained in the Section 3.2. Figure 8 illustrates an example of the missing data treatment in RapidMiner®, using regression imputation. Once obtained complete data after applying each method, we used the Detect Outlier operator (see Figure 8) to identify 5 outliers based on the measurement of the euclidian distance to their k (in our case 10) nearest neighbors. As seen in Figure 8, for outlier remotion followed by the exclusion of the Dev Frc variable from the regression analysis, the Filter Examples and Select Attributes operators were applied, respectively.
Fig.8 - EPM with missing data estimation by regression using RapidMiner®.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6070
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
Since code complexity, code reuse, impact, business acquaintance and technical experience are nominal variables and the regression method works only with numerical ones, to proceed with the analysis in all the three cases we need to transform each nominal variable into a dummy (binary) variable. To this end we included the operators Nominal to Binominal and Nominal to Numerical transformations (see Figure 8). 4.3. Results Analysis Since the regression models were induced from datasets of different sizes due to diverse missing data treatments, the precision of the R2 metrics is reduced as it requires samples of equal size to produce reliable results. In spite of this, R2 continues to be an informative indicator. Table 3 resumes the results of the EPMs resulting after applying the three missing values treatments. Table 3 - Results of the evaluation of EPMs models induced after three missing data treatments.
Data processing method Case deletion
N
MAE
33
43.256 +/- 15.346
Mean Imputation
65
Regression Imputation
65
38.591 +/- 3.690
27.333 +/- 7.515
Coefficient of determination (R²) 0.253 +/- 0.264
0.283 +/- 0.16
0.608 +/- 0.128
Regressors - CS size - Code Complexity=3 - Technical Experience=2 - Impact=5 - CS size - Nr CRs - Code complexity=3 - Business Acquaintance=2 - Technical experience=2 - Impact=5 - CS size
The results of the EPM obtained after applying the case deletion show that the resulting model does not explain the data variation sufficiently, as R2 is equal to 0.253 +/- 0.264. Besides, it has large variation, what does not allow considering the resulting model as a valid one. From the other side, a MAE of 43.256 +/- 15.346 indicates that the obtained model is rather inaccurate. The failure of the EPM obtained with data removal may be explained by the loss of information since the data set was reduced from 81 to 38 CSs. The EPM resulting from the mean imputation included such variables as code complexity=3, technical experience=2, impact=5, CS size and nr of CRs. The coefficient of determination of the obtained model is 0.283 +/- 0.16. As in the case of the EPM resulting from case deletion this model is not able to explain the sufficient amount of the data variation. The MAE of the model is 38.591 +/- 3.690 and is rather low. This result indicates that, in average, the predictive values for the software development differ from the real ones in approximately 39 hours. The EPM obtained with regression imputation is based on the code complexity=3, business acquaintance=2, technical experience=2, impact=5 and CS size variables. The R2 associated to this model is 0.608 +/- 0.128 and the MAE is 27.333 +/- 7.515. These results are the best ones when compared to the previous cases. The resulting model explains approximately 60% of the data variation but still the predicted values differ from the real ones, in average, by approximately 27 hours. 5. Conclusions The study described in this work was carried out in a medium-sized Portuguese software development organization. The main aim was the evaluation of the effort estimation model performance according to the type of the missing data treatment. To conduct the study the open-source data mining software RapidMiner® was employed. Stepwise MLR method was selected to obtain the EPM because of the existence of historic data and due to its relative simplicity of use. The results obtained with the data removal and data replacement by the average values are not considered to be good enough, as the R2 for both cases is rather low with a great variance and the MAE demonstrates the significant deviation among the predicted and actual values.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6071
Olga Fedotova et al. / International Journal of Engineering Science and Technology (IJEST)
Referring to the EPM obtained with the regression imputation, we can state that it has shown a considerable improvement in terms of both, the explicative and predictive capabilities. On the other hand, this approach has the disadvantage that the predicted values are usually better-behaved than the real values and may lead to overoptimistic results which do not correspond to the reality. For the further study, we are planning to explore other missing values techniques and regression models for the EPM induction in order to investigate their response to different missing values treatments. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
Anderson, D.R., D.J. Sweeney, and T.A. Williams, eds. Statistics for Business and Economics. 10 ed. ed. 2009, Thomson SouthWestern. Boehm, B., C. Abts, and S. Chulani, Software development cost estimation approaches - A survey. Annals of Software Engineering, 2000. 10: p. 177-205. Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, From Data Mining to Knowledge Discovery in Databases. AI Magazine, 1996. 17(3): p. 37-54. Frawley, W.J., G. Piatetsky-Shapiro, and C.J. Matheus, Knowldege Discovery in Databases: An Overview. AI Magazine, 1992. 13(3): p. 57-70. Freund, R.J., W.J. Wilson, and P. Sa, eds. Regression Analysis: Statistical Modeling of a Response Variable. 2nd ed. ed. 2006, Elsevier. Galorath, D.D. and M.W. Evan, eds. Software Sizing, Estimation, and Risk Management: When Performance is Measured Performance Improves. 2006, Taylor & Francis Group. Jorgensen, M., A review of studies on expert estimation of software development effort. The Journal of Systems and Software, 2004. 70: p. 37-60. Jorgensen, M., et al., Human Judgement in Effort Estimation of Software Projects. Little, R.J. and D.B. Rubin, eds. Statistical Analysis with Missing Data. 2nd ed. ed. 2002, John Wiley and Sons: New York. Montgomery, D.C., E.A. Peck, and G.G. Vining, Introduction to Linear Regression. 2006: John Wiley & Sons, Inc. Myrtveit, I., E. Stensrud, and U.H. Olsson, Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods. IEEE Transactions on Software Engineering, 2001. 27(11): p. 999-1013. Rapid-I. 2010; Available from: www.rapid-i.com. Visited at 8/04/2011. Shepperd, M., C. Schofield, and B. Kitchenham, Effort Estimation Using Analogy, in Paper presented at the International Conference on Software Engineering 1996: Berlin. Singh, Y., et al., A Review of Studies on Effort Estimation Techniques of Software Development, in Paper presented at the Mathematical Techniques: Emerging Paradigms for Electronics and IT Industries. 2008. Willmott, C.J. and K. Matsuura, Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research, 2005. 70: p. 79-82.
ISSN : 0975-5462
Vol. 3 No. 7 July 2011
6072