Software Qual J (2011) 19:537–552 DOI 10.1007/s11219-010-9112-9
A comparative study for estimating software development effort intervals Ays¸ e Bakır • Burak Turhan • Ays¸ e Bener
Published online: 9 September 2010 Springer Science+Business Media, LLC 2010
Abstract Software cost/effort estimation is still an open challenge. Many researchers have proposed various methods that usually focus on point estimates. Until today, software cost estimation has been treated as a regression problem. However, in order to prevent overestimates and underestimates, it is more practical to predict the interval of estimations instead of the exact values. In this paper, we propose an approach that converts cost estimation into a classification problem and that classifies new software projects in one of the effort classes, each of which corresponds to an effort interval. Our approach integrates cluster analysis with classification methods. Cluster analysis is used to determine effort intervals while different classification algorithms are used to find corresponding effort classes. The proposed approach is applied to seven public datasets. Our experimental results show that the hit rate obtained for effort estimation are around 90–100%, which is much higher than that obtained by related studies. Furthermore, in terms of point estimation, our results are comparable to those in the literature although a simple mean/median is used for estimation. Finally, the dynamic generation of effort intervals is the most distinctive part of our study, and it results in time and effort gain for project managers through the removal of human intervention. Keywords Software effort estimation Interval prediction Classification Cluster analysis Machine learning
A. Bakır (&) Department of Computer Engineering, Bog˘azic¸i University, 34342 Bebek, Istanbul, Turkey e-mail:
[email protected] B. Turhan Department of Information Processing Science, University of Oulu, 90014 Oulu, Finland e-mail:
[email protected] A. Bener Ted Rogers School of Information Technology Management, Ryerson University, Toronto M5B 2K3, Canada e-mail:
[email protected]
123
538
Software Qual J (2011) 19:537–552
1 Introduction As software becomes more important in many domains, the focus on its overall quality in terms of technical product quality and process quality also increases. As a result, software is blamed for business failures and the increased cost of business in many industries (Lum et al. 2003). The underestimation of software effort causes cost overruns that lead to cost cutting. Cost cutting means that some of the life cycle activities either can be skipped or cannot be completed as originally planned. This causes a drop in software product quality. To avoid the cost/quality death spiral, accurate cost estimates are vital (Menzies and Hihn 2006). Software cost estimation is one of the critical steps in the software development life cycle (Boehm 1981; Leung and Fan 2001). It is the process of predicting the effort required to develop a software project. Such predictions assist project managers when they make important decisions such as bidding for a new project, planning and allocating resources. Inaccurate cost estimations may cause project managers to make wrong decisions. As Leung and Fan state, underestimations may result in approving projects that would exceed their budgets and schedules (Leung and Fan 2001). Overestimations, on the other hand, may result in rejecting other useful projects and wasting resources. Point estimates are generally used for project staffing and scheduling (Sentas et al. 2005). However, managers may easily make wrong decisions if they rely only on point estimates and the associated error margins generated by cost estimation methods. Although most methods proposed in the literature produce point estimates, Stamelos and Angelis state that producing interval estimates is safer (Stamelos and Angelis 2001). They emphasize that point estimates have a high impact on project managers, causing them to make wrong decisions, since they include a high level of uncertainty as a result of unclear requirements and their implications in the project. Interval estimates may be used for predicting the cost of any current project in terms of completed ones. In addition, while bidding for a new project, an interval estimate can easily be converted to a point estimate by evaluating the values that fall into the same interval. Up to now, interval estimation has consisted of finding either the confidence intervals for point estimates or the posterior probabilities of predefined intervals and then fitting regression-based methods to these intervals (Angelis and Stamelos 2000; Jorgensen 2002; Sentas et al. 2003, 2005; Stamelos and Angelis 2001; Stamelos et al. 2003). However, none of these approaches addresses the problem of cost estimation as a pure classification problem. In this paper, we aim to convert cost estimation into a classification problem by using interval estimation as a tool. The proposed approach integrates classification methods with cluster analysis, which, to the best of our knowledge, is applied for the first time in the software engineering domain. In addition, by using cluster analysis, effort classes are determined dynamically instead of using manually predefined intervals. The approach uses historical data of completed projects including their effort values. The proposed approach includes three main phases: (1) clustering effort data so that each cluster contains similar projects; (2) labeling each cluster with a class number and determining the effort intervals for each cluster; and (3) classifying new projects to one of the effort classes. We used various datasets to validate our approach, and our results revealed much higher estimation accuracies than those in the literature. According to our experimental study, we obtained higher hit rates for effort estimation. We also obtained point estimates with simple approaches such as mean/median regression, and our performance has been comparable to those in the literature.
123
Software Qual J (2011) 19:537–552
539
The rest of the paper is organized as follows: Sect. 2 discusses related work from the literature. Section 3 describes the proposed approach in detail, while Sect. 4 presents the experiments conducted. Section 5 comprises a presentation of the results and discussions. Finally, conclusions and future work are presented in Sect. 6.
2 Related work Previous work on software cost estimation mostly produced point estimates by using regression methods (Baskeles et al. 2007; Boetticher 2001; Briand et al. 1992; Draper and Smith 1981; Miyazaki et al. 1994; Shepperd and Schofield 1997; Srinivasan and Fisher 1995; Tadayon 2005). According to Boehm, the two most popular regression methods are ordinary least square regression (OLS) and robust regression (Boehm et al. 2000). OLS is a general linear model that uses least squares, whereas robust regression is the improved version of OLS (Draper and Smith 1981; Miyazaki et al. 1994). Besides regression, various machine learning methods are used for cost estimation. For example, back-propagation multilayer perceptrons and support vector machines (SVM) have been used for effort estimation in Baskeles et al. (2007) and Boetticher (2001), and Briand et al. (1992) introduce a cost estimation method based on optimized set reduction (Baskeles et al. 2007; Boetticher 2001; Briand et al. 1992). Other methods for point estimation include estimation by analogy and neural networks. In Shepperd and Schofield (1997), high accuracies are obtained by using analogy with prediction models, whereas in Tadayon (2005) and Shepperd and Schofield (1997), a significant improvement is made on large datasets through the use of an adaptive neural network model (Shepperd and Schofield 1997; Tadayon 2005). Fewer studies focus on interval estimation. They can be grouped into two main categories: (1) those that produce confidence intervals for point estimates and (2) those that produce probabilities of predefined intervals. In category 1, interval estimates are generated during the estimation process, whereas in category 2, intervals are predefined before the estimation process. The first study that has empirically evaluated effort prediction interval models in the literature is Angelis and Stamelos (2000). It compares the effort prediction intervals derived from a bootstrap-based model with the prediction intervals derived from regression-based effort estimation models. However, the said study displays a confusion of terms, and a critique was consequently made by Jorgensen in (2002) to clarify the ambiguity (Jorgensen and Teigen 2002). In another study, an interval estimation method based on expert judgment is proposed (Jorgensen 2003). Statistical simulation techniques for calculating confidence intervals for project portfolios are presented in Stamelos and Angelis (2001). Two important studies for category 2 are Sentas et al. (2005), in which ordinal regression is used to model the probabilities of both effort and productivity intervals, and Sentas et al. (2003), which uses multinomial logistic regression for modeling productivity intervals (Sentas et al. 2003, 2005). Both studies also include point estimate results of the proposed models. Also, in Sentas et al. (2003), predefined intervals of productivity are used in a Bayesian belief network to support expert opinion (Sentas et al. 2003). An empirical comparison of the models that produce point estimates and predefined interval estimates is given in Bibi et al. (2004). Firstly, in contrast to these studies, effort intervals are not predefined manually in this paper. Instead, they are determined by clustering analysis. Secondly, instead of using regression-based methods, we use classification algorithms that
123
540
Software Qual J (2011) 19:537–552
originate from the machine learning domain. Thirdly, point estimates can still be derived from these intervals as we will show in the following sections. NASA’s Software Engineering Laboratory also specified some guidelines for the estimation of effort prediction intervals (NASA 1990). However, these guidelines may affect the external validity of the results since they do not reflect the same characteristics of the projects in other organizations. Clustering analysis is not a new concept in the software cost estimation domain. Lee et al. integrate clustering with neural networks in order to estimate the development cost (Lee et al. 1998). They have found similar projects with clustering and used them to train the network. In Gallego et al. (2007), the cost data are clustered, and then different regression models are fitted to each cluster (Gallego et al. 2007). Similar to these studies, we also use here cluster analysis for grouping similar projects. The difference of our research in comparison to these studies is that we combine clustering through classification methods for effort estimation.
3 The approach There are three main steps in our approach: (1) grouping similar projects together by cluster analysis; (2) determining the effort intervals for each cluster and specifying the effort classes; and (3) classifying new projects into one of the effort classes. The assumption behind applying cluster analysis to effort data is that similar projects have similar development effort. The class-labeled clusters then become the input data for the classification algorithm, which converts cost estimation into a classification process. 3.1 Cluster analysis Cluster analysis is a technique for grouping data and finding similar structures in data. In software cost estimation domain, clustering corresponds to grouping projects into clusters based on their attributes. Similar projects are assigned to the same cluster, whereas dissimilar projects belong to different clusters. In this study, we use an incremental clustering algorithm called leader cluster (Alpaydin 2004) for cluster analysis. In this algorithm, the number of clusters is not predefined; instead, the clusters are generated incrementally. Since one of our main objectives is to generate the effort intervals dynamically, this algorithm is selected to group similar software projects. Other clustering techniques that generate the clusters dynamically can also be used, but this is out of the scope of this work. The pseudocode of the leader cluster algorithm is given in Fig. 1 (Bakar et al. 2005). In order to determine the similarity between two projects, Euclidean distance is used. It is a widely preferred distance metric for software engineering datasets (Lee et al. 1998). 3.2 Effort classes After the clusters and their centers are determined, the effort intervals are calculated for each cluster. In order to specify the effort intervals and classes, firstly, the minimum and maximum values of the efforts of the projects residing in the same cluster are found. Secondly, these minimum and maximum values are selected as the upper and lower bounds of the interval
123
Software Qual J (2011) 19:537–552
541
Assign the first data item to the first cluster. Consider the next data item: Find the distances between the new item and the existing cluster centers. If (distance < threshold) { Assign this item to the nearest cluster Recompute the value for that cluster center } Else { Assign it to a new cluster } Repeat step 2 until the total squared error is small enough.
Fig. 1 Pseudocode for leader cluster algorithm (Bakar et al. 2005)
that will represent that cluster. Finally, each cluster is given a class label, which will be used for classifying new projects. 3.3 Classification The class of a new project is estimated by using the class-labeled data generated in the previous step. The resulting class corresponds to the effort interval that contains the effort value of the new project. We use three different classification algorithms for this step: one is parametric (linear discrimination) and the others are non-parametric (k-nearest neighbor and decision tree). These three algorithms are chosen to show how our approach performs with the algorithms of different complexities. Linear discrimination is the simplest, whereas the decision tree is the most complex one. k-nearest neighbor has moderate complexity depending on the size of the training set. 3.3.1 Linear discrimination Linear discrimination (LD) is a discriminant-based approach that tries to fit a model directly for the discriminant between the class regions, without first estimating the likelihoods or posteriors (Alpaydin 2004). It assumes that the projects of a class are linearly separable from the projects of other classes and require no knowledge of the densities inside the class regions. The linear discriminant function is as: gi hx j wi ; wi0 i ¼
d X
wij xj þ wi0
ð1Þ
j¼1
where gi is the model, wi and wi0 are the model parameters and x is the software project with d attributes. It is used to separate two or more classes. Learning involves the optimization of the model parameters to maximize the classification accuracy on a given set of projects. Because of its simplicity and comprehensibility, linear discrimination is frequently used before trying a more complicated model. 3.3.2 k-nearest neighbor The k-nearest neighbor (k-NN) algorithm is a simple but also powerful learning method that is particularly suited for classification problems.
123
542
Software Qual J (2011) 19:537–552
k-NN assumes that all projects correspond to points in the n-dimensional Euclidean space Rn, where n is the number of the project attributes. The algorithm’s output is the class, which has the most examples among the k neighbors of the input project. The neighbors are found by calculating the Euclidean distance from each project to the input project. The selection of k is very important. It is generally set as an odd number to minimize ties as confusion generally appears between any two neighboring classes (Alpaydin 2004). Although the algorithm is easy to implement, the amount of computation increases as the training set grows in size. 3.3.3 Decision tree Decision trees (DT) are hierarchical data structures that are based on a divide-and-conquer strategy (Quinlan 1993). They can be used for both classification and regression and require no assumptions concerning the data. In the case of classification, they are called classification trees. The nodes of a classification tree correspond to the attributes that best split data into disjoint groups, while the leaves correspond to the average effort of that split. The quality of the split is determined by an impurity measure. The tree is constructed by partitioning the data recursively until no further partitioning is possible while choosing the split that minimizes the impurity at every occasion (Alpaydin 2004). Concerning the estimation of software effort, the effort of the new project can be determined by traversing the tree from top to bottom along the appropriate paths.
4 Experimental study Our purpose in this study is to convert the effort estimation problem into a classification problem that includes the following phases: (1) clustering the effort data; (2) labeling each cluster with a class number and determining the effort intervals for each cluster; and (3) classifying the new projects. In addition, the point estimation performance of the approach is tested by taking either the mean or the median of the effort values of the projects included in the estimated class. In this section, details about the validation of our approach on a number of datasets will be given. MATLAB is used as a tool for all the analyses stated in this study. 4.1 Dataset description In our experiments, data from two different sources are used: the Promise Data Repository and the Software Engineering Research Laboratory (SoftLab) Repository (Boetticher et al. 2007; SoftLab 2009). Seven datasets are used in this study. Four of them, which are cocomonasa_v1, coc81, desharnais_1_1 and nasa93, are taken from the Promise Data Repository. The others, which are sdr05, sdr06 and sdr07, are taken from the SoftLab (2009) Repository. These datasets contain data from different local software companies in Turkey, which are collected by using the COCOMO II Data Collection Questionnaire (Boehm 1999). The datasets include a number of nominal attributes and two real-valued attributes: Lines of Code and Actual Effort. An exemplary dataset is given in Table 1. Each row in Table 1 corresponds to a different project. These projects are represented by the nominal
123
Software Qual J (2011) 19:537–552
543
Table 1 An example dataset Project
Nominal attributes (as defined in COCOMO II)
P1
1.00,1.08,1.30,1.00,1.00,0.87,1.00,0.86,1.00,0.70,1.21,1.00,0.91,1.00,1.08
70
278
P2
1.40,1.08,1.15,1.30,1.21,1.00,1.00,0.71,0.82,0.70,1.00,0.95,0.91,0.91,1.08
227
1,181
P3
1.00,1.08,1.15,1.30,1.06,0.87,1.07,0.86,1.00,0.86,1.10,0.95,0.91,1.00,1.08
177.9
1,248
P4
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
115.8
480
P5
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
29.5
120
P6
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
19.7
60
P7
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
66.6
300
P8
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
5.5
18
P9
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
10.4
50
P10
1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08
14
60
P11
1.00,1.00,1.15,1.11,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
16
114
P12
1.15,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
P13
1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
13
60
P14
1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00
8
42
Table 2 An overview of the datasets
LOC
6.5
Effort
42
Data source
Dataset name
# of Projects
Promise
cocomonasa_v1
60
coc81
63
desharnais_1_1 (updated version)
77
nasa93
93
sdr05
25
sdr06
24
sdr07
40
SoftLab (2009)
attributes from the COCOMO II model along with their size in terms of LOC and the actual effort spent for completing the projects. We have used several datasets in the same format as provided in Table 1 in order to validate our approach on a wide range of effort estimation data and to generalize our results as much as possible. A list of all the datasets used in this study is given in Table 2. 4.2 Design Before applying any method, all of the datasets are normalized in order to remove the scaling effects on different dimensions. By using min–max normalization, project attribute values are converted into the [0…1] interval (Shalabi and Shaaban 2006). After normalization, the need for a dimension reduction technique to extract the relevant features arises. In this paper, principal component analysis (PCA) is used (Alpaydin 2004). The main purpose of PCA is to reduce the dimensions of the dataset so that it can still be efficiently represented without losing much information. Specifically, PCA seeks dimensions in which the variances are maximized. By applying PCA to each cluster after clustering, the model shown in Fig. 2 is developed. Our aim in applying PCA separately to each cluster is
123
544
Software Qual J (2011) 19:537–552
Data
Min-Max Normalization
Normalized Data
Leader Cluster
On Each Cluster PCA
Find Effort Intervals for Each Cluster
10x10 Cross-Validation
KNN
LD
DT
Calculate
Fig. 2 Our proposed model
to extract separate features for each cluster so that we can obtain better results for both classification and point estimation. The dataset given in Table 1 is used as an example in Fig. 2 to show how our cost data are processed. In Fig. 2, the projects in the cost dataset are illustrated as P1…P14. After the dataset is normalized, projects are shown as P10 …P140 . The four clusters generated are named as C1, C2, C3 and C4, which correspond to effort interval classes. As described earlier, the lower and upper bounds for an effort interval class are determined dynamically by the minimum and the maximum effort values of the projects that reside in the corresponding cluster. 4.3 Model Normalized effort estimation data are given as input to this model. Firstly, the leader cluster algorithm is applied to the normalized data to obtain project groups. Here, we selected the number of clusters that minimize the total-squared error while keeping the distance below the defined threshold value. The optimum value for the number of clusters is found by testing all possibilities and calculating the total-squared error. Secondly, with PCA, each cluster’s dimensions are reduced individually by using their own covariance matrices (the proportion of variance is set to 0.90). The aim here is to prevent data loss within the
123
Software Qual J (2011) 19:537–552
545
clusters. PCA is applied to the entire data except the Effort column, which is the value that we want to estimate. Thirdly, each cluster is assigned a class label, and the effort intervals for each of them are determined. As stated in Sect. 3.2, minimum and maximum values are selected as the interval bounds. Then, the effort data containing the projects with corresponding class labels are given to each of the classification algorithms described in Sect. 3. For the k-nearest neighbor algorithm, the nearest neighbor is selected. For linear discrimination and decision tree algorithms, the predefined implementations of Matlab have been used. Since separate training and test sets do not exist, the classification process is performed in a 10 9 10 cross-validation loop. The data are shuffled 10 times into random order and then divided into 10 bins in the cross-validation loop. The training set is built from nine of the bins, and the remaining bin is used as the validation set. Classification algorithms are first trained on the training set, and then, estimations and error calculations are made on the validation set. The errors are collected during 100 cross-validation iterations, and then MMRE, MdMRE and PRED values are calculated. Since we have three classification methods, we have three sets of measurements. In addition, point estimates are calculated at the classification stage in order to determine our point estimation performance. For this process, we decided to use the mean and the median as our point estimators since they have been used by other studies in the literature. For example, Sentas et al. (2003) represent each interval by a single representative value: the mean point or the median point (Sentas et al. 2003). At the classification step, when the correct effort class is estimated, the mean and median of the effort values of the projects belonging to that class are calculated. 4.4 Accuracy measures Although our aim is to convert cost estimation to a classification problem, we want to give the point estimate results of the proposed approach in order to make a comparison with other studies. Thus, we have employed two types of accuracy measures in our experimental study: (1) misclassification rate for classification and (2) MeanMRE (MMRE), MedianMRE and PRED (25) for point estimates. 4.4.1 Misclassification rate The misclassification rate is simply the proportion of the number of misclassified software projects in a test set to the total number of projects to be classified in the same test set. It is calculated for each classification algorithm in each model. The formula for calculating the misclassification rate is as follows: ( Nt 1 if y 6¼ y0 1X ð2Þ MR ¼ Nt n¼1 0 otherwise where Nt is the total number of training samples, y0 is the estimated effort and y is the actual effort. The misclassification rate can be thought as the complement of the hit rate that has been mentioned in interval prediction studies; thus, our results are still comparable to those studies: %100 ¼ Misclassification Rate þ Hit Rate
ð3Þ
123
546
Software Qual J (2011) 19:537–552
4.4.2 MMRE, MedianMRE and PRED (25) These are the measures that are calculated from the relative error and the difference between the actual and the estimated value. The magnitude of relative error (MRE) is calculated by the following formula: MRE ¼ jpredicted actualj=actual
ð4Þ
The mean magnitude of relative error (MMRE) is the mean of the MRE values, and MedianMRE (MdMRE) is the median of the MRE values. Prediction at level r or PRED(r) is used to examine the cumulative frequency of MRE for a specific error level. For T estimations, the formula is as follows: T N 100 X 1 if MREi 100 ð5Þ PREDðNÞ ¼ 0 otherwise T i In this study, we take the desired error level as r = 25. PRED (25) is preferred over MMRE and MdMRE, in terms of evaluating the stability and robustness of the estimations (Conte et al. 1986; Stensrud et al. 2003). In order to say that a model performs well, the MdMRE and MRE values should be low and the PRED (25) values should be high. 4.5 Scope and limitations In this paper, we address the cost estimation problem as a classification problem and propose an approach that integrates classification methods with cluster analysis. This approach uses historical cost data and different machine learning techniques in order to make predictions. Although our main aim is to predict effort intervals, we also demonstrate that point estimates can be achieved through our approach as well. Therefore, the scope of our work is relevant for practitioners who employ cost estimation practices. One of the limitations of our approach is that we test only one clustering method in order to obtain the effort classes. Other clustering techniques that create dynamic clusters can also be used instead of the leader cluster. As a second limitation, we obtain point estimates through simple approaches such as the mean/median regression. Regression-based models can be used to increase the point estimation performance. However, our aim is not to demonstrate the superiority of one algorithm over the others; instead, we provide an implementation of our ideas using public datasets in order to demonstrate the applicability of our approach. We address the threats to the validity of our work under three categories: (1) internal validity, (2) external validity and (3) construct validity. Internal validity fundamentally questions to what extent the cause–effect relationship between dependent and independent variables exist. For addressing the threats to the internal validity of our results, we used seven datasets and applied 10 9 10 cross-validation to overcome the ordering effects. External validity, i.e. the generalizability, of results addresses the extent to which the findings of a particular study are applicable outside the specifications of that study. To ensure the generalizability of our results, we paid extra attention to include as many datasets coming from various resources as possible and used seven datasets from two different sources in our study. Our datasets contain a wide diversity of projects in terms of their sources, their domains and the time period during which they were developed. Datasets composed of software development projects from different organizations around the world are used to generalize our results.
123
Software Qual J (2011) 19:537–552
547
Construct validity (i.e. face validity) assures that we are measuring what we actually intended to measure. We use in our research MR, MMRE, MdMRE and PRED (25) for measuring and comparing the performance of the model. The majority of effort estimation studies use estimation-error-based measures for measuring and comparing the performance of different methods. We also used error-based measures in our study since they are a practical option for the majority of researchers. Moreover, using error-based methods enables our study to be benchmarked with previous effort estimation research.
5 Results and discussions The proposed approach is applied to and validated on all of the seven datasets. The results are given in terms of accuracy measures mentioned in Sect. 4. The effort clusters created for each dataset are given in Table 3. In order to show the clustering efficiency, the minimum and maximum numbers of projects assigned to a cluster are also given. The classification results for effort interval estimation are given in Fig. 3. k-NN and LD perform similarly for coc81, desharnais_1_1, nasa93 and sdr05. They both give a misclassification rate of 0% for coc81 and sdr05. For cocomonasa_v1 and sdr06, k-NN outperforms the others, whereas LD is the best one for sdr07. In total, the proposed model gives a misclassification rate of 0% for five cases in the best case and 17% in the worst case.
Table 3 Effort clusters for each dataset
Dataset
# of Clusters
# of Projects Min
Max
coc81
4
2
44
cocomonasa_v1
5
3
36
desharnais_1_1
9
2
21
nasa93
6
3
44
sdr05
3
3
16
sdr06
3
2
12
sdr07
4
6
16
Fig. 3 Effort misclassification rates for each dataset
123
548 Table 4 Comparison of the results
Software Qual J (2011) 19:537–552
# of Clusters
Hit rate (%) Min
Sentas et al.
60.38
Our model
97
Max 79.24 100
The outcomes concerning effort interval estimation yield some important results. Considering classifiers, k-NN is the best performing one and LD follows it with a slight difference, whereas DT is the worst performing one. Since our main aim is effort interval classification, we focus on the misclassification rate to measure how good our classification performance is. The misclassification rates are 0% for most cases and around 17% in the worst case. There are not many studies in the literature that investigate the effort interval classification. The most recent study on this topic is Sentas et al.’s study, in which ordinal regression is used to model the probabilities of both effort and productivity intervals (Sentas et al. 2005). In the said study, hit rates of around 70% are obtained for productivity interval estimation on the coc81 dataset. In our study, however, the hit rates for all datasets are between 90 and 100%. The main reason for this is that we use similar projects in order to predict the project cost. This is achieved through clustering the projects according to their attributes. Furthermore, the intervals in the above-mentioned study are manually predefined, whereas we dynamically create them by clustering. In Table 4, we compare our results with those of Sentas et al. We also analyzed our results in terms of point estimation. We used a simple approach based on using the means and medians of the intervals for point estimation. We should once again note that our main aim is to determine the effort intervals. However, we also show how our results can be easily converted to point estimates and can produce comparable results to previous ones. In Table 5, we present point estimation results in terms of the three measures mentioned in the previous section. Point estimates are determined by taking either the mean or the median of the effort values of the projects. In terms of the point estimation performance, k-NN and LD perform nearly the same and better than DT for all datasets. The performance of all classifiers improves for all measures when the median is used for point estimation. Especially for MMRE and MdMRE measures, the improvement is obvious. MMRE and MdMRE results decrease to 13%, and PRED results increase to 86% for some datasets. Note that a PRED value of 86% means that 86% of all estimations are within the 25% confidence interval, which shows the stability and robustness of the model we propose. Combining clustering with classification methods has helped us to achieve favorable results by eliminating the effects of unrelated data. Our experimental results show that we achieved much higher hit rates than those of previous studies. Although we simply use the mean and the median of the effort interval values, the point estimation results are also comparable to those in the literature. If a different model is fitted to each interval separately, it is expected that our estimation results will further improve.
6 Conclusions and future work Although various methods have been proposed within the scope of the literature, in this paper, we handle the cost estimation problem in a different manner. We treat cost estimation as a classification problem rather than a regression problem and propose an
123
Software Qual J (2011) 19:537–552
549
Table 5 Point estimation results (%) Dataset
coc81
cocomonasa_v1
desharnais_1_1
nasa93
sdr05
sdr06
sdr07
Classifier
Using the mean of projects
Using the median of projects
MMRE
MdMRE
PRED
MMRE
MdMRE
PRED
LD
189
183
33
131
131
33.6
k-NN
189
183
33
131
131
33.6
DT
192
190
29.6
134
131
30.2
LD
69
45
42.2
51
32
54.8
k-NN
69
45
42
51
32
54.6
DT
76
50
26.8
58
40
39.4
LD
13
12
84.14
13
12
86.42
k-NN
13
12
84.14
13
12
86.71
DT
16
15
79
15
15
81.85
LD
70
52
55.5
52
40
57.7
k-NN
69
52
55.5
52
40
57.7
DT
72
52
51.2
55
41
53.4
LD
45
28
45.5
37
26
52
k-NN
45
28
45.5
37
26
52
DT
59
44
28.5
52
38
35
LD
31
31
50.5
25
23
67
k-NN
30
31
50.5
24
23
67
DT
34
36
44.5
27
25
61
LD
14
14
84.66
14
14
79.6
k-NN
14
13
81.33
14
14
76.3
DT
14
13
81.33
14
14
76.3
approach that classifies new software projects into one of the dynamically created effort classes, with each corresponding to an effort interval. The prevention of overestimation and underestimation is more practical through predicting the intervals instead of the exact values. This approach integrates classification methods with cluster analysis, which is, to the best of our knowledge, performed for the first time in the software engineering domain. In contrast to previous studies, the intervals are not predefined but dynamically created through clustering. The proposed approach is validated on seven datasets taken from public repositories, and the results are presented in terms of widely used performance measures. These results point out the three important advantages our approach offers: 1. We obtain much higher effort estimation hit rates (around 90–100%) in comparison to other studies in the literature. 2. For point estimation results, we can see that the MdMRE, MMRE and PRED (25) values are comparable to those in the literature for most of the datasets although we use simple methods such as mean and median regression. 3. Effort intervals are generated dynamically according to historical data. This method removes the need for project managers to specify effort intervals manually and hence prevents the waste of time and effort. Future work includes the use of different clustering techniques to find effort classes and to fit probabilistic models to the intervals. Also, regression-based models can be used for
123
550
Software Qual J (2011) 19:537–552
point estimation instead of taking the mean and the median of interval values, which would enhance the point estimation performance. Acknowledgments This research is supported in part by Tubitak under grant number EEEAG108E014.
References Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press. Angelis, L., & Stamelos, I. (2000). A simulation tool for efficient analogy based cost estimation. Journal of Empirical Software Engineering, 5(1), 35–68. Bakar, Z. A., Deris, M. M., & Alhadi, A. C. (2005). Performance analysis of partitional and incremental clustering, Seminar Nasional Aplikasi Teknologi Informasi (SNATI). Baskeles, B., Turhan, B., & Bener, A. (2007). Software effort estimation using machine learning methods. In Proceedings of the 22nd international symposium on computer and information sciences (ISCIS 2007), Ankara, Turkey, pp. 126–131. Bibi, S., Stamelos, I., & Angelis, L. (2004). Software cost prediction with predefined interval estimates. In First Software Measurement European Forum, Rome, Italy, January 2004. Boehm, B. W. (1981). Software engineering economics. Advances in computer science and technology series. Upper Saddle River, NJ: Prentice Hall PTR. Boehm, B. W. (1999). COCOMO II and COQUALMO Data Collection Questionnaire. University of Southern California, Version 2.2. Boehm, B., Abts, C., & Chulani, S. (2000). Software development cost estimation approaches—A survey. Annals of Software Engineering. Boetticher, G. D. (2001). Using machine learning to predict project effort: empirical case studies in datastarved domains. In First international workshop on model-based requirements engineering, pp. 17–24. Boetticher, G., Menzies, T., & Ostrand, T. (2007). PROMISE repository of empirical software engineering data. West Virginia University, Department of Computer Science. http://www.promisedata.org/ repository. Briand, L. C., Basili, V. R., & Thomas, W. M. (1992). A pattern recognition approach for software engineering data analysis. IEEE Transactions on Software Engineering, 18(11), 931–942. Conte, S. D., Dunsmore, H. E., & Shen, V. Y. (1986). Software engineering metrics and models. Menlo Park, CA: Benjamin-Cummings. Draper, N., & Smith, H. (1981). Applied regression analysis. London: Wiley. Gallego, J. J. C., Rodriguez, D., Sicilia, M. A., Rubio, M. G., & Crespo, A. G. (2007). Software project effort estimation based on multiple parametric models generated through data clustering. Journal of Computer Science and Technology, 22(3), 371–378. Jorgensen, M. (2002). Comments on ‘a simulation tool for efficient analogy based cost estimation’. Empirical Software Engineering, 7, 375–376. Jorgensen, M. (2003). An effort prediction interval approach based on the empirical distribution of previous estimation accuracy. Information and Software Technology, 45, 123–126. Jorgensen, M., & Teigen, K. H. (2002). Uncertainty intervals versus interval uncertainty: An alternative method for eliciting effort prediction intervals in software development projects. In International conference on project management (ProMAC), Singapore, pp. 343–352. Lee, A., Cheng, C. H., & Balakrishnan, J. (1998). Software development cost estimation: Integrating neural network with cluster analysis. Information and Management, 34, 1–9. Leung, H., & Fan, Z. (2001). Software cost estimation. Handbook of software engineering and knowledge engineering. ftp://cs.pitt.edu/chang/handbook/42b.pdf. Lum, K., Bramble, M., Hihn, J., Hackney, J., Khorrami, M., & Monson, E. (2003). Handbook for software cost estimation. NASA Jet Propulsion Laboratory, JPL D-26303. Menzies, T., & Hihn, J. (2006). Evidence-based cost estimation for better-quality software. IEEE Software, 23(4), 64–66. Miyazaki, Y., Terakado, M., Ozaki, K., & Nozaki, H. (1994). Robust regression for developing software estimation models. Journal of Systems and Software, 1, 3–16. NASA. (1990). Manager’s handbook for software development. Goddard Space Flight Center, Greenbelt, MD, NASA Software Engineering Laboratory. Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufman.
123
Software Qual J (2011) 19:537–552
551
Sentas, P., Angelis, L., & Stamelos, I. (2003). Multinominal logistic regression applied on software productivity prediction. In 9th Panhellenic conference in informatics, Thessaloniki. Sentas, P., Angelis, L., Stamelos, I., & Bleris, G. (2005). Software productivity and effort prediction with ordinal regression. Information and Software Technology, 47, 17–29. Shalabi, L. A., & Shaaban, Z. (2006). Normalization as a preprocessing engine for data mining and the approach of preference matrix. In IEEE proceedings of the international conference on dependability of computer systems (DEPCOS-RELCOMEX’06). Shepperd, M., & Schofield, M. (1997). Estimating software project effort using analogies. IEEE Transactions on Software Engineering, 23(12), 736–743. SoftLab. (2009). Software research laboratory, Department of Computer Engineering, Bogazici University. http://www.softlab.boun.edu.tr. Srinivasan, K., & Fisher, D. (1995). Machine learning approaches to estimating software development effort. IEEE Transactions on Software Engineering, 21(2), 126–137. Stamelos, I., & Angelis, L. (2001). Managing uncertainty in project portfolio cost estimation. Information and Software Technology, 43(13), 759–768. Stamelos, I., Angelis, L., Dimou, P., & Sakellaris, E. (2003). On the use of bayesian belief networks for the prediction of software productivity. Information and Software Technology, 45, 51–60. Stensrud, E., Foss, T., Kitchenham, B., & Myrtveit, I. (2003). A further empirical investigation of the relationship between MRE and project size. Empirical Software Engineering. Tadayon, N. (2005). Neural network approach for software cost estimation. International Conference on Information Technology: Coding and Computing, 2, 815–818.
Author Biographies Ays¸ e Bakır received her MSc degree in computer engineering from Bogazici University in 2008 and her BSc degree in computer engineering from Gebze Institute of Technology in 2006. Her research interests include software quality modeling and software cost estimation.
Burak Turhan received his PhD in computer engineering from Bogazici University. After his postdoctoral studies at the National Research Council of Canada, he joined the Department of Information Processing Science at the University of Oulu. His research interests include empirical studies on software quality, cost/defect prediction models, test-driven development and the evaluation of new approaches for software development.
123
552
Software Qual J (2011) 19:537–552 Ays¸ e B. Bener is an associate professor in the Ted Rogers School of Information Technology Management. Prior to joining Ryerson, Dr. Bener was a faculty member and Vice Chair in the Department of Computer Engineering at Bog˘azic¸i University. Her research interests are software defect prediction, process improvement and software economics. Bener has a PhD in information systems from the London School of Economics. She is a member of the IEEE, the IEEE Computer Society and the ACM.
123