Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956
Software Development Effort Estimation – Neural Network Vs. Regression Modeling Approach Roheet Bhatnagar* Associate Professor, Department of Computer Engineering, Sikkim Manipal Institute of Technology, Majitar, Rangpo, East Sikkim, 737 136 INDIA.
[email protected]
Vandana Bhattacharjee Associate Professor, Department of Computer Science and Engineering, Birla Institute of Technology, Mesra, Ranchi, 835 215 INDIA.
[email protected]
Mrinal Kanti Ghose Professor & Head, Department of Computer Engineering, Sikkim Manipal Institute of Technology, Majitar, Rangpo, East Sikkim, 737 136 INDIA.
[email protected] Abstract : The global software development industry has now become more matured and complex. The industry is making use of newer tools and approaches of software development. The challenge then lies in accurately modeling and predicting the software development effort, and then create project development schedule. This work employs a neural network (NN) approach and a multiple regression modeling approach to model and predict the software development effort based on an available real life dataset which is prepared by Lopez-Martin et al. [1, 2]. A comparison between results obtained by both the approaches is presented. It is concluded that NN is able to successfully model the complex, non-linear relationship between a large number of effort drivers and the software maintenance effort, with results closely matching the effort estimated by experts. Keywords: Software Development, Software Development Effort, Project Development Schedule, Neural Network, Regression Modeling. 1. Introduction Developing a software project with acceptable quality within budget and on planned schedule is the main goal of every software development firm. Schedule estimation has historically been and continues to be a major difficulty in managing software development projects [3]. Failure of the project mostly is attributed to failure to fulfill customers’ quality expectations or the budget and schedule over-run. It is essential for a project manager to know the effort, schedule and functionality of a project in advance. Per-haps there is no point in starting a project when there is not enough time to finish it or enough money to fund it or if the quality is so inadequate that the end product will be useless and unmarketable. However, the project factors change in the duration of the project, and they may change a lot. The worse thing is that one can seldom predict how they will change, yet we need to know all these before we start. There is no way to calculate in advance and expect the initial values to be correct. This does not render the estimates vain. On the contrary, it calls for better quality estimation techniques, which will yield more accurate early results and guide us to more targeted and effective contingency plans. Software estimation is the act of predicting the duration and cost of a project. It is a complex process with errors built into its very fabric, however it is very rewarding when done the right way. The estimation
ISSN: 0975-5462
2950
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956 process does not finish until the project finishes. This is the answer of the project manager to the ever changing conditions of the project. An accurate estimate is a critical part of the foundation of an efficient software project. In this paper we discuss and evaluate two different approaches to estimate the effort in developing software using standard dataset. The paper is organised into four sections. First section is the Introduction, where estima-tion and its imporatnce are discussed. Section-2 briefly discusses the working methodology and the effort estimation using NN soft computing approach. In this section only, under respective headings we describe the experimentation steps and the findings of experiment on the standard dataset. Section-3 presents the result and discussion about the findings of experimentation. Section -4 summarises the results obtained by using the two different approaches and provides a conclusion as to which one is a better technique. 2. Working Methodology In the present work of our research we have tried to find out the Development Time (DT’) by applying first the Feed Forward Backpropagation Neural Network Model and then the Regression Analysis. Following methodology was adopted to carry out the effort estimation using the NN and Regression Analysis approaches. 2.1. Data Collection The standard dataset as proposed by Lopez-Martin et.al. has been used for the experimentation purposes. They used the sets of system development projects, where the Development Time (DT), Dhama Coupling (DC), McCabe Complexity (MC) and the Lines of Code (LOC) metrices were registered for 41 modules. Since all the programs were written in Pascal, the module categories mostly belong to procedures and functions. The development time of each of the forty-one modules were registered including five phases: requirements understanding, algorithm design, coding, compiling and testing [1, 2]. Table I shows the dataset used for carrying out experimentation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Module Description Calculates t value Inserts a new element in a linked list Calculates a value according to normal distribution equation Calculates the variance Generates range square root Determines both minimum and maximum values from a stored linked list Turns each linked list value into its z value Copies a list of values from a file to an array Determines parity of a number Defines segment limits From two lists (X and Y), returns the product of all xi and yi values Calculates a sum from a vector and its average Calculates q values Generates the sum of a vector components Calculates the sum of a vector values square Calculates the average of the linked list values Counts the number of lines of code including blanks and comments Prints values non zero of a linked list Stores values into a matrix Generates range square root Returns the number of elements in a linked list Calculates the sum of odd segments (Simpson’s formula) Calculates the sum of pair segments (Simpson’s formula) Generates the standard deviation of the linked list values Returns the sum of square roots of a list values Prints a matrix Calculates the sum of odd segments (Simpson’s formula) Calculates the sum of pair segments (Simpson’s formula) Calculates the average of linked list values Returns the sum of a list of values
ISSN: 0975-5462
MC 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
DC 0.25 0.25 0.333 0.083 0.111 0.125 0.125 0.125 0.167 0.167 0.167 0.167 0.167 0.2 0.2 0.2 0.2 0.25 0.25 0.083 0.125 0.125 0.125 0.143 0.143 0.143 0.143 0.143 0.167 0.167
LOC 4 10 4 10 23 9 9 14 7 8 10 10 10 10 10 10 15 10 10 17 11 15 15 13 14 14 15 15 13 14
DT in minutes 13 13 9 15 15 15 16 16 16 18 15 15 18 13 14 15 13 12 12 22 19 18 19 21 20 21 19 20 15 13
2951
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956 31 32 33 34 35 36 37 38 39 40 41
Generates the standard deviation of linked list values Prints a linked list Calculates gamma value (G) Calculates the average of vector components Calculates the range standard deviation Calculates beta 1 value Returns the product between values of two vectors and the number of these pairs Counts commented lines Reduces final matrix (according to Gauss method) Reduces a matrix (according to Gauss method) Counts blank lines
3 3 3 3 4 4 4 4 5 5 5
0.2 0.25 0.25 0.25 0.077 0.077 0.111 0.2 0.143 0.143 0.2
18 9 12 17 16 31 16 24 22 22 22
19 13 12 12 21 21 19 18 24 25 18
MC: McCabe Complexity, DC: Dhama Coupling, LOC: Lines of Code, DT: Development Time (minutes)
2.2. Neural Network Modeling Artificial Neural Network is used in effort estimation due to its ability to learn from previous data [4][5]. It is also able to model complex relationships between the dependent (effort) and independent variables (cost drivers). In addition, it has the ability to generalize from the training data set thus enabling it to produce acceptable result for previously unseen data. Most of the work in the application of neural network to effort estimation made use of feed-forward multi-layer Perceptron, Back-propagation algorithm and sigmoid function. However many researchers refuse to use them because of their shortcoming of being the “black boxes” that is, determining why an ANN makes a particular decision is a difficult task. But then also many different models of neural nets have been proposed for solving many complex real life problems and in this paper too we discuss the application of NN model for effort estimation. [6] A simplified NN architecture as given in Figure-1, with only one input layer (having 3 neurons for each input viz. MC, DC and LOC), one hidden layer (with minimum 3 neurons) and an one output layer (having one output as DT) was designed using Matlab NN Toolbox.
Figure-1 NN Architectural Model
The model was then trained by using 25 (60% of dataset) data from the dataset as given in Table 1, the remaining 8 (20% of the dataset) data and another 8 (20% of dataset) data were used to validate and test the model respectively. The data were randomly selected for all the three cases by the NN model. The plot is as given in Figure-2 below.
ISSN: 0975-5462
2952
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956
Figure-2 NN plot for Training, Validation and Testing data
Table II shows the Actual Effort and Feed Forward NN Predicted development time (DT’) and the relative errors. MC
DC
LOC
DT
TRAINING DATA SET 1 0.25 10 13 1 0.333 4 9 2 0.083 10 15 2 0.125 9 15 2 0.2 10 13 2 0.2 10 14 2 0.2 10 15 2 0.2 15 13 2 0.25 10 12 2 0.25 10 12 3 0.083 17 22 3 0.125 11 19 3 0.125 15 18 3 0.125 15 19 3 0.143 13 21 3 0.143 14 20 3 0.143 15 19 3 0.143 15 20 3 0.167 13 15 3 0.2 18 19 3 0.25 9 13 3 0.25 17 12
ISSN: 0975-5462
NN prediction (DT ‘) 12.43 9.35 18.84 17.18 14.19 14.19 14.19 15.56 12.53 12.53 19.89 18.18 19.08 19.08 18.05 18.32 18.56 18.56 17.02 16.75 13.32 12.74
Error %
4.38 -3.89 -25.60 -14.53 -9.15 -1.36 5.40 -19.69 -4.42 -4.42 9.59 4.32 -6.00 -0.42 14.05 8.40 2.32 7.20 -13.47 11.84 -2.46 -6.17
2953
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956 4 0.111 16 19 4 0.2 24 18 5 0.143 22 25 VALIDATION DATA SET 2 0.111 23 15 2 0.167 7 16 3 0.143 14 21 3 0.167 14 13 3 0.25 12 12 4 0.077 16 21 4 0.077 31 21 5 0.2 22 18 TESTING DATA SET 1 0.25 4 13 2 0.125 9 16 2 0.125 14 16 2 0.167 8 18 2 0.167 10 15 2 0.167 10 15 2 0.167 10 18 5 0.143 22 24
20.63 18.54 23.36
-8.58 -3.00 6.56
17.65 14.67 18.32 15.33 13.08 20.86 20.59 21.40
‐17.66 8.31 12.76 ‐17.92 ‐9.00 0.67 1.95 ‐18.89
12.29 17.18 18.52 15.54 15.58 15.58 15.58 23.36
5.46 ‐7.38 ‐15.75 13.66 ‐3.87 ‐3.87 13.44 2.67
Table II – Actual Effort(DT) and NN predicted Efforts (DT’)
2.3 Statistical Analysis and Regression Modeling Before conducting regression analysis we proceed to check if the data was normally distributed. Figure 3 shows a histogram plot of a normally distributed dataset.
Figure-3 Histogram showing normal distribution of development time data (DT)
ISSN: 0975-5462
2954
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956 From the dataset, MC, DC and LOC were taken as input and DT as output. A linear regression model was obtained using the commercial package STATISTICA by conducting the stepwise regression modeling. Table III shows the table containing DT predicted through the regression analysis.
Actual (DT)
Predicted by Regression Analysis
Error %
(DT’)
ISSN: 0975-5462
13.00000
10.85161
16.52607692
13.00000
10.85161
16.52607692
9.00000
8.18266
9.081555556
15.00000
18.09036
‐20.6024
15.00000
17.18999
‐14.59993333
15.00000
16.73981
‐11.59873333
16.00000
16.73981
‐4.6238125
16.00000
16.73981
‐4.6238125
16.00000
15.38925
3.8171875
18.00000
15.38925
14.50416667
15.00000
15.38925
‐2.595
15.00000
15.38925
‐2.595
18.00000
15.38925
14.50416667
13.00000
14.32810
‐10.21615385
14.00000
14.32810
‐2.343571429
15.00000
14.32810
4.479333333
13.00000
14.32810
‐10.21615385
12.00000
12.72030
‐6.0025
12.00000
12.72030
‐6.0025
22.00000
19.95905
9.277045455
19.00000
18.60849
2.060578947
18.00000
18.60849
‐3.3805
19.00000
18.60849
2.060578947
21.00000
18.02969
14.14433333
20.00000
18.02969
9.85155
21.00000
18.02969
14.14433333
19.00000
18.02969
5.106894737
20.00000
18.02969
9.85155
15.00000
17.25794
‐15.05293333
13.00000
17.25794
‐32.75338462
19.00000
16.19679
14.75373684
13.00000
14.58899
‐12.223
12.00000
14.58899
‐21.57491667
12.00000
14.58899
‐21.57491667
21.00000
22.02067
‐4.860333333
21.00000
22.02067
‐4.860333333
19.00000
20.92737
‐10.14405263
18.00000
18.06548
‐0.363777778
24.00000
21.76707
9.303875
25.00000
21.76707
12.93172
18.00000
19.93417
‐10.74538889
2955
Roheet Bhatnagar et. al. / International Journal of Engineering Science and Technology Vol. 2(7), 2010, 2950-2956
Table III – Actual Effort (DT) and Regression Analysis Predicted Efforts (DT’)
3. Result and Discussion A comparison of the 3-3-1 NN output with measured experimental values of effort shows the % error varying from +14.05 to -25.60, +12.76 to -18.89 and +13.66 to -15.75 for the training dataset (25 nos.), validation dataset (8 nos.) and testing dataset (8 nos.), respectively. A much simplified NN architecture was able to effectively and successfully model the non-linear relationship between the 3 variables and a single output parameter. The performance of NN can be further increased by increasing the neurons in the hidden layer and retraining the model with the data. Also the performance will improve with large datasets. 4. Conclusion In this paper, effectiveness of NN modeling approach of effort estimation for standard dataset was presented. The NN model trained using experimental data was found to have good generalization capabilities and is able to successfully predict the effort closely matching the experimental observations. Since the effect of various cost drivers on effort is often quite complex, ANN can be used as an effective tool to model and predict the development effort. However, the models should also be evaluated by exploring a variety of historical and unseen input data and the model can be adapted and tested to predict the early effort estimation in software development. 5. Running Heads SDEENNRMA
6. References [1] [2] [3] [4] [5] [6]
C. Lopez-Martin, C.Yanez-Marquez, A.Gutierrez-Tornes, “Predictive accuracy comparison of fuzzy models for software development effort of small programs, The journal of systems and software”, Vol. 81, Issue 6, 2008, pp. 949-960. C.L. Martin, J.L. Pasquier, M.C. Yanez, T.A. Gutierrez, “Software Development Effort Estimation Using Fuzzy Logic: A Case Study”, IEEE Proceedings of the Sixth Mexican International Conference on Computer Science (ENC’05), 2005, pp. 113-120. Steve McConnell. Rapid development: taming wild software schedules. Microsoft Press, 1996. A. Idri, T. M. Khoshgoftaar, A. Abran. “Can neural networks be easily interpreted in software cost estimation?”, IEEE Trans. Software Engineering, Vol. 2, 2002, pp. 1162 – 1167. A. Idri,, A. Abran,, T.M. Khoshgoftaar. “Estimating software project effort by analogy based on linguistic values” in. Proceedings of the Eighth IEEE Symposium on Software Metrics, 4-7 June 2002, pp. 21 – 30. H. Park, S. Baek, “An empirical validation of a neural network model for software effort estimation”, Expert Systems with Applications, 2007.
ISSN: 0975-5462
2956