Incorporating Data Mining Techniques on Software Cost ... - IJETAE

5 downloads 42659 Views 634KB Size Report
analytical tools for analyzing data. It allows users ... Keywords— Data mining, agile COCOMO Software ... that have been identified and best performing machine.
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012)

Incorporating Data Mining Techniques on Software Cost Estimation: Validation and Improvement 1

Narendra Sharma, 2Ratnesh Litoriya

Department of Computer Science and Engineering Jaypee University of Engg & Technology Guna, India 1 2

[email protected] [email protected]

These knowledge or information applied in the cost estimation models and try to generate the approximate estimation on the basis of past project data. In this research I am trying to identify the common cost drivers that are affected the cost of the project. For estimation the cost of the new project we are using the agile cocomo model [2].

Abstract— Generally, data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. I am using data mining tools weka to identify the important and common cost drivers that are used to generate the estimate of a project. Cost drivers are multiplicative factors that determine the effort required to complete our software project. In the analogy estimation models, the cost drivers are the base of cost estimation models. They estimate the new project with compare the past project data or cost drivers and set the value of cost drivers in the new projects. The aim of this research work to identify the important cost drivers in the past project data with the help of data mining tools weka..

This paper investigates the systemic cost estimation issues that have been identified and best performing machine learning techniques. While we have found that agile COCOMO II, a software estimation model with publicly available algorithms developed by Barry Boehm, et al. [9], is a very robust model, it is generate the more accurate result on the basis of past project data that are very similar for our new projects.. However these results were only internally validated, using leave one out cross validation, with the historical data within the data mining system. We seek to find the prediction accuracy of the new model developed by the data mining system against new external data to evaluate the true effectiveness of these models in comparison to standard cost models that do not use machine learning techniques. In this research we are used the data mining tools weka for performing the data mining. The main aim of the research to increase the efficiency of software cost estimation with the help of the data mining techniques [1,3].

Keywords— Data mining, agile COCOMO Software estimation tools. Weka data mining tools, software engineering etc.

I. INTRODUCTION Cost estimation is a process or an approximation of the probable cost of a prod.uct, program, or a project, computed on the basis of available information. Accurate cost estimation is very important for every kind of project, if we do not estimate the projects in a proper way; result the cost of the project is very high sometimes it will be reached 150200% more than the original cost. So in that case it is very necessary to estimate the project correctly. In this research we are working with two different-different fields one is software engineering and another field is data mining. Data mining help us to classified the past project data and generate the valuable information.

II. INTRODUCTION OF DATA MINING AND WEKA TOOLS We know that the all software cost estimation models are not able to produce accurate estimates that often can be off by greater than 50% from the actual cost, and sometimes as much as 150- 200% off from the actual cost. So we need such types of new methods or models that can be helpful for us for generate the actual costs and their accuracy are being investigated. Even methods that show a small improvement are considered great in the field of software estimation [2].

301

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process [5,7]. Data mining, at its core, is the transformation of large amounts of data into meaningful patterns and rules. Further, it could be broken down into two types: directed and undirected. In directed data mining, you are trying to predict a particular data point the sales price of a house given information about other houses for sale in the neighborhood, for example. In undirected data mining, we are trying to create groups of data, or find patterns in existing data creating the "Soccer Mom" demographic group, for example. In effect, every U.S. census is data mining, as the government looks to gather data All paragraphs must be indented. All paragraphs must be justified, i.e. both left-justified and right-justified. About everyone in the country and turn it into useful information. Today we are using data mining in every type of applications such as banking, insurances, medical, education etc.

Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. We’ll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you'll receive a decision tree that requires a series of binary decisions. One popular classification algorithm is the k-means clustering algorithm. WEKA Data mining isn't solely the domain of big companies and expensive software. In fact, there's a piece of software that does almost all the same things as these expensive pieces of software the software is called WEKA. WEKA is the product of the University of Waikato (New Zealand) and was first implemented in its modern form in 1997. It uses the GNU General Public License (GPL). The figure of weka is shown in the figure 1.The software is written in the Java™ language and contains a GUI for interacting with data files and producing visual results (think tables and curves). It also has a general API, so you can embed WEKA, like any other library, in our own applications to such things as automated server-side data-mining tasks. I am using the k-means clustering algorithms for classification of data. For working of weka we not need the deep knowledge of data mining that’s reason it is very popular data mining tool. Weka also provides the graphical user interface of the user and provides many facilities [4, 7].

A. Some basic operations of data miningRegression –

K-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.

Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you're ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you've got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you're working with categorical data where order is not significant (like color, name or gender) you're better off choosing another technique [2,7].

C. The k-means Algorithm: The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter.

B. Classification –

302

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. Here’s how the algorithm works [7]: 1. 2.

3. 4.

III. INTRODUCTION OF COST ESTIMATION In recent years, software has become the most expensive component of computer system projects. The bulk of the cost of software development is due to the human effort, and most cost estimation methods focus on this aspect and give estimates in terms of person-months [9].

The algorithm arbitrarily selects k points as the initial cluster centres (“means”). Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance between each point and each cluster centre. Each cluster centre is recomputed as the average of the points in that cluster. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently depending upon the implementation, but it normally means that either no observations change clusters when steps 2 and 3 are repeated or that the changes do not make a material difference in the definition of the clusters.

Accurate software cost estimates are critical to both developers and customers. They can be used for generating request for proposals, contract negotiations, scheduling, monitoring and control. Underestimating the costs may result in management approving proposed systems that then exceed their budgets, with underdeveloped functions and poor quality, and failure to complete on time. Overestimating may result is too many resources committed to the project, or, during contract bidding, result in not winning the contract, which can lead to loss of jobs [6].

Figure1- front view of weka

IV. WHY WE NEED THIS STUDY There are so many techniques available for software cost estimation but they are not very effectively. There is more work done of using data mining and software engineering. I m trying to data predict good result to the combining both fields. V. EXISTING METHODS FOR ESTIMATION The estimation is a process of determining amount of efforts, money, resources and time for building a software project with the help of available quality information. Many estimation methods have been proposed in last 30 years and almost all methods require quantitative information of productivity, size of project and other important factors that affect the project. There are various practices of software estimation such as analogy, expert opinion and empirical based practices [Jones, 2007]. Analogy based practices require historical data of projects as an input for comparison whereas expert opinion are intuition based [Jorgenson and Sheppard, 2007]. Empirical way is a practice of deriving the cost of software using some mathematical/ algorithmic model. Examples of methods that use such practices are FP based method and COCOMO II method in TEMs. Mostly, all traditional software development methods follow either COCOMO II or FP based estimation methods successfully due to complete set of requirement specification.

Data mining techniques are being used extensively in a variety of fields. It has been frequently applied in the business arena for customer relationship management and market analysis. In addition to the multitude of applications of data mining, there has been parallel research in improving data mining algorithms. While data mining techniques have been applied across broad domains, it has been rarely applied in the field of software cost estimation, a subfield of software engineering [4].

303

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) The figure given in the below show that the methodology of the research. We are applying the k-means clustering algorithms and classifieds the data.

2 CEE is the first model work with using the machine learning algorithms and cost estimation algorithms for generating the cost of the projects but it is specially designed for the NASA, so we cannot use it for publically but it is gives the important guideline for the new researchrs [2, 4]. Agile cocomo model -- A COCOMO™ tool that is very simple to use and easy to learn. It incorporates the full COCOMO™ parametric model and used for analogy-based estimation to generate accurate results for a new project. Estimation by analogy is one of the most popular ways to estimate software cost and effort. While comparing similarities between the new and old projects provides a great way to estimate, results could still be inaccurate from overlooking differences between the two projects especially if the grounds of dissimilarity are fairly important. To build on the estimation by analogy approach while accounting for differences between projects, USC-CSE has created Agile COCOMO-II, a cost estimation tool that is based on COCOMO-II. It uses analogy based estimation to generate accurate results while being very simple to use and easy to learn. It can provide the facility to estimate the project in various ways, it is shown in the figure 5. We can estimate the project in tem of person- month, in term of dollars, in term of object points, in term of function points etc. In this paper, we discuss motivation for the program, the program's structure, the results of our research, and provide insight into the future direction of this tool [10].

Figure2:- functional diagram of existing methodology

VI. 2CEE COST ESTIMATION TOOLS

VII. AN INTRODUCTION OF SCALE FACTORS AND COST DRIVER

2CEE (21st Century Effort Estimation) is one of the cost estimation tools that can be used both data mining area and software engineering fields. It is developed for the NASA and copyrighted by NASA. It uses a variety of data mining and machine learning techniques nearest neighbour, feature subset selection, bootstrapping local calibration to propose the most accurate software cost model. It is designed to explore the uncertainty in the model and in the estimate, to allow estimates early in the lifecycle by representing new projects as ranges of values, and to provide numerous calibration options. 2CEE1 has been encoded in a Windows based tool that can be used to both generate an estimate and allow the model developer to calibrate and develop models using various machine learning, data mining, and statistical techniques. By automating many tasks for the user it provides gains in cost analyst efficiency. 2CEE uses leaveone out cross validation as a measure of model performance.

A. The Scale Drivers In the COCOMO II model, some of the most important factors contributing to a project's duration and cost are the Scale Drivers. You set each Scale Driver to describe your project; these Scale Drivers determine the exponent used in the Effort Equation. There are five scale driver used in the cocomo model and each cost driver play an important role in the estimation [5,9]. The 5 Scale Drivers are:

304



Precedentedness



Development Flexibility



Architecture / Risk Resolution

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) 

Team Cohesion



Process Maturity

C. Introduction of some cost drivers 1. Required Software Reliability (RELY) This is the measure of the extent to which the software must perform its intended function over a period of time. If the effect of a software failure is only slight inconvenience then RELY is low. If a failure would risk human life then RELIES is very high.

B. Cost Drivers COCOMO II has 17 cost drivers for estimation of project, development environment, and team to set each cost driver. The cost drivers are multiplicative factors that determine the effort required to complete your software project. For example, if your project will develop software that controls an airplane's flight, you would set the Required Software Reliability (RELY) cost driver to Very High. That rating corresponds to an effort multiplier of 1.26, meaning that your project will require 26% more effort than a typical software project. In the cocomo model, the cost drivers divide in the four groups show in the below and given an introduction some cost drivers in short form[5].

2. Data Base Size (DATA) This measure attempts to capture the affect large data requirements have on product development. The rating is determined by calculating D/P. The reason the size of the database is important to consider it because of the effort required to generate the test data that will be used to exercise the program. 3. Product Complexity (CPLX)

The cost drivers dived four groups: 

1. 2. 3. 4. 5. 6. 

Analyst Capability Programmer Capability Applications Experience Platform Experience Personnel Continuity Use of Software Tools

This cost driver accounts for the additional effort needed to construct components intended for reuse on the current or future projects. This effort is consumed with creating more generic design of software, more elaborate documentation, and more extensive testing to ensure components are ready for use in other applications.

Required Software Reliability Data Base Size Required Reusability Documentation match to life-cycle needs etc.

Platform Factors: 1. 2.



4. Required Reusability (RUSE)

Product cost driver: 1. 2. 3. 4.



Complexity is divided into five areas: control operations, computational operations, device -dependent operations, data management operations, and user interface management operations. Select the area or combination of areas that characterize the product or a sub-system of the product. The complexity rating is the subjective weighted average of these areas.

Personnel Factors:

5. Execution Time Constraint (TIME)

Execution Time Constraint Platform Volatility

This is a measure of the execution time constraint imposed upon a software system. the rating is expressed in term of the percentage of available execution time expected to be used by the system or subsystem consuming the execution time resource. The rating ranges from nominal, less than 50% of the execution time resource used, to extra high, 95% of the execution time resource is consumed.

Project Factors: 1. 2.

Required Development Schedule Multisite Development etc.

305

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) 6. Analyst Capability (ACAP)

10. Use of Software Tools (TOOL)

Analysts are personnel that work on requirements, high level design and detailed design. The major attributes that should be considered in this rating are Analysis and Design ability, efficiency and thoroughness, and the ability to communicate and cooperate. The rating should not consider the level of experience of the analyst; that is rated with AEXP. Analysts that fall in the 15th percentile are rated very low and those that fall in the 95th percentile are rated as very high.

Software tools have improved significantly since the 1970's projects used to calibrate COCOMO™. The tool rating ranges from simple edit and code, very low, to integrated lifecycle management tools, very high[5]. VIII. COCOMO II EFFORT EQUATION The COCOMO II model makes its estimates of required effort (measured in Person-Months � PM) based primarily on your estimate of the software project's size (as measured in thousands of SLOC, KSLOC)):

7. Programmer Capability (PCAP) Current trends continue to emphasize the importance of highly capable analysts. However the increasing role of complex COTS packages, and the significant productivity leverage associated with programmers' ability to deal with these COTS packages, indicates a trend toward higher importance of programmer capability as well.

Effort

=

2.94

*

EAF

*

(KSLOC)E

Where EAF Is the Effort Adjustment Factor derived from the Cost Drivers E Is an exponent derived from the five Scale Drivers As an example, a project with all Nominal Cost Drivers and Scale Drivers would have an EAF of 1.00 and exponent, E, of 1.0997. Assuming that the project is projected to consist of 9,000 source lines of code, COCOMO II estimates that 29.9 Person-Months of effort is required to complete it[ 1,9]. Effort = 2.94 * (1.0) * (9) 1.0997 = 29.9 Person-Months.

Evaluation should be based on the capability of the programmers as a team rather than as individuals. Major factors which should be considered in the rating are ability, efficiency and thoroughness, and the ability to communicate and cooperate. The experience of the programmer should not be considered here; it is rated with AEXP. A very low rated programmer team is in the 15th percentile and a very high rated programmer team is in the 95th percentile.

Methodology Our methodology is very simple, I am combine two different-different fields data mining and the software engineering and try to generate the accurate cost of the project with the help of past project data whose cost or effort is known and find out the common cost factors. We used weka tools for data mining and agile cocomo tools for software estimation. I am using the promise data set for the analysis.

8. Applications Experience (AEXP) This rating is dependent on the level of applications experience of the project team developing the software system or subsystem. The ratings are defined in terms of the project team's equivalent level of experience with this type of application. A very low rating is for application experience of less than 2 months. A very high rating is for experience of 6 years or more.

IX. DATASET 9. Platform Experience (PEXP) This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. The data files in the arff and .csv format. These data set directly apply in the weka and apply the various algorithms. Result of weka applied in the agile cocomo model.

The Post-Architecture model broadens the productivity influence of PEXP, recognizing the importance of understanding the use of more powerful platforms, including more graphic user interface, database, networking, and distributed middleware capabilities.

306

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) Result – Agile cocomo model is the analogy model. In this model we estimate the new project with the help of compare the past project data. The feature of new project and past project is very similar to the past project. with the help of weka and agile we are predicted some useful result. In this research we have taken 60 nasa past project data whose efforts are already given, the list of the project is shown in the figure. I have search that the common cost drivers and the scale factors that are mainly affected the project estimation. With the help of agile cocomo model we have changed one of the values of the cost drivers or scale factors and predict the value of the cost drivers. The below figure shown the classification of the after apply the k-means clustering algorithms. With the help of clustering we are grouped of similar group of cost drivers. These cost drivers are very helpful to predict the estimate the new projects.

In the weka, it is provide the facility to classify the data we are used Apriori algorithms. It also provides the graphical user interface and command line interface of the user. with the help of table 1 and 2 I am showing the cost drivers, found out after the analysis of past project data. These cost driver used in every type of project.

Figure4- clustering

Next figure show that the front view of agile cocomo model. It provides the facility of estimate the project in various way such as in term of the cost of the project in term of dollars, in term of the person month, in term of function point and object points etc

Figure3- past project dataset in weka

This figure 3 show that different cost drivers used in the various past projects. I am using 60 past NASA project’s data and apply these project data in the weka, This figure show the actual effort of the past project data. We are taken as a base value in the agile cocomo model and set the new value of cost driver. After applying the k-means clustering we are find out the clusters that are store the similar cost drivers. Result of k-means clustering is shown in the figure 4. With the help of clustering we grouped the similar behaviour instances in to the clusters.

Figure5- front view of agile cocomo model

307

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) Decrease these to decrease cost of the project

The next figure show the various cost factors. We set the new value of the cost factors and change their value with respect to the past project cost drivers. We are find out some important or useful cost drivers that can be used in every project and they are responsible for increase or decrease the cost of project. These cost drivers shown in the table 1 and 2.

Store

main memory constraint

Data

data base size

Time

time constraint for cpu

Virt

machine volatility

Rely

required software reliability etc

Table 2- show cost drivers whose values is decrease

X. CONCLUSION These results suggest that building data mining and machine learning techniques into existing software estimation techniques such as COCOMO can effectively improve the performance of a proven method. We have used weka tools for data mining because it consist of differentdifferent machine learning algorithms that can be help us to classify the data easily. We understand that there is a lack of serious research in this field. Our main aim to show the data mining is also very useful for the field of software engineering. Not all data mining techniques performed better than the traditional method of local calibration. However, a couple of techniques used in combination did provide more accurate software cost models than the traditional technique. While the best combination of data mining techniques were not consistent across the different stratifications of data, it shows that there are different populations of software projects and that rigorous data collection should be continued for improving the development of accurate cost estimation models.

Figure6- show the various cost drivers

Increase these to decrease effort Acap

analysts capability

Pcap

programmers capability

Aaexp

application experience

Modp

Modern programming practices

Tool

use of software tools etc

Lexp

language experience

On the basis of this research we can say that cost drivers and scale factors perform important role in this estimation which we used any analogy models. I found out some common cost drivers that we can use for all projects. The future work is the need to investigate some more data mining algorithms that can be help to improve the process of software cost estimation and easy to use. The main reason for choose the cocomo model for this research because it is the best model of the software cost estimation and it is publicly available easily.

Table1- show the cost drivers whose value is increased

308

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 3, March 2012) .

References [1]

“COCOMO II Model definition manual”, version 1.4, University of Southern California.

[2]

Karen T. Lum, Daniel R. Baker, and Jairus M. Hihn “The Effects of Data Mining Techniques on Software Cost Estimation” 2009 IEEE.

[3]

Zhihao Chen, Tim Menzies? Dan PortTim Menzies? Dan Port “Feature Subset Selection Can Improve Software Cost Estimation Accuracy” Center for Software Engineering,Univ. of Southern California.

[4]

Jairus Hihn,Karen Lum “2CEE, A TWENTY FIRST CENTURY EFFORT ESTIMATION METHODOLOGY” Lane Dept. CSEE West Virginia University ISPA / SCEA 2009 Joint International Conference.

[5]

] Z. Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García , “Cost Drivers of a Parametric Cost Estimation Model for Data Mining Projects” Notes, vol. 30, no. 4, pp. 1-6, 2005

[6]

Oscar Marbán, Antonio de Amescua, Juan J. Cuadrado, Luis García “A cost model to estimate the effort of data mining projects” Universidad Carlos III de Madrid (UC3M)

[7]

Dr. Alassane Ndiaye and Dr. Dominik Heckmann “Weka: Practical machine learning tools and techniques with Java implementations” AI Tools Seminar University of Saarland, WS 06/07

[8]

S. Chandrasekaran1, R.Lavanya2 and V.Kanchana “MULTICRITERIA APPROACH FOR AGILE SOFTWARE COST ESTIMATION MODEL ”

[9]

Caper Jones., “Estimating software cost” tata Mc- Graw -Hill Edition 2007

[10]

http://sunset.usc.edu/cse/pub/research/AgileCOCOMO/ AgileCOCOMOII/Main.html

309